CAD Dataset Infrastructure — Metadata Schema¶
Home / Engineering / System Design / CAD Dataset Infrastructure / Metadata Schema
Engineering Log Entry — March 2026
SQL schema for the dataset metadata catalog — parts, annotations, snapshots, and pins — all hosted in the existing Supabase Postgres instance.
Parts Catalog¶
CREATE TABLE dataset_parts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
part_id TEXT UNIQUE NOT NULL,
blob_hash TEXT NOT NULL,
filename TEXT NOT NULL,
file_size_bytes BIGINT,
-- Geometric properties (populated by ingestion pipeline)
face_count INTEGER,
edge_count INTEGER,
vertex_count INTEGER,
volume_mm3 REAL,
bounding_box JSONB, -- {"x": 100, "y": 50, "z": 30}
feature_types TEXT[], -- ['pocket', 'hole', 'slot', ...]
feature_count INTEGER,
-- Provenance
source TEXT, -- 'grabcad', 'customer', 'synthetic'
source_url TEXT,
license TEXT,
-- Annotation status
annotation_status TEXT DEFAULT 'unannotated',
annotator_id UUID REFERENCES auth.users(id),
annotated_at TIMESTAMPTZ,
-- ML
split TEXT, -- 'train', 'val', 'test'
-- Classification tags (difficulty, material, machine_type, industry, etc.)
-- All classification dimensions live here — see "Tagging and Schema Extensibility"
tags JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_parts_annotation_status ON dataset_parts(annotation_status);
CREATE INDEX idx_parts_split ON dataset_parts(split);
CREATE INDEX idx_parts_source ON dataset_parts(source);
CREATE INDEX idx_parts_feature_types ON dataset_parts USING GIN(feature_types);
CREATE INDEX idx_parts_tags ON dataset_parts USING GIN(tags);
CREATE INDEX idx_parts_blob_hash ON dataset_parts(blob_hash);
Row-level security policies restrict access by role — see Access & Authentication for the full policy set and the user_role() helper function.
The feature_types array with a GIN index supports queries like "all parts containing a pocket AND a hole." The tags JSONB column handles all classification metadata:
{
"difficulty": "medium",
"material": "6061-T651",
"machine_type": "3-axis",
"industry": "aerospace",
"tolerance_class": "standard",
"model_scores": {"aagnet_v2": 0.92, "brepformer_v1": 0.87},
"benchmark": {"gouge_count": 0, "last_tested_snapshot": "v003"}
}
Annotations¶
CREATE TABLE dataset_annotations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
part_id TEXT NOT NULL REFERENCES dataset_parts(part_id),
annotator_id UUID REFERENCES auth.users(id),
annotation_type TEXT NOT NULL, -- 'feature', 'gdt', 'setup'
annotation_data JSONB NOT NULL, -- flexible per annotation_type
model_assisted BOOLEAN DEFAULT false,
confidence REAL,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
Annotation types are intentionally unstructured (JSONB) — feature labels, GD&T annotations, and future annotation types each have different schemas. The annotation_type discriminator tells consumers how to interpret annotation_data.
Snapshots and Pins¶
CREATE TABLE dataset_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
version TEXT UNIQUE NOT NULL,
parent_version TEXT REFERENCES dataset_snapshots(version),
description TEXT,
created_by UUID REFERENCES auth.users(id),
parts_count INTEGER NOT NULL,
manifest_url TEXT NOT NULL, -- R2 URL to full manifest JSON
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE dataset_snapshot_parts (
snapshot_id UUID REFERENCES dataset_snapshots(id),
part_id TEXT NOT NULL,
blob_hash TEXT NOT NULL,
PRIMARY KEY (snapshot_id, part_id)
);
CREATE TABLE dataset_pins (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
snapshot_id UUID REFERENCES dataset_snapshots(id),
context TEXT NOT NULL, -- 'aagnet-training-run-042'
context_type TEXT NOT NULL, -- 'training', 'benchmark', 'evaluation'
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT now()
);