CAD Dataset Infrastructure — Metadata Schema¶

Home / Engineering / System Design / CAD Dataset Infrastructure / Metadata Schema

Engineering Log Entry — March 2026

SQL schema for the dataset metadata catalog — parts, annotations, snapshots, and pins — all hosted in the existing Supabase Postgres instance.

Parts Catalog¶

CREATE TABLE dataset_parts (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    part_id           TEXT UNIQUE NOT NULL,
    blob_hash         TEXT NOT NULL,
    filename          TEXT NOT NULL,
    file_size_bytes   BIGINT,

    -- Geometric properties (populated by ingestion pipeline)
    face_count        INTEGER,
    edge_count        INTEGER,
    vertex_count      INTEGER,
    volume_mm3        REAL,
    bounding_box      JSONB,          -- {"x": 100, "y": 50, "z": 30}
    feature_types     TEXT[],         -- ['pocket', 'hole', 'slot', ...]
    feature_count     INTEGER,

    -- Provenance
    source            TEXT,           -- 'grabcad', 'customer', 'synthetic'
    source_url        TEXT,
    license           TEXT,

    -- Annotation status
    annotation_status TEXT DEFAULT 'unannotated',
    annotator_id      UUID REFERENCES auth.users(id),
    annotated_at      TIMESTAMPTZ,

    -- ML
    split             TEXT,           -- 'train', 'val', 'test'

    -- Classification tags (difficulty, material, machine_type, industry, etc.)
    -- All classification dimensions live here — see "Tagging and Schema Extensibility"
    tags              JSONB DEFAULT '{}',

    created_at        TIMESTAMPTZ DEFAULT now(),
    updated_at        TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX idx_parts_annotation_status ON dataset_parts(annotation_status);
CREATE INDEX idx_parts_split ON dataset_parts(split);
CREATE INDEX idx_parts_source ON dataset_parts(source);
CREATE INDEX idx_parts_feature_types ON dataset_parts USING GIN(feature_types);
CREATE INDEX idx_parts_tags ON dataset_parts USING GIN(tags);
CREATE INDEX idx_parts_blob_hash ON dataset_parts(blob_hash);

Row-level security policies restrict access by role — see Access & Authentication for the full policy set and the user_role() helper function.

The feature_types array with a GIN index supports queries like "all parts containing a pocket AND a hole." The tags JSONB column handles all classification metadata:

{
  "difficulty": "medium",
  "material": "6061-T651",
  "machine_type": "3-axis",
  "industry": "aerospace",
  "tolerance_class": "standard",
  "model_scores": {"aagnet_v2": 0.92, "brepformer_v1": 0.87},
  "benchmark": {"gouge_count": 0, "last_tested_snapshot": "v003"}
}

Annotations¶

CREATE TABLE dataset_annotations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    part_id         TEXT NOT NULL REFERENCES dataset_parts(part_id),
    annotator_id    UUID REFERENCES auth.users(id),
    annotation_type TEXT NOT NULL,     -- 'feature', 'gdt', 'setup'
    annotation_data JSONB NOT NULL,    -- flexible per annotation_type
    model_assisted  BOOLEAN DEFAULT false,
    confidence      REAL,
    created_at      TIMESTAMPTZ DEFAULT now(),
    updated_at      TIMESTAMPTZ DEFAULT now()
);

Annotation types are intentionally unstructured (JSONB) — feature labels, GD&T annotations, and future annotation types each have different schemas. The annotation_type discriminator tells consumers how to interpret annotation_data.

Snapshots and Pins¶

CREATE TABLE dataset_snapshots (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    version         TEXT UNIQUE NOT NULL,
    parent_version  TEXT REFERENCES dataset_snapshots(version),
    description     TEXT,
    created_by      UUID REFERENCES auth.users(id),
    parts_count     INTEGER NOT NULL,
    manifest_url    TEXT NOT NULL,     -- R2 URL to full manifest JSON
    created_at      TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE dataset_snapshot_parts (
    snapshot_id     UUID REFERENCES dataset_snapshots(id),
    part_id         TEXT NOT NULL,
    blob_hash       TEXT NOT NULL,
    PRIMARY KEY (snapshot_id, part_id)
);

CREATE TABLE dataset_pins (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    snapshot_id     UUID REFERENCES dataset_snapshots(id),
    context         TEXT NOT NULL,     -- 'aagnet-training-run-042'
    context_type    TEXT NOT NULL,     -- 'training', 'benchmark', 'evaluation'
    metadata        JSONB,
    created_at      TIMESTAMPTZ DEFAULT now()
);