CAD Dataset Infrastructure — Tagging & Extensibility¶

Home / Engineering / System Design / CAD Dataset Infrastructure / Tagging & Extensibility

Engineering Log Entry — March 2026

Design for extensible part classification using JSONB tags, a lightweight tag registry, and guidelines for when to promote tags to dedicated columns.

Parts need to be classified along multiple dimensions — difficulty, machine compatibility, industry sector, material type — and new dimensions will emerge as the dataset and its use cases grow. The schema must support adding new classification dimensions without database migrations, while keeping frequently-queried dimensions efficient.

Why Not Just Dedicated Columns?¶

A purely column-based approach — adding machine_type TEXT, industry TEXT, etc. to dataset_parts for every classification dimension — works when the set of dimensions is small and known upfront. It does not scale: each new dimension (machine type, industry, part complexity category) requires a schema migration, a CLI flag, and a deployment. The set of useful classifications will grow as the dataset finds new uses, so the schema needs to accommodate that without friction.

Design: JSONB-First with Tag Registry¶

Classification dimensions — difficulty, machine type, industry, material class, complexity — all live in the tags JSONB column rather than as dedicated columns. Early on, these dimensions will change frequently as we learn what's useful. Dedicated columns for each one would mean constant migrations. JSONB lets us add, rename, or remove dimensions with zero schema changes.

The schema reserves dedicated columns only for structural fields that are inherent to the data model (e.g., annotation_status, split, source, blob_hash) rather than classification metadata. A lightweight tag_definitions table acts as a registry — not enforced by foreign keys, but used by the CLI, the Carbon UI, and ingestion scripts for validation and autocomplete. Once a tag dimension has proven stable through sustained use and is queried in nearly every workflow, it can be promoted to a dedicated column.

Tag definitions table:

CREATE TABLE tag_definitions (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    dimension       TEXT UNIQUE NOT NULL,    -- 'machine_type', 'industry', 'complexity'
    display_name    TEXT NOT NULL,           -- 'Machine Type'
    value_type      TEXT NOT NULL,           -- 'enum', 'integer_range', 'text', 'boolean'
    allowed_values  JSONB,                   -- ['3-axis', '4-axis', '5-axis'] for enums
    description     TEXT,
    created_at      TIMESTAMPTZ DEFAULT now()
);

Example registry entries:

INSERT INTO tag_definitions (dimension, display_name, value_type, allowed_values, description) VALUES
    ('difficulty', 'Difficulty', 'enum', '["easy", "medium", "hard"]',
     'Manufacturing difficulty for annotation and benchmarking'),
    ('material', 'Material', 'enum', '["6061-T651", "7075-T6", "304-stainless", "titanium-6Al4V", "delrin"]',
     'Stock material — values will expand as new materials are used'),
    ('machine_type', 'Machine Type', 'enum', '["3-axis", "4-axis", "5-axis", "mill-turn"]',
     'CNC machine axis configuration required to manufacture the part'),
    ('industry', 'Industry', 'enum', '["aerospace", "automotive", "medical", "general"]',
     'Primary industry vertical for the part'),
    ('complexity', 'Complexity', 'enum', '["simple", "moderate", "complex"]',
     'Subjective manufacturing complexity rating'),
    ('tolerance_class', 'Tolerance Class', 'enum', '["standard", "precision", "tight"]',
     'General tolerance band for the part');

Example tags JSONB on a part:

{
  "difficulty": "hard",
  "material": "titanium-6Al4V",
  "machine_type": "5-axis",
  "industry": "aerospace",
  "complexity": "complex",
  "tolerance_class": "tight",
  "model_scores": {"aagnet_v2": 0.92, "brepformer_v1": 0.87},
  "benchmark": {"gouge_count": 0, "last_tested_snapshot": "v003"}
}

Querying by tags uses the existing GIN index on tags:

-- All 5-axis aerospace parts
SELECT * FROM dataset_parts
WHERE tags @> '{"machine_type": "5-axis", "industry": "aerospace"}';

-- Multiple tag dimensions in one predicate
SELECT * FROM dataset_parts
WHERE tags @> '{"machine_type": "5-axis", "material_class": "titanium", "complexity": "complex"}';

CLI integration:

# Pull by tag dimensions
mds pull --tag machine_type=5-axis --tag industry=aerospace --output ./data/aero-5ax/
mds pull --tag complexity=simple --tag difficulty=easy --output ./data/easy/

# List registered tag dimensions and their allowed values
mds tags list

# Register a new tag dimension (no schema migration needed)
mds tags define --dimension surface_treatment --type enum \
    --values "anodized,passivated,as-machined"

# Tag parts
mds tag part-0042 machine_type=5-axis industry=aerospace
mds tag part-0042 complexity=complex

Why JSONB over a junction table?

A junction table (part_tags with rows like (part_id, 'machine_type', '5-axis')) gives cleaner relational semantics but requires a JOIN per tag dimension in queries. With JSONB and a GIN index, a single WHERE tags @> '...' clause filters by multiple dimensions in one predicate. At Anvil's current scale (10K–20K parts, fewer than 20 tag dimensions), the JSONB approach is simpler and fast enough. If the number of distinct tag dimensions grows past ~50 or per-value indexing becomes important, a junction table could be reconsidered.

Schema Evolution Guidelines¶

Default: start in tags JSONB. All classification and categorization dimensions begin in JSONB. This includes dimensions that feel obvious now (difficulty, material class) — early assumptions about what's stable are often wrong, and JSONB lets the team iterate without coordination overhead.

When to promote to a dedicated column:

A dimension earns a dedicated column only when all of these are true: it has been in active use for a meaningful period without changing semantics, it is queried in most subset pulls or used in JOINs/GROUP BYs, and it would benefit from a CHECK constraint or typed index. Structural fields like annotation_status, split, and source are dedicated columns from the start because they are part of the data model, not classification metadata.

Promoting a tag to a dedicated column:

When a JSONB tag dimension stabilizes and becomes central to queries:

Add the new column: ALTER TABLE dataset_parts ADD COLUMN machine_type TEXT;
Backfill from JSONB: UPDATE dataset_parts SET machine_type = tags->>'machine_type';
Add an index if needed.
Update the CLI to support --machine-type as a first-class flag.
Optionally remove the key from tags JSONB, or leave it for backward compatibility during the transition.

Adding new allowed values to an existing dimension:

Update the tag_definitions row's allowed_values array. No schema migration required. The CLI and Carbon UI pick up the new values on next load.

Changing feature annotation categories:

The annotation_type field on dataset_annotations is free-form TEXT, so new annotation categories (e.g., 'surface_finish', 'fixturing') are added by using a new string value — no migration. If validation is desired, the tag_definitions registry pattern can be extended to annotation types by adding a dimension = 'annotation_type' row with allowed values.