Skip to content

CAD Dataset Infrastructure — Annotation Pipeline

Home / Engineering / System Design / CAD Dataset Infrastructure / Annotation Pipeline

Engineering Log Entry — March 2026

Integration between the dataset infrastructure and Carbon's annotation workflow — data flow, pre-labeling strategy, and presigned URL access.

The annotation tool is migrating into Carbon, which already uses Supabase as its backend. This design makes the dataset catalog a natural extension of the same database.

flowchart LR
    subgraph Ingest["Ingestion"]
        Eng["Engineer"]
        Push["mds push"]
    end

    subgraph Store["Storage"]
        R2["R2\n(STEP blobs)"]
        PG["Supabase\n(metadata)"]
    end

    subgraph PreLabel["Pre-Labeling"]
        Model["Cloud ML Model"]
    end

    subgraph Annotate["Annotation"]
        Carbon["Carbon App"]
        Annotator["Offshore Annotator"]
    end

    subgraph Version["Versioning"]
        Snap["mds snapshot create"]
    end

    Eng --> Push
    Push --> R2
    Push -->|register part| PG
    PG -->|unannotated queue| Model
    Model -->|write pre-labels| PG
    PG -->|pre-labeled queue| Carbon
    R2 -->|presigned URL| Carbon
    Annotator --> Carbon
    Carbon -->|save corrections| PG
    PG -->|approved parts| Snap

Data Flow

  1. Ingest — Engineer runs mds push to upload STEP files to R2 and register them in dataset_parts with annotation_status = 'unannotated'.

  2. Pre-label — A background job picks up unannotated parts, runs the feature recognition model, and writes results to dataset_annotations with model_assisted = true. The part's status moves to 'pre-labeled'.

  3. Annotate — Offshore annotator opens the annotation tool in Carbon. The tool queries dataset_parts for parts with status 'pre-labeled', loads the STEP file via a presigned R2 URL (generated by a Supabase Edge Function), and displays pre-labels for correction.

  4. Review — Annotator submits corrections. The annotation tool writes to dataset_annotations and updates the part's status to 'in-review' or 'approved'.

  5. Snapshot — When a batch of annotations is complete, an engineer runs mds snapshot create to freeze the current state of the catalog.

Batch vs. Live Annotation

The Data Flow above describes batch pre-labeling: a background job processes all unannotated parts before any annotator sees them. An alternative is live inference: the ML model runs on demand when an annotator opens a part.

Dimension Batch Pre-Labeling Live Inference
Annotator latency None — pre-labels ready when part is opened Model inference delay (seconds to minutes depending on model and hardware)
Compute cost Processes all parts, including ones that may never be annotated Only runs for parts actually opened by an annotator
Model freshness Pre-labels use whichever model version ran at batch time; may go stale if the model is updated Always uses the latest deployed model
Infrastructure Background worker (cron or queue) with GPU access Inference endpoint (serverless GPU or persistent service) callable from Carbon
Failure handling Failures retried in batch; annotator never sees them Failure blocks the annotator; needs fallback to empty labels
Scale behavior Cost grows linearly with corpus size Cost grows with annotation throughput (parts opened per day)

How live inference would work:

  1. Annotator opens a part in Carbon with annotation_status = 'unannotated'.
  2. Carbon calls an inference endpoint (e.g., a Modal or Replicate serverless function) with the part's blob_hash.
  3. The endpoint loads the STEP file from R2, runs the feature recognition model, and returns predicted labels.
  4. Carbon writes the predictions to dataset_annotations with model_assisted = true and displays them for correction.
  5. If the endpoint is unavailable or times out, Carbon falls back to showing empty labels and the annotator works from scratch.

Recommendation

Start with batch pre-labeling — it is simpler to implement and decouples model inference from the annotation UX. Move to live inference later if the corpus grows large enough that pre-labeling all parts becomes wasteful, or if model iteration speed makes stale pre-labels a problem.

Presigned URL Access

The Carbon web app never accesses R2 directly. A Supabase Edge Function (or Cloudflare Worker) generates time-limited presigned URLs for read access:

GET /functions/v1/dataset-url?part_id=part-0001
→ { "url": "https://machenit-dataset.r2.cloudflarestorage.com/blobs/a1b2c3...?X-Amz-Expires=3600&..." }

The Edge Function verifies the caller's JWT and checks their role before generating a URL. For annotators, it also verifies the requested part is assigned to them. See Access & Authentication for the authorization logic and the claim-next-part Edge Function that handles part assignment.

The annotation tool fetches this URL, then loads the STEP file directly from R2. Write access (uploading new files) goes through the CLI or a protected API endpoint, not the browser.

GD&T and Future Annotation Types

The annotation_type field on dataset_annotations supports multiple annotation workflows without schema changes. Feature labeling uses annotation_type = 'feature'; GD&T annotations use annotation_type = 'gdt' with a different annotation_data structure. The annotation tool renders different UI based on the type.