CAD Dataset Infrastructure — Annotation Pipeline¶
Home / Engineering / System Design / CAD Dataset Infrastructure / Annotation Pipeline
Engineering Log Entry — March 2026
Integration between the dataset infrastructure and Carbon's annotation workflow — data flow, pre-labeling strategy, and presigned URL access.
The annotation tool is migrating into Carbon, which already uses Supabase as its backend. This design makes the dataset catalog a natural extension of the same database.
flowchart LR
subgraph Ingest["Ingestion"]
Eng["Engineer"]
Push["mds push"]
end
subgraph Store["Storage"]
R2["R2\n(STEP blobs)"]
PG["Supabase\n(metadata)"]
end
subgraph PreLabel["Pre-Labeling"]
Model["Cloud ML Model"]
end
subgraph Annotate["Annotation"]
Carbon["Carbon App"]
Annotator["Offshore Annotator"]
end
subgraph Version["Versioning"]
Snap["mds snapshot create"]
end
Eng --> Push
Push --> R2
Push -->|register part| PG
PG -->|unannotated queue| Model
Model -->|write pre-labels| PG
PG -->|pre-labeled queue| Carbon
R2 -->|presigned URL| Carbon
Annotator --> Carbon
Carbon -->|save corrections| PG
PG -->|approved parts| Snap
Data Flow¶
-
Ingest — Engineer runs
mds pushto upload STEP files to R2 and register them indataset_partswithannotation_status = 'unannotated'. -
Pre-label — A background job picks up unannotated parts, runs the feature recognition model, and writes results to
dataset_annotationswithmodel_assisted = true. The part's status moves to'pre-labeled'. -
Annotate — Offshore annotator opens the annotation tool in Carbon. The tool queries
dataset_partsfor parts with status'pre-labeled', loads the STEP file via a presigned R2 URL (generated by a Supabase Edge Function), and displays pre-labels for correction. -
Review — Annotator submits corrections. The annotation tool writes to
dataset_annotationsand updates the part's status to'in-review'or'approved'. -
Snapshot — When a batch of annotations is complete, an engineer runs
mds snapshot createto freeze the current state of the catalog.
Batch vs. Live Annotation¶
The Data Flow above describes batch pre-labeling: a background job processes all unannotated parts before any annotator sees them. An alternative is live inference: the ML model runs on demand when an annotator opens a part.
| Dimension | Batch Pre-Labeling | Live Inference |
|---|---|---|
| Annotator latency | None — pre-labels ready when part is opened | Model inference delay (seconds to minutes depending on model and hardware) |
| Compute cost | Processes all parts, including ones that may never be annotated | Only runs for parts actually opened by an annotator |
| Model freshness | Pre-labels use whichever model version ran at batch time; may go stale if the model is updated | Always uses the latest deployed model |
| Infrastructure | Background worker (cron or queue) with GPU access | Inference endpoint (serverless GPU or persistent service) callable from Carbon |
| Failure handling | Failures retried in batch; annotator never sees them | Failure blocks the annotator; needs fallback to empty labels |
| Scale behavior | Cost grows linearly with corpus size | Cost grows with annotation throughput (parts opened per day) |
How live inference would work:
- Annotator opens a part in Carbon with
annotation_status = 'unannotated'. - Carbon calls an inference endpoint (e.g., a Modal or Replicate serverless function) with the part's
blob_hash. - The endpoint loads the STEP file from R2, runs the feature recognition model, and returns predicted labels.
- Carbon writes the predictions to
dataset_annotationswithmodel_assisted = trueand displays them for correction. - If the endpoint is unavailable or times out, Carbon falls back to showing empty labels and the annotator works from scratch.
Recommendation
Start with batch pre-labeling — it is simpler to implement and decouples model inference from the annotation UX. Move to live inference later if the corpus grows large enough that pre-labeling all parts becomes wasteful, or if model iteration speed makes stale pre-labels a problem.
Presigned URL Access¶
The Carbon web app never accesses R2 directly. A Supabase Edge Function (or Cloudflare Worker) generates time-limited presigned URLs for read access:
GET /functions/v1/dataset-url?part_id=part-0001
→ { "url": "https://machenit-dataset.r2.cloudflarestorage.com/blobs/a1b2c3...?X-Amz-Expires=3600&..." }
The Edge Function verifies the caller's JWT and checks their role before generating a URL. For annotators, it also verifies the requested part is assigned to them. See Access & Authentication for the authorization logic and the claim-next-part Edge Function that handles part assignment.
The annotation tool fetches this URL, then loads the STEP file directly from R2. Write access (uploading new files) goes through the CLI or a protected API endpoint, not the browser.
GD&T and Future Annotation Types¶
The annotation_type field on dataset_annotations supports multiple annotation workflows without schema changes. Feature labeling uses annotation_type = 'feature'; GD&T annotations use annotation_type = 'gdt' with a different annotation_data structure. The annotation tool renders different UI based on the type.