CAD Dataset Infrastructure¶
Home / Engineering / System Design / CAD Dataset Infrastructure
Engineering Log Entry — March 2026
Storage, versioning, and metadata infrastructure for Anvil's CAD part dataset — enabling reproducible ML experiments, metadata-driven subsetting, and integration with the annotation pipeline.
Problem¶
Anvil maintains a growing corpus of CAD parts (currently 5K+, targeting 10K–20K) used for two purposes:
- ML training — Feature recognition models (AAGNet, BRepMFR, BRepFormer) need versioned, labeled datasets with reproducible train/val/test splits.
- AutoCAM benchmarking — The testing suite runs the CAM pipeline against a curated set of parts and tracks metrics over time.
Today these files are scattered across local drives and repos with no versioning, no metadata index, and no way to pull a subset without downloading everything. The annotation tool (migrating into Carbon) has no structured connection to the dataset it annotates. This document designs the infrastructure to fix that.
Key Decisions¶
| Decision | Choice | Rationale |
|---|---|---|
| Object storage | Cloudflare R2 | Free egress, S3-compatible API, already a Cloudflare vendor |
| Metadata catalog | Supabase Postgres | Already used by Carbon — no new service to stand up |
| Dataset versioning | Snapshot manifests | Lightweight, metadata-driven subsetting, minimal infra overhead vs. DVC/LakeFS |
| File storage strategy | Content-addressed (SHA-256) | Automatic deduplication across sources, immutable blob references |
| Extensible metadata | JSONB tags column + GIN index |
New classification dimensions require no schema migration; promote to dedicated columns only after they prove stable through use |
| Tagging and categorization | JSONB-first with tag registry | All classification dimensions start in JSONB tags; a tag_definitions registry table provides validation and autocomplete; graduate to dedicated columns only when a dimension is queried constantly and has settled |
Architecture Overview¶
At the highest level, data moves through five stages — from ingestion to consumption:
flowchart LR
Ingest["Ingest\n(mds push)"] --> Store["Store\n(R2 + Postgres)"]
Store --> Annotate["Annotate\n(Carbon)"]
Store --> Version["Version\n(Snapshots)"]
Version --> Consume["Consume\n(ML / Benchmark)"]
Annotate --> Version
The infrastructure supporting this lifecycle has three layers: object storage for the STEP files themselves, a metadata catalog for queryable part attributes, and snapshot manifests for dataset versioning.
flowchart TB
subgraph Consumers
ML["ML Training\n(local pulls)"]
Bench["AutoCAM Benchmark\n(CI/CD)"]
Annot["Annotation Tool\n(Carbon)"]
end
subgraph CLI["Dataset CLI (mds)"]
Pull["pull / push / query\nsnapshot / pin / diff"]
end
subgraph Backend
PG["Metadata Catalog\n(Supabase Postgres)"]
R2["Object Storage\n(Cloudflare R2)"]
end
ML --> Pull
Bench --> Pull
Annot --> PG
Annot -->|presigned URLs| R2
Pull --> PG
Pull --> R2
Object Storage — Cloudflare R2¶
STEP files are stored content-addressed by SHA-256 hash in a Cloudflare R2 bucket. Identical files are stored once regardless of how many part IDs reference them.
Metadata Catalog — Supabase Postgres¶
Part metadata lives in the existing Supabase Postgres instance that Carbon already uses. This avoids standing up a new service and gives the annotation tool direct access to the catalog through the same database it already queries.
Structured columns cover the known metadata dimensions (geometric, manufacturing, provenance, ML). A JSONB tags column with a GIN index handles everything else — new tag types require no schema migration. When a tag type stabilizes, it can be promoted to a dedicated column.
Versioning — Snapshot Manifests¶
A snapshot is a frozen point-in-time record of the full dataset: which parts are included and which blob each one maps to. Snapshots are stored as JSON manifests in R2 with metadata in Supabase.
Components¶
The design is broken into eight component pages, each covering a specific aspect of the infrastructure:
| Component | Description |
|---|---|
| Storage & Versioning | R2 bucket layout, content addressing, snapshot manifests, and versioning comparison |
| Metadata Schema | SQL schema for the parts catalog, annotations, snapshots, and pins |
| Tagging & Extensibility | JSONB-first tag design, tag registry, schema evolution guidelines |
| Dataset CLI | mds command-line tool: pull, push, query, snapshot, pin |
| Annotation Pipeline | Carbon integration, data flow, batch vs. live pre-labeling, presigned URLs |
| Ingestion & Testing | Geometry extraction pipeline and AutoCAM benchmark integration |
| Access & Authentication | Access tiers, auth flows, RLS policies, credential management |
| Carbon & the Annotation Tool | Carbon architecture overview, how the annotation tool fits in, Supabase interaction patterns |
Implementation Roadmap¶
Phase 1: Foundation¶
- Create the Cloudflare R2 bucket and provision two API tokens (read/write for engineers, read-only for CI)
- Apply the Supabase schema (tables, indexes, RLS policies,
user_role()function) - Build the
mdsCLI core:push,pull,query,snapshot create/list/diff,login,init - Configure CI credentials in GitHub Actions secrets
- Write a migration script to ingest the existing 5K+ files: hash, upload, register
- Create snapshot
v001representing the initial corpus
Phase 2: Metadata Enrichment¶
- Run geometry extraction over all existing parts
- Tag known parts with source, difficulty, and any existing labels
- Define initial train/val/test splits
- Create snapshot
v002with enriched metadata
Phase 3: Annotation Integration¶
- Update the Carbon annotation tool to read part queues from
dataset_parts - Wire up presigned R2 URL generation (Supabase Edge Function) with role-based authorization
- Deploy the
claim-next-partEdge Function for annotator part assignment - Apply annotator RLS policies and establish the annotator onboarding process
- Ensure annotation writes go to
dataset_annotations - Set up the model-assisted pre-labeling pipeline
Phase 4: Team Onboarding¶
- Write CLI setup docs (install, configure credentials)
- Create standard subset recipes (training set, benchmark parts, unannotated queue)
- Pin snapshots to existing training runs and benchmarks retroactively
Cost Estimate¶
| Component | Monthly Cost |
|---|---|
| Cloudflare R2 storage (200 GB) | ~$3 |
| R2 operations (Class A + B) | ~$1 |
| R2 egress | $0 |
| Supabase (incremental — metadata is small) | $0 |
| Total | ~$4 |
Even at 1 TB, R2 storage would be ~$15/month.
Frequently Asked Questions¶
Q: Can we run the annotation model or OpenCASCADE as part of the CLI?
Geometry extraction (Open CASCADE / pythonocc) already runs locally as part of mds push — it computes face count, vertex count, volume, bounding box, and basic feature identification during ingestion. The open question is which library to use (Open CASCADE via pythonocc vs. HOOPS Exchange), depending on what installs cleanly on developer machines.
Annotation pre-labeling is a separate concern. The current design runs it as a batch background job, not inside the CLI. The ML model that generates pre-labels needs GPU access and is decoupled from the ingestion flow — an engineer pushing files should not need a GPU or model weights locally. Pre-labeled annotations are written to dataset_annotations asynchronously, and annotators see them when they open a part in Carbon.
Q: Why Cloudflare R2 instead of S3, and how would we migrate?
R2 was chosen for three reasons: zero egress fees (significant for a dataset engineers pull subsets of regularly), full S3-compatible API (no proprietary lock-in), and Anvil is already a Cloudflare customer (the docs site runs on Cloudflare Pages). At 200 GB, storage costs roughly $3/month with $0 egress — AWS S3 would add ~$0.09/GB in egress fees on top of similar storage costs.
Migration to S3 (or any S3-compatible store) is straightforward because R2 uses the S3 API. The CLI uses boto3 with an R2 endpoint URL and credentials. Switching providers means changing the endpoint URL and credentials in ~/.mds/config.toml — no code changes to the CLI or metadata catalog. The content-addressed blob layout (blobs/{sha256}.step) is provider-agnostic.
Q: How does the dataset infrastructure relate to the AutoCAM testing suite?
The testing suite runs the CAM pipeline against benchmark parts and tracks metrics over time. The dataset infrastructure provides the storage and versioning layer that the testing suite pulls parts from.
Concretely: benchmark parts live in R2 alongside all other dataset parts, tracked in the metadata catalog. The CAM repository's test manifest (parts.json) references parts by part_id. A CI step pulls the needed files before a benchmark run: mds pull --snapshot v003 --part-ids-from tests/cam_benchmark/parts.json --output assets/parts/. Pinning benchmark runs to a specific dataset snapshot ensures results are reproducible even as the corpus evolves.
See the AutoCAM Testing Suite design for the full testing architecture.
Open Questions¶
- Geometry extraction tooling — Should
mds pushinvoke Open CASCADE directly (viapythonocc), use HOOPS Exchange, or call a separate extraction service? Depends on which libraries are easiest to install on developer machines. - Annotation data schema — What fields does
annotation_datacontain for feature labels vs. GD&T? Needs input from the annotation tool team. - Pre-labeling model deployment — Where does the cloud ML model run? Supabase Edge Functions have compute limits; a small GPU instance or a serverless GPU service (Modal, Replicate) may be needed.
- Access control for offshore annotators — Designed. See Access & Authentication for access tiers, RLS policies, and the part-claiming Edge Function.
- Deduplication across sources — If the same STEP file comes from multiple sources, content-addressing handles storage deduplication, but should the catalog show one entry or multiple? Current design: one
dataset_partsrow per logical part, withsourceindicating provenance.