CAD Dataset Infrastructure¶

Home / Engineering / System Design / CAD Dataset Infrastructure

Engineering Log Entry — March 2026

Storage, versioning, and metadata infrastructure for Anvil's CAD part dataset — enabling reproducible ML experiments, metadata-driven subsetting, and integration with the annotation pipeline.

Problem¶

Anvil maintains a growing corpus of CAD parts (currently 5K+, targeting 10K–20K) used for two purposes:

ML training — Feature recognition models (AAGNet, BRepMFR, BRepFormer) need versioned, labeled datasets with reproducible train/val/test splits.
AutoCAM benchmarking — The testing suite runs the CAM pipeline against a curated set of parts and tracks metrics over time.

Today these files are scattered across local drives and repos with no versioning, no metadata index, and no way to pull a subset without downloading everything. The annotation tool (migrating into Carbon) has no structured connection to the dataset it annotates. This document designs the infrastructure to fix that.

Key Decisions¶

Decision	Choice	Rationale
Object storage	Cloudflare R2	Free egress, S3-compatible API, already a Cloudflare vendor
Metadata catalog	Supabase Postgres	Already used by Carbon — no new service to stand up
Dataset versioning	Snapshot manifests	Lightweight, metadata-driven subsetting, minimal infra overhead vs. DVC/LakeFS
File storage strategy	Content-addressed (SHA-256)	Automatic deduplication across sources, immutable blob references
Extensible metadata	JSONB `tags` column + GIN index	New classification dimensions require no schema migration; promote to dedicated columns only after they prove stable through use
Tagging and categorization	JSONB-first with tag registry	All classification dimensions start in JSONB `tags`; a `tag_definitions` registry table provides validation and autocomplete; graduate to dedicated columns only when a dimension is queried constantly and has settled

Architecture Overview¶

At the highest level, data moves through five stages — from ingestion to consumption:

flowchart LR
    Ingest["Ingest\n(mds push)"] --> Store["Store\n(R2 + Postgres)"]
    Store --> Annotate["Annotate\n(Carbon)"]
    Store --> Version["Version\n(Snapshots)"]
    Version --> Consume["Consume\n(ML / Benchmark)"]
    Annotate --> Version

The infrastructure supporting this lifecycle has three layers: object storage for the STEP files themselves, a metadata catalog for queryable part attributes, and snapshot manifests for dataset versioning.

flowchart TB
    subgraph Consumers
        ML["ML Training\n(local pulls)"]
        Bench["AutoCAM Benchmark\n(CI/CD)"]
        Annot["Annotation Tool\n(Carbon)"]
    end

    subgraph CLI["Dataset CLI (mds)"]
        Pull["pull / push / query\nsnapshot / pin / diff"]
    end

    subgraph Backend
        PG["Metadata Catalog\n(Supabase Postgres)"]
        R2["Object Storage\n(Cloudflare R2)"]
    end

    ML --> Pull
    Bench --> Pull
    Annot --> PG
    Annot -->|presigned URLs| R2
    Pull --> PG
    Pull --> R2

Object Storage — Cloudflare R2¶

STEP files are stored content-addressed by SHA-256 hash in a Cloudflare R2 bucket. Identical files are stored once regardless of how many part IDs reference them.

Metadata Catalog — Supabase Postgres¶

Part metadata lives in the existing Supabase Postgres instance that Carbon already uses. This avoids standing up a new service and gives the annotation tool direct access to the catalog through the same database it already queries.

Structured columns cover the known metadata dimensions (geometric, manufacturing, provenance, ML). A JSONB tags column with a GIN index handles everything else — new tag types require no schema migration. When a tag type stabilizes, it can be promoted to a dedicated column.

Versioning — Snapshot Manifests¶

A snapshot is a frozen point-in-time record of the full dataset: which parts are included and which blob each one maps to. Snapshots are stored as JSON manifests in R2 with metadata in Supabase.

Components¶

The design is broken into eight component pages, each covering a specific aspect of the infrastructure:

Component	Description
Storage & Versioning	R2 bucket layout, content addressing, snapshot manifests, and versioning comparison
Metadata Schema	SQL schema for the parts catalog, annotations, snapshots, and pins
Tagging & Extensibility	JSONB-first tag design, tag registry, schema evolution guidelines
Dataset CLI	`mds` command-line tool: pull, push, query, snapshot, pin
Annotation Pipeline	Carbon integration, data flow, batch vs. live pre-labeling, presigned URLs
Ingestion & Testing	Geometry extraction pipeline and AutoCAM benchmark integration
Access & Authentication	Access tiers, auth flows, RLS policies, credential management
Carbon & the Annotation Tool	Carbon architecture overview, how the annotation tool fits in, Supabase interaction patterns

Implementation Roadmap¶

Phase 1: Foundation¶

Create the Cloudflare R2 bucket and provision two API tokens (read/write for engineers, read-only for CI)
Apply the Supabase schema (tables, indexes, RLS policies, user_role() function)
Build the mds CLI core: push, pull, query, snapshot create/list/diff, login, init
Configure CI credentials in GitHub Actions secrets
Write a migration script to ingest the existing 5K+ files: hash, upload, register
Create snapshot v001 representing the initial corpus

Phase 2: Metadata Enrichment¶

Run geometry extraction over all existing parts
Tag known parts with source, difficulty, and any existing labels
Define initial train/val/test splits
Create snapshot v002 with enriched metadata

Phase 3: Annotation Integration¶

Update the Carbon annotation tool to read part queues from dataset_parts
Wire up presigned R2 URL generation (Supabase Edge Function) with role-based authorization
Deploy the claim-next-part Edge Function for annotator part assignment
Apply annotator RLS policies and establish the annotator onboarding process
Ensure annotation writes go to dataset_annotations
Set up the model-assisted pre-labeling pipeline

Phase 4: Team Onboarding¶

Write CLI setup docs (install, configure credentials)
Create standard subset recipes (training set, benchmark parts, unannotated queue)
Pin snapshots to existing training runs and benchmarks retroactively

Cost Estimate¶

Component	Monthly Cost
Cloudflare R2 storage (200 GB)	~$3
R2 operations (Class A + B)	~$1
R2 egress	$0
Supabase (incremental — metadata is small)	$0
Total	~$4

Even at 1 TB, R2 storage would be ~$15/month.

Frequently Asked Questions¶

Q: Can we run the annotation model or OpenCASCADE as part of the CLI?

Geometry extraction (Open CASCADE / pythonocc) already runs locally as part of mds push — it computes face count, vertex count, volume, bounding box, and basic feature identification during ingestion. The open question is which library to use (Open CASCADE via pythonocc vs. HOOPS Exchange), depending on what installs cleanly on developer machines.

Annotation pre-labeling is a separate concern. The current design runs it as a batch background job, not inside the CLI. The ML model that generates pre-labels needs GPU access and is decoupled from the ingestion flow — an engineer pushing files should not need a GPU or model weights locally. Pre-labeled annotations are written to dataset_annotations asynchronously, and annotators see them when they open a part in Carbon.

Q: Why Cloudflare R2 instead of S3, and how would we migrate?

R2 was chosen for three reasons: zero egress fees (significant for a dataset engineers pull subsets of regularly), full S3-compatible API (no proprietary lock-in), and Anvil is already a Cloudflare customer (the docs site runs on Cloudflare Pages). At 200 GB, storage costs roughly $3/month with $0 egress — AWS S3 would add ~$0.09/GB in egress fees on top of similar storage costs.

Migration to S3 (or any S3-compatible store) is straightforward because R2 uses the S3 API. The CLI uses boto3 with an R2 endpoint URL and credentials. Switching providers means changing the endpoint URL and credentials in ~/.mds/config.toml — no code changes to the CLI or metadata catalog. The content-addressed blob layout (blobs/{sha256}.step) is provider-agnostic.

Q: How does the dataset infrastructure relate to the AutoCAM testing suite?

The testing suite runs the CAM pipeline against benchmark parts and tracks metrics over time. The dataset infrastructure provides the storage and versioning layer that the testing suite pulls parts from.

Concretely: benchmark parts live in R2 alongside all other dataset parts, tracked in the metadata catalog. The CAM repository's test manifest (parts.json) references parts by part_id. A CI step pulls the needed files before a benchmark run: mds pull --snapshot v003 --part-ids-from tests/cam_benchmark/parts.json --output assets/parts/. Pinning benchmark runs to a specific dataset snapshot ensures results are reproducible even as the corpus evolves.

See the AutoCAM Testing Suite design for the full testing architecture.

Open Questions¶

Geometry extraction tooling — Should mds push invoke Open CASCADE directly (via pythonocc), use HOOPS Exchange, or call a separate extraction service? Depends on which libraries are easiest to install on developer machines.
Annotation data schema — What fields does annotation_data contain for feature labels vs. GD&T? Needs input from the annotation tool team.
Pre-labeling model deployment — Where does the cloud ML model run? Supabase Edge Functions have compute limits; a small GPU instance or a serverless GPU service (Modal, Replicate) may be needed.
Access control for offshore annotators — Designed. See Access & Authentication for access tiers, RLS policies, and the part-claiming Edge Function.
Deduplication across sources — If the same STEP file comes from multiple sources, content-addressing handles storage deduplication, but should the catalog show one entry or multiple? Current design: one dataset_parts row per logical part, with source indicating provenance.