CAD Dataset Infrastructure — Design Review & Alternatives¶

Home / Engineering / System Design / CAD Dataset Infrastructure / Design Review & Alternatives

Engineering Log Entry — March 2026

Critical review of the CAD Dataset Infrastructure design — open risks, underspecified areas, and alternative approaches. Intended as a constructive companion to the design, not a replacement. Findings are organized by urgency: issues to resolve before implementation, items to address during implementation, and concerns to track for later.

Purpose¶

This page identifies gaps, risks, and design alternatives surfaced during review of the CAD Dataset Infrastructure design. The goal is to strengthen the design before implementation begins. Each issue includes a severity level, which component it relates to, and a recommended path forward. The Design Alternatives section presents tradeoff comparisons for major decision points — some reinforce the existing design choice, others suggest changes.

Issues¶

Fix Before Building¶

These are correctness or architectural issues that should be resolved before implementation begins.

`claim-next-part` Race Condition¶

Reference: Access & Authentication — Part Claiming

The claim-next-part Edge Function uses a PostgREST chain: .update().eq('annotation_status', 'pre-labeled').is('annotator_id', null).order().limit(1). This is not atomic — PostgREST translates it into an UPDATE ... WHERE without row-level locking. Two annotators hitting the endpoint simultaneously can claim the same part, because both SELECT the same row before either UPDATE commits.

At the planned initial scale of 1-3 annotators the probability is low, but it is a correctness bug that should not ship.

Recommendation

Replace the PostgREST chain with a Postgres stored procedure using FOR UPDATE SKIP LOCKED, called via Supabase rpc():

CREATE OR REPLACE FUNCTION claim_next_part(p_annotator_id UUID)
RETURNS TEXT AS $$
DECLARE
    v_part_id TEXT;
BEGIN
    SELECT part_id INTO v_part_id
    FROM dataset_parts
    WHERE annotation_status = 'pre-labeled'
      AND annotator_id IS NULL
    ORDER BY created_at ASC
    LIMIT 1
    FOR UPDATE SKIP LOCKED;

    IF v_part_id IS NULL THEN
        RETURN NULL;
    END IF;

    UPDATE dataset_parts
    SET annotator_id = p_annotator_id,
        annotation_status = 'in-progress',
        updated_at = now()
    WHERE part_id = v_part_id;

    RETURN v_part_id;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

The Edge Function simplifies to a single rpc('claim_next_part', { p_annotator_id: user.id }) call after verifying the caller's role.

See Atomic Part Claiming in the design alternatives section for a comparison of approaches.

Annotation Data Model Needs a Defined Structure¶

Reference: Metadata Schema — Annotations

The dataset_annotations table has annotation_data JSONB NOT NULL with no defined structure. The current annotation tool stores face-level annotations (face_id to feature_type mapping) with 37 feature types, instance grouping via connected components on the Attribute Adjacency Graph, and GD&T annotations. The design needs to specify how this maps to dataset_annotations rows — specifically: is each row an entire annotation session per part, one row per feature instance, or one row per face label?

This affects query patterns, RLS performance, and how ML pre-labels merge with human corrections.

Recommendation

Define each dataset_annotations row as one annotation session per part per annotator — the entire annotation state for a part stored as a single JSONB document. This matches the current tool's session model and keeps the row count manageable.

Example annotation_data for annotation_type = 'feature':

{
  "schema_version": 1,
  "faces": {
    "face_12": {"feature_type": "through_hole", "instance_id": "inst_001"},
    "face_13": {"feature_type": "through_hole", "instance_id": "inst_001"},
    "face_45": {"feature_type": "blind_pocket", "instance_id": "inst_002"}
  },
  "instances": {
    "inst_001": {
      "feature_type": "through_hole",
      "face_ids": ["face_12", "face_13"],
      "parameters": {"diameter_mm": 6.35, "depth_mm": 12.7}
    },
    "inst_002": {
      "feature_type": "blind_pocket",
      "face_ids": ["face_45"],
      "parameters": {"length_mm": 25.0, "width_mm": 15.0, "depth_mm": 5.0}
    }
  }
}

A schema_version field allows the structure to evolve without breaking consumers. ML pre-labels use the same structure with model_assisted = true and per-instance confidence scores embedded in the instance objects.

See Annotation Storage Granularity for a comparison of approaches.

Phase A to Phase B Storage Divergence¶

Reference: Carbon & the Annotation Tool — Integration Roadmap

Phase A stores STEP files in Supabase Storage (S3) with a self-contained schema (annotationSession, annotationDocument). Phase B switches to Cloudflare R2 with content-addressed blobs via mds push. The design does not describe how files uploaded during Phase A migrate to R2, or how the two schemas converge.

Engineers will have data in two places with no tooling to reconcile them. Any annotations created during Phase A need a migration path into dataset_annotations.

Recommendation

Two viable approaches:

Option A: Start with R2 from Phase AOption B: Define a migration script

Skip Supabase Storage for STEP files entirely. Phase A uploads go directly to R2 (content-addressed), and the annotationSession table references blobs by SHA-256 hash from the start. This eliminates the migration but requires presigned URL infrastructure earlier.

Accept the S3/R2 split during Phase A. Before Phase B begins, run a migration script that: (1) hashes all S3-stored STEP files, (2) uploads them to R2 under their content-addressed paths, (3) creates dataset_parts rows pointing to the R2 blobs, (4) maps annotationDocument rows into dataset_annotations format.

Option A is cleaner if presigned URL generation can be stood up early. Option B is more pragmatic if Phase A needs to ship fast with minimal infrastructure.

Address During Implementation¶

Items that do not block starting work but need resolution during the implementation phases.

Snapshots Do Not Capture Annotation State¶

Reference: Storage & Versioning — Snapshot Manifests

Snapshots freeze part_id to blob_hash mappings but not annotation state. A snapshot taken today and replayed in six months would reconstruct the same STEP files but potentially different annotations — any labels modified after the snapshot was created would reflect the newer state, not the state at snapshot time.

For file-level reproducibility this is sufficient. Annotation state can be approximately reconstructed from dataset_annotations by filtering on created_at < snapshot.created_at, but this is fragile if annotations are updated in place rather than appended.

Recommendation

Document this as an explicit limitation of the current snapshot design. For a future enhancement, consider adding an optional annotations_hash to the snapshot manifest — a checksum of all annotation data at snapshot time — so consumers can at least detect drift. Full annotation snapshotting (copying annotation rows into a snapshot-scoped table) can be deferred until ML reproducibility requires it.

See Snapshot Scope for a comparison of approaches.

Snapshot Manifest Redundancy¶

Reference: Storage & Versioning + Metadata Schema — Snapshots and Pins

The same part-to-blob mapping exists in two places: the R2 JSON manifest (snapshots/v003.json) and the dataset_snapshot_parts Postgres table. The design does not designate which is the source of truth. If they diverge — for example, if a manifest upload fails after the table is written — consumers get inconsistent results depending on which they read.

Recommendation

Designate Postgres as the source of truth — it is transactional, queryable, and supports the CLI's mds pull --snapshot queries. The R2 manifest becomes a convenience export for offline or portable use. The mds snapshot create command should write to Postgres first (within a transaction), then generate and upload the R2 JSON. Add a mds snapshot verify command that checks consistency between the two.

OpenCascade.js for STEP Processing Is Risky¶

Reference: Carbon & the Annotation Tool — Phase A

Phase A proposes replacing pythonocc (mature Python bindings to Open CASCADE) with OpenCascade.js (a community-maintained WASM build) running server-side in a Trigger.dev task. This is the riskiest technical bet in the design:

WASM memory limits — browsers and Node.js WASM runtimes typically cap at 2-4 GB. Complex STEP assemblies can exceed this.
STEP coverage — OpenCascade.js is a subset of Open CASCADE; not all STEP entities and configurations are supported.
Maintenance — OpenCascade.js is a small community project. pythonocc has broader adoption and active maintenance.
The current tool already works with pythonocc — switching introduces risk for no functional gain.

Recommendation

Keep pythonocc as the STEP processing backend for Phase A. Run it as a containerized service — either a Docker container managed by Trigger.dev, a small Cloud Run instance, or a sidecar. The annotation tool's existing tessellation code can be reused with minimal changes. Evaluate OpenCascade.js as an optimization in a later phase, after the annotation tool is stable and the WASM limitations can be tested against the actual part corpus.

See STEP Processing Runtime for a full comparison.

No Review / QA Workflow¶

Reference: Annotation Pipeline — Data Flow

The design references in-review and approved annotation statuses but does not specify who reviews, what happens on rejection, or how annotation quality is measured. At the planned initial scale of 1-3 annotators this can work informally, but the status transitions should be defined explicitly.

Recommendation

Define the annotation status state machine:

stateDiagram-v2
    [*] --> unannotated
    unannotated --> pre_labeled : batch pre-labeling
    pre_labeled --> in_progress : annotator claims
    in_progress --> pre_labeled : annotator releases (unclaim)
    in_progress --> in_review : annotator submits
    in_review --> approved : engineer approves
    in_review --> in_progress : engineer rejects (with notes)
    approved --> [*]

For the initial implementation: engineers review. An engineer opens the part in the annotation tool, sees the annotator's work, and either approves or rejects with notes. Rejection returns the part to in_progress with the same annotator so they can correct their work.

Add a review_notes TEXT column to dataset_parts (or store in tags) for rejection feedback. Inter-annotator agreement metrics and formal QA workflows can be added when the annotator team scales beyond 3.

Shared R2 Write Key¶

Reference: Access & Authentication — R2 Bucket Credentials

All engineers share a single R2 read/write API token stored in ~/.mds/config.toml. The design lists "per-engineer R2 credentials" as a Phase 3 hardening item, but R2 does not support per-user API tokens — it only supports per-bucket tokens with configurable permissions.

A compromised engineer laptop exposes the entire bucket. There is no per-engineer audit trail at the R2 layer.

Recommendation

Accept the shared key for the initial implementation — R2's token model does not offer a better option. For per-engineer audit trails, route writes through a Cloudflare Worker or Supabase Edge Function that logs the authenticated user (from the Supabase JWT) before proxying the upload to R2. This adds latency to writes but provides accountability. The direct boto3 path remains available for bulk operations like initial migration.

Remove "per-engineer R2 credentials" from the Phase 3 hardening list since R2 does not support this. Replace with "write-audit proxy" as the hardening target.

Track for Later¶

Items that are not urgent but should be tracked as the system scales.

Pre-Labeling Model Deployment¶

Reference: Annotation Pipeline — Pre-label step

The three ML detectors (AAGNet, BrepMFR, BRepFormer) currently run as local Docker containers via the annotation tool's docker-compose.yml. The design's architecture diagram shows a "Cloud ML Model" without specifying where it runs. Modal and Replicate are mentioned as possibilities but not designed.

This does not block Phase 1 (storage/versioning) or Phase 2 (metadata enrichment). Pre-labeling can run locally or be deferred until the annotation pipeline is active. A deployment design is needed before Phase 3.

No Garbage Collection for Orphaned Blobs¶

Reference: Storage & Versioning

Content-addressed storage means blobs are never overwritten, but there is no mechanism to identify or remove blobs that are no longer referenced by any dataset_parts row. At 200 GB and ~$3/month this is a negligible cost concern, but orphaned blobs accumulate over time.

A periodic garbage collection script — list all R2 blob keys, compare against SELECT DISTINCT blob_hash FROM dataset_parts, delete the difference — addresses this. It should run infrequently (monthly) and with a dry-run mode.

No Backup / Disaster Recovery for R2¶

Reference: Storage & Versioning

R2 does not offer cross-region replication by default. The content-addressed design makes backup straightforward: a periodic rclone sync to a secondary R2 bucket, an S3 bucket, or even a local NAS provides disaster recovery. The Supabase Postgres side is covered by Supabase's built-in daily backups and point-in-time recovery.

Cost Estimate Is Incomplete¶

Reference: Overview — Cost Estimate

The $4/month estimate covers R2 storage and operations. It omits:

Component	Estimated Cost
Supabase Edge Function invocations	Free tier likely sufficient; ~$2/month beyond
Trigger.dev task execution (STEP processing)	Depends on volume; ~$5-20/month for 1K parts/month
GPU compute for ML pre-labeling	$0 if local; ~$10-50/month if cloud (Modal/Replicate)
Supabase Postgres compute (incremental)	$0 on free tier; $25/month on Pro if needed

The total is still modest, but the estimate should reflect all components to avoid surprises.

Design Alternatives¶

Metadata: DuckDB CLI vs. Supabase¶

The design uses Supabase Postgres for all metadata queries, including CLI operations like mds query and mds pull --split train. An alternative is embedding DuckDB in the CLI for local-first querying.

Dimension	Supabase Postgres (current)	DuckDB in CLI
Query latency	Network round-trip (~50-200ms)	Local, sub-millisecond
Multi-user consistency	Single source of truth	Requires sync mechanism
Setup complexity	Already exists (Carbon uses it)	New dependency in CLI
Offline capability	Requires network	Works offline after sync
Write path	Standard SQL	Read-only (writes still go to Supabase)
Annotation queue integration	Native (same database)	Not feasible for real-time

Verdict

Supabase is the right choice for the primary catalog — it serves both the CLI and the annotation tool from one source of truth. DuckDB could be a useful complement for offline querying: mds query could cache a local DuckDB snapshot of dataset_parts for fast filtering and complex analytical queries. Not a replacement, but a potential Phase 2 enhancement for ML workflows that need repeated local queries.

Self-Contained Datasets: Parquet in R2¶

HuggingFace Datasets stores metadata as Parquet files alongside data files, making datasets self-contained and portable — no database required to understand them.

Dimension	Supabase + R2 JSON (current)	Parquet files in R2
Portability	Requires Supabase access to query	Self-contained, shareable
Query capability	Full SQL, real-time	Parquet readers (DuckDB, Pandas, Polars)
HuggingFace Datasets compatibility	Requires export script	Native
Write complexity	Standard Supabase insert	Rewrite Parquet files on each change
Real-time annotation queue	Native (Supabase Realtime)	Not feasible
External sharing	Requires API access or data export	Download the directory

Verdict

Not a replacement for Supabase — the annotation queue and multi-user workflows need a live database. But Parquet is an excellent export format for ML consumption and external sharing. Consider adding mds export --format parquet as a Phase 2 feature that dumps the current catalog (or a snapshot) to Parquet files in R2 alongside the blobs. This makes the dataset consumable by anyone with standard data tools, without Supabase access.

Annotation Storage Granularity¶

Three approaches for mapping annotation data to dataset_annotations rows:

Approach	Row per...	Pros	Cons
Session blob	Part + annotator	Simple, mirrors current tool's model, single read/write per part	Hard to query individual features across parts
Feature instance	Feature instance	Queryable ("all through-holes"), supports partial updates	Many rows per part (10-50+), complex upsert logic
Face label	Face	Maximum granularity, ML-friendly	Extremely high row count (100s-1000s per part), slow RLS

Verdict

Session blob — one JSONB document per part per annotator. This matches the current tool's data model (session JSON), keeps the row count at ~1 per part, and makes the Phase A to Phase B migration straightforward. If feature-level queries become important (e.g., "find all parts with through-holes annotated by a specific annotator"), add a materialized view or extract to a separate analytics table rather than changing the primary storage granularity.

Atomic Part Claiming¶

Three approaches for race-safe annotator part assignment:

Approach	Implementation	Atomicity	Complexity	PostgREST Compatible
`FOR UPDATE SKIP LOCKED`	Postgres stored procedure via `rpc()`	Row-level lock, skips claimed rows	Low (~15 lines SQL)	Yes
`pg_advisory_lock`	Advisory lock on part ID before update	Application-level lock	Medium (must ensure release)	Yes via `rpc()`
Queue table	Separate `annotation_queue` with dequeue semantics	Table-level isolation	High (new table, additional sync)	Yes

Verdict

Stored procedure with FOR UPDATE SKIP LOCKED. This is the standard Postgres pattern for exactly this use case — "give me the next unclaimed row and lock it so nobody else gets it." It works through PostgREST's rpc() endpoint, is atomic by definition, and requires minimal code. See the recommendation above for the implementation.

STEP Processing Runtime¶

Three options for tessellating STEP files into viewer-ready meshes:

Dimension	pythonocc (container)	OCC.js (WASM / Trigger.dev)	HOOPS Exchange (service)
STEP coverage	Full (Open CASCADE C++ bindings)	Partial (community WASM build)	Broadest (commercial, 30+ formats)
Memory handling	Container memory (configurable, GB+)	WASM ~2-4 GB hard limit	Container memory (configurable)
Maintenance	Active OSS, wide adoption	Small community project	Commercial support (TechSoft3D)
Dependency	Python + Docker	Node.js only (no Python)	License fee + API integration
Current status	Already working in annotation tool	Not yet tested at Anvil	Evaluated, SDK available
Tessellation quality	Good, configurable precision	Good for supported entities	Excellent

Verdict

pythonocc in a container for Phase A. The annotation tool already has working tessellation code using pythonocc — reuse it in a containerized service (Docker on Trigger.dev, Cloud Run, or a sidecar). This eliminates the WASM risk and reuses proven code. Evaluate OpenCascade.js as a later optimization if eliminating the Python dependency becomes a priority. HOOPS Exchange remains a fallback if Open CASCADE coverage gaps appear with production parts.

Snapshot Scope¶

What should a snapshot capture?

Scope	Contents	Reproducibility Level	Implementation Complexity
Files only (current)	`part_id` to `blob_hash` mappings	Same STEP files, potentially different annotations	Low — already designed
Files + annotation hash	Above + checksum of annotation state	Detect annotation drift, reconstruct approximately	Medium — hash computation on snapshot create
Files + annotation copy	Above + full copy of all annotation rows	Exact annotation state at snapshot time	High — snapshot-scoped annotation table
Full catalog state	Above + tags, metadata, splits	Complete point-in-time database state	Very high — essentially a database dump

Verdict

Files only is correct for the initial implementation. Add annotation hash as a Phase 2 enhancement — it provides drift detection without the storage cost of copying annotation rows. Full annotation copying or catalog snapshots are overkill at current scale; Supabase's point-in-time recovery covers disaster scenarios, and dataset_annotations.created_at timestamps allow approximate historical reconstruction.

Summary of Recommendations¶

#	Issue	Priority	Recommendation	Affects
1	`claim-next-part` race condition	Fix before building	Stored procedure with `FOR UPDATE SKIP LOCKED`	Access & Auth
2	Annotation data model undefined	Fix before building	One session blob per part per annotator, with `schema_version`	Metadata Schema
3	Phase A/B storage divergence	Fix before building	Start with R2 in Phase A, or define S3-to-R2 migration script	Carbon & Annotation Tool
4	Snapshots miss annotation state	During implementation	Document as file-level only; add annotation hash in Phase 2	Storage & Versioning
5	Snapshot manifest redundancy	During implementation	Postgres is source of truth; R2 JSON is convenience export	Storage & Versioning
6	OCC.js STEP processing risk	During implementation	Use pythonocc container for Phase A; evaluate OCC.js later	Carbon & Annotation Tool
7	No review/QA workflow	During implementation	Define status state machine; engineer reviews initially	Annotation Pipeline
8	Shared R2 write key	During implementation	Accept shared key; add write-audit proxy for accountability	Access & Auth
9	Pre-labeling deployment	Track for later	Design cloud deployment before Phase 3	Annotation Pipeline
10	No blob garbage collection	Track for later	Monthly orphan scan script with dry-run mode	Storage & Versioning
11	No R2 backup	Track for later	Periodic `rclone sync` to secondary bucket	Storage & Versioning
12	Incomplete cost estimate	Track for later	Add Edge Function, Trigger.dev, and GPU cost estimates	Overview

CAD Dataset Infrastructure — Design Review & Alternatives¶

Purpose¶

Issues¶

Fix Before Building¶

claim-next-part Race Condition¶

Annotation Data Model Needs a Defined Structure¶

Phase A to Phase B Storage Divergence¶

Address During Implementation¶

Snapshots Do Not Capture Annotation State¶

Snapshot Manifest Redundancy¶

OpenCascade.js for STEP Processing Is Risky¶

No Review / QA Workflow¶

Shared R2 Write Key¶

Track for Later¶

Pre-Labeling Model Deployment¶

No Garbage Collection for Orphaned Blobs¶

No Backup / Disaster Recovery for R2¶

Cost Estimate Is Incomplete¶

Design Alternatives¶

Metadata: DuckDB CLI vs. Supabase¶

Self-Contained Datasets: Parquet in R2¶

Annotation Storage Granularity¶

Atomic Part Claiming¶

STEP Processing Runtime¶

Snapshot Scope¶

Summary of Recommendations¶

`claim-next-part` Race Condition¶