Skip to content

CAD Dataset Infrastructure — Design Review & Alternatives

Home / Engineering / System Design / CAD Dataset Infrastructure / Design Review & Alternatives

Engineering Log Entry — March 2026

Critical review of the CAD Dataset Infrastructure design — open risks, underspecified areas, and alternative approaches. Intended as a constructive companion to the design, not a replacement. Findings are organized by urgency: issues to resolve before implementation, items to address during implementation, and concerns to track for later.

Purpose

This page identifies gaps, risks, and design alternatives surfaced during review of the CAD Dataset Infrastructure design. The goal is to strengthen the design before implementation begins. Each issue includes a severity level, which component it relates to, and a recommended path forward. The Design Alternatives section presents tradeoff comparisons for major decision points — some reinforce the existing design choice, others suggest changes.


Issues

Fix Before Building

These are correctness or architectural issues that should be resolved before implementation begins.

claim-next-part Race Condition

Reference: Access & Authentication — Part Claiming

The claim-next-part Edge Function uses a PostgREST chain: .update().eq('annotation_status', 'pre-labeled').is('annotator_id', null).order().limit(1). This is not atomic — PostgREST translates it into an UPDATE ... WHERE without row-level locking. Two annotators hitting the endpoint simultaneously can claim the same part, because both SELECT the same row before either UPDATE commits.

At the planned initial scale of 1-3 annotators the probability is low, but it is a correctness bug that should not ship.

Recommendation

Replace the PostgREST chain with a Postgres stored procedure using FOR UPDATE SKIP LOCKED, called via Supabase rpc():

CREATE OR REPLACE FUNCTION claim_next_part(p_annotator_id UUID)
RETURNS TEXT AS $$
DECLARE
    v_part_id TEXT;
BEGIN
    SELECT part_id INTO v_part_id
    FROM dataset_parts
    WHERE annotation_status = 'pre-labeled'
      AND annotator_id IS NULL
    ORDER BY created_at ASC
    LIMIT 1
    FOR UPDATE SKIP LOCKED;

    IF v_part_id IS NULL THEN
        RETURN NULL;
    END IF;

    UPDATE dataset_parts
    SET annotator_id = p_annotator_id,
        annotation_status = 'in-progress',
        updated_at = now()
    WHERE part_id = v_part_id;

    RETURN v_part_id;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

The Edge Function simplifies to a single rpc('claim_next_part', { p_annotator_id: user.id }) call after verifying the caller's role.

See Atomic Part Claiming in the design alternatives section for a comparison of approaches.

Annotation Data Model Needs a Defined Structure

Reference: Metadata Schema — Annotations

The dataset_annotations table has annotation_data JSONB NOT NULL with no defined structure. The current annotation tool stores face-level annotations (face_id to feature_type mapping) with 37 feature types, instance grouping via connected components on the Attribute Adjacency Graph, and GD&T annotations. The design needs to specify how this maps to dataset_annotations rows — specifically: is each row an entire annotation session per part, one row per feature instance, or one row per face label?

This affects query patterns, RLS performance, and how ML pre-labels merge with human corrections.

Recommendation

Define each dataset_annotations row as one annotation session per part per annotator — the entire annotation state for a part stored as a single JSONB document. This matches the current tool's session model and keeps the row count manageable.

Example annotation_data for annotation_type = 'feature':

{
  "schema_version": 1,
  "faces": {
    "face_12": {"feature_type": "through_hole", "instance_id": "inst_001"},
    "face_13": {"feature_type": "through_hole", "instance_id": "inst_001"},
    "face_45": {"feature_type": "blind_pocket", "instance_id": "inst_002"}
  },
  "instances": {
    "inst_001": {
      "feature_type": "through_hole",
      "face_ids": ["face_12", "face_13"],
      "parameters": {"diameter_mm": 6.35, "depth_mm": 12.7}
    },
    "inst_002": {
      "feature_type": "blind_pocket",
      "face_ids": ["face_45"],
      "parameters": {"length_mm": 25.0, "width_mm": 15.0, "depth_mm": 5.0}
    }
  }
}

A schema_version field allows the structure to evolve without breaking consumers. ML pre-labels use the same structure with model_assisted = true and per-instance confidence scores embedded in the instance objects.

See Annotation Storage Granularity for a comparison of approaches.

Phase A to Phase B Storage Divergence

Reference: Carbon & the Annotation Tool — Integration Roadmap

Phase A stores STEP files in Supabase Storage (S3) with a self-contained schema (annotationSession, annotationDocument). Phase B switches to Cloudflare R2 with content-addressed blobs via mds push. The design does not describe how files uploaded during Phase A migrate to R2, or how the two schemas converge.

Engineers will have data in two places with no tooling to reconcile them. Any annotations created during Phase A need a migration path into dataset_annotations.

Recommendation

Two viable approaches:

Skip Supabase Storage for STEP files entirely. Phase A uploads go directly to R2 (content-addressed), and the annotationSession table references blobs by SHA-256 hash from the start. This eliminates the migration but requires presigned URL infrastructure earlier.

Accept the S3/R2 split during Phase A. Before Phase B begins, run a migration script that: (1) hashes all S3-stored STEP files, (2) uploads them to R2 under their content-addressed paths, (3) creates dataset_parts rows pointing to the R2 blobs, (4) maps annotationDocument rows into dataset_annotations format.

Option A is cleaner if presigned URL generation can be stood up early. Option B is more pragmatic if Phase A needs to ship fast with minimal infrastructure.


Address During Implementation

Items that do not block starting work but need resolution during the implementation phases.

Snapshots Do Not Capture Annotation State

Reference: Storage & Versioning — Snapshot Manifests

Snapshots freeze part_id to blob_hash mappings but not annotation state. A snapshot taken today and replayed in six months would reconstruct the same STEP files but potentially different annotations — any labels modified after the snapshot was created would reflect the newer state, not the state at snapshot time.

For file-level reproducibility this is sufficient. Annotation state can be approximately reconstructed from dataset_annotations by filtering on created_at < snapshot.created_at, but this is fragile if annotations are updated in place rather than appended.

Recommendation

Document this as an explicit limitation of the current snapshot design. For a future enhancement, consider adding an optional annotations_hash to the snapshot manifest — a checksum of all annotation data at snapshot time — so consumers can at least detect drift. Full annotation snapshotting (copying annotation rows into a snapshot-scoped table) can be deferred until ML reproducibility requires it.

See Snapshot Scope for a comparison of approaches.

Snapshot Manifest Redundancy

Reference: Storage & Versioning + Metadata Schema — Snapshots and Pins

The same part-to-blob mapping exists in two places: the R2 JSON manifest (snapshots/v003.json) and the dataset_snapshot_parts Postgres table. The design does not designate which is the source of truth. If they diverge — for example, if a manifest upload fails after the table is written — consumers get inconsistent results depending on which they read.

Recommendation

Designate Postgres as the source of truth — it is transactional, queryable, and supports the CLI's mds pull --snapshot queries. The R2 manifest becomes a convenience export for offline or portable use. The mds snapshot create command should write to Postgres first (within a transaction), then generate and upload the R2 JSON. Add a mds snapshot verify command that checks consistency between the two.

OpenCascade.js for STEP Processing Is Risky

Reference: Carbon & the Annotation Tool — Phase A

Phase A proposes replacing pythonocc (mature Python bindings to Open CASCADE) with OpenCascade.js (a community-maintained WASM build) running server-side in a Trigger.dev task. This is the riskiest technical bet in the design:

  • WASM memory limits — browsers and Node.js WASM runtimes typically cap at 2-4 GB. Complex STEP assemblies can exceed this.
  • STEP coverage — OpenCascade.js is a subset of Open CASCADE; not all STEP entities and configurations are supported.
  • Maintenance — OpenCascade.js is a small community project. pythonocc has broader adoption and active maintenance.
  • The current tool already works with pythonocc — switching introduces risk for no functional gain.

Recommendation

Keep pythonocc as the STEP processing backend for Phase A. Run it as a containerized service — either a Docker container managed by Trigger.dev, a small Cloud Run instance, or a sidecar. The annotation tool's existing tessellation code can be reused with minimal changes. Evaluate OpenCascade.js as an optimization in a later phase, after the annotation tool is stable and the WASM limitations can be tested against the actual part corpus.

See STEP Processing Runtime for a full comparison.

No Review / QA Workflow

Reference: Annotation Pipeline — Data Flow

The design references in-review and approved annotation statuses but does not specify who reviews, what happens on rejection, or how annotation quality is measured. At the planned initial scale of 1-3 annotators this can work informally, but the status transitions should be defined explicitly.

Recommendation

Define the annotation status state machine:

stateDiagram-v2
    [*] --> unannotated
    unannotated --> pre_labeled : batch pre-labeling
    pre_labeled --> in_progress : annotator claims
    in_progress --> pre_labeled : annotator releases (unclaim)
    in_progress --> in_review : annotator submits
    in_review --> approved : engineer approves
    in_review --> in_progress : engineer rejects (with notes)
    approved --> [*]

For the initial implementation: engineers review. An engineer opens the part in the annotation tool, sees the annotator's work, and either approves or rejects with notes. Rejection returns the part to in_progress with the same annotator so they can correct their work.

Add a review_notes TEXT column to dataset_parts (or store in tags) for rejection feedback. Inter-annotator agreement metrics and formal QA workflows can be added when the annotator team scales beyond 3.

Shared R2 Write Key

Reference: Access & Authentication — R2 Bucket Credentials

All engineers share a single R2 read/write API token stored in ~/.mds/config.toml. The design lists "per-engineer R2 credentials" as a Phase 3 hardening item, but R2 does not support per-user API tokens — it only supports per-bucket tokens with configurable permissions.

A compromised engineer laptop exposes the entire bucket. There is no per-engineer audit trail at the R2 layer.

Recommendation

Accept the shared key for the initial implementation — R2's token model does not offer a better option. For per-engineer audit trails, route writes through a Cloudflare Worker or Supabase Edge Function that logs the authenticated user (from the Supabase JWT) before proxying the upload to R2. This adds latency to writes but provides accountability. The direct boto3 path remains available for bulk operations like initial migration.

Remove "per-engineer R2 credentials" from the Phase 3 hardening list since R2 does not support this. Replace with "write-audit proxy" as the hardening target.


Track for Later

Items that are not urgent but should be tracked as the system scales.

Pre-Labeling Model Deployment

Reference: Annotation Pipeline — Pre-label step

The three ML detectors (AAGNet, BrepMFR, BRepFormer) currently run as local Docker containers via the annotation tool's docker-compose.yml. The design's architecture diagram shows a "Cloud ML Model" without specifying where it runs. Modal and Replicate are mentioned as possibilities but not designed.

This does not block Phase 1 (storage/versioning) or Phase 2 (metadata enrichment). Pre-labeling can run locally or be deferred until the annotation pipeline is active. A deployment design is needed before Phase 3.

No Garbage Collection for Orphaned Blobs

Reference: Storage & Versioning

Content-addressed storage means blobs are never overwritten, but there is no mechanism to identify or remove blobs that are no longer referenced by any dataset_parts row. At 200 GB and ~$3/month this is a negligible cost concern, but orphaned blobs accumulate over time.

A periodic garbage collection script — list all R2 blob keys, compare against SELECT DISTINCT blob_hash FROM dataset_parts, delete the difference — addresses this. It should run infrequently (monthly) and with a dry-run mode.

No Backup / Disaster Recovery for R2

Reference: Storage & Versioning

R2 does not offer cross-region replication by default. The content-addressed design makes backup straightforward: a periodic rclone sync to a secondary R2 bucket, an S3 bucket, or even a local NAS provides disaster recovery. The Supabase Postgres side is covered by Supabase's built-in daily backups and point-in-time recovery.

Cost Estimate Is Incomplete

Reference: Overview — Cost Estimate

The $4/month estimate covers R2 storage and operations. It omits:

Component Estimated Cost
Supabase Edge Function invocations Free tier likely sufficient; ~$2/month beyond
Trigger.dev task execution (STEP processing) Depends on volume; ~$5-20/month for 1K parts/month
GPU compute for ML pre-labeling $0 if local; ~$10-50/month if cloud (Modal/Replicate)
Supabase Postgres compute (incremental) $0 on free tier; $25/month on Pro if needed

The total is still modest, but the estimate should reflect all components to avoid surprises.


Design Alternatives

Metadata: DuckDB CLI vs. Supabase

The design uses Supabase Postgres for all metadata queries, including CLI operations like mds query and mds pull --split train. An alternative is embedding DuckDB in the CLI for local-first querying.

Dimension Supabase Postgres (current) DuckDB in CLI
Query latency Network round-trip (~50-200ms) Local, sub-millisecond
Multi-user consistency Single source of truth Requires sync mechanism
Setup complexity Already exists (Carbon uses it) New dependency in CLI
Offline capability Requires network Works offline after sync
Write path Standard SQL Read-only (writes still go to Supabase)
Annotation queue integration Native (same database) Not feasible for real-time

Verdict

Supabase is the right choice for the primary catalog — it serves both the CLI and the annotation tool from one source of truth. DuckDB could be a useful complement for offline querying: mds query could cache a local DuckDB snapshot of dataset_parts for fast filtering and complex analytical queries. Not a replacement, but a potential Phase 2 enhancement for ML workflows that need repeated local queries.

Self-Contained Datasets: Parquet in R2

HuggingFace Datasets stores metadata as Parquet files alongside data files, making datasets self-contained and portable — no database required to understand them.

Dimension Supabase + R2 JSON (current) Parquet files in R2
Portability Requires Supabase access to query Self-contained, shareable
Query capability Full SQL, real-time Parquet readers (DuckDB, Pandas, Polars)
HuggingFace Datasets compatibility Requires export script Native
Write complexity Standard Supabase insert Rewrite Parquet files on each change
Real-time annotation queue Native (Supabase Realtime) Not feasible
External sharing Requires API access or data export Download the directory

Verdict

Not a replacement for Supabase — the annotation queue and multi-user workflows need a live database. But Parquet is an excellent export format for ML consumption and external sharing. Consider adding mds export --format parquet as a Phase 2 feature that dumps the current catalog (or a snapshot) to Parquet files in R2 alongside the blobs. This makes the dataset consumable by anyone with standard data tools, without Supabase access.

Annotation Storage Granularity

Three approaches for mapping annotation data to dataset_annotations rows:

Approach Row per... Pros Cons
Session blob Part + annotator Simple, mirrors current tool's model, single read/write per part Hard to query individual features across parts
Feature instance Feature instance Queryable ("all through-holes"), supports partial updates Many rows per part (10-50+), complex upsert logic
Face label Face Maximum granularity, ML-friendly Extremely high row count (100s-1000s per part), slow RLS

Verdict

Session blob — one JSONB document per part per annotator. This matches the current tool's data model (session JSON), keeps the row count at ~1 per part, and makes the Phase A to Phase B migration straightforward. If feature-level queries become important (e.g., "find all parts with through-holes annotated by a specific annotator"), add a materialized view or extract to a separate analytics table rather than changing the primary storage granularity.

Atomic Part Claiming

Three approaches for race-safe annotator part assignment:

Approach Implementation Atomicity Complexity PostgREST Compatible
FOR UPDATE SKIP LOCKED Postgres stored procedure via rpc() Row-level lock, skips claimed rows Low (~15 lines SQL) Yes
pg_advisory_lock Advisory lock on part ID before update Application-level lock Medium (must ensure release) Yes via rpc()
Queue table Separate annotation_queue with dequeue semantics Table-level isolation High (new table, additional sync) Yes

Verdict

Stored procedure with FOR UPDATE SKIP LOCKED. This is the standard Postgres pattern for exactly this use case — "give me the next unclaimed row and lock it so nobody else gets it." It works through PostgREST's rpc() endpoint, is atomic by definition, and requires minimal code. See the recommendation above for the implementation.

STEP Processing Runtime

Three options for tessellating STEP files into viewer-ready meshes:

Dimension pythonocc (container) OCC.js (WASM / Trigger.dev) HOOPS Exchange (service)
STEP coverage Full (Open CASCADE C++ bindings) Partial (community WASM build) Broadest (commercial, 30+ formats)
Memory handling Container memory (configurable, GB+) WASM ~2-4 GB hard limit Container memory (configurable)
Maintenance Active OSS, wide adoption Small community project Commercial support (TechSoft3D)
Dependency Python + Docker Node.js only (no Python) License fee + API integration
Current status Already working in annotation tool Not yet tested at Anvil Evaluated, SDK available
Tessellation quality Good, configurable precision Good for supported entities Excellent

Verdict

pythonocc in a container for Phase A. The annotation tool already has working tessellation code using pythonocc — reuse it in a containerized service (Docker on Trigger.dev, Cloud Run, or a sidecar). This eliminates the WASM risk and reuses proven code. Evaluate OpenCascade.js as a later optimization if eliminating the Python dependency becomes a priority. HOOPS Exchange remains a fallback if Open CASCADE coverage gaps appear with production parts.

Snapshot Scope

What should a snapshot capture?

Scope Contents Reproducibility Level Implementation Complexity
Files only (current) part_id to blob_hash mappings Same STEP files, potentially different annotations Low — already designed
Files + annotation hash Above + checksum of annotation state Detect annotation drift, reconstruct approximately Medium — hash computation on snapshot create
Files + annotation copy Above + full copy of all annotation rows Exact annotation state at snapshot time High — snapshot-scoped annotation table
Full catalog state Above + tags, metadata, splits Complete point-in-time database state Very high — essentially a database dump

Verdict

Files only is correct for the initial implementation. Add annotation hash as a Phase 2 enhancement — it provides drift detection without the storage cost of copying annotation rows. Full annotation copying or catalog snapshots are overkill at current scale; Supabase's point-in-time recovery covers disaster scenarios, and dataset_annotations.created_at timestamps allow approximate historical reconstruction.


Summary of Recommendations

# Issue Priority Recommendation Affects
1 claim-next-part race condition Fix before building Stored procedure with FOR UPDATE SKIP LOCKED Access & Auth
2 Annotation data model undefined Fix before building One session blob per part per annotator, with schema_version Metadata Schema
3 Phase A/B storage divergence Fix before building Start with R2 in Phase A, or define S3-to-R2 migration script Carbon & Annotation Tool
4 Snapshots miss annotation state During implementation Document as file-level only; add annotation hash in Phase 2 Storage & Versioning
5 Snapshot manifest redundancy During implementation Postgres is source of truth; R2 JSON is convenience export Storage & Versioning
6 OCC.js STEP processing risk During implementation Use pythonocc container for Phase A; evaluate OCC.js later Carbon & Annotation Tool
7 No review/QA workflow During implementation Define status state machine; engineer reviews initially Annotation Pipeline
8 Shared R2 write key During implementation Accept shared key; add write-audit proxy for accountability Access & Auth
9 Pre-labeling deployment Track for later Design cloud deployment before Phase 3 Annotation Pipeline
10 No blob garbage collection Track for later Monthly orphan scan script with dry-run mode Storage & Versioning
11 No R2 backup Track for later Periodic rclone sync to secondary bucket Storage & Versioning
12 Incomplete cost estimate Track for later Add Edge Function, Trigger.dev, and GPU cost estimates Overview