CAD Dataset Infrastructure — Design Review & Alternatives¶
Home / Engineering / System Design / CAD Dataset Infrastructure / Design Review & Alternatives
Engineering Log Entry — March 2026
Critical review of the CAD Dataset Infrastructure design — open risks, underspecified areas, and alternative approaches. Intended as a constructive companion to the design, not a replacement. Findings are organized by urgency: issues to resolve before implementation, items to address during implementation, and concerns to track for later.
Purpose¶
This page identifies gaps, risks, and design alternatives surfaced during review of the CAD Dataset Infrastructure design. The goal is to strengthen the design before implementation begins. Each issue includes a severity level, which component it relates to, and a recommended path forward. The Design Alternatives section presents tradeoff comparisons for major decision points — some reinforce the existing design choice, others suggest changes.
Issues¶
Fix Before Building¶
These are correctness or architectural issues that should be resolved before implementation begins.
claim-next-part Race Condition¶
Reference: Access & Authentication — Part Claiming
The claim-next-part Edge Function uses a PostgREST chain: .update().eq('annotation_status', 'pre-labeled').is('annotator_id', null).order().limit(1). This is not atomic — PostgREST translates it into an UPDATE ... WHERE without row-level locking. Two annotators hitting the endpoint simultaneously can claim the same part, because both SELECT the same row before either UPDATE commits.
At the planned initial scale of 1-3 annotators the probability is low, but it is a correctness bug that should not ship.
Recommendation
Replace the PostgREST chain with a Postgres stored procedure using FOR UPDATE SKIP LOCKED, called via Supabase rpc():
CREATE OR REPLACE FUNCTION claim_next_part(p_annotator_id UUID)
RETURNS TEXT AS $$
DECLARE
v_part_id TEXT;
BEGIN
SELECT part_id INTO v_part_id
FROM dataset_parts
WHERE annotation_status = 'pre-labeled'
AND annotator_id IS NULL
ORDER BY created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;
IF v_part_id IS NULL THEN
RETURN NULL;
END IF;
UPDATE dataset_parts
SET annotator_id = p_annotator_id,
annotation_status = 'in-progress',
updated_at = now()
WHERE part_id = v_part_id;
RETURN v_part_id;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
The Edge Function simplifies to a single rpc('claim_next_part', { p_annotator_id: user.id }) call after verifying the caller's role.
See Atomic Part Claiming in the design alternatives section for a comparison of approaches.
Annotation Data Model Needs a Defined Structure¶
Reference: Metadata Schema — Annotations
The dataset_annotations table has annotation_data JSONB NOT NULL with no defined structure. The current annotation tool stores face-level annotations (face_id to feature_type mapping) with 37 feature types, instance grouping via connected components on the Attribute Adjacency Graph, and GD&T annotations. The design needs to specify how this maps to dataset_annotations rows — specifically: is each row an entire annotation session per part, one row per feature instance, or one row per face label?
This affects query patterns, RLS performance, and how ML pre-labels merge with human corrections.
Recommendation
Define each dataset_annotations row as one annotation session per part per annotator — the entire annotation state for a part stored as a single JSONB document. This matches the current tool's session model and keeps the row count manageable.
Example annotation_data for annotation_type = 'feature':
{
"schema_version": 1,
"faces": {
"face_12": {"feature_type": "through_hole", "instance_id": "inst_001"},
"face_13": {"feature_type": "through_hole", "instance_id": "inst_001"},
"face_45": {"feature_type": "blind_pocket", "instance_id": "inst_002"}
},
"instances": {
"inst_001": {
"feature_type": "through_hole",
"face_ids": ["face_12", "face_13"],
"parameters": {"diameter_mm": 6.35, "depth_mm": 12.7}
},
"inst_002": {
"feature_type": "blind_pocket",
"face_ids": ["face_45"],
"parameters": {"length_mm": 25.0, "width_mm": 15.0, "depth_mm": 5.0}
}
}
}
A schema_version field allows the structure to evolve without breaking consumers. ML pre-labels use the same structure with model_assisted = true and per-instance confidence scores embedded in the instance objects.
See Annotation Storage Granularity for a comparison of approaches.
Phase A to Phase B Storage Divergence¶
Reference: Carbon & the Annotation Tool — Integration Roadmap
Phase A stores STEP files in Supabase Storage (S3) with a self-contained schema (annotationSession, annotationDocument). Phase B switches to Cloudflare R2 with content-addressed blobs via mds push. The design does not describe how files uploaded during Phase A migrate to R2, or how the two schemas converge.
Engineers will have data in two places with no tooling to reconcile them. Any annotations created during Phase A need a migration path into dataset_annotations.
Recommendation
Two viable approaches:
Skip Supabase Storage for STEP files entirely. Phase A uploads go directly to R2 (content-addressed), and the annotationSession table references blobs by SHA-256 hash from the start. This eliminates the migration but requires presigned URL infrastructure earlier.
Accept the S3/R2 split during Phase A. Before Phase B begins, run a migration script that: (1) hashes all S3-stored STEP files, (2) uploads them to R2 under their content-addressed paths, (3) creates dataset_parts rows pointing to the R2 blobs, (4) maps annotationDocument rows into dataset_annotations format.
Option A is cleaner if presigned URL generation can be stood up early. Option B is more pragmatic if Phase A needs to ship fast with minimal infrastructure.
Address During Implementation¶
Items that do not block starting work but need resolution during the implementation phases.
Snapshots Do Not Capture Annotation State¶
Reference: Storage & Versioning — Snapshot Manifests
Snapshots freeze part_id to blob_hash mappings but not annotation state. A snapshot taken today and replayed in six months would reconstruct the same STEP files but potentially different annotations — any labels modified after the snapshot was created would reflect the newer state, not the state at snapshot time.
For file-level reproducibility this is sufficient. Annotation state can be approximately reconstructed from dataset_annotations by filtering on created_at < snapshot.created_at, but this is fragile if annotations are updated in place rather than appended.
Recommendation
Document this as an explicit limitation of the current snapshot design. For a future enhancement, consider adding an optional annotations_hash to the snapshot manifest — a checksum of all annotation data at snapshot time — so consumers can at least detect drift. Full annotation snapshotting (copying annotation rows into a snapshot-scoped table) can be deferred until ML reproducibility requires it.
See Snapshot Scope for a comparison of approaches.
Snapshot Manifest Redundancy¶
Reference: Storage & Versioning + Metadata Schema — Snapshots and Pins
The same part-to-blob mapping exists in two places: the R2 JSON manifest (snapshots/v003.json) and the dataset_snapshot_parts Postgres table. The design does not designate which is the source of truth. If they diverge — for example, if a manifest upload fails after the table is written — consumers get inconsistent results depending on which they read.
Recommendation
Designate Postgres as the source of truth — it is transactional, queryable, and supports the CLI's mds pull --snapshot queries. The R2 manifest becomes a convenience export for offline or portable use. The mds snapshot create command should write to Postgres first (within a transaction), then generate and upload the R2 JSON. Add a mds snapshot verify command that checks consistency between the two.
OpenCascade.js for STEP Processing Is Risky¶
Reference: Carbon & the Annotation Tool — Phase A
Phase A proposes replacing pythonocc (mature Python bindings to Open CASCADE) with OpenCascade.js (a community-maintained WASM build) running server-side in a Trigger.dev task. This is the riskiest technical bet in the design:
- WASM memory limits — browsers and Node.js WASM runtimes typically cap at 2-4 GB. Complex STEP assemblies can exceed this.
- STEP coverage — OpenCascade.js is a subset of Open CASCADE; not all STEP entities and configurations are supported.
- Maintenance — OpenCascade.js is a small community project. pythonocc has broader adoption and active maintenance.
- The current tool already works with pythonocc — switching introduces risk for no functional gain.
Recommendation
Keep pythonocc as the STEP processing backend for Phase A. Run it as a containerized service — either a Docker container managed by Trigger.dev, a small Cloud Run instance, or a sidecar. The annotation tool's existing tessellation code can be reused with minimal changes. Evaluate OpenCascade.js as an optimization in a later phase, after the annotation tool is stable and the WASM limitations can be tested against the actual part corpus.
See STEP Processing Runtime for a full comparison.
No Review / QA Workflow¶
Reference: Annotation Pipeline — Data Flow
The design references in-review and approved annotation statuses but does not specify who reviews, what happens on rejection, or how annotation quality is measured. At the planned initial scale of 1-3 annotators this can work informally, but the status transitions should be defined explicitly.
Recommendation
Define the annotation status state machine:
stateDiagram-v2
[*] --> unannotated
unannotated --> pre_labeled : batch pre-labeling
pre_labeled --> in_progress : annotator claims
in_progress --> pre_labeled : annotator releases (unclaim)
in_progress --> in_review : annotator submits
in_review --> approved : engineer approves
in_review --> in_progress : engineer rejects (with notes)
approved --> [*]
For the initial implementation: engineers review. An engineer opens the part in the annotation tool, sees the annotator's work, and either approves or rejects with notes. Rejection returns the part to in_progress with the same annotator so they can correct their work.
Add a review_notes TEXT column to dataset_parts (or store in tags) for rejection feedback. Inter-annotator agreement metrics and formal QA workflows can be added when the annotator team scales beyond 3.
Shared R2 Write Key¶
Reference: Access & Authentication — R2 Bucket Credentials
All engineers share a single R2 read/write API token stored in ~/.mds/config.toml. The design lists "per-engineer R2 credentials" as a Phase 3 hardening item, but R2 does not support per-user API tokens — it only supports per-bucket tokens with configurable permissions.
A compromised engineer laptop exposes the entire bucket. There is no per-engineer audit trail at the R2 layer.
Recommendation
Accept the shared key for the initial implementation — R2's token model does not offer a better option. For per-engineer audit trails, route writes through a Cloudflare Worker or Supabase Edge Function that logs the authenticated user (from the Supabase JWT) before proxying the upload to R2. This adds latency to writes but provides accountability. The direct boto3 path remains available for bulk operations like initial migration.
Remove "per-engineer R2 credentials" from the Phase 3 hardening list since R2 does not support this. Replace with "write-audit proxy" as the hardening target.
Track for Later¶
Items that are not urgent but should be tracked as the system scales.
Pre-Labeling Model Deployment¶
Reference: Annotation Pipeline — Pre-label step
The three ML detectors (AAGNet, BrepMFR, BRepFormer) currently run as local Docker containers via the annotation tool's docker-compose.yml. The design's architecture diagram shows a "Cloud ML Model" without specifying where it runs. Modal and Replicate are mentioned as possibilities but not designed.
This does not block Phase 1 (storage/versioning) or Phase 2 (metadata enrichment). Pre-labeling can run locally or be deferred until the annotation pipeline is active. A deployment design is needed before Phase 3.
No Garbage Collection for Orphaned Blobs¶
Reference: Storage & Versioning
Content-addressed storage means blobs are never overwritten, but there is no mechanism to identify or remove blobs that are no longer referenced by any dataset_parts row. At 200 GB and ~$3/month this is a negligible cost concern, but orphaned blobs accumulate over time.
A periodic garbage collection script — list all R2 blob keys, compare against SELECT DISTINCT blob_hash FROM dataset_parts, delete the difference — addresses this. It should run infrequently (monthly) and with a dry-run mode.
No Backup / Disaster Recovery for R2¶
Reference: Storage & Versioning
R2 does not offer cross-region replication by default. The content-addressed design makes backup straightforward: a periodic rclone sync to a secondary R2 bucket, an S3 bucket, or even a local NAS provides disaster recovery. The Supabase Postgres side is covered by Supabase's built-in daily backups and point-in-time recovery.
Cost Estimate Is Incomplete¶
Reference: Overview — Cost Estimate
The $4/month estimate covers R2 storage and operations. It omits:
| Component | Estimated Cost |
|---|---|
| Supabase Edge Function invocations | Free tier likely sufficient; ~$2/month beyond |
| Trigger.dev task execution (STEP processing) | Depends on volume; ~$5-20/month for 1K parts/month |
| GPU compute for ML pre-labeling | $0 if local; ~$10-50/month if cloud (Modal/Replicate) |
| Supabase Postgres compute (incremental) | $0 on free tier; $25/month on Pro if needed |
The total is still modest, but the estimate should reflect all components to avoid surprises.
Design Alternatives¶
Metadata: DuckDB CLI vs. Supabase¶
The design uses Supabase Postgres for all metadata queries, including CLI operations like mds query and mds pull --split train. An alternative is embedding DuckDB in the CLI for local-first querying.
| Dimension | Supabase Postgres (current) | DuckDB in CLI |
|---|---|---|
| Query latency | Network round-trip (~50-200ms) | Local, sub-millisecond |
| Multi-user consistency | Single source of truth | Requires sync mechanism |
| Setup complexity | Already exists (Carbon uses it) | New dependency in CLI |
| Offline capability | Requires network | Works offline after sync |
| Write path | Standard SQL | Read-only (writes still go to Supabase) |
| Annotation queue integration | Native (same database) | Not feasible for real-time |
Verdict
Supabase is the right choice for the primary catalog — it serves both the CLI and the annotation tool from one source of truth. DuckDB could be a useful complement for offline querying: mds query could cache a local DuckDB snapshot of dataset_parts for fast filtering and complex analytical queries. Not a replacement, but a potential Phase 2 enhancement for ML workflows that need repeated local queries.
Self-Contained Datasets: Parquet in R2¶
HuggingFace Datasets stores metadata as Parquet files alongside data files, making datasets self-contained and portable — no database required to understand them.
| Dimension | Supabase + R2 JSON (current) | Parquet files in R2 |
|---|---|---|
| Portability | Requires Supabase access to query | Self-contained, shareable |
| Query capability | Full SQL, real-time | Parquet readers (DuckDB, Pandas, Polars) |
| HuggingFace Datasets compatibility | Requires export script | Native |
| Write complexity | Standard Supabase insert | Rewrite Parquet files on each change |
| Real-time annotation queue | Native (Supabase Realtime) | Not feasible |
| External sharing | Requires API access or data export | Download the directory |
Verdict
Not a replacement for Supabase — the annotation queue and multi-user workflows need a live database. But Parquet is an excellent export format for ML consumption and external sharing. Consider adding mds export --format parquet as a Phase 2 feature that dumps the current catalog (or a snapshot) to Parquet files in R2 alongside the blobs. This makes the dataset consumable by anyone with standard data tools, without Supabase access.
Annotation Storage Granularity¶
Three approaches for mapping annotation data to dataset_annotations rows:
| Approach | Row per... | Pros | Cons |
|---|---|---|---|
| Session blob | Part + annotator | Simple, mirrors current tool's model, single read/write per part | Hard to query individual features across parts |
| Feature instance | Feature instance | Queryable ("all through-holes"), supports partial updates | Many rows per part (10-50+), complex upsert logic |
| Face label | Face | Maximum granularity, ML-friendly | Extremely high row count (100s-1000s per part), slow RLS |
Verdict
Session blob — one JSONB document per part per annotator. This matches the current tool's data model (session JSON), keeps the row count at ~1 per part, and makes the Phase A to Phase B migration straightforward. If feature-level queries become important (e.g., "find all parts with through-holes annotated by a specific annotator"), add a materialized view or extract to a separate analytics table rather than changing the primary storage granularity.
Atomic Part Claiming¶
Three approaches for race-safe annotator part assignment:
| Approach | Implementation | Atomicity | Complexity | PostgREST Compatible |
|---|---|---|---|---|
FOR UPDATE SKIP LOCKED |
Postgres stored procedure via rpc() |
Row-level lock, skips claimed rows | Low (~15 lines SQL) | Yes |
pg_advisory_lock |
Advisory lock on part ID before update | Application-level lock | Medium (must ensure release) | Yes via rpc() |
| Queue table | Separate annotation_queue with dequeue semantics |
Table-level isolation | High (new table, additional sync) | Yes |
Verdict
Stored procedure with FOR UPDATE SKIP LOCKED. This is the standard Postgres pattern for exactly this use case — "give me the next unclaimed row and lock it so nobody else gets it." It works through PostgREST's rpc() endpoint, is atomic by definition, and requires minimal code. See the recommendation above for the implementation.
STEP Processing Runtime¶
Three options for tessellating STEP files into viewer-ready meshes:
| Dimension | pythonocc (container) | OCC.js (WASM / Trigger.dev) | HOOPS Exchange (service) |
|---|---|---|---|
| STEP coverage | Full (Open CASCADE C++ bindings) | Partial (community WASM build) | Broadest (commercial, 30+ formats) |
| Memory handling | Container memory (configurable, GB+) | WASM ~2-4 GB hard limit | Container memory (configurable) |
| Maintenance | Active OSS, wide adoption | Small community project | Commercial support (TechSoft3D) |
| Dependency | Python + Docker | Node.js only (no Python) | License fee + API integration |
| Current status | Already working in annotation tool | Not yet tested at Anvil | Evaluated, SDK available |
| Tessellation quality | Good, configurable precision | Good for supported entities | Excellent |
Verdict
pythonocc in a container for Phase A. The annotation tool already has working tessellation code using pythonocc — reuse it in a containerized service (Docker on Trigger.dev, Cloud Run, or a sidecar). This eliminates the WASM risk and reuses proven code. Evaluate OpenCascade.js as a later optimization if eliminating the Python dependency becomes a priority. HOOPS Exchange remains a fallback if Open CASCADE coverage gaps appear with production parts.
Snapshot Scope¶
What should a snapshot capture?
| Scope | Contents | Reproducibility Level | Implementation Complexity |
|---|---|---|---|
| Files only (current) | part_id to blob_hash mappings |
Same STEP files, potentially different annotations | Low — already designed |
| Files + annotation hash | Above + checksum of annotation state | Detect annotation drift, reconstruct approximately | Medium — hash computation on snapshot create |
| Files + annotation copy | Above + full copy of all annotation rows | Exact annotation state at snapshot time | High — snapshot-scoped annotation table |
| Full catalog state | Above + tags, metadata, splits | Complete point-in-time database state | Very high — essentially a database dump |
Verdict
Files only is correct for the initial implementation. Add annotation hash as a Phase 2 enhancement — it provides drift detection without the storage cost of copying annotation rows. Full annotation copying or catalog snapshots are overkill at current scale; Supabase's point-in-time recovery covers disaster scenarios, and dataset_annotations.created_at timestamps allow approximate historical reconstruction.
Summary of Recommendations¶
| # | Issue | Priority | Recommendation | Affects |
|---|---|---|---|---|
| 1 | claim-next-part race condition |
Fix before building | Stored procedure with FOR UPDATE SKIP LOCKED |
Access & Auth |
| 2 | Annotation data model undefined | Fix before building | One session blob per part per annotator, with schema_version |
Metadata Schema |
| 3 | Phase A/B storage divergence | Fix before building | Start with R2 in Phase A, or define S3-to-R2 migration script | Carbon & Annotation Tool |
| 4 | Snapshots miss annotation state | During implementation | Document as file-level only; add annotation hash in Phase 2 | Storage & Versioning |
| 5 | Snapshot manifest redundancy | During implementation | Postgres is source of truth; R2 JSON is convenience export | Storage & Versioning |
| 6 | OCC.js STEP processing risk | During implementation | Use pythonocc container for Phase A; evaluate OCC.js later | Carbon & Annotation Tool |
| 7 | No review/QA workflow | During implementation | Define status state machine; engineer reviews initially | Annotation Pipeline |
| 8 | Shared R2 write key | During implementation | Accept shared key; add write-audit proxy for accountability | Access & Auth |
| 9 | Pre-labeling deployment | Track for later | Design cloud deployment before Phase 3 | Annotation Pipeline |
| 10 | No blob garbage collection | Track for later | Monthly orphan scan script with dry-run mode | Storage & Versioning |
| 11 | No R2 backup | Track for later | Periodic rclone sync to secondary bucket |
Storage & Versioning |
| 12 | Incomplete cost estimate | Track for later | Add Edge Function, Trigger.dev, and GPU cost estimates | Overview |