Skip to content

CAD Dataset Infrastructure — Ingestion & Testing

Home / Engineering / System Design / CAD Dataset Infrastructure / Ingestion & Testing

Engineering Log Entry — March 2026

Geometry extraction during ingestion and integration with the AutoCAM benchmark testing suite.

Ingestion Pipeline

When new STEP files are pushed, geometric metadata should be extracted automatically:

flowchart LR
    New["New STEP files"] --> Push["mds push"]
    Push --> Upload["Upload to R2"]
    Push --> Register["Register in\ndataset_parts"]
    Register --> Extract["Geometry\nextraction"]
    Extract --> Update["Update\nmetadata"]

The geometry extraction step uses Open CASCADE (or HOOPS Exchange) to compute face count, vertex count, volume, bounding box, and basic feature identification. For the initial implementation, this runs locally as part of mds push. It can be moved to a background worker (Supabase Edge Function or a small Python service) as volume grows.

Estimated throughput: A single STEP file typically processes in a few seconds with Open CASCADE. 5K files at ~5 seconds each takes ~7 hours single-threaded or ~1 hour with 8 parallel workers — a one-time cost for the initial migration.

AutoCAM Testing Integration

The AutoCAM testing suite uses a parts.json manifest in the CAM repository to define benchmark parts. The dataset infrastructure complements this pattern:

  • Benchmark parts live in R2 alongside all other parts, tracked in the metadata catalog.
  • The CAM repo's manifest references parts by part_id. A CI step pulls the needed files: mds pull --snapshot v003 --part-ids-from tests/cam_benchmark/parts.json --output assets/parts/.
  • Benchmark runs pin to a specific dataset snapshot, so results are reproducible even as the corpus evolves.
  • As the test corpus grows beyond what belongs in git, STEP files can be removed from the repository and pulled from the dataset store as a CI step.