CAD Dataset Infrastructure — Ingestion & Testing¶
Home / Engineering / System Design / CAD Dataset Infrastructure / Ingestion & Testing
Engineering Log Entry — March 2026
Geometry extraction during ingestion and integration with the AutoCAM benchmark testing suite.
Ingestion Pipeline¶
When new STEP files are pushed, geometric metadata should be extracted automatically:
flowchart LR
New["New STEP files"] --> Push["mds push"]
Push --> Upload["Upload to R2"]
Push --> Register["Register in\ndataset_parts"]
Register --> Extract["Geometry\nextraction"]
Extract --> Update["Update\nmetadata"]
The geometry extraction step uses Open CASCADE (or HOOPS Exchange) to compute face count, vertex count, volume, bounding box, and basic feature identification. For the initial implementation, this runs locally as part of mds push. It can be moved to a background worker (Supabase Edge Function or a small Python service) as volume grows.
Estimated throughput: A single STEP file typically processes in a few seconds with Open CASCADE. 5K files at ~5 seconds each takes ~7 hours single-threaded or ~1 hour with 8 parallel workers — a one-time cost for the initial migration.
AutoCAM Testing Integration¶
The AutoCAM testing suite uses a parts.json manifest in the CAM repository to define benchmark parts. The dataset infrastructure complements this pattern:
- Benchmark parts live in R2 alongside all other parts, tracked in the metadata catalog.
- The CAM repo's manifest references parts by
part_id. A CI step pulls the needed files:mds pull --snapshot v003 --part-ids-from tests/cam_benchmark/parts.json --output assets/parts/. - Benchmark runs pin to a specific dataset snapshot, so results are reproducible even as the corpus evolves.
- As the test corpus grows beyond what belongs in git, STEP files can be removed from the repository and pulled from the dataset store as a CI step.