CAD Dataset Infrastructure — Storage & Versioning¶
Home / Engineering / System Design / CAD Dataset Infrastructure / Storage & Versioning
Engineering Log Entry — March 2026
Object storage design (Cloudflare R2) and dataset versioning via snapshot manifests — the foundation layer for CAD file storage and reproducibility.
Object Storage — Cloudflare R2¶
STEP files are stored content-addressed by SHA-256 hash in a Cloudflare R2 bucket. Identical files are stored once regardless of how many part IDs reference them.
Bucket layout:
machenit-dataset/
blobs/
{sha256}.step # Content-addressed STEP files
snapshots/
v001.json # Full manifest for each snapshot version
v002.json
Why R2:
| Option | Egress Cost | S3-Compatible | Already a Vendor | Verdict |
|---|---|---|---|---|
| Cloudflare R2 | Free | Yes | Yes (docs site) | Chosen |
| AWS S3 | ~$0.09/GB | Yes | No | Viable but more expensive for frequent pulls |
| Supabase Storage | Metered | Limited API | Yes | Not designed for large binary datasets |
| MinIO (self-hosted) | Free | Yes | No | Ops burden of self-hosting |
At 200 GB, R2 storage costs ~$3/month with zero egress — significant for a dataset that engineers pull subsets of regularly.
Versioning — Snapshot Manifests¶
A snapshot is a frozen point-in-time record of the full dataset: which parts are included and which blob each one maps to. Snapshots are stored as JSON manifests in R2 with metadata in Supabase.
This is deliberately lighter than DVC or LakeFS:
| Option | Partial Pulls | Metadata Queries | Infra Overhead | Verdict |
|---|---|---|---|---|
| Snapshot manifests | Native (query → pull) | Via Supabase | Minimal (thin CLI) | Chosen |
| DVC | Poor (directory-level) | None built-in | .dvc files, remote config |
Poor fit for metadata-driven subsetting |
| LakeFS | Good | Requires separate index | Server deployment, learning curve | Over-engineered for team of 5–10 |
| Git LFS | None | None | GitHub storage limits | Too limited |
Snapshot structure:
{
"version": "v003",
"created_at": "2026-03-18T10:00:00Z",
"created_by": "david",
"parent": "v002",
"description": "Added 200 annotated parts from batch-7",
"parts_count": 5200,
"parts": {
"part-0001": "sha256:a1b2c3d4...",
"part-0002": "sha256:e5f6a7b8..."
}
}
Pinning a snapshot version to a training run or benchmark gives full reproducibility — the exact set of files can be reconstructed from the manifest.