CAD Dataset Infrastructure — Storage & Versioning¶

Home / Engineering / System Design / CAD Dataset Infrastructure / Storage & Versioning

Engineering Log Entry — March 2026

Object storage design (Cloudflare R2) and dataset versioning via snapshot manifests — the foundation layer for CAD file storage and reproducibility.

Object Storage — Cloudflare R2¶

STEP files are stored content-addressed by SHA-256 hash in a Cloudflare R2 bucket. Identical files are stored once regardless of how many part IDs reference them.

Bucket layout:

machenit-dataset/
  blobs/
    {sha256}.step              # Content-addressed STEP files
  snapshots/
    v001.json                  # Full manifest for each snapshot version
    v002.json

Why R2:

Option	Egress Cost	S3-Compatible	Already a Vendor	Verdict
Cloudflare R2	Free	Yes	Yes (docs site)	Chosen
AWS S3	~$0.09/GB	Yes	No	Viable but more expensive for frequent pulls
Supabase Storage	Metered	Limited API	Yes	Not designed for large binary datasets
MinIO (self-hosted)	Free	Yes	No	Ops burden of self-hosting

At 200 GB, R2 storage costs ~$3/month with zero egress — significant for a dataset that engineers pull subsets of regularly.

Versioning — Snapshot Manifests¶

A snapshot is a frozen point-in-time record of the full dataset: which parts are included and which blob each one maps to. Snapshots are stored as JSON manifests in R2 with metadata in Supabase.

This is deliberately lighter than DVC or LakeFS:

Option	Partial Pulls	Metadata Queries	Infra Overhead	Verdict
Snapshot manifests	Native (query → pull)	Via Supabase	Minimal (thin CLI)	Chosen
DVC	Poor (directory-level)	None built-in	`.dvc` files, remote config	Poor fit for metadata-driven subsetting
LakeFS	Good	Requires separate index	Server deployment, learning curve	Over-engineered for team of 5–10
Git LFS	None	None	GitHub storage limits	Too limited

Snapshot structure:

{
  "version": "v003",
  "created_at": "2026-03-18T10:00:00Z",
  "created_by": "david",
  "parent": "v002",
  "description": "Added 200 annotated parts from batch-7",
  "parts_count": 5200,
  "parts": {
    "part-0001": "sha256:a1b2c3d4...",
    "part-0002": "sha256:e5f6a7b8..."
  }
}

Pinning a snapshot version to a training run or benchmark gives full reproducibility — the exact set of files can be reconstructed from the manifest.