Skip to content

CAD Dataset Infrastructure — Storage & Versioning

Home / Engineering / System Design / CAD Dataset Infrastructure / Storage & Versioning

Engineering Log Entry — March 2026

Object storage design (Cloudflare R2) and dataset versioning via snapshot manifests — the foundation layer for CAD file storage and reproducibility.

Object Storage — Cloudflare R2

STEP files are stored content-addressed by SHA-256 hash in a Cloudflare R2 bucket. Identical files are stored once regardless of how many part IDs reference them.

Bucket layout:

machenit-dataset/
  blobs/
    {sha256}.step              # Content-addressed STEP files
  snapshots/
    v001.json                  # Full manifest for each snapshot version
    v002.json

Why R2:

Option Egress Cost S3-Compatible Already a Vendor Verdict
Cloudflare R2 Free Yes Yes (docs site) Chosen
AWS S3 ~$0.09/GB Yes No Viable but more expensive for frequent pulls
Supabase Storage Metered Limited API Yes Not designed for large binary datasets
MinIO (self-hosted) Free Yes No Ops burden of self-hosting

At 200 GB, R2 storage costs ~$3/month with zero egress — significant for a dataset that engineers pull subsets of regularly.

Versioning — Snapshot Manifests

A snapshot is a frozen point-in-time record of the full dataset: which parts are included and which blob each one maps to. Snapshots are stored as JSON manifests in R2 with metadata in Supabase.

This is deliberately lighter than DVC or LakeFS:

Option Partial Pulls Metadata Queries Infra Overhead Verdict
Snapshot manifests Native (query → pull) Via Supabase Minimal (thin CLI) Chosen
DVC Poor (directory-level) None built-in .dvc files, remote config Poor fit for metadata-driven subsetting
LakeFS Good Requires separate index Server deployment, learning curve Over-engineered for team of 5–10
Git LFS None None GitHub storage limits Too limited

Snapshot structure:

{
  "version": "v003",
  "created_at": "2026-03-18T10:00:00Z",
  "created_by": "david",
  "parent": "v002",
  "description": "Added 200 annotated parts from batch-7",
  "parts_count": 5200,
  "parts": {
    "part-0001": "sha256:a1b2c3d4...",
    "part-0002": "sha256:e5f6a7b8..."
  }
}

Pinning a snapshot version to a training run or benchmark gives full reproducibility — the exact set of files can be reconstructed from the manifest.