CAD Dataset Infrastructure — Dataset CLI¶
Home / Engineering / System Design / CAD Dataset Infrastructure / Dataset CLI
Engineering Log Entry — March 2026
Design for mds, a Python command-line tool for pulling, pushing, querying, and versioning CAD dataset parts.
A Python command-line tool (mds) wrapping boto3 (R2 via S3-compatible API) and supabase-py. Configuration stored in ~/.mds/config.toml, authentication tokens in ~/.mds/auth.json.
Setup and Authentication¶
# First-time setup — writes template config with Supabase URL and R2 endpoint
mds init
# Log in with Supabase credentials (stores JWT in ~/.mds/auth.json)
mds login
# Log out (removes cached tokens)
mds logout
Config file (~/.mds/config.toml):
[supabase]
url = "https://xxxx.supabase.co"
[r2]
endpoint_url = "https://xxxx.r2.cloudflarestorage.com"
bucket = "machenit-dataset"
access_key_id = "..."
secret_access_key = "..."
All config values can be overridden via environment variables, which enables CI/CD use without writing config files to disk:
| Config Path | Environment Variable |
|---|---|
supabase.url |
MDS_SUPABASE_URL |
r2.endpoint_url |
MDS_R2_ENDPOINT |
r2.bucket |
MDS_R2_BUCKET |
r2.access_key_id |
MDS_R2_ACCESS_KEY_ID |
r2.secret_access_key |
MDS_R2_SECRET_ACCESS_KEY |
| (auth bypass) | MDS_SUPABASE_SERVICE_ROLE_KEY |
Precedence: environment variables > ~/.mds/config.toml > defaults. When MDS_SUPABASE_SERVICE_ROLE_KEY is set, the CLI skips interactive login and uses the service role key directly (CI path).
See Access & Authentication for the full auth design including onboarding, credential management, and role-based access.
Core Commands¶
# Pull subsets by metadata query
mds pull --split train --annotation-status approved --output ./data/train/
mds pull --feature-types pocket,hole --tag difficulty=easy --output ./data/easy/
mds pull --source grabcad --limit 500 --output ./data/grabcad/
# Pull a pinned snapshot (exact reproducibility)
mds pull --snapshot v003 --output ./data/v003/
# Push new files
mds push ./new_parts/ --source customer-batch-8
# Snapshots
mds snapshot create --message "Added 200 parts from customer batch 8"
mds snapshot list
mds snapshot diff v002 v003
# Pin a snapshot to an experiment
mds pin v003 --context "aagnet-training-run-042" --type training
# Query without downloading
mds query --feature-types pocket --annotation-status approved --count
How Partial Pulls Work¶
Every pull is inherently partial. The CLI queries Supabase for matching part_ids and blob_hashes, then downloads only the matching blobs from R2. There is no "full checkout" concept — you always specify what you want via metadata filters or a snapshot version.
Local Cache¶
Downloaded blobs are cached at ~/.mds/cache/{sha256}.step. When switching between snapshots or subsets with overlapping parts, cached files are symlinked or copied from the cache instead of re-downloaded. This makes subset switching fast when there is high overlap between queries.
Parallelism¶
Downloads are parallelized (configurable, default 8 concurrent transfers). At typical STEP file sizes (1–20 MB), pulling 1,000 files takes a few minutes on a reasonable connection.