CAD Dataset Infrastructure — Dataset CLI¶

Home / Engineering / System Design / CAD Dataset Infrastructure / Dataset CLI

Engineering Log Entry — March 2026

Design for mds, a Python command-line tool for pulling, pushing, querying, and versioning CAD dataset parts.

A Python command-line tool (mds) wrapping boto3 (R2 via S3-compatible API) and supabase-py. Configuration stored in ~/.mds/config.toml, authentication tokens in ~/.mds/auth.json.

Setup and Authentication¶

# First-time setup — writes template config with Supabase URL and R2 endpoint
mds init

# Log in with Supabase credentials (stores JWT in ~/.mds/auth.json)
mds login

# Log out (removes cached tokens)
mds logout

Config file (~/.mds/config.toml):

[supabase]
url = "https://xxxx.supabase.co"

[r2]
endpoint_url = "https://xxxx.r2.cloudflarestorage.com"
bucket = "machenit-dataset"
access_key_id = "..."
secret_access_key = "..."

All config values can be overridden via environment variables, which enables CI/CD use without writing config files to disk:

Config Path	Environment Variable
`supabase.url`	`MDS_SUPABASE_URL`
`r2.endpoint_url`	`MDS_R2_ENDPOINT`
`r2.bucket`	`MDS_R2_BUCKET`
`r2.access_key_id`	`MDS_R2_ACCESS_KEY_ID`
`r2.secret_access_key`	`MDS_R2_SECRET_ACCESS_KEY`
(auth bypass)	`MDS_SUPABASE_SERVICE_ROLE_KEY`

Precedence: environment variables > ~/.mds/config.toml > defaults. When MDS_SUPABASE_SERVICE_ROLE_KEY is set, the CLI skips interactive login and uses the service role key directly (CI path).

See Access & Authentication for the full auth design including onboarding, credential management, and role-based access.

Core Commands¶

# Pull subsets by metadata query
mds pull --split train --annotation-status approved --output ./data/train/
mds pull --feature-types pocket,hole --tag difficulty=easy --output ./data/easy/
mds pull --source grabcad --limit 500 --output ./data/grabcad/

# Pull a pinned snapshot (exact reproducibility)
mds pull --snapshot v003 --output ./data/v003/

# Push new files
mds push ./new_parts/ --source customer-batch-8

# Snapshots
mds snapshot create --message "Added 200 parts from customer batch 8"
mds snapshot list
mds snapshot diff v002 v003

# Pin a snapshot to an experiment
mds pin v003 --context "aagnet-training-run-042" --type training

# Query without downloading
mds query --feature-types pocket --annotation-status approved --count

How Partial Pulls Work¶

Every pull is inherently partial. The CLI queries Supabase for matching part_ids and blob_hashes, then downloads only the matching blobs from R2. There is no "full checkout" concept — you always specify what you want via metadata filters or a snapshot version.

Local Cache¶

Downloaded blobs are cached at ~/.mds/cache/{sha256}.step. When switching between snapshots or subsets with overlapping parts, cached files are symlinked or copied from the cache instead of re-downloaded. This makes subset switching fast when there is high overlap between queries.

Parallelism¶

Downloads are parallelized (configurable, default 8 concurrent transfers). At typical STEP file sizes (1–20 MB), pulling 1,000 files takes a few minutes on a reasonable connection.