Skip to content

CAD Dataset Infrastructure — Dataset CLI

Home / Engineering / System Design / CAD Dataset Infrastructure / Dataset CLI

Engineering Log Entry — March 2026

Design for mds, a Python command-line tool for pulling, pushing, querying, and versioning CAD dataset parts.

A Python command-line tool (mds) wrapping boto3 (R2 via S3-compatible API) and supabase-py. Configuration stored in ~/.mds/config.toml, authentication tokens in ~/.mds/auth.json.

Setup and Authentication

# First-time setup — writes template config with Supabase URL and R2 endpoint
mds init

# Log in with Supabase credentials (stores JWT in ~/.mds/auth.json)
mds login

# Log out (removes cached tokens)
mds logout

Config file (~/.mds/config.toml):

[supabase]
url = "https://xxxx.supabase.co"

[r2]
endpoint_url = "https://xxxx.r2.cloudflarestorage.com"
bucket = "machenit-dataset"
access_key_id = "..."
secret_access_key = "..."

All config values can be overridden via environment variables, which enables CI/CD use without writing config files to disk:

Config Path Environment Variable
supabase.url MDS_SUPABASE_URL
r2.endpoint_url MDS_R2_ENDPOINT
r2.bucket MDS_R2_BUCKET
r2.access_key_id MDS_R2_ACCESS_KEY_ID
r2.secret_access_key MDS_R2_SECRET_ACCESS_KEY
(auth bypass) MDS_SUPABASE_SERVICE_ROLE_KEY

Precedence: environment variables > ~/.mds/config.toml > defaults. When MDS_SUPABASE_SERVICE_ROLE_KEY is set, the CLI skips interactive login and uses the service role key directly (CI path).

See Access & Authentication for the full auth design including onboarding, credential management, and role-based access.

Core Commands

# Pull subsets by metadata query
mds pull --split train --annotation-status approved --output ./data/train/
mds pull --feature-types pocket,hole --tag difficulty=easy --output ./data/easy/
mds pull --source grabcad --limit 500 --output ./data/grabcad/

# Pull a pinned snapshot (exact reproducibility)
mds pull --snapshot v003 --output ./data/v003/

# Push new files
mds push ./new_parts/ --source customer-batch-8

# Snapshots
mds snapshot create --message "Added 200 parts from customer batch 8"
mds snapshot list
mds snapshot diff v002 v003

# Pin a snapshot to an experiment
mds pin v003 --context "aagnet-training-run-042" --type training

# Query without downloading
mds query --feature-types pocket --annotation-status approved --count

How Partial Pulls Work

Every pull is inherently partial. The CLI queries Supabase for matching part_ids and blob_hashes, then downloads only the matching blobs from R2. There is no "full checkout" concept — you always specify what you want via metadata filters or a snapshot version.

Local Cache

Downloaded blobs are cached at ~/.mds/cache/{sha256}.step. When switching between snapshots or subsets with overlapping parts, cached files are symlinked or copied from the cache instead of re-downloaded. This makes subset switching fast when there is high overlap between queries.

Parallelism

Downloads are parallelized (configurable, default 8 concurrent transfers). At typical STEP file sizes (1–20 MB), pulling 1,000 files takes a few minutes on a reasonable connection.