AutoCAM Testing Suite — Design¶

Home / Engineering / System Design / AutoCAM Testing Suite / Design

Engineering Log Entry — March 2026

Architecture, component design, and implementation roadmap for the AutoCAM testing and benchmarking suite. See also the requirements.

Architecture Overview¶

The testing suite has four layers: test harness (runs the CAM CLI, collects results), CI integration (triggers runs), results backend (stores and serves data), and web frontend (dashboard + deep dive viewer).

flowchart TB
    subgraph Triggers
        PR[Pull Request]
        Push[Push to main]
        Manual[Manual / Dry Run]
    end

    subgraph CI["CI Layer (GitHub Actions)"]
        GHA[GitHub Actions Workflow]
        Runner["Self-Hosted Runner\n(Windows + Licenses)"]
    end

    subgraph Harness["Test Harness"]
        Manifest[Test Manifest\nparts.json]
        CLI["machenit CLI\n--json"]
        Collector[Result Collector]
    end

    subgraph Storage["Results Backend"]
        DB[(SQLite DB\nMetrics)]
        Artifacts["Artifact Store\n(PLY, Reports)"]
        API[REST API\nFastAPI]
    end

    subgraph Frontend["Web Frontend"]
        Dashboard[Dashboard\nTrend Charts]
        DeepDive[Deep Dive Viewer\nThree.js]
    end

    PR --> GHA
    Push --> GHA
    Manual --> GHA
    GHA --> Runner
    Runner --> Manifest
    Manifest --> CLI
    CLI --> Collector
    Collector --> DB
    Collector --> Artifacts
    DB --> API
    Artifacts --> API
    API --> Dashboard
    API --> DeepDive

Component Design¶

1. Test Manifest¶

A JSON file versioned in the CAM repository that defines the test corpus.

File: tests/cam_benchmark/parts.json

{
  "version": 1,
  "defaults": {
    "tool_library": "assets/tools/machenit_tooling_library.json",
    "stepdown": 2.0,
    "stepover": 0.0,
    "stock_allowance": 5.0
  },
  "parts": [
    {
      "id": "pocket-simple-01",
      "name": "Simple Rectangular Pocket",
      "file": "assets/parts/simple_pocket.stp",
      "category": "pockets",
      "difficulty": 1,
      "strategy": "simple-planar",
      "description": "Single rectangular pocket, 20mm deep",
      "parameters": {},
      "tags": ["should_have_no_gouges"]
    }
  ]
}

Parameters in a part entry override the defaults. This keeps manifest entries minimal for standard configurations.

2. Test Harness¶

A Python script that orchestrates test execution. Python is chosen for scripting convenience, cross-platform support, and easy JSON handling.

File: tests/cam_benchmark/run_benchmark.py

Responsibilities:

Parse parts.json manifest.
For each part: build the CLI command, invoke machenit --json, capture stdout/stderr, measure wall-clock time.
Parse JSON output and extract metrics.
Produce a results bundle — a single JSON file containing all results plus run metadata.
Upload results to the backend (or write to a local file for dry-run mode).

Results bundle schema:

{
  "run_id": "20260312-143022-a1b2c3d",
  "commit_sha": "a1b2c3d4e5f6",
  "branch": "feature/adaptive-roughing",
  "trigger": "pull_request",
  "timestamp": "2026-03-12T14:30:22Z",
  "runner": "windows-cam-01",
  "suite_duration_ms": 842000,
  "results": [
    {
      "part_id": "pocket-simple-01",
      "status": "success",
      "wall_time_ms": 12340,
      "cli_output": { },
      "metrics": {
        "generation_time_ms": 8200,
        "evaluation_time_ms": 3100,
        "total_moves": 1842,
        "total_toolpath_length_mm": 24500.3,
        "collision_count": 0,
        "gouge_count": 0,
        "max_gouge_mm": 0.0,
        "excess_count": 12,
        "max_excess_mm": 0.42,
        "tool_count": 3,
        "setup_count": 1
      },
      "artifacts": ["pocket-simple-01.ply"]
    }
  ]
}

Error handling: If machenit crashes or times out (configurable, default 5 minutes per part), the result entry gets "status": "error" with stderr captured. The harness continues to the next part.

3. CI Integration¶

A GitHub Actions workflow on the CAM repository using a self-hosted Windows runner.

# .github/workflows/cam-benchmark.yml
name: CAM Benchmark Suite

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  workflow_dispatch:
    inputs:
      parts_filter:
        description: 'Comma-separated part IDs (empty = all)'
        required: false

jobs:
  benchmark:
    runs-on: [self-hosted, Windows, cam-licensed]
    timeout-minutes: 60
    steps:
      - uses: actions/checkout@v4

      - name: Build machenit
        run: |
          cmake -S . -B build -DMW_DIR=C:/sdk/moduleworks -DASITUS_DIR=C:/sdk/asitus
          cmake --build build --config Release

      - name: Run benchmark suite
        run: |
          python tests/cam_benchmark/run_benchmark.py
            --manifest tests/cam_benchmark/parts.json
            --binary build/Release/machenit.exe
            --output results.json
            --export-ply

      - name: Upload results
        run: |
          python tests/cam_benchmark/upload_results.py
            --results results.json
            --artifacts output/
            --api-url ${{ secrets.BENCHMARK_API_URL }}

      - name: Post PR comment
        if: github.event_name == 'pull_request'
        run: |
          python tests/cam_benchmark/pr_comment.py
            --results results.json
            --repo ${{ github.repository }}
            --pr ${{ github.event.pull_request.number }}

Self-hosted runner setup:

Install as a Windows service for auto-start on reboot.
Custom label cam-licensed ensures only this runner picks up benchmark jobs.
SDK paths configured as machine-level environment variables.
Runner workspace excluded from Windows Defender real-time scanning.

PR comment: After a PR run, a bot comment summarizes key metrics with a comparison table against the latest main baseline and a link to the full dashboard view.

4. Results Backend¶

A lightweight API server that stores results and serves them to the frontend.

Storage: SQLite + File System¶

Why SQLite:

Zero-ops — single file, no database server to manage.
Handles the expected data volume easily (100 parts x 10 runs/day x 365 days = ~365K rows/year).
Portable — can move the file between machines if the runner changes.
Built-in full-text search and JSON functions.

Schema:

CREATE TABLE runs (
    id          TEXT PRIMARY KEY,  -- run_id
    commit_sha  TEXT NOT NULL,
    branch      TEXT NOT NULL,
    trigger     TEXT NOT NULL,     -- pull_request | push | manual
    timestamp   TEXT NOT NULL,     -- ISO 8601
    runner      TEXT,
    suite_duration_ms INTEGER,
    metadata    TEXT               -- JSON blob for extensibility
);

CREATE TABLE results (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id          TEXT NOT NULL REFERENCES runs(id),
    part_id         TEXT NOT NULL,
    status          TEXT NOT NULL,  -- success | error | timeout
    wall_time_ms    INTEGER,
    -- Flattened metrics for fast queries:
    generation_time_ms      REAL,
    evaluation_time_ms      REAL,
    total_moves             INTEGER,
    total_toolpath_length_mm REAL,
    collision_count         INTEGER,
    gouge_count             INTEGER,
    max_gouge_mm            REAL,
    excess_count            INTEGER,
    max_excess_mm           REAL,
    tool_count              INTEGER,
    setup_count             INTEGER,
    -- Full CLI output for deep dive:
    cli_output      TEXT,          -- JSON blob
    error_message   TEXT
);

CREATE TABLE artifacts (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id      TEXT NOT NULL REFERENCES runs(id),
    part_id     TEXT NOT NULL,
    filename    TEXT NOT NULL,
    file_path   TEXT NOT NULL,     -- path in artifact store
    size_bytes  INTEGER
);

-- Indexes for dashboard queries
CREATE INDEX idx_results_part_branch ON results(part_id, run_id);
CREATE INDEX idx_runs_branch_ts ON runs(branch, timestamp);
CREATE INDEX idx_runs_commit ON runs(commit_sha);

Artifact store: A directory on the server (or S3-compatible bucket) organized as artifacts/{run_id}/{part_id}/. Stores PLY files, detailed collision reports, and any future large outputs.

API: FastAPI (Python)¶

A small FastAPI application serving the frontend. Key endpoints:

Method	Path	Description
`POST`	`/api/runs`	Ingest a results bundle
`GET`	`/api/runs`	List runs (filterable by branch, trigger, date range)
`GET`	`/api/runs/{id}`	Get a specific run with all results
`GET`	`/api/parts`	List all known parts with latest metrics
`GET`	`/api/parts/{id}/history`	Time-series metrics for a part on a branch
`GET`	`/api/metrics/aggregate`	Corpus-wide aggregate metrics over time
`GET`	`/api/compare`	Compare two runs side-by-side
`GET`	`/api/artifacts/{run_id}/{part_id}/{filename}`	Serve artifact files

Why FastAPI: Lightweight, auto-generates OpenAPI docs, async support, easy to deploy as a single process. The entire backend is a single Python file + SQLite file — trivial to move between machines.

5. Web Frontend — Dashboard¶

A single-page application for viewing metrics trends and navigating to deep dives.

Tech Stack¶

Layer	Choice	Rationale
Framework	React (Vite)	Fast builds, large ecosystem, team familiarity
Charts	Recharts or Plotly.js	Time-series charts with zoom, hover, and click-through
3D viewer	Three.js	Industry standard for WebGL, loaders for PLY/STL built in
Styling	Tailwind CSS	Utility-first, fast to build with, minimal custom CSS
Hosting	Cloudflare Pages	Already used for docs site, free tier sufficient

Dashboard Views¶

Corpus Overview — Landing page showing:

Summary cards: total parts, latest run status, aggregate gouge/collision counts.
Metric trend chart: selectable metric (y-axis) plotted over commits to main (x-axis).
Part table: sortable/filterable table of all parts with their latest metrics. Each row links to the part's detail view.

Part Detail — Shows a specific part's metrics over time:

Trend charts for all metrics on main.
Table of recent runs with metric values.
Each run row links to the deep dive.

Run Detail — Shows all parts for a specific run:

Summary statistics.
Per-part results table (sortable, color-coded by status).
If a PR run: comparison columns showing deltas vs. main baseline.

PR Comparison — Side-by-side view of a PR run vs. its main baseline:

Delta table (green = improved, red = regressed, gray = unchanged).
Filterable to show only regressions.

6. Web Frontend — Deep Dive Viewer¶

The deep dive page is a dedicated 3D visualization and analysis tool for a single (run, part) pair.

Layout¶

+----------------------------------------------------------+
| Part: Simple Pocket  |  Run: abc123  |  Branch: main     |
+----------------------------------------------------------+
| +------------------+  +-------------------------------+  |
| | Metrics Panel    |  |                               |  |
| | - Generation: 8s |  |      3D Viewport              |  |
| | - Gouges: 0      |  |      (Three.js)               |  |
| | - Excess: 12     |  |                               |  |
| | - Length: 24.5m   |  |                               |  |
| | - Tools: 3       |  |                               |  |
| +------------------+  +-------------------------------+  |
| +------------------+  +-------------------------------+  |
| | Operations List  |  | Playback Controls             |  |
| | [x] Facing       |  | [|<] [<] [>||] [>] [>|]      |  |
| | [x] Roughing     |  | Speed: [1x] [2x] [5x] [10x]  |  |
| | [ ] Finishing     |  | Progress: =====>------  62%   |  |
| | [ ] Drilling      |  | Move: 1142 / 1842             |  |
| +------------------+  +-------------------------------+  |
+----------------------------------------------------------+

3D Viewport Features¶

Phase 1 — Toolpath Visualization:

Render toolpath moves as colored lines (rapid = dashed gray, cutting = solid, color-coded by operation type).
Show tool position as a cylinder/sphere at the current animation point.
Load part geometry from the STEP file (tessellated and exported as STL/GLB during the test run) as a translucent reference surface.
Orbit, pan, zoom controls (Three.js OrbitControls).
Click on a gouge/excess marker to see its details (position, deviation, move index).

Phase 2 — Stock Removal Simulation:

Two approaches to evaluate, in order of implementation ease:

Approach A: Pre-computed SnapshotsApproach B: In-Browser CSG SimulationApproach C: WebGPU Dexel Simulation (Future)

During the test run, the harness exports stock state at regular intervals (e.g., every N moves) as lightweight meshes (PLY or GLB). The viewer loads these snapshots and interpolates between them during playback.

Pros: Simple viewer logic, no in-browser simulation needed, leverages ModuleWorks' high-fidelity dexel sim.

Cons: Large artifact storage per run, snapshot granularity limits scrubbing resolution. At ~50 snapshots x ~2MB each = ~100MB per part per run.

Use three-bvh-csg to perform boolean subtractions of swept tool volumes from the stock mesh in the browser. Batch moves into groups (e.g., 50–100 moves per CSG operation) to keep performance tractable.

Pros: No pre-computed data needed, smooth scrubbing, small artifact size.

Cons: Lower fidelity than dexel-based sim, performance limits with complex parts, must generate swept tool volumes in JS.

Implement a simplified dexel grid in a WebGPU compute shader. The tool geometry and moves are uploaded as buffers; the shader subtracts tool volumes from the dexel grid in parallel.

Pros: High fidelity, real-time performance, runs entirely in browser.

Cons: Significant development effort, WebGPU browser support still maturing. Best suited for a later phase when the viewer is proven valuable.

Recommended path: Start with Approach A (pre-computed snapshots) for the MVP because it reuses the existing ModuleWorks simulation and keeps the viewer simple. Transition to Approach C if interactive scrubbing and fidelity become important enough to justify the investment.

Phase 3 — Machine Visualization (Future):

Render a simplified 5-axis machine model (imported as GLTF) with kinematic chain.
Replay the toolpath with the machine moving (table rotation, spindle translation).
This is a significant effort and should only be pursued after the toolpath viewer and stock sim are proven useful.

Data Requirements for Deep Dive¶

The test harness must export the following artifacts for each part to enable the deep dive viewer:

Artifact	Format	Purpose	Approx. Size
Toolpath moves	JSON	Animate tool position	1–5 MB
Part mesh	GLB	Reference geometry	0.5–2 MB
Initial stock mesh	GLB	Starting material block	< 0.1 MB
Stock snapshots	GLB (x50)	Stock removal replay (Phase 2A)	~100 MB
Collision markers	JSON	Highlight collision points	< 0.1 MB
Gouge/excess markers	JSON	Highlight deviation points	< 0.1 MB

Design Alternatives Considered¶

Results Storage¶

Option	Pros	Cons	Verdict
SQLite + filesystem	Zero ops, portable, sufficient scale	Single-writer, no built-in replication	Chosen — right-sized for the team and data volume
PostgreSQL	Rich queries, concurrent writes	Requires a managed service or self-hosting	Overkill for < 1M rows and 2–5 users
JSON files in git	Zero infrastructure, version-controlled	Slow queries, repo bloat, merge conflicts	Too limited for time-series queries
InfluxDB / TimescaleDB	Purpose-built for time-series	Additional service to manage, learning curve	Over-engineered for this use case

Dashboard Hosting¶

Option	Pros	Cons	Verdict
Cloudflare Pages + Workers	Already used for docs, free tier, edge performance	Workers have execution limits	Chosen — unified with existing infra
GitHub Pages + static JSON	Zero cost, simple	No API, limited interactivity	Viable for MVP charts, not for deep dive
Self-hosted VM	Full control	Ops burden, cost	Unnecessary at this scale
Grafana	Powerful charting, alerting	Heavy for this use case, separate system	Better suited if we already ran it for other monitoring

Benchmark Tracking Integration¶

Option	Pros	Cons	Verdict
Custom (this design)	Full control over metrics and deep dive, CAM-specific features	More to build	Chosen — generic tools don't support 3D deep dive or CAM-specific metrics
github-action-benchmark	Free, simple, GitHub Pages charts	No deep dive, limited customization, charts only	Good for supplementary PR comments
Bencher.dev	Statistical regression detection, nice dashboard	No 3D viewer, SaaS dependency, limited customization	Could supplement but not replace

3D Viewer¶

Option	Pros	Cons	Verdict
Three.js (custom)	Full control, large ecosystem, PLY/STL loaders built in	Must build viewer from scratch	Chosen — most flexible for our specific needs
Reuse native ImGui viewer	Already built, high fidelity	Not web-accessible, requires licenses on viewer machine	Could supplement for local deep debugging
CAMotics (WASM port)	Full stock sim, open source	Major porting effort, 3-axis only	Not viable for 5-axis

Implementation Roadmap¶

Phase 1: CLI Test Harness (2–3 weeks)¶

Goal: Run the full corpus from the command line and produce a structured JSON report.

[ ] Define parts.json manifest with 20 initial parts
[ ] Write run_benchmark.py harness script
[ ] Add total_toolpath_length_mm metric to machenit CLI JSON output
[ ] Add --export-ply integration to the harness for artifact capture
[ ] Add GLB export of part mesh and stock (via a small export utility or script)
[ ] Validate harness on the Windows machine with licensed SDKs
[ ] Document dry-run usage for developers

Deliverable: A developer can run python run_benchmark.py --manifest parts.json --binary machenit.exe and get a results.json file.

Phase 2: CI Integration (1–2 weeks)¶

Goal: Benchmark suite runs automatically on PRs and merges to main.

[ ] Set up self-hosted GitHub Actions runner on the Windows machine (as service)
[ ] Write .github/workflows/cam-benchmark.yml
[ ] Implement pr_comment.py to post metric summaries on PRs
[ ] Test with a real PR cycle

Deliverable: Every PR gets an automated comment showing CAM metrics vs. main.

Phase 3: Results Backend (2–3 weeks)¶

Goal: Store results persistently and serve them via API.

[ ] Implement SQLite schema and ingestion script (upload_results.py)
[ ] Build FastAPI server with endpoints for runs, parts, history, and comparison
[ ] Set up artifact storage directory structure
[ ] Deploy API (Cloudflare Worker, small VM, or on the runner machine itself)
[ ] Backfill results from initial CI runs

Deliverable: API returns historical metrics for any part on any branch.

Phase 4: Dashboard (2–3 weeks)¶

Goal: Web-based metric trends and run exploration.

[ ] Scaffold React app (Vite + Tailwind)
[ ] Build corpus overview page with aggregate charts
[ ] Build part detail page with per-part trend lines
[ ] Build run detail page with per-part results table
[ ] Build PR comparison view with delta highlighting
[ ] Deploy to Cloudflare Pages

Deliverable: Team can view metric trends in a browser and navigate from aggregate to specific runs.

Phase 5: Deep Dive Viewer — Toolpath (2–3 weeks)¶

Goal: 3D toolpath visualization and replay for individual test results.

[ ] Build Three.js viewport with OrbitControls
[ ] Load and render part mesh (GLB via Three.js GLTFLoader)
[ ] Render toolpath moves as colored lines
[ ] Implement animated replay with play/pause/speed/scrub
[ ] Show tool position cylinder moving along the path
[ ] Overlay gouge/excess markers as colored spheres
[ ] Wire up navigation from dashboard to deep dive

Deliverable: Click a part in the dashboard, see its toolpath animated over the part geometry.

Phase 6: Deep Dive Viewer — Stock Removal (3–4 weeks)¶

Goal: Interactive stock removal simulation in the browser.

[ ] Integrate snapshot export into the test harness (GLB at intervals)
[ ] Load and display stock snapshots in the viewer, synced to playback position
[ ] Add color-coded deviation overlay (gouge = red, excess = blue, clean = green)
[ ] Implement smooth interpolation between snapshots
[ ] Evaluate three-bvh-csg as an alternative to pre-computed snapshots
[ ] Optimize for parts with large move counts

Deliverable: Replay shows material being removed from stock in sync with the toolpath.

Future Phases¶

Run-to-run comparison viewer — Side-by-side 3D comparison of two runs.
Machine visualization — Animated 5-axis machine model with kinematic replay.
WebGPU dexel simulation — High-fidelity in-browser stock removal.
Alerting — Slack/email notifications on significant regressions.
Vericut integration — Run top candidates through Vericut for production-grade validation.
Expanded corpus management — UI for adding/categorizing/retiring test parts.

Deployment Architecture¶

flowchart LR
    subgraph GitHub
        Repo[CAM Repo]
        GHA[GitHub Actions]
    end

    subgraph WinRunner["Windows Machine"]
        Runner[GH Actions Runner\nWindows Service]
        MW[ModuleWorks SDK]
        AS[Analysis Situs SDK]
        Build["machenit.exe"]
        Harness["run_benchmark.py"]
    end

    subgraph Server["API Server"]
        API[FastAPI]
        SQLite[(SQLite)]
        ArtifactDir["/artifacts"]
    end

    subgraph CF["Cloudflare"]
        Pages[Cloudflare Pages\nDashboard SPA]
    end

    Repo -->|trigger| GHA
    GHA -->|dispatch| Runner
    Runner --> Build
    Build --> Harness
    Harness -->|POST results| API
    Harness -->|upload artifacts| ArtifactDir
    Pages -->|fetch data| API
    API --> SQLite
    API --> ArtifactDir

API Server Location

The API server can run on the same Windows machine as the runner initially, or on a small cloud VM. Since the dashboard is a static SPA on Cloudflare Pages, the API just needs to be reachable from the browser. A Cloudflare Tunnel can expose a local server without a public IP.

Key Design Decisions¶

Decision	Rationale
Python for harness	Cross-platform, easy JSON handling, no compilation step. The harness is glue code — simplicity matters more than performance.
SQLite over Postgres	Zero ops overhead. At our data volume (< 1M rows), SQLite is more than sufficient and can be backed up by copying a single file.
FastAPI over serverless	A single long-running process is simpler to reason about than serverless functions for this workload. Can be moved to serverless later if needed.
Pre-computed stock snapshots	Leverages the existing high-fidelity ModuleWorks simulation rather than rebuilding a lower-fidelity version in the browser. Trading storage for simplicity.
Separate repo for dashboard	The dashboard and API are decoupled from the CAM code. The CAM repo has the harness and manifest; the dashboard repo has the frontend and API. This keeps CI concerns separate.
Self-hosted runner with labels	Licensing constraints require specific hardware. Labels ensure benchmark jobs only run on the licensed machine while other CI (linting, docs) can use GitHub-hosted runners.

Open Questions¶

API hosting — Run on the Windows machine behind a Cloudflare Tunnel, or provision a small Linux VM? The tunnel approach is zero-cost but ties availability to the Windows machine.
Artifact retention — How long to keep PLY/GLB artifacts for deep dive? Full history could grow to ~100GB/year at 100 parts. Consider a retention policy (e.g., keep last 30 days of artifacts, metrics forever).
Stock snapshot format — GLB is compact and Three.js-native. PLY is what machenit already exports. Adding a GLB export path to the harness (via a conversion step) is low effort and worth the viewer simplicity.
Multi-strategy runs — Should each part run exactly one assigned strategy, or should some parts run multiple strategies for comparison? The manifest supports both; the question is whether to exercise it early.
Notification channel — Where should regression alerts go? GitHub PR comments are the MVP. Slack integration is a natural next step for push-to-main regressions.