MARB studies

Each MARB study is a self-contained experiment on the same task, the same authored part kit, and the same automated CADCLAW grader. Only the driver, or the short note we give it, changes. This page is the running log, so every result has one place to live.

A note on terms. Buildability metrics (did the model export a loadable STEP, and how many part instances did it place) describe what a model built. They are not the grade, which scores how correctly each part is located, oriented, and gapped against the answer key. Grading is a separate step. Studies that report buildability say so plainly.

StudyDriverDateRunsHeadlineStatus
Local open-weight anchor qwen3-coder-next (80B, on one local box) 2026-05-30 30 Recursive prompt study. Buildability went from 1 in 5 to 5 in 5 after naming one CAD command. More design guidance made it worse. Published
First results, frontier models Claude (Fusion, CadQuery), OpenAI Codex (CadQuery) 2026-05-26 3 Each placed about 100 authored parts from one photo. Scores 12 to 15 out of 100. None buildable yet. Published

Study: the local open-weight anchor

Driver: qwen3-coder-next:q4_K_M (80B total, 3B active), text only, on a single NVIDIA GX-10. CAD tool: CadQuery 2.7.0. Kit: v1.1. Blind run, no internet, no memory of past work. 30 runs in six cohorts. Buildability metrics, not grades.

The cloud models in the first results are large and expensive. This study asks the honest floor question: how does a strong coding model that a small shop could own and run for free, on one machine, do on the same task. And once it is running, what short note actually helps it.

We ran it thirty times and changed only the brief operational note before each batch of five. Every note stayed inside the fairness wall: it clarified the CAD tool or the task, and never revealed the reference design.

CohortNote givenTurn budgetBuildable fileParts placed (median)
ANone (control)81 of 515
BCorrect CAD export idiom85 of 598
CB, plus a build-volume clarification83 of 528
DB, plus design-goal requests82 of 524
ESame as D144 of 530
FLean note, the export idiom sharpened85 of 584

Target is about 100 placed part instances. The kit holds authored STEP parts; the model places each one and exports a single STEP file.

What the recursion taught us

What this study does not yet answer

Method, harness, and the full run catalogue are in the open-source repository. This study reports buildability; correctness grading uses the same CADCLAW gates as every other MARB run.