MARB studies
Each MARB study is a self-contained experiment on the same task, the same authored part kit, and the same automated CADCLAW grader. Only the driver, or the short note we give it, changes. This page is the running log, so every result has one place to live.
A note on terms. Buildability metrics (did the model export a loadable STEP, and how many part instances did it place) describe what a model built. They are not the grade, which scores how correctly each part is located, oriented, and gapped against the answer key. Grading is a separate step. Studies that report buildability say so plainly.
| Study | Driver | Date | Runs | Headline | Status |
|---|---|---|---|---|---|
| Local open-weight anchor | qwen3-coder-next (80B, on one local box) | 2026-05-30 | 30 | Recursive prompt study. Buildability went from 1 in 5 to 5 in 5 after naming one CAD command. More design guidance made it worse. | Published |
| First results, frontier models | Claude (Fusion, CadQuery), OpenAI Codex (CadQuery) | 2026-05-26 | 3 | Each placed about 100 authored parts from one photo. Scores 12 to 15 out of 100. None buildable yet. | Published |
Study: the local open-weight anchor
The cloud models in the first results are large and expensive. This study asks the honest floor question: how does a strong coding model that a small shop could own and run for free, on one machine, do on the same task. And once it is running, what short note actually helps it.
We ran it thirty times and changed only the brief operational note before each batch of five. Every note stayed inside the fairness wall: it clarified the CAD tool or the task, and never revealed the reference design.
| Cohort | Note given | Turn budget | Buildable file | Parts placed (median) |
|---|---|---|---|---|
| A | None (control) | 8 | 1 of 5 | 15 |
| B | Correct CAD export idiom | 8 | 5 of 5 | 98 |
| C | B, plus a build-volume clarification | 8 | 3 of 5 | 28 |
| D | B, plus design-goal requests | 8 | 2 of 5 | 24 |
| E | Same as D | 14 | 4 of 5 | 30 |
| F | Lean note, the export idiom sharpened | 8 | 5 of 5 | 84 |
Target is about 100 placed part instances. The kit holds authored STEP parts; the model places each one and exports a single STEP file.
What the recursion taught us
- One tool step gated everything. With no help, four of five runs wrote sensible placement code, then failed on the final export by handing an assembly object to a function that needs a solid. Two lines naming the correct export call flipped the result from 1 of 5 to 5 of 5.
- More guidance made it worse. Adding a build-volume note, then requests to consider rigidity and manufacturability, lowered buildability each time. A small model on a short budget spends its turns reading and reasoning instead of building.
- More time helped less than expected. Raising the budget recovered the export rate but not the part count or the design quality. The extra time went to a tool-confusion loop, not to better engineering.
- Lean and exact won. The best note named only the failure modes the model actually hit, and dropped everything else. It matched the best buildability and removed the worst failure loop.
What this study does not yet answer
- Structural quality. Across all thirty runs the output is parts placed near each other, not a rigid jointed frame. Measuring that needs the grader's higher levels for interfaces and motion, which are next on the roadmap.
- Design judgement at scale. Asking a small model to weigh manufacturability, thermal load, vibration, fatigue, and kinematics is the real goal of the benchmark. It pays off only with a larger model, a longer budget, and a grader that can score those modes. That work belongs in the shared task definition, not in a local note.
Method, harness, and the full run catalogue are in the open-source repository. This study reports buildability; correctness grading uses the same CADCLAW gates as every other MARB run.