On this test, Depaza leads.

There is no good way to compare apples with bananas. Different agents are built on different models, run in different environments and are optimised for different things. See how our models work for the details of our approach. We therefore defined one narrow, practical test: the exact same 10-minute production coding task, given to Grok, Claude Code and Depaza on 15 June 2026 under identical conditions. We published the full prompt, the skeleton, the verifier, every solution and the raw logs. On this test Depaza led.

Download the full kit Read the method

depaza-coder · benchmark

$depaza benchmark --task durafmt

Running the 10-minute coding benchmark…

✓38 / 38 acceptance tests pass

✓Engineering-quality review

Overall Score higher is better

Depaza Depaza · 🇪🇺 Leader 100.0

Claude Code Anthropic · 🇺🇸 93.8

Grok xAI · 🇺🇸 75.0

Overall = % of the 16 quality bars cleared

✓Test completed successfully 🎉

Bar by bar

Depaza cleared every bar on this test.

Line-by-line review of the actual code each agent produced against the requirements we defined for the task. Green means it met that bar.

Clears the bar —Does not

The bar	Depaza Depaza · 🇪🇺 LEADER	Claude Code Anthropic · 🇺🇸	Grok xAI · 🇺🇸
Correctness & robustness	9/9	9/9	9/9
All 38 acceptance tests pass
Roundtrip property: parse(format(x)) == x
CLI parse + format both work
Whitespace- and tab-tolerant parsing
Handles very large values (1000w)
Case-insensitive units (1H = 1h)
Rejects malformed input cleanly
Integer-exact — no floating-point drift
Standard library only, zero dependencies
Engineering quality	7/7	6/7	3/7
Long style with pluralised words
Long style stays consistent at zero ("0 seconds")			—
Field-level error messages (names the bad unit)			—
Explicit public API (__all__)			—
Full type hints
Module + function docstrings
Rich inline documentation of the logic		—	—

16/16 on this test. Depaza was the only one to clear every bar.

Depaza was the only participant that satisfied all nine correctness requirements and all seven engineering-quality requirements we defined for a complete, production-grade result. The others shipped correct, tested diffs but did not meet the full standard on this benchmark.

Time is irrelevant

Seconds are a vanity metric. Correct code is the product.

Yes, the wall-clock numbers differ — they always will, swinging with model load, network and how hard an agent double-checks itself. Grok was fastest to a passing diff. Depaza took longer because it refused to stop at "the tests pass." The extra time produced the rich inline documentation of the logic, the explicit public API, field-level error messages that name the bad unit, and long-style consistency at zero — exactly the engineering bars that only Depaza cleared. The others optimized for the stopwatch. Depaza optimized for the code that ships and has to be maintained. What a developer ships is the diff, and all three diffs pass. We publish the times anyway, because we have nothing to hide. Just do not mistake a stopwatch for quality.

Every one of them crossed the same finish line: a correct, tested durafmt. The clock is the least interesting column on the page.

Reproducible by design

How we ran and published the test

We decided to benchmark a narrow, practical use case — 10-minute coding — under controlled conditions. Same task, same rules, same verifier for all three agents. We made the entire experiment public so anyone can reproduce it.

📋

Identical task

The same durafmt spec, the same starter skeleton and the same executable acceptance tests, handed to every agent.

⚖️

Objective judge

An independent verifier runs the acceptance suite plus adversarial cases. The tests decide the result on this benchmark.

🤖

Fair power mode

Every CLI ran headless with auto-approve — claude -p, grok -p, depaza -p --yolo — started within the same second.

📦

Open kit

The prompt, skeleton, verifier, every produced solution and the raw logs are all in the downloadable zip.

The honest footnote

The verifier flags three "failures" — and all three agents hit the same ones, because the verifier is wrong, not the agents. parse("1w 2d") is 777600 seconds; the verifier secretly expected a value with an extra three hours baked in. The other two ("0" and "-0s") are inputs the agents correctly reject to stay consistent with the shipped tests. We left the bug in the kit on purpose, so you can see we did not score on a curve.

A European data point

We chose a test. We made it public. On that test we lead.

Agents built on different foundations will naturally perform differently on different tasks. We picked one concrete, time-boxed use case, ran the experiment, and published everything. To understand the approach behind Depaza, see how our models work.

How our models work Security & hosting Get the CLI

Try the same kind of task yourself.

Install the Depaza Code CLI and run real 10-minute coding work. The full benchmark kit is public if you want to reproduce the test.

Try for free