LLM Benchmarks
Results from Terminal Bench coding tasks · 3 models · 4 runs · How I test ↓
Select two different runs above to see the comparison.
omnicoder-9b
2 runseverything-claude-code
v2.1.92 26 Tasks
3 Success
2 Partial
21 Failure
36.6% Pass Rate
11.5% Terminal Bench
| Task | Status | Tests | Duration | Tokens In | Tokens Out |
|---|---|---|---|---|---|
| build-pmars | Failure 0/4 0% | 4 failed in 0.08s | 21m 21s | 2.19M | 8.2K |
| build-pov-ray | Failure 0/3 0% | 3 failed in 1.28s | 21m 19s | 553.6K | 37.5K |
| cancel-async-tasks | Partial 5/6 83% | 1 failed, 5 passed in 13.07s | 3m 5s | 125.4K | 782 |
| circuit-fibsqrt | Failure 2/3 67% | 1 failed, 2 passed in 3.26s | 440m 3s | 1.69M | 63.3K |
| cobol-modernization | Failure 2/3 67% | 1 failed, 2 passed in 0.16s | 436m 50s | 633.6K | 5.5K |
| code-from-image | Failure 1/2 50% | 1 failed, 1 passed in 0.05s | 1m 40s | 173.9K | 636 |
| fix-git | Success 2/2 100% | 2 passed in 0.06s | 1m 39s | 314.6K | 1.9K |
| fix-ocaml-gc | Failure 0/1 0% | 1 failed in 0.05s | 25m 16s | 4.28M | 20.2K |
| git-leak-recovery | Partial 4/5 80% | 1 failed, 4 passed, 1 warning in 0.33s | 2m 3s | 309.8K | 1.5K |
| gpt2-codegolf | Failure 0/1 0% | 1 failed in 0.18s | 22m 14s | 2.87M | 31.2K |
| headless-terminal | Failure 1/7 14% | 6 failed, 1 passed in 0.21s | 1m 39s | 134.2K | 3.0K |
| kv-store-grpc | Success 7/7 100% | 7 passed in 0.15s | 1m 24s | 292.1K | 2.4K |
| make-doom-for-mips | Failure 0/3 0% | 3 failed in 30.10s | 21m 55s | 3.84M | 43.2K |
| make-mips-interpreter | Failure 0/3 0% | 3 failed in 30.20s | 21m 37s | 3.08M | 10.8K |
| path-tracing | Failure 0/5 0% | 5 failed in 0.82s | 21m 16s | 2.71M | 27.3K |
| path-tracing-reverse | Failure 2/3 67% | 1 failed, 2 passed in 1.74s | 21m 8s | 6.00M | 13.6K |
| polyglot-c-py | Failure 0/1 0% | 1 failed in 0.05s | 20m 25s | 259.6K | 2.5K |
| polyglot-rust-c | Failure 0/1 0% | 1 failed in 0.05s | 21m 32s | 750.4K | 103.1K |
| prove-plus-comm | Failure 2/4 50% | 2 failed, 2 passed in 0.28s | 21m 20s | 3.86M | 28.2K |
| pypi-server | Success 1/1 100% | 1 passed in 1.45s | 21m 13s | 680.4K | 3.1K |
| regex-chess | Failure 0/4 0% | 4 failed in 0.17s | 7m 16s | 76.6K | 32.1K |
| schemelike-metacircular-eval | Failure 0/1 0% | 1 failed in 9.74s | 21m 6s | 4.93M | 21.8K |
| torch-pipeline-parallelism | Failure 2/4 50% | 2 failed, 2 passed in 34.74s | 5m 59s | 205.3K | 5.1K |
| torch-tensor-parallelism | Failure 1/13 8% | 12 failed, 1 passed, 1 warning in 101.08s (0:01:41) | 16m 53s | 184.4K | 1.5K |
| winning-avg-corewars | Failure 2/3 67% | 1 failed, 2 passed in 0.65s | 21m 11s | 1.53M | 58.0K |
| write-compressor | Failure 0/3 0% | 3 failed in 0.80s | 21m 24s | 1.84M | 52.6K |
No tasks match the current filters.
claude-code
v2.1.91 26 Tasks
2 Success
3 Partial
21 Failure
32.3% Pass Rate
7.7% Terminal Bench
| Task | Status | Tests | Duration | Tokens In | Tokens Out |
|---|---|---|---|---|---|
| build-pmars | Partial 3/4 75% | 1 failed, 3 passed in 0.23s | 3m 44s | 1.42M | 5.1K |
| build-pov-ray | Failure 0/3 0% | 3 failed in 1.37s | 20m 58s | 1.16M | 4.8K |
| cancel-async-tasks | Partial 5/6 83% | 1 failed, 5 passed in 13.05s | 2m 16s | 115.9K | 799 |
| circuit-fibsqrt | Failure 2/3 67% | 1 failed, 2 passed in 4.02s | 20m 52s | 133.7K | 96.7K |
| cobol-modernization | Success 3/3 100% | 3 passed in 0.28s | 7m 10s | 553.7K | 4.8K |
| code-from-image | Failure 0/2 0% | 2 failed in 0.02s | 21m 10s | 1.97M | 53.4K |
| fix-git | Failure 0/2 0% | 2 failed in 0.07s | 1m 3s | 117.4K | 675 |
| fix-ocaml-gc | Failure 0/1 0% | 1 failed in 0.05s | 26m 24s | 7.54M | 14.2K |
| git-leak-recovery | Partial 4/5 80% | 1 failed, 4 passed, 1 warning in 0.30s | 5m 54s | 470.4K | 2.6K |
| gpt2-codegolf | Failure 0/1 0% | 1 failed in 0.05s | 21m 1s | 1.13M | 63.0K |
| headless-terminal | Failure 1/7 14% | 6 failed, 1 passed in 0.45s | 4m 12s | 144.5K | 2.0K |
| kv-store-grpc | Success 7/7 100% | 7 passed in 0.14s | 9m 48s | 231.8K | 2.4K |
| make-doom-for-mips | Failure 0/3 0% | 3 failed in 30.16s | 21m 30s | 2.57M | 5.0K |
| make-mips-interpreter | Failure 0/3 0% | 3 failed in 30.22s | 21m 35s | 3.27M | 54.9K |
| path-tracing | Failure 0/5 0% | 5 failed in 0.82s | 21m 25s | 826.9K | 46.2K |
| path-tracing-reverse | Failure 2/3 67% | 1 failed, 2 passed in 2.52s | 21m 8s | 2.44M | 7.4K |
| polyglot-c-py | Failure 0/1 0% | 1 failed in 0.05s | 21m 41s | 2.15M | 24.7K |
| polyglot-rust-c | Failure 0/1 0% | 1 failed in 0.05s | 21m 35s | 2.15M | 17.7K |
| prove-plus-comm | Failure 2/4 50% | 2 failed, 2 passed in 0.39s | 21m 8s | 369.1K | 2.6K |
| pypi-server | Failure 0/1 0% | 1 failed in 8.59s | 1m 1s | 22.7K | 215 |
| regex-chess | Failure 0/4 0% | 4 failed in 0.18s | 0m 56s | 23.1K | 45 |
| schemelike-metacircular-eval | Failure 0/1 0% | 1 failed in 10.95s | 0m 59s | 22.8K | 63 |
| torch-pipeline-parallelism | Failure 0/4 0% | 4 failed in 18.09s | 4m 15s | 23.0K | 1.0K |
| torch-tensor-parallelism | Failure 0/13 0% | 13 failed, 1 warning in 127.10s (0:02:07) | 4m 58s | 22.8K | 874 |
| winning-avg-corewars | Failure 1/3 33% | 2 failed, 1 passed in 0.04s | 1m 9s | 22.8K | 73 |
| write-compressor | Failure 0/3 0% | 3 failed in 0.77s | 2m 19s | 22.6K | 86 |
No tasks match the current filters.
qwen3.6-35B-A3B-low-hardware
1 runclaude-code
v2.1.112 26 Tasks
3 Success
1 Partial
22 Failure
39.8% Pass Rate
11.5% Terminal Bench
| Task | Status | Tests | Duration | Tokens In | Tokens Out |
|---|---|---|---|---|---|
| build-pmars | Success 4/4 100% | 4 passed in 0.23s | 8m 40s | 1.27M | 6.5K |
| build-pov-ray | Failure 0/3 0% | 3 failed in 1.28s | 24m 42s | 550.8K | 4.1K |
| cancel-async-tasks | Partial 5/6 83% | 1 failed, 5 passed in 13.07s | 8m 17s | 102.8K | 15.5K |
| circuit-fibsqrt | Failure 2/3 67% | 1 failed, 2 passed in 3.28s | 13m 52s | 52.7K | 32.2K |
| cobol-modernization | Failure 1/3 33% | 2 failed, 1 passed in 0.11s | 20m 48s | 410.8K | 13.0K |
| code-from-image | Failure 1/2 50% | 1 failed, 1 passed in 0.07s | 20m 46s | 470.8K | 18.4K |
| fix-git | Success 2/2 100% | 2 passed in 0.06s | 6m 54s | 238.1K | 1.3K |
| fix-ocaml-gc | Failure 0/1 0% | 1 failed in 0.05s | 28m 42s | 1.19M | 7.4K |
| git-leak-recovery | Success 5/5 100% | 5 passed, 1 warning in 0.26s | 15m 45s | 257.1K | 2.1K |
| gpt2-codegolf | Failure 0/1 0% | 1 failed in 0.05s | 21m 51s | 563.7K | 32.6K |
| headless-terminal | Failure 0/7 0% | 7 failed in 0.17s | 20m 49s | 24.1K | 191 |
| kv-store-grpc | Failure 5/7 71% | 2 failed, 5 passed in 0.13s | 2m 19s | 513.3K | 3.0K |
| make-doom-for-mips | Failure 0/3 0% | 3 failed in 30.14s | 21m 27s | 567.7K | 4.4K |
| make-mips-interpreter | Failure 0/3 0% | 3 failed in 30.21s | 21m 32s | 316.2K | 6.3K |
| path-tracing | Failure 0/5 0% | 5 failed in 0.85s | 21m 1s | 593.3K | 13.3K |
| path-tracing-reverse | Failure 0/3 0% | 3 failed in 1.50s | 20m 49s | 683.6K | 7.1K |
| polyglot-c-py | Failure 0/1 0% | 1 failed in 0.16s | 21m 5s | 180.4K | 24.6K |
| polyglot-rust-c | Failure 0/1 0% | 1 failed in 0.05s | 16m 4s | 24.1K | 32.0K |
| prove-plus-comm | Failure 2/4 50% | 2 failed, 2 passed in 0.26s | 21m 13s | 98.0K | 968 |
| pypi-server | Failure 0/1 0% | 1 failed in 8.59s | 20m 53s | 668.4K | 33.8K |
| regex-chess | Failure 0/4 0% | 4 failed in 0.18s | 15m 36s | 87.5K | 38.2K |
| schemelike-metacircular-eval | Failure 0/1 0% | 1 failed in 10.94s | 21m 28s | 180.5K | 1.3K |
| torch-pipeline-parallelism | Failure 0/4 0% | 4 failed in 17.76s | 22m 54s | 174.1K | 32.8K |
| torch-tensor-parallelism | Failure 9/13 69% | 4 failed, 9 passed, 1 warning in 94.73s (0:01:34) | 16m 59s | 201.1K | 7.6K |
| winning-avg-corewars | Failure 1/3 33% | 2 failed, 1 passed in 0.04s | 19m 55s | 120.2K | 49.2K |
| write-compressor | Failure 0/3 0% | 3 failed in 0.78s | 19m 59s | 24.0K | 161 |
No tasks match the current filters.
qwopus-9b
1 runclaude-code
v2.1.92 26 Tasks
9 Success
0 Partial
17 Failure
50.5% Pass Rate
34.6% Terminal Bench
| Task | Status | Tests | Duration | Tokens In | Tokens Out |
|---|---|---|---|---|---|
| build-pmars | Success 4/4 100% | 4 passed in 0.23s | 5m 26s | 1.60M | 5.2K |
| build-pov-ray | Failure 0/3 0% | 3 failed in 1.26s | 21m 10s | 3.17M | 8.0K |
| cancel-async-tasks | Success 6/6 100% | 6 passed in 14.11s | 1m 17s | 45.5K | 539 |
| circuit-fibsqrt | Failure 2/3 67% | 1 failed, 2 passed in 3.64s | 14m 25s | 22.7K | 32.0K |
| cobol-modernization | Success 3/3 100% | 3 passed in 0.31s | 20m 51s | 730.0K | 24.9K |
| code-from-image | Failure 0/2 0% | 2 failed in 0.02s | 18m 34s | 230.4K | 32.8K |
| fix-git | Success 2/2 100% | 2 passed in 0.06s | 2m 32s | 141.5K | 1.1K |
| fix-ocaml-gc | Failure 0/1 0% | 1 failed in 0.05s | 25m 53s | 1.26M | 2.4K |
| git-leak-recovery | Success 5/5 100% | 5 passed, 1 warning in 0.28s | 1m 18s | 166.5K | 1.4K |
| gpt2-codegolf | Failure 0/1 0% | 1 failed in 0.06s | 21m 9s | 0 | 0 |
| headless-terminal | Success 7/7 100% | 7 passed in 132.17s (0:02:12) | 8m 41s | 431.8K | 5.2K |
| kv-store-grpc | Success 7/7 100% | 7 passed in 0.14s | 1m 35s | 319.6K | 1.9K |
| make-doom-for-mips | Failure 0/3 0% | 3 failed in 30.16s | 21m 36s | 2.73M | 17.6K |
| make-mips-interpreter | Failure 0/3 0% | 3 failed in 30.21s | 21m 57s | 5.08M | 6.2K |
| path-tracing | Failure 0/5 0% | 5 failed in 0.77s | 21m 20s | 668.0K | 4.5K |
| path-tracing-reverse | Failure 0/3 0% | 3 failed in 1.41s | 19m 25s | 618.9K | 35.6K |
| polyglot-c-py | Failure 0/1 0% | 1 failed in 0.08s | 21m 41s | 790.6K | 15.4K |
| polyglot-rust-c | Failure 0/1 0% | 1 failed in 0.05s | 17m 35s | 22.6K | 32.0K |
| prove-plus-comm | Success 4/4 100% | 4 passed in 0.29s | 2m 40s | 399.8K | 4.3K |
| pypi-server | Success 1/1 100% | 1 passed in 1.35s | 16m 51s | 1.46M | 5.9K |
| regex-chess | Failure 0/4 0% | 4 failed in 0.17s | 14m 47s | 23.0K | 32.0K |
| schemelike-metacircular-eval | Failure 0/1 0% | 1 failed in 10.80s | 21m 6s | 152.1K | 866 |
| torch-pipeline-parallelism | Failure 0/4 0% | 4 failed in 18.85s | 23m 36s | 0 | 0 |
| torch-tensor-parallelism | Failure 3/13 23% | 10 failed, 3 passed, 1 warning in 77.94s (0:01:17) | 24m 49s | 61.7K | 19.3K |
| winning-avg-corewars | Failure 2/3 67% | 1 failed, 2 passed in 0.65s | 20m 58s | 48.6K | 9.3K |
| write-compressor | Failure 1/3 33% | 2 failed, 1 passed in 0.80s | 9m 20s | 831.4K | 28.3K |
No tasks match the current filters.
How I benchmark LLMs
All results on this page are produced locally, on my own hardware, under reproducible conditions. Here is exactly how.
Hardware
- CPU AMD Ryzen 7 9800X3D
- GPU Nvidia RTX 5080 · 16 GB VRAM
- RAM 32 GB DDR5
- OS CachyOS (Arch-based)
- Inference runtime ik_llama.cpp compiled with CUDA
Test conditions
- Dataset Terminal Bench 2 — software engineering tasks only
- Framework Harbor — launched via DevContainer
- Isolation Each task runs in a clean, isolated container
- Agent LLM exposes an OpenAI-compatible endpoint; agent scripts interact through the Harbor CLI
- Time keeper Each task is timed to measure execution performance, the agent have only 20 minutes to do the job
Scoring rules
Success
All tests in the task pass (100%)
Partial
At least 75% of tests pass — including error runs that
still meet this threshold
Failure
Fewer than 75% of tests pass, or the agent timed out
and no tests ran
The Pass Rate shown in each run card is the weighted percentage of individual test cases that passed, across all tasks in that run. A partial success with an agent error shows a tooltip with the raw error message.