Thomas Rumas

omnicoder-9b

2 runs

everything-claude-code

v2.1.92
26 Tasks
3 Success
2 Partial
21 Failure
36.6% Pass Rate
11.5% Terminal Bench
Task Status Tests Duration Tokens In Tokens Out
build-pmars
Failure 0/4 0%
4 failed in 0.08s 21m 21s 2.19M 8.2K
build-pov-ray
Failure 0/3 0%
3 failed in 1.28s 21m 19s 553.6K 37.5K
cancel-async-tasks
Partial 5/6 83%
1 failed, 5 passed in 13.07s 3m 5s 125.4K 782
circuit-fibsqrt
Failure 2/3 67%
1 failed, 2 passed in 3.26s 440m 3s 1.69M 63.3K
cobol-modernization
Failure 2/3 67%
1 failed, 2 passed in 0.16s 436m 50s 633.6K 5.5K
code-from-image
Failure 1/2 50%
1 failed, 1 passed in 0.05s 1m 40s 173.9K 636
fix-git
Success 2/2 100%
2 passed in 0.06s 1m 39s 314.6K 1.9K
fix-ocaml-gc
Failure 0/1 0%
1 failed in 0.05s 25m 16s 4.28M 20.2K
git-leak-recovery
Partial 4/5 80%
1 failed, 4 passed, 1 warning in 0.33s 2m 3s 309.8K 1.5K
gpt2-codegolf
Failure 0/1 0%
1 failed in 0.18s 22m 14s 2.87M 31.2K
headless-terminal
Failure 1/7 14%
6 failed, 1 passed in 0.21s 1m 39s 134.2K 3.0K
kv-store-grpc
Success 7/7 100%
7 passed in 0.15s 1m 24s 292.1K 2.4K
make-doom-for-mips
Failure 0/3 0%
3 failed in 30.10s 21m 55s 3.84M 43.2K
make-mips-interpreter
Failure 0/3 0%
3 failed in 30.20s 21m 37s 3.08M 10.8K
path-tracing
Failure 0/5 0%
5 failed in 0.82s 21m 16s 2.71M 27.3K
path-tracing-reverse
Failure 2/3 67%
1 failed, 2 passed in 1.74s 21m 8s 6.00M 13.6K
polyglot-c-py
Failure 0/1 0%
1 failed in 0.05s 20m 25s 259.6K 2.5K
polyglot-rust-c
Failure 0/1 0%
1 failed in 0.05s 21m 32s 750.4K 103.1K
prove-plus-comm
Failure 2/4 50%
2 failed, 2 passed in 0.28s 21m 20s 3.86M 28.2K
pypi-server
Success 1/1 100%
1 passed in 1.45s 21m 13s 680.4K 3.1K
regex-chess
Failure 0/4 0%
4 failed in 0.17s 7m 16s 76.6K 32.1K
schemelike-metacircular-eval
Failure 0/1 0%
1 failed in 9.74s 21m 6s 4.93M 21.8K
torch-pipeline-parallelism
Failure 2/4 50%
2 failed, 2 passed in 34.74s 5m 59s 205.3K 5.1K
torch-tensor-parallelism
Failure 1/13 8%
12 failed, 1 passed, 1 warning in 101.08s (0:01:41) 16m 53s 184.4K 1.5K
winning-avg-corewars
Failure 2/3 67%
1 failed, 2 passed in 0.65s 21m 11s 1.53M 58.0K
write-compressor
Failure 0/3 0%
3 failed in 0.80s 21m 24s 1.84M 52.6K

claude-code

v2.1.91
26 Tasks
2 Success
3 Partial
21 Failure
32.3% Pass Rate
7.7% Terminal Bench
Task Status Tests Duration Tokens In Tokens Out
build-pmars
Partial 3/4 75%
1 failed, 3 passed in 0.23s 3m 44s 1.42M 5.1K
build-pov-ray
Failure 0/3 0%
3 failed in 1.37s 20m 58s 1.16M 4.8K
cancel-async-tasks
Partial 5/6 83%
1 failed, 5 passed in 13.05s 2m 16s 115.9K 799
circuit-fibsqrt
Failure 2/3 67%
1 failed, 2 passed in 4.02s 20m 52s 133.7K 96.7K
cobol-modernization
Success 3/3 100%
3 passed in 0.28s 7m 10s 553.7K 4.8K
code-from-image
Failure 0/2 0%
2 failed in 0.02s 21m 10s 1.97M 53.4K
fix-git
Failure 0/2 0%
2 failed in 0.07s 1m 3s 117.4K 675
fix-ocaml-gc
Failure 0/1 0%
1 failed in 0.05s 26m 24s 7.54M 14.2K
git-leak-recovery
Partial 4/5 80%
1 failed, 4 passed, 1 warning in 0.30s 5m 54s 470.4K 2.6K
gpt2-codegolf
Failure 0/1 0%
1 failed in 0.05s 21m 1s 1.13M 63.0K
headless-terminal
Failure 1/7 14%
6 failed, 1 passed in 0.45s 4m 12s 144.5K 2.0K
kv-store-grpc
Success 7/7 100%
7 passed in 0.14s 9m 48s 231.8K 2.4K
make-doom-for-mips
Failure 0/3 0%
3 failed in 30.16s 21m 30s 2.57M 5.0K
make-mips-interpreter
Failure 0/3 0%
3 failed in 30.22s 21m 35s 3.27M 54.9K
path-tracing
Failure 0/5 0%
5 failed in 0.82s 21m 25s 826.9K 46.2K
path-tracing-reverse
Failure 2/3 67%
1 failed, 2 passed in 2.52s 21m 8s 2.44M 7.4K
polyglot-c-py
Failure 0/1 0%
1 failed in 0.05s 21m 41s 2.15M 24.7K
polyglot-rust-c
Failure 0/1 0%
1 failed in 0.05s 21m 35s 2.15M 17.7K
prove-plus-comm
Failure 2/4 50%
2 failed, 2 passed in 0.39s 21m 8s 369.1K 2.6K
pypi-server
Failure 0/1 0%
1 failed in 8.59s 1m 1s 22.7K 215
regex-chess
Failure 0/4 0%
4 failed in 0.18s 0m 56s 23.1K 45
schemelike-metacircular-eval
Failure 0/1 0%
1 failed in 10.95s 0m 59s 22.8K 63
torch-pipeline-parallelism
Failure 0/4 0%
4 failed in 18.09s 4m 15s 23.0K 1.0K
torch-tensor-parallelism
Failure 0/13 0%
13 failed, 1 warning in 127.10s (0:02:07) 4m 58s 22.8K 874
winning-avg-corewars
Failure 1/3 33%
2 failed, 1 passed in 0.04s 1m 9s 22.8K 73
write-compressor
Failure 0/3 0%
3 failed in 0.77s 2m 19s 22.6K 86

qwen3.6-35B-A3B-low-hardware

1 run

claude-code

v2.1.112
26 Tasks
3 Success
1 Partial
22 Failure
39.8% Pass Rate
11.5% Terminal Bench
Task Status Tests Duration Tokens In Tokens Out
build-pmars
Success 4/4 100%
4 passed in 0.23s 8m 40s 1.27M 6.5K
build-pov-ray
Failure 0/3 0%
3 failed in 1.28s 24m 42s 550.8K 4.1K
cancel-async-tasks
Partial 5/6 83%
1 failed, 5 passed in 13.07s 8m 17s 102.8K 15.5K
circuit-fibsqrt
Failure 2/3 67%
1 failed, 2 passed in 3.28s 13m 52s 52.7K 32.2K
cobol-modernization
Failure 1/3 33%
2 failed, 1 passed in 0.11s 20m 48s 410.8K 13.0K
code-from-image
Failure 1/2 50%
1 failed, 1 passed in 0.07s 20m 46s 470.8K 18.4K
fix-git
Success 2/2 100%
2 passed in 0.06s 6m 54s 238.1K 1.3K
fix-ocaml-gc
Failure 0/1 0%
1 failed in 0.05s 28m 42s 1.19M 7.4K
git-leak-recovery
Success 5/5 100%
5 passed, 1 warning in 0.26s 15m 45s 257.1K 2.1K
gpt2-codegolf
Failure 0/1 0%
1 failed in 0.05s 21m 51s 563.7K 32.6K
headless-terminal
Failure 0/7 0%
7 failed in 0.17s 20m 49s 24.1K 191
kv-store-grpc
Failure 5/7 71%
2 failed, 5 passed in 0.13s 2m 19s 513.3K 3.0K
make-doom-for-mips
Failure 0/3 0%
3 failed in 30.14s 21m 27s 567.7K 4.4K
make-mips-interpreter
Failure 0/3 0%
3 failed in 30.21s 21m 32s 316.2K 6.3K
path-tracing
Failure 0/5 0%
5 failed in 0.85s 21m 1s 593.3K 13.3K
path-tracing-reverse
Failure 0/3 0%
3 failed in 1.50s 20m 49s 683.6K 7.1K
polyglot-c-py
Failure 0/1 0%
1 failed in 0.16s 21m 5s 180.4K 24.6K
polyglot-rust-c
Failure 0/1 0%
1 failed in 0.05s 16m 4s 24.1K 32.0K
prove-plus-comm
Failure 2/4 50%
2 failed, 2 passed in 0.26s 21m 13s 98.0K 968
pypi-server
Failure 0/1 0%
1 failed in 8.59s 20m 53s 668.4K 33.8K
regex-chess
Failure 0/4 0%
4 failed in 0.18s 15m 36s 87.5K 38.2K
schemelike-metacircular-eval
Failure 0/1 0%
1 failed in 10.94s 21m 28s 180.5K 1.3K
torch-pipeline-parallelism
Failure 0/4 0%
4 failed in 17.76s 22m 54s 174.1K 32.8K
torch-tensor-parallelism
Failure 9/13 69%
4 failed, 9 passed, 1 warning in 94.73s (0:01:34) 16m 59s 201.1K 7.6K
winning-avg-corewars
Failure 1/3 33%
2 failed, 1 passed in 0.04s 19m 55s 120.2K 49.2K
write-compressor
Failure 0/3 0%
3 failed in 0.78s 19m 59s 24.0K 161

qwopus-9b

1 run

claude-code

v2.1.92
26 Tasks
9 Success
0 Partial
17 Failure
50.5% Pass Rate
34.6% Terminal Bench
Task Status Tests Duration Tokens In Tokens Out
build-pmars
Success 4/4 100%
4 passed in 0.23s 5m 26s 1.60M 5.2K
build-pov-ray
Failure 0/3 0%
3 failed in 1.26s 21m 10s 3.17M 8.0K
cancel-async-tasks
Success 6/6 100%
6 passed in 14.11s 1m 17s 45.5K 539
circuit-fibsqrt
Failure 2/3 67%
1 failed, 2 passed in 3.64s 14m 25s 22.7K 32.0K
cobol-modernization
Success 3/3 100%
3 passed in 0.31s 20m 51s 730.0K 24.9K
code-from-image
Failure 0/2 0%
2 failed in 0.02s 18m 34s 230.4K 32.8K
fix-git
Success 2/2 100%
2 passed in 0.06s 2m 32s 141.5K 1.1K
fix-ocaml-gc
Failure 0/1 0%
1 failed in 0.05s 25m 53s 1.26M 2.4K
git-leak-recovery
Success 5/5 100%
5 passed, 1 warning in 0.28s 1m 18s 166.5K 1.4K
gpt2-codegolf
Failure 0/1 0%
1 failed in 0.06s 21m 9s 0 0
headless-terminal
Success 7/7 100%
7 passed in 132.17s (0:02:12) 8m 41s 431.8K 5.2K
kv-store-grpc
Success 7/7 100%
7 passed in 0.14s 1m 35s 319.6K 1.9K
make-doom-for-mips
Failure 0/3 0%
3 failed in 30.16s 21m 36s 2.73M 17.6K
make-mips-interpreter
Failure 0/3 0%
3 failed in 30.21s 21m 57s 5.08M 6.2K
path-tracing
Failure 0/5 0%
5 failed in 0.77s 21m 20s 668.0K 4.5K
path-tracing-reverse
Failure 0/3 0%
3 failed in 1.41s 19m 25s 618.9K 35.6K
polyglot-c-py
Failure 0/1 0%
1 failed in 0.08s 21m 41s 790.6K 15.4K
polyglot-rust-c
Failure 0/1 0%
1 failed in 0.05s 17m 35s 22.6K 32.0K
prove-plus-comm
Success 4/4 100%
4 passed in 0.29s 2m 40s 399.8K 4.3K
pypi-server
Success 1/1 100%
1 passed in 1.35s 16m 51s 1.46M 5.9K
regex-chess
Failure 0/4 0%
4 failed in 0.17s 14m 47s 23.0K 32.0K
schemelike-metacircular-eval
Failure 0/1 0%
1 failed in 10.80s 21m 6s 152.1K 866
torch-pipeline-parallelism
Failure 0/4 0%
4 failed in 18.85s 23m 36s 0 0
torch-tensor-parallelism
Failure 3/13 23%
10 failed, 3 passed, 1 warning in 77.94s (0:01:17) 24m 49s 61.7K 19.3K
winning-avg-corewars
Failure 2/3 67%
1 failed, 2 passed in 0.65s 20m 58s 48.6K 9.3K
write-compressor
Failure 1/3 33%
2 failed, 1 passed in 0.80s 9m 20s 831.4K 28.3K

How I benchmark LLMs

All results on this page are produced locally, on my own hardware, under reproducible conditions. Here is exactly how.

Hardware
  • CPU AMD Ryzen 7 9800X3D
  • GPU Nvidia RTX 5080 · 16 GB VRAM
  • RAM 32 GB DDR5
  • OS CachyOS (Arch-based)
  • Inference runtime ik_llama.cpp compiled with CUDA
Test conditions
  • Dataset Terminal Bench 2 — software engineering tasks only
  • Framework Harbor — launched via DevContainer
  • Isolation Each task runs in a clean, isolated container
  • Agent LLM exposes an OpenAI-compatible endpoint; agent scripts interact through the Harbor CLI
  • Time keeper Each task is timed to measure execution performance, the agent have only 20 minutes to do the job
Scoring rules
Success All tests in the task pass (100%)
Partial At least 75% of tests pass — including error runs that still meet this threshold
Failure Fewer than 75% of tests pass, or the agent timed out and no tests ran

The Pass Rate shown in each run card is the weighted percentage of individual test cases that passed, across all tasks in that run. A partial success with an agent error shows a tooltip with the raw error message.