Benchmarks - Hello ! Welcome to my personnal blog

omnicoder-9b

2 runs

everything-claude-code

v2.1.92

Apr 5, 2026

26 Tasks

3 Success

2 Partial

21 Failure

36.6% Pass Rate

11.5% Terminal Bench

Task	Status	Tests	Duration	Tokens In	Tokens Out
build-pmars	Failure 0/4 0%	4 failed in 0.08s	21m 21s	2.19M	8.2K
build-pov-ray	Failure 0/3 0%	3 failed in 1.28s	21m 19s	553.6K	37.5K
cancel-async-tasks	Partial 5/6 83%	1 failed, 5 passed in 13.07s	3m 5s	125.4K	782
circuit-fibsqrt	Failure 2/3 67%	1 failed, 2 passed in 3.26s	440m 3s	1.69M	63.3K
cobol-modernization	Failure 2/3 67%	1 failed, 2 passed in 0.16s	436m 50s	633.6K	5.5K
code-from-image	Failure 1/2 50%	1 failed, 1 passed in 0.05s	1m 40s	173.9K	636
fix-git	Success 2/2 100%	2 passed in 0.06s	1m 39s	314.6K	1.9K
fix-ocaml-gc	Failure 0/1 0%	1 failed in 0.05s	25m 16s	4.28M	20.2K
git-leak-recovery	Partial 4/5 80%	1 failed, 4 passed, 1 warning in 0.33s	2m 3s	309.8K	1.5K
gpt2-codegolf	Failure 0/1 0%	1 failed in 0.18s	22m 14s	2.87M	31.2K
headless-terminal	Failure 1/7 14%	6 failed, 1 passed in 0.21s	1m 39s	134.2K	3.0K
kv-store-grpc	Success 7/7 100%	7 passed in 0.15s	1m 24s	292.1K	2.4K
make-doom-for-mips	Failure 0/3 0%	3 failed in 30.10s	21m 55s	3.84M	43.2K
make-mips-interpreter	Failure 0/3 0%	3 failed in 30.20s	21m 37s	3.08M	10.8K
path-tracing	Failure 0/5 0%	5 failed in 0.82s	21m 16s	2.71M	27.3K
path-tracing-reverse	Failure 2/3 67%	1 failed, 2 passed in 1.74s	21m 8s	6.00M	13.6K
polyglot-c-py	Failure 0/1 0%	1 failed in 0.05s	20m 25s	259.6K	2.5K
polyglot-rust-c	Failure 0/1 0%	1 failed in 0.05s	21m 32s	750.4K	103.1K
prove-plus-comm	Failure 2/4 50%	2 failed, 2 passed in 0.28s	21m 20s	3.86M	28.2K
pypi-server	Success 1/1 100%	1 passed in 1.45s	21m 13s	680.4K	3.1K
regex-chess	Failure 0/4 0%	4 failed in 0.17s	7m 16s	76.6K	32.1K
schemelike-metacircular-eval	Failure 0/1 0%	1 failed in 9.74s	21m 6s	4.93M	21.8K
torch-pipeline-parallelism	Failure 2/4 50%	2 failed, 2 passed in 34.74s	5m 59s	205.3K	5.1K
torch-tensor-parallelism	Failure 1/13 8%	12 failed, 1 passed, 1 warning in 101.08s (0:01:41)	16m 53s	184.4K	1.5K
winning-avg-corewars	Failure 2/3 67%	1 failed, 2 passed in 0.65s	21m 11s	1.53M	58.0K
write-compressor	Failure 0/3 0%	3 failed in 0.80s	21m 24s	1.84M	52.6K

claude-code

v2.1.91

Apr 3, 2026

26 Tasks

2 Success

3 Partial

21 Failure

32.3% Pass Rate

7.7% Terminal Bench

Task	Status	Tests	Duration	Tokens In	Tokens Out
build-pmars	Partial 3/4 75%	1 failed, 3 passed in 0.23s	3m 44s	1.42M	5.1K
build-pov-ray	Failure 0/3 0%	3 failed in 1.37s	20m 58s	1.16M	4.8K
cancel-async-tasks	Partial 5/6 83%	1 failed, 5 passed in 13.05s	2m 16s	115.9K	799
circuit-fibsqrt	Failure 2/3 67%	1 failed, 2 passed in 4.02s	20m 52s	133.7K	96.7K
cobol-modernization	Success 3/3 100%	3 passed in 0.28s	7m 10s	553.7K	4.8K
code-from-image	Failure 0/2 0%	2 failed in 0.02s	21m 10s	1.97M	53.4K
fix-git	Failure 0/2 0%	2 failed in 0.07s	1m 3s	117.4K	675
fix-ocaml-gc	Failure 0/1 0%	1 failed in 0.05s	26m 24s	7.54M	14.2K
git-leak-recovery	Partial 4/5 80%	1 failed, 4 passed, 1 warning in 0.30s	5m 54s	470.4K	2.6K
gpt2-codegolf	Failure 0/1 0%	1 failed in 0.05s	21m 1s	1.13M	63.0K
headless-terminal	Failure 1/7 14%	6 failed, 1 passed in 0.45s	4m 12s	144.5K	2.0K
kv-store-grpc	Success 7/7 100%	7 passed in 0.14s	9m 48s	231.8K	2.4K
make-doom-for-mips	Failure 0/3 0%	3 failed in 30.16s	21m 30s	2.57M	5.0K
make-mips-interpreter	Failure 0/3 0%	3 failed in 30.22s	21m 35s	3.27M	54.9K
path-tracing	Failure 0/5 0%	5 failed in 0.82s	21m 25s	826.9K	46.2K
path-tracing-reverse	Failure 2/3 67%	1 failed, 2 passed in 2.52s	21m 8s	2.44M	7.4K
polyglot-c-py	Failure 0/1 0%	1 failed in 0.05s	21m 41s	2.15M	24.7K
polyglot-rust-c	Failure 0/1 0%	1 failed in 0.05s	21m 35s	2.15M	17.7K
prove-plus-comm	Failure 2/4 50%	2 failed, 2 passed in 0.39s	21m 8s	369.1K	2.6K
pypi-server	Failure 0/1 0%	1 failed in 8.59s	1m 1s	22.7K	215
regex-chess	Failure 0/4 0%	4 failed in 0.18s	0m 56s	23.1K	45
schemelike-metacircular-eval	Failure 0/1 0%	1 failed in 10.95s	0m 59s	22.8K	63
torch-pipeline-parallelism	Failure 0/4 0%	4 failed in 18.09s	4m 15s	23.0K	1.0K
torch-tensor-parallelism	Failure 0/13 0%	13 failed, 1 warning in 127.10s (0:02:07)	4m 58s	22.8K	874
winning-avg-corewars	Failure 1/3 33%	2 failed, 1 passed in 0.04s	1m 9s	22.8K	73
write-compressor	Failure 0/3 0%	3 failed in 0.77s	2m 19s	22.6K	86

qwen3.6-35B-A3B-low-hardware

1 run

claude-code

v2.1.112

Apr 17, 2026

26 Tasks

3 Success

1 Partial

22 Failure

39.8% Pass Rate

11.5% Terminal Bench

Task	Status	Tests	Duration	Tokens In	Tokens Out
build-pmars	Success 4/4 100%	4 passed in 0.23s	8m 40s	1.27M	6.5K
build-pov-ray	Failure 0/3 0%	3 failed in 1.28s	24m 42s	550.8K	4.1K
cancel-async-tasks	Partial 5/6 83%	1 failed, 5 passed in 13.07s	8m 17s	102.8K	15.5K
circuit-fibsqrt	Failure 2/3 67%	1 failed, 2 passed in 3.28s	13m 52s	52.7K	32.2K
cobol-modernization	Failure 1/3 33%	2 failed, 1 passed in 0.11s	20m 48s	410.8K	13.0K
code-from-image	Failure 1/2 50%	1 failed, 1 passed in 0.07s	20m 46s	470.8K	18.4K
fix-git	Success 2/2 100%	2 passed in 0.06s	6m 54s	238.1K	1.3K
fix-ocaml-gc	Failure 0/1 0%	1 failed in 0.05s	28m 42s	1.19M	7.4K
git-leak-recovery	Success 5/5 100%	5 passed, 1 warning in 0.26s	15m 45s	257.1K	2.1K
gpt2-codegolf	Failure 0/1 0%	1 failed in 0.05s	21m 51s	563.7K	32.6K
headless-terminal	Failure 0/7 0%	7 failed in 0.17s	20m 49s	24.1K	191
kv-store-grpc	Failure 5/7 71%	2 failed, 5 passed in 0.13s	2m 19s	513.3K	3.0K
make-doom-for-mips	Failure 0/3 0%	3 failed in 30.14s	21m 27s	567.7K	4.4K
make-mips-interpreter	Failure 0/3 0%	3 failed in 30.21s	21m 32s	316.2K	6.3K
path-tracing	Failure 0/5 0%	5 failed in 0.85s	21m 1s	593.3K	13.3K
path-tracing-reverse	Failure 0/3 0%	3 failed in 1.50s	20m 49s	683.6K	7.1K
polyglot-c-py	Failure 0/1 0%	1 failed in 0.16s	21m 5s	180.4K	24.6K
polyglot-rust-c	Failure 0/1 0%	1 failed in 0.05s	16m 4s	24.1K	32.0K
prove-plus-comm	Failure 2/4 50%	2 failed, 2 passed in 0.26s	21m 13s	98.0K	968
pypi-server	Failure 0/1 0%	1 failed in 8.59s	20m 53s	668.4K	33.8K
regex-chess	Failure 0/4 0%	4 failed in 0.18s	15m 36s	87.5K	38.2K
schemelike-metacircular-eval	Failure 0/1 0%	1 failed in 10.94s	21m 28s	180.5K	1.3K
torch-pipeline-parallelism	Failure 0/4 0%	4 failed in 17.76s	22m 54s	174.1K	32.8K
torch-tensor-parallelism	Failure 9/13 69%	4 failed, 9 passed, 1 warning in 94.73s (0:01:34)	16m 59s	201.1K	7.6K
winning-avg-corewars	Failure 1/3 33%	2 failed, 1 passed in 0.04s	19m 55s	120.2K	49.2K
write-compressor	Failure 0/3 0%	3 failed in 0.78s	19m 59s	24.0K	161

qwopus-9b

1 run

claude-code

v2.1.92

Apr 7, 2026

26 Tasks

9 Success

0 Partial

17 Failure

50.5% Pass Rate

34.6% Terminal Bench

Task	Status	Tests	Duration	Tokens In	Tokens Out
build-pmars	Success 4/4 100%	4 passed in 0.23s	5m 26s	1.60M	5.2K
build-pov-ray	Failure 0/3 0%	3 failed in 1.26s	21m 10s	3.17M	8.0K
cancel-async-tasks	Success 6/6 100%	6 passed in 14.11s	1m 17s	45.5K	539
circuit-fibsqrt	Failure 2/3 67%	1 failed, 2 passed in 3.64s	14m 25s	22.7K	32.0K
cobol-modernization	Success 3/3 100%	3 passed in 0.31s	20m 51s	730.0K	24.9K
code-from-image	Failure 0/2 0%	2 failed in 0.02s	18m 34s	230.4K	32.8K
fix-git	Success 2/2 100%	2 passed in 0.06s	2m 32s	141.5K	1.1K
fix-ocaml-gc	Failure 0/1 0%	1 failed in 0.05s	25m 53s	1.26M	2.4K
git-leak-recovery	Success 5/5 100%	5 passed, 1 warning in 0.28s	1m 18s	166.5K	1.4K
gpt2-codegolf	Failure 0/1 0%	1 failed in 0.06s	21m 9s	0	0
headless-terminal	Success 7/7 100%	7 passed in 132.17s (0:02:12)	8m 41s	431.8K	5.2K
kv-store-grpc	Success 7/7 100%	7 passed in 0.14s	1m 35s	319.6K	1.9K
make-doom-for-mips	Failure 0/3 0%	3 failed in 30.16s	21m 36s	2.73M	17.6K
make-mips-interpreter	Failure 0/3 0%	3 failed in 30.21s	21m 57s	5.08M	6.2K
path-tracing	Failure 0/5 0%	5 failed in 0.77s	21m 20s	668.0K	4.5K
path-tracing-reverse	Failure 0/3 0%	3 failed in 1.41s	19m 25s	618.9K	35.6K
polyglot-c-py	Failure 0/1 0%	1 failed in 0.08s	21m 41s	790.6K	15.4K
polyglot-rust-c	Failure 0/1 0%	1 failed in 0.05s	17m 35s	22.6K	32.0K
prove-plus-comm	Success 4/4 100%	4 passed in 0.29s	2m 40s	399.8K	4.3K
pypi-server	Success 1/1 100%	1 passed in 1.35s	16m 51s	1.46M	5.9K
regex-chess	Failure 0/4 0%	4 failed in 0.17s	14m 47s	23.0K	32.0K
schemelike-metacircular-eval	Failure 0/1 0%	1 failed in 10.80s	21m 6s	152.1K	866
torch-pipeline-parallelism	Failure 0/4 0%	4 failed in 18.85s	23m 36s	0	0
torch-tensor-parallelism	Failure 3/13 23%	10 failed, 3 passed, 1 warning in 77.94s (0:01:17)	24m 49s	61.7K	19.3K
winning-avg-corewars	Failure 2/3 67%	1 failed, 2 passed in 0.65s	20m 58s	48.6K	9.3K
write-compressor	Failure 1/3 33%	2 failed, 1 passed in 0.80s	9m 20s	831.4K	28.3K

How I benchmark LLMs

All results on this page are produced locally, on my own hardware, under reproducible conditions. Here is exactly how.

Hardware

CPU AMD Ryzen 7 9800X3D
GPU Nvidia RTX 5080 · 16 GB VRAM
RAM 32 GB DDR5
OS CachyOS (Arch-based)
Inference runtime ik_llama.cpp compiled with CUDA

Test conditions

Dataset Terminal Bench 2 — software engineering tasks only
Framework Harbor — launched via DevContainer
Isolation Each task runs in a clean, isolated container
Agent LLM exposes an OpenAI-compatible endpoint; agent scripts interact through the Harbor CLI
Time keeper Each task is timed to measure execution performance, the agent have only 20 minutes to do the job

Scoring rules

Success All tests in the task pass (100%)

Partial At least 75% of tests pass — including error runs that still meet this threshold

Failure Fewer than 75% of tests pass, or the agent timed out and no tests ran

The Pass Rate shown in each run card is the weighted percentage of individual test cases that passed, across all tasks in that run. A partial success with an agent error shows a tooltip with the raw error message.

Thomas Rumas

LLM Benchmarks

omnicoder-9b

everything-claude-code

claude-code

qwen3.6-35B-A3B-low-hardware

claude-code

qwopus-9b

claude-code

How I benchmark LLMs