Leaderboard
283 tasks (V1 153 + V2 130) · 163 live platforms · two-stage scoring (HTTP interception → LLM judge)
V2 (Hermes) — 8 models
| Rank | Model | Harness | Intercepted | Reward (lenient) | Reward (strict) | Cost / task | Pass / Total |
|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-7 | hermes | 54.6% | 44.6% | 24.6% | $4.4425 | 58 / 130 |
| 2 | gpt-5.5 | hermes | 45.4% | 35.4% | 18.5% | $0.3325 | 46 / 130 |
| 3 | glm-5.1 | hermes | 48.5% | 34.6% | 17.7% | $0.1935 | 45 / 130 |
| 4 | deepseek-v4-pro | hermes | 43.9% | 33.9% | 12.3% | $0.0721 | 44 / 130 |
| 5 | deepseek-v4-flash:free | hermes | 3.1% | 2.3% | 0.0% | $0.0000 | 3 / 130 |
| 6 | z-ai/glm-4.5-air:free | hermes | 4.6% | 2.3% | 0.8% | $0.0000 | 3 / 130 |
| 7 | minimax-m2.5:free | hermes | 2.3% | 1.5% | 0.0% | $0.0000 | 2 / 130 |
| 8 | openrouter-owl-alpha | hermes | 14.6% | 0.0% | 0.0% | $0.3704 | 0 / 130 |
Intercepted = final HTTP request matched the task's URL/method (Stage 1, deterministic). Reward (lenient) = additionally judged by deepseek/deepseek-v4-pro to fulfill the instruction under "no contradiction → match" rubric (Stage 2). Reward (strict) = same judge, strict rubric ("ambiguous → mismatch"). Ranked by Intercepted; Reward as tiebreak.
Snapshot: 2026-05-20. Scoring details: eval/scoring.md ↗.
New here? About ClawBench ↗.
We give AI agents real online tasks — booking flights, ordering food, applying for jobs — on live websites, and check whether they actually submit the right thing. Best so far: claude-opus-4-7 at 44.6% Reward on V2 (54.6% Intercepted) — even the frontier closed-source models leave a 55-point gap. How scoring works →
One harness, the whole Claw family. Runs claw-eval, WildClawBench, ClawMark, and scope-peers (WebVoyager, OSWorld, …) from one CLI — see how ↓
Featured in HF Daily Paper #3 · DeepWiki · awesome-harness-engineering · Awesome-AI-Agents · LLM-Agent-Benchmark-List
pip install clawbench-eval && clawbench run --corpus v2 --model your-model
News
View all on GitHub- V2 default + lenient judge + 6 harnesses. Details →
- Claw-Eval suite added: 19 browser-research tasks with final-answer submission. Details →
- Canonical leaderboard moved to TIGER-Lab/ClawBench Gradio Space. Details →
- V2 leaderboard ships; top so far
glm-5.1 / hermesat 18.5% reward / 48.5% intercepted. Details → - Inline LLM judge added as second scoring stage; runs auto-produce pass/fail. Details →
clawbench-evalpublished to PyPI for one-command install. Details →- Released ClawBenchV1Trace: full 5-layer execution trace per V1 run. Details →
- Paper released on arXiv (2604.08523); #3 HuggingFace Paper of the Day. Details →
Browser-agent execution traces curated open for download Apache-2.0
Real summed tokens from every run's agent-messages.jsonl × current OpenRouter list prices. V1 base: $5,177 across 1,377 runs (sonnet-4-6 $3/$15, haiku-4-5 $1/$5, gpt-5.4 $2.50/$15, gemini-3.1-pro $2/$12, glm-5 $0.60/$1.92, kimi-k2.5 $0.40/$1.90, qwen3.5-397b $0.39/$2.34, gemini-3-flash $0.50/$3, gemini-3.1-flash-lite $0.25/$1.50). V1 opus-4-6: $3,254 (the #1 V1 model — 113 dirs at $5/$25 pricing). V2: $1,745 across the full 6-model corpus (opus-4-7 $5/$25 → $1,214, gpt-5.5 $1.25/$10 → $255, glm-5.1 $0.60/$1.92 → $122, deepseek-v4-pro $0.55/$2.19 → $146, deepseek-v4-flash $0.27/$1.10 → $9, owl-alpha free). Cache reads billed at full prompt rate (conservative). OpenRouter prices ↑
What will you do with them?
Sample the corpus before you download
Browse the 283 task definitions these traces capture — searchable, filterable, no download. Each row is a prompt that one of the 13 frontier models attempted.
Powered by the Hugging Face Datasets Viewer · Open full dataset
Watch a real trace — played at 16×, no narration
gpt-5.4 on V1 task 862-entertainment-hobbies-movies-amc-theatres:
"Book a ticket on AMC Theatres for a showing in the city"
— the agent navigates the live site, picks a movie + showtime + seat, fills the checkout form, and reaches the purchase request that the harness intercepts before submit. Intercepted ✓.
This is one of the 1,377 V1 + 676 V2 recordings shipped with every trace bundle. Full 5-layer bundle for this run: download via /traces ↓.
A real turn from this corpus
Excerpted from agent-messages.jsonl of one V2 run (z-ai/glm-5 · task 001 · Uber Eats / Pad Thai). Every trace bundle has hundreds of these, time-aligned with the recording, actions, and HTTP requests.
shared/alex_green_personal_info.json
https://ubereats.com
Inside every trace — 6 time-synchronized signals per run multi-track recorder
Every signal is timestamped against the same clock — click frame 1872 of recording.mp4 and you can find the exact actions.jsonl event, the LLM turn that triggered it, and the HTTP requests it fired. Cross-org mirrors:
NAIL-Group ·
TIGER-Lab
· Apache-2.0 · Bundle format: tar.gz per run, jsonl within.
Cite this benchmark
Using ClawBench in your research? Please cite the arXiv paper:
@article{zhang2026clawbench,
title={ClawBench: Can AI Agents Complete Everyday Online Tasks?},
author={Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
journal={arXiv preprint arXiv:2604.08523},
year={2026}
}