Leaderboard
Per-model pass rates on ClawBench V2 (130 newer everyday tasks). Two-stage scoring: HTTP-request interception + LLM judge on the intercepted payload. Scoring details ↗
V2 Snapshot — 7 models
| Rank | Model | Harness | Intercepted | Reward | Pass / Total |
|---|---|---|---|---|---|
| 1 | glm-5.1 | hermes | 48.5% | 18.5% | 24 / 130 |
| 2 | claude-opus-4-7·partial | hermes | 54.7% | 13.3% | 10 / 75 |
| 3 | gpt-5.5·partial | hermes | 48.1% | 11.1% | 9 / 81 |
| 4 | deepseek-v4-pro | hermes | 43.8% | 10.0% | 13 / 130 |
| 5 | openrouter-owl-alpha | hermes | 14.6% | 4.6% | 6 / 130 |
| 6 | deepseek-v4-flash | hermes | 3.1% | 1.5% | 2 / 130 |
| 7 | glm-5.1 | openclaw | 0.0% | 0.0% | 0 / 130 |
Reward = fraction that intercepted the final HTTP request and were judged to fulfill the natural-language instruction (default judge: deepseek/deepseek-v4-pro via OpenRouter). Intercepted alone counts "agent reached the endpoint"; reward additionally checks the payload was correct. Rows marked ·partial attempted fewer than the full 130 V2 tasks (mid-run abort / queue cap).
Snapshot generated 2026-05-12. Scoring details: eval/scoring.md ↗.
Fresh runs + V1 results: interactive HF Space ↗.
New here? About ClawBench — how it works ↗.
Download high-quality execution traces curated
Every ClawBench run ships a full 5-layer bundle — screen recording, browser actions, agent messages, network requests, and final-request interception. Hand-verified for fidelity; suitable for training, replay, audits, and reproducing every leaderboard cell above.
ClawBench V2 Trace
130 tasks · ~5.2 GBNewer everyday tasks (V2 corpus). 7 model × harness combinations: glm-5.1, claude-opus-4-7, gpt-5.5, deepseek-v4-pro/flash, owl-alpha, openclaw.
ClawBench V1 Trace
1,416 runs · 153 tasksOriginal corpus (paper). 6 frontier models including Claude Opus 4.6 (61.4%), Sonnet 4.6 (56.9%), GPT-5.4, Kimi K2.5.
What's in each trace
recording.mp4— full browser session videoactions.jsonl— agent clicks / typing / scrollsagent-messages.jsonl— model I/O incl. reasoningrequests.jsonl— every HTTP request the page madeinterception.json— final intercepted request (graded)run-meta.json— model, harness, scores, timing
Browse interactively at /traces.