Leaderboard

Per-model pass rates on ClawBench V2 (130 newer everyday tasks). Two-stage scoring: HTTP-request interception + LLM judge on the intercepted payload. Scoring details ↗

Open the live interactive leaderboard ↗ Download traces curated Submit your model

V2 Snapshot — 7 models

Rank Model Harness Intercepted Reward Pass / Total
1 glm-5.1 hermes 48.5% 18.5% 24 / 130
2 claude-opus-4-7·partial hermes 54.7% 13.3% 10 / 75
3 gpt-5.5·partial hermes 48.1% 11.1% 9 / 81
4 deepseek-v4-pro hermes 43.8% 10.0% 13 / 130
5 openrouter-owl-alpha hermes 14.6% 4.6% 6 / 130
6 deepseek-v4-flash hermes 3.1% 1.5% 2 / 130
7 glm-5.1 openclaw 0.0% 0.0% 0 / 130

Reward = fraction that intercepted the final HTTP request and were judged to fulfill the natural-language instruction (default judge: deepseek/deepseek-v4-pro via OpenRouter). Intercepted alone counts "agent reached the endpoint"; reward additionally checks the payload was correct. Rows marked ·partial attempted fewer than the full 130 V2 tasks (mid-run abort / queue cap). Snapshot generated 2026-05-12. Scoring details: eval/scoring.md ↗. Fresh runs + V1 results: interactive HF Space ↗. New here? About ClawBench — how it works ↗.

Download high-quality execution traces curated

Every ClawBench run ships a full 5-layer bundle — screen recording, browser actions, agent messages, network requests, and final-request interception. Hand-verified for fidelity; suitable for training, replay, audits, and reproducing every leaderboard cell above.

ClawBench V2 Trace

130 tasks · ~5.2 GB

Newer everyday tasks (V2 corpus). 7 model × harness combinations: glm-5.1, claude-opus-4-7, gpt-5.5, deepseek-v4-pro/flash, owl-alpha, openclaw.

NAIL-Group/ClawBenchV2Trace TIGER-Lab/ClawBenchV2Trace

ClawBench V1 Trace

1,416 runs · 153 tasks

Original corpus (paper). 6 frontier models including Claude Opus 4.6 (61.4%), Sonnet 4.6 (56.9%), GPT-5.4, Kimi K2.5.

NAIL-Group/ClawBenchV1Trace TIGER-Lab/ClawBenchV1Trace

What's in each trace

  • recording.mp4 — full browser session video
  • actions.jsonl — agent clicks / typing / scrolls
  • agent-messages.jsonl — model I/O incl. reasoning
  • requests.jsonl — every HTTP request the page made
  • interception.json — final intercepted request (graded)
  • run-meta.json — model, harness, scores, timing

Browse interactively at /traces.