ClawBench · the open benchmark for AI agents on real, live websites

Leaderboard

130 tasks · 130+ live platforms · two-stage scoring (HTTP interception → LLM judge)

V2 (Hermes) — 8 models

Rank	Model	Harness	Intercepted	Reward (lenient)	Reward (strict)	Cost / task	Pass / Total
1	claude-opus-4-7	hermes	54.6%	44.6%	24.6%	$4.4425	58 / 130
2	gpt-5.5	hermes	45.4%	35.4%	18.5%	$0.3325	46 / 130
3	glm-5.1	hermes	48.5%	34.6%	17.7%	$0.1935	45 / 130
4	deepseek-v4-pro	hermes	43.9%	33.9%	12.3%	$0.0721	44 / 130
5	deepseek-v4-flash:free	hermes	3.1%	2.3%	0.0%	$0.0000	3 / 130
6	z-ai/glm-4.5-air:free	hermes	4.6%	2.3%	0.8%	$0.0000	3 / 130
7	minimax-m2.5:free	hermes	2.3%	1.5%	0.0%	$0.0000	2 / 130
8	openrouter-owl-alpha	hermes	14.6%	0.0%	0.0%	$0.3704	0 / 130

Intercepted = final HTTP request matched the task's URL/method (Stage 1, deterministic). Reward (lenient) = additionally judged by deepseek/deepseek-v4-pro to fulfill the instruction under "no contradiction → match" rubric (Stage 2). Reward (strict) = same judge, strict rubric ("ambiguous → mismatch"). Ranked by Intercepted; Reward as tiebreak. Snapshot: 2026-05-20. Scoring details: eval/scoring.md ↗. New here? About ClawBench ↗.

We give AI agents real online tasks — booking flights, ordering food, applying for jobs — on live websites, and check whether they actually submit the right thing. Best so far: claude-opus-4-7 at 44.6% Reward on V2 (54.6% Intercepted) — even the frontier closed-source models leave a 55-point gap. How scoring works →

One harness, the whole Claw family. Runs claw-eval, WildClawBench, ClawMark, and scope-peers (WebVoyager, OSWorld, …) from one CLI — see how ↓

Featured in HF Daily Paper #3 · DeepWiki · awesome-harness-engineering · Awesome-AI-Agents · LLM-Agent-Benchmark-List

Resources Paper arXiv:2604.08523 Cite BibTeX ↓ GitHub TIGER-AI-Lab/ClawBench Dataset TIGER-Lab/ClawBench Space TIGER-Lab/ClawBench Collection Traces V1 + V2

Open on Hugging Face Space ↗ Star on GitHub Upvote on HF Browse a real trace Download traces curated Submit your model

Quick start

pip install clawbench-eval && clawbench run --corpus v2 --model your-model

PyPI Full setup ↗

News

View all on GitHub

2026.05.20V2 default + lenient judge + 6 harnesses. Details →
2026.05.16Claw-Eval suite added: 19 browser-research tasks with final-answer submission. Details →
2026.05.12Canonical leaderboard moved to TIGER-Lab/ClawBench Gradio Space. Details →
2026.05.11V2 leaderboard ships; top so far glm-5.1 / hermes at 18.5% reward / 48.5% intercepted. Details →
2026.05.09Inline LLM judge added as second scoring stage; runs auto-produce pass/fail. Details →
2026.05.09clawbench-eval published to PyPI for one-command install. Details →
2026.05.09Released ClawBenchV1Trace: full 5-layer execution trace per V1 run. Details →
2026.04.11Paper released on arXiv (2604.08523); #3 HuggingFace Paper of the Day. Details →

Browser-agent execution traces curated open for download Apache-2.0

Refreshed weekly · last 2026-05-20

1,724

judge-verified runs

frontier models

283

distinct everyday tasks

163

live platforms covered

$10K+

in frontier-model compute

5.7Btokens

input + cache + output

Real summed tokens from every run's agent-messages.jsonl × current OpenRouter list prices. V1 base: $5,177 across 1,377 runs (sonnet-4-6 $3/$15, haiku-4-5 $1/$5, gpt-5.4 $2.50/$15, gemini-3.1-pro $2/$12, glm-5 $0.60/$1.92, kimi-k2.5 $0.40/$1.90, qwen3.5-397b $0.39/$2.34, gemini-3-flash $0.50/$3, gemini-3.1-flash-lite $0.25/$1.50). V1 opus-4-6: $3,254 (the #1 V1 model — 113 dirs at $5/$25 pricing). V2: $1,745 across the full 6-model corpus (opus-4-7 $5/$25 → $1,214, gpt-5.5 $1.25/$10 → $255, glm-5.1 $0.60/$1.92 → $122, deepseek-v4-pro $0.55/$2.19 → $146, deepseek-v4-flash $0.27/$1.10 → $9, owl-alpha free). Cache reads billed at full prompt rate (conservative). OpenRouter prices ↑

      claude-opus-4-7
      claude-opus-4-6
      claude-sonnet-4-6
      claude-haiku-4-5
      gpt-5.5
      gpt-5.4 · mini
      gpt-5.3-codex · spark
      gpt-5.2
      gpt-oss-120b
      gemini-3.1-pro · flash · flash-lite
      gemini-3-flash
      deepseek-v4-pro · flash
      glm-5.1
      glm-4.5-air
      minimax-m2.5
      kimi-k2.5
      qwen3.5-397b
      owl-alpha
    

What will you do with them?

Train your own agent

Fine-tune on 918 V1 + 806 V2 frontier-model trajectories without spending $10k+ in API tokens. JSONL-native, SFT/DPO/PRM-ready. Mine success-vs-failure pairs across 13 models on identical tasks.

Get V1 · 918 runs Get V2 · 806 runs

Replay & audit

Step through with video + HAR + agent reasoning side-by-side. Diagnose failure modes, audit judge calls, diff a model's pixels vs its words. Per-step frame-accurate.

Open trace browser Browse on HF Hub

Reproduce the leaderboard

Re-run any cell with our judge on your own data — or our data on your judge. The CLI consumes the same bundles you'll download. Held-out, post-cutoff tasks; no contamination.

Scoring rubric pip install clawbench-eval

Sample the corpus before you download

Browse the 283 task definitions these traces capture — searchable, filterable, no download. Each row is a prompt that one of the 13 frontier models attempted.

Browse a real trace — hover to step through screenshots

deepseek-v4-pro on V2 task 535-daily-life-shopping-etsy: "Search Etsy for a handmade blue ceramic flower vase under $50 and add it to your favorites" — the agent searches, filters results, opens a listing, and adds it to favorites before the harness intercepts. Intercepted ✓.

65 screenshots

13m 47s

Full 5-layer trace bundle for this run: view trace →.

A real turn from this corpus

Excerpted from agent-messages.jsonl of one V2 run (z-ai/glm-5 · task 001 · Uber Eats / Pad Thai). Every trace bundle has hundreds of these, time-aligned with the recording, actions, and HTTP requests.

          user
          On Uber Eats, order delivery: one Pad Thai, deliver to home address, note "no peanuts" …
        
          glm-5.1
          I'll help you order Pad Thai on Uber Eats. Let me first read your personal info to get your delivery address.
        
          tool_use
          read_file · shared/alex_green_personal_info.json
        
          browser
          open_url · https://ubereats.com
        
          ↓ ~80 more turns until the agent's checkout request was intercepted and graded

Inside every trace — 6 time-synchronized signals per run multi-track recorder

Every signal is timestamped against the same clock — click frame 1872 of recording.mp4 and you can find the exact actions.jsonl event, the LLM turn that triggered it, and the HTTP requests it fired. Cross-org mirrors: NAIL-Group · TIGER-Lab · Apache-2.0 · Bundle format: tar.gz per run, jsonl within.

Cite this benchmark

Using ClawBench in your research? Please cite the arXiv paper:

@article{zhang2026clawbench,
  title={ClawBench: Can AI Agents Complete Everyday Online Tasks?},
  author={Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
  journal={arXiv preprint arXiv:2604.08523},
  year={2026}
}

View on arXiv Discuss on HF Papers CITATION.cff JSON API /api/leaderboard.json Contact [reveal]

Share X LinkedIn Reddit HN

Rank	Model	Harness	Intercepted	Reward (lenient)	Reward (strict)	Cost / task	Pass / Total
1	claude-opus-4-6	hermes	61.4%	61.4%	—	—	94 / 153
2	claude-sonnet-4-6	hermes	56.9%	56.9%	—	—	87 / 153
3	claude-haiku-4-5-20251001	hermes	30.1%	30.1%	—	—	46 / 153
4	gpt-5.4-2026-03-05	hermes	25.5%	25.5%	—	—	39 / 153
5	gpt-5.4-mini-2026-03-17	hermes	24.8%	24.8%	—	—	38 / 153
6	kimi-k2.5	hermes	17.6%	17.6%	—	—	27 / 153