ClawBench · the open benchmark for AI agents on real, live websites

Can AI agents complete everyday online tasks?

130 tasks · 63 live platforms · two-stage scoring (HTTP interception → LLM judge)

We give AI agents real online tasks — booking flights, ordering food, applying for jobs — on live websites, and check whether they actually submit the right thing.

Read the paper Quick start

V2 (Hermes) — 8 models

Rank	Model	Harness	Reward	Intercepted	Reward (strict)	Cost / task	Pass / Total
1	claude-opus-4-7	hermes	44.6%	54.6%	24.6%	$4.4425	58 / 130
2	gpt-5.5	hermes	35.4%	45.4%	18.5%	$0.3325	46 / 130
3	glm-5.1	hermes	34.6%	48.5%	17.7%	$0.1935	45 / 130
4	deepseek-v4-pro	hermes	33.9%	43.9%	12.3%	$0.0721	44 / 130
5	deepseek-v4-flash:free	hermes	2.3%	3.1%	0.0%	$0.0000	3 / 130
6	z-ai/glm-4.5-air:free	hermes	2.3%	4.6%	0.8%	$0.0000	3 / 130
7	minimax-m2.5:free	hermes	1.5%	2.3%	0.0%	$0.0000	2 / 130
8	openrouter-owl-alpha	hermes	0.0%	14.6%	0.0%	$0.3704	0 / 130

How scoring works

Intercepted = final HTTP request matched the task's URL/method (Stage 1, deterministic). Reward = additionally judged by deepseek/deepseek-v4-pro to fulfill the instruction under "no contradiction → match" rubric (Stage 2, lenient). Reward (strict) = same judge, strict rubric ("ambiguous → mismatch"). Ranked by Reward (lenient). Snapshot: 2026-05-20. Full details: eval/scoring.md ↗. New here? About ClawBench ↗.

         Quick start
      

PyPI Full setup ↗ Submit your model

$ pip install clawbench-eval && clawbench run --corpus v2 --model your-model

How it works

1 · Real task, live website The agent gets an everyday instruction — "On Uber Eats, order one Pad Thai to the home address, note 'no peanuts'" — and drives a real browser on the real site.

2 · HTTP interception The harness catches the agent's final submit request before it fires. Deterministic Stage 1: does the URL/method match the task schema?

3 · LLM judge Stage 2: a judge model checks the intercepted payload against the instruction — right item, right address, right note.

Watch one run: deepseek-v4-pro on V2 task 535-daily-life-shopping-etsy — "Search Etsy for a handmade blue ceramic flower vase under $50 and add it to your favorites". The agent searches, filters, opens a listing, and favorites it before the harness intercepts. Intercepted ✓ (hover to step through screenshots)

65 screenshots

13m 47s

Full 5-layer trace bundle for this run: view trace →

Traces & dataset

1,724 judge-verified runs · 13 frontier models · 283 distinct everyday tasks · 163 live platforms · Apache-2.0 · refreshed weekly (last 2026-05-20). Every run ships as a 5-layer bundle: video, actions, agent messages, HTTP requests, and the graded verdict — all on one clock.

Train your own agent

918 V1 + 806 V2 frontier-model trajectories. JSONL-native, SFT/DPO/PRM-ready; mine success-vs-failure pairs across 13 models on identical tasks.

Get V1 · 918 runs ↗ Get V2 · 806 runs ↗

Replay & audit

Step through video + HAR + agent reasoning side-by-side. Diagnose failure modes, audit judge calls, diff a model's pixels vs its words.

Open trace browser → Browse on HF Hub ↗

Reproduce the leaderboard

Re-run any cell with our judge on your own data — or our data on your judge. The CLI consumes the same bundles you'll download.

Scoring rubric ↗ clawbench-eval on PyPI ↗

Want to read the tasks first? Browse all 283 task definitions → (searchable, filterable, no download)

Inside every trace — 6 time-synchronized signals per run

Every signal is timestamped against the same clock — click frame 1872 of recording.mp4 and you can find the exact actions.jsonl event, the LLM turn that triggered it, and the HTTP requests it fired. Cross-org mirrors: NAIL-Group · TIGER-Lab · Apache-2.0 · Bundle format: tar.gz per run, jsonl within.

FAQ

How is the leaderboard scored and ranked?

Two stages. Stage 1 (Intercepted, deterministic): the run counts only if the agent's final HTTP request matches the task's URL/method schema. Stage 2 (Reward): a judge model (deepseek/deepseek-v4-pro) reads the intercepted payload against the instruction — lenient rubric ("no contradiction → match") for the headline Reward, strict rubric ("ambiguous → mismatch") for Reward (strict), visible via the "All columns" toggle. Ranked by Reward (lenient). Details: eval/scoring.md ↗

Are the tasks contaminated / memorized?

Tasks are held-out and post-cutoff: they target live third-party websites whose state changes daily, and the expected final request is defined per task, not scrapeable from training data. Re-running any cell with your own judge (or your own data with our judge) is supported by the CLI — clawbench-eval ↗.

What did the trace corpus cost to produce?

$10K+ in frontier-model compute, 5.7B tokens (input + cache + output), summed from every run's agent-messages.jsonl × current OpenRouter list prices. V1 base: $5,177 across 1,377 runs (sonnet-4-6 $3/$15, haiku-4-5 $1/$5, gpt-5.4 $2.50/$15, gemini-3.1-pro $2/$12, glm-5 $0.60/$1.92, kimi-k2.5 $0.40/$1.90, qwen3.5-397b $0.39/$2.34, gemini-3-flash $0.50/$3, gemini-3.1-flash-lite $0.25/$1.50). V1 opus-4-6: $3,254 (the #1 V1 model — 113 dirs at $5/$25 pricing). V2: $1,745 across the full 6-model corpus (opus-4-7 $5/$25 → $1,214, gpt-5.5 $1.25/$10 → $255, glm-5.1 $0.60/$1.92 → $122, deepseek-v4-pro $0.55/$2.19 → $146, deepseek-v4-flash $0.27/$1.10 → $9, owl-alpha free). Cache reads billed at full prompt rate (conservative). OpenRouter prices ↗

How do I submit my model?

Run the CLI (pip install clawbench-eval && clawbench run --corpus v2 --model your-model) and open a PR with your run artifacts — github.com/TIGER-AI-Lab/ClawBench/pulls ↗. Or start from the Contribute page.

Cite this benchmark

Using ClawBench in your research? Please cite the arXiv paper:

@article{zhang2026clawbench,
  title={ClawBench: Can AI Agents Complete Everyday Online Tasks?},
  author={Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
  journal={arXiv preprint arXiv:2604.08523},
  year={2026}
}

View on arXiv Discuss on HF Papers CITATION.cff JSON API /api/leaderboard.json Contact [reveal]

Rank	Model	Harness	Reward	Intercepted	Reward (strict)	Cost / task	Pass / Total
1	claude-opus-4-6	hermes	61.4%	61.4%	—	—	94 / 153
2	claude-sonnet-4-6	hermes	56.9%	56.9%	—	—	87 / 153
3	claude-haiku-4-5-20251001	hermes	30.1%	30.1%	—	—	46 / 153
4	gpt-5.4-2026-03-05	hermes	25.5%	25.5%	—	—	39 / 153
5	gpt-5.4-mini-2026-03-17	hermes	24.8%	24.8%	—	—	38 / 153
6	kimi-k2.5	hermes	17.6%	17.6%	—	—	27 / 153