# ClawBench

> ClawBench is a comprehensive benchmark for evaluating AI browser agents on 153 real-world everyday online tasks across 144 live websites and 8 categories. It captures 5 layers of behavioral data (session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions), includes human ground-truth for every task, and scores with an agentic evaluator providing step-level traceable diagnostics. The best-performing model (Claude Sonnet 4.6) achieves only 33.3% success rate, revealing a large gap between current AI agents and human-level web task completion.

## Links

- [Paper](https://arxiv.org/abs/2604.08523): ClawBench: Can AI Agents Complete Everyday Online Tasks? (arXiv:2604.08523)
- [PDF](https://arxiv.org/pdf/2604.08523): Full paper PDF
- [Website](https://claw-bench.com): Interactive leaderboard, task browser, trace viewer, and agent demo gallery
- [GitHub](https://github.com/reacher-z/ClawBench): Source code — framework, evaluators, test driver, and Chrome extension
- [Dataset](https://huggingface.co/datasets/NAIL-Group/ClawBench): 153 tasks in Parquet format on Hugging Face
- [Hugging Face Papers](https://huggingface.co/papers/2604.08523): Community discussion page
- [PyPI](https://pypi.org/project/clawbench-eval/): Install with `pip install clawbench-eval`

## Key Facts

- 153 tasks across 144 live websites in 8 categories
- Categories: Daily Life, Finance, Work & Office, Development, Academic, Travel, Social, Pets
- 5 layers of behavioral data: session replay (rrweb), screenshots, HTTP traffic, agent reasoning traces, browser actions
- Human ground-truth recorded for every task
- Agentic evaluator with VLM, LLM, and Human-Agent evaluation modes providing step-level traceable diagnostics
- Request interceptor prevents irreversible real-world actions (payments, form submissions) during evaluation
- 7 models evaluated: Claude Sonnet 4.6, GLM-5, Gemini 3 Flash, Claude Haiku 4.5, GPT-5.4, Kimi K2.5, Gemini 3.1 Flash Lite
- Apache 2.0 license
- COLM 2026 submission
- 21 authors from 11 institutions (UBC, Vector Institute, Etude AI, CMU, U Waterloo, SJTU, UniPat AI, ZJU, HKUST, Tsinghua, Netmind.ai)

## Leaderboard (Overall Success Rate %)

| Model | Provider | Score |
|-------|----------|-------|
| Claude Sonnet 4.6 | Anthropic | 33.3% |
| GLM-5 | Zhipu AI | 24.2% |
| Gemini 3 Flash | Google | 19.0% |
| Claude Haiku 4.5 | Anthropic | 18.3% |
| Kimi K2.5 | Moonshot AI | 15.0% |
| GPT-5.4 | OpenAI | 6.5% |
| Gemini 3.1 Flash Lite | Google | 3.3% |

## What Makes ClawBench Different

- **Real websites, not simulations**: Tasks run on 144 actual live platforms (Airbnb, Uber Eats, Coursera, Indeed, etc.), not synthetic environments
- **Everyday tasks**: Booking flights, ordering groceries, applying for jobs, scheduling appointments — tasks people actually do online
- **Safe evaluation**: Request interceptor blocks the final HTTP request before irreversible actions, allowing evaluation on production websites without side effects
- **Rich behavioral data**: 5 complementary data layers enable fine-grained analysis of where and why agents fail
- **Human baseline**: Every task has a human-completed ground-truth recording for direct comparison

## Citation

```bibtex
@article{zhang2026clawbench,
  title={ClawBench: Can AI Agents Complete Everyday Online Tasks?},
  author={Zhang, Yuxuan and Wang, Yubo and Zhu, Yipeng and Du, Penghui and Miao, Junwen and Lu, Xuan and Xu, Wendong and Hao, Yunzhuo and Cai, Songcheng and Wang, Xiaochen and Zhang, Huaisong and Wu, Xian and Lu, Yi and Lei, Minyi and Zou, Kai and Yin, Huifeng and Nie, Ping and Chen, Liang and Jiang, Dongfu and Chen, Wenhu and Allen, Kelsey R.},
  journal={arXiv preprint arXiv:2604.08523},
  year={2026}
}
```