Skip to content

Add ClawBench to Evaluation Harnesses & Benchmarks#9

Open
reacher-z wants to merge 1 commit into
Picrew:mainfrom
reacher-z:add-clawbench
Open

Add ClawBench to Evaluation Harnesses & Benchmarks#9
reacher-z wants to merge 1 commit into
Picrew:mainfrom
reacher-z:add-clawbench

Conversation

@reacher-z

Copy link
Copy Markdown

Adds ClawBench to Evaluation Harnesses & Benchmarks.

ClawBench evaluates browser agents on live production websites (Uber Eats, Indeed, Craigslist, etc.). Two-stage harness: HTTP-request interception at per-task URL/method schema + LLM judge on the intercepted payload.

  • 283 tasks (V1 153 + V2 130) across 163 live platforms · 15 life categories
  • Paper: https://arxiv.org/abs/2604.08523 · Live: https://claw-bench.com
  • Already sits next to WildClawBench in the table — complementary (Wild evaluates inside OpenClaw env; ClawBench evaluates on the open web).

Affiliation: I'm one of the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant