Hard Subset¶
Overview¶
WebArena-Verified Hard is a carefully curated subset of 258 challenging tasks selected from the full 812-task benchmark. This subset focuses on genuinely difficult tasks while maintaining broad site coverage and category diversity.
Why use the hard subset?
- Cost-effective evaluation: Evaluate on 258 tasks instead of 812 while preserving discriminative power
- Difficulty-prioritized: 48.1% of tasks have predicted success rate ≤ 0.20
- Representative coverage: Maintains balanced distribution across sites and task categories
Task Selection¶
The subset contains:
| Site | Tasks | Multi-site | Total |
|---|---|---|---|
| Shopping Admin | 55 | - | 55 |
| GitLab | 57 | - | 57 |
| 42 | - | 42 | |
| Shopping | 56 | - | 56 |
| Multi-site | - | 48 | 48 |
| Overall | 210 | 48 | 258 |
Why no single-site Map tasks?
Single-site Map tasks are excluded from the hard subset due to contamination issues identified during benchmark diagnosis. All 48 multi-site tasks (including 19 that involve Map) are included.
How It Was Created¶
The subset was constructed using a principled difficulty modeling approach:
- Difficulty Quantification: Estimated task hardness from multi-agent trajectories (8 agents) using a survival-style GLMM that models success probability as a function of steps taken
- Task Ranking: Ranked tasks by difficulty coefficient (β_t), where larger values indicate harder tasks
- Category Balancing: Within each per-site category, selected up to κ tasks based on hardness probability:
- Default cap: κ_default = 3 tasks per category
- Easy category cap: κ_easy = 2 tasks (for categories with median success ≥ 0.85)
- Site Coverage: Single-site Map excluded due to contamination; all 48 multi-site tasks included
Selection criteria:
- τ_hard = 0.20 (threshold for "hard" classification)
- τ_easy = 0.85 (threshold for "easy" category identification)
- 16.7% of tasks have ≥ 0.90 probability of being hard
The hardest categories involve multi-step state-changing interactions (forms, data updates), while easiest are browse/read-only tasks.
Usage¶
Export the hard subset tasks to a JSON file:
webarena-verified subset-export \
--name webarena-verified-hard \
--output webarena-verified-hard.json
The exported file contains the full task definitions for all 258 tasks in the subset.
For more subset management commands, see the Subset Manager guide.
Reference¶
For detailed methodology and analysis, see Section 4.5 "WebArena Verified Hard: A Representative Subset" in the WebArena Verified paper.