Evaluation¶
WebArena-Verified provides a comprehensive evaluation framework for assessing web agent performance. The system validates agent behavior through multiple evaluators that check different aspects of task completion.
Evaluator Configuration¶
Each task defines its validation requirements through evaluator configurations:
- One agent response evaluator - Every task has exactly one
AgentResponseEvaluatorconfiguration that validates the agent's final structured response (performed operation, status, and retrieved data). - Zero or more network event evaluators - Depending on the expected operation, a task may include zero to multiple
NetworkEventEvaluatorconfigurations. Navigate and mutate operations typically require network validation, while retrieve operations may not need any network checks.
Evaluation Method Comparison¶
| Aspect | WebArena | WebArena-Verified |
|---|---|---|
| Validation Approach | DOM-based evaluation | Network event-based evaluation |
| Matching Method | Substring matching and LLM-as-judge eval | Data type-aware exact match |
| LLM-Based Evaluation | LLM-based evaluation | Replaced by exact match |
| Stability | Fragile - breaks with UI changes | Stable - resilient to UI changes |
| Tool Dependency | Tightly coupled to specific frameworks | Framework-flexible (any tool with network traces) |
| Offline Evaluation | Not supported | Supported - re-evaluate captured traces |
Learn More¶
- Evaluation Results - Complete guide to understanding evaluation output format and results
- Network Event-Based Evaluation - Detailed guide on network trace validation using HTTP Archive (HAR) format
- Removing LLM-Based Evaluation - How we replaced LLM-as-judge with exact matching and verifiable intents
- Handling of Unachievable Tasks - Guidance on replacing N/A with explicit statuses and reducing guesswork