Skip to content

Evaluation

WebArena-Verified provides a comprehensive evaluation framework for assessing web agent performance. The system validates agent behavior through multiple evaluators that check different aspects of task completion.

Evaluator Configuration

Each task defines its validation requirements through evaluator configurations:

  • One agent response evaluator - Every task has exactly one AgentResponseEvaluator configuration that validates the agent's final structured response (performed operation, status, and retrieved data).
  • Zero or more network event evaluators - Depending on the expected operation, a task may include zero to multiple NetworkEventEvaluator configurations. Navigate and mutate operations typically require network validation, while retrieve operations may not need any network checks.

Evaluation Method Comparison

Aspect WebArena WebArena-Verified
Validation Approach DOM-based evaluation Network event-based evaluation
Matching Method Substring matching and LLM-as-judge eval Data type-aware exact match
LLM-Based Evaluation LLM-based evaluation Replaced by exact match
Stability Fragile - breaks with UI changes Stable - resilient to UI changes
Tool Dependency Tightly coupled to specific frameworks Framework-flexible (any tool with network traces)
Offline Evaluation Not supported Supported - re-evaluate captured traces

Learn More