Skip to content

v1.0.0

Data Schema Changes

We use WebArenaVerifiedTask, a Pydantic BaseModel, to map and interact with task data. Pydantic provides automatic validation, type safety, and seamless JSON serialization/deserialization, ensuring data integrity throughout the benchmark pipeline.

Changes from WebArena

WebArena-Verified is a verified benchmark introducing a more structured and extensible task format compared to the original WebArena benchmark. This section outlines the key changes.

High-Level Changes

Aspect WebArena WebArena-Verified
Evaluation Structure Single dict with type strings Array of typed evaluator configs
Field Organization Flat structure with runtime configs Separation of task data and runtime concerns
Type Safety String-based types Discriminated unions with Pydantic models
Extensibility Limited evaluator types Multiple evaluator types with clear interfaces

Field-by-Field Mapping

WebArena Field WebArena-Verified Field Rationale
start_url (string) start_urls (array) Support multiple starting URLs for complex tasks
eval (object) eval (array of objects) Simplify and flatten evaluator structure for extensibility
eval.eval_types eval[].evaluator More explicit evaluator identification
eval.reference_answers eval[].expected Structured expected values per evaluator type
eval.reference_url NetworkEventEvaluator Replaced string matching with network trace validation
eval.program_html NetworkEventEvaluator Network traces now cover backend checks
require_login (removed) Runtime configuration, not task data
storage_state (removed) Runtime configuration (local file path)
geolocation (removed) Always null in dataset; removed as redundant
require_reset (removed) Always false in dataset; removed as redundant
- format_specification New: Specifies output format requirements
- start_url_context New: Provides context about starting page
- revision New: Integer revision number for task changes (minimum 1)

Evaluation System Migration

The evaluation system was restructured to use typed evaluators instead of string-based types:

WebArena Evaluation Types → WebArena-Verified Evaluators

WebArena eval_types WebArena-Verified Evaluator(s) Purpose
string_match AgentResponseEvaluator (action=retrieve) Validates agent's returned response format, action type, and result values
program_html (persisted changes) NetworkEventEvaluator Validates backend-side mutations via network traces
program_html (not persisted changes) NetworkEventEvaluator Network requests capture transient UI interactions
url_match NetworkEventEvaluator + AgentResponseEvaluator (action=navigate) Validates navigation to correct URL using network traces

Expected Agent Response Changes

WebArena-Verified introduces a structured agent response format, replacing the plain string format used in WebArena.

WebArena Agent Response

Agents returned a plain string containing the answer:

Quest Lumaflex™ Band

WebArena-Verified Agent Response

Agents return a structured JSON object with explicit action type, status, and results:

{
  "action": "retrieve",
  "status": "SUCCESS",
  "results": ["Quest Lumaflex™ Band"]
}

Benefits

  • Reduced false negatives: Eliminates evaluation failures due to string parsing ambiguities.
  • Explicit status indication: Agents clearly report whether they succeeded or encountered errors. This is especially useful when an agent reaches the maximum number of iterations for navigation tasks but fails at the right page. Without explicit status, the evaluation would incorrectly pass.
  • Action type tracking: Clear indication of the performed action (retrieve, navigate, or mutate). This is especially beneficial to differentiate between navigate and retrieve tasks. For example, a task might expect the agent to navigate, but the agent thinks it needs to retrieve some value. The agent might fail at retrieving the value but still navigate to the right page. Without the action being explicit, the validation would incorrectly pass.

Result Format Specification

WebArena-Verified introduces the format_specification field to eliminate ambiguity in how agents should format their results.

Problem in WebArena

Without format specifications, agents had to interpret how to format results, leading to evaluation ambiguities. For example, Task ID 10 asks: "Tell me the full address of all US international airports that are within a driving distance of 60 km to Niagara Falls"

Different agents interpreted "full address" differently, producing varied formats that were difficult to evaluate consistently. One leaderboard agent returned:

There is one US international airport within 60 km driving distance of Niagara Falls: Buffalo-Niagara International Airport, Holtz Drive, Town of Cheektowaga, Erie County, New York, 14225, United States.

Solution in WebArena-Verified

The format_specification field explicitly defines the expected result structure. For Task ID 10, the format specification states: "Use \"name\" for the name, \"state\" for the state, and \"zip_code\" for the zip code."

Agents now return structured data:

{
  "action": "retrieve",
  "status": "SUCCESS",
  "results": [
    {
      "name": "Niagara Falls International Airport",
      "state": "New York",
      "zip_code": "14304"
    },
    {
      "name": "Buffalo-Niagara International Airport",
      "state": "New York",
      "zip_code": "14225"
    }
  ]
}

Benefits

  • Eliminates format ambiguity: Clear specifications for how to structure results
  • Consistent evaluation: All agents use the same format, enabling reliable comparisons
  • Structured data validation: Results can be validated against JSON schemas