Skip to content

Welcome to WebArena-Verified

WebArena-Verified is the reproducible release of the WebArena benchmark: the original containerized sites remain intact, but every task, reference answer, and evaluator has been re-audited to eliminate brittle string matching and ambiguous success criteria. Deterministic, JSON-based scoring and network events-based checks let you measure web agents offline.

Key Contributions:

  • Fully audited benchmark: Every task, reference answer, and evaluator has been manually reviewed and corrected
  • Offline evaluation: Evaluate agent runs without requiring live web environments using network trace replay
  • Deterministic scoring: Removed LLM-as-a-judge evaluation and substring matching in favor of type-aware normalization and structural comparison
  • WebArena-Verified Hard subset: A difficulty-prioritized 258-task subset for cost-effective evaluation

The following quick start demonstrates these capabilities in practice. You'll bring the toolkit up, validate a single task end-to-end, and then branch into batch evaluation or custom integrations when you're ready.

This quick start is divided into two parts:

  • Part 1 (~5 minutes): Understand evaluation by evaluating a pre-run agent log
  • Part 2 (~10 minutes): Run an agent and evaluate it

Setup

Prerequisites

  • Python 3.11+
git clone https://github.com/ServiceNow/webarena-verified.git
cd webarena-verified
uv sync --extra examples
source .venv/bin/activate
playwright install chromium  # Only needed for example agents, not the evaluation framework
webarena-verified --help
git clone https://github.com/ServiceNow/webarena-verified.git
cd webarena-verified
python -m venv .venv
source .venv/bin/activate
pip install -e ".[examples]"
playwright install chromium  # Only needed for example agents, not the evaluation framework
webarena-verified --help

Part 1: Evaluate a Pre-Run Task

Before running an agent, let's evaluate an existing agent log to understand how WebArena-Verified works. We'll use the following task that already has output in examples/agent_logs/demo/108/:

{
  "task_id": 108,
  "intent": "Get the monthly count of successful orders 01/2023-05/2023",
  "sites": ["shopping_admin"]
  ...
}

New in WebArena-Verified: Offline Evaluation.

Why This Matters:

  • Evaluate agent runs without live web environments
  • Reevaluate past runs at any time
  • Compare different agents transparently with reproducible benchmarking

1. What's in a Task Log?

The task log contains two key artifacts:

examples/agent_logs/demo/108/
├── agent_response.json
└── network.har

Agent Response

Agents are required to return a valid JSON like the following:

{
  "task_type": "RETRIEVE",
  "status": "SUCCESS",
  "retrieved_data": [
    { "month": "Jan", "count": 12 },
    { "month": "Feb", "count": 7 },
    { "month": "March", "count": 5 },
    { "month": "April", "count": 9 },
    { "month": "May", "count": 5 }
  ],
  "error_details": null
}
Field Descriptions
  • task_type (required): Type of work performed - RETRIEVE, MUTATE, or NAVIGATE
  • status (required): Task outcome - SUCCESS or error codes (NOT_FOUND_ERROR, PERMISSION_DENIED_ERROR, DATA_VALIDATION_ERROR, etc.)
  • retrieved_data: Array of items for RETRIEVE operations (otherwise null)
  • error_details: Null for SUCCESS, otherwise explains what failed and why

New in WebArena-Verified: Structured Agent Response

Why This Matters:

  • Robust Evaluation: Modern LLMs rarely struggle with generating valid JSON, enabling more reliable evaluation with explicit fields:
    • task_type: Requires agents to explicitly state what operation they performed, revealing whether they truly understood the task
    • status: Allows various error status codes instead of catch-all "N/A" responses for unachievable tasks
    • retrieved_data: Structured format reduces false negatives due to parsing issues
  • Reduces False Positives: By validating both the operation type and the outcome, we ensure agents actually completed the intended task.

    Example: Navigation vs. Retrieval

    For a task that requires retrieving data, the agent misunderstands and only navigates. The agent can reach the correct page but never retrieve the data.

    • Original WebArena: Pass ✓ (only checked if agent reached the correct page)
    • WebArena-Verified: Fail ✗ (verifies page navigation and task_type matches RETRIEVE)

    (Or picture asking a coding agent to "review this code" and watching it start rewriting everything while you frantically mash Ctrl+C! 😱)

Network Trace

Captures all network activity between the browser frontend and the backend in HAR (HTTP Archive) format - a standard format widely used for debugging and analyzing web traffic. This records what the agent actually did - including page navigations, data retrievals, and modifications. Each network event includes the URL, HTTP method, and response status used by the evaluator:

{
  "request": {
    "method": "GET",
    "url": "http://192.168.1.35:7780/admin/customer/index/",
    ...
  },
  "response": {
    "status": 200,
    ...
  },
  ...
}

New in WebArena-Verified: Network Event Based Evaluation

Why This Matters:

  • Enables Offline Evaluation: Network traces can be evaluated without live web environments - this is the critical piece that makes reevaluation possible
  • Avoids Brittle Locators: No reliance on DOM selectors or page structure - allows for easy website updates
  • Single Evaluation Method: Works uniformly across all websites (GitLab, e-commerce, forums, etc.)

See Network Event Based Evaluation for details.

2. Run the evaluator

webarena-verified eval-tasks \
  --task-ids 108 \
  --output-dir examples/agent_logs/demo \
  --config examples/configs/config.demo.json
Troubleshooting
  • If the webarena-verified command is not available, make sure you have activated the virtual environment correctly. See the Setup section.

This creates an eval_result.json file in the task directory (examples/agent_logs/demo/108/).

3. Examine the evaluation result

The evaluation result is a structured JSON document that shows:

  • The overall task status and score - Did the agent pass or fail?
  • Individual evaluator results - Each evaluator (e.g., AgentResponseEvaluator) reports its findings
  • Raw and normalized values - We show both actual (raw agent output) and actual_normalized (after type-aware normalization) to help you catch normalization issues and understand how values are being compared
  • Reproducibility checksums - We track evaluation code and task dataset checksums to ensure consistent, reproducible evaluations across different runs and environments

The annotated JSON below explains each field. Click the + markers to expand explanations:

{
  "task_id": 108,
  "intent_template_id": 270,
  "sites": [
    "shopping_admin"
  ],
  "task_revision": 2, // (1)!
  "status": "success", // (2)!
  "score": 1.0, // (3)!
  "evaluators_results": [ // (4)!
    {
      "evaluator_name": "AgentResponseEvaluator", // (5)!
      "status": "success",
      "score": 1.0,
      "actual": { // (6)!
        "task_type": "RETRIEVE",
        "status": "SUCCESS",
        "retrieved_data": [
          { "month": "Jan", "count": 12 },
          { "month": "Feb", "count": 7 },
          { "month": "March", "count": 5 },
          { "month": "April", "count": 9 },
          { "month": "May", "count": 5 }
        ],
        "error_details": null
      },
      "actual_normalized": { // (7)!
        "task_type": "retrieve",
        "status": "success",
        "retrieved_data": [
          { "month": "january", "count": 12.0 },
          { "month": "february", "count": 7.0 },
          { "month": "march", "count": 5.0 },
          { "month": "april", "count": 9.0 },
          { "month": "may", "count": 5.0 }
        ]
      },
      "expected": { // (8)!
        "task_type": "retrieve",
        "status": "success",
        "retrieved_data": [
          { "month": "january", "count": 12.0 },
          { "month": "february", "count": 7.0 },
          { "month": "march", "count": 5.0 },
          { "month": "april", "count": 9.0 },
          { "month": "may", "count": 5.0 }
        ]
      },
      "assertions": null, // (9)!
      "error_msg": null // (10)!
    }
  ],
  "error_msg": null,
  "webarena_verified_version": "2.0.0", // (11)!
  "webarena_verified_evaluator_checksum": "a9e6da7e172f7ba62e388d445ccc974cc5df02529b833738051c54879319d4f8", // (12)!
  "webarena_verified_data_checksum": "33048e32d7349835e3ea656348e59ba4ca43d2068cce24d3772134e402ef8f4b" // (13)!
}
  1. Task revision number - incremented when task definition changes
  2. Overall evaluation status - success when all evaluators pass
  3. Overall score - 1.0 = complete success, 0.0 = failure
  4. Results from each evaluator that ran on this task
  5. Name of the evaluator - AgentResponseEvaluator validates structured agent responses
  6. Raw agent response before normalization - note mixed month formats ("Jan", "Feb", "March")
  7. Agent response after type-aware normalization - all months converted to lowercase ("january", "february", "march"). Notice how actual_normalized matches expected even though raw formats were mixed.
  8. Expected values from task definition - what the agent should return after normalization
  9. List of assertion failures - null means all checks passed
  10. Error message when the evaluation system itself encounters an error (not agent failures). When error_msg is not null, status is ERROR.
  11. WebArena-Verified version used for this evaluation
  12. Checksum of evaluator code - ensures evaluation logic hasn't changed
  13. Checksum of task dataset - ensures task definitions are consistent

New in WebArena-Verified: Type-Aware Normalization

Why This Matters:

  • Handles Common Data Types: Automatically normalizes dates, currency, URLs, coordinates, and more without requiring LLM-based evaluation
  • Format-Agnostic Comparison: In this example, month names are normalized regardless of format ("Jan" vs "January" vs "january"), ensuring reliable comparison
  • Deterministic & Cost-Effective: Eliminates the unpredictability and cost of LLM evaluators

Part 2: Run and Evaluate an Agent

Now that you understand evaluation, let's run an agent and evaluate it. We'll complete the following task:

{
  "task_id": 44,
  "intent": "Open my todos page",
  "sites": ["gitlab"]
  ...
}

We'll use a special "human agent" that opens a browser and hands control to you to complete this simple navigation task (requires clicking on a single menu item).

Why not use a real AI agent implementation?

The goal of this exercise is to walk through how to use the benchmark in a straightforward way, without additional complexity. By stepping through the process manually, you'll understand exactly what agents need to produce and how evaluation works.

1. Setup GitLab Environment

First, you need a GitLab instance to work with. Choose one option:

What is the Demo GitLab?

This is a lightweight, bare-bones GitLab Docker image instead of the full 100 GB+ GitLab instance from the original WebArena. For simple navigation tasks like "Check out my todos", this smaller image is perfectly sufficient and much faster to download and run on your laptop! However, this task is not part of our hard subset since it only requires basic navigation.

Start the demo GitLab instance using Docker:

uv run invoke -r examples demo-gitlab-start

The GitLab instance takes 2-3 minutes to fully boot up. Wait until the command completes and shows the container status as 'running' before proceeding.

Change the default port (8012)

To use a different port, add the --port flag:

uv run invoke -r examples demo-gitlab-start --port=8080
Then update examples/configs/config.demo.json to match your port.

We'll use examples/configs/config.demo.json

{
  "environments": {
    "__GITLAB__": {
      "urls": ["http://localhost:8012"],
      "credentials": {
        "username": "root",
        "password": "demopass"
      }
    }
  }
}

If you have your own GitLab instance running (from the original webarena setup), update examples/configs/config.demo.json with your GitLab URL and credentials:

{
  "environments": {
    "__GITLAB__": {
      "urls": ["http://your-gitlab-url[:port]"],
      "credentials": {
        "username": "your-username",
        "password": "your-password"
      }
    }
  }
}

2. Export Task Data

Export the task information that the agent needs:

webarena-verified agent-input-get \
  --task-ids 44 \
  --config examples/configs/config.demo.json \
  --output output/tasks.demo.json

This exports only the fields that the agent needs to perform the task (intent, start_urls) and the IDs (task_id, intent_template_id, and sites). Since the --config argument is provided, URL templates like __GITLAB__ are rendered to actual URLs (e.g., http://localhost:8012).

New in WebArena-Verified: Agent runner does not depend on benchmark dependencies

Why This Matters:

  • Language & Framework Freedom: Your agent implementation can use any programming language (Python, JavaScript, Go, etc.) or framework - no dependency on the benchmark's libraries
  • Independent Versioning: Use any version of Playwright, Selenium, or other browser automation tools without conflicts with the benchmark
  • Lightweight Integration: Agents only need to read JSON task files and produce standard output formats (JSON response + HAR trace)
  • Alternative Approach: While we use agent-input-get CLI here to export tasks, you can also call WebArena-Verified's Python API directly within your agent code if you prefer programmatic access

3. Your Turn: Complete the Task

Now let's run the human agent for Task ID 44 (from output/tasks.demo.json we generated earlier)

{
  "sites": ["gitlab"],
  "task_id": 44,
  "intent_template_id": 303,
  "start_urls": ["http://localhost:8012"],
  "intent": "Open my todos page"
}
uv run python examples/agents/human/agent.py \
  --tasks-file output/tasks.demo.json \
  --task_id 44 \
  --task_output_dir output/demo-run/44 \
  --config examples/configs/config.demo.json

What happens next:

  1. The agent script opens a browser window and navigates to GitLab (login is handled automatically)
  2. Now it's your turn! Navigate to the todos page by clicking "To-Do List" in the left sidebar, then close the browser window
  3. The agent will prompt you in the terminal to generate the agent response saved to agent_response.json
  4. The agent writes its response and network event logs to output/demo-run/44/agent_response.json and output/demo-run/44/network.har
Example: Agent Response Questionnaire Output
==============================================================
Browser closed. Generating the agent response questionnaire...
==============================================================

------------------------------------------------------------
Select the performed operation:

1. RETRIEVE
2. MUTATE
3. NAVIGATE

Enter choice number > 3

------------------------------------------------------------
Select the task status:

1. SUCCESS
2. ACTION_NOT_ALLOWED_ERROR
3. PERMISSION_DENIED_ERROR
4. NOT_FOUND_ERROR
5. DATA_VALIDATION_ERROR
6. UNKNOWN_ERROR

Enter choice number > 1


------------------------------------------------------------
Proposed agent response:
{
  "task_type": "NAVIGATE",
  "status": "SUCCESS",
  "retrieved_data": null,
  "error_details": null
}
------------------------------------------------------------
Confirm and save this response?
  1. Yes
  2. No
> 1

4. Evaluate Your Run

Now let's evaluate your performance:

webarena-verified eval-tasks \
  --config examples/configs/config.demo.json \
  --task-ids 44 \
  --output-dir output/demo-run

This creates output/demo-run/44/eval_result.json with your evaluation results.

5. Review the Results

Check output/demo-run/44/eval_result.json - it will have the same structure as Part 1.

What got evaluated:

  • AgentResponseEvaluator: Validated your response structure (task_type, status, etc.)
  • NetworkEventEvaluator: Checked that you navigated to the correct URL (/dashboard/todos)

If you successfully navigated to the todos page and reported task_type: "NAVIGATE" with status: "SUCCESS", you should see:

{
  "status": "success",
  "score": 1.0,
  "evaluators_results": [
    {
      "evaluator_name": "AgentResponseEvaluator",
      "status": "success",
      "score": 1.0,
      ...
    },
    {
      "evaluator_name": "NetworkEventEvaluator",
      "status": "success",
      "score": 1.0,
      ...
    }
  ],
  ...
}

If you used the demo GitLab instance, you can now stop it:

uv run invoke -r examples demo-gitlab-stop

Where to Next?

Happy benchmarking!