Welcome to WebArena-Verified¶
WebArena-Verified is the reproducible release of the WebArena benchmark: the original containerized sites remain intact, but every task, reference answer, and evaluator has been re-audited to eliminate brittle string matching and ambiguous success criteria. Deterministic, JSON-based scoring and network events-based checks let you measure web agents offline.
Key Contributions:
- Fully audited benchmark: Every task, reference answer, and evaluator has been manually reviewed and corrected
- Offline evaluation: Evaluate agent runs without requiring live web environments using network trace replay
- Deterministic scoring: Removed LLM-as-a-judge evaluation and substring matching in favor of type-aware normalization and structural comparison
- WebArena-Verified Hard subset: A difficulty-prioritized 258-task subset for cost-effective evaluation
The following quick start demonstrates these capabilities in practice. You'll bring the toolkit up, validate a single task end-to-end, and then branch into batch evaluation or custom integrations when you're ready.
This quick start is divided into two parts:
- Part 1 (~5 minutes): Understand evaluation by evaluating a pre-run agent log
- Part 2 (~10 minutes): Run an agent and evaluate it
Setup¶
Prerequisites
- Python 3.11+
Part 1: Evaluate a Pre-Run Task¶
Before running an agent, let's evaluate an existing agent log to understand how WebArena-Verified works. We'll use the following task that already has output in examples/agent_logs/demo/108/:
{
"task_id": 108,
"intent": "Get the monthly count of successful orders 01/2023-05/2023",
"sites": ["shopping_admin"]
...
}
New in WebArena-Verified: Offline Evaluation.
Why This Matters:
- Evaluate agent runs without live web environments
- Reevaluate past runs at any time
- Compare different agents transparently with reproducible benchmarking
1. What's in a Task Log?¶
The task log contains two key artifacts:
Agent Response¶
Agents are required to return a valid JSON like the following:
{
"task_type": "RETRIEVE",
"status": "SUCCESS",
"retrieved_data": [
{ "month": "Jan", "count": 12 },
{ "month": "Feb", "count": 7 },
{ "month": "March", "count": 5 },
{ "month": "April", "count": 9 },
{ "month": "May", "count": 5 }
],
"error_details": null
}
Field Descriptions
task_type(required): Type of work performed -RETRIEVE,MUTATE, orNAVIGATEstatus(required): Task outcome -SUCCESSor error codes (NOT_FOUND_ERROR,PERMISSION_DENIED_ERROR,DATA_VALIDATION_ERROR, etc.)retrieved_data: Array of items forRETRIEVEoperations (otherwise null)error_details: Null forSUCCESS, otherwise explains what failed and why
New in WebArena-Verified: Structured Agent Response
Why This Matters:
- Robust Evaluation: Modern LLMs rarely struggle with generating valid JSON, enabling more reliable evaluation with explicit fields:
task_type: Requires agents to explicitly state what operation they performed, revealing whether they truly understood the taskstatus: Allows various error status codes instead of catch-all "N/A" responses for unachievable tasksretrieved_data: Structured format reduces false negatives due to parsing issues
-
Reduces False Positives: By validating both the operation type and the outcome, we ensure agents actually completed the intended task.
Example: Navigation vs. Retrieval
For a task that requires retrieving data, the agent misunderstands and only navigates. The agent can reach the correct page but never retrieve the data.
- Original WebArena: Pass ✓ (only checked if agent reached the correct page)
- WebArena-Verified: Fail ✗ (verifies page navigation and
task_typematchesRETRIEVE)
(Or picture asking a coding agent to "review this code" and watching it start rewriting everything while you frantically mash Ctrl+C! 😱)
Network Trace¶
Captures all network activity between the browser frontend and the backend in HAR (HTTP Archive) format - a standard format widely used for debugging and analyzing web traffic. This records what the agent actually did - including page navigations, data retrievals, and modifications. Each network event includes the URL, HTTP method, and response status used by the evaluator:
{
"request": {
"method": "GET",
"url": "http://192.168.1.35:7780/admin/customer/index/",
...
},
"response": {
"status": 200,
...
},
...
}
New in WebArena-Verified: Network Event Based Evaluation
Why This Matters:
- Enables Offline Evaluation: Network traces can be evaluated without live web environments - this is the critical piece that makes reevaluation possible
- Avoids Brittle Locators: No reliance on DOM selectors or page structure - allows for easy website updates
- Single Evaluation Method: Works uniformly across all websites (GitLab, e-commerce, forums, etc.)
See Network Event Based Evaluation for details.
2. Run the evaluator¶
webarena-verified eval-tasks \
--task-ids 108 \
--output-dir examples/agent_logs/demo \
--config examples/configs/config.demo.json
Troubleshooting
- If the
webarena-verifiedcommand is not available, make sure you have activated the virtual environment correctly. See the Setup section.
This creates an eval_result.json file in the task directory (examples/agent_logs/demo/108/).
3. Examine the evaluation result¶
The evaluation result is a structured JSON document that shows:
- The overall task status and score - Did the agent pass or fail?
- Individual evaluator results - Each evaluator (e.g., AgentResponseEvaluator) reports its findings
- Raw and normalized values - We show both
actual(raw agent output) andactual_normalized(after type-aware normalization) to help you catch normalization issues and understand how values are being compared - Reproducibility checksums - We track evaluation code and task dataset checksums to ensure consistent, reproducible evaluations across different runs and environments
The annotated JSON below explains each field. Click the + markers to expand explanations:
{
"task_id": 108,
"intent_template_id": 270,
"sites": [
"shopping_admin"
],
"task_revision": 2, // (1)!
"status": "success", // (2)!
"score": 1.0, // (3)!
"evaluators_results": [ // (4)!
{
"evaluator_name": "AgentResponseEvaluator", // (5)!
"status": "success",
"score": 1.0,
"actual": { // (6)!
"task_type": "RETRIEVE",
"status": "SUCCESS",
"retrieved_data": [
{ "month": "Jan", "count": 12 },
{ "month": "Feb", "count": 7 },
{ "month": "March", "count": 5 },
{ "month": "April", "count": 9 },
{ "month": "May", "count": 5 }
],
"error_details": null
},
"actual_normalized": { // (7)!
"task_type": "retrieve",
"status": "success",
"retrieved_data": [
{ "month": "january", "count": 12.0 },
{ "month": "february", "count": 7.0 },
{ "month": "march", "count": 5.0 },
{ "month": "april", "count": 9.0 },
{ "month": "may", "count": 5.0 }
]
},
"expected": { // (8)!
"task_type": "retrieve",
"status": "success",
"retrieved_data": [
{ "month": "january", "count": 12.0 },
{ "month": "february", "count": 7.0 },
{ "month": "march", "count": 5.0 },
{ "month": "april", "count": 9.0 },
{ "month": "may", "count": 5.0 }
]
},
"assertions": null, // (9)!
"error_msg": null // (10)!
}
],
"error_msg": null,
"webarena_verified_version": "2.0.0", // (11)!
"webarena_verified_evaluator_checksum": "a9e6da7e172f7ba62e388d445ccc974cc5df02529b833738051c54879319d4f8", // (12)!
"webarena_verified_data_checksum": "33048e32d7349835e3ea656348e59ba4ca43d2068cce24d3772134e402ef8f4b" // (13)!
}
- Task revision number - incremented when task definition changes
- Overall evaluation status -
successwhen all evaluators pass - Overall score - 1.0 = complete success, 0.0 = failure
- Results from each evaluator that ran on this task
- Name of the evaluator -
AgentResponseEvaluatorvalidates structured agent responses - Raw agent response before normalization - note mixed month formats ("Jan", "Feb", "March")
- Agent response after type-aware normalization - all months converted to lowercase ("january", "february", "march"). Notice how
actual_normalizedmatchesexpectedeven though raw formats were mixed. - Expected values from task definition - what the agent should return after normalization
- List of assertion failures -
nullmeans all checks passed - Error message when the evaluation system itself encounters an error (not agent failures). When
error_msgis not null,statusisERROR. - WebArena-Verified version used for this evaluation
- Checksum of evaluator code - ensures evaluation logic hasn't changed
- Checksum of task dataset - ensures task definitions are consistent
New in WebArena-Verified: Type-Aware Normalization
Why This Matters:
- Handles Common Data Types: Automatically normalizes dates, currency, URLs, coordinates, and more without requiring LLM-based evaluation
- Format-Agnostic Comparison: In this example, month names are normalized regardless of format ("Jan" vs "January" vs "january"), ensuring reliable comparison
- Deterministic & Cost-Effective: Eliminates the unpredictability and cost of LLM evaluators
Part 2: Run and Evaluate an Agent¶
Now that you understand evaluation, let's run an agent and evaluate it. We'll complete the following task:
We'll use a special "human agent" that opens a browser and hands control to you to complete this simple navigation task (requires clicking on a single menu item).
Why not use a real AI agent implementation?
The goal of this exercise is to walk through how to use the benchmark in a straightforward way, without additional complexity. By stepping through the process manually, you'll understand exactly what agents need to produce and how evaluation works.
1. Setup GitLab Environment¶
First, you need a GitLab instance to work with. Choose one option:
What is the Demo GitLab?
This is a lightweight, bare-bones GitLab Docker image instead of the full 100 GB+ GitLab instance from the original WebArena. For simple navigation tasks like "Check out my todos", this smaller image is perfectly sufficient and much faster to download and run on your laptop! However, this task is not part of our hard subset since it only requires basic navigation.
Start the demo GitLab instance using Docker:
The GitLab instance takes 2-3 minutes to fully boot up. Wait until the command completes and shows the container status as 'running' before proceeding.
Change the default port (8012)
To use a different port, add the --port flag:
examples/configs/config.demo.json to match your port.
We'll use examples/configs/config.demo.json
If you have your own GitLab instance running (from the original webarena setup), update examples/configs/config.demo.json with your GitLab URL and credentials:
2. Export Task Data¶
Export the task information that the agent needs:
webarena-verified agent-input-get \
--task-ids 44 \
--config examples/configs/config.demo.json \
--output output/tasks.demo.json
This exports only the fields that the agent needs to perform the task (intent, start_urls) and the IDs (task_id, intent_template_id, and sites). Since the --config argument is provided, URL templates like __GITLAB__ are rendered to actual URLs (e.g., http://localhost:8012).
New in WebArena-Verified: Agent runner does not depend on benchmark dependencies
Why This Matters:
- Language & Framework Freedom: Your agent implementation can use any programming language (Python, JavaScript, Go, etc.) or framework - no dependency on the benchmark's libraries
- Independent Versioning: Use any version of Playwright, Selenium, or other browser automation tools without conflicts with the benchmark
- Lightweight Integration: Agents only need to read JSON task files and produce standard output formats (JSON response + HAR trace)
- Alternative Approach: While we use
agent-input-getCLI here to export tasks, you can also call WebArena-Verified's Python API directly within your agent code if you prefer programmatic access
3. Your Turn: Complete the Task¶
Now let's run the human agent for Task ID 44 (from output/tasks.demo.json we generated earlier)
{
"sites": ["gitlab"],
"task_id": 44,
"intent_template_id": 303,
"start_urls": ["http://localhost:8012"],
"intent": "Open my todos page"
}
uv run python examples/agents/human/agent.py \
--tasks-file output/tasks.demo.json \
--task_id 44 \
--task_output_dir output/demo-run/44 \
--config examples/configs/config.demo.json
What happens next:
- The agent script opens a browser window and navigates to GitLab (login is handled automatically)
- Now it's your turn! Navigate to the todos page by clicking "To-Do List" in the left sidebar, then close the browser window
- The agent will prompt you in the terminal to generate the agent response saved to
agent_response.json - The agent writes its response and network event logs to
output/demo-run/44/agent_response.jsonandoutput/demo-run/44/network.har
Example: Agent Response Questionnaire Output
==============================================================
Browser closed. Generating the agent response questionnaire...
==============================================================
------------------------------------------------------------
Select the performed operation:
1. RETRIEVE
2. MUTATE
3. NAVIGATE
Enter choice number > 3
------------------------------------------------------------
Select the task status:
1. SUCCESS
2. ACTION_NOT_ALLOWED_ERROR
3. PERMISSION_DENIED_ERROR
4. NOT_FOUND_ERROR
5. DATA_VALIDATION_ERROR
6. UNKNOWN_ERROR
Enter choice number > 1
------------------------------------------------------------
Proposed agent response:
{
"task_type": "NAVIGATE",
"status": "SUCCESS",
"retrieved_data": null,
"error_details": null
}
------------------------------------------------------------
Confirm and save this response?
1. Yes
2. No
> 1
4. Evaluate Your Run¶
Now let's evaluate your performance:
webarena-verified eval-tasks \
--config examples/configs/config.demo.json \
--task-ids 44 \
--output-dir output/demo-run
This creates output/demo-run/44/eval_result.json with your evaluation results.
5. Review the Results¶
Check output/demo-run/44/eval_result.json - it will have the same structure as Part 1.
What got evaluated:
- AgentResponseEvaluator: Validated your response structure (
task_type,status, etc.) - NetworkEventEvaluator: Checked that you navigated to the correct URL (
/dashboard/todos)
If you successfully navigated to the todos page and reported task_type: "NAVIGATE" with status: "SUCCESS", you should see:
{
"status": "success",
"score": 1.0,
"evaluators_results": [
{
"evaluator_name": "AgentResponseEvaluator",
"status": "success",
"score": 1.0,
...
},
{
"evaluator_name": "NetworkEventEvaluator",
"status": "success",
"score": 1.0,
...
}
],
...
}
If you used the demo GitLab instance, you can now stop it:
Where to Next?¶
- Usage Guide - Agent workflow, batch evaluation, CLI filters, programmatic APIs
- Configuration Reference - All config options
- Evaluation Guide - Deep dive into evaluators and scoring
- API Reference - Type models and classes
Happy benchmarking!