Usage¶
This guide walks you through using WebArena-Verified to evaluate web agents. You'll learn how to get task data, run your agent, and evaluate the results using either the CLI or programmatic API.
Prerequisites¶
- Docker or Python 3.11+ (Python only required when installing as a library)
- WebArena-Verified installed (see Installation)
- Configuration file set up (see Configuration)
Installation¶
Prerequisites: uv
What is uvx?
uvx runs Python CLI tools in isolated, ephemeral environments without installation. It doesn't pollute your environment and automatically handles dependencies and cleanup.
No installation needed! Verify the CLI is working:
Pull the Docker image:
Verify the installation:
For development or contributing, see the Contributing Guide.
Step 1: Set Up Your Configuration¶
Create a configuration file that specifies your environment URLs and credentials:
{
"environments": {
"__GITLAB__": {
"urls": ["http://localhost:8012"],
"credentials": {"username": "root", "password": "demopass"}
},
"__SHOPPING__": {
"urls": ["http://localhost:7770"]
}
}
}
See Configuration for complete details on all configuration options.
Step 2: Get Task Data¶
Export task information that your agent needs using the agent-input-get command:
The output file contains task metadata your agent needs:
[
{
"task_id": 1,
"intent_template_id": 100,
"sites": ["shopping"],
"start_urls": ["http://localhost:7770/..."],
"intent": "What is the price of..."
}
]
URL Rendering
The --config flag is required to render template URLs (like __SHOPPING__) into actual URLs that your agent can navigate to.
Step 3: Run Your Agent¶
Your agent should:
- Load task data from the JSON file produced in Step 2
- For each task:
- Navigate to the provided
start_urls - Execute the task based on the
intent - Save outputs to the expected location
- Navigate to the provided
Required output files per task:
{output_dir}/
└── {task_id}/
├── agent_response.json # Agent's response (see format below)
└── network.har # Network trace in HAR format
Agent response format:
{
"task_type": "RETRIEVE",
"status": "SUCCESS",
"retrieved_data": ["extracted data here"],
"error_details": null
}
| Field | Type | Description |
|---|---|---|
task_type |
string | One of: RETRIEVE, MUTATE, NAVIGATE |
status |
string | One of: SUCCESS, ACTION_NOT_ALLOWED_ERROR, PERMISSION_DENIED_ERROR, NOT_FOUND_ERROR, DATA_VALIDATION_ERROR, UNKNOWN_ERROR |
retrieved_data |
array or null | Required for RETRIEVE tasks; list of extracted values |
error_details |
string or null | Optional error description |
Reference Implementation
See the human agent example in examples/agents/human/ for a complete reference implementation that demonstrates loading task data, browser automation with Playwright, and producing properly formatted output files.
Step 4: Evaluate Results¶
Use the eval-tasks command to score your agent's outputs:
Basic Evaluation¶
Score one or more runs. When no filters are provided, the CLI discovers every task directory under --output-dir that contains the required files.
Filtering Tasks¶
You can filter which tasks to evaluate:
# Specific task IDs
uvx webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--task-ids 1,2,3
# Single task
uvx webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--task-ids 42
# By site
uvx webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--sites shopping
# By task type
uvx webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--task-type mutate
# By template ID
uvx webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--template-id 5
# Combined filters
uvx webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--sites shopping,reddit \
--task-type mutate
# Dry run (no scoring)
uvx webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--sites reddit \
--dry-run
# Specific task IDs
docker run --rm \
-v /path/to/output:/data \
-v /path/to/config.json:/config.json \
ghcr.io/servicenow/webarena-verified:latest \
eval-tasks --config /config.json --output-dir /data --task-ids 1,2,3
# Single task
docker run --rm \
-v /path/to/output:/data \
-v /path/to/config.json:/config.json \
ghcr.io/servicenow/webarena-verified:latest \
eval-tasks --config /config.json --output-dir /data --task-ids 42
# By site
docker run --rm \
-v /path/to/output:/data \
-v /path/to/config.json:/config.json \
ghcr.io/servicenow/webarena-verified:latest \
eval-tasks --config /config.json --output-dir /data --sites shopping
# By task type
docker run --rm \
-v /path/to/output:/data \
-v /path/to/config.json:/config.json \
ghcr.io/servicenow/webarena-verified:latest \
eval-tasks --config /config.json --output-dir /data --task-type mutate
# By template ID
docker run --rm \
-v /path/to/output:/data \
-v /path/to/config.json:/config.json \
ghcr.io/servicenow/webarena-verified:latest \
eval-tasks --config /config.json --output-dir /data --template-id 5
# Combined filters
docker run --rm \
-v /path/to/output:/data \
-v /path/to/config.json:/config.json \
ghcr.io/servicenow/webarena-verified:latest \
eval-tasks --config /config.json --output-dir /data --sites shopping,reddit --task-type mutate
# Dry run (no scoring)
docker run --rm \
-v /path/to/output:/data \
-v /path/to/config.json:/config.json \
ghcr.io/servicenow/webarena-verified:latest \
eval-tasks --config /config.json --output-dir /data --sites reddit --dry-run
# Specific task IDs
webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--task-ids 1,2,3
# Single task
webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--task-ids 42
# By site
webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--sites shopping
# By task type
webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--task-type mutate
# By template ID
webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--template-id 5
# Combined filters
webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--sites shopping,reddit \
--task-type mutate
# Dry run (no scoring)
webarena-verified eval-tasks \
--config config.json \
--output-dir output \
--sites reddit \
--dry-run
Available filter flags:
| Flag | Description |
|---|---|
--task-ids |
Comma-separated task IDs (for example 1,2,3 or single 42). |
--sites |
Comma-separated site names (shopping, reddit, gitlab, map, etc.). |
--task-type |
Task type (retrieve, mutate, or navigate). |
--template-id |
Filter by intent_template_id. |
--dry-run |
List matching tasks without scoring them. |
Understanding Evaluation Output¶
The CLI writes evaluation artifacts alongside your agent outputs:
output/
├── {task_id}/
│ ├── agent_response.json # Agent response produced by the agent
│ ├── network.har # Network trace captured during the run (HAR format)
│ └── eval_result.json # Evaluation result written by the CLI
└── eval_log_{timestamp}.txt # Batch evaluation log
See Evaluation Results for details on the evaluation output format.
Using the Programmatic API¶
If you prefer to integrate WebArena-Verified directly into your Python code, you can use the programmatic API.
Step 1: Initialize WebArenaVerified¶
Create a WebArenaVerified instance with your environment configuration:
from pathlib import Path
from webarena_verified.api import WebArenaVerified
from webarena_verified.types.config import WebArenaVerifiedConfig
# Initialize with configuration
config = WebArenaVerifiedConfig(
environments={
"__GITLAB__": {
"urls": ["http://localhost:8012"],
"credentials": {"username": "root", "password": "demopass"}
}
}
)
wa = WebArenaVerified(config=config)
Step 2: Get Task Data¶
Retrieve task information programmatically:
# Get a single task
task = wa.get_task(42)
print(f"Task intent: {task.intent}")
print(f"Start URLs: {task.start_urls}")
# Get multiple tasks
tasks = [wa.get_task(task_id) for task_id in [1, 2, 3]]
Step 3: Evaluate Agent Output¶
Once you have your agent's output, evaluate it against the task definition. You can pass agent responses as file paths or construct them directly in code:
import json
# Evaluate a task with direct content
result = wa.evaluate_task(
task_id=44,
agent_response={
"task_type": "NAVIGATE",
"status": "SUCCESS",
"retrieved_data": None
},
network_trace=json.loads(Path("output/44/network.har").read_text())
)
print(f"Score: {result.score}, Status: {result.status}")
See Also¶
- Configuration - Complete configuration reference and options
- Subset Manager - Work with task subsets for focused evaluation