Skip to content

Usage

This guide walks you through using WebArena-Verified to evaluate web agents. You'll learn how to get task data, run your agent, and evaluate the results using either the CLI or programmatic API.

Prerequisites

  • Docker or Python 3.11+ (Python only required when installing as a library)
  • WebArena-Verified installed (see Installation)
  • Configuration file set up (see Configuration)

Installation

Prerequisites: uv

What is uvx?

uvx runs Python CLI tools in isolated, ephemeral environments without installation. It doesn't pollute your environment and automatically handles dependencies and cleanup.

No installation needed! Verify the CLI is working:

uvx webarena-verified --help

Pull the Docker image:

docker pull ghcr.io/servicenow/webarena-verified:latest

Verify the installation:

docker run --rm ghcr.io/servicenow/webarena-verified:latest --help
uv pip install webarena-verified

Verify the installation:

webarena-verified --help
pip install webarena-verified

Verify the installation:

webarena-verified --help

For development or contributing, see the Contributing Guide.

Step 1: Set Up Your Configuration

Create a configuration file that specifies your environment URLs and credentials:

{
  "environments": {
    "__GITLAB__": {
      "urls": ["http://localhost:8012"],
      "credentials": {"username": "root", "password": "demopass"}
    },
    "__SHOPPING__": {
      "urls": ["http://localhost:7770"]
    }
  }
}

See Configuration for complete details on all configuration options.

Step 2: Get Task Data

Export task information that your agent needs using the agent-input-get command:

webarena-verified agent-input-get \
  --config config.json \
  --output tasks.json
webarena-verified agent-input-get \
  --task-ids 1,2,3 \
  --config config.json \
  --output tasks.json
webarena-verified agent-input-get \
  --sites shopping \
  --config config.json \
  --output tasks.json

The output file contains task metadata your agent needs:

[
  {
    "task_id": 1,
    "intent_template_id": 100,
    "sites": ["shopping"],
    "start_urls": ["http://localhost:7770/..."],
    "intent": "What is the price of..."
  }
]

URL Rendering

The --config flag is required to render template URLs (like __SHOPPING__) into actual URLs that your agent can navigate to.

Step 3: Run Your Agent

Your agent should:

  1. Load task data from the JSON file produced in Step 2
  2. For each task:
    • Navigate to the provided start_urls
    • Execute the task based on the intent
    • Save outputs to the expected location

Required output files per task:

{output_dir}/
└── {task_id}/
    ├── agent_response.json  # Agent's response (see format below)
    └── network.har          # Network trace in HAR format

Agent response format:

{
  "task_type": "RETRIEVE",
  "status": "SUCCESS",
  "retrieved_data": ["extracted data here"],
  "error_details": null
}
Field Type Description
task_type string One of: RETRIEVE, MUTATE, NAVIGATE
status string One of: SUCCESS, ACTION_NOT_ALLOWED_ERROR, PERMISSION_DENIED_ERROR, NOT_FOUND_ERROR, DATA_VALIDATION_ERROR, UNKNOWN_ERROR
retrieved_data array or null Required for RETRIEVE tasks; list of extracted values
error_details string or null Optional error description

Reference Implementation

See the human agent example in examples/agents/human/ for a complete reference implementation that demonstrates loading task data, browser automation with Playwright, and producing properly formatted output files.

Step 4: Evaluate Results

Use the eval-tasks command to score your agent's outputs:

Basic Evaluation

Score one or more runs. When no filters are provided, the CLI discovers every task directory under --output-dir that contains the required files.

uvx webarena-verified eval-tasks --output-dir output
docker run --rm \
  -v /path/to/output:/data \
  ghcr.io/servicenow/webarena-verified:latest \
  eval-tasks --output-dir /data
webarena-verified eval-tasks --output-dir output

Filtering Tasks

You can filter which tasks to evaluate:

# Specific task IDs
uvx webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-ids 1,2,3

# Single task
uvx webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-ids 42

# By site
uvx webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites shopping

# By task type
uvx webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-type mutate

# By template ID
uvx webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --template-id 5

# Combined filters
uvx webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites shopping,reddit \
  --task-type mutate

# Dry run (no scoring)
uvx webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites reddit \
  --dry-run
# Specific task IDs
docker run --rm \
  -v /path/to/output:/data \
  -v /path/to/config.json:/config.json \
  ghcr.io/servicenow/webarena-verified:latest \
  eval-tasks --config /config.json --output-dir /data --task-ids 1,2,3

# Single task
docker run --rm \
  -v /path/to/output:/data \
  -v /path/to/config.json:/config.json \
  ghcr.io/servicenow/webarena-verified:latest \
  eval-tasks --config /config.json --output-dir /data --task-ids 42

# By site
docker run --rm \
  -v /path/to/output:/data \
  -v /path/to/config.json:/config.json \
  ghcr.io/servicenow/webarena-verified:latest \
  eval-tasks --config /config.json --output-dir /data --sites shopping

# By task type
docker run --rm \
  -v /path/to/output:/data \
  -v /path/to/config.json:/config.json \
  ghcr.io/servicenow/webarena-verified:latest \
  eval-tasks --config /config.json --output-dir /data --task-type mutate

# By template ID
docker run --rm \
  -v /path/to/output:/data \
  -v /path/to/config.json:/config.json \
  ghcr.io/servicenow/webarena-verified:latest \
  eval-tasks --config /config.json --output-dir /data --template-id 5

# Combined filters
docker run --rm \
  -v /path/to/output:/data \
  -v /path/to/config.json:/config.json \
  ghcr.io/servicenow/webarena-verified:latest \
  eval-tasks --config /config.json --output-dir /data --sites shopping,reddit --task-type mutate

# Dry run (no scoring)
docker run --rm \
  -v /path/to/output:/data \
  -v /path/to/config.json:/config.json \
  ghcr.io/servicenow/webarena-verified:latest \
  eval-tasks --config /config.json --output-dir /data --sites reddit --dry-run
# Specific task IDs
webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-ids 1,2,3

# Single task
webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-ids 42

# By site
webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites shopping

# By task type
webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-type mutate

# By template ID
webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --template-id 5

# Combined filters
webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites shopping,reddit \
  --task-type mutate

# Dry run (no scoring)
webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites reddit \
  --dry-run

Available filter flags:

Flag Description
--task-ids Comma-separated task IDs (for example 1,2,3 or single 42).
--sites Comma-separated site names (shopping, reddit, gitlab, map, etc.).
--task-type Task type (retrieve, mutate, or navigate).
--template-id Filter by intent_template_id.
--dry-run List matching tasks without scoring them.

Understanding Evaluation Output

The CLI writes evaluation artifacts alongside your agent outputs:

output/
├── {task_id}/
│   ├── agent_response.json  # Agent response produced by the agent
│   ├── network.har          # Network trace captured during the run (HAR format)
│   └── eval_result.json     # Evaluation result written by the CLI
└── eval_log_{timestamp}.txt # Batch evaluation log

See Evaluation Results for details on the evaluation output format.

Using the Programmatic API

If you prefer to integrate WebArena-Verified directly into your Python code, you can use the programmatic API.

Step 1: Initialize WebArenaVerified

Create a WebArenaVerified instance with your environment configuration:

from pathlib import Path
from webarena_verified.api import WebArenaVerified
from webarena_verified.types.config import WebArenaVerifiedConfig

# Initialize with configuration
config = WebArenaVerifiedConfig(
    environments={
        "__GITLAB__": {
            "urls": ["http://localhost:8012"],
            "credentials": {"username": "root", "password": "demopass"}
        }
    }
)
wa = WebArenaVerified(config=config)

Step 2: Get Task Data

Retrieve task information programmatically:

# Get a single task
task = wa.get_task(42)
print(f"Task intent: {task.intent}")
print(f"Start URLs: {task.start_urls}")

# Get multiple tasks
tasks = [wa.get_task(task_id) for task_id in [1, 2, 3]]

Step 3: Evaluate Agent Output

Once you have your agent's output, evaluate it against the task definition. You can pass agent responses as file paths or construct them directly in code:

# Evaluate a task with file paths
result = wa.evaluate_task(
    task_id=44,
    agent_response=Path("output/44/agent_response.json"),
    network_trace=Path("output/44/network.har")
)

print(f"Score: {result.score}, Status: {result.status}")
import json

# Evaluate a task with direct content
result = wa.evaluate_task(
    task_id=44,
    agent_response={
        "task_type": "NAVIGATE",
        "status": "SUCCESS",
        "retrieved_data": None
    },
    network_trace=json.loads(Path("output/44/network.har").read_text())
)

print(f"Score: {result.score}, Status: {result.status}")

See Also