Usage¶

This guide walks you through using WebArena-Verified to evaluate web agents. You'll learn how to get task data, run your agent, and evaluate the results using either the CLI or programmatic API.

Prerequisites¶

Configuration file set up (see Configuration)
Python 3.11+ with WebArena-Verified installed

Step 1: Set Up Your Configuration¶

Create a configuration file that specifies your environment URLs and credentials:

{
  "environments": {
    "__GITLAB__": {
      "urls": ["http://localhost:8012"],
      "credentials": {"username": "root", "password": "demopass"}
    },
    "__SHOPPING__": {
      "urls": ["http://localhost:7770"]
    }
  }
}

See Configuration for complete details on all configuration options.

Step 2: Get Task Data¶

Export task information that your agent needs using the agent-input-get command:

All tasksSpecific tasksFilter by site

webarena-verified agent-input-get \
  --config config.json \
  --output tasks.json

webarena-verified agent-input-get \
  --task-ids 1,2,3 \
  --config config.json \
  --output tasks.json

webarena-verified agent-input-get \
  --sites shopping \
  --config config.json \
  --output tasks.json

The output file contains task metadata your agent needs:

[
  {
    "task_id": 1,
    "intent_template_id": 100,
    "sites": ["shopping"],
    "start_urls": ["http://localhost:7770/..."],
    "intent": "What is the price of..."
  }
]

URL Rendering

The --config flag is required to render template URLs (like __SHOPPING__) into actual URLs that your agent can navigate to.

Step 3: Run Your Agent¶

Your agent should:

Load task data from the JSON file produced in Step 2
For each task:
- Navigate to the provided start_urls
- Execute the task based on the intent
- Save outputs to the expected location

Required output files per task:

{output_dir}/
└── {task_id}/
    ├── agent_response.json  # Agent's response (see format below)
    └── network.har          # Network trace in HAR format

Agent response format:

{
  "task_type": "RETRIEVE",
  "status": "SUCCESS",
  "retrieved_data": ["extracted data here"],
  "error_details": null
}

Field	Type	Description
`task_type`	string	One of: `RETRIEVE`, `MUTATE`, `NAVIGATE`
`status`	string	One of: `SUCCESS`, `ACTION_NOT_ALLOWED_ERROR`, `PERMISSION_DENIED_ERROR`, `NOT_FOUND_ERROR`, `DATA_VALIDATION_ERROR`, `UNKNOWN_ERROR`
`retrieved_data`	array or null	Required for `RETRIEVE` tasks; list of extracted values
`error_details`	string or null	Optional error description

Reference Implementation

See the human agent example in examples/agents/human/ for a complete reference implementation that demonstrates loading task data, browser automation with Playwright, and producing properly formatted output files.

Step 4: Evaluate Results¶

Use the eval-tasks command to score your agent's outputs:

Basic Evaluation¶

Score one or more runs. When no filters are provided, the CLI discovers every task directory under --output-dir that contains the required files.

webarena-verified eval-tasks --config config.json --output-dir output

Filtering Tasks¶

You can filter which tasks to evaluate:

Specific task IDsSingle taskBy siteBy task typeBy template IDCombined filtersDry run (no scoring)

webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-ids 1,2,3

webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-ids 42

webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites shopping

webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --task-type mutate

webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --template-id 5

webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites shopping,reddit \
  --task-type mutate

webarena-verified eval-tasks \
  --config config.json \
  --output-dir output \
  --sites reddit \
  --dry-run

Available filter flags:

Flag	Description
`--task-ids`	Comma-separated task IDs (for example `1,2,3` or single `42`).
`--sites`	Comma-separated site names (`shopping`, `reddit`, `gitlab`, `map`, etc.).
`--task-type`	Task type (`retrieve`, `mutate`, or `navigate`).
`--template-id`	Filter by `intent_template_id`.
`--dry-run`	List matching tasks without scoring them.

Understanding Evaluation Output¶

The CLI writes evaluation artifacts alongside your agent outputs:

output/
├── {task_id}/
│   ├── agent_response.json  # Agent response produced by the agent
│   ├── network.har          # Network trace captured during the run (HAR format)
│   └── eval_result.json     # Evaluation result written by the CLI
└── eval_log_{timestamp}.txt # Batch evaluation log

See Evaluation Results for details on the evaluation output format.

Using the Programmatic API¶

If you prefer to integrate WebArena-Verified directly into your Python code, you can use the programmatic API.

Step 1: Initialize WebArenaVerified¶

Create a WebArenaVerified instance with your environment configuration:

from pathlib import Path
from webarena_verified.api import WebArenaVerified
from webarena_verified.types.config import WebArenaVerifiedConfig

# Initialize with configuration
config = WebArenaVerifiedConfig(
    environments={
        "__GITLAB__": {
            "urls": ["http://localhost:8012"],
            "credentials": {"username": "root", "password": "demopass"}
        }
    }
)
wa = WebArenaVerified(config=config)

Step 2: Get Task Data¶

Retrieve task information programmatically:

# Get a single task
task = wa.get_task(42)
print(f"Task intent: {task.intent}")
print(f"Start URLs: {task.start_urls}")

# Get multiple tasks
tasks = [wa.get_task(task_id) for task_id in [1, 2, 3]]

Step 3: Evaluate Agent Output¶

Once you have your agent's output, evaluate it against the task definition. You can pass agent responses as file paths or construct them directly in code:

With FilesWith Content

# Evaluate a task with file paths
result = wa.evaluate_task(
    task_id=44,
    agent_response=Path("output/44/agent_response.json"),
    network_trace=Path("output/44/network.har")
)

print(f"Score: {result.score}, Status: {result.status}")

import json

# Evaluate a task with direct content
result = wa.evaluate_task(
    task_id=44,
    agent_response={
        "task_type": "NAVIGATE",
        "status": "SUCCESS",
        "retrieved_data": None
    },
    network_trace=json.loads(Path("output/44/network.har").read_text())
)

print(f"Score: {result.score}, Status: {result.status}")

Usage¶

Prerequisites¶

Step 1: Set Up Your Configuration¶

Step 2: Get Task Data¶

Step 3: Run Your Agent¶

Step 4: Evaluate Results¶

Basic Evaluation¶

Filtering Tasks¶

Understanding Evaluation Output¶

Using the Programmatic API¶

Step 1: Initialize WebArenaVerified¶

Step 2: Get Task Data¶

Step 3: Evaluate Agent Output¶

See Also¶