Skip to content

WebArenaVerified

Facade for WebArena Verified evaluation framework.

This class provides a stable, high-level API. It is the recommended interface for all WebArena Verified operations, as it maintains API stability across versions.

Example
from webarena_verified.api import WebArenaVerified
from webarena_verified.types.config import WebArenaVerifiedConfig

# Initialize with custom config
config = WebArenaVerifiedConfig(
    environments={
        "__GITLAB__": {
            "urls": ["http://localhost:8012"],
            "credentials": {"username": "root", "password": "demopass"}
        }
    }
)
wa = WebArenaVerified(config=config)

# Evaluate a task
result = wa.evaluate_task(
    task_id=44,
    agent_response=Path("output/44/agent_response.json"),
    network_trace=Path("output/44/network.har")
)

config property

config: WebArenaVerifiedConfig

Access the configuration.

evaluate_task

evaluate_task(
    *,
    task_id: int,
    agent_response: Any,
    network_trace: list[dict] | Path | NetworkTrace,
) -> TaskEvalResult

Evaluate a single task with automatic format detection.

Parameters:

Name Type Description Default
task_id int

ID of the task to evaluate

required
agent_response Any

Agent's response in any of these formats: - str: Raw response text (e.g., "answer: 42" or "navigate: https://example.com") - dict: Parsed response dict (e.g., {"action": "retrieve", "value": "42"}) - list: List of values (may result in validation failure) - None: No response (may result in validation failure) - Path: File path to read response from

required
network_trace list[dict] | Path | NetworkTrace

Network trace in any of these formats: - Path: HAR file path - list: Pre-parsed list of network events/requests - NetworkTrace: Pre-constructed NetworkTrace object

required

Returns:

Type Description
TaskEvalResult

TaskEvalResult with status, score, and detailed evaluation results. Errors are captured in result.status = EvalStatus.ERROR with result.error_msg.

Examples:

String response with HAR file:

wa = WebArenaVerified()
result = wa.evaluate_task(
    task_id=1,
    agent_response="answer: 42",
    network_trace=Path("trace.har")
)

Dict response with pre-parsed trace:

result = wa.evaluate_task(
    task_id=1,
    agent_response={"action": "retrieve", "value": "42"},
    network_trace=network_events
)

Response from file:

result = wa.evaluate_task(
    task_id=1,
    agent_response=Path("response.txt"),
    network_trace=Path("trace.har")
)

get_task

get_task(task_id: int) -> WebArenaVerifiedTask

Get a single task by its ID.

Parameters:

Name Type Description Default
task_id int

Task ID to retrieve

required

Returns:

Type Description
WebArenaVerifiedTask

WebArenaVerifiedTask instance

Raises:

Type Description
ValueError

If task not found

Example
wa = WebArenaVerified()
task = wa.get_task(42)
print(task.intent)

get_tasks

get_tasks(
    sites: list[WebArenaSite] | None = None,
    template_id: int | None = None,
    action: MainObjectiveType | None = None,
) -> list[WebArenaVerifiedTask]

Get all tasks, optionally filtered by criteria.

Parameters:

Name Type Description Default
sites list[WebArenaSite] | None

Filter by sites (default: None = no filter)

None
template_id int | None

Filter by template ID (default: None = no filter)

None
action MainObjectiveType | None

Filter by action type (default: None = no filter)

None

Returns:

Type Description
list[WebArenaVerifiedTask]

List of tasks matching all filter criteria (AND logic).

list[WebArenaVerifiedTask]

If all parameters are None, returns all tasks.

Examples:

Get all tasks:

wa = WebArenaVerified()
all_tasks = wa.get_tasks()
print(f"Total tasks: {len(all_tasks)}")

Filter by site:

shopping_tasks = wa.get_tasks(sites=[WebArenaSite.SHOPPING])

Filter by multiple criteria:

mutate_shopping = wa.get_tasks(
    sites=[WebArenaSite.SHOPPING],
    action=MainObjectiveType.MUTATE
)