WebArenaVerified¶

Facade for WebArena Verified evaluation framework.

This class provides a stable, high-level API. It is the recommended interface for all WebArena Verified operations, as it maintains API stability across versions.

Example

from webarena_verified.api import WebArenaVerified
from webarena_verified.types.config import WebArenaVerifiedConfig

# Initialize with custom config
config = WebArenaVerifiedConfig(
    environments={
        "__GITLAB__": {
            "urls": ["http://localhost:8012"],
            "credentials": {"username": "root", "password": "demopass"}
        }
    }
)
wa = WebArenaVerified(config=config)

# Evaluate a task
result = wa.evaluate_task(
    task_id=44,
    agent_response=Path("output/44/agent_response.json"),
    network_trace=Path("output/44/network.har")
)

config `property` ¶

config: WebArenaVerifiedConfig

Access the configuration.

evaluate_task ¶

evaluate_task(
    *,
    task_id: int,
    agent_response: Any,
    network_trace: list[dict] | Path | NetworkTrace,
) -> TaskEvalResult

Evaluate a single task with automatic format detection.

Parameters:

Name	Type	Description	Default
`task_id`	`int`	ID of the task to evaluate	required
`agent_response`	`Any`	Agent's response in any of these formats: - str: Raw response text (e.g., "answer: 42" or "navigate: https://example.com") - dict: Parsed response dict (e.g., {"action": "retrieve", "value": "42"}) - list: List of values (may result in validation failure) - None: No response (may result in validation failure) - Path: File path to read response from	required
`network_trace`	`list[dict] \| Path \| NetworkTrace`	Network trace in any of these formats: - Path: HAR file path - list: Pre-parsed list of network events/requests - NetworkTrace: Pre-constructed NetworkTrace object	required

Returns:

Type	Description
`TaskEvalResult`	TaskEvalResult with status, score, and detailed evaluation results. Errors are captured in result.status = EvalStatus.ERROR with result.error_msg.

Examples:

String response with HAR file:

wa = WebArenaVerified()
result = wa.evaluate_task(
    task_id=1,
    agent_response="answer: 42",
    network_trace=Path("trace.har")
)

Dict response with pre-parsed trace:

result = wa.evaluate_task(
    task_id=1,
    agent_response={"action": "retrieve", "value": "42"},
    network_trace=network_events
)

Response from file:

result = wa.evaluate_task(
    task_id=1,
    agent_response=Path("response.txt"),
    network_trace=Path("trace.har")
)

get_task ¶

get_task(task_id: int) -> WebArenaVerifiedTask

Get a single task by its ID.

Parameters:

Name	Type	Description	Default
`task_id`	`int`	Task ID to retrieve	required

Returns:

Type	Description
`WebArenaVerifiedTask`	WebArenaVerifiedTask instance

Raises:

Type	Description
`ValueError`	If task not found

Example

wa = WebArenaVerified()
task = wa.get_task(42)
print(task.intent)

get_tasks ¶

get_tasks(
    sites: list[WebArenaSite] | None = None,
    template_id: int | None = None,
    action: MainObjectiveType | None = None,
) -> list[WebArenaVerifiedTask]

Get all tasks, optionally filtered by criteria.

Parameters:

Name	Type	Description	Default
`sites`	`list[WebArenaSite] \| None`	Filter by sites (default: None = no filter)	`None`
`template_id`	`int \| None`	Filter by template ID (default: None = no filter)	`None`
`action`	`MainObjectiveType \| None`	Filter by action type (default: None = no filter)	`None`

Returns:

Type	Description
`list[WebArenaVerifiedTask]`	List of tasks matching all filter criteria (AND logic).
`list[WebArenaVerifiedTask]`	If all parameters are None, returns all tasks.

Examples:

Get all tasks:

wa = WebArenaVerified()
all_tasks = wa.get_tasks()
print(f"Total tasks: {len(all_tasks)}")

Filter by site:

shopping_tasks = wa.get_tasks(sites=[WebArenaSite.SHOPPING])

Filter by multiple criteria:

mutate_shopping = wa.get_tasks(
    sites=[WebArenaSite.SHOPPING],
    action=MainObjectiveType.MUTATE
)

WebArenaVerified¶

config property ¶

evaluate_task ¶

get_task ¶

get_tasks ¶

config `property` ¶