WebArenaVerified¶
Facade for WebArena Verified evaluation framework.
This class provides a stable, high-level API. It is the recommended interface for all WebArena Verified operations, as it maintains API stability across versions.
Example
from webarena_verified.api import WebArenaVerified
from webarena_verified.types.config import WebArenaVerifiedConfig
# Initialize with custom config
config = WebArenaVerifiedConfig(
environments={
"__GITLAB__": {
"urls": ["http://localhost:8012"],
"credentials": {"username": "root", "password": "demopass"}
}
}
)
wa = WebArenaVerified(config=config)
# Evaluate a task
result = wa.evaluate_task(
task_id=44,
agent_response=Path("output/44/agent_response.json"),
network_trace=Path("output/44/network.har")
)
evaluate_task ¶
evaluate_task(
*,
task_id: int,
agent_response: Any,
network_trace: list[dict] | Path | NetworkTrace,
) -> TaskEvalResult
Evaluate a single task with automatic format detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
int
|
ID of the task to evaluate |
required |
agent_response
|
Any
|
Agent's response in any of these formats: - str: Raw response text (e.g., "answer: 42" or "navigate: https://example.com") - dict: Parsed response dict (e.g., {"action": "retrieve", "value": "42"}) - list: List of values (may result in validation failure) - None: No response (may result in validation failure) - Path: File path to read response from |
required |
network_trace
|
list[dict] | Path | NetworkTrace
|
Network trace in any of these formats: - Path: HAR file path - list: Pre-parsed list of network events/requests - NetworkTrace: Pre-constructed NetworkTrace object |
required |
Returns:
| Type | Description |
|---|---|
TaskEvalResult
|
TaskEvalResult with status, score, and detailed evaluation results. Errors are captured in result.status = EvalStatus.ERROR with result.error_msg. |
Examples:
String response with HAR file:
wa = WebArenaVerified()
result = wa.evaluate_task(
task_id=1,
agent_response="answer: 42",
network_trace=Path("trace.har")
)
Dict response with pre-parsed trace:
result = wa.evaluate_task(
task_id=1,
agent_response={"action": "retrieve", "value": "42"},
network_trace=network_events
)
Response from file:
get_task ¶
Get a single task by its ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
int
|
Task ID to retrieve |
required |
Returns:
| Type | Description |
|---|---|
WebArenaVerifiedTask
|
WebArenaVerifiedTask instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If task not found |
get_tasks ¶
get_tasks(
sites: list[WebArenaSite] | None = None,
template_id: int | None = None,
action: MainObjectiveType | None = None,
) -> list[WebArenaVerifiedTask]
Get all tasks, optionally filtered by criteria.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sites
|
list[WebArenaSite] | None
|
Filter by sites (default: None = no filter) |
None
|
template_id
|
int | None
|
Filter by template ID (default: None = no filter) |
None
|
action
|
MainObjectiveType | None
|
Filter by action type (default: None = no filter) |
None
|
Returns:
| Type | Description |
|---|---|
list[WebArenaVerifiedTask]
|
List of tasks matching all filter criteria (AND logic). |
list[WebArenaVerifiedTask]
|
If all parameters are None, returns all tasks. |
Examples:
Get all tasks:
Filter by site:
Filter by multiple criteria: