Web Agents Evaluation¶
Overview¶
The eval.agents.web_agents module provides a comprehensive framework for evaluating browser automation agents. It includes tools for web interaction, request/response logging, retry logic with intelligent failure hints, inline evaluation, and metrics calculation.
Architecture¶
Core Components¶
- Task Executor (
task_executor.py) - Main execution logic with pre/post processors - Tools (
tools.py) - Browser interaction tools (click, type, scroll, etc.) - Constants (
constants.py) - Configuration constants and state keys - Graph Config (
graph_config.yaml) - Workflow configuration and system prompts
Module Structure¶
tasks/eval/agents/web_agents/
├── task_executor.py # Core execution logic
├── tools.py # Browser interaction tools
├── constants.py # Configuration constants
├── graph_config.yaml # Workflow and evaluation config
├── chat_history_seed.json # Sample input data
├── logs/ # Request/response logs
└── metadata/ # Execution metadata
Components¶
1. Task Executor (task_executor.py)¶
The task executor provides several classes for managing the evaluation workflow:
RequestResponseLogger¶
Comprehensive logging system for tracking exact request/response payloads.
Methods:
- setup_logger() - Initialize logging directory and file structure
- log_request(request_payload, step_number) - Log requests sent to the model
- log_response(response_payload, step_number) - Log responses received from the model
- is_server_error(response) - Detect server errors in responses
Log Format:
{
"step": 1,
"timestamp": "2026-03-05T12:00:00",
"type": "request|response",
"payload": {...}
}
FetchNextActionPreProcessor¶
Pre-processes requests before sending to the model.
Key Features: - Manages chat history and state - Handles retry logic with failure hints - Injects retry prompts when needed - Manages screenshot scaling
Methods:
- retry_failure_and_chat_injection(cls, state, lambda_node_dict) - Inject retry hints into chat
- apply(cls, lambda_node_dict, state) - Main preprocessing logic
Retry Hints: - Tool Incorrect: Guides agent to use different tool - Parameters Incorrect: Guides agent to fix parameters for same tool
FetchNextActionPostProcessor¶
Post-processes model responses.
Key Features: - Logs responses - Extracts tool calls from model output - Handles server errors - Updates state with model responses
Methods:
- apply(cls, lambda_node_dict, state, messages) - Main post-processing logic
InlineEvaluationLambda¶
Performs inline evaluation of model predictions against golden responses.
Evaluation Metrics: - Tool Match: Exact match of predicted vs golden tool - Step Match: Comprehensive validation including: - Tool correctness - Parameter correctness (for click, type, scroll) - Bounding box validation (for click) - Scroll direction validation - Typed value matching
Methods:
- apply(cls, lambda_node_dict, state) - Execute inline evaluation
Validation Logic:
# Tool must match
if predicted_tool != golden_tool:
return False
# For click: validate coordinates within bounding box
if tool == "click":
validate_bbox(x, y, golden_bbox)
# For scroll: validate direction
if tool == "scroll":
validate_direction(predicted_direction, golden_direction)
# For typing: validate text content
if tool == "typing":
validate_text(predicted_text, golden_text)
RetryFlow¶
Manages retry logic and determines when to retry failed attempts.
Methods:
- apply(cls, lambda_node_dict, state) - Determine if retry is needed
Retry Conditions: - Step evaluation failed - Retry count < max retries - No server errors
ShouldContinueCondition¶
Edge condition to determine if workflow should continue.
Methods:
- apply(cls, edge_dict, state) - Check continuation conditions
Flatten¶
Post-processor that flattens nested retry structure for analysis.
Transformation:
Input: {mission_id: {retry_0: {...}, retry_1: {...}}}
Output: [{mission_id, retry_id: "retry_0", ...},
{mission_id, retry_id: "retry_1", ...}]
Methods:
- process(data, metadata) - Flatten retry structure
Output Fields:
- id, mission_id, turn, mission, navigational_directions
- golden_response, retry_id, model_response
- tool_match, step_match
2. Tools (tools.py)¶
Browser interaction tools using LangChain's @tool decorator.
Available Tools¶
| Tool | Description | Parameters |
|---|---|---|
screenshot_tool |
Capture current page state | take_screenshot: bool |
click_tool |
Click at coordinates | x: float, y: float |
type_tool |
Type at coordinates | x: float, y: float, text: str |
typing_tool |
Type without coordinates | text: str |
scroll_tool |
Scroll in direction | direction: str, amount: float |
wait_tool |
Wait for duration | time: float (milliseconds) |
resume_tool |
Resume after pause | resume: bool |
hil_tool |
Human-in-the-loop | wait_for_human: bool |
text_clear_tool |
Clear text field | clear_text: bool |
slider_tool |
Move slider | direction: str, amount: float |
Tool Usage Example¶
# Click at coordinates
click_tool.invoke({"x": 500, "y": 300})
# Type text
typing_tool.invoke({"text": "Hello World"})
# Scroll down
scroll_tool.invoke({"direction": "down", "amount": 200})
3. Constants (constants.py)¶
Configuration constants organized by category.
Categories¶
Server Error Detection
- SERVER_DOWN_ERROR - Error marker string
- SERVER_ERROR_MARKERS - List of error markers
Tool Configuration
- TOOL_NAME_SUFFIX - Suffix for tool names ("_tool")
- TOOL_NAME_MAPPINGS - Tool name aliases
- TOOL_TYPE_* - Tool type constants
Retry Configuration
- DEFAULT_MAX_RETRIES - Default retry limit (3)
- RETRY_KEY_PREFIX - Prefix for retry keys
State Keys
- CHAT_HISTORY_STATE_KEY - Chat history storage key
- MODEL_RESPONSES_KEY - Model responses storage key
- GOLDEN_RESPONSE_KEY - Golden response key
- CURRENT_TOOL_RESULT_KEY - Current tool result key
Failure Hints
- FAILURE_HINT_TOOL_INCORRECT - Hint for wrong tool
- FAILURE_HINT_PARAMS_INCORRECT - Hint for wrong parameters
- FAILURE_HINT_RETRY_TEMPLATE - Template for retry messages
4. Graph Configuration (graph_config.yaml)¶
Defines the evaluation workflow and configuration.
Data Configuration¶
data_config:
source:
type: "disk"
file_path: "tasks/eval/agents/web_agents/chat_history_seed.json"
transformations:
- AddNewFieldTransform # Add supported tools, dimensions
- AddRetryFieldsTransform # Add retry configuration
Note: The
chat_history_seed.jsonfile currently contains only sample data with a few records for testing and development purposes. This is not a complete dataset.
Graph Configuration¶
Properties:
- chat_conversation: multiturn - Multi-turn conversation support
- chat_history_window_size: 1000 - History window size
- retry_chat_injection - Retry prompt injection settings
Nodes:
- fetch_next_action_tools - Main LLM node for action prediction
Lambdas:
- inline_evaluation - Evaluate predictions inline
- retry_flow - Manage retry logic
Edges:
- should_continue - Conditional edge for workflow control
System Prompt Structure¶
The system prompt guides the agent with: 1. Core Principles - Think before acting, tool restrictions, trust screenshots 2. Action Guidelines - Pop-ups, ads, disabled elements, extraction, retries 3. Interaction Rules - Clicks, typing, scrolling, modals, dropdowns, tabs 4. User Confirmation - Authentication, sensitive data, destructive actions 5. Mission Context - Mission description, navigational directions 6. Screenshot Specs - Dimensions, coordinate system
Post-Processors¶
Flatten
post_processors:
- name: tasks.eval.agents.web_agents.task_executor.Flatten
MetricCollatorPostProcessor
- name: tasks.eval.utils.MetricCollatorPostProcessor
params:
aggregator_metrics_map:
- name: "accuracy"
unit_metrics_results: ["step_match"]
- name: "pass@k"
params: {k: 3}
unit_metrics_results: ["step_match"]
- name: "pass^k"
params: {k: 3}
unit_metrics_results: ["step_match"]
- name: "step_efficiency"
params: {key: "tool"}
metadata:
mission_id: "mission_id"
step_id: "turn"
retry_id: "retry_id"
unit_metrics_results: ["step_match"]
Evaluation Metrics¶
Unit Metrics (Inline)¶
Evaluated during execution:
- Tool Match - Exact tool name match
- Step Match - Comprehensive step validation:
- Tool correctness
- Parameter validation
- Bounding box validation (click)
- Direction validation (scroll)
- Text matching (typing)
Aggregator Metrics (Post-Processing)¶
Calculated after execution:
- Accuracy - Overall success rate
- Pass@k - Probability of success in k attempts
- Returns:
pass@1,pass@2, ...,pass@k - Pass^k - Probability of success in all k attempts
- Returns:
success_rate,pass^1,pass^2, ...,pass^k - Step Efficiency - Efficiency based on retry attempts
- Metrics:
step_efficiency,first_attempt_correct,retry_correct,never_correct - Penalty per retry: 0.2 (configurable)
Workflow¶
Execution Flow¶
1. Load Data (chat_history_seed.json)
↓
2. Add Transformations (tools, dimensions, retry fields)
↓
3. For each mission/step:
├─→ FetchNextActionPreProcessor
│ ├─ Manage chat history
│ ├─ Inject retry hints (if retry)
│ └─ Log request
↓
├─→ LLM Node (fetch_next_action_tools)
│ └─ Generate action prediction
↓
├─→ FetchNextActionPostProcessor
│ ├─ Extract tool call
│ ├─ Log response
│ └─ Update state
↓
├─→ InlineEvaluationLambda
│ ├─ Evaluate tool match
│ ├─ Evaluate step match
│ └─ Store results
↓
├─→ RetryFlow
│ └─ Determine if retry needed
↓
└─→ ShouldContinueCondition
└─ Check if workflow continues
↓
4. Flatten (Post-Processor)
└─ Flatten retry structure
↓
5. MetricCollatorPostProcessor
└─ Calculate aggregator metrics
Retry Logic¶
Initial Attempt (retry_0)
↓
Evaluate
↓
Failed? → Yes → Inject Failure Hint
↓
Retry Attempt (retry_1)
↓
Evaluate
↓
Failed? → Yes → Inject Failure Hint
↓
Retry Attempt (retry_2)
↓
Max retries reached → Stop
Failure Hints: - Tool Incorrect: "Do NOT use this tool again. Use a different tool." - Parameters Incorrect: "Use the same tool with correct parameters."
Usage¶
Running Evaluation¶
# Run the web agent evaluation
python -m sygra.cli.run_graph \
--config tasks/eval/agents/web_agents/graph_config.yaml
Input Data Format¶
The input data is stored in tasks/eval/agents/web_agents/chat_history_seed.json. Currently, this file contains sample data with only one mission for testing and development purposes.
Sample Data Structure¶
Each record in the input file represents one step (turn) of a mission:
{
"id": "mission_01_2",
"mission_id": "mission_01",
"mission": "search for one way flight from hyd to chennai on nov 1 2025",
"date": "2025-11-11 15:12:56",
"navigational_directions": "",
"turn": 2,
"chat_history": [
{
"role": "system",
"content": [
{
"text": "You are a web automation agent...",
"type": "text"
}
]
},
{
"role": "user",
"content": [
{
"text": "Help me now to complete the assigned mission...",
"type": "text"
}
]
},
{
"content": "I'll help you search for a one-way flight...",
"role": "assistant",
"tool_calls": [
{
"id": "tooluse_O5Dr64r9RC-lW8BNsdHTng",
"type": "function",
"function": {
"name": "screenshot_tool",
"arguments": "{\"take_screenshot\": true}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "tooluse_O5Dr64r9RC-lW8BNsdHTng",
"name": "screenshot_tool",
"content": "success"
}
],
"current_user_text": "You are now midway through the assigned mission...",
"current_tool_result": {
"role": "tool",
"tool_call_id": "tooluse_O5Dr64r9RC-lW8BNsdHTng",
"name": "screenshot_tool",
"content": [
{
"image": {
"format": "png",
"source": {
"bytes": "iVBORw0KGgoAAAANSUhEUgAAA+gAAAPoCAIAAADCwUOz..."
}
}
}
]
},
"golden_response": {
"tool": "click",
"properties": {
"x": 146.44,
"y": 94.44,
"width": 82.04,
"height": 61.11,
"offset_x": 0.0,
"offset_y": 0.0
}
}
}
Field Descriptions¶
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier for this step (format: mission_id_turn) |
mission_id |
string | Identifier for the mission this step belongs to |
mission |
string | Description of the overall mission/task |
date |
string | Timestamp of the mission |
navigational_directions |
string | Optional hints or directions for this step |
turn |
integer | Step number within the mission (1-indexed) |
chat_history |
array | Complete conversation history up to this point |
current_user_text |
string | The prompt text for the current step |
current_tool_result |
object | Result from the previous tool execution (includes screenshot) |
golden_response |
object | Expected correct response for evaluation |
Golden Response Structure¶
The golden_response contains the ground truth for evaluation:
For Click Actions:
{
"tool": "click",
"properties": {
"x": 146.44,
"y": 94.44,
"width": 82.04,
"height": 61.11,
"offset_x": 0.0,
"offset_y": 0.0
}
}
For Typing Actions:
{
"tool": "typing",
"properties": {
"text": "Hyderabad"
}
}
For Scroll Actions:
{
"tool": "scroll",
"properties": {
"direction": "down",
"amount": 200
}
}
Current Sample Data¶
The chat_history_seed.json file currently contains:
- 1 mission (mission_01)
- Multiple steps/turns for that mission
- Complete chat history for each step
- Screenshots embedded as base64 in current_tool_result
- Golden responses for evaluation
Note: This is sample data for testing purposes. A production dataset would contain multiple missions with various web automation scenarios.
Output Format¶
Flattened Output:
{
"id": "mission_1_step_1",
"mission_id": "mission_1",
"turn": 1,
"retry_id": "retry_0",
"model_response": {...},
"tool_match": {"correct": true, "score": 1.0},
"step_match": {"correct": true, "score": 1.0}
}
Metrics Output:
{
"evaluation_summary": {
"total_records": 100,
"timestamp": "2026-03-05 12:00:00",
"status": "success"
},
"results": {
"step_match-accuracy": 0.85,
"step_match-pass@1": 0.75,
"step_match-pass@2": 0.90,
"step_match-pass@3": 0.95,
"step_match-pass^1": 0.75,
"step_match-pass^2": 0.5625,
"step_match-pass^3": 0.421875,
"step_match-step_efficiency": {
"step_efficiency": 0.82,
"first_attempt_correct": 75,
"retry_correct": 10,
"never_correct": 15
}
}
}
Configuration¶
Retry Configuration¶
# In graph_config.yaml
max_retries: 3
retry_chat_injection:
required: "yes"
retry_prompt_injection: "yes"
Metadata Mapping¶
# For step_efficiency metric
metadata:
mission_id: "mission_id" # Maps to mission_id field
step_id: "turn" # Maps to turn field
retry_id: "retry_id" # Maps to retry_id field
Image Scaling¶
IMAGE_SCALE_FACTORS = {
"50_percent": 0.5,
"25_percent": 0.25,
"30_percent": 0.3
}
Logging¶
Request/Response Logs¶
Location: tasks/eval/agents/web_agents/logs/
Format: web_agent_requests_YYYYMMDD_HHMMSS.json
Log Entry Structure¶
{
"step": 1,
"timestamp": "2026-03-05T12:00:00.000000",
"type": "request",
"payload": {
"messages": [...],
"tools": [...],
"model": "..."
}
}
Error Handling¶
Server Errors¶
Detected by RequestResponseLogger.is_server_error():
- Checks for SERVER_ERROR_MARKERS in response
- Prevents retry on server errors
- Logs error details
Validation Errors¶
Handled by inline evaluation: - Tool mismatch - Parameter validation failures - Bounding box violations - Direction mismatches
Best Practices¶
- Always log requests/responses for debugging
- Use retry hints to guide agent corrections
- Validate inline to catch errors early
- Configure metadata for metrics that need context
- Monitor step efficiency to optimize retry strategy
- Review logs when debugging failures
Extending the Module¶
Adding New Tools¶
# In tools.py
@tool
def new_tool(param1: type1, param2: type2):
"""Tool description.
Args:
param1: Description
param2: Description
"""
logger.info(f"Executing new_tool: {param1}, {param2}")
return
Adding New Metrics¶
# In graph_config.yaml
- name: "custom_metric"
params:
custom_param: value
metadata:
field1: "source_field1"
unit_metrics_results:
- "step_match"
Customizing Retry Logic¶
# In task_executor.py
class CustomRetryFlow(LambdaFunction):
@classmethod
def apply(cls, lambda_node_dict: dict, state: SygraState) -> SygraState:
# Custom retry logic
return state
Troubleshooting¶
Common Issues¶
Issue: Metrics not receiving metadata - Solution: Ensure metadata mapping is configured in graph_config.yaml
Issue: Retry hints not appearing
- Solution: Check retry_chat_injection.required is set to "yes"
Issue: Step match always failing - Solution: Verify golden response format matches expected structure
Issue: Logs not being created - Solution: Check logs directory exists and has write permissions