Unit Metrics - Technical Reference¶
Note: This is a technical reference for developers working with unit metrics (validators).
Architecture Overview¶
Unit metrics validate individual predictions and return UnitMetricResult objects.
- Input: Two lists of equal length:
goldenandpredicted - Output: A list of
UnitMetricResultobjects, one per golden/predicted pair - Core output signal:
UnitMetricResult.correct(bool) - whether the prediction passed validation (True/False)
Unit metrics are designed to be: - Modular (each metric in its own module) - Extensible (easy to add new validators) - Task-agnostic (inputs can be any type; validation logic decides how to interpret them)
Unit Metrics Reference¶
| Metric | Purpose | Typical Inputs | Returns | Notes |
|---|---|---|---|---|
| ExactMatchMetric | Exact string equality check | dict/string/any | List[UnitMetricResult] |
Supports case sensitivity + whitespace normalization; optional key extraction |
| ActionWithinBboxMetric | Validate predicted (x, y) inside golden bbox |
dicts with bbox + coordinates | List[UnitMetricResult] |
Bbox expects x, y, width, height |
| TypedValueMatchMetric | Validate typed value using exact + fuzzy matching | dicts with typed strings | List[UnitMetricResult] |
Returns True if exact or fuzzy passes |
| ScrollDirectionMetric | Validate scroll direction matches golden | dicts with direction | List[UnitMetricResult] |
Valid directions: up/down/left/right |
| ScrollAmountMetric | Validate scroll amount within tolerance | dicts with numeric amount | List[UnitMetricResult] |
Percentage tolerance; special-case for golden 0 |
Basic Usage¶
Unit Metrics¶
from sygra.core.eval.metrics.unit_metrics.exact_match import ExactMatchMetric
metric = ExactMatchMetric(
case_sensitive=False,
normalize_whitespace=True,
key="text",
)
results = metric.evaluate(
golden=[{"text": "Hello World"}, {"text": "Foo"}],
predicted=[{"text": "hello world"}, {"text": "bar"}],
)
# results = [
# UnitMetricResult(correct=True, golden={...}, predicted={...}, metadata={...}),
# UnitMetricResult(correct=False, golden={...}, predicted={...}, metadata={...})
# ]
How Unit Metrics and Aggregator Metrics Work Together¶
from sygra.core.eval.metrics.unit_metrics.exact_match import ExactMatchMetric
from sygra.core.eval.metrics.aggregator_metrics.accuracy import AccuracyMetric
validator = ExactMatchMetric(key="tool")
unit_results = validator.evaluate(
golden=[{"tool": "click"}, {"tool": "type"}],
predicted=[{"tool": "click"}, {"tool": "scroll"}],
)
accuracy = AccuracyMetric()
print(accuracy.calculate(unit_results))
# Output: {'accuracy': 0.5}
Creating UnitMetricResult¶
from sygra.core.eval.metrics.unit_metrics.unit_metric_result import UnitMetricResult
result = UnitMetricResult(
correct=True,
golden={"event": "click"},
predicted={"tool": "click"},
metadata={"step_id": 1},
)
Initialization Patterns¶
Direct Import (Recommended)¶
from sygra.core.eval.metrics.unit_metrics.typed_value_match import TypedValueMatchMetric
metric = TypedValueMatchMetric(
golden_text_key="text",
predicted_text_key="text",
fuzzy_match_threshold=0.8,
)
From Config Dict¶
from sygra.core.eval.metrics.unit_metrics.scroll_amount import ScrollAmountMetric
config = {
"tolerance_percent": 20.0,
"scroll_threshold": 10.0,
}
metric = ScrollAmountMetric(**config)
Parameter Validation¶
Each unit metric defines its own Pydantic config class in the same module. Validation errors come from Pydantic and surface as ValidationError.
✅ Valid Initialization¶
from sygra.core.eval.metrics.unit_metrics.action_within_bbox import ActionWithinBboxMetric
ActionWithinBboxMetric(predicted_x_key="x", predicted_y_key="y", golden_bbox_key="bbox")
❌ Invalid Initialization¶
from sygra.core.eval.metrics.unit_metrics.typed_value_match import TypedValueMatchMetric
# threshold out of range -> ValidationError
TypedValueMatchMetric(fuzzy_match_threshold=1.5)
Common Issues¶
| Symptom | Cause | Solution |
|---|---|---|
ValueError: golden and predicted must have same length |
Input list lengths mismatch | Ensure both lists align per-item |
Always returns correct=False |
Key mismatch or unexpected input shape | Confirm keys (*_key) match your data |
| Fuzzy match passes unexpectedly | Threshold too low | Increase fuzzy_match_threshold |
Configuration Architecture¶
Each metric file is self-contained and includes:
- A Pydantic config model (*MetricConfig)
- The metric implementation (*Metric)
- Helper methods for normalization / comparison
This keeps metrics modular and makes it easy to add additional validators without modifying shared config classes.