check_grafana_alert_rule() in grader.py only detects the >= 5 threshold
when it appears inline in the PromQL expr field (e.g. "expr": "pg_lock_wait_time_seconds > 5").
The solution.sh creates alerts this way, so test-solution passes. But agents
universally create alerts using Grafana's standard multi-step format, where the
threshold lives in a separate step -- not in the expr string. This causes
100% artificial failure on the observability subscore across all 8 eval rollouts.
Format 1 -- conditions/evaluator (classic Grafana)
"model": {
"conditions": [{"evaluator": {"type": "gt", "params": [5]}}]
}Format 2 -- expression-based threshold (most common in evals)
{"refId": "A", "model": {"expr": "pg_lock_wait_time_seconds"}},
{"refId": "B", "model": {"expression": "A", "type": "reduce"}},
{"refId": "C", "model": {"expression": "$B >= 5", "type": "threshold"}}Format 3 -- math expression
{"refId": "C", "model": {"expression": "$A > 5", "type": "math"}}All three are semantically equivalent to pg_lock_wait_time_seconds >= 5.
The original grader only checked model.expr and missed all three.
- 8/8 eval rollouts scored observability=0.0
- All 8 agents correctly created alerts with
pg_lock_waitmetric and threshold of 5 - Grader logged:
"Alert does not include threshold >= 5 seconds in expressions" - Local eval with fix applied: agent scored deadlock_prevention=1.0 but still observability=0.0 because the first version of the fix only covered Format 1 (conditions/evaluator), not Format 2 (expression-based threshold)
The patch adds detection for all three formats by inspecting:
model.conditions[*].evaluator.paramsfor{type: "gt"/"gte", params: [5]}model.expressionwhenmodel.typeis"threshold"or"math", applying the same>= 5regex that already works for inline PromQL
The original inline-PromQL check (model.expr) is preserved for backward
compatibility with solution.sh.