Skip to content

Instantly share code, notes, and snippets.

@arubis
Created February 13, 2026 01:41
Show Gist options
  • Select an option

  • Save arubis/a97c8feafa6c7f24b9e8b4acd74878b9 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/a97c8feafa6c7f24b9e8b4acd74878b9 to your computer and use it in GitHub Desktop.
fix: distributed-transaction-deadlock grader -- detect alert threshold in all Grafana multi-step alert formats

fix: detect alert threshold in all Grafana multi-step alert formats

Problem

check_grafana_alert_rule() in grader.py only detects the >= 5 threshold when it appears inline in the PromQL expr field (e.g. "expr": "pg_lock_wait_time_seconds > 5").

The solution.sh creates alerts this way, so test-solution passes. But agents universally create alerts using Grafana's standard multi-step format, where the threshold lives in a separate step -- not in the expr string. This causes 100% artificial failure on the observability subscore across all 8 eval rollouts.

How agents create alerts (3 observed formats)

Format 1 -- conditions/evaluator (classic Grafana)

"model": {
  "conditions": [{"evaluator": {"type": "gt", "params": [5]}}]
}

Format 2 -- expression-based threshold (most common in evals)

{"refId": "A", "model": {"expr": "pg_lock_wait_time_seconds"}},
{"refId": "B", "model": {"expression": "A", "type": "reduce"}},
{"refId": "C", "model": {"expression": "$B >= 5", "type": "threshold"}}

Format 3 -- math expression

{"refId": "C", "model": {"expression": "$A > 5", "type": "math"}}

All three are semantically equivalent to pg_lock_wait_time_seconds >= 5. The original grader only checked model.expr and missed all three.

Evidence

  • 8/8 eval rollouts scored observability=0.0
  • All 8 agents correctly created alerts with pg_lock_wait metric and threshold of 5
  • Grader logged: "Alert does not include threshold >= 5 seconds in expressions"
  • Local eval with fix applied: agent scored deadlock_prevention=1.0 but still observability=0.0 because the first version of the fix only covered Format 1 (conditions/evaluator), not Format 2 (expression-based threshold)

Fix

The patch adds detection for all three formats by inspecting:

  1. model.conditions[*].evaluator.params for {type: "gt"/"gte", params: [5]}
  2. model.expression when model.type is "threshold" or "math", applying the same >= 5 regex that already works for inline PromQL

The original inline-PromQL check (model.expr) is preserved for backward compatibility with solution.sh.

diff --git a/grader.py b/grader.py
index faccd5a..3a87dea 100644
--- a/grader.py
+++ b/grader.py
@@ -392,37 +392,72 @@ def check_grafana_alert_rule(alert_name=EXPECTED_GRAFANA_ALERT_TITLE):
for query in data:
model = query.get("model", {})
expr = str(model.get("expr", ""))
-
+
# Check for PostgreSQL lock wait metric (case insensitive)
# Accept pg_lock_wait metrics that directly measure lock wait time
if "pg_lock_wait" in expr.lower():
has_pg_lock_metric = True
-
- # Check for threshold of ~5 seconds (accept 4-6 range with or without decimals)
- # Using regex to handle various forms:
- # - "> 5", ">5", "> 5.0", ">5.0"
- # - ">= 5", ">=5", ">= 5.0", ">=5.0"
- # - "> 4", ">4", "> 4.5", etc.
- # Handles whitespace variations and decimal numbers
+
+ # Check for threshold of ~5 seconds in inline PromQL comparison
+ # Handles forms like: "> 5", ">5", ">= 5.0", etc.
threshold_pattern = re.compile(r'[>]=?\s*([0-9]+\.?[0-9]*)')
matches = threshold_pattern.findall(expr)
for match in matches:
try:
threshold_value = float(match)
- # Accept threshold >= 5.0 seconds as specified in task.yaml (>5 seconds)
if threshold_value >= 5.0:
has_threshold = True
break
except ValueError:
continue
-
+
+ # Also check Grafana's multi-step alert formats. Agents create
+ # alerts in several valid ways:
+ #
+ # Format 1 (conditions/evaluator): threshold in
+ # model.conditions[*].evaluator {type: "gt", params: [5]}
+ #
+ # Format 2 (expression-based threshold): threshold in
+ # model.expression e.g. "$B >= 5" with model.type = "threshold"
+ #
+ # Format 3 (math expression): threshold in
+ # model.expression e.g. "$A > 5" with model.type = "math"
+
+ # Format 1: conditions/evaluator params
+ conditions = model.get("conditions", [])
+ for cond in conditions:
+ evaluator = cond.get("evaluator", {})
+ eval_type = evaluator.get("type", "")
+ eval_params = evaluator.get("params", [])
+ if eval_type in ("gt", "gte") and eval_params:
+ try:
+ threshold_value = float(eval_params[0])
+ if threshold_value >= 5.0:
+ has_threshold = True
+ except (ValueError, TypeError, IndexError):
+ continue
+
+ # Format 2 & 3: expression field (e.g. "$B >= 5", "$A > 5")
+ expression = str(model.get("expression", ""))
+ if expression and model.get("type") in ("threshold", "math"):
+ threshold_pattern = re.compile(r'[>]=?\s*([0-9]+\.?[0-9]*)')
+ matches = threshold_pattern.findall(expression)
+ for match in matches:
+ try:
+ threshold_value = float(match)
+ if threshold_value >= 5.0:
+ has_threshold = True
+ break
+ except ValueError:
+ continue
+
# Validation: alert must reference PostgreSQL lock wait metric AND have correct threshold
# Task requirements explicitly specify "lock wait time exceeds 5 seconds"
if not has_pg_lock_metric:
print(" Alert does not reference a pg_lock_wait* metric in its expressions")
return False
if not has_threshold:
- print(" Alert does not include threshold >= 5 seconds in expressions")
+ print(" Alert does not include threshold >= 5 seconds in expressions or evaluator params")
return False
return True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment