Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active February 4, 2026 18:24
Show Gist options
  • Select an option

  • Save arubis/fea0e4f8af4e82168fe10922041d8cf7 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/fea0e4f8af4e82168fe10922041d8cf7 to your computer and use it in GitHub Desktop.
Synthetic Endpoint Monitoring - Hardening Suggestions (post-grader-fix)

Synthetic Endpoint Monitoring - Hardening Suggestions

Task ID: a6b6b25b-fbdf-4830-bd13-258c6bfd9948 Current Version: v32 Date: 2026-02-04

Context

After fixing the grader bugs (broken self-parameter methods), the task now passes test-solution with a perfect score. The agent pass rate is expected to increase with the fixed grader—it may exceed the 70% threshold, but it also might not.

This document is provided as a contingency. These hardening suggestions should only be pursued if the task proves too easy after the grader fix. If the task passes within the acceptable threshold (pass rate < 70%) after fixing the grader, no further changes are needed.

The options below are evaluated by implementation effort and expected effectiveness, prioritized for if/when hardening becomes necessary.

Agent RBAC Constraints (Verified)

Before suggesting discovery paths, we must understand what agents can actually access:

Resource Access Notes
crictl images ✅ YES Host command, not RBAC restricted
kubectl get ingress -A ❌ NO No cluster-wide ingress permission
kubectl get namespaces ❌ NO No namespace list permission
ConfigMaps in observability ✅ YES Task-specific RBAC grant
Pods/Services in observability ❌ NO Only ConfigMap access granted
Wiki content (Gitea HTTP) ✅ YES HTTP accessible at gitea.devops.local

Effort vs Effectiveness Matrix

quadrantChart
    title Hardening Options: Effort vs Effectiveness
    x-axis Low Effort --> High Effort
    y-axis Low Effectiveness --> High Effectiveness
    quadrant-1 High Impact, More Work
    quadrant-2 Sweet Spot
    quadrant-3 Low Priority
    quadrant-4 Avoid

    "Remove version hints": [0.25, 0.75]
    "Require SLO burn rate": [0.45, 0.80]
    "Multi-protocol modules": [0.40, 0.70]
    "Stricter dashboard": [0.55, 0.72]
    "Recording rules": [0.50, 0.55]
    "Alert severities": [0.45, 0.50]
    "HA deployment": [0.55, 0.50]
    "Retention config": [0.15, 0.25]
    "Decoy services": [0.80, 0.35]
    "RBAC/ServiceAccount": [0.70, 0.40]
Loading

Key Principles

1. Discovery vs Guessing

Good hardening requires agents to perform more discovery steps, not guess at hidden information.

2. Technical Complexity

Tasks can require genuinely harder configurations that test real DevOps expertise.

Approach Type Verdict
Remove wiki entirely Guessing Bad
Remove version hints only Discovery (crictl) Good
Require SLO burn rate alerts Technical complexity Good
Require multi-protocol blackbox Technical complexity Good
Add endpoints via Ingress discovery Requires missing RBAC Won't work

Recommended Hardening (Priority Order)

1. Remove Image Version Hints from Wiki

Effort Effectiveness Type
Low High Discovery

Discovery Path: crictl images | grep -E "prometheus|blackbox|grafana" ✅ Verified

Currently the wiki tells agents exactly which versions to use. Remove this hint and let agents discover preloaded images via containerd.

Change required:

# setup.sh - wiki content
- Use the **most recent version** of each image.
-     crictl images | grep -E \"prometheus|blackbox|grafana\"
+ Container images are preloaded in the air-gapped environment.
+ Agents must discover available versions using standard container tooling.

2. Require SLO-Based Burn Rate Alerts

Effort Effectiveness Type
Medium High Technical complexity

Requires understanding multi-window burn rates. Instead of simple "probe failed for 2m" alerts, require proper SLO-based alerting with burn rate calculations.

Example required configuration:

groups:
  - name: synthetic-slo
    rules:
      # Fast burn: 14.4x error budget consumption
      - alert: SyntheticProbeHighBurnRate
        expr: |
          (
            1 - avg_over_time(probe_success{job="blackbox"}[5m])
          ) / (1 - 0.99) > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          description: "High error budget burn rate detected"

      # Slow burn: 1x error budget consumption over 6h
      - alert: SyntheticProbeLowBurnRate
        expr: |
          (
            1 - avg_over_time(probe_success{job="blackbox"}[6h])
          ) / (1 - 0.99) > 1
        for: 1h
        labels:
          severity: warning

Grader check:

def check_slo_burn_rate_alerts():
    """Verify alerts use SLO burn rate pattern."""
    code, out, _ = sh(
        "kubectl get configmap prometheus-config "
        "-n observability -o yaml"
    )
    if code != 0:
        return False, "Prometheus config not readable"

    # Check for burn rate pattern
    has_burn_rate = (
        'error budget' in out.lower() or
        '14.4' in out or  # Fast burn multiplier
        '1 - avg_over_time' in out or
        'burn' in out.lower()
    )

    if not has_burn_rate:
        return False, "Alerts should use SLO burn rate pattern, not simple threshold"

    return True, "Alerts use proper SLO burn rate calculations"

3. Require Multi-Protocol Blackbox Modules

Effort Effectiveness Type
Medium High Technical complexity

Requires understanding protocol-specific probe configuration. Require different blackbox modules for different protocols, with proper TLS validation where appropriate.

Required modules:

  • http_2xx for HTTP endpoints with TLS validation
  • tcp_connect for raw TCP endpoints (K8s API)
  • icmp for network layer checks (optional bonus)

Current check already validates this but could be stricter:

def check_blackbox_modules():
    """Verify correct Blackbox modules used for each protocol."""
    # ... existing code ...

    # NEW: Require explicit TLS verification for HTTPS targets
    if 'https://' in out:
        if 'tls_config' not in out and 'insecure_skip_verify: false' not in out:
            return False, "HTTPS targets should have explicit TLS verification config"

    return True, "Blackbox modules correctly matched to target protocols"

4. Stricter Dashboard Validation

Effort Effectiveness Type
Medium High Technical complexity

Discovery Path: Standard Grafana/Prometheus patterns (documentation)

Require dashboards to include:

  1. Availability percentage panel (not just raw probe_success)
  2. Per-endpoint breakdown
  3. Response time histogram or percentiles

Example required query:

avg_over_time(probe_success{job="blackbox"}[1h]) * 100

Change required:

# grader.py - check_grafana_dashboard_semantics()
def check_grafana_dashboard_semantics():
    # ... existing checks ...

    # NEW: Check for percentage calculation
    has_percentage = (
        '* 100' in out or
        '*100' in out or
        'percentage' in out.lower() or
        '100 *' in out
    )
    if not has_percentage:
        return False, "Dashboard should show availability as percentage"

    # NEW: Check for response time metrics
    has_latency = (
        'probe_duration' in out or
        'probe_http_duration' in out or
        'duration_seconds' in out
    )
    if not has_latency:
        return False, "Dashboard should include response time metrics"

    return True, "Dashboard meets visualization requirements"

5. Require Recording Rules

Effort Effectiveness Type
Medium Medium Technical complexity

Requires understanding Prometheus recording rules. Require Prometheus recording rules for pre-computed availability metrics:

groups:
  - name: synthetic-recording
    rules:
      - record: probe:availability:5m
        expr: avg_over_time(probe_success[5m])
      - record: probe:availability:1h
        expr: avg_over_time(probe_success[1h])
      - record: probe:latency_p99:5m
        expr: histogram_quantile(0.99, sum(rate(probe_duration_seconds_bucket[5m])) by (le, instance))

Grader check:

def check_recording_rules():
    """Verify recording rules exist for availability metrics."""
    code, out, _ = sh(
        "kubectl get configmap prometheus-config "
        "-n observability -o yaml"
    )
    if code != 0:
        return False, "Prometheus config not readable"

    # Check for recording rule pattern
    if 'record:' not in out:
        return False, "Prometheus should have recording rules for availability metrics"

    if 'probe:' not in out.lower() and 'availability' not in out.lower():
        return False, "Recording rules should compute availability metrics"

    return True, "Recording rules configured for availability metrics"

Additional Options (If Still Too Easy)

6. Require Multiple Alert Severities

Effort Effectiveness
Medium Medium

Require both warning (degraded) and critical (down) alerts:

- alert: SyntheticProbeWarning
  expr: avg_over_time(probe_success[5m]) < 0.99
  for: 5m
  labels:
    severity: warning

- alert: SyntheticProbeCritical
  expr: probe_success == 0
  for: 2m
  labels:
    severity: critical

7. Require HA Deployment

Effort Effectiveness
Medium Medium

Require replicas: 2 for Prometheus or blackbox-exporter with proper PodDisruptionBudget.


Not Recommended

Option Reason
Remove wiki entirely No discovery path - becomes guessing
Require ingress discovery Agent RBAC doesn't allow kubectl get ingress -A
Decoy services Ambiguous - which are "critical"? Leads to guessing
RBAC/ServiceAccount validation High effort, agents often get this implicitly
Complex probe depth (timeouts, TLS) Low signal - easy to copy from docs

Summary

Start with options 1-4 (verified discovery + technical complexity), then add 5-7 if pass rate remains above 70%.

Priority Change Type Effort Expected Impact
1 Remove version hints Discovery Low -10-15% pass rate
2 Require SLO burn rate alerts Technical Medium -15-20% pass rate
3 Multi-protocol blackbox Technical Medium -10-15% pass rate
4 Stricter dashboard validation Technical Medium -10-15% pass rate
5+ Recording rules, alert severities, HA Technical Medium -5-10% each

Key Insight

The most effective hardening combines:

  1. Discovery requirements that use verified accessible paths (crictl, wiki HTTP)
  2. Technical complexity that tests genuine DevOps expertise (SLO burn rates, recording rules)

Avoid hardening that relies on RBAC access the agent doesn't have (kubectl get ingress -A) or pure guessing (removing all documentation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment