Synthetic Endpoint Monitoring - Hardening Suggestions

Task ID: a6b6b25b-fbdf-4830-bd13-258c6bfd9948 Current Version: v32 Date: 2026-02-04

Context

After fixing the grader bugs (broken self-parameter methods), the task now passes test-solution with a perfect score. The agent pass rate is expected to increase with the fixed grader—it may exceed the 70% threshold, but it also might not.

This document is provided as a contingency. These hardening suggestions should only be pursued if the task proves too easy after the grader fix. If the task passes within the acceptable threshold (pass rate < 70%) after fixing the grader, no further changes are needed.

The options below are evaluated by implementation effort and expected effectiveness, prioritized for if/when hardening becomes necessary.

Agent RBAC Constraints (Verified)

Before suggesting discovery paths, we must understand what agents can actually access:

Resource	Access	Notes
`crictl images`	✅ YES	Host command, not RBAC restricted
`kubectl get ingress -A`	❌ NO	No cluster-wide ingress permission
`kubectl get namespaces`	❌ NO	No namespace list permission
ConfigMaps in `observability`	✅ YES	Task-specific RBAC grant
Pods/Services in `observability`	❌ NO	Only ConfigMap access granted
Wiki content (Gitea HTTP)	✅ YES	HTTP accessible at gitea.devops.local

Effort vs Effectiveness Matrix

quadrantChart
    title Hardening Options: Effort vs Effectiveness
    x-axis Low Effort --> High Effort
    y-axis Low Effectiveness --> High Effectiveness
    quadrant-1 High Impact, More Work
    quadrant-2 Sweet Spot
    quadrant-3 Low Priority
    quadrant-4 Avoid

    "Remove version hints": [0.25, 0.75]
    "Require SLO burn rate": [0.45, 0.80]
    "Multi-protocol modules": [0.40, 0.70]
    "Stricter dashboard": [0.55, 0.72]
    "Recording rules": [0.50, 0.55]
    "Alert severities": [0.45, 0.50]
    "HA deployment": [0.55, 0.50]
    "Retention config": [0.15, 0.25]
    "Decoy services": [0.80, 0.35]
    "RBAC/ServiceAccount": [0.70, 0.40]

Key Principles

1. Discovery vs Guessing

Good hardening requires agents to perform more discovery steps, not guess at hidden information.

2. Technical Complexity

Tasks can require genuinely harder configurations that test real DevOps expertise.

Approach	Type	Verdict
Remove wiki entirely	Guessing	Bad
Remove version hints only	Discovery (crictl)	Good
Require SLO burn rate alerts	Technical complexity	Good
Require multi-protocol blackbox	Technical complexity	Good
Add endpoints via Ingress discovery	Requires missing RBAC	Won't work

Recommended Hardening (Priority Order)

1. Remove Image Version Hints from Wiki

Effort	Effectiveness	Type
Low	High	Discovery

Discovery Path: crictl images | grep -E "prometheus|blackbox|grafana" ✅ Verified

Currently the wiki tells agents exactly which versions to use. Remove this hint and let agents discover preloaded images via containerd.

Change required:

# setup.sh - wiki content
- Use the **most recent version** of each image.
-     crictl images | grep -E \"prometheus|blackbox|grafana\"
+ Container images are preloaded in the air-gapped environment.
+ Agents must discover available versions using standard container tooling.

2. Require SLO-Based Burn Rate Alerts

Effort	Effectiveness	Type
Medium	High	Technical complexity

Requires understanding multi-window burn rates. Instead of simple "probe failed for 2m" alerts, require proper SLO-based alerting with burn rate calculations.

Example required configuration:

groups:
  - name: synthetic-slo
    rules:
      # Fast burn: 14.4x error budget consumption
      - alert: SyntheticProbeHighBurnRate
        expr: |
          (
            1 - avg_over_time(probe_success{job="blackbox"}[5m])
          ) / (1 - 0.99) > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          description: "High error budget burn rate detected"

      # Slow burn: 1x error budget consumption over 6h
      - alert: SyntheticProbeLowBurnRate
        expr: |
          (
            1 - avg_over_time(probe_success{job="blackbox"}[6h])
          ) / (1 - 0.99) > 1
        for: 1h
        labels:
          severity: warning

Grader check:

def check_slo_burn_rate_alerts():
    """Verify alerts use SLO burn rate pattern."""
    code, out, _ = sh(
        "kubectl get configmap prometheus-config "
        "-n observability -o yaml"
    )
    if code != 0:
        return False, "Prometheus config not readable"

    # Check for burn rate pattern
    has_burn_rate = (
        'error budget' in out.lower() or
        '14.4' in out or  # Fast burn multiplier
        '1 - avg_over_time' in out or
        'burn' in out.lower()
    )

    if not has_burn_rate:
        return False, "Alerts should use SLO burn rate pattern, not simple threshold"

    return True, "Alerts use proper SLO burn rate calculations"

3. Require Multi-Protocol Blackbox Modules

Effort	Effectiveness	Type
Medium	High	Technical complexity

Requires understanding protocol-specific probe configuration. Require different blackbox modules for different protocols, with proper TLS validation where appropriate.

Required modules:

http_2xx for HTTP endpoints with TLS validation
tcp_connect for raw TCP endpoints (K8s API)
icmp for network layer checks (optional bonus)

Current check already validates this but could be stricter:

def check_blackbox_modules():
    """Verify correct Blackbox modules used for each protocol."""
    # ... existing code ...

    # NEW: Require explicit TLS verification for HTTPS targets
    if 'https://' in out:
        if 'tls_config' not in out and 'insecure_skip_verify: false' not in out:
            return False, "HTTPS targets should have explicit TLS verification config"

    return True, "Blackbox modules correctly matched to target protocols"

4. Stricter Dashboard Validation

Effort	Effectiveness	Type
Medium	High	Technical complexity

Discovery Path: Standard Grafana/Prometheus patterns (documentation)

Require dashboards to include:

Availability percentage panel (not just raw probe_success)
Per-endpoint breakdown
Response time histogram or percentiles

Example required query:

avg_over_time(probe_success{job="blackbox"}[1h]) * 100

Change required:

# grader.py - check_grafana_dashboard_semantics()
def check_grafana_dashboard_semantics():
    # ... existing checks ...

    # NEW: Check for percentage calculation
    has_percentage = (
        '* 100' in out or
        '*100' in out or
        'percentage' in out.lower() or
        '100 *' in out
    )
    if not has_percentage:
        return False, "Dashboard should show availability as percentage"

    # NEW: Check for response time metrics
    has_latency = (
        'probe_duration' in out or
        'probe_http_duration' in out or
        'duration_seconds' in out
    )
    if not has_latency:
        return False, "Dashboard should include response time metrics"

    return True, "Dashboard meets visualization requirements"

5. Require Recording Rules

Effort	Effectiveness	Type
Medium	Medium	Technical complexity

Requires understanding Prometheus recording rules. Require Prometheus recording rules for pre-computed availability metrics:

groups:
  - name: synthetic-recording
    rules:
      - record: probe:availability:5m
        expr: avg_over_time(probe_success[5m])
      - record: probe:availability:1h
        expr: avg_over_time(probe_success[1h])
      - record: probe:latency_p99:5m
        expr: histogram_quantile(0.99, sum(rate(probe_duration_seconds_bucket[5m])) by (le, instance))

Grader check:

def check_recording_rules():
    """Verify recording rules exist for availability metrics."""
    code, out, _ = sh(
        "kubectl get configmap prometheus-config "
        "-n observability -o yaml"
    )
    if code != 0:
        return False, "Prometheus config not readable"

    # Check for recording rule pattern
    if 'record:' not in out:
        return False, "Prometheus should have recording rules for availability metrics"

    if 'probe:' not in out.lower() and 'availability' not in out.lower():
        return False, "Recording rules should compute availability metrics"

    return True, "Recording rules configured for availability metrics"

Additional Options (If Still Too Easy)

6. Require Multiple Alert Severities

Effort	Effectiveness
Medium	Medium

Require both warning (degraded) and critical (down) alerts:

- alert: SyntheticProbeWarning
  expr: avg_over_time(probe_success[5m]) < 0.99
  for: 5m
  labels:
    severity: warning

- alert: SyntheticProbeCritical
  expr: probe_success == 0
  for: 2m
  labels:
    severity: critical

7. Require HA Deployment

Effort	Effectiveness
Medium	Medium

Require replicas: 2 for Prometheus or blackbox-exporter with proper PodDisruptionBudget.

Not Recommended

Option	Reason
Remove wiki entirely	No discovery path - becomes guessing
Require ingress discovery	Agent RBAC doesn't allow `kubectl get ingress -A`
Decoy services	Ambiguous - which are "critical"? Leads to guessing
RBAC/ServiceAccount validation	High effort, agents often get this implicitly
Complex probe depth (timeouts, TLS)	Low signal - easy to copy from docs

Summary

Start with options 1-4 (verified discovery + technical complexity), then add 5-7 if pass rate remains above 70%.

Priority	Change	Type	Effort	Expected Impact
1	Remove version hints	Discovery	Low	-10-15% pass rate
2	Require SLO burn rate alerts	Technical	Medium	-15-20% pass rate
3	Multi-protocol blackbox	Technical	Medium	-10-15% pass rate
4	Stricter dashboard validation	Technical	Medium	-10-15% pass rate
5+	Recording rules, alert severities, HA	Technical	Medium	-5-10% each

Key Insight

The most effective hardening combines:

Discovery requirements that use verified accessible paths (crictl, wiki HTTP)
Technical complexity that tests genuine DevOps expertise (SLO burn rates, recording rules)

Avoid hardening that relies on RBAC access the agent doesn't have (kubectl get ingress -A) or pure guessing (removing all documentation).

arubis/synthetic-monitoring-hardening-suggestions.md

Select an option

No results found