Skip to content

Instantly share code, notes, and snippets.

@arubis
Created February 5, 2026 00:37
Show Gist options
  • Select an option

  • Save arubis/0327c54a7d0e53607558fde36e973adf to your computer and use it in GitHub Desktop.

Select an option

Save arubis/0327c54a7d0e53607558fde36e973adf to your computer and use it in GitHub Desktop.
Review: prometheus-observability-stack-failure (2084d83d) — annotation discoverability fix
--- a/setup.sh
+++ b/setup.sh
@@ -69,6 +69,16 @@
# ------------------------------------------------------------
-# Prometheus discovery label mismatch (fault 4)
+# Prometheus discovery label mismatch (fault 4 & 5)
+# Fault 4: observability label
+# Fault 5: remove app label so the bleater-services endpoint
+# discovery job can no longer match these services.
+# This forces annotation-based pod discovery to be the
+# only remaining Prometheus path, making fault 1
+# (removed annotations) functionally visible.
# ------------------------------------------------------------
-echo "Injecting Prometheus discovery label mismatch..."
+echo "Disabling Prometheus service discovery for affected services..."
for svc in bleater-timeline-service bleater-profile-service; do
kubectl label svc "$svc" -n bleater observability=disabled --overwrite
+ kubectl label svc "$svc" -n bleater app- 2>/dev/null || true
done
--- a/solution.sh
+++ b/solution.sh
@@ -52,7 +52,13 @@
echo "🔧 Restoring Prometheus discovery labels on Services..."
for svc in "${SERVICES[@]}"; do
+ SHORT="${svc#bleater-}"
kubectl label svc "$svc" -n "$BLEATER_NS" observability=enabled --overwrite
+ kubectl label svc "$svc" -n "$BLEATER_NS" "app=${SHORT}" --overwrite
done
echo "🧹 Removing restrictive NetworkPolicy..."

Review Notes: prometheus-observability-stack-failure

UUID: 2084d83d-8453-4761-81ae-2b7a34c1b0d3 (v63) Discord: https://discord.com/channels/1427397917685321919/1453316585300430858

Verdict: NEEDS_WORK

Solution passes (1.0). But 0/8 agents can score above 0.50 due to a task design flaw.

The Problem

Setup removes prometheus.io/* annotations from two deployment pod templates (fault 1). The grader checks that these annotations exist (pod_metrics_live subscore, 25% weight) and gates control_plane_converged (another 25%) on that check passing. So 50% of the score depends on the agent restoring annotations.

But Prometheus doesn't use annotations to discover these services. The bleater-services scrape job in prometheus-config.yaml uses endpoint-based discovery filtered by __meta_kubernetes_service_label_app. Once the agent fixes the NetworkPolicy (fault 2), Prometheus targets come back UP. The annotation removal has zero functional impact and produces no observable symptom.

Every agent in 8 rollouts:

  1. Finds the deny-all NetworkPolicy
  2. Fixes it
  3. Checks Prometheus -- targets are UP
  4. Declares victory

They're right to do so. The annotations are cosmetically missing but functionally irrelevant.

Score distribution (8 rollouts)

Rollout Score network_path pod_metrics service_scrape control_plane
0 0.250 false 0.0 1.0 0.0
1 0.500 true 0.0 1.0 0.0
2 0.000 false 0.0 0.0 0.0
3 0.250 false 0.0 1.0 0.0
4 0.500 true 0.0 1.0 0.0
5 0.000 false 0.0 0.0 0.0
6 0.500 true 0.0 1.0 0.0
7 0.000 false 0.0 0.0 0.0

Mean: 0.25. pod_metrics_live fails 8/8. control_plane_converged fails 8/8 (cascades from pod_metrics).

The Fix

The attached patch adds one line to setup.sh: remove the app label from the two affected Services. The bleater-services Prometheus job filters on __meta_kubernetes_service_label_app matching timeline-service, profile-service, etc. Without that label, endpoint-based discovery stops finding them. The kubernetes-pods annotation-based job is the only remaining path -- and fault 1 already broke that.

Now when the agent fixes the NetworkPolicy and checks Prometheus, these two services are still missing. They have a real symptom to investigate, and restoring the annotations (or the app label) becomes a meaningful fix rather than a cosmetic one.

solution.sh gets a corresponding two-line addition to restore the app label.

Grader, task.yaml, and Dockerfile are unchanged.

Other Issues (not addressed by this patch)

  • NetworkPolicy name leak: deny-monitoring-metrics reveals the root cause in its name. A neutral name would require actual investigation.
  • Namespace label rigidity: 5/8 agents used name: monitoring instead of kubernetes.io/metadata.name: monitoring. The grader only accepts the latter. This is a genuine mistake but the failure rate suggests either accepting both or hinting at the built-in label.
  • Scope: With 2 services and 4-5 faults, this is closer to 2 hours than 4 for a senior engineer. Previous versions had Istio mTLS complexity that was removed due to sidecar issues. May need additional layers to meet the 4-hour threshold.
  • Overlap: There is an existing prometheus-observability-stack-failure task in the Nebula repo targeting different deployments. These should be reconciled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment