UUID: 2084d83d-8453-4761-81ae-2b7a34c1b0d3 (v63) Discord: https://discord.com/channels/1427397917685321919/1453316585300430858
Solution passes (1.0). But 0/8 agents can score above 0.50 due to a task design flaw.
Setup removes prometheus.io/* annotations from two deployment pod templates (fault 1). The grader checks that these annotations exist (pod_metrics_live subscore, 25% weight) and gates control_plane_converged (another 25%) on that check passing. So 50% of the score depends on the agent restoring annotations.
But Prometheus doesn't use annotations to discover these services. The bleater-services scrape job in prometheus-config.yaml uses endpoint-based discovery filtered by __meta_kubernetes_service_label_app. Once the agent fixes the NetworkPolicy (fault 2), Prometheus targets come back UP. The annotation removal has zero functional impact and produces no observable symptom.
Every agent in 8 rollouts:
- Finds the deny-all NetworkPolicy
- Fixes it
- Checks Prometheus -- targets are UP
- Declares victory
They're right to do so. The annotations are cosmetically missing but functionally irrelevant.
| Rollout | Score | network_path | pod_metrics | service_scrape | control_plane |
|---|---|---|---|---|---|
| 0 | 0.250 | false | 0.0 | 1.0 | 0.0 |
| 1 | 0.500 | true | 0.0 | 1.0 | 0.0 |
| 2 | 0.000 | false | 0.0 | 0.0 | 0.0 |
| 3 | 0.250 | false | 0.0 | 1.0 | 0.0 |
| 4 | 0.500 | true | 0.0 | 1.0 | 0.0 |
| 5 | 0.000 | false | 0.0 | 0.0 | 0.0 |
| 6 | 0.500 | true | 0.0 | 1.0 | 0.0 |
| 7 | 0.000 | false | 0.0 | 0.0 | 0.0 |
Mean: 0.25. pod_metrics_live fails 8/8. control_plane_converged fails 8/8 (cascades from pod_metrics).
The attached patch adds one line to setup.sh: remove the app label from the two affected Services. The bleater-services Prometheus job filters on __meta_kubernetes_service_label_app matching timeline-service, profile-service, etc. Without that label, endpoint-based discovery stops finding them. The kubernetes-pods annotation-based job is the only remaining path -- and fault 1 already broke that.
Now when the agent fixes the NetworkPolicy and checks Prometheus, these two services are still missing. They have a real symptom to investigate, and restoring the annotations (or the app label) becomes a meaningful fix rather than a cosmetic one.
solution.sh gets a corresponding two-line addition to restore the app label.
Grader, task.yaml, and Dockerfile are unchanged.
- NetworkPolicy name leak:
deny-monitoring-metricsreveals the root cause in its name. A neutral name would require actual investigation. - Namespace label rigidity: 5/8 agents used
name: monitoringinstead ofkubernetes.io/metadata.name: monitoring. The grader only accepts the latter. This is a genuine mistake but the failure rate suggests either accepting both or hinting at the built-in label. - Scope: With 2 services and 4-5 faults, this is closer to 2 hours than 4 for a senior engineer. Previous versions had Istio mTLS complexity that was removed due to sidecar issues. May need additional layers to meet the 4-hour threshold.
- Overlap: There is an existing
prometheus-observability-stack-failuretask in the Nebula repo targeting different deployments. These should be reconciled.