Task Review: chaos-engineering-resilience (v33)

Task UUID: 326ba06c-1edc-4f27-a0e3-57001213b883
Category: SRE | Difficulty: Hard | Horizon: 6h
Author: .tryps (Orestis Trypidakis)
Primary reviewer: shahryaradil (approved, promoted to secondary)
Secondary reviewer: daltoris

Verdict: APPROVE

1. Solvability

PASS — apex-arena test-solution scored 1.0 (4/4 subscores at 100%). Solution.sh cleanly handles all obstacles, installs Chaos Mesh, creates all 6 experiments, observability resources, RBAC, NetworkPolicy, Workflow, and Schedule.

2. Challenge (Pass Rate)

PASS — From 10 rollouts:

Score	Count	Rollouts
100%	3	2, 3, 6
75%	4	0, 5, 8, 9
50%	1	7
25%	1	1
0%	1	4

Mean: 67.5% | Pass rate (100%): 30% | Meets ≤70% threshold

Per-Subscore Breakdown

Rollout	deployed	experiments	observability	network_workflow	Total
0	1.0	0.0	1.0	1.0	75%
1	1.0	0.0	0.0	0.0	25%
2	1.0	1.0	1.0	1.0	100%
3	1.0	1.0	1.0	1.0	100%
4	0.0	0.0	0.0	0.0	0%
5	1.0	0.0	1.0	1.0	75%
6	1.0	1.0	1.0	1.0	100%
7	1.0	0.0	0.0	1.0	50%
8	1.0	1.0	0.0	1.0	75%
9	1.0	0.0	1.0	1.0	75%
Pass rate	9/10	3/10	6/10	7/10

experiments_complete is the hardest check (3/10 pass). No check is universally failed — chaos_mesh_deployed passes 9/10, showing agents consistently handle the basic deployment.

3. Failure Analysis — Genuine vs Artificial

6+ distinct failure patterns across rollouts — overwhelmingly genuine:

Pattern	Rollouts	Classification	Details
DNSChaos targets wrong service	0, 7, 9	Genuine	Agents target `profile-service` (0, 9) or `authentication-service` (7) instead of `user-service`. Task clearly says "DNS chaos (user-service)" but agents get confused by the decoy profile-service.
Experiments in wrong namespace	1, 4	Genuine	Agents create experiments in `bleater` instead of `chaos-mesh`
HTTPChaos uses `replace` not `abort`	5	Borderline	Agent used `abort: false` + `replace: {code: 500}`. Grader requires `abort` or `delay`. Replace with 500 is a valid HTTP fault pattern, but uncommon.
Grafana dashboard wrong namespace	8	Minor	Created in chaos-mesh instead of monitoring. Grader only checks monitoring.
Ingress port issues	1, 7	Genuine	Wrong port for Chaos Mesh dashboard
Various observability gaps	Multiple	Genuine	Different agents struggle with different parts of Prometheus rules, Grafana dashboards, etc.

Verdict: Failures are diverse and genuine. No uniform failure pattern suggesting grader/infra issues.

4. Grader Quality

4 equal-weight subscores (0.25 each) — all represent meaningful milestones:

Subscore	What it checks	Gameable?
chaos_mesh_deployed (0.25)	Controller + daemon running, webhook registered, functional test (creates real experiment), Harbor images, RBAC, obstacles resolved	No — requires actual Chaos Mesh deployment
experiments_complete (0.25)	All 6 experiment types with correct targets, safety controls (pause, blast radius, annotations, duration, gracePeriod)	No — requires 6 properly configured experiments
observability_ready (0.25)	Schedule, Prometheus rules with rate()/histogram_quantile(), Grafana dashboard with panels, Ingress, metrics	No — requires real observability setup
network_and_workflow (0.25)	Namespace label, NetworkPolicy, Workflow with 3+ chaos types	No — requires real cross-namespace config

Wait times: 180s for chaos-mesh pods, 30s for webhooks, 60s for bleater pods. Appropriate for the workloads.

Functional validation: The grader creates a test PodChaos experiment and checks for finalizers — this verifies the controller is actually working, not just that pods exist. verify_selector_matches_pods() validates experiment selectors match actual pod labels.

Minor concerns:

HTTPChaos check (grader.py:542): spec.get("abort") or spec.get("delay") — doesn't accept replace with error codes. Affects 1/10 rollouts. Technically valid but uncommon HTTP fault pattern.
Grafana dashboard (grader.py:898): Only checks monitoring namespace. Task doesn't explicitly specify namespace. Affects 1/10 rollouts. Could check both monitoring and chaos-mesh.

5. task.yaml Assessment

Clear, specific objective — deploy Chaos Mesh with 6 named experiments targeting specific services
Sufficient context — air-gapped environment, Helm charts cached, Harbor registry
Doesn't reveal grading criteria — describes objectives not exact checks
Doesn't overspecify approach — says what to deploy, not how
Scope is substantial — Helm deployment, 6 experiments, RBAC, NetworkPolicy, Workflow, Schedule, observability (easily 4+ hours)
Cohesive task — all parts relate to chaos engineering platform setup

Note: Task.yaml is somewhat prescriptive about obstacles (names ResourceQuota, LimitRange, PDB). This is appropriate for the task complexity — without these hints, agents would spend all their time debugging obstacles with no time for the actual chaos engineering work.

6. Information Isolation

Filesystem: Dockerfile does NOT copy solution.sh or grader.py
task.yaml specificity: Describes objectives, not grader checks
Setup artifacts: Obstacles (ResourceQuota, LimitRange, PDB, broken experiments) don't explain how to fix them
Git repos: No runbooks or solution files committed
Environment variables: ConfigMap has HARBOR_REGISTRY/VERSION/CHART_PATH — helpful for air-gapped env but doesn't reveal solution approach
Prior run artifacts: No issues observed

7. Overlap Detection

No problematic overlap with any existing task. The chaos engineering domain (Chaos Mesh deployment, chaos experiments, chaos-specific observability) is unique in the repository. Low overlap with single-node-chaos-hardening (different problem domain), prometheus-observability-stack-failure (debugging existing vs. creating new), and kubernetes-security-hardening-zero-disruption (different NetworkPolicy purpose).

Summary

This is a well-constructed, substantial SRE task with:

Verified solvability (100% from solution.sh)
Appropriate challenge (30% pass rate, 67.5% mean)
Diverse genuine failure patterns (6+ modes)
Functional grader validation (not just resource existence checks)
Good information isolation
No overlap with existing tasks

Minor recommendations (non-blocking):

Consider accepting replace with error codes in HTTPChaos check alongside abort/delay — would reduce 1 borderline failure
Consider checking both monitoring and chaos-mesh namespaces for Grafana dashboard ConfigMap
Both are edge cases affecting 1/10 rollouts each and arguably represent legitimate agent mistakes in reading requirements, so they don't block approval

arubis/chaos-engineering-resilience-review.md

Select an option

No results found