Task UUID: 326ba06c-1edc-4f27-a0e3-57001213b883
Category: SRE | Difficulty: Hard | Horizon: 6h
Author: .tryps (Orestis Trypidakis)
Primary reviewer: shahryaradil (approved, promoted to secondary)
Secondary reviewer: daltoris
PASS — apex-arena test-solution scored 1.0 (4/4 subscores at 100%). Solution.sh cleanly handles all obstacles, installs Chaos Mesh, creates all 6 experiments, observability resources, RBAC, NetworkPolicy, Workflow, and Schedule.
PASS — From 10 rollouts:
| Score | Count | Rollouts |
|---|---|---|
| 100% | 3 | 2, 3, 6 |
| 75% | 4 | 0, 5, 8, 9 |
| 50% | 1 | 7 |
| 25% | 1 | 1 |
| 0% | 1 | 4 |
- Mean: 67.5% | Pass rate (100%): 30% | Meets ≤70% threshold
| Rollout | deployed | experiments | observability | network_workflow | Total |
|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 1.0 | 1.0 | 75% |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 25% |
| 2 | 1.0 | 1.0 | 1.0 | 1.0 | 100% |
| 3 | 1.0 | 1.0 | 1.0 | 1.0 | 100% |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0% |
| 5 | 1.0 | 0.0 | 1.0 | 1.0 | 75% |
| 6 | 1.0 | 1.0 | 1.0 | 1.0 | 100% |
| 7 | 1.0 | 0.0 | 0.0 | 1.0 | 50% |
| 8 | 1.0 | 1.0 | 0.0 | 1.0 | 75% |
| 9 | 1.0 | 0.0 | 1.0 | 1.0 | 75% |
| Pass rate | 9/10 | 3/10 | 6/10 | 7/10 |
experiments_complete is the hardest check (3/10 pass). No check is universally failed — chaos_mesh_deployed passes 9/10, showing agents consistently handle the basic deployment.
6+ distinct failure patterns across rollouts — overwhelmingly genuine:
| Pattern | Rollouts | Classification | Details |
|---|---|---|---|
| DNSChaos targets wrong service | 0, 7, 9 | Genuine | Agents target profile-service (0, 9) or authentication-service (7) instead of user-service. Task clearly says "DNS chaos (user-service)" but agents get confused by the decoy profile-service. |
| Experiments in wrong namespace | 1, 4 | Genuine | Agents create experiments in bleater instead of chaos-mesh |
HTTPChaos uses replace not abort |
5 | Borderline | Agent used abort: false + replace: {code: 500}. Grader requires abort or delay. Replace with 500 is a valid HTTP fault pattern, but uncommon. |
| Grafana dashboard wrong namespace | 8 | Minor | Created in chaos-mesh instead of monitoring. Grader only checks monitoring. |
| Ingress port issues | 1, 7 | Genuine | Wrong port for Chaos Mesh dashboard |
| Various observability gaps | Multiple | Genuine | Different agents struggle with different parts of Prometheus rules, Grafana dashboards, etc. |
Verdict: Failures are diverse and genuine. No uniform failure pattern suggesting grader/infra issues.
4 equal-weight subscores (0.25 each) — all represent meaningful milestones:
| Subscore | What it checks | Gameable? |
|---|---|---|
| chaos_mesh_deployed (0.25) | Controller + daemon running, webhook registered, functional test (creates real experiment), Harbor images, RBAC, obstacles resolved | No — requires actual Chaos Mesh deployment |
| experiments_complete (0.25) | All 6 experiment types with correct targets, safety controls (pause, blast radius, annotations, duration, gracePeriod) | No — requires 6 properly configured experiments |
| observability_ready (0.25) | Schedule, Prometheus rules with rate()/histogram_quantile(), Grafana dashboard with panels, Ingress, metrics | No — requires real observability setup |
| network_and_workflow (0.25) | Namespace label, NetworkPolicy, Workflow with 3+ chaos types | No — requires real cross-namespace config |
Wait times: 180s for chaos-mesh pods, 30s for webhooks, 60s for bleater pods. Appropriate for the workloads.
Functional validation: The grader creates a test PodChaos experiment and checks for finalizers — this verifies the controller is actually working, not just that pods exist. verify_selector_matches_pods() validates experiment selectors match actual pod labels.
Minor concerns:
- HTTPChaos check (grader.py:542):
spec.get("abort") or spec.get("delay")— doesn't acceptreplacewith error codes. Affects 1/10 rollouts. Technically valid but uncommon HTTP fault pattern. - Grafana dashboard (grader.py:898): Only checks
monitoringnamespace. Task doesn't explicitly specify namespace. Affects 1/10 rollouts. Could check bothmonitoringandchaos-mesh.
- Clear, specific objective — deploy Chaos Mesh with 6 named experiments targeting specific services
- Sufficient context — air-gapped environment, Helm charts cached, Harbor registry
- Doesn't reveal grading criteria — describes objectives not exact checks
- Doesn't overspecify approach — says what to deploy, not how
- Scope is substantial — Helm deployment, 6 experiments, RBAC, NetworkPolicy, Workflow, Schedule, observability (easily 4+ hours)
- Cohesive task — all parts relate to chaos engineering platform setup
Note: Task.yaml is somewhat prescriptive about obstacles (names ResourceQuota, LimitRange, PDB). This is appropriate for the task complexity — without these hints, agents would spend all their time debugging obstacles with no time for the actual chaos engineering work.
- Filesystem: Dockerfile does NOT copy solution.sh or grader.py
- task.yaml specificity: Describes objectives, not grader checks
- Setup artifacts: Obstacles (ResourceQuota, LimitRange, PDB, broken experiments) don't explain how to fix them
- Git repos: No runbooks or solution files committed
- Environment variables: ConfigMap has HARBOR_REGISTRY/VERSION/CHART_PATH — helpful for air-gapped env but doesn't reveal solution approach
- Prior run artifacts: No issues observed
No problematic overlap with any existing task. The chaos engineering domain (Chaos Mesh deployment, chaos experiments, chaos-specific observability) is unique in the repository. Low overlap with single-node-chaos-hardening (different problem domain), prometheus-observability-stack-failure (debugging existing vs. creating new), and kubernetes-security-hardening-zero-disruption (different NetworkPolicy purpose).
This is a well-constructed, substantial SRE task with:
- Verified solvability (100% from solution.sh)
- Appropriate challenge (30% pass rate, 67.5% mean)
- Diverse genuine failure patterns (6+ modes)
- Functional grader validation (not just resource existence checks)
- Good information isolation
- No overlap with existing tasks
Minor recommendations (non-blocking):
- Consider accepting
replacewith error codes in HTTPChaos check alongsideabort/delay— would reduce 1 borderline failure - Consider checking both
monitoringandchaos-meshnamespaces for Grafana dashboard ConfigMap - Both are edge cases affecting 1/10 rollouts each and arguably represent legitimate agent mistakes in reading requirements, so they don't block approval