Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active February 10, 2026 23:40
Show Gist options
  • Select an option

  • Save arubis/a2098fc16894866419de9abf16dd72db to your computer and use it in GitHub Desktop.

Select an option

Save arubis/a2098fc16894866419de9abf16dd72db to your computer and use it in GitHub Desktop.
Task Review: chaos-engineering-resilience (v33) - Secondary Review

Task Review: chaos-engineering-resilience (v33)

Task UUID: 326ba06c-1edc-4f27-a0e3-57001213b883
Category: SRE | Difficulty: Hard | Horizon: 6h
Author: .tryps (Orestis Trypidakis)
Primary reviewer: shahryaradil (approved, promoted to secondary)
Secondary reviewer: daltoris


Verdict: APPROVE


1. Solvability

PASSapex-arena test-solution scored 1.0 (4/4 subscores at 100%). Solution.sh cleanly handles all obstacles, installs Chaos Mesh, creates all 6 experiments, observability resources, RBAC, NetworkPolicy, Workflow, and Schedule.

2. Challenge (Pass Rate)

PASS — From 10 rollouts:

Score Count Rollouts
100% 3 2, 3, 6
75% 4 0, 5, 8, 9
50% 1 7
25% 1 1
0% 1 4
  • Mean: 67.5% | Pass rate (100%): 30% | Meets ≤70% threshold

Per-Subscore Breakdown

Rollout deployed experiments observability network_workflow Total
0 1.0 0.0 1.0 1.0 75%
1 1.0 0.0 0.0 0.0 25%
2 1.0 1.0 1.0 1.0 100%
3 1.0 1.0 1.0 1.0 100%
4 0.0 0.0 0.0 0.0 0%
5 1.0 0.0 1.0 1.0 75%
6 1.0 1.0 1.0 1.0 100%
7 1.0 0.0 0.0 1.0 50%
8 1.0 1.0 0.0 1.0 75%
9 1.0 0.0 1.0 1.0 75%
Pass rate 9/10 3/10 6/10 7/10

experiments_complete is the hardest check (3/10 pass). No check is universally failed — chaos_mesh_deployed passes 9/10, showing agents consistently handle the basic deployment.

3. Failure Analysis — Genuine vs Artificial

6+ distinct failure patterns across rollouts — overwhelmingly genuine:

Pattern Rollouts Classification Details
DNSChaos targets wrong service 0, 7, 9 Genuine Agents target profile-service (0, 9) or authentication-service (7) instead of user-service. Task clearly says "DNS chaos (user-service)" but agents get confused by the decoy profile-service.
Experiments in wrong namespace 1, 4 Genuine Agents create experiments in bleater instead of chaos-mesh
HTTPChaos uses replace not abort 5 Borderline Agent used abort: false + replace: {code: 500}. Grader requires abort or delay. Replace with 500 is a valid HTTP fault pattern, but uncommon.
Grafana dashboard wrong namespace 8 Minor Created in chaos-mesh instead of monitoring. Grader only checks monitoring.
Ingress port issues 1, 7 Genuine Wrong port for Chaos Mesh dashboard
Various observability gaps Multiple Genuine Different agents struggle with different parts of Prometheus rules, Grafana dashboards, etc.

Verdict: Failures are diverse and genuine. No uniform failure pattern suggesting grader/infra issues.

4. Grader Quality

4 equal-weight subscores (0.25 each) — all represent meaningful milestones:

Subscore What it checks Gameable?
chaos_mesh_deployed (0.25) Controller + daemon running, webhook registered, functional test (creates real experiment), Harbor images, RBAC, obstacles resolved No — requires actual Chaos Mesh deployment
experiments_complete (0.25) All 6 experiment types with correct targets, safety controls (pause, blast radius, annotations, duration, gracePeriod) No — requires 6 properly configured experiments
observability_ready (0.25) Schedule, Prometheus rules with rate()/histogram_quantile(), Grafana dashboard with panels, Ingress, metrics No — requires real observability setup
network_and_workflow (0.25) Namespace label, NetworkPolicy, Workflow with 3+ chaos types No — requires real cross-namespace config

Wait times: 180s for chaos-mesh pods, 30s for webhooks, 60s for bleater pods. Appropriate for the workloads.

Functional validation: The grader creates a test PodChaos experiment and checks for finalizers — this verifies the controller is actually working, not just that pods exist. verify_selector_matches_pods() validates experiment selectors match actual pod labels.

Minor concerns:

  • HTTPChaos check (grader.py:542): spec.get("abort") or spec.get("delay") — doesn't accept replace with error codes. Affects 1/10 rollouts. Technically valid but uncommon HTTP fault pattern.
  • Grafana dashboard (grader.py:898): Only checks monitoring namespace. Task doesn't explicitly specify namespace. Affects 1/10 rollouts. Could check both monitoring and chaos-mesh.

5. task.yaml Assessment

  • Clear, specific objective — deploy Chaos Mesh with 6 named experiments targeting specific services
  • Sufficient context — air-gapped environment, Helm charts cached, Harbor registry
  • Doesn't reveal grading criteria — describes objectives not exact checks
  • Doesn't overspecify approach — says what to deploy, not how
  • Scope is substantial — Helm deployment, 6 experiments, RBAC, NetworkPolicy, Workflow, Schedule, observability (easily 4+ hours)
  • Cohesive task — all parts relate to chaos engineering platform setup

Note: Task.yaml is somewhat prescriptive about obstacles (names ResourceQuota, LimitRange, PDB). This is appropriate for the task complexity — without these hints, agents would spend all their time debugging obstacles with no time for the actual chaos engineering work.

6. Information Isolation

  • Filesystem: Dockerfile does NOT copy solution.sh or grader.py
  • task.yaml specificity: Describes objectives, not grader checks
  • Setup artifacts: Obstacles (ResourceQuota, LimitRange, PDB, broken experiments) don't explain how to fix them
  • Git repos: No runbooks or solution files committed
  • Environment variables: ConfigMap has HARBOR_REGISTRY/VERSION/CHART_PATH — helpful for air-gapped env but doesn't reveal solution approach
  • Prior run artifacts: No issues observed

7. Overlap Detection

No problematic overlap with any existing task. The chaos engineering domain (Chaos Mesh deployment, chaos experiments, chaos-specific observability) is unique in the repository. Low overlap with single-node-chaos-hardening (different problem domain), prometheus-observability-stack-failure (debugging existing vs. creating new), and kubernetes-security-hardening-zero-disruption (different NetworkPolicy purpose).


Summary

This is a well-constructed, substantial SRE task with:

  • Verified solvability (100% from solution.sh)
  • Appropriate challenge (30% pass rate, 67.5% mean)
  • Diverse genuine failure patterns (6+ modes)
  • Functional grader validation (not just resource existence checks)
  • Good information isolation
  • No overlap with existing tasks

Minor recommendations (non-blocking):

  1. Consider accepting replace with error codes in HTTPChaos check alongside abort/delay — would reduce 1 borderline failure
  2. Consider checking both monitoring and chaos-mesh namespaces for Grafana dashboard ConfigMap
  3. Both are edge cases affecting 1/10 rollouts each and arguably represent legitimate agent mistakes in reading requirements, so they don't block approval
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment