Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active February 13, 2026 18:59
Show Gist options
  • Select an option

  • Save arubis/3b36d9973ffecd014a0b5273168af680 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/3b36d9973ffecd014a0b5273168af680 to your computer and use it in GitHub Desktop.

Task Production Feasibility & Timeline Estimate

Date: 2026-02-13
Authors: Dylan Fitzgerald + Claude (Opus 4.6)
Context: The client has ~120 accepted apex-arena tasks on the Nebula platform. They want to reach 900. This document estimates what we can deliver and in what timeframe, grounded in empirical gap analysis.


Executive Summary

We ran a systematic gap analysis across all four task categories (cloud-ops, platform-engineering, SRE, devops), scanning 185+ Discord threads across both feedback channels, 37+ local task directories, and the full Nebula infrastructure manifest. The analysis identified 28-30 deduplicated, viable new task ideas on the current Nebula platform, with 19 at low overlap risk.

Can we create 100 additional acceptable tasks? Probably not as 100 fully standalone tasks. With subtask decomposition and moderate infrastructure expansion, we can likely produce 60-80 deliverable task items, with a realistic ceiling around 100 if everything goes well. Our best estimate for standalone, non-subtask tasks is 35-50.

Can the client reach 900? Not on Nebula alone, regardless of effort. The platform's realistic ceiling is 300-500 tasks. Reaching 900 requires additional simulation platforms with different infrastructure stacks. See companion document: Task Milestone Viability Analysis.


Part 1: What the Gap Analysis Found

Methodology

flowchart LR
    A["Discord Channels<br>#task-idea-feedback<br>#task-feedback"] --> B["Thread Scanner<br>(discord_fetcher gem)"]
    C["Local Tasks<br>37+ directories"] --> D["File Analyzer<br>(task.yaml, grader.py)"]
    E["Nebula Platform<br>Helm values, K8s manifests"] --> F["Infrastructure Mapper"]

    B --> G["Coverage Matrix<br>by component × skill"]
    D --> G
    F --> G

    G --> H["Gap Identification<br>43 raw gaps"]
    H --> I["Cross-Category Dedup<br>-15 duplicates"]
    I --> J["Overlap Risk Filter<br>-3 high-risk"]
    J --> K["28-30 viable ideas<br>19 low-risk"]

    style K fill:#22c55e,color:#000
    style H fill:#eab308,color:#000
Loading

Raw Data

Category Threads scanned Gaps found Low risk Medium risk High risk
SRE 170+ 12 10 1 0
Platform Engineering 165+ 12 5 5 2
DevOps 211+ 12 7 3 0
Cloud-Ops 185+ 7 2 4 1
Totals (raw) 43 24 13 3

After Cross-Category Deduplication

Eight ideas appeared in multiple categories (KEDA debugging, GlitchTip, Statping-ng, SLO/burn-rate, ConfigMap propagation, CronJob failures, init container deadlocks, HPA/KEDA conflicts). Removing duplicates and the 3 high-risk ideas:

  • 19 low-risk ideas (high confidence they'd pass review)
  • ~10 medium-risk ideas (~6-8 likely survive careful scoping)
  • Total unique viable ideas: ~28-30

Coverage Saturation by Domain

quadrantChart
    title Component Coverage vs. Remaining Capacity
    x-axis "Low Existing Coverage" --> "High Existing Coverage"
    y-axis "Low Remaining Capacity" --> "High Remaining Capacity"

    GlitchTip: [0.05, 0.7]
    Statping-ng: [0.05, 0.5]
    KEDA: [0.1, 0.75]
    Maddy: [0.05, 0.4]
    Event Exporter: [0.05, 0.4]
    CronJobs: [0.05, 0.6]
    Init Containers: [0.05, 0.5]
    Grafana OnCall: [0.15, 0.55]
    SLO/SLI: [0.0, 0.8]
    Prometheus: [0.9, 0.15]
    PostgreSQL: [0.95, 0.05]
    ArgoCD: [0.9, 0.1]
    Istio: [0.8, 0.2]
    Loki/Logging: [0.75, 0.15]
    CI/CD Pipelines: [0.85, 0.1]
    Harbor: [0.5, 0.35]
    Redis: [0.3, 0.45]
    Helm: [0.4, 0.35]
Loading

Reading the chart: Top-left quadrant = highest-value targets (low coverage, high remaining capacity). Bottom-right = saturated (high coverage, little room). The gap analysis confirms SLO/SLI, KEDA, GlitchTip, and CronJobs are the most fertile ground.

What the Gap Analysis Might Have Missed

The agents searched for "obvious" gaps — uncovered components, untested skill areas. They likely undercount:

  • Novel cross-component combinations (e.g., "KEDA + Istio traffic shifting", "Prometheus alert → OnCall → Mattermost → automated remediation" as a chain). Maybe 5-10 additional ideas here.
  • Process/workflow tasks rather than pure debugging (postmortems, toil audits, runbook creation). The SRE report caught some of these; there may be 3-5 more.
  • Build/create tasks vs. debug/fix tasks (most existing tasks are "something is broken, fix it" — tasks like "design and implement an SLO framework from scratch" test different skills). Maybe 5-8 more.
  • Difficulty-level variations of the same concept (a simple version and a hard version of the same component debugging). Limited value for differentiation, but maybe 3-5 more.

Adjusted estimate with creative exploration: ~40-50 viable standalone task ideas.


Part 2: Can We Create 100 Additional Acceptable Tasks?

Path A: Standalone Tasks Only (no subtask decomposition)

xychart-beta
    title "Probability of Reaching Target (Standalone Tasks)"
    x-axis ["25", "35", "50", "75", "100"]
    y-axis "Probability (%)" 0 --> 100
    bar [90, 77, 57, 32, 17]
Loading
Milestone Probability Notes
25 new tasks 90% Just the low-risk identified gaps
35 new tasks 75-80% Low-risk + surviving medium-risk ideas
50 new tasks 55-60% Above + creative cross-component and process tasks
75 new tasks 30-35% Requires 2-3 new infrastructure components in Nebula
100 new tasks 15-20% Requires significant Nebula expansion + accepts some niche scenarios

Path B: With Subtask Decomposition

The apex-workflows toolkit includes a full subtask system (/subtask-scope, /subtask-create, /subtask-review). If the client counts subtasks as individual task items:

  • A master task with 3-4 grader checks can often decompose into 2-3 standalone subtasks
  • Each subtask must test a genuinely different skill (not just "fix part 1, part 2")
  • Realistic decomposition ratio: ~2x (not every master task decomposes cleanly)
Milestone Probability Notes
50 task items 85-90% 25-30 masters, ~half decompose into 2 subtasks
75 task items 65-70% 35-40 masters with selective decomposition
100 task items 45-55% 40-50 masters + decomposition + moderate infra expansion
125 task items 25-35% Requires significant expansion + aggressive decomposition

Path C: With Infrastructure Expansion

Adding new components to Nebula creates entirely new task categories. Estimated yield per component:

Component Engineering effort New tasks enabled Notes
Argo Workflows 2-3 weeks 10-15 Workflow orchestration, DAG debugging
Cert-manager (advanced) 1-2 weeks 8-12 Already partially present; deep scenarios
Falco 2-3 weeks 8-12 Runtime security, SIEM integration
Velero (advanced) 1-2 weeks 5-8 Already present; complex DR scenarios
External Secrets Operator 1-2 weeks 5-10 Secret injection from external stores

Each component must work in the Nebula snapshot model (air-gapped, single-node, 60s boot). Not all candidates are feasible.

With 3 new components: +25-35 new task ideas, bringing the viable pool to ~65-85 standalone ideas.

Self-Argument: Why These Estimates Might Be Wrong

Too pessimistic?

  • The DevOps/SRE domain is genuinely vast; we may find more ideas during implementation as we develop deeper component expertise
  • Some "medium overlap risk" ideas might sail through review with proper scoping
  • The client's overlap criteria might be more lenient than our automated analysis assumes
  • We haven't fully explored "build from scratch" tasks (vs. "debug broken thing")

Too optimistic?

  • The formal review process may reject ideas our agents flagged as "low risk"
  • Implementation reveals problems: graders that can't reliably verify, setups that are too fragile
  • Quality degrades as we push into more niche scenarios
  • Nebula infrastructure expansion takes real engineering time and may have compatibility issues
  • Human review bandwidth is a hard limit regardless of Claude's output speed

Net assessment: The estimates above are calibrated to account for both directions. If anything, they may be slightly optimistic on the higher targets (75+) due to underweighting implementation attrition.


Part 3: How Long Would It Take?

Production Pipeline

flowchart TB
    subgraph Phase1["Phase 1: Ideation + Review (Weeks 1-2)"]
        A1["/task-idea-research<br>× 4 categories"] --> A2["~140 candidates"]
        A2 --> A3["Batch /task-idea-review<br>parallel subagents"]
        A3 --> A4["overlap-detector<br>agent"]
        A4 --> A5["~40-50 approved ideas"]
    end

    subgraph Phase2["Phase 2: Implementation (Weeks 3-6)"]
        A5 --> B1["Agent Team Lead<br>assigns tasks"]
        B1 --> B2["Implementer 1<br>task.yaml + setup.sh<br>grader.py + solution.sh"]
        B1 --> B3["Implementer 2<br>(parallel)"]
        B1 --> B4["Implementer 3<br>(parallel)"]
        B1 --> B5["Implementer 4<br>(parallel)"]
        B2 --> B6["Human Review<br>4-6 tasks/day"]
        B3 --> B6
        B4 --> B6
        B5 --> B6
    end

    subgraph Phase3["Phase 3: Testing + Quality (Weeks 5-8)"]
        B6 --> C1["test-solution<br>(parallel ports)"]
        C1 --> C2["eval --runs 8"]
        C2 --> C3["eval-analyzer<br>failure categorization"]
        C3 --> C4{Pass?}
        C4 -->|Yes| C5["Accepted Task"]
        C4 -->|No| C6["Rework Queue"]
        C6 --> B6
    end

    style Phase1 fill:#1e3a5f,color:#fff
    style Phase2 fill:#1a4731,color:#fff
    style Phase3 fill:#4a1942,color:#fff
    style C5 fill:#22c55e,color:#000
Loading

Our Tooling Advantages

Phase Tool Speedup vs. manual
Ideation /task-idea-research agents, Discord integration 3-5x
Review /task-idea-review with batch parallelism 3-4x
Overlap detection overlap-detector agent 10x+
Implementation CREATING_APEX_TASKS.md templates, /subtask-create 1.5-2x
Testing test-solution (parallel ports), eval-analyzer 1.5-2x

Agent Teams: The Implementation Multiplier

Claude Code agent teams (released Feb 2026) enable parallel task implementation. This maps perfectly to our workload because:

  • Zero inter-task dependencies — each task lives in its own directory
  • Well-templated work — CREATING_APEX_TASKS.md provides consistent patterns
  • Independent validation — each task can be tested separately
  • Shared context — Nebula skill provides platform knowledge to all agents
gantt
    title Daily Implementation Rhythm with Agent Teams
    dateFormat HH:mm
    axisFormat %H:%M

    section Human
        Review yesterday's output (4-6 tasks)    :review, 09:00, 2h
        Queue next batch + context briefs         :queue, 11:00, 1h
        Spot-check in-progress work               :check, 14:00, 1h
        Test completed tasks on Nebula            :test, 15:00, 2h

    section Agent 1
        Implement Task A                          :impl1, 11:00, 5h

    section Agent 2
        Implement Task B                          :impl2, 11:00, 5h

    section Agent 3
        Implement Task C                          :impl3, 11:00, 5h

    section Agent 4
        Implement Task D                          :impl4, 11:00, 5h
Loading

Effective throughput: 4-6 tasks/day (limited by human review bandwidth, not Claude output speed).

Summary Timeline

Target Timeline (FTE) Probability Critical path
25 tasks 4-6 weeks 90% Implementation + testing
50 task items 6-9 weeks 70-75% Implementation + subtask scoping
75 task items 10-14 weeks 45-50% Infrastructure expansion
100 task items 14-20 weeks 30-40% Infrastructure expansion is the bottleneck

If working part-time (50% allocation), multiply calendar time by ~1.8x (not 2x, because some phases have dead time where agents run independently).

Confidence Intervals (for 50 task items, our most likely ambitious target)

Outcome Probability
Done in < 6 weeks 10%
Done in 6-9 weeks 45%
Done in 10-14 weeks 30%
Done in 15+ weeks or stall 15%

Part 4: The 900-Task Target

See companion document Task Milestone Viability Analysis for a detailed breakdown of reaching 200, 300, 500, and 900 tasks across single-environment and multi-environment scenarios.

Summary: Nebula's ceiling is ~300-450 standalone tasks with aggressive expansion. 900 requires 2-3 additional simulation platforms with different infrastructure stacks, 5-10 task authors, and 18-24 months.


Part 5: Recommended Strategy

Immediate (next 2 weeks)

  1. Formally review the top 19 low-risk ideas via batch /task-idea-review
  2. Begin implementation of confirmed ideas using agent teams
  3. Target: 15-20 accepted tasks in the pipeline

Short-term (weeks 3-8)

  1. Implement confirmed tasks (agent team sprint)
  2. Research creative cross-component and process tasks
  3. Selective subtask decomposition where natural
  4. Target: 40-50 deliverable task items

Medium-term (weeks 6-14, if pursuing 75+)

  1. Scope and execute 2-3 infrastructure expansions
  2. Generate and implement tasks for new components
  3. Target: 60-80 deliverable task items

Strategic (for 900-task conversation with client)

  1. Present the platform ceiling analysis honestly
  2. Propose additional platforms if client is serious about 900
  3. Frame our contribution as "the Nebula tranche" of a larger multi-platform effort

Appendix: Top 19 Low-Risk Task Ideas (deduplicated)

# Task Idea Primary Category Overlap Risk Key Components
1 SLO/SLI framework with multi-window burn-rate alerting SRE Low Prometheus, Grafana, Alertmanager
2 Alertmanager routing tree with silences and inhibition SRE Low Alertmanager, OnCall, Mattermost
3 GlitchTip error tracking pipeline integration SRE Very Low GlitchTip, Bleater services
4 Health check probe cascade debugging SRE Low K8s probes, Bleater services
5 LogQL-based alert rules for anomaly detection SRE Low Loki, Grafana Ruler
6 KEDA ScaledObject trigger authentication debugging Cloud-Ops Low KEDA, RabbitMQ, Prometheus
7 Runbook-driven incident response automation SRE Low OnCall, CronJobs, Mattermost
8 Structured postmortem analysis from observability data SRE Very Low Prometheus, Loki, Jaeger, Gitea
9 Prometheus remote write and metric aggregation SRE Very Low Prometheus, Grafana, MinIO
10 Statping-ng status page configuration SRE Very Low Statping-ng, Bleater services
11 Toil identification and automation SRE Low CronJobs, Prometheus, Grafana
12 ArgoCD deployment notification pipeline Platform Eng Very Low ArgoCD, Mattermost, Maddy
13 Harbor multi-project robot account CI integration Platform Eng Low Harbor, Gitea, Gitea Runner
14 Multi-tenant namespace provisioning with guardrails Platform Eng Low RBAC, Quotas, NetworkPolicy
15 CronJob pipeline failure with cascading job backlog DevOps Low CronJobs, PostgreSQL, MinIO
16 Init container dependency chain deadlock Cloud-Ops Low Init containers, Bleater services
17 Maddy SMTP relay debugging DevOps Low Maddy, Alertmanager
18 Kubernetes event exporter pipeline failure DevOps Low Event exporter, Loki
19 Service account token rotation and auth breakdown Cloud-Ops Medium ServiceAccounts, RBAC

Appendix: Gap Analysis Source Files

Full gap analysis reports:

  • gap-analysis-sre.md — 356 lines, 12 gaps, 75 existing SRE tasks mapped
  • gap-analysis-platform-engineering.md — 299 lines, 12 gaps, 69 existing PE tasks mapped
  • gap-analysis-devops.md — 398 lines, 12 gaps, 116 existing DevOps threads mapped
  • gap-analysis-cloud-ops.md — 289 lines, 7 gaps, 55 existing cloud-ops tasks mapped
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment