Skip to content

Instantly share code, notes, and snippets.

@mrpollo
Created February 13, 2026 15:10
Show Gist options
  • Select an option

  • Save mrpollo/a5e17342898cd60e2e46187aa13f06e7 to your computer and use it in GitHub Desktop.

Select an option

Save mrpollo/a5e17342898cd60e2e46187aa13f06e7 to your computer and use it in GitHub Desktop.
PX4 CI Orchestrator Analysis: PR #26257 - Four expert reports analyzing build time, cost, and architecture options

PX4 CI Orchestrator Analysis: PR #26257

Date: 2026-02-13 | Branch: mrpollo/ci_orchestration | PR: PX4/PX4-Autopilot#26257


PR Review Status

  • dagar (Matthias): Has NOT commented on this PR at all. Not in timeline.
  • MaEtUgR (Matthias Grob): Posted today with two key concerns:
    1. "Total CI time is x2" (28 min old vs ~63 min new)
    2. "One file instead of split checks makes it significantly harder to maintain forks"
  • farhangnaderi: Flagged hardcoded container versions, commented-out code
  • github-advanced-security[bot]: 25 comments about missing permissions: blocks (now fixed)
  • No formal approvals exist

Current State

Metric Old (15 separate files) PR (single orchestrator)
Files 15 workflow files, 1,091 total lines 1 file, 1,162 lines
Wall-clock (success) ~28 min ~63 min
Wall-clock (T1 failure) ~28 min (wasted) ~6.5 min
Cost per success ~$0.57 (everything runs) ~$0.57
Cost per T1 failure ~$0.57 ~$0.00
Monthly (300 runs) ~$171 CI-only ~$98 CI-only

Four Expert Reports

Option A: Optimized Single-File Orchestrator

Expert A identified that the 63-min wall-clock comes from overly conservative tier gating, not from the single-file architecture. Most T3 jobs (ubuntu-builds, macOS, ITCM, flash, failsafe) have zero real dependency on T2 results. T4 jobs (SITL, ROS, MAVROS) only need proof that code compiles (build-sitl), not that all T3 jobs passed.

Key change: Remove T3 from the critical path. The path becomes T1 (6.5 min) + build-sitl (10 min) + basic-tests (15 min) + T4 (45 min) = ~40 min.

Aspect Details
Wall-clock ~38-42 min
Cost per run ~$0.57 (unchanged)
DRY wins Composite actions for ccache reduce 164 lines of boilerplate
Fork-friendliness vars.CI_SKIP_* repository variables (no file edits)
Implementation ~12 hours (3 phases)

Pros: Single source of truth, no reusable workflow limitations, simpler GitHub UI. Cons: Large file, merge conflicts possible, harder to test in isolation.


Option B: Modular Split Architecture

Expert B proposed splitting into a lightweight orchestrator (~100-150 lines) calling reusable workflows via workflow_call, plus composite actions for shared steps.

Proposed file layout:

.github/
  workflows/
    ci-orchestrator.yml        (~120 lines, calls tier workflows)
    ci-tier1-gates.yml         (~80 lines)
    ci-tier2-builds.yml        (~200 lines)
    ci-tier3-platforms.yml     (~250 lines)
    ci-tier4-integration.yml   (~350 lines)
  actions/
    setup-ccache/action.yml    (~30 lines)
    setup-px4-dev/action.yml   (~20 lines)
Aspect Details
Wall-clock ~42-48 min (workflow_call adds ~1 min dispatch overhead per tier)
Fork-friendliness Forks can delete entire tier files or override via workflow_call inputs
Limitations 4-level nesting max, needs only within same file, output passing complexity
Implementation ~16-20 hours

Pros: Clean separation, fork-friendly, smaller files to review. Cons: Workflow dispatch overhead, complex secrets/permissions forwarding, harder to visualize full pipeline.


Option C: Cost-Time Analysis (Expert C)

Monthly cost projections across three scenarios:

Scenario Monthly CI (T1-T4) Monthly CI (T1-T5) Dev Time Cost Total
Old (All Parallel) $171 $771 $0 $771
Tiered Monolith $98 $368 $843 $1,211
Hybrid (Recommended) $98 $368 $574 $942

Key finding: When developer time is factored in, the strict tiered approach is actually more expensive than the old parallel approach. The hybrid (T1 gates, then T2+T3 parallel, then T4) recovers 11 min at zero additional CI cost.

Break-even: Tiered gating only pays for itself (including dev time) at ~87-89% T1/T2 failure rate. Current is ~40%.

Counter-argument: PX4 is open-source; dev time cost may be externalized. If only CI costs matter, tiered always wins.


Option D: F1 of CI (Sub-30 Minutes, Spare No Cost)

Expert D designed a maximum-parallelism pipeline:

Core insight: The entire tier system is a cost-saving illusion, not a technical dependency. Every job fetches its own ccache from the cache service independently. Only 3 real dependency chains exist:

  1. build-sitl -> basic-tests / ekf-check / sitl-tests (ccache seeding)
  2. flash-build-current + flash-build-baseline -> flash-compare (data)
  3. Gate checks -> cancel-watchdog (abort-on-failure)

Architecture: Everything starts at T=0. Gate checks run as a cancel-trigger, not a gate. If lint fails, a watchdog job cancels all running jobs within 3 minutes.

Time  0    5    10   15   20   25   30
      |    |    |    |    |    |    |
T1:   ###............................. (watchdog: cancel on failure)
BSITL:######.......................... (8cpu spot, 5-6 min)
CTIDY:############.................... (16cpu spot, 10-12 min)
UB22: ########........................ (8cpu spot, 6-8 min)
UB24: ########........................ (8cpu spot, 6-8 min)
macOS:##################.............. (GitHub free, 15-18 min)
ITCM: ##########...................... (8cpu spot x4, 8-10 min)
FLASH:######################.......... (SPLIT into 4 parallel jobs)
FSAFE:##########...................... (4cpu spot, 8-10 min)
      |    |    |    |    |    |    |
      |    +-########................. basic-tests (waits for ccache)
      |    +-######................... ekf-check (waits for ccache)
      |    |    |    |    |    |    |
SITL1:     ####################### .. iris (8cpu spot, starts T=5)
SITL2:     ####################### .. tailsitter (8cpu spot, starts T=5)
SITL3:     ####################### .. std_vtol (8cpu spot, starts T=5)
ROS:  ############################ .. ROS integration (8cpu spot, ~28 min)
MAV1: ######################### ..... MAVROS mission (8cpu spot)
MAV2: ######################### ..... MAVROS offboard (8cpu spot)
ROST1:##########...................... ROS translation humble
ROST2:##########...................... ROS translation jazzy
      |    |    |    |    |    |    |
      DONE ----------------------> ~28 min
Aspect Details
Wall-clock ~28 min (bound by ROS integration and SITL tests)
Cost per run ~$3.43 (6x increase due to parallelism + 8cpu upgrades)
Monthly cost (300 runs) ~$1,029 CI-only (vs $98 tiered, $171 old parallel)
Peak concurrent jobs ~25 (all self-hosted via RunsOn, no GitHub limits)
Key optimizations All-parallel, 8cpu spot runners, split flash-analysis, gate watchdog
Wasted cost on lint failure ~$0.30 (3 min of 25 runners before watchdog cancels)

Additional optimizations in F1 approach:

  • fetch-depth: 1 instead of 0 where full history not needed
  • Selective submodule init for ITCM (not all 31 submodules)
  • Pre-built Docker image with emscripten for failsafe-sim
  • Cache MAVROS Docker image

Comparison Matrix

Old Parallel Tiered (current PR) Option A (Optimized Monolith) Option B (Modular Split) Option D (F1)
Wall-clock 28 min 63 min 38-42 min 42-48 min 28 min
Cost/run $0.57 $0.57 $0.57 $0.57 $3.43
Monthly CI $171 $98 $98 $98 $1,029
Early exit (T1 fail) No Yes ($0) Yes ($0) Yes ($0) Yes (~$0.30)
Fork-friendly Yes No Medium (vars) Yes No
Maintainability Good Poor Medium Good Poor
Files 15 1 1 + 2 actions 5 + 2 actions 1 + 2 actions
Implementation Done Done 12 hrs 16-20 hrs 20-24 hrs
Risk None None Low Medium Medium (spot)

Recommendation

Best balance: Option A (Optimized Single-File) as the immediate fix, with elements of Option B for long-term.

  1. Immediate (Phase 1, 6 hours): Restructure needs: in the current monolith to remove T3 from the critical path. This alone drops wall-clock from 63 to ~40 min at zero cost change. This directly addresses MaEtUgR's "2x time" concern.

  2. Short-term (Phase 2, 4 hours): Add composite actions for ccache, add vars.CI_SKIP_* for fork-friendliness. This addresses the maintainability and fork concerns.

  3. Medium-term (if 30 min is a hard requirement): Evaluate F1 approach selectively -- upgrade build jobs to 8cpu spot, split flash-analysis into parallel builds. Gets to ~32-35 min at ~$1.50/run.

  4. Long-term: Consider modular split (Option B) if the team grows and CI changes become frequent enough to cause merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment