PX4 CI Orchestrator Analysis: PR #26257

Date: 2026-02-13 | Branch: mrpollo/ci_orchestration | PR: PX4/PX4-Autopilot#26257

PR Review Status

dagar (Matthias): Has NOT commented on this PR at all. Not in timeline.
MaEtUgR (Matthias Grob): Posted today with two key concerns:
1. "Total CI time is x2" (28 min old vs ~63 min new)
2. "One file instead of split checks makes it significantly harder to maintain forks"
farhangnaderi: Flagged hardcoded container versions, commented-out code
github-advanced-security[bot]: 25 comments about missing permissions: blocks (now fixed)
No formal approvals exist

Current State

Metric	Old (15 separate files)	PR (single orchestrator)
Files	15 workflow files, 1,091 total lines	1 file, 1,162 lines
Wall-clock (success)	~28 min	~63 min
Wall-clock (T1 failure)	~28 min (wasted)	~6.5 min
Cost per success	~$0.57 (everything runs)	~$0.57
Cost per T1 failure	~$0.57	~$0.00
Monthly (300 runs)	~$171 CI-only	~$98 CI-only

Four Expert Reports

Option A: Optimized Single-File Orchestrator

Expert A identified that the 63-min wall-clock comes from overly conservative tier gating, not from the single-file architecture. Most T3 jobs (ubuntu-builds, macOS, ITCM, flash, failsafe) have zero real dependency on T2 results. T4 jobs (SITL, ROS, MAVROS) only need proof that code compiles (build-sitl), not that all T3 jobs passed.

Key change: Remove T3 from the critical path. The path becomes T1 (6.5 min) + build-sitl (10 min) + basic-tests (15 min) + T4 (45 min) = ~40 min.

Aspect	Details
Wall-clock	~38-42 min
Cost per run	~$0.57 (unchanged)
DRY wins	Composite actions for ccache reduce 164 lines of boilerplate
Fork-friendliness	`vars.CI_SKIP_*` repository variables (no file edits)
Implementation	~12 hours (3 phases)

Pros: Single source of truth, no reusable workflow limitations, simpler GitHub UI. Cons: Large file, merge conflicts possible, harder to test in isolation.

Option B: Modular Split Architecture

Expert B proposed splitting into a lightweight orchestrator (~100-150 lines) calling reusable workflows via workflow_call, plus composite actions for shared steps.

Proposed file layout:

.github/
  workflows/
    ci-orchestrator.yml        (~120 lines, calls tier workflows)
    ci-tier1-gates.yml         (~80 lines)
    ci-tier2-builds.yml        (~200 lines)
    ci-tier3-platforms.yml     (~250 lines)
    ci-tier4-integration.yml   (~350 lines)
  actions/
    setup-ccache/action.yml    (~30 lines)
    setup-px4-dev/action.yml   (~20 lines)

Aspect	Details
Wall-clock	~42-48 min (workflow_call adds ~1 min dispatch overhead per tier)
Fork-friendliness	Forks can delete entire tier files or override via `workflow_call` inputs
Limitations	4-level nesting max, `needs` only within same file, output passing complexity
Implementation	~16-20 hours

Pros: Clean separation, fork-friendly, smaller files to review. Cons: Workflow dispatch overhead, complex secrets/permissions forwarding, harder to visualize full pipeline.

Option C: Cost-Time Analysis (Expert C)

Monthly cost projections across three scenarios:

Scenario	Monthly CI (T1-T4)	Monthly CI (T1-T5)	Dev Time Cost	Total
Old (All Parallel)	$171	$771	$0	$771
Tiered Monolith	$98	$368	$843	$1,211
Hybrid (Recommended)	$98	$368	$574	$942

Key finding: When developer time is factored in, the strict tiered approach is actually more expensive than the old parallel approach. The hybrid (T1 gates, then T2+T3 parallel, then T4) recovers 11 min at zero additional CI cost.

Break-even: Tiered gating only pays for itself (including dev time) at ~87-89% T1/T2 failure rate. Current is ~40%.

Counter-argument: PX4 is open-source; dev time cost may be externalized. If only CI costs matter, tiered always wins.

Option D: F1 of CI (Sub-30 Minutes, Spare No Cost)

Expert D designed a maximum-parallelism pipeline:

Core insight: The entire tier system is a cost-saving illusion, not a technical dependency. Every job fetches its own ccache from the cache service independently. Only 3 real dependency chains exist:

build-sitl -> basic-tests / ekf-check / sitl-tests (ccache seeding)
flash-build-current + flash-build-baseline -> flash-compare (data)
Gate checks -> cancel-watchdog (abort-on-failure)

Architecture: Everything starts at T=0. Gate checks run as a cancel-trigger, not a gate. If lint fails, a watchdog job cancels all running jobs within 3 minutes.

Time  0    5    10   15   20   25   30
      |    |    |    |    |    |    |
T1:   ###............................. (watchdog: cancel on failure)
BSITL:######.......................... (8cpu spot, 5-6 min)
CTIDY:############.................... (16cpu spot, 10-12 min)
UB22: ########........................ (8cpu spot, 6-8 min)
UB24: ########........................ (8cpu spot, 6-8 min)
macOS:##################.............. (GitHub free, 15-18 min)
ITCM: ##########...................... (8cpu spot x4, 8-10 min)
FLASH:######################.......... (SPLIT into 4 parallel jobs)
FSAFE:##########...................... (4cpu spot, 8-10 min)
      |    |    |    |    |    |    |
      |    +-########................. basic-tests (waits for ccache)
      |    +-######................... ekf-check (waits for ccache)
      |    |    |    |    |    |    |
SITL1:     ####################### .. iris (8cpu spot, starts T=5)
SITL2:     ####################### .. tailsitter (8cpu spot, starts T=5)
SITL3:     ####################### .. std_vtol (8cpu spot, starts T=5)
ROS:  ############################ .. ROS integration (8cpu spot, ~28 min)
MAV1: ######################### ..... MAVROS mission (8cpu spot)
MAV2: ######################### ..... MAVROS offboard (8cpu spot)
ROST1:##########...................... ROS translation humble
ROST2:##########...................... ROS translation jazzy
      |    |    |    |    |    |    |
      DONE ----------------------> ~28 min

Aspect	Details
Wall-clock	~28 min (bound by ROS integration and SITL tests)
Cost per run	~$3.43 (6x increase due to parallelism + 8cpu upgrades)
Monthly cost (300 runs)	~$1,029 CI-only (vs $98 tiered, $171 old parallel)
Peak concurrent jobs	~25 (all self-hosted via RunsOn, no GitHub limits)
Key optimizations	All-parallel, 8cpu spot runners, split flash-analysis, gate watchdog
Wasted cost on lint failure	~$0.30 (3 min of 25 runners before watchdog cancels)

Additional optimizations in F1 approach:

fetch-depth: 1 instead of 0 where full history not needed
Selective submodule init for ITCM (not all 31 submodules)
Pre-built Docker image with emscripten for failsafe-sim
Cache MAVROS Docker image

Comparison Matrix

	Old Parallel	Tiered (current PR)	Option A (Optimized Monolith)	Option B (Modular Split)	Option D (F1)
Wall-clock	28 min	63 min	38-42 min	42-48 min	28 min
Cost/run	$0.57	$0.57	$0.57	$0.57	$3.43
Monthly CI	$171	$98	$98	$98	$1,029
Early exit (T1 fail)	No	Yes ($0)	Yes ($0)	Yes ($0)	Yes (~$0.30)
Fork-friendly	Yes	No	Medium (vars)	Yes	No
Maintainability	Good	Poor	Medium	Good	Poor
Files	15	1	1 + 2 actions	5 + 2 actions	1 + 2 actions
Implementation	Done	Done	12 hrs	16-20 hrs	20-24 hrs
Risk	None	None	Low	Medium	Medium (spot)

Recommendation

Best balance: Option A (Optimized Single-File) as the immediate fix, with elements of Option B for long-term.

Immediate (Phase 1, 6 hours): Restructure needs: in the current monolith to remove T3 from the critical path. This alone drops wall-clock from 63 to ~40 min at zero cost change. This directly addresses MaEtUgR's "2x time" concern.
Short-term (Phase 2, 4 hours): Add composite actions for ccache, add vars.CI_SKIP_* for fork-friendliness. This addresses the maintainability and fork concerns.
Medium-term (if 30 min is a hard requirement): Evaluate F1 approach selectively -- upgrade build jobs to 8cpu spot, split flash-analysis into parallel builds. Gets to ~32-35 min at ~$1.50/run.
Long-term: Consider modular split (Option B) if the team grows and CI changes become frequent enough to cause merge conflicts.

mrpollo/ci-orchestrator-analysis.md

Select an option

No results found