Date: 2026-02-13 | Branch: mrpollo/ci_orchestration | PR: PX4/PX4-Autopilot#26257
- dagar (Matthias): Has NOT commented on this PR at all. Not in timeline.
- MaEtUgR (Matthias Grob): Posted today with two key concerns:
- "Total CI time is x2" (28 min old vs ~63 min new)
- "One file instead of split checks makes it significantly harder to maintain forks"
- farhangnaderi: Flagged hardcoded container versions, commented-out code
- github-advanced-security[bot]: 25 comments about missing
permissions:blocks (now fixed) - No formal approvals exist
| Metric | Old (15 separate files) | PR (single orchestrator) |
|---|---|---|
| Files | 15 workflow files, 1,091 total lines | 1 file, 1,162 lines |
| Wall-clock (success) | ~28 min | ~63 min |
| Wall-clock (T1 failure) | ~28 min (wasted) | ~6.5 min |
| Cost per success | ~$0.57 (everything runs) | ~$0.57 |
| Cost per T1 failure | ~$0.57 | ~$0.00 |
| Monthly (300 runs) | ~$171 CI-only | ~$98 CI-only |
Expert A identified that the 63-min wall-clock comes from overly conservative tier gating, not from the single-file architecture. Most T3 jobs (ubuntu-builds, macOS, ITCM, flash, failsafe) have zero real dependency on T2 results. T4 jobs (SITL, ROS, MAVROS) only need proof that code compiles (build-sitl), not that all T3 jobs passed.
Key change: Remove T3 from the critical path. The path becomes T1 (6.5 min) + build-sitl (10 min) + basic-tests (15 min) + T4 (45 min) = ~40 min.
| Aspect | Details |
|---|---|
| Wall-clock | ~38-42 min |
| Cost per run | ~$0.57 (unchanged) |
| DRY wins | Composite actions for ccache reduce 164 lines of boilerplate |
| Fork-friendliness | vars.CI_SKIP_* repository variables (no file edits) |
| Implementation | ~12 hours (3 phases) |
Pros: Single source of truth, no reusable workflow limitations, simpler GitHub UI. Cons: Large file, merge conflicts possible, harder to test in isolation.
Expert B proposed splitting into a lightweight orchestrator (~100-150 lines) calling reusable workflows via workflow_call, plus composite actions for shared steps.
Proposed file layout:
.github/
workflows/
ci-orchestrator.yml (~120 lines, calls tier workflows)
ci-tier1-gates.yml (~80 lines)
ci-tier2-builds.yml (~200 lines)
ci-tier3-platforms.yml (~250 lines)
ci-tier4-integration.yml (~350 lines)
actions/
setup-ccache/action.yml (~30 lines)
setup-px4-dev/action.yml (~20 lines)
| Aspect | Details |
|---|---|
| Wall-clock | ~42-48 min (workflow_call adds ~1 min dispatch overhead per tier) |
| Fork-friendliness | Forks can delete entire tier files or override via workflow_call inputs |
| Limitations | 4-level nesting max, needs only within same file, output passing complexity |
| Implementation | ~16-20 hours |
Pros: Clean separation, fork-friendly, smaller files to review. Cons: Workflow dispatch overhead, complex secrets/permissions forwarding, harder to visualize full pipeline.
Monthly cost projections across three scenarios:
| Scenario | Monthly CI (T1-T4) | Monthly CI (T1-T5) | Dev Time Cost | Total |
|---|---|---|---|---|
| Old (All Parallel) | $171 | $771 | $0 | $771 |
| Tiered Monolith | $98 | $368 | $843 | $1,211 |
| Hybrid (Recommended) | $98 | $368 | $574 | $942 |
Key finding: When developer time is factored in, the strict tiered approach is actually more expensive than the old parallel approach. The hybrid (T1 gates, then T2+T3 parallel, then T4) recovers 11 min at zero additional CI cost.
Break-even: Tiered gating only pays for itself (including dev time) at ~87-89% T1/T2 failure rate. Current is ~40%.
Counter-argument: PX4 is open-source; dev time cost may be externalized. If only CI costs matter, tiered always wins.
Expert D designed a maximum-parallelism pipeline:
Core insight: The entire tier system is a cost-saving illusion, not a technical dependency. Every job fetches its own ccache from the cache service independently. Only 3 real dependency chains exist:
build-sitl->basic-tests/ekf-check/sitl-tests(ccache seeding)flash-build-current+flash-build-baseline->flash-compare(data)- Gate checks -> cancel-watchdog (abort-on-failure)
Architecture: Everything starts at T=0. Gate checks run as a cancel-trigger, not a gate. If lint fails, a watchdog job cancels all running jobs within 3 minutes.
Time 0 5 10 15 20 25 30
| | | | | | |
T1: ###............................. (watchdog: cancel on failure)
BSITL:######.......................... (8cpu spot, 5-6 min)
CTIDY:############.................... (16cpu spot, 10-12 min)
UB22: ########........................ (8cpu spot, 6-8 min)
UB24: ########........................ (8cpu spot, 6-8 min)
macOS:##################.............. (GitHub free, 15-18 min)
ITCM: ##########...................... (8cpu spot x4, 8-10 min)
FLASH:######################.......... (SPLIT into 4 parallel jobs)
FSAFE:##########...................... (4cpu spot, 8-10 min)
| | | | | | |
| +-########................. basic-tests (waits for ccache)
| +-######................... ekf-check (waits for ccache)
| | | | | | |
SITL1: ####################### .. iris (8cpu spot, starts T=5)
SITL2: ####################### .. tailsitter (8cpu spot, starts T=5)
SITL3: ####################### .. std_vtol (8cpu spot, starts T=5)
ROS: ############################ .. ROS integration (8cpu spot, ~28 min)
MAV1: ######################### ..... MAVROS mission (8cpu spot)
MAV2: ######################### ..... MAVROS offboard (8cpu spot)
ROST1:##########...................... ROS translation humble
ROST2:##########...................... ROS translation jazzy
| | | | | | |
DONE ----------------------> ~28 min
| Aspect | Details |
|---|---|
| Wall-clock | ~28 min (bound by ROS integration and SITL tests) |
| Cost per run | ~$3.43 (6x increase due to parallelism + 8cpu upgrades) |
| Monthly cost (300 runs) | ~$1,029 CI-only (vs $98 tiered, $171 old parallel) |
| Peak concurrent jobs | ~25 (all self-hosted via RunsOn, no GitHub limits) |
| Key optimizations | All-parallel, 8cpu spot runners, split flash-analysis, gate watchdog |
| Wasted cost on lint failure | ~$0.30 (3 min of 25 runners before watchdog cancels) |
Additional optimizations in F1 approach:
fetch-depth: 1instead of 0 where full history not needed- Selective submodule init for ITCM (not all 31 submodules)
- Pre-built Docker image with emscripten for failsafe-sim
- Cache MAVROS Docker image
| Old Parallel | Tiered (current PR) | Option A (Optimized Monolith) | Option B (Modular Split) | Option D (F1) | |
|---|---|---|---|---|---|
| Wall-clock | 28 min | 63 min | 38-42 min | 42-48 min | 28 min |
| Cost/run | $0.57 | $0.57 | $0.57 | $0.57 | $3.43 |
| Monthly CI | $171 | $98 | $98 | $98 | $1,029 |
| Early exit (T1 fail) | No | Yes ($0) | Yes ($0) | Yes ($0) | Yes (~$0.30) |
| Fork-friendly | Yes | No | Medium (vars) | Yes | No |
| Maintainability | Good | Poor | Medium | Good | Poor |
| Files | 15 | 1 | 1 + 2 actions | 5 + 2 actions | 1 + 2 actions |
| Implementation | Done | Done | 12 hrs | 16-20 hrs | 20-24 hrs |
| Risk | None | None | Low | Medium | Medium (spot) |
Best balance: Option A (Optimized Single-File) as the immediate fix, with elements of Option B for long-term.
-
Immediate (Phase 1, 6 hours): Restructure
needs:in the current monolith to remove T3 from the critical path. This alone drops wall-clock from 63 to ~40 min at zero cost change. This directly addresses MaEtUgR's "2x time" concern. -
Short-term (Phase 2, 4 hours): Add composite actions for ccache, add
vars.CI_SKIP_*for fork-friendliness. This addresses the maintainability and fork concerns. -
Medium-term (if 30 min is a hard requirement): Evaluate F1 approach selectively -- upgrade build jobs to 8cpu spot, split flash-analysis into parallel builds. Gets to ~32-35 min at ~$1.50/run.
-
Long-term: Consider modular split (Option B) if the team grows and CI changes become frequent enough to cause merge conflicts.