The apex-arena test-solution command has a race condition that causes Nebula-based tasks to fail intermittently. The container's entrypoint hasn't finished critical initialization (node cleanup, PV recreation) before setup.sh is executed.
When apex-arena runs test-solution, it:
- Starts a container with
docker run -d ... sleep infinity(returns immediately) - Immediately runs
docker exec setup.sh
But the container's entrypoint (docker-entrypoint-fast.sh) needs time to:
- Clean up stale nodes from the snapshot
- Recreate PersistentVolumes
- Wait for k3s to be ready
Task: scale-deployment
Symptom: kubectl get deployment timeline-service -n bleater showed 0/0 READY
Grader feedback: "Timeline-service has 0 replicas (expected: 3)"
The Nebula snapshot contains a node ID from when the snapshot was created. When a new container starts:
- A new node joins the cluster with the container's hostname
- The old node from the snapshot still exists in etcd
- Pods get scheduled on the old (non-existent) node
- Deployments show 0/0 replicas because pods can't run
The entrypoint has cleanup code to delete old nodes, but apex-arena runs setup.sh before this cleanup completes.
sequenceDiagram
participant AA as apex-arena
participant D as Docker
participant EP as Entrypoint
participant K as k3s
participant SS as setup.sh
AA->>D: docker run -d ... sleep infinity
D-->>AA: Container ID (immediate return)
Note over EP,K: Entrypoint starts in background
EP->>K: Starting k3s...
AA->>D: docker exec setup.sh
D->>SS: Execute setup.sh
Note over SS,K: setup.sh runs BEFORE<br/>entrypoint completes!
SS->>K: kubectl get deployments
K-->>SS: 0/0 replicas (pods on stale node)
EP->>K: Cleaning up old nodes...
Note over EP: Too late! setup.sh already ran
sequenceDiagram
participant AA as apex-arena
participant D as Docker
participant EP as Entrypoint
participant K as k3s
participant SS as setup.sh
AA->>D: docker run -d ... sleep infinity
D-->>AA: Container ID (immediate return)
EP->>K: Starting k3s...
EP->>K: Cleaning up old nodes...
EP->>K: Recreating PVs...
loop Wait for "Fast-boot complete!"
AA->>D: docker logs container
D-->>AA: Log output
Note over AA: Check for completion marker
end
EP->>K: Fast-boot complete!
AA->>D: docker exec setup.sh
D->>SS: Execute setup.sh
SS->>K: kubectl get deployments
K-->>SS: 3/3 replicas (correct node)
Add a wait loop in cli.py that monitors container logs for the "Fast-boot complete!" marker before proceeding with setup.sh:
# Wait for container entrypoint to complete (node cleanup, PV recreation, etc.)
console.print("⏳ Waiting for container entrypoint to complete...")
boot_timeout = 300 # 5 minutes for boot
boot_start = time.time()
entrypoint_complete = False
while time.time() - boot_start < boot_timeout:
logs_result = subprocess.run(
["docker", "logs", test_container_name],
capture_output=True,
text=True,
)
if "Fast-boot complete!" in logs_result.stdout or "Fast-boot complete!" in logs_result.stderr:
entrypoint_complete = True
console.print("✅ Container entrypoint completed")
break
# Check if container exited unexpectedly
inspect_result = subprocess.run(
["docker", "inspect", test_container_name, "--format", "{{.State.Running}}"],
capture_output=True,
text=True,
)
if inspect_result.stdout.strip() != "true":
console.print("[red]❌ Container exited unexpectedly[/red]")
sys.exit(1)
elapsed = int(time.time() - boot_start)
if elapsed > 0 and elapsed % 30 == 0:
console.print(f" Still waiting... ({elapsed}s elapsed)")
time.sleep(5)
if not entrypoint_complete:
console.print("[yellow]⚠️ Entrypoint did not complete within {boot_timeout}s, proceeding anyway[/yellow]")| Metric | Before Fix | After Fix |
|---|---|---|
| timeline-service replicas | 0/0 | 3/3 |
| Scaling test | FAIL | PASS |
| Test suite | N/A (couldn't run) | 94/96 passed |
Insert the wait loop at line 3236 in apex_arena/cli.py, immediately after the container health check and before the iptables rules are applied.
- The 5-minute timeout is generous; typical boot time is 60-90 seconds
- The fix is backward-compatible with non-Nebula tasks (they'll just see the "proceeding anyway" message if no marker is found)
- The completion marker "Fast-boot complete!" is already emitted by
docker-entrypoint-fast.sh