Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active December 18, 2025 18:33
Show Gist options
  • Select an option

  • Save arubis/50aa7b1bc9b536525851b56836d3ca0b to your computer and use it in GitHub Desktop.

Select an option

Save arubis/50aa7b1bc9b536525851b56836d3ca0b to your computer and use it in GitHub Desktop.
apex-arena: Fix race condition in test-solution by waiting for entrypoint completion
From: Claude Code
Subject: [PATCH] Fix race condition in test-solution by waiting for entrypoint completion
The test-solution command starts containers with `docker run -d ... sleep infinity`
which returns immediately, then runs `docker exec setup.sh` before the container's
entrypoint has finished critical initialization (node cleanup, PV recreation, etc.).
This causes Nebula-based tasks to fail because:
1. Snapshot contains stale node ID from when it was created
2. Entrypoint cleans up old nodes at line ~585
3. But setup.sh runs before entrypoint reaches cleanup
4. Pods get scheduled on non-existent old node
5. Deployments show 0/0 replicas
The fix adds a wait loop that monitors container logs for "Fast-boot complete!"
before proceeding with setup.sh execution.
Tested with scale-deployment task:
- Before: 0/0 replicas, grader fails with "Timeline-service has 0 replicas"
- After: 3/3 replicas, scaling works correctly
---
apex_arena/cli.py | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/apex_arena/cli.py b/apex_arena/cli.py
--- a/apex_arena/cli.py
+++ b/apex_arena/cli.py
@@ -3233,6 +3233,46 @@ def test_solution(task_id: str, force: bool = False):
)
sys.exit(1)
+ # Wait for container entrypoint to complete (node cleanup, PV recreation, etc.)
+ # The entrypoint prints "Fast-boot complete!" when initialization is done
+ console.print("⏳ Waiting for container entrypoint to complete...")
+ boot_timeout = 300 # 5 minutes for boot
+ boot_start = time.time()
+ entrypoint_complete = False
+
+ while time.time() - boot_start < boot_timeout:
+ # Check container logs for completion marker
+ logs_result = subprocess.run(
+ ["docker", "logs", test_container_name],
+ capture_output=True,
+ text=True,
+ )
+ if "Fast-boot complete!" in logs_result.stdout or "Fast-boot complete!" in logs_result.stderr:
+ entrypoint_complete = True
+ console.print("✅ Container entrypoint completed")
+ break
+
+ # Also check if container exited unexpectedly
+ inspect_result = subprocess.run(
+ ["docker", "inspect", test_container_name, "--format", "{{.State.Running}}"],
+ capture_output=True,
+ text=True,
+ )
+ if inspect_result.stdout.strip() != "true":
+ console.print(f"[red]❌ Container exited unexpectedly[/red]")
+ console.print(f"Logs: {logs_result.stdout}")
+ sys.exit(1)
+
+ # Brief status update every 30 seconds
+ elapsed = int(time.time() - boot_start)
+ if elapsed > 0 and elapsed % 30 == 0:
+ console.print(f" Still waiting... ({elapsed}s elapsed)")
+
+ time.sleep(5)
+
+ if not entrypoint_complete:
+ console.print(f"[yellow]⚠️ Entrypoint did not complete within {boot_timeout}s, proceeding anyway[/yellow]")
+
# Apply iptables rules to block internet
try:
# Get container IP and gateway

apex-arena Race Condition Fix: Entrypoint Wait

Problem Summary

The apex-arena test-solution command has a race condition that causes Nebula-based tasks to fail intermittently. The container's entrypoint hasn't finished critical initialization (node cleanup, PV recreation) before setup.sh is executed.

Root Cause

When apex-arena runs test-solution, it:

  1. Starts a container with docker run -d ... sleep infinity (returns immediately)
  2. Immediately runs docker exec setup.sh

But the container's entrypoint (docker-entrypoint-fast.sh) needs time to:

  • Clean up stale nodes from the snapshot
  • Recreate PersistentVolumes
  • Wait for k3s to be ready

What Was Failing

Task: scale-deployment Symptom: kubectl get deployment timeline-service -n bleater showed 0/0 READY Grader feedback: "Timeline-service has 0 replicas (expected: 3)"

Why It Failed

The Nebula snapshot contains a node ID from when the snapshot was created. When a new container starts:

  1. A new node joins the cluster with the container's hostname
  2. The old node from the snapshot still exists in etcd
  3. Pods get scheduled on the old (non-existent) node
  4. Deployments show 0/0 replicas because pods can't run

The entrypoint has cleanup code to delete old nodes, but apex-arena runs setup.sh before this cleanup completes.

Before/After State Diagrams

Before Fix: Race Condition

sequenceDiagram
    participant AA as apex-arena
    participant D as Docker
    participant EP as Entrypoint
    participant K as k3s
    participant SS as setup.sh

    AA->>D: docker run -d ... sleep infinity
    D-->>AA: Container ID (immediate return)

    Note over EP,K: Entrypoint starts in background
    EP->>K: Starting k3s...

    AA->>D: docker exec setup.sh
    D->>SS: Execute setup.sh

    Note over SS,K: setup.sh runs BEFORE<br/>entrypoint completes!

    SS->>K: kubectl get deployments
    K-->>SS: 0/0 replicas (pods on stale node)

    EP->>K: Cleaning up old nodes...
    Note over EP: Too late! setup.sh already ran
Loading

After Fix: Proper Sequencing

sequenceDiagram
    participant AA as apex-arena
    participant D as Docker
    participant EP as Entrypoint
    participant K as k3s
    participant SS as setup.sh

    AA->>D: docker run -d ... sleep infinity
    D-->>AA: Container ID (immediate return)

    EP->>K: Starting k3s...
    EP->>K: Cleaning up old nodes...
    EP->>K: Recreating PVs...

    loop Wait for "Fast-boot complete!"
        AA->>D: docker logs container
        D-->>AA: Log output
        Note over AA: Check for completion marker
    end

    EP->>K: Fast-boot complete!
    AA->>D: docker exec setup.sh
    D->>SS: Execute setup.sh

    SS->>K: kubectl get deployments
    K-->>SS: 3/3 replicas (correct node)
Loading

The Fix

Add a wait loop in cli.py that monitors container logs for the "Fast-boot complete!" marker before proceeding with setup.sh:

# Wait for container entrypoint to complete (node cleanup, PV recreation, etc.)
console.print("⏳ Waiting for container entrypoint to complete...")
boot_timeout = 300  # 5 minutes for boot
boot_start = time.time()
entrypoint_complete = False

while time.time() - boot_start < boot_timeout:
    logs_result = subprocess.run(
        ["docker", "logs", test_container_name],
        capture_output=True,
        text=True,
    )
    if "Fast-boot complete!" in logs_result.stdout or "Fast-boot complete!" in logs_result.stderr:
        entrypoint_complete = True
        console.print("✅ Container entrypoint completed")
        break

    # Check if container exited unexpectedly
    inspect_result = subprocess.run(
        ["docker", "inspect", test_container_name, "--format", "{{.State.Running}}"],
        capture_output=True,
        text=True,
    )
    if inspect_result.stdout.strip() != "true":
        console.print("[red]❌ Container exited unexpectedly[/red]")
        sys.exit(1)

    elapsed = int(time.time() - boot_start)
    if elapsed > 0 and elapsed % 30 == 0:
        console.print(f"    Still waiting... ({elapsed}s elapsed)")

    time.sleep(5)

if not entrypoint_complete:
    console.print("[yellow]⚠️  Entrypoint did not complete within {boot_timeout}s, proceeding anyway[/yellow]")

Test Results

Metric Before Fix After Fix
timeline-service replicas 0/0 3/3
Scaling test FAIL PASS
Test suite N/A (couldn't run) 94/96 passed

Patch Location

Insert the wait loop at line 3236 in apex_arena/cli.py, immediately after the container health check and before the iptables rules are applied.

Notes

  • The 5-minute timeout is generous; typical boot time is 60-90 seconds
  • The fix is backward-compatible with non-Nebula tasks (they'll just see the "proceeding anyway" message if no marker is found)
  • The completion marker "Fast-boot complete!" is already emitted by docker-entrypoint-fast.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment