Skip to content

Instantly share code, notes, and snippets.

@tomholford
Created February 13, 2026 11:59
Show Gist options
  • Select an option

  • Save tomholford/2ba6e804ccf2c91944abec87a0dfc56d to your computer and use it in GitHub Desktop.

Select an option

Save tomholford/2ba6e804ccf2c91944abec87a0dfc56d to your computer and use it in GitHub Desktop.
Rolling deploy plan for RemiliaNET using docker-rollout — zero-downtime deployments with container draining

Plan: Switch to Rolling Deploys with docker-rollout

Context

Deployments currently use docker compose up -d --pull always --force-recreate, which stops and recreates all containers simultaneously, causing 10-30 seconds of downtime per deploy. docker-rollout eliminates this by scaling up new containers alongside old ones, health-checking, then removing old containers — zero-downtime.

The original investigation (gist from ~4 months ago) identified removing container_name as a blocker. That's now resolved — PR #628 confirmed all application services have no container_name and no host ports:.

First deploy does NOT require manually stopping stacks. docker-rollout works with running containers — it scales up alongside them, waits for health, then removes old ones. No manual intervention needed.

Prerequisites (all met)

  • No container_name on app services (server, identity, property, profile-server)
  • No host port bindings on app services (only expose: for Docker internal network)
  • All app services have healthchecks defined
  • Caddy reverse proxy in front of all services using Docker DNS service names

Services eligible for rolling deploy

Service Rollout? Notes
server Yes Go monolith, healthcheck at /health
property Yes Bun service, healthcheck at /health
identity Yes Node service, healthcheck at /health
profile-server Yes Express API, healthcheck at /api/health
caddy No Host port bindings (80, 443) — force-recreate instead
mongodb No Stateful infrastructure
redis No Stateful infrastructure
keycloak* No Has container_name, host port 9000 (staging only)
monitoring No Have container_name, host ports

Changes

1. Install docker-rollout on staging + production servers

a) Install now via SSH (MCP SSH tool — hosts: reminet-staging, reminet-prod):

mkdir -p ~/.docker/cli-plugins
curl -s https://raw.githubusercontent.com/wowu/docker-rollout/main/docker-rollout \
  -o ~/.docker/cli-plugins/docker-rollout
chmod +x ~/.docker/cli-plugins/docker-rollout
docker rollout --version  # verify

b) Add to setup scripts so future server provisioning includes it:

Files: infra/staging/setup.sh, infra/production/setup.sh

2. Modify deploy-staging.yaml "Run deployment" step (~line 210-224)

Replace the current single docker compose up block. Key points:

  • docker compose pull first — docker-rollout does NOT pull images itself
  • Infrastructure/monitoring use traditional up -d
  • App services rolled out in dependency order (server+property → identity → profile-server)
  • Caddy force-recreated last to re-resolve DNS to new container IPs
  • Automatic rollback: if a health check fails, docker-rollout removes the NEW container and exits non-zero, leaving the old container running
cd ~/remilia/staging

echo ${{ secrets.GH_PACKAGES_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin

# Pull all images upfront (docker-rollout does NOT pull)
docker compose \
  -f docker-compose.yaml \
  -f infra/staging/docker-compose.override.yaml \
  pull

# Infrastructure + monitoring (traditional deploy, not rolled out)
docker compose \
  -f docker-compose.yaml \
  -f infra/staging/docker-compose.override.yaml \
  up -d mongodb redis keycloak-postgres keycloak \
       node-exporter cadvisor alloy

# Rolling deploy app services in dependency order
for svc in server property identity profile-server; do
  echo "Rolling out $svc..."
  docker rollout \
    -f docker-compose.yaml \
    -f infra/staging/docker-compose.override.yaml \
    "$svc"
done

# Recreate Caddy last (re-resolves DNS to new containers, sub-second restart)
docker compose \
  -f docker-compose.yaml \
  -f infra/staging/docker-compose.override.yaml \
  up -d --force-recreate caddy

docker image prune -f

File: .github/workflows/deploy-staging.yaml

3. Modify deploy-production.yaml "Execute deployment" step (~line 212-227)

Same pattern, minus keycloak services and with production mongodb disabled (managed DO):

cd ~/remilia/production

echo ${{ secrets.GH_PACKAGES_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin

docker compose \
  -f docker-compose.yaml \
  -f infra/production/docker-compose.override.yaml \
  pull

# Infrastructure + monitoring (traditional)
docker compose \
  -f docker-compose.yaml \
  -f infra/production/docker-compose.override.yaml \
  up -d redis node-exporter cadvisor alloy

# Rolling deploy app services in dependency order
for svc in server property identity profile-server; do
  echo "Rolling out $svc..."
  docker rollout \
    -f docker-compose.yaml \
    -f infra/production/docker-compose.override.yaml \
    "$svc"
done

# Recreate Caddy last
docker compose \
  -f docker-compose.yaml \
  -f infra/production/docker-compose.override.yaml \
  up -d --force-recreate caddy

docker image prune -f

File: .github/workflows/deploy-production.yaml

4. DNS resolution: why Caddy needs force-recreate

Caddy resolves upstream hostnames (server:8080, etc.) at config load time and caches the IPs. After docker-rollout swaps containers (new IPs), Caddy still routes to stale IPs. The --force-recreate caddy at the end re-resolves all DNS in under a second (Caddy startup is fast).

Between individual rollout completions and the final Caddy recreate, there may be a brief window where Caddy routes to a removed container. Caddy's built-in retry logic handles this gracefully. If testing shows otherwise, we can add a docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile after each individual rollout instead.

5. Shared volumes during rollout overlap

server mounts beetle_data:/app/data (SQLite) and property in production mounts /home/deploy/remilia/data/property:/app/data. During the ~10-20s overlap, both old and new containers mount these volumes. The old container continues serving traffic; the new container only runs health checks (read-only /health endpoint). No SQLite write conflicts expected.

6. Container draining (graceful shutdown)

Two services already handle SIGTERM gracefully; two need fixes:

Service SIGTERM Graceful shutdown Status
profile-server Yes (app.js:396-404) Yes — server.close() + redisClient.quit() Ready
property Yes (api-service.js:428-429, sync.js:287-288) Yes — stops monitors, closes DB Ready
server (beetle) No No — ListenAndServe() blocks with no signal interception Needs fix
identity No No — no process.on('SIGTERM') Needs fix

6a. Fix beetle server graceful shutdown

File: packages/server/pkg/web/server.go (lines 261-291, Run() method)

The remichat module already has the pattern (packages/server/pkg/.remichat/cmd/main.go:169):

ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer stop()

Apply the same pattern to beetle's Run():

  • Accept a context (or create one with signal.NotifyContext in cmd/main.go)
  • Pass it through to Run()
  • On ctx.Done(), call srv.Shutdown(shutdownCtx) with a 5-10s timeout

File: packages/server/cmd/main.go (~line 156) — wire signal context into app.Run()

6b. Fix identity service graceful shutdown

File: packages/identity/app.js (after server.listen() on line 675)

Add:

const shutdown = async () => {
  server.close();
  await mongoose.connection.close();
  process.exit(0);
};
process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);

6c. Enable drain in docker-rollout invocation

Once graceful shutdown works in all services, add --pre-stop-hook to the rollout loop in both workflows:

for svc in server property identity profile-server; do
  docker rollout \
    -f docker-compose.yaml \
    -f infra/$ENV/docker-compose.override.yaml \
    --pre-stop-hook "sleep 5" \
    "$svc"
done

This gives the old container 5 extra seconds (sleep completes, then docker stop sends SIGTERM, then the app's graceful shutdown kicks in). The total drain window is ~15s (5s sleep + 10s docker stop grace period).

Note: We do NOT need to modify healthchecks for the drain-file pattern. That pattern is only useful if the proxy actively health-checks upstreams and routes away from unhealthy ones. Since we force-recreate Caddy at the end anyway, the simpler sleep approach is sufficient.

Verification

  1. Staging first: Deploy to staging and verify:
    • docker rollout installs and runs correctly
    • Each service rolls out without errors
    • Health check endpoints respond throughout deployment
    • No dropped requests during rollout (test with curl loop)
    • WebSocket connections (beetle game, identity) reconnect gracefully
    • Caddy routes to new containers after restart
  2. Production: Deploy during low-traffic period, monitor error rates in Sentry/Grafana
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment