Plan: Switch to Rolling Deploys with docker-rollout

Context

Deployments currently use docker compose up -d --pull always --force-recreate, which stops and recreates all containers simultaneously, causing 10-30 seconds of downtime per deploy. docker-rollout eliminates this by scaling up new containers alongside old ones, health-checking, then removing old containers — zero-downtime.

The original investigation (gist from ~4 months ago) identified removing container_name as a blocker. That's now resolved — PR #628 confirmed all application services have no container_name and no host ports:.

First deploy does NOT require manually stopping stacks. docker-rollout works with running containers — it scales up alongside them, waits for health, then removes old ones. No manual intervention needed.

Prerequisites (all met)

No container_name on app services (server, identity, property, profile-server)
No host port bindings on app services (only expose: for Docker internal network)
All app services have healthchecks defined
Caddy reverse proxy in front of all services using Docker DNS service names

Services eligible for rolling deploy

Service	Rollout?	Notes
server	Yes	Go monolith, healthcheck at `/health`
property	Yes	Bun service, healthcheck at `/health`
identity	Yes	Node service, healthcheck at `/health`
profile-server	Yes	Express API, healthcheck at `/api/health`
caddy	No	Host port bindings (80, 443) — force-recreate instead
mongodb	No	Stateful infrastructure
redis	No	Stateful infrastructure
keycloak*	No	Has `container_name`, host port 9000 (staging only)
monitoring	No	Have `container_name`, host ports

Changes

1. Install docker-rollout on staging + production servers

a) Install now via SSH (MCP SSH tool — hosts: reminet-staging, reminet-prod):

mkdir -p ~/.docker/cli-plugins
curl -s https://raw.githubusercontent.com/wowu/docker-rollout/main/docker-rollout \
  -o ~/.docker/cli-plugins/docker-rollout
chmod +x ~/.docker/cli-plugins/docker-rollout
docker rollout --version  # verify

b) Add to setup scripts so future server provisioning includes it:

Files: infra/staging/setup.sh, infra/production/setup.sh

2. Modify deploy-staging.yaml "Run deployment" step (~line 210-224)

Replace the current single docker compose up block. Key points:

docker compose pull first — docker-rollout does NOT pull images itself
Infrastructure/monitoring use traditional up -d
App services rolled out in dependency order (server+property → identity → profile-server)
Caddy force-recreated last to re-resolve DNS to new container IPs
Automatic rollback: if a health check fails, docker-rollout removes the NEW container and exits non-zero, leaving the old container running

cd ~/remilia/staging

echo ${{ secrets.GH_PACKAGES_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin

# Pull all images upfront (docker-rollout does NOT pull)
docker compose \
  -f docker-compose.yaml \
  -f infra/staging/docker-compose.override.yaml \
  pull

# Infrastructure + monitoring (traditional deploy, not rolled out)
docker compose \
  -f docker-compose.yaml \
  -f infra/staging/docker-compose.override.yaml \
  up -d mongodb redis keycloak-postgres keycloak \
       node-exporter cadvisor alloy

# Rolling deploy app services in dependency order
for svc in server property identity profile-server; do
  echo "Rolling out $svc..."
  docker rollout \
    -f docker-compose.yaml \
    -f infra/staging/docker-compose.override.yaml \
    "$svc"
done

# Recreate Caddy last (re-resolves DNS to new containers, sub-second restart)
docker compose \
  -f docker-compose.yaml \
  -f infra/staging/docker-compose.override.yaml \
  up -d --force-recreate caddy

docker image prune -f

File: .github/workflows/deploy-staging.yaml

3. Modify deploy-production.yaml "Execute deployment" step (~line 212-227)

Same pattern, minus keycloak services and with production mongodb disabled (managed DO):

cd ~/remilia/production

echo ${{ secrets.GH_PACKAGES_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin

docker compose \
  -f docker-compose.yaml \
  -f infra/production/docker-compose.override.yaml \
  pull

# Infrastructure + monitoring (traditional)
docker compose \
  -f docker-compose.yaml \
  -f infra/production/docker-compose.override.yaml \
  up -d redis node-exporter cadvisor alloy

# Rolling deploy app services in dependency order
for svc in server property identity profile-server; do
  echo "Rolling out $svc..."
  docker rollout \
    -f docker-compose.yaml \
    -f infra/production/docker-compose.override.yaml \
    "$svc"
done

# Recreate Caddy last
docker compose \
  -f docker-compose.yaml \
  -f infra/production/docker-compose.override.yaml \
  up -d --force-recreate caddy

docker image prune -f

File: .github/workflows/deploy-production.yaml

4. DNS resolution: why Caddy needs force-recreate

Caddy resolves upstream hostnames (server:8080, etc.) at config load time and caches the IPs. After docker-rollout swaps containers (new IPs), Caddy still routes to stale IPs. The --force-recreate caddy at the end re-resolves all DNS in under a second (Caddy startup is fast).

Between individual rollout completions and the final Caddy recreate, there may be a brief window where Caddy routes to a removed container. Caddy's built-in retry logic handles this gracefully. If testing shows otherwise, we can add a docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile after each individual rollout instead.

5. Shared volumes during rollout overlap

server mounts beetle_data:/app/data (SQLite) and property in production mounts /home/deploy/remilia/data/property:/app/data. During the ~10-20s overlap, both old and new containers mount these volumes. The old container continues serving traffic; the new container only runs health checks (read-only /health endpoint). No SQLite write conflicts expected.

6. Container draining (graceful shutdown)

Two services already handle SIGTERM gracefully; two need fixes:

Service	SIGTERM	Graceful shutdown	Status
profile-server	Yes (`app.js:396-404`)	Yes — `server.close()` + `redisClient.quit()`	Ready
property	Yes (`api-service.js:428-429`, `sync.js:287-288`)	Yes — stops monitors, closes DB	Ready
server (beetle)	No	No — `ListenAndServe()` blocks with no signal interception	Needs fix
identity	No	No — no `process.on('SIGTERM')`	Needs fix

6a. Fix beetle server graceful shutdown

File: packages/server/pkg/web/server.go (lines 261-291, Run() method)

The remichat module already has the pattern (packages/server/pkg/.remichat/cmd/main.go:169):

ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer stop()

Apply the same pattern to beetle's Run():

Accept a context (or create one with signal.NotifyContext in cmd/main.go)
Pass it through to Run()
On ctx.Done(), call srv.Shutdown(shutdownCtx) with a 5-10s timeout

File: packages/server/cmd/main.go (~line 156) — wire signal context into app.Run()

6b. Fix identity service graceful shutdown

File: packages/identity/app.js (after server.listen() on line 675)

Add:

const shutdown = async () => {
  server.close();
  await mongoose.connection.close();
  process.exit(0);
};
process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);

6c. Enable drain in docker-rollout invocation

Once graceful shutdown works in all services, add --pre-stop-hook to the rollout loop in both workflows:

for svc in server property identity profile-server; do
  docker rollout \
    -f docker-compose.yaml \
    -f infra/$ENV/docker-compose.override.yaml \
    --pre-stop-hook "sleep 5" \
    "$svc"
done

This gives the old container 5 extra seconds (sleep completes, then docker stop sends SIGTERM, then the app's graceful shutdown kicks in). The total drain window is ~15s (5s sleep + 10s docker stop grace period).

Note: We do NOT need to modify healthchecks for the drain-file pattern. That pattern is only useful if the proxy actively health-checks upstreams and routes away from unhealthy ones. Since we force-recreate Caddy at the end anyway, the simpler sleep approach is sufficient.

Verification

Staging first: Deploy to staging and verify:
- docker rollout installs and runs correctly
- Each service rolls out without errors
- Health check endpoints respond throughout deployment
- No dropped requests during rollout (test with curl loop)
- WebSocket connections (beetle game, identity) reconnect gracefully
- Caddy routes to new containers after restart
Production: Deploy during low-traffic period, monitor error rates in Sentry/Grafana

tomholford/vectorized-stargazing-candy.md

Select an option

No results found