Deployments currently use docker compose up -d --pull always --force-recreate, which stops and recreates all containers simultaneously, causing 10-30 seconds of downtime per deploy. docker-rollout eliminates this by scaling up new containers alongside old ones, health-checking, then removing old containers — zero-downtime.
The original investigation (gist from ~4 months ago) identified removing container_name as a blocker. That's now resolved — PR #628 confirmed all application services have no container_name and no host ports:.
First deploy does NOT require manually stopping stacks. docker-rollout works with running containers — it scales up alongside them, waits for health, then removes old ones. No manual intervention needed.
- No
container_nameon app services (server, identity, property, profile-server) - No host port bindings on app services (only
expose:for Docker internal network) - All app services have healthchecks defined
- Caddy reverse proxy in front of all services using Docker DNS service names
| Service | Rollout? | Notes |
|---|---|---|
| server | Yes | Go monolith, healthcheck at /health |
| property | Yes | Bun service, healthcheck at /health |
| identity | Yes | Node service, healthcheck at /health |
| profile-server | Yes | Express API, healthcheck at /api/health |
| caddy | No | Host port bindings (80, 443) — force-recreate instead |
| mongodb | No | Stateful infrastructure |
| redis | No | Stateful infrastructure |
| keycloak* | No | Has container_name, host port 9000 (staging only) |
| monitoring | No | Have container_name, host ports |
a) Install now via SSH (MCP SSH tool — hosts: reminet-staging, reminet-prod):
mkdir -p ~/.docker/cli-plugins
curl -s https://raw.githubusercontent.com/wowu/docker-rollout/main/docker-rollout \
-o ~/.docker/cli-plugins/docker-rollout
chmod +x ~/.docker/cli-plugins/docker-rollout
docker rollout --version # verifyb) Add to setup scripts so future server provisioning includes it:
Files: infra/staging/setup.sh, infra/production/setup.sh
Replace the current single docker compose up block. Key points:
docker compose pullfirst — docker-rollout does NOT pull images itself- Infrastructure/monitoring use traditional
up -d - App services rolled out in dependency order (server+property → identity → profile-server)
- Caddy force-recreated last to re-resolve DNS to new container IPs
- Automatic rollback: if a health check fails, docker-rollout removes the NEW container and exits non-zero, leaving the old container running
cd ~/remilia/staging
echo ${{ secrets.GH_PACKAGES_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin
# Pull all images upfront (docker-rollout does NOT pull)
docker compose \
-f docker-compose.yaml \
-f infra/staging/docker-compose.override.yaml \
pull
# Infrastructure + monitoring (traditional deploy, not rolled out)
docker compose \
-f docker-compose.yaml \
-f infra/staging/docker-compose.override.yaml \
up -d mongodb redis keycloak-postgres keycloak \
node-exporter cadvisor alloy
# Rolling deploy app services in dependency order
for svc in server property identity profile-server; do
echo "Rolling out $svc..."
docker rollout \
-f docker-compose.yaml \
-f infra/staging/docker-compose.override.yaml \
"$svc"
done
# Recreate Caddy last (re-resolves DNS to new containers, sub-second restart)
docker compose \
-f docker-compose.yaml \
-f infra/staging/docker-compose.override.yaml \
up -d --force-recreate caddy
docker image prune -fFile: .github/workflows/deploy-staging.yaml
Same pattern, minus keycloak services and with production mongodb disabled (managed DO):
cd ~/remilia/production
echo ${{ secrets.GH_PACKAGES_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin
docker compose \
-f docker-compose.yaml \
-f infra/production/docker-compose.override.yaml \
pull
# Infrastructure + monitoring (traditional)
docker compose \
-f docker-compose.yaml \
-f infra/production/docker-compose.override.yaml \
up -d redis node-exporter cadvisor alloy
# Rolling deploy app services in dependency order
for svc in server property identity profile-server; do
echo "Rolling out $svc..."
docker rollout \
-f docker-compose.yaml \
-f infra/production/docker-compose.override.yaml \
"$svc"
done
# Recreate Caddy last
docker compose \
-f docker-compose.yaml \
-f infra/production/docker-compose.override.yaml \
up -d --force-recreate caddy
docker image prune -fFile: .github/workflows/deploy-production.yaml
Caddy resolves upstream hostnames (server:8080, etc.) at config load time and caches the IPs. After docker-rollout swaps containers (new IPs), Caddy still routes to stale IPs. The --force-recreate caddy at the end re-resolves all DNS in under a second (Caddy startup is fast).
Between individual rollout completions and the final Caddy recreate, there may be a brief window where Caddy routes to a removed container. Caddy's built-in retry logic handles this gracefully. If testing shows otherwise, we can add a docker compose exec caddy caddy reload --config /etc/caddy/Caddyfile after each individual rollout instead.
server mounts beetle_data:/app/data (SQLite) and property in production mounts /home/deploy/remilia/data/property:/app/data. During the ~10-20s overlap, both old and new containers mount these volumes. The old container continues serving traffic; the new container only runs health checks (read-only /health endpoint). No SQLite write conflicts expected.
Two services already handle SIGTERM gracefully; two need fixes:
| Service | SIGTERM | Graceful shutdown | Status |
|---|---|---|---|
| profile-server | Yes (app.js:396-404) |
Yes — server.close() + redisClient.quit() |
Ready |
| property | Yes (api-service.js:428-429, sync.js:287-288) |
Yes — stops monitors, closes DB | Ready |
| server (beetle) | No | No — ListenAndServe() blocks with no signal interception |
Needs fix |
| identity | No | No — no process.on('SIGTERM') |
Needs fix |
File: packages/server/pkg/web/server.go (lines 261-291, Run() method)
The remichat module already has the pattern (packages/server/pkg/.remichat/cmd/main.go:169):
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer stop()Apply the same pattern to beetle's Run():
- Accept a context (or create one with
signal.NotifyContextincmd/main.go) - Pass it through to
Run() - On
ctx.Done(), callsrv.Shutdown(shutdownCtx)with a 5-10s timeout
File: packages/server/cmd/main.go (~line 156) — wire signal context into app.Run()
File: packages/identity/app.js (after server.listen() on line 675)
Add:
const shutdown = async () => {
server.close();
await mongoose.connection.close();
process.exit(0);
};
process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);Once graceful shutdown works in all services, add --pre-stop-hook to the rollout loop in both workflows:
for svc in server property identity profile-server; do
docker rollout \
-f docker-compose.yaml \
-f infra/$ENV/docker-compose.override.yaml \
--pre-stop-hook "sleep 5" \
"$svc"
doneThis gives the old container 5 extra seconds (sleep completes, then docker stop sends SIGTERM, then the app's graceful shutdown kicks in). The total drain window is ~15s (5s sleep + 10s docker stop grace period).
Note: We do NOT need to modify healthchecks for the drain-file pattern. That pattern is only useful if the proxy actively health-checks upstreams and routes away from unhealthy ones. Since we force-recreate Caddy at the end anyway, the simpler sleep approach is sufficient.
- Staging first: Deploy to staging and verify:
docker rolloutinstalls and runs correctly- Each service rolls out without errors
- Health check endpoints respond throughout deployment
- No dropped requests during rollout (test with
curlloop) - WebSocket connections (beetle game, identity) reconnect gracefully
- Caddy routes to new containers after restart
- Production: Deploy during low-traffic period, monitor error rates in Sentry/Grafana