- SSH to citadel (5.78.40.219) hung — TCP connected but no SSH banner returned
- Hetzner VNC console showed continuous OOM killer messages:
Memory cgroup out of memory: Killed process XXXX (windmill) Memory cgroup out of memory: Killed process XXXX (uv) - CPU pinned at ~1000-1500% (OOM kill/respawn loop)
- Server was completely unresponsive
Windmill was configured with 8 worker replicas, each with a 2GB memory limit (16GB potential).
Combined with the Windmill server, postgres, and LSP (which had no memory limits), plus ~60 other
containers on the 30GB CPX51, memory was exhausted. The OOM killer entered a death spiral killing
and respawning windmill and uv (Python dependency installer) processes.
Soft reboot via hcloud server reboot citadel failed (kernel too overwhelmed).
Used hcloud server poweroff + hcloud server poweron to recover.
Edited /etc/dokploy/compose/windmill-windmill-whifrv/code/docker-compose.yml:
replicas: 8→replicas: 4
| Service | Limit |
|---|---|
| windmill-worker (×4) | 2048M each |
| windmill-server | 2048M |
| windmill-postgres | 2048M |
| windmill-lsp | 512M |
| windmill-worker-native | 128M (already set) |
New Windmill memory budget: ~12.6GB max (down from unbounded ~20GB+)
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo "/swapfile none swap sw 0 0" >> /etc/fstab
sysctl vm.swappiness=10Swap with low swappiness ensures the system degrades to slowness rather than OOM-killing critical services like sshd.
Important: The compose file was edited directly on disk. Dokploy will overwrite these changes on the next deploy unless the settings are also updated in Dokploy.
-
Update Windmill in Dokploy UI — Go to the Windmill compose service in Dokploy and update the
docker-compose.ymlto match the changes above (4 replicas, memory limits on all services). This ensures the fix survives redeployments. -
Investigate the triggering Windmill job — Check the Windmill UI at
windmill.knowsuchagency.aifor recently failed/running jobs that may have triggered excessiveuvdependency installs. Consider adding per-job memory/timeout limits in Windmill's worker settings. -
Consider further hardening:
- Set
MEMORY_LIMITenv var on Windmill workers if supported - Configure Docker's
oom-score-adjto protect critical containers (traefik, dokploy, sshd) - Set up monitoring/alerts (e.g. via Beszel which is already running) for memory > 80%
- Consider whether 4 workers is sufficient for your workload, or if you need to scale the server
- Set