Citadel OOM Incident — 2026-02-09

Symptoms

SSH to citadel (5.78.40.219) hung — TCP connected but no SSH banner returned

Hetzner VNC console showed continuous OOM killer messages:

Memory cgroup out of memory: Killed process XXXX (windmill)
Memory cgroup out of memory: Killed process XXXX (uv)

CPU pinned at ~1000-1500% (OOM kill/respawn loop)
Server was completely unresponsive

Root Cause

Windmill was configured with 8 worker replicas, each with a 2GB memory limit (16GB potential). Combined with the Windmill server, postgres, and LSP (which had no memory limits), plus ~60 other containers on the 30GB CPX51, memory was exhausted. The OOM killer entered a death spiral killing and respawning windmill and uv (Python dependency installer) processes.

Resolution

1. Force power cycle

Soft reboot via hcloud server reboot citadel failed (kernel too overwhelmed). Used hcloud server poweroff + hcloud server poweron to recover.

2. Reduced Windmill workers (8 → 4)

Edited /etc/dokploy/compose/windmill-windmill-whifrv/code/docker-compose.yml:

replicas: 8 → replicas: 4

3. Added memory limits to unbounded services

Service	Limit
windmill-worker (×4)	2048M each
windmill-server	2048M
windmill-postgres	2048M
windmill-lsp	512M
windmill-worker-native	128M (already set)

New Windmill memory budget: ~12.6GB max (down from unbounded ~20GB+)

4. Added 4GB swap as safety net

fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo "/swapfile none swap sw 0 0" >> /etc/fstab
sysctl vm.swappiness=10

Swap with low swappiness ensures the system degrades to slowness rather than OOM-killing critical services like sshd.

Next Steps (Dokploy)

Important: The compose file was edited directly on disk. Dokploy will overwrite these changes on the next deploy unless the settings are also updated in Dokploy.

Update Windmill in Dokploy UI — Go to the Windmill compose service in Dokploy and update the docker-compose.yml to match the changes above (4 replicas, memory limits on all services). This ensures the fix survives redeployments.
Investigate the triggering Windmill job — Check the Windmill UI at windmill.knowsuchagency.ai for recently failed/running jobs that may have triggered excessive uv dependency installs. Consider adding per-job memory/timeout limits in Windmill's worker settings.
Consider further hardening:
- Set MEMORY_LIMIT env var on Windmill workers if supported
- Configure Docker's oom-score-adj to protect critical containers (traefik, dokploy, sshd)
- Set up monitoring/alerts (e.g. via Beszel which is already running) for memory > 80%
- Consider whether 4 workers is sufficient for your workload, or if you need to scale the server

knowsuchagency/citadel-oom-incident-2026-02-09.md

Select an option

No results found