Skip to content

Instantly share code, notes, and snippets.

@knowsuchagency
Created February 9, 2026 07:49
Show Gist options
  • Select an option

  • Save knowsuchagency/d7aafe7a80633e105f4aa269e8ab98b6 to your computer and use it in GitHub Desktop.

Select an option

Save knowsuchagency/d7aafe7a80633e105f4aa269e8ab98b6 to your computer and use it in GitHub Desktop.

Citadel OOM Incident — 2026-02-09

Symptoms

  • SSH to citadel (5.78.40.219) hung — TCP connected but no SSH banner returned
  • Hetzner VNC console showed continuous OOM killer messages:
    Memory cgroup out of memory: Killed process XXXX (windmill)
    Memory cgroup out of memory: Killed process XXXX (uv)
    
  • CPU pinned at ~1000-1500% (OOM kill/respawn loop)
  • Server was completely unresponsive

Root Cause

Windmill was configured with 8 worker replicas, each with a 2GB memory limit (16GB potential). Combined with the Windmill server, postgres, and LSP (which had no memory limits), plus ~60 other containers on the 30GB CPX51, memory was exhausted. The OOM killer entered a death spiral killing and respawning windmill and uv (Python dependency installer) processes.

Resolution

1. Force power cycle

Soft reboot via hcloud server reboot citadel failed (kernel too overwhelmed). Used hcloud server poweroff + hcloud server poweron to recover.

2. Reduced Windmill workers (8 → 4)

Edited /etc/dokploy/compose/windmill-windmill-whifrv/code/docker-compose.yml:

  • replicas: 8replicas: 4

3. Added memory limits to unbounded services

Service Limit
windmill-worker (×4) 2048M each
windmill-server 2048M
windmill-postgres 2048M
windmill-lsp 512M
windmill-worker-native 128M (already set)

New Windmill memory budget: ~12.6GB max (down from unbounded ~20GB+)

4. Added 4GB swap as safety net

fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo "/swapfile none swap sw 0 0" >> /etc/fstab
sysctl vm.swappiness=10

Swap with low swappiness ensures the system degrades to slowness rather than OOM-killing critical services like sshd.

Next Steps (Dokploy)

Important: The compose file was edited directly on disk. Dokploy will overwrite these changes on the next deploy unless the settings are also updated in Dokploy.

  1. Update Windmill in Dokploy UI — Go to the Windmill compose service in Dokploy and update the docker-compose.yml to match the changes above (4 replicas, memory limits on all services). This ensures the fix survives redeployments.

  2. Investigate the triggering Windmill job — Check the Windmill UI at windmill.knowsuchagency.ai for recently failed/running jobs that may have triggered excessive uv dependency installs. Consider adding per-job memory/timeout limits in Windmill's worker settings.

  3. Consider further hardening:

    • Set MEMORY_LIMIT env var on Windmill workers if supported
    • Configure Docker's oom-score-adj to protect critical containers (traefik, dokploy, sshd)
    • Set up monitoring/alerts (e.g. via Beszel which is already running) for memory > 80%
    • Consider whether 4 workers is sufficient for your workload, or if you need to scale the server
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment