Skip to content

Instantly share code, notes, and snippets.

@nodir-t
Created January 23, 2026 04:20
Show Gist options
  • Select an option

  • Save nodir-t/fbe11e56e019a69c4ca80255444e38f9 to your computer and use it in GitHub Desktop.

Select an option

Save nodir-t/fbe11e56e019a69c4ca80255444e38f9 to your computer and use it in GitHub Desktop.

Postmortem: Nodira and Mirzo Outage

Date: 2026-01-22 Authors: Mirzo (supervisor agent) Status: Draft v5 Last Updated: 2026-01-23 04:07 PST


Summary

Both Nodira (chatbot) and Mirzo (supervisor) experienced outages on 2026-01-22 due to Linux OOM (Out of Memory) killer terminating the Claude Code subprocess.

Timeline

  • Last signs of life: 10:38 PST (18:38 UTC) - both bots
  • Owner restart: ~19:41 PST (Mirzo)
  • Total outage: ~9 hours

Root Cause

The Claude Code CLI process consumed ~5GB of RAM, triggering the Linux OOM killer. Evidence from dmesg:

Out of memory: Killed process 21285 (claude) total-vm:85525212kB, anon-rss:5366112kB
Out of memory: Killed process 21666 (claude) total-vm:82271660kB, anon-rss:2737224kB
Out of memory: Killed process 21761 (claude) total-vm:85433696kB, anon-rss:5329092kB

The claudir process continued running, but its Claude Code subprocess was dead. Without Claude, the bot could not process messages or respond.

Contributing Factor: Error Messages Ignored

claudir ignores error messages from Claude Code. If CC encounters a transient API error (rate limit, overload, etc.), it may output an error JSON message that we silently discard.

Evidence from code analysis (src/chatbot/claude_code.rs):

// Line 287-288: Unknown message types are caught and ignored
#[serde(other)]
Other,

// Line 814: "Other" messages are silently skipped
Some(OutputMessage::Other) => continue,

// Line 572-576: Unparseable JSON logged at debug level (not visible in prod)
Err(e) => {
    debug!("Parse error: {} ({})", e, preview);  // Not warn or error!
}

Impact: If Claude Code sent {"type": "error", "message": "API overloaded"}, we would:

  1. Try to parse as system, assistant, or result type → fail
  2. Fall back to Other variant → silently skip
  3. Wait for a result message that never comes
  4. Eventually return empty tool_calls (the "No structured output" we saw)

This means transient API errors could cause the same symptoms as OOM, but the OOM evidence from dmesg is strong and specific to this incident.

Impact

Metric Value
Total downtime ~9 hours
Duration of incident 10:38 - ~19:41 PST
Bots affected Both Nodira and Mirzo
Root cause OOM kill of Claude Code subprocess
Detection method Owner noticed lack of responses

Technical Context

System resources:

  • Total RAM: 7.7 GB
  • Swap: 2.0 GB (1.5 GB was in use during incident)

Memory usage at incident time:

  • xtts_server (TTS): ~2 GB (24% of RAM)
  • Claude Code: ~5 GB (65% of RAM) - exceeded available memory
  • Combined: ~7 GB - triggers OOM with minimal headroom

Architecture:

  • claudir (Rust) spawns Claude Code CLI as subprocess
  • Claude Code handles AI inference
  • If Claude Code dies, claudir has no AI backend but keeps running
  • No health check to detect dead subprocess

5 Whys Analysis

Why were both bots down for 8 hours?

  1. Why were the bots unresponsive? → Claude Code subprocess was dead (killed by OOM)

  2. Why was Claude Code killed? → It consumed ~5GB RAM, exceeding available memory

  3. Why did Claude Code use so much memory?UNKNOWN - Possibly extended thinking, large context, or memory leak

  4. Why didn't claudir detect the dead subprocess? → No health check for subprocess liveness; claudir continued running

  5. Why did it take 8 hours to recover? → No alerting; owner was AFK; no auto-restart on subprocess death

Root cause: OOM kill + no subprocess health check + no alerting = prolonged outage

Action Items

P0 - Critical (prevent recurrence)

Action Rationale Validation
Add subprocess health check Detect when Claude Code dies Test: kill Claude subprocess, verify claudir detects and restarts
Add memory monitoring Alert before OOM threshold Test: monitor memory, alert at 80%
Reduce TTS server memory OR disable when not needed Free ~2GB headroom Measure memory with TTS disabled

P1 - Important

Action Rationale Validation
Log CC error messages Currently ignored - we discard type: error JSON from CC Test: inject error, verify it's logged at warn/error level
Handle CC API errors Transient errors should trigger retry, not silent failure Test: simulate API error, verify graceful degradation
Log Claude Code subprocess exit status Know when and why it dies Verify exit logged
Auto-restart Claude Code on death Self-heal from OOM kills Test: kill subprocess, verify auto-restart
Add swap space Delay OOM threshold Add 2-4GB swap

P2 - Nice to have

Action Rationale Validation
Profile Claude Code memory usage Understand why it uses 5GB Run profiler during normal operation
Research Claude Code memory settings See if memory can be limited Check CC documentation

Open Questions

Q1: Why does Claude Code use 5GB+ RAM?

  • Is this normal for extended thinking?
  • Is it a memory leak?
  • Does context size affect memory usage?
  • Can we limit it with env vars or flags?

Action: Research Claude Code memory behavior, profile during operation.

Q2: Should TTS stay disabled?

  • TTS uses ~2GB
  • Without TTS: 5.7GB free for Claude Code
  • With TTS: 3.7GB free - may still OOM

Decision needed: Is TTS worth the memory cost?

Q3: Can both bots run simultaneously?

  • Each bot needs Claude Code (~2-5GB)
  • Two bots = potentially 10GB
  • System has 7.7GB

Decision needed: Run one bot at a time, or upgrade hardware?

Lessons Learned

What went wrong

  • No subprocess health monitoring: claudir didn't know Claude Code was dead
  • Insufficient memory: 7.7GB is too tight for two bots + TTS
  • No OOM alerting: Linux kills processes silently unless monitored
  • No auto-recovery: Subprocess death = permanent failure until manual restart

What went well

  • Logs preserved: Could reconstruct timeline from application logs
  • dmesg available: Found the actual root cause (OOM)
  • Quick diagnosis once investigated: Root cause identified within minutes of looking

Where we got lucky

  • Not during peak usage: Fewer users affected
  • Owner eventually checked: 8 hours is bad, could have been longer

Supporting Information

  • Nodira logs: data/prod/nodira/logs/claudir.log
  • Mirzo logs: data/prod/mirzo/logs/claudir.log
  • OOM evidence: dmesg output showing killed processes
  • System memory: free -h shows 7.7GB total

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment