Postmortem: Nodira and Mirzo Outage

Date: 2026-01-22 Authors: Mirzo (supervisor agent) Status: Draft v5 Last Updated: 2026-01-23 04:07 PST

Summary

Both Nodira (chatbot) and Mirzo (supervisor) experienced outages on 2026-01-22 due to Linux OOM (Out of Memory) killer terminating the Claude Code subprocess.

Timeline

Last signs of life: 10:38 PST (18:38 UTC) - both bots
Owner restart: ~19:41 PST (Mirzo)
Total outage: ~9 hours

Root Cause

The Claude Code CLI process consumed ~5GB of RAM, triggering the Linux OOM killer. Evidence from dmesg:

Out of memory: Killed process 21285 (claude) total-vm:85525212kB, anon-rss:5366112kB
Out of memory: Killed process 21666 (claude) total-vm:82271660kB, anon-rss:2737224kB
Out of memory: Killed process 21761 (claude) total-vm:85433696kB, anon-rss:5329092kB

The claudir process continued running, but its Claude Code subprocess was dead. Without Claude, the bot could not process messages or respond.

Contributing Factor: Error Messages Ignored

claudir ignores error messages from Claude Code. If CC encounters a transient API error (rate limit, overload, etc.), it may output an error JSON message that we silently discard.

Evidence from code analysis (src/chatbot/claude_code.rs):

// Line 287-288: Unknown message types are caught and ignored
#[serde(other)]
Other,

// Line 814: "Other" messages are silently skipped
Some(OutputMessage::Other) => continue,

// Line 572-576: Unparseable JSON logged at debug level (not visible in prod)
Err(e) => {
    debug!("Parse error: {} ({})", e, preview);  // Not warn or error!
}

Impact: If Claude Code sent {"type": "error", "message": "API overloaded"}, we would:

Try to parse as system, assistant, or result type → fail
Fall back to Other variant → silently skip
Wait for a result message that never comes
Eventually return empty tool_calls (the "No structured output" we saw)

This means transient API errors could cause the same symptoms as OOM, but the OOM evidence from dmesg is strong and specific to this incident.

Impact

Metric	Value
Total downtime	~9 hours
Duration of incident	10:38 - ~19:41 PST
Bots affected	Both Nodira and Mirzo
Root cause	OOM kill of Claude Code subprocess
Detection method	Owner noticed lack of responses

Technical Context

System resources:

Total RAM: 7.7 GB
Swap: 2.0 GB (1.5 GB was in use during incident)

Memory usage at incident time:

xtts_server (TTS): ~2 GB (24% of RAM)
Claude Code: ~5 GB (65% of RAM) - exceeded available memory
Combined: ~7 GB - triggers OOM with minimal headroom

Architecture:

claudir (Rust) spawns Claude Code CLI as subprocess
Claude Code handles AI inference
If Claude Code dies, claudir has no AI backend but keeps running
No health check to detect dead subprocess

5 Whys Analysis

Why were both bots down for 8 hours?

Why were the bots unresponsive? → Claude Code subprocess was dead (killed by OOM)
Why was Claude Code killed? → It consumed ~5GB RAM, exceeding available memory
Why did Claude Code use so much memory? → UNKNOWN - Possibly extended thinking, large context, or memory leak
Why didn't claudir detect the dead subprocess? → No health check for subprocess liveness; claudir continued running
Why did it take 8 hours to recover? → No alerting; owner was AFK; no auto-restart on subprocess death

Root cause: OOM kill + no subprocess health check + no alerting = prolonged outage

Action Items

P0 - Critical (prevent recurrence)

Action	Rationale	Validation
Add subprocess health check	Detect when Claude Code dies	Test: kill Claude subprocess, verify claudir detects and restarts
Add memory monitoring	Alert before OOM threshold	Test: monitor memory, alert at 80%
Reduce TTS server memory OR disable when not needed	Free ~2GB headroom	Measure memory with TTS disabled

P1 - Important

Action	Rationale	Validation
Log CC error messages	Currently ignored - we discard `type: error` JSON from CC	Test: inject error, verify it's logged at warn/error level
Handle CC API errors	Transient errors should trigger retry, not silent failure	Test: simulate API error, verify graceful degradation
Log Claude Code subprocess exit status	Know when and why it dies	Verify exit logged
Auto-restart Claude Code on death	Self-heal from OOM kills	Test: kill subprocess, verify auto-restart
Add swap space	Delay OOM threshold	Add 2-4GB swap

P2 - Nice to have

Action	Rationale	Validation
Profile Claude Code memory usage	Understand why it uses 5GB	Run profiler during normal operation
Research Claude Code memory settings	See if memory can be limited	Check CC documentation

Open Questions

Q1: Why does Claude Code use 5GB+ RAM?

Is this normal for extended thinking?
Is it a memory leak?
Does context size affect memory usage?
Can we limit it with env vars or flags?

Action: Research Claude Code memory behavior, profile during operation.

Q2: Should TTS stay disabled?

TTS uses ~2GB
Without TTS: 5.7GB free for Claude Code
With TTS: 3.7GB free - may still OOM

Decision needed: Is TTS worth the memory cost?

Q3: Can both bots run simultaneously?

Each bot needs Claude Code (~2-5GB)
Two bots = potentially 10GB
System has 7.7GB

Decision needed: Run one bot at a time, or upgrade hardware?

Lessons Learned

What went wrong

No subprocess health monitoring: claudir didn't know Claude Code was dead
Insufficient memory: 7.7GB is too tight for two bots + TTS
No OOM alerting: Linux kills processes silently unless monitored
No auto-recovery: Subprocess death = permanent failure until manual restart

What went well

Logs preserved: Could reconstruct timeline from application logs
dmesg available: Found the actual root cause (OOM)
Quick diagnosis once investigated: Root cause identified within minutes of looking

nodir-t/postmortem-2026-01-22.md

Select an option

No results found