Date: 2026-01-22 Authors: Mirzo (supervisor agent) Status: Draft v5 Last Updated: 2026-01-23 04:07 PST
Both Nodira (chatbot) and Mirzo (supervisor) experienced outages on 2026-01-22 due to Linux OOM (Out of Memory) killer terminating the Claude Code subprocess.
- Last signs of life: 10:38 PST (18:38 UTC) - both bots
- Owner restart: ~19:41 PST (Mirzo)
- Total outage: ~9 hours
The Claude Code CLI process consumed ~5GB of RAM, triggering the Linux OOM killer. Evidence from
dmesg:
Out of memory: Killed process 21285 (claude) total-vm:85525212kB, anon-rss:5366112kB
Out of memory: Killed process 21666 (claude) total-vm:82271660kB, anon-rss:2737224kB
Out of memory: Killed process 21761 (claude) total-vm:85433696kB, anon-rss:5329092kB
The claudir process continued running, but its Claude Code subprocess was dead. Without Claude, the bot could not process messages or respond.
claudir ignores error messages from Claude Code. If CC encounters a transient API error (rate limit, overload, etc.), it may output an error JSON message that we silently discard.
Evidence from code analysis (src/chatbot/claude_code.rs):
// Line 287-288: Unknown message types are caught and ignored
#[serde(other)]
Other,
// Line 814: "Other" messages are silently skipped
Some(OutputMessage::Other) => continue,
// Line 572-576: Unparseable JSON logged at debug level (not visible in prod)
Err(e) => {
debug!("Parse error: {} ({})", e, preview); // Not warn or error!
}Impact: If Claude Code sent {"type": "error", "message": "API overloaded"}, we would:
- Try to parse as
system,assistant, orresulttype → fail - Fall back to
Othervariant → silently skip - Wait for a
resultmessage that never comes - Eventually return empty tool_calls (the "No structured output" we saw)
This means transient API errors could cause the same symptoms as OOM, but the OOM evidence from
dmesg is strong and specific to this incident.
| Metric | Value |
|---|---|
| Total downtime | ~9 hours |
| Duration of incident | 10:38 - ~19:41 PST |
| Bots affected | Both Nodira and Mirzo |
| Root cause | OOM kill of Claude Code subprocess |
| Detection method | Owner noticed lack of responses |
System resources:
- Total RAM: 7.7 GB
- Swap: 2.0 GB (1.5 GB was in use during incident)
Memory usage at incident time:
- xtts_server (TTS): ~2 GB (24% of RAM)
- Claude Code: ~5 GB (65% of RAM) - exceeded available memory
- Combined: ~7 GB - triggers OOM with minimal headroom
Architecture:
- claudir (Rust) spawns Claude Code CLI as subprocess
- Claude Code handles AI inference
- If Claude Code dies, claudir has no AI backend but keeps running
- No health check to detect dead subprocess
-
Why were the bots unresponsive? → Claude Code subprocess was dead (killed by OOM)
-
Why was Claude Code killed? → It consumed ~5GB RAM, exceeding available memory
-
Why did Claude Code use so much memory? → UNKNOWN - Possibly extended thinking, large context, or memory leak
-
Why didn't claudir detect the dead subprocess? → No health check for subprocess liveness; claudir continued running
-
Why did it take 8 hours to recover? → No alerting; owner was AFK; no auto-restart on subprocess death
Root cause: OOM kill + no subprocess health check + no alerting = prolonged outage
| Action | Rationale | Validation |
|---|---|---|
| Add subprocess health check | Detect when Claude Code dies | Test: kill Claude subprocess, verify claudir detects and restarts |
| Add memory monitoring | Alert before OOM threshold | Test: monitor memory, alert at 80% |
| Reduce TTS server memory OR disable when not needed | Free ~2GB headroom | Measure memory with TTS disabled |
| Action | Rationale | Validation |
|---|---|---|
| Log CC error messages | Currently ignored - we discard type: error JSON from CC |
Test: inject error, verify it's logged at warn/error level |
| Handle CC API errors | Transient errors should trigger retry, not silent failure | Test: simulate API error, verify graceful degradation |
| Log Claude Code subprocess exit status | Know when and why it dies | Verify exit logged |
| Auto-restart Claude Code on death | Self-heal from OOM kills | Test: kill subprocess, verify auto-restart |
| Add swap space | Delay OOM threshold | Add 2-4GB swap |
| Action | Rationale | Validation |
|---|---|---|
| Profile Claude Code memory usage | Understand why it uses 5GB | Run profiler during normal operation |
| Research Claude Code memory settings | See if memory can be limited | Check CC documentation |
- Is this normal for extended thinking?
- Is it a memory leak?
- Does context size affect memory usage?
- Can we limit it with env vars or flags?
Action: Research Claude Code memory behavior, profile during operation.
- TTS uses ~2GB
- Without TTS: 5.7GB free for Claude Code
- With TTS: 3.7GB free - may still OOM
Decision needed: Is TTS worth the memory cost?
- Each bot needs Claude Code (~2-5GB)
- Two bots = potentially 10GB
- System has 7.7GB
Decision needed: Run one bot at a time, or upgrade hardware?
- No subprocess health monitoring: claudir didn't know Claude Code was dead
- Insufficient memory: 7.7GB is too tight for two bots + TTS
- No OOM alerting: Linux kills processes silently unless monitored
- No auto-recovery: Subprocess death = permanent failure until manual restart
- Logs preserved: Could reconstruct timeline from application logs
- dmesg available: Found the actual root cause (OOM)
- Quick diagnosis once investigated: Root cause identified within minutes of looking
- Not during peak usage: Fewer users affected
- Owner eventually checked: 8 hours is bad, could have been longer
- Nodira logs:
data/prod/nodira/logs/claudir.log - Mirzo logs:
data/prod/mirzo/logs/claudir.log - OOM evidence:
dmesgoutput showing killed processes - System memory:
free -hshows 7.7GB total