Postmortem: Nodira & Dilya Multi-Hour Downtime

Date: 2026-02-05 Authors: Mirzo (CTO agent) Status: Draft v5 (addressed Dilya's 5 review items) Reviewers: Dilya (reviewed v4, provided 5 fixes) Last Updated: 2026-02-05 17:07 UTC

Summary

Two bots experienced extended downtime on 2026-02-05:

Nodira: SIGTERM at 07:40:15 UTC, down 3 hours 48 minutes until owner noticed at 11:28 UTC
Dilya: Last activity ~08:15 UTC, down 7 hours 24 minutes until discovered at 15:39 UTC (while writing this postmortem!)

Despite Mirzo running heartbeat checks every 15 minutes, neither outage was detected because:

The heartbeat checks used pgrep patterns that could match wrong processes
There was no automated cross-bot health monitoring
The built-in heartbeat mechanism only checks during active processing, not idle state

Timeline

Nodira

Time (UTC)	Event
07:39:37	Nodira received bot message from Mirzo, debouncer fired
07:39:43	Claude Code returned `Sleep { sleep_ms: 30000 }`
07:40:15	SIGTERM received - harness began terminating
07:40:15	Claude Code child (PID 4797) reaped with SIGTERM
07:40:15	Context saved (7454 messages), process terminated
07:40 - 11:27	~4 hours of undetected downtime
11:27:22	Owner sent "fix nodira" to Mirzo
11:28:21	Mirzo restarted Nodira

Dilya

Time (UTC)	Event
~08:10	Last activity (processing messages about heartbeat fix)
~08:15	Telegram API timeout error in logs - last log entry
08:15 - 15:39	~7.5 hours of undetected downtime
15:38:53	Owner asked "is dilya alive" (while reviewing Nodira postmortem)
15:39:01	Nodira reported "Dilya's last heartbeat: 7h 29m ago"
15:39:41	Mirzo restarted Dilya

Impact

Metric	Nodira	Dilya
Total downtime	3 hours 48 minutes	7 hours 24 minutes
Detection method	Owner noticed	Discovered during postmortem review
Auto-recovery	None	None

Combined impact:

2 of 3 bots down simultaneously (only Mirzo running)
No automated alerting triggered
User-facing impact unknown (Dilya is still learning, Nodira handles main traffic)

Root Cause Analysis

5 Whys - Nodira

Why was Nodira down for 4 hours? → Mirzo's heartbeat checks didn't detect the outage
Why didn't heartbeat checks detect it? → The pgrep -af "claudir.*nodira" pattern was matching Mirzo's own Claude subprocess (which has "nodira" in its system prompt), giving false positive
Why was the pattern matching wrong processes? → pgrep regex matched any process with "claudir" and "nodira" in its arguments, including Mirzo's Claude subprocess which contains Nodira's name in its system prompt
Why didn't the built-in heartbeat mechanism detect it? → The is_heartbeat_stale() function only checks during active processing (is_processing=true). When idle, it returns false without checking timeout.
Why did Nodira receive SIGTERM? → UNKNOWN - No evidence of OOM or crash. Likely manual kill (owner testing?) but not confirmed.

5 Whys - Dilya

Why was Dilya down for 7.5 hours? → Same faulty heartbeat checks that missed Nodira also missed Dilya
Why did Dilya go down? → Logs show Telegram API timeout error at 08:15 UTC, then no further entries - harness likely crashed or hung
Why wasn't Dilya's crash detected? → No cross-bot monitoring; Mirzo only checked own processes; Nodira doesn't monitor siblings
Why did it take until 15:39 to discover? → Only discovered when owner asked "is dilya alive" while reviewing Nodira postmortem
Why wasn't there an automated alert? → No alerting system exists - all detection relies on manual observation

Technical Deep Dive

Finding 1: Heartbeat check has an idle blind spot

From src/chatbot/engine/mod.rs lines 503-507:

pub fn is_heartbeat_stale(&self) -> bool {
    // Only check timeout if we're actively processing
    if !self.is_processing.load(Ordering::SeqCst) {
        return false;  // ← IDLE BOTS ARE NEVER CHECKED
    }
    // ...
}

This means a dead/crashed Claude Code during idle time will never trigger the heartbeat timeout. The 30-minute timeout only applies during active message processing.

Finding 2: pgrep pattern was incorrect

Mirzo's heartbeat checks used:

pgrep -af "claudir.*nodira"

This pattern matches:

✅ Nodira's harness: claudir data/prod/nodira/bot.json
❌ Mirzo's Claude subprocess: claude ... --system-prompt "...Nodira..." (contains "nodira" in prompt text)

When Nodira's harness died, pgrep still returned results (Mirzo's Claude subprocess), causing Mirzo to report "both healthy".

Finding 3: No cross-bot health monitoring

Each bot only monitors itself. There's no mechanism for:

Mirzo to check if Nodira's Telegram bot is responding
Owner to receive automated alerts when a bot goes down
Bots to ping each other for liveness

Contributing Factors

Pattern matching error - pgrep regex was too broad, matching unintended processes
Idle-only blind spot - Built-in heartbeat check disabled during idle
No automated alerting - Owner must manually check or notice missing responses
Single point of failure - Each bot runs independently with no cross-monitoring
Unknown SIGTERM source - We don't know why Nodira was killed, making prevention difficult
No sibling monitoring - Bots don't check each other's health

Detection Gap Analysis

If we had proper monitoring, when would we have detected these outages?

Proposed Fix	Nodira Detection	Dilya Detection
Correct pgrep pattern	07:45 (first heartbeat check)	08:30 (next check after crash)
Cross-bot monitoring	07:45 (Dilya/Mirzo would see Nodira down)	08:30 (Nodira/Mirzo would see Dilya down)
Idle liveness check	08:10 (30min idle timeout)	08:45 (30min idle timeout)
systemd auto-restart	Immediate restart, no downtime	Immediate restart, no downtime

Conclusion: Correct pgrep pattern alone would have reduced Nodira's downtime from 4 hours to ~5 minutes. Cross-bot monitoring would provide redundancy.

Action Items

P0 - Critical (prevent recurrence)

Action	Owner	Due Date	Status
Use robust process detection: `pgrep -f "claudir data/prod/nodira/bot.json"` (exact config path match, not regex)	Mirzo	2026-02-06	Open
Add PID file: harness writes `data/prod/{bot}/harness.pid` on startup, health check verifies PID exists and process alive via `kill -0`	Mirzo	2026-02-06	Open

P1 - Important (improve detection)

Action	Owner	Due Date	Status
Add cross-bot health monitoring: Mirzo monitors Nodira AND Dilya via Telegram API getMe check	Mirzo	2026-02-10	Open
Add idle liveness check for ALL bots - verify CC process alive even when not processing	Mirzo	2026-02-10	Open
Add automated owner alert when ANY bot (Nodira, Dilya, or Mirzo) is detected down	Mirzo	2026-02-10	Open
Investigate Dilya crash: why did Telegram API timeout cause harness death?	Mirzo	2026-02-10	Open

P2 - Nice to have

Action	Owner	Due Date	Status
Investigate why Nodira received SIGTERM - check owner activity, scripts, etc.	Mirzo	2026-02-15	Open
Add process supervisor (systemd) for automatic restart on any death - covers ALL 3 bots	Owner	TBD	Open

Lessons Learned

What went wrong

Pattern matching was fragile - Relied on regex that could match unintended processes
Heartbeat check has design flaw - Only active during processing, idle bots invisible
No cross-validation - Each bot trusted its own check without external verification
Silent failure - Combined 11.5 hours of downtime (Nodira 4h + Dilya 7.5h) with no alert to anyone
No sibling awareness - Dilya's downtime was only discovered because owner asked while reviewing Nodira postmortem
Telegram API timeout unhandled - Dilya crashed from API timeout with no auto-recovery

What went well

Logs preserved - Full timeline reconstructable from claudir.log
Quick recovery - Once notified, Mirzo restarted Nodira within 1 minute
Context saved - Nodira saved 7454 messages before terminating, no data loss

Where we got lucky

Not during peak hours - 07:40-11:27 UTC is early morning in Uzbekistan
Owner checked - Could have gone longer without detection
Clean shutdown - SIGTERM allowed graceful context save
The Irony - While writing this postmortem about Nodira's downtime, we discovered Dilya had been down for 7+ hours. Mirzo asked Dilya to review the postmortem - but Dilya was down and couldn't respond. This perfectly illustrates the need for cross-bot health monitoring, and "lucky" that we caught it during postmortem writing.

Open Questions

Q1: Why was SIGTERM sent?

Evidence: Clean SIGTERM (signal 15), not OOM or crash Possibilities:

Owner manually killed Nodira (testing?)
Script or cron job
System restart

Action: Ask owner if they killed Nodira around 07:40 UTC

Q2: Should bots monitor each other? (ESCALATE TO OWNER)

This is an architectural decision requiring owner input:

Option A: Mirzo directly queries Nodira's Telegram API for liveness Option B: Bots share a "heartbeat" table in shared SQLite Option C: External monitoring (UptimeRobot, etc.) Option D: systemd/supervisor for automatic restart

Mirzo's Recommendation: Option D (systemd) + Option B (heartbeat table)

Reasoning:

systemd provides instant auto-restart on crash - zero human intervention needed. This alone would have prevented both outages.
Heartbeat table provides visibility into bot health for debugging and alerting, even if systemd handles restarts.
Options A/C add complexity without solving the core problem (dead process stays dead).
Defense in depth: systemd handles crashes, heartbeat table catches hangs/freezes.

Owner - Please decide on monitoring architecture. This affects all 3 bots.

Remediation Test Plan

How we'll verify the fixes work:

P0 Fixes (Process Detection)

Test	Expected Result	How to Verify
Kill Nodira harness	pgrep returns empty	`pkill -f "claudir data/prod/nodira/bot.json" && pgrep -f "claudir data/prod/nodira/bot.json"` should return nothing
Kill Nodira harness while Mirzo's CC running	pgrep still returns empty	Same as above - Mirzo's Claude subprocess should NOT match
PID file created on startup	File exists with valid PID	`cat data/prod/nodira/harness.pid && kill -0 $(cat data/prod/nodira/harness.pid)`
PID file stale after crash	`kill -0` fails	Kill harness without cleanup, verify PID check fails

P1 Fixes (Cross-Bot Monitoring)

Test	Expected Result	How to Verify
Cross-bot detection catches dead bot	Alert within 5 minutes	Kill Nodira, wait for Mirzo's next health check, verify alert sent
Idle liveness check works	Detects frozen CC	Simulate CC freeze (SIGSTOP), verify timeout triggers
Owner alert sent	DM received	Verify DM arrives when bot detected down

Success Criteria

Nodira downtime detected within 5 minutes (vs 3h 48m)
Dilya downtime detected within 5 minutes (vs 7h 24m)
No false positives from pgrep pattern for 24 hours
Owner receives automated alert on any bot death

Supporting Information

Nodira logs: data/prod/nodira/logs/claudir.log
Mirzo logs: data/prod/mirzo/logs/claudir.log
Heartbeat code: src/chatbot/engine/mod.rs lines 503-525
Process check code: src/chatbot/claude_code.rs lines 294-307
Previous postmortem: data/prod/mirzo/postmortem-2026-01-22.md

References

Postmortem writing guide
Architecture docs
Subagent analysis: heartbeat mechanism investigation (2026-02-05)

nodir-t/2026-02-05-nodira-dilya-downtime.md

Select an option

No results found