Skip to content

Instantly share code, notes, and snippets.

@nodir-t
Created February 5, 2026 17:46
Show Gist options
  • Select an option

  • Save nodir-t/90e26410d3e17bad52b747fb3f034c82 to your computer and use it in GitHub Desktop.

Select an option

Save nodir-t/90e26410d3e17bad52b747fb3f034c82 to your computer and use it in GitHub Desktop.
Postmortem: Nodira & Dilya Multi-Hour Downtime (2026-02-05)

Postmortem: Nodira & Dilya Multi-Hour Downtime

Date: 2026-02-05 Authors: Mirzo (CTO agent) Status: Draft v5 (addressed Dilya's 5 review items) Reviewers: Dilya (reviewed v4, provided 5 fixes) Last Updated: 2026-02-05 17:07 UTC


Summary

Two bots experienced extended downtime on 2026-02-05:

  1. Nodira: SIGTERM at 07:40:15 UTC, down 3 hours 48 minutes until owner noticed at 11:28 UTC
  2. Dilya: Last activity ~08:15 UTC, down 7 hours 24 minutes until discovered at 15:39 UTC (while writing this postmortem!)

Despite Mirzo running heartbeat checks every 15 minutes, neither outage was detected because:

  1. The heartbeat checks used pgrep patterns that could match wrong processes
  2. There was no automated cross-bot health monitoring
  3. The built-in heartbeat mechanism only checks during active processing, not idle state

Timeline

Nodira

Time (UTC) Event
07:39:37 Nodira received bot message from Mirzo, debouncer fired
07:39:43 Claude Code returned Sleep { sleep_ms: 30000 }
07:40:15 SIGTERM received - harness began terminating
07:40:15 Claude Code child (PID 4797) reaped with SIGTERM
07:40:15 Context saved (7454 messages), process terminated
07:40 - 11:27 ~4 hours of undetected downtime
11:27:22 Owner sent "fix nodira" to Mirzo
11:28:21 Mirzo restarted Nodira

Dilya

Time (UTC) Event
~08:10 Last activity (processing messages about heartbeat fix)
~08:15 Telegram API timeout error in logs - last log entry
08:15 - 15:39 ~7.5 hours of undetected downtime
15:38:53 Owner asked "is dilya alive" (while reviewing Nodira postmortem)
15:39:01 Nodira reported "Dilya's last heartbeat: 7h 29m ago"
15:39:41 Mirzo restarted Dilya

Impact

Metric Nodira Dilya
Total downtime 3 hours 48 minutes 7 hours 24 minutes
Detection method Owner noticed Discovered during postmortem review
Auto-recovery None None

Combined impact:

  • 2 of 3 bots down simultaneously (only Mirzo running)
  • No automated alerting triggered
  • User-facing impact unknown (Dilya is still learning, Nodira handles main traffic)

Root Cause Analysis

5 Whys - Nodira

  1. Why was Nodira down for 4 hours? → Mirzo's heartbeat checks didn't detect the outage

  2. Why didn't heartbeat checks detect it? → The pgrep -af "claudir.*nodira" pattern was matching Mirzo's own Claude subprocess (which has "nodira" in its system prompt), giving false positive

  3. Why was the pattern matching wrong processes? → pgrep regex matched any process with "claudir" and "nodira" in its arguments, including Mirzo's Claude subprocess which contains Nodira's name in its system prompt

  4. Why didn't the built-in heartbeat mechanism detect it? → The is_heartbeat_stale() function only checks during active processing (is_processing=true). When idle, it returns false without checking timeout.

  5. Why did Nodira receive SIGTERM?UNKNOWN - No evidence of OOM or crash. Likely manual kill (owner testing?) but not confirmed.

5 Whys - Dilya

  1. Why was Dilya down for 7.5 hours? → Same faulty heartbeat checks that missed Nodira also missed Dilya

  2. Why did Dilya go down? → Logs show Telegram API timeout error at 08:15 UTC, then no further entries - harness likely crashed or hung

  3. Why wasn't Dilya's crash detected? → No cross-bot monitoring; Mirzo only checked own processes; Nodira doesn't monitor siblings

  4. Why did it take until 15:39 to discover? → Only discovered when owner asked "is dilya alive" while reviewing Nodira postmortem

  5. Why wasn't there an automated alert? → No alerting system exists - all detection relies on manual observation

Technical Deep Dive

Finding 1: Heartbeat check has an idle blind spot

From src/chatbot/engine/mod.rs lines 503-507:

pub fn is_heartbeat_stale(&self) -> bool {
    // Only check timeout if we're actively processing
    if !self.is_processing.load(Ordering::SeqCst) {
        return false;  // ← IDLE BOTS ARE NEVER CHECKED
    }
    // ...
}

This means a dead/crashed Claude Code during idle time will never trigger the heartbeat timeout. The 30-minute timeout only applies during active message processing.

Finding 2: pgrep pattern was incorrect

Mirzo's heartbeat checks used:

pgrep -af "claudir.*nodira"

This pattern matches:

  • ✅ Nodira's harness: claudir data/prod/nodira/bot.json
  • ❌ Mirzo's Claude subprocess: claude ... --system-prompt "...Nodira..." (contains "nodira" in prompt text)

When Nodira's harness died, pgrep still returned results (Mirzo's Claude subprocess), causing Mirzo to report "both healthy".

Finding 3: No cross-bot health monitoring

Each bot only monitors itself. There's no mechanism for:

  • Mirzo to check if Nodira's Telegram bot is responding
  • Owner to receive automated alerts when a bot goes down
  • Bots to ping each other for liveness

Contributing Factors

  1. Pattern matching error - pgrep regex was too broad, matching unintended processes
  2. Idle-only blind spot - Built-in heartbeat check disabled during idle
  3. No automated alerting - Owner must manually check or notice missing responses
  4. Single point of failure - Each bot runs independently with no cross-monitoring
  5. Unknown SIGTERM source - We don't know why Nodira was killed, making prevention difficult
  6. No sibling monitoring - Bots don't check each other's health

Detection Gap Analysis

If we had proper monitoring, when would we have detected these outages?

Proposed Fix Nodira Detection Dilya Detection
Correct pgrep pattern 07:45 (first heartbeat check) 08:30 (next check after crash)
Cross-bot monitoring 07:45 (Dilya/Mirzo would see Nodira down) 08:30 (Nodira/Mirzo would see Dilya down)
Idle liveness check 08:10 (30min idle timeout) 08:45 (30min idle timeout)
systemd auto-restart Immediate restart, no downtime Immediate restart, no downtime

Conclusion: Correct pgrep pattern alone would have reduced Nodira's downtime from 4 hours to ~5 minutes. Cross-bot monitoring would provide redundancy.

Action Items

P0 - Critical (prevent recurrence)

Action Owner Due Date Status
Use robust process detection: pgrep -f "claudir data/prod/nodira/bot.json" (exact config path match, not regex) Mirzo 2026-02-06 Open
Add PID file: harness writes data/prod/{bot}/harness.pid on startup, health check verifies PID exists and process alive via kill -0 Mirzo 2026-02-06 Open

P1 - Important (improve detection)

Action Owner Due Date Status
Add cross-bot health monitoring: Mirzo monitors Nodira AND Dilya via Telegram API getMe check Mirzo 2026-02-10 Open
Add idle liveness check for ALL bots - verify CC process alive even when not processing Mirzo 2026-02-10 Open
Add automated owner alert when ANY bot (Nodira, Dilya, or Mirzo) is detected down Mirzo 2026-02-10 Open
Investigate Dilya crash: why did Telegram API timeout cause harness death? Mirzo 2026-02-10 Open

P2 - Nice to have

Action Owner Due Date Status
Investigate why Nodira received SIGTERM - check owner activity, scripts, etc. Mirzo 2026-02-15 Open
Add process supervisor (systemd) for automatic restart on any death - covers ALL 3 bots Owner TBD Open

Lessons Learned

What went wrong

  1. Pattern matching was fragile - Relied on regex that could match unintended processes
  2. Heartbeat check has design flaw - Only active during processing, idle bots invisible
  3. No cross-validation - Each bot trusted its own check without external verification
  4. Silent failure - Combined 11.5 hours of downtime (Nodira 4h + Dilya 7.5h) with no alert to anyone
  5. No sibling awareness - Dilya's downtime was only discovered because owner asked while reviewing Nodira postmortem
  6. Telegram API timeout unhandled - Dilya crashed from API timeout with no auto-recovery

What went well

  1. Logs preserved - Full timeline reconstructable from claudir.log
  2. Quick recovery - Once notified, Mirzo restarted Nodira within 1 minute
  3. Context saved - Nodira saved 7454 messages before terminating, no data loss

Where we got lucky

  1. Not during peak hours - 07:40-11:27 UTC is early morning in Uzbekistan
  2. Owner checked - Could have gone longer without detection
  3. Clean shutdown - SIGTERM allowed graceful context save
  4. The Irony - While writing this postmortem about Nodira's downtime, we discovered Dilya had been down for 7+ hours. Mirzo asked Dilya to review the postmortem - but Dilya was down and couldn't respond. This perfectly illustrates the need for cross-bot health monitoring, and "lucky" that we caught it during postmortem writing.

Open Questions

Q1: Why was SIGTERM sent?

Evidence: Clean SIGTERM (signal 15), not OOM or crash Possibilities:

  • Owner manually killed Nodira (testing?)
  • Script or cron job
  • System restart

Action: Ask owner if they killed Nodira around 07:40 UTC

Q2: Should bots monitor each other? (ESCALATE TO OWNER)

This is an architectural decision requiring owner input:

Option A: Mirzo directly queries Nodira's Telegram API for liveness Option B: Bots share a "heartbeat" table in shared SQLite Option C: External monitoring (UptimeRobot, etc.) Option D: systemd/supervisor for automatic restart

Mirzo's Recommendation: Option D (systemd) + Option B (heartbeat table)

Reasoning:

  • systemd provides instant auto-restart on crash - zero human intervention needed. This alone would have prevented both outages.
  • Heartbeat table provides visibility into bot health for debugging and alerting, even if systemd handles restarts.
  • Options A/C add complexity without solving the core problem (dead process stays dead).
  • Defense in depth: systemd handles crashes, heartbeat table catches hangs/freezes.

Owner - Please decide on monitoring architecture. This affects all 3 bots.

Remediation Test Plan

How we'll verify the fixes work:

P0 Fixes (Process Detection)

Test Expected Result How to Verify
Kill Nodira harness pgrep returns empty pkill -f "claudir data/prod/nodira/bot.json" && pgrep -f "claudir data/prod/nodira/bot.json" should return nothing
Kill Nodira harness while Mirzo's CC running pgrep still returns empty Same as above - Mirzo's Claude subprocess should NOT match
PID file created on startup File exists with valid PID cat data/prod/nodira/harness.pid && kill -0 $(cat data/prod/nodira/harness.pid)
PID file stale after crash kill -0 fails Kill harness without cleanup, verify PID check fails

P1 Fixes (Cross-Bot Monitoring)

Test Expected Result How to Verify
Cross-bot detection catches dead bot Alert within 5 minutes Kill Nodira, wait for Mirzo's next health check, verify alert sent
Idle liveness check works Detects frozen CC Simulate CC freeze (SIGSTOP), verify timeout triggers
Owner alert sent DM received Verify DM arrives when bot detected down

Success Criteria

  • Nodira downtime detected within 5 minutes (vs 3h 48m)
  • Dilya downtime detected within 5 minutes (vs 7h 24m)
  • No false positives from pgrep pattern for 24 hours
  • Owner receives automated alert on any bot death

Supporting Information

  • Nodira logs: data/prod/nodira/logs/claudir.log
  • Mirzo logs: data/prod/mirzo/logs/claudir.log
  • Heartbeat code: src/chatbot/engine/mod.rs lines 503-525
  • Process check code: src/chatbot/claude_code.rs lines 294-307
  • Previous postmortem: data/prod/mirzo/postmortem-2026-01-22.md

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment