Skip to content

Instantly share code, notes, and snippets.

@possibilities
Created February 7, 2026 17:23
Show Gist options
  • Select an option

  • Save possibilities/cc884c57a0a6aa343e2fefd4f2e91aa1 to your computer and use it in GitHub Desktop.

Select an option

Save possibilities/cc884c57a0a6aa343e2fefd4f2e91aa1 to your computer and use it in GitHub Desktop.

Bot Alert Spam Analysis

Two recurring bot alerts from "The Secretary" in the backend Telegram group are spamming notifications. Both share a common pattern: transient failures triggering unbounded alert loops.

1. Backup Health Alert: Integrity check failed

Source: ~/code/dotfiles/bin/.local/bin/backup-monitor (launchd service local.backup-monitor, runs every 30 min)

What's happening

The restic repository has stale locks from old backup processes (PIDs 759, 1940, 2845) that crashed without cleanup, dating back to Feb 2-3. Every time restic check runs, it fails because the repo is locked.

Why it spams

Bug in the monitor script. When the integrity check passes, LAST_INTEGRITY_CHECK is updated in the state file. When it fails, it's not:

if check_output=$(run_integrity_check); then
    LAST_INTEGRITY_CHECK=$current_epoch    # updated on success
else
    integrity_status="FAILED"
    failure_messages+=("Integrity check failed")
    # LAST_INTEGRITY_CHECK is NOT updated here
fi

The state file shows LAST_INTEGRITY_CHECK stuck at 1770038199 (Feb 2, 08:16). Every 30-minute run calculates hours_since_integrity >= 4, re-runs the check, it fails again, fires another alert. Infinite loop.

Fixes

  1. Immediate: restic unlock to clear stale locks
  2. Script: Update LAST_INTEGRITY_CHECK even on failure so it respects the 4-hour interval regardless of outcome

2. ClaudeCodeLog Scraper Error

Source: clis/internalctl/internalctl/run_scrape_claude_changelog_analysis.py (runs in ah-control:monitor-changelog tmux window)

What's happening

The scraper uses agent-browser to visit x.com/ClaudeCodeLog every 15-25 minutes checking for new Claude Code version announcements. The browser runs inside an Apple container with Wayland + Chromium. Various transient failures occur:

Error Cause
Browser not ready: timeout Container still starting up
net::ERR_NETWORK_CHANGED Machine sleep/wake cycle
Target page, context or browser has been closed Chromium crashed mid-navigation
Failed to connect via CDP Race condition - browser reported ready but CDP unreachable

These are all expected transient issues. The scraper actually recovers on the next run - there was a successful run at 11:45 between two error clusters.

Why it spams

Every single exception sends a Telegram notification with zero deduplication:

except Exception as e:
    error_msg = str(e)
    send_error_notification(f"Exception: {error_msg}")

A machine going to sleep overnight generated 7 error notifications. A brief network blip generates 1-2. There's no distinction between "transient browser hiccup" and "something is actually broken."

Fixes

  1. Error suppression: Track consecutive failure count in state. Only notify after N consecutive failures (e.g., 3). Send a single "recovered" message when it starts working again.
  2. Retry logic: Retry start_browser() and navigation once before declaring failure. Most transient issues resolve on a second attempt.
  3. Error classification: Transient infrastructure errors (ERR_NETWORK_CHANGED, timeout, browser closed) could be logged locally without notifying Telegram at all.

Common theme

Both issues boil down to: alerts that fire on every failure without any suppression, deduplication, or cooldown. The backup monitor needs its state update bug fixed. The scraper needs a consecutive-failure threshold before alerting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment