Two recurring bot alerts from "The Secretary" in the backend Telegram group are spamming notifications. Both share a common pattern: transient failures triggering unbounded alert loops.
Source: ~/code/dotfiles/bin/.local/bin/backup-monitor (launchd service local.backup-monitor, runs every 30 min)
The restic repository has stale locks from old backup processes (PIDs 759, 1940, 2845) that crashed without cleanup, dating back to Feb 2-3. Every time restic check runs, it fails because the repo is locked.
Bug in the monitor script. When the integrity check passes, LAST_INTEGRITY_CHECK is updated in the state file. When it fails, it's not:
if check_output=$(run_integrity_check); then
LAST_INTEGRITY_CHECK=$current_epoch # updated on success
else
integrity_status="FAILED"
failure_messages+=("Integrity check failed")
# LAST_INTEGRITY_CHECK is NOT updated here
fiThe state file shows LAST_INTEGRITY_CHECK stuck at 1770038199 (Feb 2, 08:16). Every 30-minute run calculates hours_since_integrity >= 4, re-runs the check, it fails again, fires another alert. Infinite loop.
- Immediate:
restic unlockto clear stale locks - Script: Update
LAST_INTEGRITY_CHECKeven on failure so it respects the 4-hour interval regardless of outcome
Source: clis/internalctl/internalctl/run_scrape_claude_changelog_analysis.py (runs in ah-control:monitor-changelog tmux window)
The scraper uses agent-browser to visit x.com/ClaudeCodeLog every 15-25 minutes checking for new Claude Code version announcements. The browser runs inside an Apple container with Wayland + Chromium. Various transient failures occur:
| Error | Cause |
|---|---|
Browser not ready: timeout |
Container still starting up |
net::ERR_NETWORK_CHANGED |
Machine sleep/wake cycle |
Target page, context or browser has been closed |
Chromium crashed mid-navigation |
Failed to connect via CDP |
Race condition - browser reported ready but CDP unreachable |
These are all expected transient issues. The scraper actually recovers on the next run - there was a successful run at 11:45 between two error clusters.
Every single exception sends a Telegram notification with zero deduplication:
except Exception as e:
error_msg = str(e)
send_error_notification(f"Exception: {error_msg}")A machine going to sleep overnight generated 7 error notifications. A brief network blip generates 1-2. There's no distinction between "transient browser hiccup" and "something is actually broken."
- Error suppression: Track consecutive failure count in state. Only notify after N consecutive failures (e.g., 3). Send a single "recovered" message when it starts working again.
- Retry logic: Retry
start_browser()and navigation once before declaring failure. Most transient issues resolve on a second attempt. - Error classification: Transient infrastructure errors (
ERR_NETWORK_CHANGED,timeout,browser closed) could be logged locally without notifying Telegram at all.
Both issues boil down to: alerts that fire on every failure without any suppression, deduplication, or cooldown. The backup monitor needs its state update bug fixed. The scraper needs a consecutive-failure threshold before alerting.