Skip to content

Instantly share code, notes, and snippets.

@matthew-gerstman
Created February 10, 2026 21:02
Show Gist options
  • Select an option

  • Save matthew-gerstman/cff61549fe81cc3c7996cfe294da7725 to your computer and use it in GitHub Desktop.

Select an option

Save matthew-gerstman/cff61549fe81cc3c7996cfe294da7725 to your computer and use it in GitHub Desktop.
SSE Resilience Improvements Plan

SSE Resilience Improvements

Context

The SSE implementation is already well-architected with exponential backoff, heartbeat monitoring, event replay via lastEventId, and Redis-backed connection tracking. However, there are gaps in failure recovery paths — specifically around unbounded replay queries, silent replay failures, a heartbeat race condition, and lack of Redis subscriber health monitoring. These gaps mean that after prolonged disconnections or infrastructure hiccups, clients can end up with stale data and no way to detect or recover from it.

Improvements (ranked by impact-to-risk ratio)

1. Bound the event replay query + notify client of truncation

Problem: getRecentEventsFromDatabase() has no LIMIT clause. A reconnecting client with a stale/missing lastEventId can trigger an unbounded query returning all events for a topic. One reconnecting client can spike database load for everyone.

Fix:

  • Add .limit(1000) to the replay query in events.service.ts:723-727
  • Return a truncated flag when the limit is hit
  • In sse.manager.ts:262, emit a replay-truncated SSE event when truncated
  • Client-side: on replay-truncated, trigger full re-hydration

Files: apps/api/src/redis/events.service.ts, apps/api/src/utils/sse.manager.ts, dashboard/src/api/event-stream/event-stream.service.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx Complexity: Small-Medium

2. Signal replay failures to the client

Problem: In sse.manager.ts:266-268, if event replay fails (DB error/timeout), the error is logged and swallowed. The client continues with stale data and no indication events were missed.

Fix:

  • In the catch block at line 266, emit a replay-failed SSE event
  • Client-side: treat replay-failed same as replay-truncated — trigger re-hydration

Files: apps/api/src/utils/sse.manager.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx Complexity: Small (3 lines server, ~10 lines client)

3. Fix heartbeat monitor race condition

Problem: handleProactiveReconnection() runs every 1s once the 10s threshold is crossed. Each invocation increments reconnectAttempts, but if isConnecting is already true, the actual reconnect is skipped. This inflates the counter (jumps to 5+ in seconds), causing premature "Connection lost" UI and incorrect backoff timing.

Fix:

  • Add if (this.isConnecting) return guard at the top of handleProactiveReconnection()
  • Call this.stopHeartbeatMonitor() before initiating the reconnect (it gets restarted by connectUserStream)

Files: dashboard/src/api/event-stream/event-stream.service.ts (lines 423-448) Complexity: Small (2-3 lines)

4. Add Redis subscriber health monitoring

Problem: redisSubscriber only has an error handler that logs to console. No handlers for ready/reconnecting/close. If the subscriber silently degrades, events stop flowing to all SSE connections on that replica — but heartbeats still work (generated server-side), so clients don't reconnect.

Fix:

  • Add lifecycle event handlers (ready, reconnecting, close, end) with structured logging
  • Add a sentinel pub/sub health check: publish a test message every 30s, verify receipt within 5s, log critical if not received
  • Expose subscriber state on the existing health endpoint

Files: apps/api/src/redis/index.ts, optionally health route Complexity: Medium

5. Add max reconnect threshold with recovery

Problem: Client retries forever (exponential backoff capped at 30s). After extended disconnection (e.g., laptop sleep for hours), lastEventId becomes very stale. When reconnection succeeds, replay may be truncated or fail, leaving the client with inconsistent data.

Fix:

  • Add maxReconnectAttempts = 50 (~25 min at cap). After threshold, transition to 'failed' status and clear lastEventId
  • Add recover() method: resets state, triggers full re-hydration, then reconnects SSE
  • Wire "Retry" button and browser online/visibilitychange events to call recover() when in failed state

Files: dashboard/src/api/event-stream/event-stream.service.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx, dashboard/src/features/connection-status-indicator/ Complexity: Medium

Commits

  1. fix: bound event replay query and signal truncation to client
  2. fix: signal replay failures to client for re-hydration
  3. fix: prevent heartbeat monitor race inflating reconnect counter
  4. feat: add Redis subscriber health monitoring
  5. feat: add max reconnect threshold with full recovery path

Verification

  • Replay limit: Manually test by setting a very low LIMIT (e.g., 5), disconnect/reconnect, verify replay-truncated triggers re-hydration
  • Replay failure: Mock getRecentEvents to throw, verify client receives replay-failed and re-hydrates
  • Heartbeat race: Add a test that fires handleProactiveReconnection() while isConnecting=true, verify counter doesn't increment
  • Redis health: Kill Redis subscriber connection, verify sentinel detects the failure within 35s
  • Max reconnect: Set maxReconnectAttempts=3 in test, verify transition to failed state and that recover() triggers re-hydration
  • Run bun obvious test --changed and bun obvious check --changed after each commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment