SSE Resilience Improvements

Context

The SSE implementation is already well-architected with exponential backoff, heartbeat monitoring, event replay via lastEventId, and Redis-backed connection tracking. However, there are gaps in failure recovery paths — specifically around unbounded replay queries, silent replay failures, a heartbeat race condition, and lack of Redis subscriber health monitoring. These gaps mean that after prolonged disconnections or infrastructure hiccups, clients can end up with stale data and no way to detect or recover from it.

Improvements (ranked by impact-to-risk ratio)

1. Bound the event replay query + notify client of truncation

Problem: getRecentEventsFromDatabase() has no LIMIT clause. A reconnecting client with a stale/missing lastEventId can trigger an unbounded query returning all events for a topic. One reconnecting client can spike database load for everyone.

Fix:

Add .limit(1000) to the replay query in events.service.ts:723-727
Return a truncated flag when the limit is hit
In sse.manager.ts:262, emit a replay-truncated SSE event when truncated
Client-side: on replay-truncated, trigger full re-hydration

Files: apps/api/src/redis/events.service.ts, apps/api/src/utils/sse.manager.ts, dashboard/src/api/event-stream/event-stream.service.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx Complexity: Small-Medium

2. Signal replay failures to the client

Problem: In sse.manager.ts:266-268, if event replay fails (DB error/timeout), the error is logged and swallowed. The client continues with stale data and no indication events were missed.

Fix:

In the catch block at line 266, emit a replay-failed SSE event
Client-side: treat replay-failed same as replay-truncated — trigger re-hydration

Files: apps/api/src/utils/sse.manager.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx Complexity: Small (3 lines server, ~10 lines client)

3. Fix heartbeat monitor race condition

Problem: handleProactiveReconnection() runs every 1s once the 10s threshold is crossed. Each invocation increments reconnectAttempts, but if isConnecting is already true, the actual reconnect is skipped. This inflates the counter (jumps to 5+ in seconds), causing premature "Connection lost" UI and incorrect backoff timing.

Fix:

Add if (this.isConnecting) return guard at the top of handleProactiveReconnection()
Call this.stopHeartbeatMonitor() before initiating the reconnect (it gets restarted by connectUserStream)

Files: dashboard/src/api/event-stream/event-stream.service.ts (lines 423-448) Complexity: Small (2-3 lines)

4. Add Redis subscriber health monitoring

Problem: redisSubscriber only has an error handler that logs to console. No handlers for ready/reconnecting/close. If the subscriber silently degrades, events stop flowing to all SSE connections on that replica — but heartbeats still work (generated server-side), so clients don't reconnect.

Fix:

Add lifecycle event handlers (ready, reconnecting, close, end) with structured logging
Add a sentinel pub/sub health check: publish a test message every 30s, verify receipt within 5s, log critical if not received
Expose subscriber state on the existing health endpoint

Files: apps/api/src/redis/index.ts, optionally health route Complexity: Medium

5. Add max reconnect threshold with recovery

Problem: Client retries forever (exponential backoff capped at 30s). After extended disconnection (e.g., laptop sleep for hours), lastEventId becomes very stale. When reconnection succeeds, replay may be truncated or fail, leaving the client with inconsistent data.

Fix:

Add maxReconnectAttempts = 50 (~25 min at cap). After threshold, transition to 'failed' status and clear lastEventId
Add recover() method: resets state, triggers full re-hydration, then reconnects SSE
Wire "Retry" button and browser online/visibilitychange events to call recover() when in failed state

Files: dashboard/src/api/event-stream/event-stream.service.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx, dashboard/src/features/connection-status-indicator/ Complexity: Medium

Commits

fix: bound event replay query and signal truncation to client
fix: signal replay failures to client for re-hydration
fix: prevent heartbeat monitor race inflating reconnect counter
feat: add Redis subscriber health monitoring
feat: add max reconnect threshold with full recovery path

Verification

Replay limit: Manually test by setting a very low LIMIT (e.g., 5), disconnect/reconnect, verify replay-truncated triggers re-hydration
Replay failure: Mock getRecentEvents to throw, verify client receives replay-failed and re-hydrates
Heartbeat race: Add a test that fires handleProactiveReconnection() while isConnecting=true, verify counter doesn't increment
Redis health: Kill Redis subscriber connection, verify sentinel detects the failure within 35s
Max reconnect: Set maxReconnectAttempts=3 in test, verify transition to failed state and that recover() triggers re-hydration
Run bun obvious test --changed and bun obvious check --changed after each commit

matthew-gerstman/radiant-cuddling-flurry.md

Select an option

No results found

Select an option

No results found

SSE Resilience Improvements

Context

Improvements (ranked by impact-to-risk ratio)

1. Bound the event replay query + notify client of truncation

2. Signal replay failures to the client

3. Fix heartbeat monitor race condition

4. Add Redis subscriber health monitoring

5. Add max reconnect threshold with recovery

Commits

Verification