The SSE implementation is already well-architected with exponential backoff, heartbeat monitoring, event replay via lastEventId, and Redis-backed connection tracking. However, there are gaps in failure recovery paths — specifically around unbounded replay queries, silent replay failures, a heartbeat race condition, and lack of Redis subscriber health monitoring. These gaps mean that after prolonged disconnections or infrastructure hiccups, clients can end up with stale data and no way to detect or recover from it.
Problem: getRecentEventsFromDatabase() has no LIMIT clause. A reconnecting client with a stale/missing lastEventId can trigger an unbounded query returning all events for a topic. One reconnecting client can spike database load for everyone.
Fix:
- Add
.limit(1000)to the replay query inevents.service.ts:723-727 - Return a
truncatedflag when the limit is hit - In
sse.manager.ts:262, emit areplay-truncatedSSE event when truncated - Client-side: on
replay-truncated, trigger full re-hydration
Files: apps/api/src/redis/events.service.ts, apps/api/src/utils/sse.manager.ts, dashboard/src/api/event-stream/event-stream.service.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx
Complexity: Small-Medium
Problem: In sse.manager.ts:266-268, if event replay fails (DB error/timeout), the error is logged and swallowed. The client continues with stale data and no indication events were missed.
Fix:
- In the catch block at line 266, emit a
replay-failedSSE event - Client-side: treat
replay-failedsame asreplay-truncated— trigger re-hydration
Files: apps/api/src/utils/sse.manager.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx
Complexity: Small (3 lines server, ~10 lines client)
Problem: handleProactiveReconnection() runs every 1s once the 10s threshold is crossed. Each invocation increments reconnectAttempts, but if isConnecting is already true, the actual reconnect is skipped. This inflates the counter (jumps to 5+ in seconds), causing premature "Connection lost" UI and incorrect backoff timing.
Fix:
- Add
if (this.isConnecting) returnguard at the top ofhandleProactiveReconnection() - Call
this.stopHeartbeatMonitor()before initiating the reconnect (it gets restarted byconnectUserStream)
Files: dashboard/src/api/event-stream/event-stream.service.ts (lines 423-448)
Complexity: Small (2-3 lines)
Problem: redisSubscriber only has an error handler that logs to console. No handlers for ready/reconnecting/close. If the subscriber silently degrades, events stop flowing to all SSE connections on that replica — but heartbeats still work (generated server-side), so clients don't reconnect.
Fix:
- Add lifecycle event handlers (
ready,reconnecting,close,end) with structured logging - Add a sentinel pub/sub health check: publish a test message every 30s, verify receipt within 5s, log critical if not received
- Expose subscriber state on the existing health endpoint
Files: apps/api/src/redis/index.ts, optionally health route
Complexity: Medium
Problem: Client retries forever (exponential backoff capped at 30s). After extended disconnection (e.g., laptop sleep for hours), lastEventId becomes very stale. When reconnection succeeds, replay may be truncated or fail, leaving the client with inconsistent data.
Fix:
- Add
maxReconnectAttempts = 50(~25 min at cap). After threshold, transition to'failed'status and clearlastEventId - Add
recover()method: resets state, triggers full re-hydration, then reconnects SSE - Wire "Retry" button and browser
online/visibilitychangeevents to callrecover()when infailedstate
Files: dashboard/src/api/event-stream/event-stream.service.ts, dashboard/src/components/user-event-provider/user-event-provider.tsx, dashboard/src/features/connection-status-indicator/
Complexity: Medium
- fix: bound event replay query and signal truncation to client
- fix: signal replay failures to client for re-hydration
- fix: prevent heartbeat monitor race inflating reconnect counter
- feat: add Redis subscriber health monitoring
- feat: add max reconnect threshold with full recovery path
- Replay limit: Manually test by setting a very low LIMIT (e.g., 5), disconnect/reconnect, verify
replay-truncatedtriggers re-hydration - Replay failure: Mock
getRecentEventsto throw, verify client receivesreplay-failedand re-hydrates - Heartbeat race: Add a test that fires
handleProactiveReconnection()whileisConnecting=true, verify counter doesn't increment - Redis health: Kill Redis subscriber connection, verify sentinel detects the failure within 35s
- Max reconnect: Set
maxReconnectAttempts=3in test, verify transition tofailedstate and thatrecover()triggers re-hydration - Run
bun obvious test --changedandbun obvious check --changedafter each commit