Created
February 10, 2026 16:50
-
-
Save matthew-gerstman/bc6feb2b3b8d5ab6f238d7f6a16d5c9f to your computer and use it in GitHub Desktop.
SSE resilience improvements for future PRs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # SSE Resilience Improvements — Future PRs | |
| Findings from investigating dropped agent events in PR #6692. | |
| ## 1. Event replay gap for thread status updates (HIGH) | |
| `emitAgentStatusEvent` in `apps/api/src/agents/obvious-v2/state/events.ts` calls | |
| `publishToProjectUsers` without a `tx` parameter, so `thread:updated` events aren't | |
| stored for replay. If a client reconnects during an agent run, they miss status | |
| transitions and the UI gets stuck on "thinking" or "running". | |
| **Fix:** Pass `tx` to `publishToProjectUsers` so these events are persisted and | |
| available for replay via `getRecentEvents`. | |
| ## 2. Validation philosophy mismatch (MEDIUM) | |
| `validateEventData` at the SSE boundary is lenient (logs warning, passes through), | |
| but downstream handlers (`validateResource`, `hasResourceId`) are strict (drop silently). | |
| **Fix:** Make downstream handlers match the lenient pattern — log the issue to Datadog | |
| but still attempt to process. For example, `handleResourceUpsert` could skip Mirror | |
| upsert but still forward to the RxJS event stream so UI components that listen for | |
| specific event types aren't starved. | |
| ## 3. Drop rate monitoring dashboard (LOW effort, HIGH value) | |
| PR #6692 added `log.warn('sse_event_dropped', {...})` at all validation drop points, | |
| which sends structured context to Datadog RUM as `log_warn` actions. | |
| **Action:** Create a Datadog dashboard/monitor: | |
| - Track `@action.name:log_warn` where `@context.message:sse_event_dropped` | |
| - Group by `@context.reason` to see which validation is dropping most events | |
| - Alert if drop rate exceeds a threshold (e.g., >10 per minute per user) | |
| ## 4. `resetOnError` still risky on base observable (MEDIUM) | |
| We protected `formatSSEMessage` with try-catch, but if `eventsService.subscribe` | |
| emits a malformed event from Redis (e.g., bad JSON from the publisher), the base | |
| observable throws and `resetOnError: true` tears down all subscribers. | |
| **Fix:** Add `catchError` on `baseObservable$` before the `map`: | |
| ```typescript | |
| const sharedObservable$ = baseObservable$.pipe( | |
| catchError((err) => { | |
| logger.error({ err: serializeError(err) }, 'SSE base observable error') | |
| return EMPTY | |
| }), | |
| map((eventData) => { ... }), | |
| ... | |
| ) | |
| ``` | |
| ## 5. Cleanup dead code (`isArtifact`) | |
| Pre-existing unused function flagged by biome on every lint run in | |
| `dashboard/src/components/user-event-provider/event-handlers.ts:192`. | |
| Trivial removal in a separate refactor commit. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment