Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save matthew-gerstman/bc6feb2b3b8d5ab6f238d7f6a16d5c9f to your computer and use it in GitHub Desktop.

Select an option

Save matthew-gerstman/bc6feb2b3b8d5ab6f238d7f6a16d5c9f to your computer and use it in GitHub Desktop.
SSE resilience improvements for future PRs
# SSE Resilience Improvements — Future PRs
Findings from investigating dropped agent events in PR #6692.
## 1. Event replay gap for thread status updates (HIGH)
`emitAgentStatusEvent` in `apps/api/src/agents/obvious-v2/state/events.ts` calls
`publishToProjectUsers` without a `tx` parameter, so `thread:updated` events aren't
stored for replay. If a client reconnects during an agent run, they miss status
transitions and the UI gets stuck on "thinking" or "running".
**Fix:** Pass `tx` to `publishToProjectUsers` so these events are persisted and
available for replay via `getRecentEvents`.
## 2. Validation philosophy mismatch (MEDIUM)
`validateEventData` at the SSE boundary is lenient (logs warning, passes through),
but downstream handlers (`validateResource`, `hasResourceId`) are strict (drop silently).
**Fix:** Make downstream handlers match the lenient pattern — log the issue to Datadog
but still attempt to process. For example, `handleResourceUpsert` could skip Mirror
upsert but still forward to the RxJS event stream so UI components that listen for
specific event types aren't starved.
## 3. Drop rate monitoring dashboard (LOW effort, HIGH value)
PR #6692 added `log.warn('sse_event_dropped', {...})` at all validation drop points,
which sends structured context to Datadog RUM as `log_warn` actions.
**Action:** Create a Datadog dashboard/monitor:
- Track `@action.name:log_warn` where `@context.message:sse_event_dropped`
- Group by `@context.reason` to see which validation is dropping most events
- Alert if drop rate exceeds a threshold (e.g., >10 per minute per user)
## 4. `resetOnError` still risky on base observable (MEDIUM)
We protected `formatSSEMessage` with try-catch, but if `eventsService.subscribe`
emits a malformed event from Redis (e.g., bad JSON from the publisher), the base
observable throws and `resetOnError: true` tears down all subscribers.
**Fix:** Add `catchError` on `baseObservable$` before the `map`:
```typescript
const sharedObservable$ = baseObservable$.pipe(
catchError((err) => {
logger.error({ err: serializeError(err) }, 'SSE base observable error')
return EMPTY
}),
map((eventData) => { ... }),
...
)
```
## 5. Cleanup dead code (`isArtifact`)
Pre-existing unused function flagged by biome on every lint run in
`dashboard/src/components/user-event-provider/event-handlers.ts:192`.
Trivial removal in a separate refactor commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment