Skip to content

Instantly share code, notes, and snippets.

@schickling
Created December 25, 2025 11:58
Show Gist options
  • Select an option

  • Save schickling/b129dcb07e1f2037fed76351bd416337 to your computer and use it in GitHub Desktop.

Select an option

Save schickling/b129dcb07e1f2037fed76351bd416337 to your computer and use it in GitHub Desktop.
LiveStore: Leader-Exit Handoff Stall - Root Cause Analysis

LiveStore: Leader-Exit Handoff Stall - Root Cause Analysis

Executive Summary

When a leader session exits while a follower has pending events, the follower's attempt to become the new leader stalls indefinitely. This is caused by the Effect Worker mechanism not detecting worker termination and propagating it to pending streams.

Impact: Pending events remain stuck, GetLeaderSyncState times out after 1s.

Root Cause: Effect Worker streams don't error out when the underlying worker terminates, causing the SharedWorker to block new requests while old ones hang.


Problem Statement

When the leader session exits while a follower has pending events, the follower's attempt to become the new leader stalls. The GetLeaderSyncState request times out, leaving the follower unable to sync its pending events.

Reproduction

CI=1 LIVESTORE_SYNC_LEADER_EXIT_REPRO=1 direnv exec . bunx vitest run \
  tests/integration/src/tests/adapter-web/adapter-web.test.ts \
  --testNamePattern "leader exit"

Expected vs Actual Behavior

Expected:

  1. Follower acquires the lock
  2. Follower updates its message port to become the new leader
  3. Follower syncs its pending events through the new leader
  4. GetLeaderSyncState returns successfully
  5. Pending count drops to 0

Actual:

  1. Follower acquires the lock ✓
  2. Follower tries to update message port ✓
  3. STALL: GetLeaderSyncState request never returns
  4. Timeout after 1s: TimeoutException: Operation timed out after '1s'
  5. Pending count stays at 5

Architecture Overview

Key Components

  1. SharedWorker (make-shared-worker.ts):

    • Singleton worker shared across all tabs
    • Maintains leaderWorkerContextSubRef - a subscription ref holding the current leader worker
    • Routes requests to the current leader via forwardRequest / forwardRequestStream
    • Has resetCurrentWorkerCtx to teardown previous leader when new one takes over
  2. Client Session (persisted-adapter.ts):

    • Each tab has a client session
    • Uses WebLock for leader election (first tab gets lock = leader)
    • When becoming leader, calls UpdateMessagePort to register with SharedWorker
    • Uses waitForSharedWorkerInitialized deferred to gate requests
  3. Leader Worker (make-leader-worker.ts):

    • Dedicated worker per-tab that only runs when tab is leader
    • Handles sync operations, storage, etc.

Normal Handoff Flow

  1. Leader (tab A) holds the lock
  2. Follower (tab B) waits on WebLock.waitForDeferredLock
  3. Leader exits (shutdown or tab close)
  4. Lock is released
  5. Follower acquires lock → enters runLocked
  6. Follower creates new dedicated worker + MessageChannel
  7. Follower calls UpdateMessagePort on SharedWorker
  8. SharedWorker's UpdateMessagePort: a. Calls resetCurrentWorkerCtx to close previous worker scope b. Creates new worker pool from the new port c. Sets leaderWorkerContextSubRef to the new worker
  9. Follower resolves waitForSharedWorkerInitialized
  10. Follower's pending requests can now flow through SharedWorker

Root Cause Analysis

Key Evidence from Logs

4 x forwardRequestStream:start  → streams initiated
3 x forwardRequestStream:shutdown → only 3 completed
1 x updateMessagePort:start (Tab A's only) → Tab B's never reaches handler

Tab B logs TMP shared worker update port start from the window (client side), then nothing - the message never reaches the SharedWorker handler.

The Actual Deadlock

Tab B Client                    SharedWorker                      Tab A Worker
     |                               |                                 |
     |--- PullStream --------------->|                                 |
     |                               |--- worker.execute(PullStream) ->|
     |                               |     (waiting for events...)     |
     |                               |                                 X (terminated)
     |                               |     (stream stuck, no error)    |
     |--- UpdateMessagePort -------->|                                 |
     |    (blocked? queued?)         |                                 |

Root Cause Explanation

The issue is NOT in resetCurrentWorkerCtx or Scope.close. It occurs earlier:

  1. Tab B's Worker.makePoolSerialized creates a pool connected to the SharedWorker
  2. Tab B's PullStream request is issued via runInWorkerStream (creates an active stream fiber in client)
  3. When Tab A (leader) exits, the SharedWorker's worker pool to Tab A's dedicated worker terminates
  4. The SharedWorker's forwardRequestStream for Tab B's PullStream is stuck:
    • It called worker.execute(req) where worker is the pool to Tab A's dedicated worker
    • That worker is now terminated, but the Effect Worker mechanism doesn't propagate the termination as an error
    • The stream just hangs indefinitely
  5. When Tab B acquires the lock and tries to call UpdateMessagePort:
    • Tab B's client-side worker pool has in-flight requests (the stuck PullStream)
    • The Effect Worker serialization blocks new requests while old ones are pending

The core issue: The Effect Worker mechanism doesn't detect worker termination and propagate it to pending streams/requests.


Test Scenario Details

The test (client session sync pending sticks after leader exit (explicit)) does:

  1. Opens two tabs (page1/A, page2/B) with boot delay of 80ms for B
  2. Tab A becomes leader first, Tab B waits for lock
  3. Both tabs create 5 todos (commits) each
  4. Tab B (follower) syncs 5 pending events through Tab A's leader
  5. Test sends shutdown message to leader (Tab A) only
  6. Tab B acquires lock and tries to become new leader
  7. STALL: Tab B's UpdateMessagePort never completes
  8. Tab B's GetLeaderSyncState request times out after 1s
  9. Tab B's pending events (5) remain stuck

Key parameters:

  • baseQuery: barrier=1&commitCount=5&timeoutMs=8000&disableFastPath=1&manualShutdown=1
  • Tab A: sessionId=a&clientId=A&bootDelayMs=0
  • Tab B: sessionId=b&clientId=B&bootDelayMs=80

Potential Fixes

Option 1: Handle worker termination in Effect Worker

The root issue is that @effect/platform's Worker module doesn't detect when the underlying worker terminates. When Tab A's dedicated worker terminates, pending requests/streams to that worker should error out.

This would require changes to @effect/platform or a wrapper layer.

Option 2: Add timeout to forwardRequestStream

Add a timeout or interrupt mechanism to detect when the underlying worker stops responding:

const forwardRequestStream = <TReq>(req: TReq) =>
  Effect.gen(function* () {
    // ...existing code...
    const stream = worker.execute(req)

    // Add a heartbeat/timeout mechanism
    return Stream.merge(stream, scopeShutdownStream, { haltStrategy: 'either' }).pipe(
      Stream.timeoutFail({
        duration: Duration.seconds(5),
        onTimeout: () => new UnknownError({ cause: 'Worker stream timed out' }),
      }),
    )
  })

However, this doesn't solve the root cause and would slow down legitimate operations.

Option 3: Interrupt pending streams before UpdateMessagePort

Before calling UpdateMessagePort, cancel any pending streams:

// In persisted-adapter.ts, before calling UpdateMessagePort
yield* interruptPendingStreams()
yield* sharedWorker.executeEffect(new WorkerSchema.SharedWorkerUpdateMessagePort(...))

This requires tracking active stream fibers and interrupting them.

Option 4: Use separate channel for UpdateMessagePort

The UpdateMessagePort call is critical and should not be blocked by other operations. Use a separate worker pool or messaging channel for this:

// Create a dedicated channel for control messages
const controlWorker = yield* Worker.makeSerialized<ControlMessages>({ ... })
yield* controlWorker.executeEffect(new UpdateMessagePort(...))

Option 5: Make forwardRequestStream detect worker context changes (Recommended)

Instead of waiting on a scope finalizer (which can deadlock), use the leaderWorkerContextSubRef changes:

const forwardRequestStream = <TReq>(req: TReq) =>
  Effect.gen(function* () {
    const context = yield* SubscriptionRef.waitUntil(leaderWorkerContextSubRef, isNotUndefined)
    const { worker, scope } = context

    // Use the subscription to detect when context changes (new leader takes over)
    const contextChangeStream = leaderWorkerContextSubRef.changes.pipe(
      Stream.filter((ctx) => ctx !== context), // Context changed
      Stream.take(1),
      Stream.drain,
    )

    return Stream.merge(worker.execute(req), contextChangeStream, { haltStrategy: 'either' })
  })

Recommended Fix

Option 5 is the cleanest because:

  1. It uses existing state (leaderWorkerContextSubRef) to detect leader changes
  2. No circular dependency (doesn't wait for scope close)
  3. Streams terminate immediately when a new leader takes over
  4. No timeout heuristics needed

Affected Files

  • packages/@livestore/adapter-web/src/web-worker/shared-worker/make-shared-worker.ts
  • packages/@livestore/adapter-web/src/web-worker/leader-worker/make-leader-worker.ts
  • packages/@livestore/adapter-web/src/web-worker/client-session/persisted-adapter.ts
  • packages/@livestore/common/src/sync/ClientSessionSyncProcessor.ts
  • packages/@livestore/common/src/leader-thread/LeaderSyncProcessor.ts

Next Steps

  1. Implement Option 5 in make-shared-worker.ts
  2. Verify the fix works with the existing repro test
  3. Ensure no regressions in other adapter-web tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment