LiveStore: Leader-Exit Handoff Stall - Root Cause Analysis

Executive Summary

When a leader session exits while a follower has pending events, the follower's attempt to become the new leader stalls indefinitely. This is caused by the Effect Worker mechanism not detecting worker termination and propagating it to pending streams.

Impact: Pending events remain stuck, GetLeaderSyncState times out after 1s.

Root Cause: Effect Worker streams don't error out when the underlying worker terminates, causing the SharedWorker to block new requests while old ones hang.

Problem Statement

When the leader session exits while a follower has pending events, the follower's attempt to become the new leader stalls. The GetLeaderSyncState request times out, leaving the follower unable to sync its pending events.

Reproduction

CI=1 LIVESTORE_SYNC_LEADER_EXIT_REPRO=1 direnv exec . bunx vitest run \
  tests/integration/src/tests/adapter-web/adapter-web.test.ts \
  --testNamePattern "leader exit"

Expected vs Actual Behavior

Expected:

Follower acquires the lock
Follower updates its message port to become the new leader
Follower syncs its pending events through the new leader
GetLeaderSyncState returns successfully
Pending count drops to 0

Actual:

Follower acquires the lock ✓
Follower tries to update message port ✓
STALL: GetLeaderSyncState request never returns
Timeout after 1s: TimeoutException: Operation timed out after '1s'
Pending count stays at 5

Architecture Overview

Key Components

SharedWorker (make-shared-worker.ts):
- Singleton worker shared across all tabs
- Maintains leaderWorkerContextSubRef - a subscription ref holding the current leader worker
- Routes requests to the current leader via forwardRequest / forwardRequestStream
- Has resetCurrentWorkerCtx to teardown previous leader when new one takes over
Client Session (persisted-adapter.ts):
- Each tab has a client session
- Uses WebLock for leader election (first tab gets lock = leader)
- When becoming leader, calls UpdateMessagePort to register with SharedWorker
- Uses waitForSharedWorkerInitialized deferred to gate requests
Leader Worker (make-leader-worker.ts):
- Dedicated worker per-tab that only runs when tab is leader
- Handles sync operations, storage, etc.

Normal Handoff Flow

Leader (tab A) holds the lock
Follower (tab B) waits on WebLock.waitForDeferredLock
Leader exits (shutdown or tab close)
Lock is released
Follower acquires lock → enters runLocked
Follower creates new dedicated worker + MessageChannel
Follower calls UpdateMessagePort on SharedWorker
SharedWorker's UpdateMessagePort: a. Calls resetCurrentWorkerCtx to close previous worker scope b. Creates new worker pool from the new port c. Sets leaderWorkerContextSubRef to the new worker
Follower resolves waitForSharedWorkerInitialized
Follower's pending requests can now flow through SharedWorker

Root Cause Analysis

Key Evidence from Logs

4 x forwardRequestStream:start  → streams initiated
3 x forwardRequestStream:shutdown → only 3 completed
1 x updateMessagePort:start (Tab A's only) → Tab B's never reaches handler

Tab B logs TMP shared worker update port start from the window (client side), then nothing - the message never reaches the SharedWorker handler.

The Actual Deadlock

Tab B Client                    SharedWorker                      Tab A Worker
     |                               |                                 |
     |--- PullStream --------------->|                                 |
     |                               |--- worker.execute(PullStream) ->|
     |                               |     (waiting for events...)     |
     |                               |                                 X (terminated)
     |                               |     (stream stuck, no error)    |
     |--- UpdateMessagePort -------->|                                 |
     |    (blocked? queued?)         |                                 |

Root Cause Explanation

The issue is NOT in resetCurrentWorkerCtx or Scope.close. It occurs earlier:

Tab B's Worker.makePoolSerialized creates a pool connected to the SharedWorker
Tab B's PullStream request is issued via runInWorkerStream (creates an active stream fiber in client)
When Tab A (leader) exits, the SharedWorker's worker pool to Tab A's dedicated worker terminates
The SharedWorker's forwardRequestStream for Tab B's PullStream is stuck:
- It called worker.execute(req) where worker is the pool to Tab A's dedicated worker
- That worker is now terminated, but the Effect Worker mechanism doesn't propagate the termination as an error
- The stream just hangs indefinitely
When Tab B acquires the lock and tries to call UpdateMessagePort:
- Tab B's client-side worker pool has in-flight requests (the stuck PullStream)
- The Effect Worker serialization blocks new requests while old ones are pending

The core issue: The Effect Worker mechanism doesn't detect worker termination and propagate it to pending streams/requests.

Test Scenario Details

The test (client session sync pending sticks after leader exit (explicit)) does:

Opens two tabs (page1/A, page2/B) with boot delay of 80ms for B
Tab A becomes leader first, Tab B waits for lock
Both tabs create 5 todos (commits) each
Tab B (follower) syncs 5 pending events through Tab A's leader
Test sends shutdown message to leader (Tab A) only
Tab B acquires lock and tries to become new leader
STALL: Tab B's UpdateMessagePort never completes
Tab B's GetLeaderSyncState request times out after 1s
Tab B's pending events (5) remain stuck

Key parameters:

baseQuery: barrier=1&commitCount=5&timeoutMs=8000&disableFastPath=1&manualShutdown=1
Tab A: sessionId=a&clientId=A&bootDelayMs=0
Tab B: sessionId=b&clientId=B&bootDelayMs=80

Potential Fixes

Option 1: Handle worker termination in Effect Worker

The root issue is that @effect/platform's Worker module doesn't detect when the underlying worker terminates. When Tab A's dedicated worker terminates, pending requests/streams to that worker should error out.

This would require changes to @effect/platform or a wrapper layer.

Option 2: Add timeout to forwardRequestStream

Add a timeout or interrupt mechanism to detect when the underlying worker stops responding:

const forwardRequestStream = <TReq>(req: TReq) =>
  Effect.gen(function* () {
    // ...existing code...
    const stream = worker.execute(req)

    // Add a heartbeat/timeout mechanism
    return Stream.merge(stream, scopeShutdownStream, { haltStrategy: 'either' }).pipe(
      Stream.timeoutFail({
        duration: Duration.seconds(5),
        onTimeout: () => new UnknownError({ cause: 'Worker stream timed out' }),
      }),
    )
  })

However, this doesn't solve the root cause and would slow down legitimate operations.

Option 3: Interrupt pending streams before UpdateMessagePort

Before calling UpdateMessagePort, cancel any pending streams:

// In persisted-adapter.ts, before calling UpdateMessagePort
yield* interruptPendingStreams()
yield* sharedWorker.executeEffect(new WorkerSchema.SharedWorkerUpdateMessagePort(...))

This requires tracking active stream fibers and interrupting them.

Option 4: Use separate channel for UpdateMessagePort

The UpdateMessagePort call is critical and should not be blocked by other operations. Use a separate worker pool or messaging channel for this:

// Create a dedicated channel for control messages
const controlWorker = yield* Worker.makeSerialized<ControlMessages>({ ... })
yield* controlWorker.executeEffect(new UpdateMessagePort(...))

Option 5: Make forwardRequestStream detect worker context changes (Recommended)

Instead of waiting on a scope finalizer (which can deadlock), use the leaderWorkerContextSubRef changes:

const forwardRequestStream = <TReq>(req: TReq) =>
  Effect.gen(function* () {
    const context = yield* SubscriptionRef.waitUntil(leaderWorkerContextSubRef, isNotUndefined)
    const { worker, scope } = context

    // Use the subscription to detect when context changes (new leader takes over)
    const contextChangeStream = leaderWorkerContextSubRef.changes.pipe(
      Stream.filter((ctx) => ctx !== context), // Context changed
      Stream.take(1),
      Stream.drain,
    )

    return Stream.merge(worker.execute(req), contextChangeStream, { haltStrategy: 'either' })
  })

Recommended Fix

Option 5 is the cleanest because:

It uses existing state (leaderWorkerContextSubRef) to detect leader changes
No circular dependency (doesn't wait for scope close)
Streams terminate immediately when a new leader takes over
No timeout heuristics needed

Affected Files

packages/@livestore/adapter-web/src/web-worker/shared-worker/make-shared-worker.ts
packages/@livestore/adapter-web/src/web-worker/leader-worker/make-leader-worker.ts
packages/@livestore/adapter-web/src/web-worker/client-session/persisted-adapter.ts
packages/@livestore/common/src/sync/ClientSessionSyncProcessor.ts
packages/@livestore/common/src/leader-thread/LeaderSyncProcessor.ts

Next Steps

Implement Option 5 in make-shared-worker.ts
Verify the fix works with the existing repro test
Ensure no regressions in other adapter-web tests

schickling/livestore-leader-exit-handoff-stall.md

Select an option

No results found