Skip to content

Instantly share code, notes, and snippets.

@clifton
Created February 9, 2026 22:25
Show Gist options
  • Select an option

  • Save clifton/a0a4ae10b316bd826465b0f6c31076d2 to your computer and use it in GitHub Desktop.

Select an option

Save clifton/a0a4ae10b316bd826465b0f6c31076d2 to your computer and use it in GitHub Desktop.
Voice Bridge Latency Architecture Review — Twilio/Pipecat/LLM pipeline overhead analysis

Voice Bridge Latency Architecture Review

Deep-dive analysis of every source of overhead in a Twilio → Pipecat → LLM voice pipeline.

Stack: Twilio Media Streams · Pipecat · OpenAI Whisper STT · Kimi K2.5 (Fireworks) LLM · ElevenLabs TTS

Repo: voice — an OpenClaw skill providing caller-aware phone calls with tool calling and tiered access control.


End-to-End Call Timeline

PHASE                           LATENCY         CUMULATIVE
─────────────────────────────────────────────────────────────
Twilio webhook received         —               t=0ms
├─ Form parsing                 1–5ms           ~3ms
├─ HMAC-SHA1 signature verify   2–5ms           ~7ms
├─ Caller lookup (linear scan)  0–10ms          ~12ms
├─ TwiML XML generation         1–2ms           ~14ms
└─ HTTP 200 → Twilio            <1ms            ~15ms

Twilio initiates WebSocket      —               t=15ms
├─ DNS + TLS handshake          50–150ms        ~115ms
├─ Tailscale Funnel proxy hop   2–6ms           ~119ms
├─ WebSocket accept             1–5ms           ~122ms
└─ Await Twilio "start" msg     50–200ms        ~222ms

Pipeline creation               —               t=222ms
├─ System prompt file I/O       15–75ms         ~272ms
├─ Service instantiation        3–5ms           ~275ms
├─ OpenClaw gateway handshake   11–25ms         ~293ms
├─ Pipeline wiring + task       2–5ms           ~296ms
└─ VAD model load (first call)  50–100ms        ~346ms*

Greeting generation             —               t=296ms
├─ LLM prompt + first token     180–500ms       ~596ms
├─ TTS connection + first chunk 150–250ms       ~746ms
├─ AudioDownsampler (16→8kHz)   <1ms            ~746ms
└─ Twilio → caller speaker      10–100ms        ~806ms

CALLER HEARS GREETING                           ~0.6–1.5s
─────────────────────────────────────────────────────────────

User speaks + silence detected  —               varies
├─ VAD silence buffer           1200ms FIXED    +1200ms
├─ Whisper STT inference        250–800ms       +500ms avg
├─ LLM first token              180–500ms       +350ms avg
├─ TTS first audio chunk        150–250ms       +200ms avg
└─ Network + Twilio playback    10–100ms        +50ms avg

EAR-TO-EAR RESPONSE LATENCY                    ~2.0–3.5s
─────────────────────────────────────────────────────────────

With ask_ron tool call           +100–5000ms    +2s avg
With get_date_time tool call     <1ms           negligible

* VAD model cached after first call.


Component-Level Analysis

1. VAD Silence Detection — 1200ms Fixed Floor

VADParams(confidence=0.8, stop_secs=1.2)

The Silero VAD analyzer runs on every 20ms audio frame (~5–10ms CPU per frame, 25–50% of one core). It waits for 1.2 seconds of continuous silence before declaring end-of-speech.

This is the single largest latency contributor and it's intentional — lower values cause false triggers when users pause mid-thought. The trade-off:

stop_secs Behavior Risk
0.5s Snappy, cuts off pauses Interrupts thinking pauses
0.8s Balanced Occasional false triggers
1.2s (current) Conservative Guaranteed 1.2s floor
1.5s+ Sluggish Poor UX

2. Audio Codec Chain — The Soxr Saga

Twilio 8kHz G.711 μ-law
    → TwilioFrameSerializer (decode to PCM-16)
    → Pipeline internal (8kHz PCM)
    → STT (8kHz PCM → text)
    → LLM (text → text)
    → TTS (text → 16kHz PCM streaming chunks)
    → AudioDownsampler (16kHz → 8kHz, stateless, <1ms/chunk)
    → TwilioFrameSerializer (encode to G.711 μ-law)
    → Twilio 8kHz G.711 μ-law

Why 16kHz TTS → 8kHz transport? Two recent critical fixes:

  • fafbadf: Pipecat's built-in SOXRStreamAudioResampler buffers ~11 consecutive 10ms chunks (~110ms) before producing output. With ElevenLabs streaming small chunks, this caused complete silence — the resampler accumulated audio but never flushed it. Fix: replace soxr with a stateless AudioDownsampler using audioop.ratecv that produces output on every single input chunk.

  • ea1680a: Setting audio_out_sample_rate=16000 (to match TTS) inadvertently activated hidden soxr resamplers inside both BaseOutputTransport and TwilioFrameSerializer. Fix: set audio_out_sample_rate=8000 so both become pass-throughs, since AudioDownsampler already converts to 8kHz before they see the frames.

# The workaround pipeline (pipeline.py):
ElevenLabsTTSService(sample_rate=16000)     # TTS outputs 16kHzAudioDownsampler(target_rate=8000)         # Immediate 16→8kHz, no bufferingTransport(audio_out_sample_rate=8000)      # Pass-through, no soxr

AudioDownsampler per-chunk overhead: audioop.ratecv on a 160-sample chunk takes 0.1–0.5ms. Stateless (no ring buffer, no history). Every input chunk produces an output chunk.

3. Context Building — Synchronous File I/O on Every Call

# pipeline.py:114 — called per-call, blocking
system_prompt = build_system_prompt(caller, config.workspace, config.context_dir)

build_system_prompt() performs synchronous disk reads:

File When Loaded Typical Size
SOUL.md Always 300–500 tokens
USER.md Owner tier only 200–500 tokens
caller.context_file Trusted tier only 200–500 tokens
context_dir/*.md (all) If configured Unbounded

All reads use plain open() / f.read() — no async I/O, no caching.

Context directory is unbounded: os.listdir(ctx_dir) loads every .md file alphabetically with no size limit. A digests directory with 12 monthly reports (500 tokens each) adds 6,000 tokens to every call's system prompt.

Token budget by caller tier:

Tier Base Tokens With Context Dir Notes
Guest ~480 ~480 + context_dir No USER.md, no tools
Trusted ~610 ~610 + context_dir Caller-specific context + hallucination guard
Owner ~690 ~690 + context_dir Full USER.md + tool instructions

Latency: 5–75ms per call depending on number and size of files. On local SSD this is fast; on NFS it could be 50ms+ per file.

4. Tool Calling — The Pipeline Stall

Two tools are registered:

get_date_time — Negligible Overhead

Pure local computation via datetime.now(). Lazy-imports zoneinfo. <1ms per call.

ask_ron — Potential 30-Second Pipeline Stall

async def handle_ask_ron(...):
    result = await gateway.ask(request, session_key=call_sid)  # blocks up to 30s
    await result_callback(result)

gateway.ask() sends a WebSocket message to the OpenClaw agent gateway and enters a blocking receive loop:

async with asyncio.timeout(self._timeout):   # 30s default
    while True:
        raw = await self._ws.recv()           # blocks until message
        msg = json.loads(raw)
        if msg["type"] == "res" and msg["id"] == req_id:
            return extract_result(msg)

While this awaits, the entire LLM→TTS path is stalled. The caller hears silence. The transport still receives audio (VAD runs), but no output is produced until the tool completes.

Timeline with ask_ron:
  t=0.0s  User: "What are Tesla's earnings?"
  t=0.5s  STT completes transcription
  t=0.7s  LLM decides to call ask_ron
  t=0.8s  Gateway request sent ──────────┐
  t=3.0s  OpenClaw returns result ◄──────┘  ← 2.2s silence
  t=3.1s  Tool result injected into context
  t=3.3s  LLM re-prompted, starts generating
  t=4.0s  TTS starts streaming to caller

No retry logic, no intermediate audio feedback, no backoff. If gateway is slow: silence. If gateway times out (30s): canned error message after half a minute of dead air.

Tool Schema Overhead

Both tool schemas (~210 tokens combined) are serialized and sent with every LLM request, even turns where tools aren't called. Over a 10-turn conversation, that's ~2,100 tokens of pure schema overhead. Adds ~200–300ms of prompt encoding/transmission per turn.

Guest tier correctly receives zero tool schemas (no overhead).

5. Gateway Connection — Per-Call WebSocket Handshake

# pipeline.py:127-131 — per-call, blocking pipeline creation
gateway = OpenClawClient.from_config(path=config.gateway.config_path)
await gateway.connect()   # 3-message handshake

Each call creates a new WebSocket connection to the OpenClaw gateway with a 3-step handshake:

  1. Receive connect.challenge from gateway
  2. Send connect with auth token
  3. Receive hello-ok

Cost: 11–25ms per call (localhost WebSocket). No connection pooling — 100 concurrent calls = 100 independent WebSocket connections.

If gateway is unavailable, exception is caught and tools silently degrade. The LLM still has ask_ron in its tool schema (from the system prompt instructions) but the handler won't be registered — the LLM may attempt to call a tool that doesn't exist.

6. Security Validation

Twilio signature verification runs on every webhook:

def validate_twilio_signature(signature, url, params, auth_token):
    validator = RequestValidator(auth_token)   # stores token
    return validator.validate(url, params, signature)  # HMAC-SHA1

Cost: 2–5ms (CPU-bound cryptographic operation). Unavoidable for security.

Caller identification uses a linear scan:

def identify_caller(phone, callers_config):
    for caller in callers_config.callers:
        if caller.phone == phone:
            return caller
    return None

Cost: O(n) — with 50 callers, up to 5ms. This runs twice per call (once in webhook handler, once in WebSocket handler when the caller phone is re-extracted from Twilio's custom parameters).

7. Tailscale Funnel Proxy

Twilio → HTTPS → Tailscale Funnel (public IP:443)
                      → Tailscale client (localhost:8765)
                      → FastAPI/Uvicorn

Per-request overhead: 2–6ms (TLS decrypt → localhost forward → TLS encrypt). For the persistent WebSocket, this is a one-time cost plus ~0.5–2ms per audio frame — negligible at 50 Hz frame rate.

8. Interruption Handling

PipelineParams(allow_interruptions=True)

When VAD detects new speech while LLM/TTS are outputting:

  • VAD detection: ~50–200ms
  • Pipecat cancels in-flight LLM + TTS: <10ms
  • Pipeline switches to processing new input
  • Interruption-to-response: 300–1000ms (VAD detect + full STT cycle)

No CPU cost beyond continuous VAD (which runs regardless). ElevenLabs streaming connection is closed on cancellation (~10ms cleanup).


Overhead Summary Table

Source Latency Type Frequency Avoidable?
VAD silence buffer 1200ms Config Every turn Tunable (trade-off)
Whisper STT 250–800ms Network + inference Every turn Switch to faster STT
LLM first token 180–500ms Network + inference Every turn Model choice
TTS first chunk 150–250ms Network + synthesis Every turn Provider choice
System prompt I/O 15–75ms Disk I/O Every call Cacheable
Gateway handshake 11–25ms Network Every call Pool reusable
TwiML + signature 8–15ms CPU + I/O Every call Minimal
Caller lookup (×2) 0–10ms CPU Every call Use dict
AudioDownsampler <1ms/chunk CPU Every chunk Already optimal
Tailscale Funnel 2–6ms Network Per-request Unavoidable
Tool schema tokens ~210 tokens/turn Token overhead Every LLM call Trim descriptions
ask_ron tool call 100–30,000ms Network Per tool use Reduce timeout

Latency Profiles

Scenario Estimated Ear-to-Ear Breakdown
Best case (short reply, warm cache) ~2.0s 1.2s VAD + 0.3s STT + 0.3s LLM + 0.2s TTS
Typical (15-word response) ~3.0–3.5s 1.2s VAD + 0.5s STT + 0.8s LLM + 0.5s TTS
With tool call ~4.0–6.0s Above + 1–3s gateway
Worst case (gateway slow) ~6.0–35s Above + up to 30s gateway timeout
Greeting (first audio) ~0.6–1.5s Pipeline setup + LLM + TTS

Architectural Strengths

  • Streaming pipeline: LLM tokens flow to TTS immediately — user hears partial responses as they're generated, not after the full response completes. This is the single biggest latency win.
  • Stateless audio resampling: Custom AudioDownsampler using audioop.ratecv produces output on every input chunk with <1ms overhead. No buffering, no accumulated silence.
  • Async throughout: Non-blocking I/O from WebSocket transport through service calls. The event loop stays responsive even during tool execution.
  • Tier-based schema control: Guest callers get zero tool overhead — no schemas in LLM requests, no gateway connection attempt.
  • Soxr bypass: Both hidden soxr resampler instances in Pipecat are configured as pass-throughs via careful sample rate alignment.

Architectural Weaknesses & Recommendations

High Impact

Issue Current Cost Fix Expected Savings
No system prompt caching 15–75ms/call Cache per caller, invalidate on file change 15–75ms/call
ask_ron blocks audio for up to 30s 0–30s silence Reduce timeout to 5–10s; play filler audio ("Let me check...") while waiting Perceived: dramatic
No gateway connection pool 11–25ms/call Maintain persistent connection, reconnect on failure 11–25ms/call
Linear caller lookup 0–10ms × 2/call dict[str, CallerProfile] keyed by phone ~10ms/call

Medium Impact

Issue Current Cost Fix Expected Savings
Unbounded context_dir loading 5–50ms + token bloat Size limit, relevance filtering, or allowlist Variable
Tool schema in every request ~210 tokens/turn Trim descriptions, or conditional inclusion ~60 tokens/turn
Silent tool degradation Confusing LLM behavior Remove ask_ron from schema if gateway fails Correctness fix
No context pruning Growing context over long calls Sliding window or summarization for old turns Prevents slowdown

Low Impact

Issue Current Cost Fix Expected Savings
Synchronous file I/O 5–20ms aiofiles or thread pool 5–20ms (marginal)
Duplicate caller lookup 5–10ms Pass caller profile through custom params 5–10ms
No connection reuse for APIs ~5ms/call Verify httpx/aiohttp pooling in Pipecat ~5ms/call
Pipecat version unpinned Risk of regression Pin to known-good version Stability

What's Tested vs. Untested

Tested Not Tested
Audio downsampler correctness End-to-end latency benchmarks
Soxr bypass configuration VAD tuning / false trigger rates
Tool availability by tier ask_ron timeout behavior
System prompt per tier Concurrent call handling
Twilio signature validation Interrupt handling (barge-in)
Gateway handshake + ask Memory/resource cleanup
Greeting frame production LLM streaming time-to-first-token
Caller identification Pipeline under load

Generated 2025-02-09. Analysis based on commit ea1680a (HEAD of main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment