Voice Bridge Latency Architecture Review

Deep-dive analysis of every source of overhead in a Twilio → Pipecat → LLM voice pipeline.

Stack: Twilio Media Streams · Pipecat · OpenAI Whisper STT · Kimi K2.5 (Fireworks) LLM · ElevenLabs TTS

Repo: voice — an OpenClaw skill providing caller-aware phone calls with tool calling and tiered access control.

End-to-End Call Timeline

PHASE                           LATENCY         CUMULATIVE
─────────────────────────────────────────────────────────────
Twilio webhook received         —               t=0ms
├─ Form parsing                 1–5ms           ~3ms
├─ HMAC-SHA1 signature verify   2–5ms           ~7ms
├─ Caller lookup (linear scan)  0–10ms          ~12ms
├─ TwiML XML generation         1–2ms           ~14ms
└─ HTTP 200 → Twilio            <1ms            ~15ms

Twilio initiates WebSocket      —               t=15ms
├─ DNS + TLS handshake          50–150ms        ~115ms
├─ Tailscale Funnel proxy hop   2–6ms           ~119ms
├─ WebSocket accept             1–5ms           ~122ms
└─ Await Twilio "start" msg     50–200ms        ~222ms

Pipeline creation               —               t=222ms
├─ System prompt file I/O       15–75ms         ~272ms
├─ Service instantiation        3–5ms           ~275ms
├─ OpenClaw gateway handshake   11–25ms         ~293ms
├─ Pipeline wiring + task       2–5ms           ~296ms
└─ VAD model load (first call)  50–100ms        ~346ms*

Greeting generation             —               t=296ms
├─ LLM prompt + first token     180–500ms       ~596ms
├─ TTS connection + first chunk 150–250ms       ~746ms
├─ AudioDownsampler (16→8kHz)   <1ms            ~746ms
└─ Twilio → caller speaker      10–100ms        ~806ms

CALLER HEARS GREETING                           ~0.6–1.5s
─────────────────────────────────────────────────────────────

User speaks + silence detected  —               varies
├─ VAD silence buffer           1200ms FIXED    +1200ms
├─ Whisper STT inference        250–800ms       +500ms avg
├─ LLM first token              180–500ms       +350ms avg
├─ TTS first audio chunk        150–250ms       +200ms avg
└─ Network + Twilio playback    10–100ms        +50ms avg

EAR-TO-EAR RESPONSE LATENCY                    ~2.0–3.5s
─────────────────────────────────────────────────────────────

With ask_ron tool call           +100–5000ms    +2s avg
With get_date_time tool call     <1ms           negligible

* VAD model cached after first call.

Component-Level Analysis

1. VAD Silence Detection — 1200ms Fixed Floor

VADParams(confidence=0.8, stop_secs=1.2)

The Silero VAD analyzer runs on every 20ms audio frame (~5–10ms CPU per frame, 25–50% of one core). It waits for 1.2 seconds of continuous silence before declaring end-of-speech.

This is the single largest latency contributor and it's intentional — lower values cause false triggers when users pause mid-thought. The trade-off:

`stop_secs`	Behavior	Risk
0.5s	Snappy, cuts off pauses	Interrupts thinking pauses
0.8s	Balanced	Occasional false triggers
1.2s (current)	Conservative	Guaranteed 1.2s floor
1.5s+	Sluggish	Poor UX

2. Audio Codec Chain — The Soxr Saga

Twilio 8kHz G.711 μ-law
    → TwilioFrameSerializer (decode to PCM-16)
    → Pipeline internal (8kHz PCM)
    → STT (8kHz PCM → text)
    → LLM (text → text)
    → TTS (text → 16kHz PCM streaming chunks)
    → AudioDownsampler (16kHz → 8kHz, stateless, <1ms/chunk)
    → TwilioFrameSerializer (encode to G.711 μ-law)
    → Twilio 8kHz G.711 μ-law

Why 16kHz TTS → 8kHz transport? Two recent critical fixes:

fafbadf: Pipecat's built-in SOXRStreamAudioResampler buffers ~11 consecutive 10ms chunks (~110ms) before producing output. With ElevenLabs streaming small chunks, this caused complete silence — the resampler accumulated audio but never flushed it. Fix: replace soxr with a stateless AudioDownsampler using audioop.ratecv that produces output on every single input chunk.
ea1680a: Setting audio_out_sample_rate=16000 (to match TTS) inadvertently activated hidden soxr resamplers inside both BaseOutputTransport and TwilioFrameSerializer. Fix: set audio_out_sample_rate=8000 so both become pass-throughs, since AudioDownsampler already converts to 8kHz before they see the frames.

# The workaround pipeline (pipeline.py):
ElevenLabsTTSService(sample_rate=16000)     # TTS outputs 16kHz
→ AudioDownsampler(target_rate=8000)         # Immediate 16→8kHz, no buffering
→ Transport(audio_out_sample_rate=8000)      # Pass-through, no soxr

AudioDownsampler per-chunk overhead: audioop.ratecv on a 160-sample chunk takes 0.1–0.5ms. Stateless (no ring buffer, no history). Every input chunk produces an output chunk.

3. Context Building — Synchronous File I/O on Every Call

# pipeline.py:114 — called per-call, blocking
system_prompt = build_system_prompt(caller, config.workspace, config.context_dir)

build_system_prompt() performs synchronous disk reads:

File	When Loaded	Typical Size
`SOUL.md`	Always	300–500 tokens
`USER.md`	Owner tier only	200–500 tokens
`caller.context_file`	Trusted tier only	200–500 tokens
`context_dir/*.md` (all)	If configured	Unbounded

All reads use plain open() / f.read() — no async I/O, no caching.

Context directory is unbounded: os.listdir(ctx_dir) loads every .md file alphabetically with no size limit. A digests directory with 12 monthly reports (500 tokens each) adds 6,000 tokens to every call's system prompt.

Token budget by caller tier:

Tier	Base Tokens	With Context Dir	Notes
Guest	~480	~480 + context_dir	No USER.md, no tools
Trusted	~610	~610 + context_dir	Caller-specific context + hallucination guard
Owner	~690	~690 + context_dir	Full USER.md + tool instructions

Latency: 5–75ms per call depending on number and size of files. On local SSD this is fast; on NFS it could be 50ms+ per file.

4. Tool Calling — The Pipeline Stall

Two tools are registered:

`get_date_time` — Negligible Overhead

Pure local computation via datetime.now(). Lazy-imports zoneinfo. <1ms per call.

`ask_ron` — Potential 30-Second Pipeline Stall

async def handle_ask_ron(...):
    result = await gateway.ask(request, session_key=call_sid)  # blocks up to 30s
    await result_callback(result)

gateway.ask() sends a WebSocket message to the OpenClaw agent gateway and enters a blocking receive loop:

async with asyncio.timeout(self._timeout):   # 30s default
    while True:
        raw = await self._ws.recv()           # blocks until message
        msg = json.loads(raw)
        if msg["type"] == "res" and msg["id"] == req_id:
            return extract_result(msg)

While this awaits, the entire LLM→TTS path is stalled. The caller hears silence. The transport still receives audio (VAD runs), but no output is produced until the tool completes.

Timeline with ask_ron:
  t=0.0s  User: "What are Tesla's earnings?"
  t=0.5s  STT completes transcription
  t=0.7s  LLM decides to call ask_ron
  t=0.8s  Gateway request sent ──────────┐
  t=3.0s  OpenClaw returns result ◄──────┘  ← 2.2s silence
  t=3.1s  Tool result injected into context
  t=3.3s  LLM re-prompted, starts generating
  t=4.0s  TTS starts streaming to caller

No retry logic, no intermediate audio feedback, no backoff. If gateway is slow: silence. If gateway times out (30s): canned error message after half a minute of dead air.

Tool Schema Overhead

Both tool schemas (~210 tokens combined) are serialized and sent with every LLM request, even turns where tools aren't called. Over a 10-turn conversation, that's ~2,100 tokens of pure schema overhead. Adds ~200–300ms of prompt encoding/transmission per turn.

Guest tier correctly receives zero tool schemas (no overhead).

5. Gateway Connection — Per-Call WebSocket Handshake

# pipeline.py:127-131 — per-call, blocking pipeline creation
gateway = OpenClawClient.from_config(path=config.gateway.config_path)
await gateway.connect()   # 3-message handshake

Each call creates a new WebSocket connection to the OpenClaw gateway with a 3-step handshake:

Receive connect.challenge from gateway
Send connect with auth token
Receive hello-ok

Cost: 11–25ms per call (localhost WebSocket). No connection pooling — 100 concurrent calls = 100 independent WebSocket connections.

If gateway is unavailable, exception is caught and tools silently degrade. The LLM still has ask_ron in its tool schema (from the system prompt instructions) but the handler won't be registered — the LLM may attempt to call a tool that doesn't exist.

6. Security Validation

Twilio signature verification runs on every webhook:

def validate_twilio_signature(signature, url, params, auth_token):
    validator = RequestValidator(auth_token)   # stores token
    return validator.validate(url, params, signature)  # HMAC-SHA1

Cost: 2–5ms (CPU-bound cryptographic operation). Unavoidable for security.

Caller identification uses a linear scan:

def identify_caller(phone, callers_config):
    for caller in callers_config.callers:
        if caller.phone == phone:
            return caller
    return None

Cost: O(n) — with 50 callers, up to 5ms. This runs twice per call (once in webhook handler, once in WebSocket handler when the caller phone is re-extracted from Twilio's custom parameters).

7. Tailscale Funnel Proxy

Twilio → HTTPS → Tailscale Funnel (public IP:443)
                      → Tailscale client (localhost:8765)
                      → FastAPI/Uvicorn

Per-request overhead: 2–6ms (TLS decrypt → localhost forward → TLS encrypt). For the persistent WebSocket, this is a one-time cost plus ~0.5–2ms per audio frame — negligible at 50 Hz frame rate.

8. Interruption Handling

PipelineParams(allow_interruptions=True)

When VAD detects new speech while LLM/TTS are outputting:

VAD detection: ~50–200ms
Pipecat cancels in-flight LLM + TTS: <10ms
Pipeline switches to processing new input
Interruption-to-response: 300–1000ms (VAD detect + full STT cycle)

No CPU cost beyond continuous VAD (which runs regardless). ElevenLabs streaming connection is closed on cancellation (~10ms cleanup).

Overhead Summary Table

Source	Latency	Type	Frequency	Avoidable?
VAD silence buffer	1200ms	Config	Every turn	Tunable (trade-off)
Whisper STT	250–800ms	Network + inference	Every turn	Switch to faster STT
LLM first token	180–500ms	Network + inference	Every turn	Model choice
TTS first chunk	150–250ms	Network + synthesis	Every turn	Provider choice
System prompt I/O	15–75ms	Disk I/O	Every call	Cacheable
Gateway handshake	11–25ms	Network	Every call	Pool reusable
TwiML + signature	8–15ms	CPU + I/O	Every call	Minimal
Caller lookup (×2)	0–10ms	CPU	Every call	Use dict
AudioDownsampler	<1ms/chunk	CPU	Every chunk	Already optimal
Tailscale Funnel	2–6ms	Network	Per-request	Unavoidable
Tool schema tokens	~210 tokens/turn	Token overhead	Every LLM call	Trim descriptions
ask_ron tool call	100–30,000ms	Network	Per tool use	Reduce timeout

Latency Profiles

Scenario	Estimated Ear-to-Ear	Breakdown
Best case (short reply, warm cache)	~2.0s	1.2s VAD + 0.3s STT + 0.3s LLM + 0.2s TTS
Typical (15-word response)	~3.0–3.5s	1.2s VAD + 0.5s STT + 0.8s LLM + 0.5s TTS
With tool call	~4.0–6.0s	Above + 1–3s gateway
Worst case (gateway slow)	~6.0–35s	Above + up to 30s gateway timeout
Greeting (first audio)	~0.6–1.5s	Pipeline setup + LLM + TTS

Architectural Strengths

Streaming pipeline: LLM tokens flow to TTS immediately — user hears partial responses as they're generated, not after the full response completes. This is the single biggest latency win.
Stateless audio resampling: Custom AudioDownsampler using audioop.ratecv produces output on every input chunk with <1ms overhead. No buffering, no accumulated silence.
Async throughout: Non-blocking I/O from WebSocket transport through service calls. The event loop stays responsive even during tool execution.
Tier-based schema control: Guest callers get zero tool overhead — no schemas in LLM requests, no gateway connection attempt.
Soxr bypass: Both hidden soxr resampler instances in Pipecat are configured as pass-throughs via careful sample rate alignment.

Architectural Weaknesses & Recommendations

High Impact

Issue	Current Cost	Fix	Expected Savings
No system prompt caching	15–75ms/call	Cache per caller, invalidate on file change	15–75ms/call
ask_ron blocks audio for up to 30s	0–30s silence	Reduce timeout to 5–10s; play filler audio ("Let me check...") while waiting	Perceived: dramatic
No gateway connection pool	11–25ms/call	Maintain persistent connection, reconnect on failure	11–25ms/call
Linear caller lookup	0–10ms × 2/call	`dict[str, CallerProfile]` keyed by phone	~10ms/call

Medium Impact

Issue	Current Cost	Fix	Expected Savings
Unbounded context_dir loading	5–50ms + token bloat	Size limit, relevance filtering, or allowlist	Variable
Tool schema in every request	~210 tokens/turn	Trim descriptions, or conditional inclusion	~60 tokens/turn
Silent tool degradation	Confusing LLM behavior	Remove `ask_ron` from schema if gateway fails	Correctness fix
No context pruning	Growing context over long calls	Sliding window or summarization for old turns	Prevents slowdown

Low Impact

Issue	Current Cost	Fix	Expected Savings
Synchronous file I/O	5–20ms	`aiofiles` or thread pool	5–20ms (marginal)
Duplicate caller lookup	5–10ms	Pass caller profile through custom params	5–10ms
No connection reuse for APIs	~5ms/call	Verify httpx/aiohttp pooling in Pipecat	~5ms/call
Pipecat version unpinned	Risk of regression	Pin to known-good version	Stability

What's Tested vs. Untested

Tested	Not Tested
Audio downsampler correctness	End-to-end latency benchmarks
Soxr bypass configuration	VAD tuning / false trigger rates
Tool availability by tier	ask_ron timeout behavior
System prompt per tier	Concurrent call handling
Twilio signature validation	Interrupt handling (barge-in)
Gateway handshake + ask	Memory/resource cleanup
Greeting frame production	LLM streaming time-to-first-token
Caller identification	Pipeline under load

Generated 2025-02-09. Analysis based on commit ea1680a (HEAD of main).

clifton/voice-latency-review.md

Select an option

No results found

Select an option

No results found

Voice Bridge Latency Architecture Review

End-to-End Call Timeline

Component-Level Analysis

1. VAD Silence Detection — 1200ms Fixed Floor

2. Audio Codec Chain — The Soxr Saga

3. Context Building — Synchronous File I/O on Every Call

4. Tool Calling — The Pipeline Stall

`get_date_time` — Negligible Overhead

`ask_ron` — Potential 30-Second Pipeline Stall

Tool Schema Overhead

5. Gateway Connection — Per-Call WebSocket Handshake

6. Security Validation

7. Tailscale Funnel Proxy

8. Interruption Handling

Overhead Summary Table

Latency Profiles

Architectural Strengths

Architectural Weaknesses & Recommendations

High Impact

Medium Impact

Low Impact

What's Tested vs. Untested

clifton/voice-latency-review.md

Voice Bridge Latency Architecture Review

End-to-End Call Timeline

Component-Level Analysis

1. VAD Silence Detection — 1200ms Fixed Floor

2. Audio Codec Chain — The Soxr Saga

3. Context Building — Synchronous File I/O on Every Call

4. Tool Calling — The Pipeline Stall

get_date_time — Negligible Overhead

ask_ron — Potential 30-Second Pipeline Stall

Tool Schema Overhead

5. Gateway Connection — Per-Call WebSocket Handshake

6. Security Validation

7. Tailscale Funnel Proxy

8. Interruption Handling

Overhead Summary Table

Latency Profiles

Architectural Strengths

Architectural Weaknesses & Recommendations

High Impact

Medium Impact

Low Impact

What's Tested vs. Untested

`get_date_time` — Negligible Overhead

`ask_ron` — Potential 30-Second Pipeline Stall