Deep-dive analysis of every source of overhead in a Twilio → Pipecat → LLM voice pipeline.
Stack: Twilio Media Streams · Pipecat · OpenAI Whisper STT · Kimi K2.5 (Fireworks) LLM · ElevenLabs TTS
Repo:
voice— an OpenClaw skill providing caller-aware phone calls with tool calling and tiered access control.
PHASE LATENCY CUMULATIVE
─────────────────────────────────────────────────────────────
Twilio webhook received — t=0ms
├─ Form parsing 1–5ms ~3ms
├─ HMAC-SHA1 signature verify 2–5ms ~7ms
├─ Caller lookup (linear scan) 0–10ms ~12ms
├─ TwiML XML generation 1–2ms ~14ms
└─ HTTP 200 → Twilio <1ms ~15ms
Twilio initiates WebSocket — t=15ms
├─ DNS + TLS handshake 50–150ms ~115ms
├─ Tailscale Funnel proxy hop 2–6ms ~119ms
├─ WebSocket accept 1–5ms ~122ms
└─ Await Twilio "start" msg 50–200ms ~222ms
Pipeline creation — t=222ms
├─ System prompt file I/O 15–75ms ~272ms
├─ Service instantiation 3–5ms ~275ms
├─ OpenClaw gateway handshake 11–25ms ~293ms
├─ Pipeline wiring + task 2–5ms ~296ms
└─ VAD model load (first call) 50–100ms ~346ms*
Greeting generation — t=296ms
├─ LLM prompt + first token 180–500ms ~596ms
├─ TTS connection + first chunk 150–250ms ~746ms
├─ AudioDownsampler (16→8kHz) <1ms ~746ms
└─ Twilio → caller speaker 10–100ms ~806ms
CALLER HEARS GREETING ~0.6–1.5s
─────────────────────────────────────────────────────────────
User speaks + silence detected — varies
├─ VAD silence buffer 1200ms FIXED +1200ms
├─ Whisper STT inference 250–800ms +500ms avg
├─ LLM first token 180–500ms +350ms avg
├─ TTS first audio chunk 150–250ms +200ms avg
└─ Network + Twilio playback 10–100ms +50ms avg
EAR-TO-EAR RESPONSE LATENCY ~2.0–3.5s
─────────────────────────────────────────────────────────────
With ask_ron tool call +100–5000ms +2s avg
With get_date_time tool call <1ms negligible
* VAD model cached after first call.
VADParams(confidence=0.8, stop_secs=1.2)The Silero VAD analyzer runs on every 20ms audio frame (~5–10ms CPU per frame, 25–50% of one core). It waits for 1.2 seconds of continuous silence before declaring end-of-speech.
This is the single largest latency contributor and it's intentional — lower values cause false triggers when users pause mid-thought. The trade-off:
stop_secs |
Behavior | Risk |
|---|---|---|
| 0.5s | Snappy, cuts off pauses | Interrupts thinking pauses |
| 0.8s | Balanced | Occasional false triggers |
| 1.2s (current) | Conservative | Guaranteed 1.2s floor |
| 1.5s+ | Sluggish | Poor UX |
Twilio 8kHz G.711 μ-law
→ TwilioFrameSerializer (decode to PCM-16)
→ Pipeline internal (8kHz PCM)
→ STT (8kHz PCM → text)
→ LLM (text → text)
→ TTS (text → 16kHz PCM streaming chunks)
→ AudioDownsampler (16kHz → 8kHz, stateless, <1ms/chunk)
→ TwilioFrameSerializer (encode to G.711 μ-law)
→ Twilio 8kHz G.711 μ-law
Why 16kHz TTS → 8kHz transport? Two recent critical fixes:
-
fafbadf: Pipecat's built-inSOXRStreamAudioResamplerbuffers ~11 consecutive 10ms chunks (~110ms) before producing output. With ElevenLabs streaming small chunks, this caused complete silence — the resampler accumulated audio but never flushed it. Fix: replace soxr with a statelessAudioDownsamplerusingaudioop.ratecvthat produces output on every single input chunk. -
ea1680a: Settingaudio_out_sample_rate=16000(to match TTS) inadvertently activated hidden soxr resamplers inside bothBaseOutputTransportandTwilioFrameSerializer. Fix: setaudio_out_sample_rate=8000so both become pass-throughs, sinceAudioDownsampleralready converts to 8kHz before they see the frames.
# The workaround pipeline (pipeline.py):
ElevenLabsTTSService(sample_rate=16000) # TTS outputs 16kHz
→ AudioDownsampler(target_rate=8000) # Immediate 16→8kHz, no buffering
→ Transport(audio_out_sample_rate=8000) # Pass-through, no soxrAudioDownsampler per-chunk overhead: audioop.ratecv on a 160-sample chunk takes 0.1–0.5ms. Stateless (no ring buffer, no history). Every input chunk produces an output chunk.
# pipeline.py:114 — called per-call, blocking
system_prompt = build_system_prompt(caller, config.workspace, config.context_dir)build_system_prompt() performs synchronous disk reads:
| File | When Loaded | Typical Size |
|---|---|---|
SOUL.md |
Always | 300–500 tokens |
USER.md |
Owner tier only | 200–500 tokens |
caller.context_file |
Trusted tier only | 200–500 tokens |
context_dir/*.md (all) |
If configured | Unbounded |
All reads use plain open() / f.read() — no async I/O, no caching.
Context directory is unbounded: os.listdir(ctx_dir) loads every .md file alphabetically with no size limit. A digests directory with 12 monthly reports (500 tokens each) adds 6,000 tokens to every call's system prompt.
Token budget by caller tier:
| Tier | Base Tokens | With Context Dir | Notes |
|---|---|---|---|
| Guest | ~480 | ~480 + context_dir | No USER.md, no tools |
| Trusted | ~610 | ~610 + context_dir | Caller-specific context + hallucination guard |
| Owner | ~690 | ~690 + context_dir | Full USER.md + tool instructions |
Latency: 5–75ms per call depending on number and size of files. On local SSD this is fast; on NFS it could be 50ms+ per file.
Two tools are registered:
Pure local computation via datetime.now(). Lazy-imports zoneinfo. <1ms per call.
async def handle_ask_ron(...):
result = await gateway.ask(request, session_key=call_sid) # blocks up to 30s
await result_callback(result)gateway.ask() sends a WebSocket message to the OpenClaw agent gateway and enters a blocking receive loop:
async with asyncio.timeout(self._timeout): # 30s default
while True:
raw = await self._ws.recv() # blocks until message
msg = json.loads(raw)
if msg["type"] == "res" and msg["id"] == req_id:
return extract_result(msg)While this awaits, the entire LLM→TTS path is stalled. The caller hears silence. The transport still receives audio (VAD runs), but no output is produced until the tool completes.
Timeline with ask_ron:
t=0.0s User: "What are Tesla's earnings?"
t=0.5s STT completes transcription
t=0.7s LLM decides to call ask_ron
t=0.8s Gateway request sent ──────────┐
t=3.0s OpenClaw returns result ◄──────┘ ← 2.2s silence
t=3.1s Tool result injected into context
t=3.3s LLM re-prompted, starts generating
t=4.0s TTS starts streaming to caller
No retry logic, no intermediate audio feedback, no backoff. If gateway is slow: silence. If gateway times out (30s): canned error message after half a minute of dead air.
Both tool schemas (~210 tokens combined) are serialized and sent with every LLM request, even turns where tools aren't called. Over a 10-turn conversation, that's ~2,100 tokens of pure schema overhead. Adds ~200–300ms of prompt encoding/transmission per turn.
Guest tier correctly receives zero tool schemas (no overhead).
# pipeline.py:127-131 — per-call, blocking pipeline creation
gateway = OpenClawClient.from_config(path=config.gateway.config_path)
await gateway.connect() # 3-message handshakeEach call creates a new WebSocket connection to the OpenClaw gateway with a 3-step handshake:
- Receive
connect.challengefrom gateway - Send
connectwith auth token - Receive
hello-ok
Cost: 11–25ms per call (localhost WebSocket). No connection pooling — 100 concurrent calls = 100 independent WebSocket connections.
If gateway is unavailable, exception is caught and tools silently degrade. The LLM still has ask_ron in its tool schema (from the system prompt instructions) but the handler won't be registered — the LLM may attempt to call a tool that doesn't exist.
Twilio signature verification runs on every webhook:
def validate_twilio_signature(signature, url, params, auth_token):
validator = RequestValidator(auth_token) # stores token
return validator.validate(url, params, signature) # HMAC-SHA1Cost: 2–5ms (CPU-bound cryptographic operation). Unavoidable for security.
Caller identification uses a linear scan:
def identify_caller(phone, callers_config):
for caller in callers_config.callers:
if caller.phone == phone:
return caller
return NoneCost: O(n) — with 50 callers, up to 5ms. This runs twice per call (once in webhook handler, once in WebSocket handler when the caller phone is re-extracted from Twilio's custom parameters).
Twilio → HTTPS → Tailscale Funnel (public IP:443)
→ Tailscale client (localhost:8765)
→ FastAPI/Uvicorn
Per-request overhead: 2–6ms (TLS decrypt → localhost forward → TLS encrypt). For the persistent WebSocket, this is a one-time cost plus ~0.5–2ms per audio frame — negligible at 50 Hz frame rate.
PipelineParams(allow_interruptions=True)When VAD detects new speech while LLM/TTS are outputting:
- VAD detection: ~50–200ms
- Pipecat cancels in-flight LLM + TTS: <10ms
- Pipeline switches to processing new input
- Interruption-to-response: 300–1000ms (VAD detect + full STT cycle)
No CPU cost beyond continuous VAD (which runs regardless). ElevenLabs streaming connection is closed on cancellation (~10ms cleanup).
| Source | Latency | Type | Frequency | Avoidable? |
|---|---|---|---|---|
| VAD silence buffer | 1200ms | Config | Every turn | Tunable (trade-off) |
| Whisper STT | 250–800ms | Network + inference | Every turn | Switch to faster STT |
| LLM first token | 180–500ms | Network + inference | Every turn | Model choice |
| TTS first chunk | 150–250ms | Network + synthesis | Every turn | Provider choice |
| System prompt I/O | 15–75ms | Disk I/O | Every call | Cacheable |
| Gateway handshake | 11–25ms | Network | Every call | Pool reusable |
| TwiML + signature | 8–15ms | CPU + I/O | Every call | Minimal |
| Caller lookup (×2) | 0–10ms | CPU | Every call | Use dict |
| AudioDownsampler | <1ms/chunk | CPU | Every chunk | Already optimal |
| Tailscale Funnel | 2–6ms | Network | Per-request | Unavoidable |
| Tool schema tokens | ~210 tokens/turn | Token overhead | Every LLM call | Trim descriptions |
| ask_ron tool call | 100–30,000ms | Network | Per tool use | Reduce timeout |
| Scenario | Estimated Ear-to-Ear | Breakdown |
|---|---|---|
| Best case (short reply, warm cache) | ~2.0s | 1.2s VAD + 0.3s STT + 0.3s LLM + 0.2s TTS |
| Typical (15-word response) | ~3.0–3.5s | 1.2s VAD + 0.5s STT + 0.8s LLM + 0.5s TTS |
| With tool call | ~4.0–6.0s | Above + 1–3s gateway |
| Worst case (gateway slow) | ~6.0–35s | Above + up to 30s gateway timeout |
| Greeting (first audio) | ~0.6–1.5s | Pipeline setup + LLM + TTS |
- Streaming pipeline: LLM tokens flow to TTS immediately — user hears partial responses as they're generated, not after the full response completes. This is the single biggest latency win.
- Stateless audio resampling: Custom
AudioDownsamplerusingaudioop.ratecvproduces output on every input chunk with <1ms overhead. No buffering, no accumulated silence. - Async throughout: Non-blocking I/O from WebSocket transport through service calls. The event loop stays responsive even during tool execution.
- Tier-based schema control: Guest callers get zero tool overhead — no schemas in LLM requests, no gateway connection attempt.
- Soxr bypass: Both hidden soxr resampler instances in Pipecat are configured as pass-throughs via careful sample rate alignment.
| Issue | Current Cost | Fix | Expected Savings |
|---|---|---|---|
| No system prompt caching | 15–75ms/call | Cache per caller, invalidate on file change | 15–75ms/call |
| ask_ron blocks audio for up to 30s | 0–30s silence | Reduce timeout to 5–10s; play filler audio ("Let me check...") while waiting | Perceived: dramatic |
| No gateway connection pool | 11–25ms/call | Maintain persistent connection, reconnect on failure | 11–25ms/call |
| Linear caller lookup | 0–10ms × 2/call | dict[str, CallerProfile] keyed by phone |
~10ms/call |
| Issue | Current Cost | Fix | Expected Savings |
|---|---|---|---|
| Unbounded context_dir loading | 5–50ms + token bloat | Size limit, relevance filtering, or allowlist | Variable |
| Tool schema in every request | ~210 tokens/turn | Trim descriptions, or conditional inclusion | ~60 tokens/turn |
| Silent tool degradation | Confusing LLM behavior | Remove ask_ron from schema if gateway fails |
Correctness fix |
| No context pruning | Growing context over long calls | Sliding window or summarization for old turns | Prevents slowdown |
| Issue | Current Cost | Fix | Expected Savings |
|---|---|---|---|
| Synchronous file I/O | 5–20ms | aiofiles or thread pool |
5–20ms (marginal) |
| Duplicate caller lookup | 5–10ms | Pass caller profile through custom params | 5–10ms |
| No connection reuse for APIs | ~5ms/call | Verify httpx/aiohttp pooling in Pipecat | ~5ms/call |
| Pipecat version unpinned | Risk of regression | Pin to known-good version | Stability |
| Tested | Not Tested |
|---|---|
| Audio downsampler correctness | End-to-end latency benchmarks |
| Soxr bypass configuration | VAD tuning / false trigger rates |
| Tool availability by tier | ask_ron timeout behavior |
| System prompt per tier | Concurrent call handling |
| Twilio signature validation | Interrupt handling (barge-in) |
| Gateway handshake + ask | Memory/resource cleanup |
| Greeting frame production | LLM streaming time-to-first-token |
| Caller identification | Pipeline under load |
Generated 2025-02-09. Analysis based on commit ea1680a (HEAD of main).