A technical journal of building dual-process voice AI on top of Kyutai's Moshi and NVIDIA's PersonaPlex, with Letta as the reasoning backbone.
Build a Talker-Reasoner voice system where:
- System 1 (Talker): PersonaPlex/Moshi handles real-time conversation at 12.5Hz
- System 2 (Reasoner): Letta agent with persistent memory does deep thinking, web search, tool use
- A Voice Bridge (Bun + TypeScript) connects them, with Voxtral transcription + classification in parallel
User speaks → PersonaPlex (System 1) responds in real-time
↓ (parallel)
Voxtral transcribes → Classifier triggers → Letta (System 2) thinks
↓
System 2 responds with answer
↓ ↓
Drip-feed inject Sentence-chunked TTS
(20ch/80ms to Moshi) (WAV per sentence)
↓ ↓
PersonaPlex absorbs Browser plays clean audio
knowledge silently (queued, chained)
↓
Server drops Moshi audio frames (gate)
until TTS + 8s cooldown expires
Attempt 1: text_prompt injection
The obvious approach — put the answer in Moshi's text_prompt and reconnect:
text_prompt = "You just looked something up for the user and found: Based on 2026 data,
the most popular cat breeds include Ragdoll, Maine Coon..."
Result: PersonaPlex says "Hello, this is Lisa!" every time. Completely ignores the content.
Why: Moshi's text_prompt was fine-tuned on ~2,250 hours of persona-shaping dialogues only. The training data never included "relay this fact" patterns. It controls persona/role/style but NOT content. The greeting behavior is baked into model weights from instruction fine-tuning (stage 5/5, 30k steps).
Attempt 2: Burst sendText() injection
PersonaPlex has an undocumented sendText() method that feeds tokens into Moshi's Inner Monologue stream mid-conversation. We sent the full 400-char response at once.
Result: "plus plus plus plus plus plus plus plus plus..." — repetition degeneration.
Reading the Moshi paper (arXiv:2410.00037) revealed why:
Moshi's Inner Monologue predicts one text token per audio frame at 12.5Hz (80ms intervals).
Burst injection of 300+ characters saturates the temporal alignment mechanism. The attention fixates on repeated tokens, causing the autoregressive loop to degenerate.
Send 20 characters every 80ms, matching Moshi's per-frame text consumption rate:
const MOSHI_FRAME_MS = 80; // Moshi/Mimi codec frame duration
const DRIP_CHUNK_SIZE = 20; // ~4-5 tokens per frame
const chunks = response.match(new RegExp(`.{1,${DRIP_CHUNK_SIZE}}`, 'g')) ?? [response];
for (const chunk of chunks) {
session.talker.sendText(chunk);
await new Promise(r => setTimeout(r, MOSHI_FRAME_MS));
}Result: 336 chars = 17 chunks over 1.36s. PersonaPlex stays stable. No degeneration.
Even with stable injection, two audio streams collide:
- PersonaPlex tries to speak the injected knowledge (garbled)
- External TTS speaks the clean answer
Disconnecting moshiWorklet from audioCtx.destination in the browser:
- Janky reconnection artifacts
- Buffer residue leaks through
- Race conditions with Web Audio API scheduling
Drop PersonaPlex audio frames at the server before they reach the browser:
// Session state
interface VoiceSession {
suppressMoshiAudio: boolean;
suppressMoshiUntil: number; // auto-expire timestamp
}
// Audio handler: gate PersonaPlex frames
bus.on('talker.audio', (event) => {
if (session.suppressMoshiAudio) {
if (Date.now() > session.suppressMoshiUntil) {
session.suppressMoshiAudio = false; // auto-expire
} else {
return; // drop frame silently
}
}
session.userWs.send(event.data); // pass through
});
// Suppress timing: WAV playback estimate + 8s cooldown
const estimatedPlayMs = Math.ceil((wavBuffer.byteLength / 48000) * 1000);
session.suppressMoshiUntil = Date.now() + estimatedPlayMs + 8000;Instead of generating TTS for the entire response (3s wait), split at sentence boundaries:
const sentences = response.match(/[^.!?]+[.!?]+/g) ?? [response];
for (const sentence of sentences) {
const formData = new FormData();
formData.append('text', sentence.trim());
const ttsResp = await fetch('http://localhost:8877/tts', { method: 'POST', body: formData });
if (ttsResp.ok) {
const wavBuffer = await ttsResp.arrayBuffer();
const msg = new Uint8Array(1 + wavBuffer.byteLength);
msg[0] = 0x10; // TTS_AUDIO message type
msg.set(new Uint8Array(wavBuffer), 1);
session.userWs.send(msg.buffer);
}
}Browser queues WAVs and chains playback via source.onended callbacks. First audio arrives ~500ms after Letta responds instead of ~3s.
After deployment, System 2 started returning "I ran into an issue thinking about that" for every query.
Diagnosis: The Reasoner's AbortController had a 30s timeout. Letta with web_search tool calls consistently took 30-40s:
- LLM reasoning: ~5s
web_searchexecution: ~10-15s- LLM processes results +
send_message: ~10-15s
Timeline from logs:
t=0.000s Trigger fires
t=8.000s Same query re-triggers (no dedup guard!)
t=30.005s AbortError on request #1
t=30.002s AbortError on request #2
Two fixes:
- Timeout: 30s → 60s
- In-flight deduplication flag to block re-triggers while Reasoner is processing
- Server sends
0x00as ready signal — do NOT echo it back - Client must only send
0x01 + opus_data— server ignores all other kinds sphn.OpusStreamReadercrashes on Ogg header pages sent out of sequenceNODE_TLS_REJECT_UNAUTHORIZED=0required for Bun WSS to self-signed cert- Server uses
asyncio.Lock— single session at a time
text_prompt= persona shaping only (fine-tuned on role prompts, not content relay)sendText()= undocumented mid-conversation injection into Inner Monologue stream- Greeting behavior is immutable (baked into instruction fine-tuning weights)
- ~200 token budget for
text_prompt
- Trailing slash matters:
/v1/health→ 307 redirect →/v1/health/ - Response extraction: actual text is in
tool_call.arguments.messageforsend_message, not inassistant_message web_searchtool calls add 15-25s to response time
| Component | Tech | Port |
|---|---|---|
| Voice Bridge | Bun + TypeScript | 9001 (TLS) |
| PersonaPlex/Moshi | Python (NVIDIA, Kyutai) | 8998 (WSS) |
| Letta (Reasoner) | Docker (3090 box) | 8283 |
| Pocket TTS | Python + FastAPI | 8877 |
| Voxtral | Mistral API (Mini + Small) | — |
| Browser | opus-recorder + Web Audio API | — |
t=0s User says "search for cat breeds"
t=2s Voxtral transcribes + classifies → trigger fires
t=2s Thinking chime (440Hz, 120ms sine)
t=2s "System 2 thinking..." indicator
t=17s Letta responds (3 sentences)
t=17s Server-side audio gate activates
t=17s Drip-feed injection starts (20ch/80ms)
t=17.5s Sentence 1 TTS arrives → plays immediately
t=19s Sentence 2 plays
t=21s Sentence 3 plays
t=29s Gate expires → PersonaPlex audio resumes
- Moshi: a speech-text foundation model (arXiv:2410.00037)
- PersonaPlex (arXiv:2602.06053)
- VITA-Audio: Interleaved Cross-Modal Token Generation (arXiv:2505.03739)
- SpeakStream: Streaming TTS with Interleaved Data (arXiv:2505.19206)
- LLMVoX: Autoregressive Streaming TTS (arXiv:2503.04724)
- PersonaPlex HuggingFace Discussion
- Moshi GitHub FAQ
All source: vaos-voice-bridge/ on the 007-blackice-3 branch.