VAOS Voice Bridge: Building a Talker-Reasoner Architecture on PersonaPlex/Moshi

A technical journal of building dual-process voice AI on top of Kyutai's Moshi and NVIDIA's PersonaPlex, with Letta as the reasoning backbone.

The Goal

Build a Talker-Reasoner voice system where:

System 1 (Talker): PersonaPlex/Moshi handles real-time conversation at 12.5Hz
System 2 (Reasoner): Letta agent with persistent memory does deep thinking, web search, tool use
A Voice Bridge (Bun + TypeScript) connects them, with Voxtral transcription + classification in parallel

User speaks → PersonaPlex (System 1) responds in real-time
                 ↓ (parallel)
              Voxtral transcribes → Classifier triggers → Letta (System 2) thinks
                                                               ↓
                                          System 2 responds with answer
                                               ↓                    ↓
                              Drip-feed inject             Sentence-chunked TTS
                              (20ch/80ms to Moshi)         (WAV per sentence)
                                    ↓                            ↓
                          PersonaPlex absorbs            Browser plays clean audio
                          knowledge silently             (queued, chained)
                                    ↓
                          Server drops Moshi audio frames (gate)
                          until TTS + 8s cooldown expires

The Hard Problem: Content Delivery Through Moshi

What We Tried First (and Why It Failed)

Attempt 1: text_prompt injection

The obvious approach — put the answer in Moshi's text_prompt and reconnect:

text_prompt = "You just looked something up for the user and found: Based on 2026 data,
the most popular cat breeds include Ragdoll, Maine Coon..."

Result: PersonaPlex says "Hello, this is Lisa!" every time. Completely ignores the content.

Why: Moshi's text_prompt was fine-tuned on ~2,250 hours of persona-shaping dialogues only. The training data never included "relay this fact" patterns. It controls persona/role/style but NOT content. The greeting behavior is baked into model weights from instruction fine-tuning (stage 5/5, 30k steps).

Attempt 2: Burst sendText() injection

PersonaPlex has an undocumented sendText() method that feeds tokens into Moshi's Inner Monologue stream mid-conversation. We sent the full 400-char response at once.

Result: "plus plus plus plus plus plus plus plus plus..." — repetition degeneration.

The Research Breakthrough

Reading the Moshi paper (arXiv:2410.00037) revealed why:

Moshi's Inner Monologue predicts one text token per audio frame at 12.5Hz (80ms intervals).

Burst injection of 300+ characters saturates the temporal alignment mechanism. The attention fixates on repeated tokens, causing the autoregressive loop to degenerate.

The Fix: Drip-Feed Token Injection

Send 20 characters every 80ms, matching Moshi's per-frame text consumption rate:

const MOSHI_FRAME_MS = 80;   // Moshi/Mimi codec frame duration
const DRIP_CHUNK_SIZE = 20;  // ~4-5 tokens per frame

const chunks = response.match(new RegExp(`.{1,${DRIP_CHUNK_SIZE}}`, 'g')) ?? [response];
for (const chunk of chunks) {
  session.talker.sendText(chunk);
  await new Promise(r => setTimeout(r, MOSHI_FRAME_MS));
}

Result: 336 chars = 17 chunks over 1.36s. PersonaPlex stays stable. No degeneration.

The Audio Collision Problem

Even with stable injection, two audio streams collide:

PersonaPlex tries to speak the injected knowledge (garbled)
External TTS speaks the clean answer

Failed: Client-Side Mute

Disconnecting moshiWorklet from audioCtx.destination in the browser:

Janky reconnection artifacts
Buffer residue leaks through
Race conditions with Web Audio API scheduling

Solution: Server-Side Audio Gate

Drop PersonaPlex audio frames at the server before they reach the browser:

// Session state
interface VoiceSession {
  suppressMoshiAudio: boolean;
  suppressMoshiUntil: number;  // auto-expire timestamp
}

// Audio handler: gate PersonaPlex frames
bus.on('talker.audio', (event) => {
  if (session.suppressMoshiAudio) {
    if (Date.now() > session.suppressMoshiUntil) {
      session.suppressMoshiAudio = false;  // auto-expire
    } else {
      return;  // drop frame silently
    }
  }
  session.userWs.send(event.data);  // pass through
});

// Suppress timing: WAV playback estimate + 8s cooldown
const estimatedPlayMs = Math.ceil((wavBuffer.byteLength / 48000) * 1000);
session.suppressMoshiUntil = Date.now() + estimatedPlayMs + 8000;

Sentence-Chunked TTS

Instead of generating TTS for the entire response (3s wait), split at sentence boundaries:

const sentences = response.match(/[^.!?]+[.!?]+/g) ?? [response];
for (const sentence of sentences) {
  const formData = new FormData();
  formData.append('text', sentence.trim());
  const ttsResp = await fetch('http://localhost:8877/tts', { method: 'POST', body: formData });
  if (ttsResp.ok) {
    const wavBuffer = await ttsResp.arrayBuffer();
    const msg = new Uint8Array(1 + wavBuffer.byteLength);
    msg[0] = 0x10;  // TTS_AUDIO message type
    msg.set(new Uint8Array(wavBuffer), 1);
    session.userWs.send(msg.buffer);
  }
}

Browser queues WAVs and chains playback via source.onended callbacks. First audio arrives ~500ms after Letta responds instead of ~3s.

The Timeout Trap

After deployment, System 2 started returning "I ran into an issue thinking about that" for every query.

Diagnosis: The Reasoner's AbortController had a 30s timeout. Letta with web_search tool calls consistently took 30-40s:

LLM reasoning: ~5s
web_search execution: ~10-15s
LLM processes results + send_message: ~10-15s

Timeline from logs:

t=0.000s  Trigger fires
t=8.000s  Same query re-triggers (no dedup guard!)
t=30.005s AbortError on request #1
t=30.002s AbortError on request #2

Two fixes:

Timeout: 30s → 60s
In-flight deduplication flag to block re-triggers while Reasoner is processing

Debugging Lessons

1. PersonaPlex Protocol Gotchas

Server sends 0x00 as ready signal — do NOT echo it back
Client must only send 0x01 + opus_data — server ignores all other kinds
sphn.OpusStreamReader crashes on Ogg header pages sent out of sequence
NODE_TLS_REJECT_UNAUTHORIZED=0 required for Bun WSS to self-signed cert
Server uses asyncio.Lock — single session at a time

2. Moshi Architecture Facts

text_prompt = persona shaping only (fine-tuned on role prompts, not content relay)
sendText() = undocumented mid-conversation injection into Inner Monologue stream
Greeting behavior is immutable (baked into instruction fine-tuning weights)
~200 token budget for text_prompt

3. Letta API Quirks

Trailing slash matters: /v1/health → 307 redirect → /v1/health/
Response extraction: actual text is in tool_call.arguments.message for send_message, not in assistant_message
web_search tool calls add 15-25s to response time

Stack

Component	Tech	Port
Voice Bridge	Bun + TypeScript	9001 (TLS)
PersonaPlex/Moshi	Python (NVIDIA, Kyutai)	8998 (WSS)
Letta (Reasoner)	Docker (3090 box)	8283
Pocket TTS	Python + FastAPI	8877
Voxtral	Mistral API (Mini + Small)	—
Browser	opus-recorder + Web Audio API	—

UX Timeline (Final)

t=0s     User says "search for cat breeds"
t=2s     Voxtral transcribes + classifies → trigger fires
t=2s     Thinking chime (440Hz, 120ms sine)
t=2s     "System 2 thinking..." indicator
t=17s    Letta responds (3 sentences)
t=17s    Server-side audio gate activates
t=17s    Drip-feed injection starts (20ch/80ms)
t=17.5s  Sentence 1 TTS arrives → plays immediately
t=19s    Sentence 2 plays
t=21s    Sentence 3 plays
t=29s    Gate expires → PersonaPlex audio resumes

References

Code

All source: vaos-voice-bridge/ on the 007-blackice-3 branch.

jmanhype/vaos-voice-bridge-journey.md

Select an option

No results found