Skip to content

Instantly share code, notes, and snippets.

@jmanhype
Created February 17, 2026 17:36
Show Gist options
  • Select an option

  • Save jmanhype/5aefd67d9e67b37a8b408abdab39b6d3 to your computer and use it in GitHub Desktop.

Select an option

Save jmanhype/5aefd67d9e67b37a8b408abdab39b6d3 to your computer and use it in GitHub Desktop.
VAOS Voice Bridge: Building a Talker-Reasoner on PersonaPlex/Moshi — research journal, debugging war stories, drip-feed token injection

VAOS Voice Bridge: Building a Talker-Reasoner Architecture on PersonaPlex/Moshi

A technical journal of building dual-process voice AI on top of Kyutai's Moshi and NVIDIA's PersonaPlex, with Letta as the reasoning backbone.

The Goal

Build a Talker-Reasoner voice system where:

  • System 1 (Talker): PersonaPlex/Moshi handles real-time conversation at 12.5Hz
  • System 2 (Reasoner): Letta agent with persistent memory does deep thinking, web search, tool use
  • A Voice Bridge (Bun + TypeScript) connects them, with Voxtral transcription + classification in parallel
User speaks → PersonaPlex (System 1) responds in real-time
                 ↓ (parallel)
              Voxtral transcribes → Classifier triggers → Letta (System 2) thinks
                                                               ↓
                                          System 2 responds with answer
                                               ↓                    ↓
                              Drip-feed inject             Sentence-chunked TTS
                              (20ch/80ms to Moshi)         (WAV per sentence)
                                    ↓                            ↓
                          PersonaPlex absorbs            Browser plays clean audio
                          knowledge silently             (queued, chained)
                                    ↓
                          Server drops Moshi audio frames (gate)
                          until TTS + 8s cooldown expires

The Hard Problem: Content Delivery Through Moshi

What We Tried First (and Why It Failed)

Attempt 1: text_prompt injection

The obvious approach — put the answer in Moshi's text_prompt and reconnect:

text_prompt = "You just looked something up for the user and found: Based on 2026 data,
the most popular cat breeds include Ragdoll, Maine Coon..."

Result: PersonaPlex says "Hello, this is Lisa!" every time. Completely ignores the content.

Why: Moshi's text_prompt was fine-tuned on ~2,250 hours of persona-shaping dialogues only. The training data never included "relay this fact" patterns. It controls persona/role/style but NOT content. The greeting behavior is baked into model weights from instruction fine-tuning (stage 5/5, 30k steps).

Attempt 2: Burst sendText() injection

PersonaPlex has an undocumented sendText() method that feeds tokens into Moshi's Inner Monologue stream mid-conversation. We sent the full 400-char response at once.

Result: "plus plus plus plus plus plus plus plus plus..." — repetition degeneration.

The Research Breakthrough

Reading the Moshi paper (arXiv:2410.00037) revealed why:

Moshi's Inner Monologue predicts one text token per audio frame at 12.5Hz (80ms intervals).

Burst injection of 300+ characters saturates the temporal alignment mechanism. The attention fixates on repeated tokens, causing the autoregressive loop to degenerate.

The Fix: Drip-Feed Token Injection

Send 20 characters every 80ms, matching Moshi's per-frame text consumption rate:

const MOSHI_FRAME_MS = 80;   // Moshi/Mimi codec frame duration
const DRIP_CHUNK_SIZE = 20;  // ~4-5 tokens per frame

const chunks = response.match(new RegExp(`.{1,${DRIP_CHUNK_SIZE}}`, 'g')) ?? [response];
for (const chunk of chunks) {
  session.talker.sendText(chunk);
  await new Promise(r => setTimeout(r, MOSHI_FRAME_MS));
}

Result: 336 chars = 17 chunks over 1.36s. PersonaPlex stays stable. No degeneration.

The Audio Collision Problem

Even with stable injection, two audio streams collide:

  1. PersonaPlex tries to speak the injected knowledge (garbled)
  2. External TTS speaks the clean answer

Failed: Client-Side Mute

Disconnecting moshiWorklet from audioCtx.destination in the browser:

  • Janky reconnection artifacts
  • Buffer residue leaks through
  • Race conditions with Web Audio API scheduling

Solution: Server-Side Audio Gate

Drop PersonaPlex audio frames at the server before they reach the browser:

// Session state
interface VoiceSession {
  suppressMoshiAudio: boolean;
  suppressMoshiUntil: number;  // auto-expire timestamp
}

// Audio handler: gate PersonaPlex frames
bus.on('talker.audio', (event) => {
  if (session.suppressMoshiAudio) {
    if (Date.now() > session.suppressMoshiUntil) {
      session.suppressMoshiAudio = false;  // auto-expire
    } else {
      return;  // drop frame silently
    }
  }
  session.userWs.send(event.data);  // pass through
});

// Suppress timing: WAV playback estimate + 8s cooldown
const estimatedPlayMs = Math.ceil((wavBuffer.byteLength / 48000) * 1000);
session.suppressMoshiUntil = Date.now() + estimatedPlayMs + 8000;

Sentence-Chunked TTS

Instead of generating TTS for the entire response (3s wait), split at sentence boundaries:

const sentences = response.match(/[^.!?]+[.!?]+/g) ?? [response];
for (const sentence of sentences) {
  const formData = new FormData();
  formData.append('text', sentence.trim());
  const ttsResp = await fetch('http://localhost:8877/tts', { method: 'POST', body: formData });
  if (ttsResp.ok) {
    const wavBuffer = await ttsResp.arrayBuffer();
    const msg = new Uint8Array(1 + wavBuffer.byteLength);
    msg[0] = 0x10;  // TTS_AUDIO message type
    msg.set(new Uint8Array(wavBuffer), 1);
    session.userWs.send(msg.buffer);
  }
}

Browser queues WAVs and chains playback via source.onended callbacks. First audio arrives ~500ms after Letta responds instead of ~3s.

The Timeout Trap

After deployment, System 2 started returning "I ran into an issue thinking about that" for every query.

Diagnosis: The Reasoner's AbortController had a 30s timeout. Letta with web_search tool calls consistently took 30-40s:

  • LLM reasoning: ~5s
  • web_search execution: ~10-15s
  • LLM processes results + send_message: ~10-15s

Timeline from logs:

t=0.000s  Trigger fires
t=8.000s  Same query re-triggers (no dedup guard!)
t=30.005s AbortError on request #1
t=30.002s AbortError on request #2

Two fixes:

  1. Timeout: 30s → 60s
  2. In-flight deduplication flag to block re-triggers while Reasoner is processing

Debugging Lessons

1. PersonaPlex Protocol Gotchas

  • Server sends 0x00 as ready signal — do NOT echo it back
  • Client must only send 0x01 + opus_data — server ignores all other kinds
  • sphn.OpusStreamReader crashes on Ogg header pages sent out of sequence
  • NODE_TLS_REJECT_UNAUTHORIZED=0 required for Bun WSS to self-signed cert
  • Server uses asyncio.Lock — single session at a time

2. Moshi Architecture Facts

  • text_prompt = persona shaping only (fine-tuned on role prompts, not content relay)
  • sendText() = undocumented mid-conversation injection into Inner Monologue stream
  • Greeting behavior is immutable (baked into instruction fine-tuning weights)
  • ~200 token budget for text_prompt

3. Letta API Quirks

  • Trailing slash matters: /v1/health → 307 redirect → /v1/health/
  • Response extraction: actual text is in tool_call.arguments.message for send_message, not in assistant_message
  • web_search tool calls add 15-25s to response time

Stack

Component Tech Port
Voice Bridge Bun + TypeScript 9001 (TLS)
PersonaPlex/Moshi Python (NVIDIA, Kyutai) 8998 (WSS)
Letta (Reasoner) Docker (3090 box) 8283
Pocket TTS Python + FastAPI 8877
Voxtral Mistral API (Mini + Small)
Browser opus-recorder + Web Audio API

UX Timeline (Final)

t=0s     User says "search for cat breeds"
t=2s     Voxtral transcribes + classifies → trigger fires
t=2s     Thinking chime (440Hz, 120ms sine)
t=2s     "System 2 thinking..." indicator
t=17s    Letta responds (3 sentences)
t=17s    Server-side audio gate activates
t=17s    Drip-feed injection starts (20ch/80ms)
t=17.5s  Sentence 1 TTS arrives → plays immediately
t=19s    Sentence 2 plays
t=21s    Sentence 3 plays
t=29s    Gate expires → PersonaPlex audio resumes

References

Code

All source: vaos-voice-bridge/ on the 007-blackice-3 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment