Skip to content

Instantly share code, notes, and snippets.

@goncalossilva
Last active December 13, 2025 03:46
Show Gist options
  • Select an option

  • Save goncalossilva/82e1fc0178e1583a1e86fb4377c24d07 to your computer and use it in GitHub Desktop.

Select an option

Save goncalossilva/82e1fc0178e1583a1e86fb4377c24d07 to your computer and use it in GitHub Desktop.

The Inspiration: Miranda Priestly

Before diving into technical challenges, it's worth sharing what inspired Ramble. We were deeply influenced by this scene from The Devil Wears Prada where Miranda dumps a dozen tasks on her assistant in one fell swoop. Non-stop, stream-of-consciousness. We asked ourselves: what if anyone could capture tasks that way? No typing, no careful formatting. Just speak your mind and let AI do the organizing.

We even tested our early prototypes using Miranda's exact monologue, playing the clip into a microphone. It became our north star use case.

Challenges

1. Real-time streaming with tool execution

Traditional voice assistants follow a request-response pattern: you speak, it processes, it responds. But Miranda doesn't wait for confirmation after each task. She just keeps going. We needed an architecture that could:

  • Stream audio continuously to the AI
  • Have the AI proactively call tools (create task, edit task, delete task) while the user is still speaking
  • Update the UI in real-time as tasks are created and modified
  • Handle corrections mid-stream ("Actually, make that 11 AM, not 10")

This ruled out simple transcription-then-process approaches. We needed true real-time bidirectional communication with tool-calling capabilities.

2. Multilingual support at scale

Todoist has users worldwide. We couldn't ship a feature that only worked well in English. But testing voice AI across 15+ languages presents unique challenges:

  • Native speakers have different speech patterns, speeds, and accents
  • Even within a single language, regional accents vary significantly (British vs. Australian English, European vs. Brazilian Portuguese)
  • Task semantics vary by culture (date formats, time expressions, relationship terms)
  • We couldn't just test with one person per language; we needed diversity within each language

3. Non-deterministic output validation

LLMs don't produce identical outputs for identical inputs. A user saying "call mom tomorrow" might result in differently-worded tasks across sessions. Traditional assertion-based testing ("expect output to equal X") doesn't work. We needed a testing approach that validated semantic correctness rather than exact string matching.

4. Cross-browser audio handling

Browser APIs for microphone access are notoriously inconsistent:

  • Different permission models across Chrome, Firefox, Safari
  • Windows returning phantom "default" and "communications" devices that aren't real microphones
  • Device IDs changing when users plug microphones into different USB ports
  • The deprecated ScriptProcessorNode API that could break at any browser update

Solution

Vertex AI and Gemini: Specific Capabilities

Gemini's Live API (via Vertex AI) powers the core real-time interaction:

  • Native audio streaming: We send raw PCM audio directly to the model without pre-transcription. The model handles speech recognition and semantic understanding in a single pass, reducing latency.
  • Proactive tool calling: Gemini invokes our task management tools (addTask, editTask, deleteTask) autonomously as the user speaks, without waiting for explicit commands.
  • Session resumption: The Vertex API provides resumption tokens that let users pause and continue sessions, essential for mobile users who might switch apps or lose connectivity.
  • Multilingual understanding: Gemini handles language detection and task extraction across our supported languages without requiring language-specific models.

Nice Surprises

Session resumption was easier than expected. We initially thought maintaining conversation state across reconnections would require complex server-side session management. But once we understood Vertex's resumption token approach (the token is provided by the API and changes with each context update), implementation was straightforward across all platforms.

Context injection worked on the first try. We spent considerable time designing how to provide user context (projects, labels, preferences) to the model. We explored complex retrieval strategies and dynamic context windows. In the end, the simple "v1" approach (just passing the user's projects and labels in the system prompt) worked remarkably well. The model correctly assigns tasks to "🏠 Home" or "💼 Work" projects based on conversational context without elaborate engineering.

The Testing Challenge

Our traditional test suite couldn't handle non-deterministic outputs. We developed an LLM-as-judge approach:

  1. Native speakers from our team recorded real-world scenarios in their languages (15+ languages, ~100 recordings total)
  2. Each scenario has expected semantic outcomes (e.g., "should create 3 tasks: one about calling family, one about shopping, one about exercise on Saturday at 11 AM")
  3. A separate Gemini model acts as a "judge," evaluating whether the actual output semantically matches the expected outcome
  4. We combine structural validation (task count, priority levels, date presence) with semantic validation (did the model understand the user's intent?)
  5. Given the stochastic nature of LLMs, we accept a defined pass-rate threshold for the test suite overall, while also monitoring per-language performance to catch regressions

This approach lets us evaluate new Gemini model versions systematically, understanding not just overall performance but which specific languages might see degraded experiences.


Architecture

High-Level Overview

flowchart LR
    subgraph Client
        Mic[Microphone]
        Preview[Preview]
    end

    subgraph Backend["Aist Backend"]
        Ramble["Brain Dump > Dictation > Streaming"]
    end

    subgraph Vertex["Vertex AI"]
        Gemini[Gemini Live API]
    end

    Mic -- audio capture --> Ramble
    Ramble -- PCM audio --> Gemini
    Gemini -- tool calls --> Ramble
    Ramble -- tool calls --> Preview
Loading

Backend Architecture

We deliberately structured our backend to enable future voice-powered features.

Streaming Layer (provider-agnostic)

  • Manages WebSocket connections and session lifecycle
  • Handles audio format conversion (resampling, encoding)
  • Abstracts away provider differences

Dictation Module (one-way audio)

  • Extends streaming with speech-to-text focus
  • No AI responses back to user
  • Foundation for capture-focused features

Brain Dump Module (Ramble)

  • Extends dictation with Todoist-specific capabilities
  • Injects user context (projects, labels) into prompts
  • Defines task management tools
  • Forwards tool calls to client for Todoist API execution

Conversation Module (two-way audio)

  • Extends streaming with bidirectional audio
  • Ready for future conversational features
  • Same underlying infrastructure

This layered design means we can ship new voice features with minimal additional infrastructure work.

Provider flexibility: While we're using Vertex AI in production (Google's models are best-in-class for our use case), our abstraction layer also supports AWS Bedrock. We could switch providers if needed, though we'd likely see performance differences since Google's specific models aren't available elsewhere.

Frontend Architecture

useMicrophone Hook encapsulates all browser permission complexity:

  • Five distinct permission states (loading, prompt, granted, denied, not-found)
  • Chromium-specific handling for dismissed vs. blocked prompts
  • Windows device alias filtering
  • Fuzzy matching for persisted device preferences (handles device ID instability across browser updates)

AudioWorklet for Capture: We migrated from the deprecated ScriptProcessorNode to AudioWorklet:

  • Audio processing runs on dedicated thread (not competing with UI)
  • Proper buffer management (2048 samples ≈ 40-50ms chunks)
  • Multi-CDN support for worklet module loading
  • Electron-specific lifecycle management (prevents computers from staying awake when Todoist is open)

Canvas Waveform Visualization:

  • HiDPI/Retina display support via devicePixelRatio scaling
  • Frame-rate independent animation (consistent speed on 60Hz and 144Hz displays)
  • RMS-based amplitude calculation for smooth loudness display
  • Circular buffer for efficient history management

Architectural Diagram

We'd be happy to work with your team to develop a polished architectural diagram for the story. The diagram above captures the key components, and we can provide additional detail for any layer.


Other Details

Early Proof of Concept

For additional context, here's a video of our initial terminal-based proof-of-concept: https://www.loom.com/share/f7a4e642399f416787061c9290f8f1b5 (internal use only, not for publication)

This early version was fully conversational (two-way audio). Through iteration, we learned that the "Miranda" model of continuous one-way input proved more effective for rapid task capture.

Frontend Challenges

A few details that might resonate with a technical audience:

Race conditions in device switching: Users can switch microphones rapidly. Each switch triggers getUserMedia. We use AbortController to ensure only the final selection wins, and orphaned streams are properly stopped.

Context inheritance: When users trigger Ramble from a project view, tasks automatically go to that project. Triggering from a priority group inherits that priority. This required careful state coordination across the application.

The waveform took multiple attempts: The scrolling visualization (bars flowing smoothly from right to left) was deceptively difficult. It required dedicated focus to get canvas rendering, audio processing, and smooth scrolling working together.

What We'd Do Differently

Honestly? Not much. The "simple first" approach served us well. Session resumption was easier than expected. Simple context injection worked. The layered architecture paid off immediately.

The main ongoing challenge is testing. Recorded samples from native speakers work well but are expensive to expand. Adding new scenarios requires coordinating new recordings across all our language contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment