Before diving into technical challenges, it's worth sharing what inspired Ramble. We were deeply influenced by this scene from The Devil Wears Prada where Miranda dumps a dozen tasks on her assistant in one fell swoop. Non-stop, stream-of-consciousness. We asked ourselves: what if anyone could capture tasks that way? No typing, no careful formatting. Just speak your mind and let AI do the organizing.
We even tested our early prototypes using Miranda's exact monologue, playing the clip into a microphone. It became our north star use case.
1. Real-time streaming with tool execution
Traditional voice assistants follow a request-response pattern: you speak, it processes, it responds. But Miranda doesn't wait for confirmation after each task. She just keeps going. We needed an architecture that could:
- Stream audio continuously to the AI
- Have the AI proactively call tools (create task, edit task, delete task) while the user is still speaking
- Update the UI in real-time as tasks are created and modified
- Handle corrections mid-stream ("Actually, make that 11 AM, not 10")
This ruled out simple transcription-then-process approaches. We needed true real-time bidirectional communication with tool-calling capabilities.
2. Multilingual support at scale
Todoist has users worldwide. We couldn't ship a feature that only worked well in English. But testing voice AI across 15+ languages presents unique challenges:
- Native speakers have different speech patterns, speeds, and accents
- Even within a single language, regional accents vary significantly (British vs. Australian English, European vs. Brazilian Portuguese)
- Task semantics vary by culture (date formats, time expressions, relationship terms)
- We couldn't just test with one person per language; we needed diversity within each language
3. Non-deterministic output validation
LLMs don't produce identical outputs for identical inputs. A user saying "call mom tomorrow" might result in differently-worded tasks across sessions. Traditional assertion-based testing ("expect output to equal X") doesn't work. We needed a testing approach that validated semantic correctness rather than exact string matching.
4. Cross-browser audio handling
Browser APIs for microphone access are notoriously inconsistent:
- Different permission models across Chrome, Firefox, Safari
- Windows returning phantom "default" and "communications" devices that aren't real microphones
- Device IDs changing when users plug microphones into different USB ports
- The deprecated ScriptProcessorNode API that could break at any browser update
Gemini's Live API (via Vertex AI) powers the core real-time interaction:
- Native audio streaming: We send raw PCM audio directly to the model without pre-transcription. The model handles speech recognition and semantic understanding in a single pass, reducing latency.
- Proactive tool calling: Gemini invokes our task management tools (
addTask,editTask,deleteTask) autonomously as the user speaks, without waiting for explicit commands. - Session resumption: The Vertex API provides resumption tokens that let users pause and continue sessions, essential for mobile users who might switch apps or lose connectivity.
- Multilingual understanding: Gemini handles language detection and task extraction across our supported languages without requiring language-specific models.
Session resumption was easier than expected. We initially thought maintaining conversation state across reconnections would require complex server-side session management. But once we understood Vertex's resumption token approach (the token is provided by the API and changes with each context update), implementation was straightforward across all platforms.
Context injection worked on the first try. We spent considerable time designing how to provide user context (projects, labels, preferences) to the model. We explored complex retrieval strategies and dynamic context windows. In the end, the simple "v1" approach (just passing the user's projects and labels in the system prompt) worked remarkably well. The model correctly assigns tasks to "🏠 Home" or "💼 Work" projects based on conversational context without elaborate engineering.
Our traditional test suite couldn't handle non-deterministic outputs. We developed an LLM-as-judge approach:
- Native speakers from our team recorded real-world scenarios in their languages (15+ languages, ~100 recordings total)
- Each scenario has expected semantic outcomes (e.g., "should create 3 tasks: one about calling family, one about shopping, one about exercise on Saturday at 11 AM")
- A separate Gemini model acts as a "judge," evaluating whether the actual output semantically matches the expected outcome
- We combine structural validation (task count, priority levels, date presence) with semantic validation (did the model understand the user's intent?)
- Given the stochastic nature of LLMs, we accept a defined pass-rate threshold for the test suite overall, while also monitoring per-language performance to catch regressions
This approach lets us evaluate new Gemini model versions systematically, understanding not just overall performance but which specific languages might see degraded experiences.
flowchart LR
subgraph Client
Mic[Microphone]
Preview[Preview]
end
subgraph Backend["Aist Backend"]
Ramble["Brain Dump > Dictation > Streaming"]
end
subgraph Vertex["Vertex AI"]
Gemini[Gemini Live API]
end
Mic -- audio capture --> Ramble
Ramble -- PCM audio --> Gemini
Gemini -- tool calls --> Ramble
Ramble -- tool calls --> Preview
We deliberately structured our backend to enable future voice-powered features.
Streaming Layer (provider-agnostic)
- Manages WebSocket connections and session lifecycle
- Handles audio format conversion (resampling, encoding)
- Abstracts away provider differences
Dictation Module (one-way audio)
- Extends streaming with speech-to-text focus
- No AI responses back to user
- Foundation for capture-focused features
Brain Dump Module (Ramble)
- Extends dictation with Todoist-specific capabilities
- Injects user context (projects, labels) into prompts
- Defines task management tools
- Forwards tool calls to client for Todoist API execution
Conversation Module (two-way audio)
- Extends streaming with bidirectional audio
- Ready for future conversational features
- Same underlying infrastructure
This layered design means we can ship new voice features with minimal additional infrastructure work.
Provider flexibility: While we're using Vertex AI in production (Google's models are best-in-class for our use case), our abstraction layer also supports AWS Bedrock. We could switch providers if needed, though we'd likely see performance differences since Google's specific models aren't available elsewhere.
useMicrophone Hook encapsulates all browser permission complexity:
- Five distinct permission states (loading, prompt, granted, denied, not-found)
- Chromium-specific handling for dismissed vs. blocked prompts
- Windows device alias filtering
- Fuzzy matching for persisted device preferences (handles device ID instability across browser updates)
AudioWorklet for Capture: We migrated from the deprecated ScriptProcessorNode to AudioWorklet:
- Audio processing runs on dedicated thread (not competing with UI)
- Proper buffer management (2048 samples ≈ 40-50ms chunks)
- Multi-CDN support for worklet module loading
- Electron-specific lifecycle management (prevents computers from staying awake when Todoist is open)
Canvas Waveform Visualization:
- HiDPI/Retina display support via devicePixelRatio scaling
- Frame-rate independent animation (consistent speed on 60Hz and 144Hz displays)
- RMS-based amplitude calculation for smooth loudness display
- Circular buffer for efficient history management
We'd be happy to work with your team to develop a polished architectural diagram for the story. The diagram above captures the key components, and we can provide additional detail for any layer.
For additional context, here's a video of our initial terminal-based proof-of-concept: https://www.loom.com/share/f7a4e642399f416787061c9290f8f1b5 (internal use only, not for publication)
This early version was fully conversational (two-way audio). Through iteration, we learned that the "Miranda" model of continuous one-way input proved more effective for rapid task capture.
A few details that might resonate with a technical audience:
Race conditions in device switching: Users can switch microphones rapidly. Each switch triggers getUserMedia. We use AbortController to ensure only the final selection wins, and orphaned streams are properly stopped.
Context inheritance: When users trigger Ramble from a project view, tasks automatically go to that project. Triggering from a priority group inherits that priority. This required careful state coordination across the application.
The waveform took multiple attempts: The scrolling visualization (bars flowing smoothly from right to left) was deceptively difficult. It required dedicated focus to get canvas rendering, audio processing, and smooth scrolling working together.
Honestly? Not much. The "simple first" approach served us well. Session resumption was easier than expected. Simple context injection worked. The layered architecture paid off immediately.
The main ongoing challenge is testing. Recorded samples from native speakers work well but are expensive to expand. Adding new scenarios requires coordinating new recordings across all our language contributors.