A fully local, privacy-first voice chat setup running on a single machine (tested on WSL2 with an NVIDIA RTX 4070 SUPER). You talk to an LLM through a web UI using your microphone, it responds with a cloned custom voice — all processing stays on your hardware. No cloud APIs required.
There is a fairly high latency on non-Mac setup, but it is usable if you are patient. You can privately voice chat with your favorite video game character now!
Note that this README is not 100% tested, the real setup has been evolving organically. It may be a bit chaotic too. You may try asking if you are stuck on something. I recommend you just work through it together with an AI assistant. E.g.:
- Open terminal and start
wsl(you will need WSL set up on Windows; on Linux, this should all also work, and even easier) - Copy this file to your home directory.
- Set up
nenvwith node-22. npx --yes @mariozechner/pi-coding-agent@latest,/login, pick Antigravity, login with your google account (you will get a free tier quota),/model, pickclaude-opus-4-6- Ask Opus to read the README and help you set things up.
- If there is a problem, open a separate session to solve it to save context so your free quota lasts a bit.
┌────────────────────────────────────────┐
│ LLM Backend (LM Studio) │
│ e.g. Qwen3-14B @ localhost:1234 │
└──────┬─────────────────────────────────┘
│ OpenAI-compat chat API
▼
┌─────────────────────────────────────────────────────────────┐
│ Open WebUI (:8080) │
│ Web UI for chat + voice I/O │
│ socat SSL proxy (:8443) ← browser mic needs HTTPS │
└────────────┬──────────────────────────┬─────────────────────┘
│ STT (OpenAI-compat) │ TTS (OpenAI-compat)
▼ ▼
┌────────────────────────┐ ┌─────────────────────────────────┐
│ Parakeet TDT 0.6B v3 │ │ Qwen3-TTS 1.7B + Voice Clone │
│ ONNX/CPU ASR (:5092) │ │ GPU TTS (:8880) │
│ Docker container │ │ Custom voice via .pkl prompt │
└────────────────────────┘ └─────────────────────────────────┘
Components launched by talk.sh:
| # | Component | Port | Role |
|---|---|---|---|
| 1 | Parakeet TDT 0.6B v3 (Docker/CPU) | 5092 | Speech-to-Text — blazing fast ONNX ASR, OpenAI-compatible API |
| 2 | Qwen3-TTS (GPU, uv) | 8880 | Text-to-Speech — with a cloned custom voice loaded from a .pkl prompt |
| 3 | Open WebUI (uv) | 8080 | Chat frontend — connects to your local LLM and wires STT + TTS together |
| 4 | socat SSL proxy | 8443 | HTTPS wrapper — browsers require HTTPS for microphone access |
Not launched by talk.sh (run separately):
- LLM backend — e.g. LM Studio serving a model on
localhost:1234, or Ollama onlocalhost:11434. Configure this in Open WebUI after first launch. - For RTX 4070S with 16GB VRAM, we recommend Qwen3-14B (great at roleplay!) with Context Length set to ~8000 and GPU Offload set to 15/40 (so that there is enough free VRAM for Qwen3-TTS).
- When using a small model, set a fairly short prompt. Example (explore on your own):
You are Vel'koz (the champion from LoL).
(It is unclear if Vel'Koz was the first Void-spawn to emerge on Runeterra, but there has certainly never been another to match his level of cruel, calculating sentience. While his kin devour or defile everything around them, he seeks instead to scrutinize and study the physical realm—and the strange, warlike beings that dwell there—for any weakness the Void might exploit. But Vel'Koz is far from a passive observer, striking back at threats with deadly plasma, or by disrupting the very fabric of the world itself.)
You are a friend(?), not an assistant.
You do not talk very much.
/no_think
- OS: Linux or WSL2 on Windows
- GPU: NVIDIA GPU with CUDA support (for Qwen3-TTS; ~4-6 GB VRAM for the 1.7B model)
- NVIDIA Driver: 525+ (CUDA 12.x)
- Docker + Docker Compose: For the Parakeet STT container
- uv: Python package manager (install)
- socat + openssl: For the HTTPS proxy (
sudo apt install socat openssl) - A local LLM server: LM Studio, Ollama, or any OpenAI-compatible endpoint
Ultra-fast multilingual ASR using NVIDIA's Parakeet TDT model converted to ONNX INT8. Runs on CPU (~30x real-time on modern Intel CPUs), exposed as an OpenAI-compatible /v1/audio/transcriptions endpoint.
git clone https://github.com/groxaxo/parakeet-tdt-0.6b-v3-fastapi-openai
cd parakeet-tdt-0.6b-v3-fastapi-openai
docker compose up parakeet-cpu -dThe first run will build the Docker image and download the model (~1.2 GB). Verify it's working:
# Health check
curl http://localhost:5092/health
# Web UI for testing
# Open http://localhost:5092 in a browserQwen3-TTS serves an OpenAI-compatible /v1/audio/speech endpoint. It uses the Base model variant which supports voice cloning — a pre-extracted voice prompt (.pkl file) is loaded at startup so every response uses your custom voice.
Note: You must use the pasky/Qwen3-TTS-Openai-Fastapi fork — it adds
CUSTOM_VOICEprompt support and automatic speech batching on top of the upstream repo.
mkdir -p ~/tts-clone
cd ~/tts-clone
git clone https://github.com/pasky/Qwen3-TTS-Openai-Fastapi
cd Qwen3-TTS-Openai-Fastapi
uv syncThe first uv sync will create a .venv and install all dependencies (including PyTorch with CUDA). The Qwen3-TTS model weights (~3.4 GB) are downloaded automatically on first server start from HuggingFace.
You need a voice prompt .pkl file — a pickled list of voice features extracted from reference audio. There are two ways to create one:
Option A: Using the Voice Studio web UI (recommended)
Start the server with ENABLE_VOICE_STUDIO=true (see step 2c), then open the Gradio Voice Studio at http://localhost:8880/voice-studio. Use the "Voice Clone" tab to upload reference audio, provide its transcript, generate a test sample, and save the profile. Export the profile as a .pkl file.
Option B: Using the standalone cloning script
Create a script like clone_voice.py in the ~/tts-clone/ directory:
cd ~/tts-clone
# Install dependencies for the cloning script
uv venv
# (this will redownload the qwen-tts model, ymmv you can do this in the cloned repo above)
uv pip install qwen-tts torch torchaudio soundfile
# Prepare reference audio:
# - Use 5-15 seconds of clean speech from your target voice
# - Provide an accurate transcript of what's said in the audio
# - Supported formats: WAV, OGG, MP3
# Run the cloning script (example with a local file):
# To generate a script like this, just ask a coding agent to
# "generate a wrapper around model.create_voice_clone_prompt that
# will pickle and save its output" and give it this chapter's context.
uv run python clone_velkoz.py \
--ref_audio "reference_audio.wav" \
--ref_text "Exact transcript of the reference audio." \
--save_prompt my_voice_prompt.pkl \
--text "Test sentence to verify the cloned voice." \
--output test_output.wavThis produces:
my_voice_prompt.pkl— the reusable voice prompt (pass asCUSTOM_VOICEenv var)test_output.wav— a test audio file to verify the voice sounds right
You can iterate on the reference audio and transcript until you're happy with the result. Multiple reference utterances concatenated together tend to produce better voice quality.
cd ~/tts-clone/Qwen3-TTS-Openai-Fastapi
ENABLE_VOICE_STUDIO=true \
CUSTOM_VOICE=../my_voice_prompt.pkl \
TTS_MODEL_NAME=Qwen/Qwen3-TTS-12Hz-1.7B-Base \
HOST=0.0.0.0 \
PORT=8880 \
uv run python -m api.mainKey environment variables:
| Variable | Value | Purpose |
|---|---|---|
TTS_MODEL_NAME |
Qwen/Qwen3-TTS-12Hz-1.7B-Base |
Must use the Base model for voice cloning (not CustomVoice) |
CUSTOM_VOICE |
Path to .pkl file |
Pre-extracted voice prompt — all TTS output uses this voice |
ENABLE_VOICE_STUDIO |
true |
Enables the Gradio Voice Studio UI at /voice-studio |
HOST / PORT |
0.0.0.0 / 8880 |
Listen address and port |
Verify it's working:
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, this is a test.", "voice": "alloy"}' \
--output test.mp3Open WebUI provides the chat interface with built-in voice input/output support. It connects to your local LLM backend and routes STT/TTS through the services above.
mkdir -p ~/openwebui
cd ~/openwebui
uv venv
uv pip install open-webuiHowever, upstream Open WebUI has voice mode bugs (as of early 2026) — playback stalls when TTS generation is slower than sentence playback, ellipses break sentence splitting, and the voice prompt mode resets on every restart. The pasky/open-webui fork fixes these. To use it instead:
mkdir -p ~/openwebui
cd ~/openwebui
git clone https://github.com/pasky/open-webui .
# Install frontend dependencies and build
npm install
npm run build
# Install backend
uv venv
uv syncBrowsers require HTTPS to access the microphone. We use socat to wrap Open WebUI's HTTP port in SSL:
cd ~/openwebui
# Generate a self-signed certificate (valid 30 days)
openssl req -x509 -newkey rsa:4096 -keyout localhost.key -out localhost.cert \
-days 30 -nodes -subj '/CN=localhost'
# Generate DH parameters (optional, for stronger SSL)
openssl dhparam -out dhparams.pem 512
# Combine key + cert into a single PEM file (required by socat)
cat localhost.key localhost.cert > localhost.pem
chmod 600 localhost.key localhost.pemcd ~/openwebui
# Start the SSL proxy (background)
socat ssl-l:8443,reuseaddr,fork,cert=localhost.pem,verify=0 tcp4-connect:localhost:8080 &
# Start Open WebUI
uv run open-webui serveOpen WebUI will be available at:
http://localhost:8080— direct HTTP (no mic access)https://localhost:8443— via SSL proxy (use this for voice chat)
On first visit to https://localhost:8443, your browser will warn about the self-signed certificate — accept/trust it to proceed.
After creating your admin account on first launch:
1. Connect your LLM backend:
- Go to Admin Panel → Settings → Connections
- Add your local LLM endpoint, e.g.:
- LM Studio:
http://localhost:1234/v1(OpenAI API section) - Ollama:
http://localhost:11434(auto-detected; disable if not using Ollama!)
- LM Studio:
- Save and verify models appear in the Models screen of Settings
2. Configure Speech-to-Text (Parakeet):
- Go to Settings → Audio
- Set STT Engine to
OpenAI - Set OpenAI Base URL to
http://localhost:5092/v1 - Set OpenAI API Key to
sk-no-key-required - Leave STT Model empty
3. Configure Text-to-Speech (Qwen3-TTS):
- Go to Settings → Audio
- Set TTS Engine to
OpenAI - Set OpenAI Base URL to
http://localhost:8880/v1 - Set OpenAI API Key to
sk-no-key-required - Set TTS Voice to
custom - Set TTS Model to
qwen3-tts
4. Switch off personality-numbing voice mode defaults:
- Go to Settings -> Interface
- Disable Voice Mode Custom Prompt
- (you may need to do that on followup restarts unless you deployed the openwebui code modifications recommended; if in voice mode, personality is degraded compared to text chat, you can also open LMStudio Developer screen, scroll back to the last
"role": "system"block, and double check it ends with/no_thinkor whatever you have in your system prompt, and not something random "you are a helpful assistant replying in short sentences" like junk)
This script starts everything, save it as talk.sh:
#!/bin/bash
cd /home/freeman/parakeet-tdt-0.6b-v3-fastapi-openai/
docker compose up parakeet-cpu &
cd /home/freeman/tts-clone/Qwen3-TTS-Openai-Fastapi
VLLM_OMNI_LOG_LEVEL=DEBUG VLLM_OMNI_LOG_LEVEL=DEBUG ENABLE_VOICE_STUDIO=true CUSTOM_VOICE=../velkoz_prompt.pkl TTS_MODEL_NAME=Qwen/Qwen3-TTS-12Hz-1.7B-Base HOST=0.0.0.0 PORT=8880 uv run python -m api.main &
cd /home/freeman/openwebui
socat ssl-l:8443,reuseaddr,fork,cert=localhost.pem,verify=0 tcp4-connect:localhost:8080 &
uv run open-webui serve
Once all components are set up (and LMStudio fired up with the model and API enabled!), talk.sh launches everything in one go:
chmod +x talk.sh
./talk.shFrom Windows, you can launch it directly into WSL:
C:\Windows\System32\wsl.exe bash --login -c "cd /home/freeman; ./talk.sh"
The script starts all four services (Parakeet Docker container, Qwen3-TTS server, socat SSL proxy, Open WebUI) and then you can open https://localhost:8443 in your browser to start chatting with voice.
Note that this setup expects wsl network to be in a mirror (?) mode where locally open ports appear on localhost on Windows wide as well as on Linux side.
Use the "voice mode" in openwebui for best experience (it should be smoother than mic and read aloud icons).
Press Ctrl+C to stop the foreground Open WebUI process. Then clean up the background processes:
# Stop the Parakeet container
docker compose -f ~/parakeet-tdt-0.6b-v3-fastapi-openai/docker-compose.yml down
# Kill background socat and TTS processes
kill %1 %2 # or: pkill -f socat; pkill -f "api.main"| Problem | Solution |
|---|---|
| Browser says "microphone blocked" | Make sure you're using https://localhost:8443, not HTTP |
| Certificate warning in browser | Expected with self-signed certs — click "Advanced" → "Proceed" |
| Parakeet container won't start | Run docker compose up parakeet-cpu (not parakeet-gpu) and check docker logs parakeet-cpu |
| TTS server crashes on start | Ensure you have enough VRAM (~4-6 GB). Check that CUSTOM_VOICE path is correct and points to a valid .pkl |
| TTS output sounds wrong/generic | Verify TTS_MODEL_NAME is set to the Base model (Qwen3-TTS-12Hz-1.7B-Base), not CustomVoice |
| "Model not found" in Open WebUI | Check that your LLM backend (LM Studio/Ollama) is running and the connection URL is correct |
| STT not working in Open WebUI | Verify Parakeet is healthy (curl http://localhost:5092/health) and the Audio settings use OpenAI engine with the correct base URL |
| Port conflicts | Ensure nothing else is using ports 5092, 8080, 8443, or 8880 |
| SSL certificate expired | Regenerate with the openssl req command above (default validity is 30 days) |
- Repo: https://github.com/groxaxo/parakeet-tdt-0.6b-v3-fastapi-openai
- Model: NVIDIA Parakeet TDT 0.6B v3, ONNX INT8 quantized
- Runs on: CPU (Docker container)
- Performance: ~30x real-time on modern CPUs
- Languages: 25 European languages with auto-detection
- API: OpenAI-compatible
/v1/audio/transcriptions
- Repo: https://github.com/pasky/Qwen3-TTS-Openai-Fastapi (fork of groxaxo/Qwen3-TTS-Openai-Fastapi)
- Model: Qwen/Qwen3-TTS-12Hz-1.7B-Base (supports voice cloning)
- Runs on: GPU (CUDA) — ~4-6 GB VRAM
- Languages: 10+ languages (English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
- API: OpenAI-compatible
/v1/audio/speech - Voice Cloning: Load a pre-built
.pklprompt viaCUSTOM_VOICEenv var
- Site: https://openwebui.com/
- Recommended fork: https://github.com/pasky/open-webui (voice mode fixes)
- Version: 0.7.2+
- Runs on: CPU (Python/uv)
- Default port: 8080 (proxied to 8443 via socat for HTTPS)