Generate cinematic, human-sounding AI dialogue for game cut scenes using Microsoft Neural Voices, completely scriptable from Python.
The script takes a dialogue script with two characters, generates separate voice lines, then merges them into a single cutscene_edge.wav ready for Unity, Unreal, Godot, etc.
⚠️ This uses Microsoft’s neural TTS over the internet via Edge-TTS. Text is sent to Microsoft’s service; audio comes back. No API key required.
Works great with:
- WSL Ubuntu
- Python 3.10+ (including 3.12)
- A basic virtualenv
- ffmpeg installed
You can keep everything in a simple folder, for example:
soundTester/
├── env/ # (optional) Python virtualenv
├── cutscene_edge_tts.py # main script
└── README.md # this file
From PowerShell / CMD:
cd C:\Users\ben\OneDrive\Desktop\testing\soundTester
wslNow you’re in WSL, in the same folder:
python3 -m venv env
source env/bin/activate # (or `. env/bin/activate`)In WSL, with your venv active:
sudo apt update
sudo apt install ffmpeg -y
pip install edge-tts pydubThat’s all you need.
Save this as cutscene_edge_tts.py:
import asyncio
import os
from pydub import AudioSegment
import edge_tts
OUTPUT_DIR = "lines_edge"
FINAL_FILE = "cutscene_edge.wav"
CHARLES = "Dr_Charles"
VICTOR = "Dr_Victor"
# Simple emotion → (rate, pitch, volume) mapping
# tweak these to taste
STYLE_PARAMS = {
# steady, composed
"calm": ("-5%", "-5Hz", "+0%"),
# commanding, tense
"serious": ("+5%", "-2Hz", "+10%"),
# heavier, lower energy
"sad": ("-10%", "-10Hz", "-5%"),
# nervous but controlled
"fearful": ("+15%", "+10Hz", "+20%"),
# FULL PANIC MODE
"terrified": ("+25%", "+20Hz", "+30%"),
# creepy / breathy effect
"whisper": ("-10%", "-10Hz", "-20%"),
# normal
"neutral": ("+0%", "+0Hz", "+0%"),
}
# SCRIPT: (speaker, emotion, text)
SCRIPT = [
# Charles - calm but concerned
(CHARLES, "calm",
"Hey, Dr. Victor… how’s the experiment coming along?"),
# Victor - fearful
(VICTOR, "fearful",
"Not good. Something went wrong in the lab. "
"We had a chemical spill, and I think it contaminated part of the project."),
# Charles - serious & urgent
(CHARLES, "serious",
"Hold on—are you saying this could affect the patients? "
"We may need to shut the entire hospital down before something terrible happens."),
# Victor - terrified
(VICTOR, "terrified",
"I’m afraid it might already be too late. "
"Whatever was created down there… it isn’t human anymore. "
"There could be something horrible roaming the halls."),
# Charles - shaken but composed
(CHARLES, "sad",
"If the public finds out, we’re finished. "
"Our careers… our lives… everything."),
# Victor - scared leadership moment
(VICTOR, "fearful",
"We have to stay calm. If you see anything—anything that doesn’t look human—"
"run. Don’t try to fight it."),
# Charles - determined command voice
(CHARLES, "serious",
"Understood. Let’s lock this place down and pray it doesn’t escape."),
# Victor - whisper horror
(VICTOR, "whisper",
"And hope… it doesn’t find us first."),
# Charles - final solemn sendoff
(CHARLES, "calm",
"Right… good luck, Doctor."),
]
# Pick two neural voices (change if you want)
VOICE_CHARLES = "en-US-EricNeural"
VOICE_VICTOR = "en-GB-RyanNeural"
async def generate_line(text: str, voice: str, path: str, rate: str, pitch: str, volume: str):
"""Use Edge TTS to synth one line to a file."""
print(f" [edge-tts] {voice} rate={rate} pitch={pitch} volume={volume} -> {path}")
communicate = edge_tts.Communicate(
text=text,
voice=voice,
rate=rate,
pitch=pitch,
volume=volume,
)
await communicate.save(path)
async def main_async():
os.makedirs(OUTPUT_DIR, exist_ok=True)
files = []
# Generate each line as its own file
for i, (speaker, style, line) in enumerate(SCRIPT, start=1):
print(f"--- Line {i}: {speaker} ({style})")
filename = f"{i:02d}_{speaker}.mp3"
filepath = os.path.join(OUTPUT_DIR, filename)
if speaker == CHARLES:
voice = VOICE_CHARLES
else:
voice = VOICE_VICTOR
rate, pitch, volume = STYLE_PARAMS.get(style, STYLE_PARAMS["neutral"])
await generate_line(line, voice, filepath, rate, pitch, volume)
files.append(filepath)
print("\nMerging all lines into one cutscene...")
# Start with a tiny bit of silence
final_audio = AudioSegment.silent(duration=500)
for path in files:
segment = AudioSegment.from_file(path)
final_audio += segment
# small pause between lines for dramatic pacing
final_audio += AudioSegment.silent(duration=400)
final_audio.export(FINAL_FILE, format="wav")
print(f"\n✅ DONE! Created {FINAL_FILE}")
def main():
asyncio.run(main_async())
if __name__ == "__main__":
main()In WSL, from the project folder:
python cutscene_edge_tts.pyYou’ll get:
lines_edge/— individual MP3 files per linecutscene_edge.wav— merged final cutscene audio
On Windows, you’ll find it at:
C:\Users\ben\OneDrive\Desktop\testing\soundTester\cutscene_edge.wav
Drop that into Unity / Unreal / Godot as a single SFX track.
You control which neural voices play each character via:
VOICE_CHARLES = "en-US-EricNeural"
VOICE_VICTOR = "en-GB-RyanNeural"Some good options:
en-US-GuyNeural
en-US-DavisNeural
en-US-JasonNeural
en-US-ChristopherNeural
en-US-EricNeural
en-GB-RyanNeural
en-GB-ThomasNeural
en-GB-AlfieNeural
en-US-JennyNeural
en-US-AriaNeural
en-US-SaraNeural
en-GB-LibbyNeural
en-GB-SoniaNeural
List all available voices:
edge-tts --list-voices | lessPick your favorites and plug them into VOICE_CHARLES / VOICE_VICTOR.
Edge-TTS (the library) exposes prosody controls:
rate→ speaking speed (e.g."+20%")pitch→ pitch shift (e.g."+15Hz")volume→ loudness (e.g."+30%")
We don’t use fancy XML <mstts:express-as> tags here, because the public endpoint is picky and tends to read them out loud. Instead, we fake “emotion” by pushing speed, pitch, and volume in different directions.
Example from STYLE_PARAMS:
"terrified": ("+25%", "+20Hz", "+30%"),That means:
- Speak 25% faster
- Raise pitch by 20 Hz (more tense / panicked)
- Boost volume 30% louder
So:
calm→ slightly slow, lower pitchserious→ a bit faster, slightly loudersad→ slower, lower pitch, softerfearful/terrified→ faster, higher pitch, louderwhisper→ quieter and lower
You can totally tweak these to match your own ear.
High-level flow of the script:
-
SCRIPT structure
SCRIPT = [ (speaker_name, style_label, text), ... ]
Each tuple defines:
- Who is speaking (
CHARLESorVICTOR) - How they should sound (
"fearful","calm", etc.) - The actual line of dialogue.
- Who is speaking (
-
Emotion mapping
STYLE_PARAMS = { "fearful": ("+15%", "+10Hz", "+20%"), ... }
When we process a line, we grab
(rate, pitch, volume)for its style. -
Calling Edge-TTS
For each line, we do:
communicate = edge_tts.Communicate( text=text, voice=voice, rate=rate, pitch=pitch, volume=volume, ) await communicate.save(path)
Under the hood:
edge_ttsbuilds a request to Microsoft’s neural TTS service.- Sends your text + prosody settings over HTTPS.
- Streams back audio as MP3.
- Writes it directly to
path.
-
Per-line files
Each line becomes something like:
lines_edge/ ├── 01_Dr_Charles.mp3 ├── 02_Dr_Victor.mp3 ├── ... -
Merging into one WAV
final_audio = AudioSegment.silent(duration=500) for path in files: segment = AudioSegment.from_file(path) final_audio += segment final_audio += AudioSegment.silent(duration=400) final_audio.export(FINAL_FILE, format="wav")
pydubloads each MP3- Concatenates them with a bit of silence between lines
- Exports a single
cutscene_edge.wav
-
Use in your game
In your engine, you just treat
cutscene_edge.wavlike any other audio asset.
-
Add background ambience (looping hospital hallway, rumble, alarms) as a second audio track.
-
Use a timeline / sequencer to:
- Pan camera while dialogue plays
- Trigger screen shakes when Victor says something terrifying
-
Optional: export line timestamps and use them for subtitles (we can add that later by tracking durations from pydub).