🎮 AI Voice Cutscene Generator (WSL + Edge-TTS)

Generate cinematic, human-sounding AI dialogue for game cut scenes using Microsoft Neural Voices, completely scriptable from Python.

The script takes a dialogue script with two characters, generates separate voice lines, then merges them into a single cutscene_edge.wav ready for Unity, Unreal, Godot, etc.

⚠️ This uses Microsoft’s neural TTS over the internet via Edge-TTS. Text is sent to Microsoft’s service; audio comes back. No API key required.

Works great with:

WSL Ubuntu
Python 3.10+ (including 3.12)
A basic virtualenv
ffmpeg installed

📁 1. Project Setup

You can keep everything in a simple folder, for example:

soundTester/
├── env/                     # (optional) Python virtualenv
├── cutscene_edge_tts.py     # main script
└── README.md                # this file

Step 1 — Open WSL in your project folder

From PowerShell / CMD:

cd C:\Users\ben\OneDrive\Desktop\testing\soundTester
wsl

Now you’re in WSL, in the same folder:

python3 -m venv env
source env/bin/activate    # (or `. env/bin/activate`)

Step 2 — Install dependencies

In WSL, with your venv active:

sudo apt update
sudo apt install ffmpeg -y

pip install edge-tts pydub

That’s all you need.

🎬 2. Python Script — Two Voices → One Cutscene File

Save this as cutscene_edge_tts.py:

import asyncio
import os
from pydub import AudioSegment
import edge_tts

OUTPUT_DIR = "lines_edge"
FINAL_FILE = "cutscene_edge.wav"

CHARLES = "Dr_Charles"
VICTOR = "Dr_Victor"

# Simple emotion → (rate, pitch, volume) mapping
# tweak these to taste
STYLE_PARAMS = {
    # steady, composed
    "calm":      ("-5%",  "-5Hz",  "+0%"),
    # commanding, tense
    "serious":   ("+5%",  "-2Hz",  "+10%"),
    # heavier, lower energy
    "sad":       ("-10%", "-10Hz", "-5%"),
    # nervous but controlled
    "fearful":   ("+15%", "+10Hz", "+20%"),
    # FULL PANIC MODE
    "terrified": ("+25%", "+20Hz", "+30%"),
    # creepy / breathy effect
    "whisper":   ("-10%", "-10Hz", "-20%"),
    # normal
    "neutral":   ("+0%",  "+0Hz", "+0%"),
}

# SCRIPT: (speaker, emotion, text)
SCRIPT = [
    # Charles - calm but concerned
    (CHARLES, "calm",
     "Hey, Dr. Victor… how’s the experiment coming along?"),

    # Victor - fearful
    (VICTOR, "fearful",
     "Not good. Something went wrong in the lab. "
     "We had a chemical spill, and I think it contaminated part of the project."),

    # Charles - serious & urgent
    (CHARLES, "serious",
     "Hold on—are you saying this could affect the patients? "
     "We may need to shut the entire hospital down before something terrible happens."),

    # Victor - terrified
    (VICTOR, "terrified",
     "I’m afraid it might already be too late. "
     "Whatever was created down there… it isn’t human anymore. "
     "There could be something horrible roaming the halls."),

    # Charles - shaken but composed
    (CHARLES, "sad",
     "If the public finds out, we’re finished. "
     "Our careers… our lives… everything."),

    # Victor - scared leadership moment
    (VICTOR, "fearful",
     "We have to stay calm. If you see anything—anything that doesn’t look human—"
     "run. Don’t try to fight it."),

    # Charles - determined command voice
    (CHARLES, "serious",
     "Understood. Let’s lock this place down and pray it doesn’t escape."),

    # Victor - whisper horror
    (VICTOR, "whisper",
     "And hope… it doesn’t find us first."),

    # Charles - final solemn sendoff
    (CHARLES, "calm",
     "Right… good luck, Doctor."),
]

# Pick two neural voices (change if you want)
VOICE_CHARLES = "en-US-EricNeural"
VOICE_VICTOR = "en-GB-RyanNeural"


async def generate_line(text: str, voice: str, path: str, rate: str, pitch: str, volume: str):
    """Use Edge TTS to synth one line to a file."""
    print(f"  [edge-tts] {voice} rate={rate} pitch={pitch} volume={volume} -> {path}")
    communicate = edge_tts.Communicate(
        text=text,
        voice=voice,
        rate=rate,
        pitch=pitch,
        volume=volume,
    )
    await communicate.save(path)


async def main_async():
    os.makedirs(OUTPUT_DIR, exist_ok=True)

    files = []

    # Generate each line as its own file
    for i, (speaker, style, line) in enumerate(SCRIPT, start=1):
        print(f"--- Line {i}: {speaker} ({style})")
        filename = f"{i:02d}_{speaker}.mp3"
        filepath = os.path.join(OUTPUT_DIR, filename)

        if speaker == CHARLES:
            voice = VOICE_CHARLES
        else:
            voice = VOICE_VICTOR

        rate, pitch, volume = STYLE_PARAMS.get(style, STYLE_PARAMS["neutral"])
        await generate_line(line, voice, filepath, rate, pitch, volume)

        files.append(filepath)

    print("\nMerging all lines into one cutscene...")

    # Start with a tiny bit of silence
    final_audio = AudioSegment.silent(duration=500)

    for path in files:
        segment = AudioSegment.from_file(path)
        final_audio += segment
        # small pause between lines for dramatic pacing
        final_audio += AudioSegment.silent(duration=400)

    final_audio.export(FINAL_FILE, format="wav")
    print(f"\n✅ DONE! Created {FINAL_FILE}")


def main():
    asyncio.run(main_async())


if __name__ == "__main__":
    main()

Run it

In WSL, from the project folder:

python cutscene_edge_tts.py

You’ll get:

lines_edge/ — individual MP3 files per line
cutscene_edge.wav — merged final cutscene audio

On Windows, you’ll find it at:

C:\Users\ben\OneDrive\Desktop\testing\soundTester\cutscene_edge.wav

Drop that into Unity / Unreal / Godot as a single SFX track.

🧑‍⚕️ 3. Voice Options

You control which neural voices play each character via:

VOICE_CHARLES = "en-US-EricNeural"
VOICE_VICTOR = "en-GB-RyanNeural"

Some good options:

🇺🇸 US Male

en-US-GuyNeural
en-US-DavisNeural
en-US-JasonNeural
en-US-ChristopherNeural
en-US-EricNeural

🇬🇧 British Male

en-GB-RyanNeural
en-GB-ThomasNeural
en-GB-AlfieNeural

🇺🇸 US Female

en-US-JennyNeural
en-US-AriaNeural
en-US-SaraNeural

🇬🇧 British Female

en-GB-LibbyNeural
en-GB-SoniaNeural

List all available voices:

edge-tts --list-voices | less

Pick your favorites and plug them into VOICE_CHARLES / VOICE_VICTOR.

😱 4. How “Emotion” Works in This Script

Edge-TTS (the library) exposes prosody controls:

rate → speaking speed (e.g. "+20%")
pitch → pitch shift (e.g. "+15Hz")
volume → loudness (e.g. "+30%")

We don’t use fancy XML <mstts:express-as> tags here, because the public endpoint is picky and tends to read them out loud. Instead, we fake “emotion” by pushing speed, pitch, and volume in different directions.

Example from STYLE_PARAMS:

"terrified": ("+25%", "+20Hz", "+30%"),

That means:

Speak 25% faster
Raise pitch by 20 Hz (more tense / panicked)
Boost volume 30% louder

So:

calm → slightly slow, lower pitch
serious → a bit faster, slightly louder
sad → slower, lower pitch, softer
fearful / terrified → faster, higher pitch, louder
whisper → quieter and lower

You can totally tweak these to match your own ear.

🧠 5. How It Works Under the Hood

High-level flow of the script:

SCRIPT structure
```
SCRIPT = [
    (speaker_name, style_label, text),
    ...
]
```
Each tuple defines:
- Who is speaking (CHARLES or VICTOR)
- How they should sound ("fearful", "calm", etc.)
- The actual line of dialogue.
Emotion mapping
```
STYLE_PARAMS = {
    "fearful": ("+15%", "+10Hz", "+20%"),
    ...
}
```
When we process a line, we grab (rate, pitch, volume) for its style.
Calling Edge-TTS

For each line, we do:
```
communicate = edge_tts.Communicate(
    text=text,
    voice=voice,
    rate=rate,
    pitch=pitch,
    volume=volume,
)
await communicate.save(path)
```
Under the hood:
- edge_tts builds a request to Microsoft’s neural TTS service.
- Sends your text + prosody settings over HTTPS.
- Streams back audio as MP3.
- Writes it directly to path.

Per-line files

Each line becomes something like:

lines_edge/
├── 01_Dr_Charles.mp3
├── 02_Dr_Victor.mp3
├── ...

Merging into one WAV

final_audio = AudioSegment.silent(duration=500)
for path in files:
    segment = AudioSegment.from_file(path)
    final_audio += segment
    final_audio += AudioSegment.silent(duration=400)
final_audio.export(FINAL_FILE, format="wav")

pydub loads each MP3
Concatenates them with a bit of silence between lines
Exports a single cutscene_edge.wav

Use in your game

In your engine, you just treat cutscene_edge.wav like any other audio asset.

🎮 6. Game Dev Tips

Add background ambience (looping hospital hallway, rumble, alarms) as a second audio track.
Use a timeline / sequencer to:
- Pan camera while dialogue plays
- Trigger screen shakes when Victor says something terrifying
Optional: export line timestamps and use them for subtitles (we can add that later by tracking durations from pydub).

bbartling/CutSceneVoiceActing.md

Select an option

No results found