This document outlines the requirements for the Playback API, which provides a unified interface for controlling text-to-speech (TTS) playback.
- Start, pause, resume, and stop
- Handle both individual and batched text/SSML input
- Report current playback state (playing, paused, stopped)
- Accept plain text and SSML input
- Support multiple utterances
- Emit events for state changes
- Provide word/sentence boundary information
- Report errors and warnings
- Select from available voices
- Configure voice parameters (rate, pitch, volume)
[WIP]
A PlaybackEngineProvider allows you to get available voices and create instances of the PlaybackEngine using one specific voice, language, etc.
This PlaybackEngine is using a voice, its parameters can be set, is loaded with utterances, can preload with context, and allows you to speak an utterance index.
A PlaybackNavigator then handles navigation, continuous play, etc.
Could we perhaps start with a review/update of the list of requirements?
It’s been sounding like we are trying to implement two different approaches (preloading multiple utterances and playing one vs miniplayer) at the same time with the current one, and it makes it difficult to come up with something in terms of types and interfaces, and help manage platform idiosyncrasies. 🙏