YouTube transcript / subtitle fetching toolkit for Python - Download, extract, and process subtitles from video URLs with a simple CLI or HTTP API.
- Download YouTube subtitles from videos and channels (powered by yt-dlp)
- Multiple output formats: SRT, VTT, TXT, Markdown, PDF
- JSON output: Machine-readable output with
--jsonand--json-fileflags - Importable module: Use as a Python library with dict-based return values
- Text extraction with automatic subtitle cleanup and optional timestamp markers
- Language selection: Download specific languages or all available subtitles
- Batch processing: Process multiple URLs from a file
- Configuration files: Project and global settings via TOML
- HTTP API: Optional FastAPI server for programmatic access
- Dry-run mode: Preview operations without downloading
- Filename sanitization: Safe, nospace, or slugify modes
- Installation
- Quick Start
- Module Usage (Python Library)
- Usage
- Configuration
- Makefile Shortcuts
- HTTP API
- Development
- Testing
- License
- Python 3.9 or higher
- uv package manager (recommended)
# Clone or download the project
git clone https://gist.github.com/cprima/subxx
cd subxx
# Install core dependencies
uv sync
# Install with optional features
uv sync --extra extract # Text extraction (txt/md/pdf)
uv sync --extra api # HTTP API server
uv sync --extra dev # Development tools (pytest)
# Install all features
uv sync --extra extract --extra api --extra devmake install # Core dependencies
make install-all # All dependencies (extract + api + dev)# List available subtitles
uv run subxx list https://youtu.be/VIDEO_ID
# Download English subtitle (SRT format, default)
uv run subxx subs https://youtu.be/VIDEO_ID
# Extract to plain text
uv run subxx subs https://youtu.be/VIDEO_ID --txt
# Extract to Markdown with 5-minute timestamps
uv run subxx subs https://youtu.be/VIDEO_ID --md -t 300
# Extract to PDF
uv run subxx subs https://youtu.be/VIDEO_ID --pdf
# Get JSON output for automation
uv run subxx list https://youtu.be/VIDEO_ID --json
uv run subxx subs https://youtu.be/VIDEO_ID --json-file output.json# Quick Markdown extraction (just paste video ID)
make md VIDEO_ID=dQw4w9WgXcQ
# With timestamps
make md VIDEO_ID=dQw4w9WgXcQ TIMESTAMPS=300New in v0.4.0+: subxx can be imported and used as a Python library. Core functions now return structured data (dicts) instead of exit codes.
# From test.pypi
pip install -i https://test.pypi.org/simple/ subxx==0.4.1
# Or with uv
uv add subxx==0.4.1 --index https://test.pypi.org/simple/from subxx import fetch_subs, extract_text
# Download subtitles
result = fetch_subs(
url="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
langs="en",
fmt="srt",
output_dir="./subs",
logger=None # Silent mode
)
if result["status"] == "success":
print(f"Downloaded: {result['video_title']}")
for file_info in result["files"]:
print(f" {file_info['language']}: {file_info['path']}")
else:
print(f"Error: {result['error']}")Functions return comprehensive dicts with all data:
{
"status": "success" | "error" | "skipped",
"video_id": "dQw4w9WgXcQ",
"video_title": "Rick Astley - Never Gonna Give You Up...",
"files": [
{
"path": "/path/to/video.en.srt",
"language": "en",
"format": "srt",
"auto_generated": false
}
],
"metadata": {...},
"available_languages": [...],
"download_info": {...},
"error": null
}from subxx import fetch_subs, extract_text
# 1. Download subtitle
result = fetch_subs(
url="https://youtube.com/watch?v=...",
langs="en",
fmt="srt",
auto=True,
output_dir="./transcripts",
logger=None
)
if result["status"] != "success":
print(f"Error: {result['error']}")
exit(1)
# 2. Extract to markdown
subtitle_file = result["files"][0]["path"]
extract_result = extract_text(
subtitle_file=subtitle_file,
output_format="md",
use_chapters=True,
logger=None
)
if extract_result["status"] == "success":
print(f"Extracted to: {extract_result['output_files'][0]['path']}")
print(f"Paragraphs: {len(extract_result['extracted_data']['paragraphs'])}")from subxx import (
fetch_subs, # Download subtitles → dict
extract_text, # Extract text from srt/vtt → dict
load_config, # Load .subxx.toml config → dict
get_default, # Get config default value
setup_logging, # Configure logging
)v0.3.x (not supported as module):
- Functions returned exit codes (int)
- CLI-focused design
v0.4.x (library-first):
- Functions return dicts with comprehensive data
- Optional
loggerparameter (None = silent) - Clean separation: core functions vs CLI wrapper
Preview available subtitle languages without downloading:
# Traditional output
uv run subxx list https://youtu.be/VIDEO_ID
# JSON output
uv run subxx list https://youtu.be/VIDEO_ID --json
# Save to file
uv run subxx list https://youtu.be/VIDEO_ID --json-file metadata.jsonOutput:
📹 Video: Example Video Title
🕒 Duration: 12:34
✅ Manual subtitles:
- en
- es
🤖 Auto-generated subtitles:
- en, de, fr, ja, ko, pt, ru, zh-Hans, ...
Options:
-v, --verbose- Debug output-q, --quiet- Errors only
Download subtitle files in SRT or VTT format:
# Download SRT (default)
uv run subxx subs https://youtu.be/VIDEO_ID
# Download VTT
uv run subxx subs https://youtu.be/VIDEO_ID --vtt
# Using --fmt flag
uv run subxx subs https://youtu.be/VIDEO_ID -f srtBehavior: Subtitle files (SRT/VTT) are downloaded and kept on disk.
# Download English (default)
uv run subxx subs https://youtu.be/VIDEO_ID
# Download specific language
uv run subxx subs https://youtu.be/VIDEO_ID -l de
# Download multiple languages
uv run subxx subs https://youtu.be/VIDEO_ID -l "en,de,fr"
# Download all available languages
uv run subxx subs https://youtu.be/VIDEO_ID -l all# Save to specific directory
uv run python __main__.py subs https://youtu.be/VIDEO_ID -o ~/Downloads/subs
# Use current directory (default)
uv run python __main__.py subs https://youtu.be/VIDEO_ID -o .# Safe mode: Remove unsafe characters, keep spaces (default)
uv run python __main__.py subs URL --sanitize safe
# No spaces: Replace spaces with underscores
uv run python __main__.py subs URL --sanitize nospaces
# Slugify: Lowercase, hyphens, URL-safe
uv run python __main__.py subs URL --sanitize slugifyExamples:
safe:"My Video Title.srt"→"My Video Title.srt"nospaces:"My Video Title.srt"→"My_Video_Title.srt"slugify:"My Video Title.srt"→"my-video-title.srt"
# Prompt before overwriting (default)
uv run python __main__.py subs URL
# Force overwrite without prompting
uv run python __main__.py subs URL --force
# Skip existing files
uv run python __main__.py subs URL --skip-existing# Include auto-generated subtitles (default)
uv run python __main__.py subs URL --auto
# Only manual subtitles
uv run python __main__.py subs URL --no-autoPreview what would be downloaded without actually downloading:
uv run python __main__.py subs URL --dry-runOutput:
[DRY RUN] Would download subtitle: en
New in v0.4.0: Get machine-readable JSON output for automation and scripting.
list- List available languagessubs- Download subtitles
# List command with JSON
uv run subxx list "https://youtu.be/dQw4w9WgXcQ" --json
# Subs command with JSON
uv run subxx subs "https://youtu.be/dQw4w9WgXcQ" --jsonExample JSON output:
{
"status": "success",
"video_id": "dQw4w9WgXcQ",
"video_title": "Rick Astley - Never Gonna Give You Up...",
"files": [
{
"path": "Rick Astley - Never Gonna Give You Up.dQw4w9WgXcQ.NA.en.srt",
"language": "en",
"format": "srt",
"auto_generated": false
}
],
"available_languages": [
{"code": "en", "name": "en", "auto_generated": false}
],
"metadata": {...}
}# Save JSON to file
uv run subxx list URL --json-file metadata.json
uv run subxx subs URL --json-file result.json
# Both stdout and file
uv run subxx subs URL --json --json-file result.json#!/bin/bash
# Get video metadata
metadata=$(uv run subxx list "$VIDEO_URL" --json)
video_title=$(echo "$metadata" | jq -r '.video_title')
echo "Downloading: $video_title"
# Download with JSON output
uv run subxx subs "$VIDEO_URL" --json-file download.json
# Check if successful
if [ "$(jq -r '.status' download.json)" == "success" ]; then
echo "Success! Downloaded $(jq -r '.files | length' download.json) files"
fiExtract clean, readable text from subtitles by automatically removing timestamps and formatting.
Key behavior: When using text formats (txt/md/pdf), subxx:
- Downloads the subtitle as SRT
- Extracts the text content
- Automatically deletes the SRT file
# Extract to plain text
uv run python __main__.py subs URL --txtOutput: Video_Title.VIDEO_ID.en.txt
Example content:
Hello world.
This is a subtitle.
Welcome to the video.
# Extract to Markdown
uv run python __main__.py subs URL --md
# Markdown with timestamp markers every 5 minutes
uv run python __main__.py subs URL --md -t 300
# Markdown with timestamp markers every 30 seconds
uv run python __main__.py subs URL --md -t 30Output: Video_Title.VIDEO_ID.en.md
Example content (with timestamps):
## [0:00]
Hello world.
This is a subtitle.
## [5:00]
Welcome to the next section.
More content here.
## [10:00]
Final section of the video.# Extract to PDF
uv run python __main__.py subs URL --pdf
# PDF with timestamp markers
uv run python __main__.py subs URL --pdf -t 300Output: Video_Title.VIDEO_ID.en.pdf
Requirements: Install extraction dependencies:
uv sync --extra extractAdd timestamp markers at regular intervals for long-form content:
# Every 5 minutes (300 seconds)
uv run python __main__.py subs URL --md -t 300
# Every 30 seconds
uv run python __main__.py subs URL --txt -t 30
# Every 10 minutes
uv run python __main__.py subs URL --pdf -t 600Format: Timestamps appear as ## [0:00], ## [5:00], ## [10:00], etc.
Download subtitles for multiple URLs from a file:
# Create URLs file (one URL per line)
cat > urls.txt << EOF
https://youtu.be/VIDEO_ID_1
https://youtu.be/VIDEO_ID_2
# This is a comment
https://youtu.be/VIDEO_ID_3
EOF
# Process all URLs
uv run python __main__.py batch urls.txt
# With options
uv run python __main__.py batch urls.txt -l "en,de" -f srt -o ~/subsOptions:
-l, --langs- Language codes (default: en)-f, --fmt- Output format (default: srt)-o, --output-dir- Output directory (default: .)--sanitize- Filename sanitization mode (default: safe)-v, --verbose- Verbose output-q, --quiet- Quiet mode
URL File Format (yt-dlp standard):
- One URL per line
- Lines starting with
#are comments - Empty lines are ignored
Extract text from existing subtitle files:
# Extract SRT to plain text
uv run python __main__.py extract video.srt
# Extract to Markdown
uv run python __main__.py extract video.srt -f md
# Extract to PDF
uv run python __main__.py extract video.srt -f pdf
# With timestamp markers every 5 minutes
uv run python __main__.py extract video.srt -f md -t 300
# Specify output file
uv run python __main__.py extract video.srt -o output.txt
# Force overwrite
uv run python __main__.py extract video.srt --forceSupported input formats: SRT, VTT
Configuration files are loaded in priority order:
./.subxx.toml(project-specific, current directory)~/.subxx.toml(user global, home directory)
Settings are resolved in this order (highest to lowest):
- CLI flags (e.g.,
--langs en,--fmt srt) - Config file (
.subxx.toml) - Hardcoded defaults
Copy .subxx.toml.example to .subxx.toml or ~/.subxx.toml:
cp .subxx.toml.example ~/.subxx.tomlExample config:
[defaults]
# Language codes (comma-separated or "all")
langs = "en"
# Output format: srt, vtt, txt, md, pdf
fmt = "md"
# Include auto-generated subtitles
auto = true
# Output directory (supports ~)
output_dir = "~/Downloads/subtitles"
# Filename sanitization: safe, nospaces, slugify
sanitize = "safe"
# Timestamp interval (seconds) for txt/md/pdf
timestamps = 300 # 5-minute intervals
[logging]
# Log level: DEBUG, INFO, WARNING, ERROR
level = "INFO"
# Log file (optional)
log_file = "~/.subxx/subxx.log"Configuration 1: Download SRT files to dedicated directory
[defaults]
langs = "en"
fmt = "srt"
output_dir = "~/Downloads/subtitles"Configuration 2: Auto-extract to Markdown with timestamps
[defaults]
langs = "en"
fmt = "md"
timestamps = 300
output_dir = "~/Documents/transcripts"Configuration 3: Multiple languages, plain text
[defaults]
langs = "en,de,fr"
fmt = "txt"
sanitize = "slugify"
output_dir = "./subtitles"# Installation
make install # Core dependencies
make install-all # All dependencies (extract + api + dev)
# Testing
make test # Run all tests
make test-unit # Unit tests only
make test-integration # Integration tests only
make test-coverage # Tests with coverage report
# Usage
make list VIDEO_URL=https://youtu.be/VIDEO_ID
make subs VIDEO_URL=https://youtu.be/VIDEO_ID
make md VIDEO_ID=VIDEO_ID # Quick Markdown extraction
make md VIDEO_ID=VIDEO_ID TIMESTAMPS=300 # With timestamps
# Utilities
make version # Show version
make clean # Clean cache files
make clean-all # Clean everything including .venv# Quick Markdown extraction (just paste video ID)
make md VIDEO_ID=dQw4w9WgXcQ
# With 5-minute timestamps
make md VIDEO_ID=lHuxDMMkGJ8 TIMESTAMPS=300
# List subtitles
make list VIDEO_URL=https://youtu.be/dQw4w9WgXcQ
# Download with languages
make subs VIDEO_URL=https://youtu.be/dQw4w9WgXcQ LANGS=en,deStart an HTTP API server for programmatic access (requires API dependencies):
# Install API dependencies
uv sync --extra api
# Or with Make
make install-api# Start on localhost:8000 (default)
uv run python __main__.py serve
# Custom host/port
uv run python __main__.py serve --host 127.0.0.1 --port 8080Security Warning: The API has NO authentication and should ONLY run on localhost (127.0.0.1).
Fetch subtitles and return content directly.
Request:
{
"url": "https://youtu.be/VIDEO_ID",
"langs": "en",
"fmt": "srt",
"auto": true,
"sanitize": "safe"
}Response: Subtitle file content as plain text.
Example:
curl -X POST http://127.0.0.1:8000/subs \
-H "Content-Type: application/json" \
-d '{
"url": "https://youtu.be/dQw4w9WgXcQ",
"langs": "en",
"fmt": "srt"
}'Health check endpoint.
Response:
{
"status": "ok",
"service": "subxx"
}Interactive API docs available at:
- Swagger UI:
http://127.0.0.1:8000/docs - ReDoc:
http://127.0.0.1:8000/redoc
# Clone repository
git clone https://gist.github.com/cprima/subxx
cd subxx
# Install all dependencies (core + extract + api + dev)
uv sync --extra extract --extra api --extra dev
# Or with Make
make install-allUpdated in v0.4.1 - Restructured for Python best practices:
subxx/
├── subxx.py # Core library functions (returns dicts)
├── cli.py # CLI + API implementation (Typer/FastAPI)
├── __main__.py # Minimal entry point (3 lines)
├── test_subxx.py # Test suite (pytest)
├── conftest.py # Pytest configuration
├── pyproject.toml # Project metadata and dependencies
├── Makefile # Build and test automation
├── .subxx.toml.example # Example configuration file
└── !README.md # This file
-
subxx.py: Core library (library-first design)fetch_subs()→ dict - Download subtitles, return structured dataextract_text()→ dict - Extract text from subtitles, return structured dataload_config()→ dict - Configuration management- Helper functions for parsing, sanitization, logging
- Importable as Python module
-
cli.py: CLI + API implementation- Typer commands:
list,subs,batch,extract,serve,version - FastAPI HTTP server
- JSON output handling (
--json,--json-file) - Traditional console output with emojis
- Typer commands:
-
__main__.py: Minimal entry point (Python best practice)- 3 lines: import and run CLI
- Enables
python -m subxxusage
# All tests
make test
# Unit tests only (fast, no network)
make test-unit
# Integration tests only
make test-integration
# With coverage report
make test-coverage
# Verbose output
make test-verbose- Unit tests (
@pytest.mark.unit): No external dependencies, mocked I/O - Integration tests (
@pytest.mark.integration): May use files/network - E2E tests (
@pytest.mark.e2e): Real YouTube API, requires internet - Slow tests (
@pytest.mark.slow): Network I/O, real downloads
# Run all tests except e2e (fast, for CI)
pytest -m "not e2e"
# Run only e2e tests (slow, requires internet)
pytest -m e2e
# Run unit tests only
pytest -m unitCurrent coverage: ~50 tests (unit, integration, and e2e)
Key areas tested:
- Configuration loading and defaults
- Language parsing
- Filename sanitization
- Text extraction (txt/md/pdf)
- Timestamp markers
- CLI commands
- Overwrite protection
- Real YouTube subtitle download (e2e)
0- Success1- User cancelled2- No subtitles available3- Network error4- Invalid URL5- Configuration error6- File error
Error:
❌ Error: Missing dependencies for text extraction
Solution:
uv sync --extra extractError:
❌ Error: API dependencies not installed
Solution:
uv sync --extra apiIf you see encoding errors on Windows, the tool automatically attempts to reconfigure stdout/stderr to UTF-8. If issues persist, use:
# Set console to UTF-8
chcp 65001If downloads fail with network errors:
-
Update yt-dlp:
uv sync --upgrade
-
Check firewall/proxy settings
-
Try with
--verbosefor debug output:uv run python __main__.py subs URL --verbose
- JSON output support (
--json,--json-file) - Importable Python module (library-first architecture)
- Published package on test.pypi.org
- Pythonic project structure (cli.py, minimal main.py)
- Publish to PyPI (production)
- Progress bars for downloads
- Retry logic for network failures
- Subtitle merging/combining
- Translation support
- Docker container
- GitHub Actions CI/CD
- SRT/VTT format conversion
- Subtitle editing/manipulation
- Batch command JSON support
- Extract command JSON support
Contributions welcome! This is an alpha project under active development.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass:
make test - Submit a pull request
- Follow existing code style
- Add docstrings for new functions
- Update tests for changes
- Update README for new features
- Keep commits focused and atomic
This project is licensed under CC BY 4.0 (Creative Commons Attribution 4.0 International).
You are free to:
- Share - Copy and redistribute the material
- Adapt - Remix, transform, and build upon the material
Under the following terms:
- Attribution - You must give appropriate credit
See LICENSE for full details.
- Built with yt-dlp for video subtitle extraction
- CLI powered by Typer
- API built with FastAPI
- Text extraction using srt and fpdf2
Christian Prior-Mamulyan
- Email: cprior@gmail.com
- GitHub: @cprima
- Report issues: GitHub Issues
- Documentation: GitHub Gist
subxx - Simple, powerful YouTube transcript / subtitle fetching for Python.