- EPUB files are compressed archives containing HTML, making direct text extraction difficult
- Needed to extract case studies and structured content from Team Topologies book
- Initial attempts using simple text conversion lost formatting and structure
Purpose: Simple EPUB to plain text conversion using ebooklib
Dependencies:
ebooklib- for reading EPUB structurebeautifulsoup4- for HTML parsing
What it does:
- Extracts all document items from EPUB
- Converts HTML content to plain text
- Joins chapters with line breaks
- Outputs total character and word counts
Limitations:
- Lost a lot of formatting and structure
- No preservation of chapter boundaries
- Missing newlines and page breaks made content hard to parse
Purpose: Create AI-friendly index with structured content extraction
Features:
- Parsed EPUB metadata and spine structure
- Created searchable chunks (1000 words each)
- Identified potential case studies using regex patterns
- Generated JSON index with chapter summaries and case study contexts
Issues:
- Chunking approach was overcomplicated
- Pattern matching missed some case studies (like BCG Digital Ventures)
- Boundary issues where content split across chunks
Purpose: Extract EPUB with proper structure preservation
Key Improvements:
- Added newlines before major structural elements (h1, h2, p, div tags)
- Added separators between chapters/sections with source file tracking
- Preserved case study markers with === formatting
- Generated auto-TOC from detected headings and case studies
- Better text cleaning while preserving intentional breaks
Success: This approach successfully extracted the complete BCG Digital Ventures case study content that previous methods missed.
python3 -m venv epub_env
source epub_env/bin/activate
pip install ebooklib beautifulsoup4 html2text# Simple conversion
python epub_to_txt.py "input.epub" "output.txt"
# Structured extraction with better formatting
python better_epub_extract.py "input.epub" "structured_output.txt"team_topologies.txt- Simple text conversion (65,244 words)team_topologies_structured.txt- Better formatted version (67,120 words)Team Topologies by Matthew Skelton, Manuel Pais_index.json- JSON index with metadata
- ebooklib + BeautifulSoup combination for reliable EPUB parsing
- Structure preservation crucial for finding specific content
- Source tracking helps identify which HTML files contain relevant content
- Simple text conversion often better than complex chunking for targeted searches
- Complex chunking algorithms overcomplicated the problem
- Regex pattern matching alone insufficient for finding all case studies
- JSON indexing created more complexity than value for our use case
The most effective approach was:
- Simple EPUB → structured text conversion preserving formatting
- Standard text search tools (grep, etc.) on the clean text
- Manual verification of extracted content
- BCG Digital Ventures - Engineering Enablement Team (Robin Weston)
- Sky Betting & Gaming - Platform Feature Teams (Michael Maibaum)
- Amazon - Strictly Independent Service Teams
- Auto Trader - Multiple case studies on office layout and operations
- TransUnion - Team topology evolution (Parts 1 & 2)
- And 15+ others across Parts 1 & 2 of the book
This toolset successfully converted a complex EPUB into searchable, structured content suitable for AI analysis and discussion.