EPUB Indexing Tools & Scripts

Problem Solved

EPUB files are compressed archives containing HTML, making direct text extraction difficult
Needed to extract case studies and structured content from Team Topologies book
Initial attempts using simple text conversion lost formatting and structure

Tools Created

1. epub_to_txt.py

Purpose: Simple EPUB to plain text conversion using ebooklib

Dependencies:

ebooklib - for reading EPUB structure
beautifulsoup4 - for HTML parsing

What it does:

Extracts all document items from EPUB
Converts HTML content to plain text
Joins chapters with line breaks
Outputs total character and word counts

Limitations:

Lost a lot of formatting and structure
No preservation of chapter boundaries
Missing newlines and page breaks made content hard to parse

2. epub_indexer.py (Original Complex Version)

Purpose: Create AI-friendly index with structured content extraction

Features:

Parsed EPUB metadata and spine structure
Created searchable chunks (1000 words each)
Identified potential case studies using regex patterns
Generated JSON index with chapter summaries and case study contexts

Issues:

Chunking approach was overcomplicated
Pattern matching missed some case studies (like BCG Digital Ventures)
Boundary issues where content split across chunks

3. better_epub_extract.py (Final Solution)

Purpose: Extract EPUB with proper structure preservation

Key Improvements:

Added newlines before major structural elements (h1, h2, p, div tags)
Added separators between chapters/sections with source file tracking
Preserved case study markers with === formatting
Generated auto-TOC from detected headings and case studies
Better text cleaning while preserving intentional breaks

Success: This approach successfully extracted the complete BCG Digital Ventures case study content that previous methods missed.

Virtual Environment Setup

python3 -m venv epub_env
source epub_env/bin/activate
pip install ebooklib beautifulsoup4 html2text

Usage Examples

# Simple conversion
python epub_to_txt.py "input.epub" "output.txt"

# Structured extraction with better formatting
python better_epub_extract.py "input.epub" "structured_output.txt"

Files Generated

team_topologies.txt - Simple text conversion (65,244 words)
team_topologies_structured.txt - Better formatted version (67,120 words)
Team Topologies by Matthew Skelton, Manuel Pais_index.json - JSON index with metadata

Key Learnings

What Worked

ebooklib + BeautifulSoup combination for reliable EPUB parsing
Structure preservation crucial for finding specific content
Source tracking helps identify which HTML files contain relevant content
Simple text conversion often better than complex chunking for targeted searches

What Didn't Work

Complex chunking algorithms overcomplicated the problem
Regex pattern matching alone insufficient for finding all case studies
JSON indexing created more complexity than value for our use case

Best Practice Discovered

The most effective approach was:

Simple EPUB → structured text conversion preserving formatting
Standard text search tools (grep, etc.) on the clean text
Manual verification of extracted content

Case Studies Successfully Extracted

BCG Digital Ventures - Engineering Enablement Team (Robin Weston)
Sky Betting & Gaming - Platform Feature Teams (Michael Maibaum)
Amazon - Strictly Independent Service Teams
Auto Trader - Multiple case studies on office layout and operations
TransUnion - Team topology evolution (Parts 1 & 2)
And 15+ others across Parts 1 & 2 of the book

This toolset successfully converted a complex EPUB into searchable, structured content suitable for AI analysis and discussion.

amiller/epub_indexing_tools_notes.md

Select an option

No results found