Skip to content

Instantly share code, notes, and snippets.

@amiller
Created November 22, 2025 19:42
Show Gist options
  • Select an option

  • Save amiller/1627a0e0827206f6d3340c82f49082f6 to your computer and use it in GitHub Desktop.

Select an option

Save amiller/1627a0e0827206f6d3340c82f49082f6 to your computer and use it in GitHub Desktop.

EPUB Indexing Tools & Scripts

Problem Solved

  • EPUB files are compressed archives containing HTML, making direct text extraction difficult
  • Needed to extract case studies and structured content from Team Topologies book
  • Initial attempts using simple text conversion lost formatting and structure

Tools Created

1. epub_to_txt.py

Purpose: Simple EPUB to plain text conversion using ebooklib

Dependencies:

  • ebooklib - for reading EPUB structure
  • beautifulsoup4 - for HTML parsing

What it does:

  • Extracts all document items from EPUB
  • Converts HTML content to plain text
  • Joins chapters with line breaks
  • Outputs total character and word counts

Limitations:

  • Lost a lot of formatting and structure
  • No preservation of chapter boundaries
  • Missing newlines and page breaks made content hard to parse

2. epub_indexer.py (Original Complex Version)

Purpose: Create AI-friendly index with structured content extraction

Features:

  • Parsed EPUB metadata and spine structure
  • Created searchable chunks (1000 words each)
  • Identified potential case studies using regex patterns
  • Generated JSON index with chapter summaries and case study contexts

Issues:

  • Chunking approach was overcomplicated
  • Pattern matching missed some case studies (like BCG Digital Ventures)
  • Boundary issues where content split across chunks

3. better_epub_extract.py (Final Solution)

Purpose: Extract EPUB with proper structure preservation

Key Improvements:

  • Added newlines before major structural elements (h1, h2, p, div tags)
  • Added separators between chapters/sections with source file tracking
  • Preserved case study markers with === formatting
  • Generated auto-TOC from detected headings and case studies
  • Better text cleaning while preserving intentional breaks

Success: This approach successfully extracted the complete BCG Digital Ventures case study content that previous methods missed.

Virtual Environment Setup

python3 -m venv epub_env
source epub_env/bin/activate
pip install ebooklib beautifulsoup4 html2text

Usage Examples

# Simple conversion
python epub_to_txt.py "input.epub" "output.txt"

# Structured extraction with better formatting
python better_epub_extract.py "input.epub" "structured_output.txt"

Files Generated

  • team_topologies.txt - Simple text conversion (65,244 words)
  • team_topologies_structured.txt - Better formatted version (67,120 words)
  • Team Topologies by Matthew Skelton, Manuel Pais_index.json - JSON index with metadata

Key Learnings

What Worked

  • ebooklib + BeautifulSoup combination for reliable EPUB parsing
  • Structure preservation crucial for finding specific content
  • Source tracking helps identify which HTML files contain relevant content
  • Simple text conversion often better than complex chunking for targeted searches

What Didn't Work

  • Complex chunking algorithms overcomplicated the problem
  • Regex pattern matching alone insufficient for finding all case studies
  • JSON indexing created more complexity than value for our use case

Best Practice Discovered

The most effective approach was:

  1. Simple EPUB → structured text conversion preserving formatting
  2. Standard text search tools (grep, etc.) on the clean text
  3. Manual verification of extracted content

Case Studies Successfully Extracted

  • BCG Digital Ventures - Engineering Enablement Team (Robin Weston)
  • Sky Betting & Gaming - Platform Feature Teams (Michael Maibaum)
  • Amazon - Strictly Independent Service Teams
  • Auto Trader - Multiple case studies on office layout and operations
  • TransUnion - Team topology evolution (Parts 1 & 2)
  • And 15+ others across Parts 1 & 2 of the book

This toolset successfully converted a complex EPUB into searchable, structured content suitable for AI analysis and discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment