Skip to content

Instantly share code, notes, and snippets.

@lmmx
Created January 29, 2026 20:47
Show Gist options
  • Select an option

  • Save lmmx/34f677e999668be07712130f1a8214a1 to your computer and use it in GitHub Desktop.

Select an option

Save lmmx/34f677e999668be07712130f1a8214a1 to your computer and use it in GitHub Desktop.
PDF transcription skill

SKILL: Document Transcription Pipeline

This guide describes how to transcribe a PDF document (book or paper) into a hierarchical modular directory tree of markdown files. Follow each step in order.

Overview

The pipeline produces:

  1. Split PDFs - one per top-level group (chapter/section), extracted with qpdf
  2. Transcript files - page-level markdown files with YAML frontmatter, named by page number
  3. Section files - assembled from transcripts, grouped into directories by top-level section
  4. Metadata - section_groupings.tsv, hierarchical_page_index.tsv, hierarchical_page_index.md

Step 0: Determine Document Type and Grouping Key

Identify the "high bit" - the top-level grouping unit:

Document type Group unit Directory prefix Example
Book Chapter chapter- chapter-01/, chapter-07/
Paper Section section- section-00/, section-0A/

For papers, top-level sections (Abstract, Introduction, Related Work, etc.) are the groups. Subsections (2.1, 4.1.1) belong to their parent group's directory.

Step 1: Create Directory Structure

data/{books|papers}/{slug}/
├── pdfs/                    # Split PDFs (one per group)
├── metadata/                # Index files
├── sections/                # Assembled section files (grouped by top-level)
│   ├── {group-dir}/        # e.g. chapter-01/ or section-02/
│   │   ├── s1_title.md
│   │   ├── s1-1_subtitle.md
│   │   └── ...
│   └── ...
├── transcripts/             # Page-level transcript files (grouped by top-level)
│   ├── {group-dir}/
│   │   ├── p005_0-of-2_title.md
│   │   └── ...
│   └── ...
├── summaries/               # Optional: AI summaries
└── section_groupings.tsv    # Master grouping metadata

Step 2: Build Hierarchical Page Index

Before doing any PDF splitting or transcription, read the full PDF and build the page index. This is the single source of truth for all subsequent steps.

Create metadata/hierarchical_page_index.tsv (tab-separated):

Index	Title	Page
0	Abstract	23
1	Introduction	23
2	Related Work	26
2.1	Rhetorical Structure Theory	27
  • Index: hierarchical section number (e.g., 3.2, 4.1.1, A, R)
  • Title: full section/chapter title
  • Page: page number where this section starts (use the document's own page numbers, not PDF page numbers)

Also create metadata/hierarchical_page_index.md as a human-readable markdown table version.

Step 3: Split PDF with qpdf

Split the full PDF into one file per top-level group. Install qpdf if needed: apt-get install -y qpdf

Critical: You must compute the offset between document page numbers and PDF page numbers. For example, if journal page 23 = PDF page 1, the offset is 22.

PDF_PAGE = DOCUMENT_PAGE - OFFSET

Split command pattern:

qpdf full.pdf --pages full.pdf START-END -- output.pdf

PDF naming convention

{slug}_{group_no}_{group-title-slug}.pdf

Examples:

  • erst_02_related-work.pdf (paper section 2)
  • ong_orality_and_literacy_01_the_orality_of_language.pdf (book chapter 1)
  • erst_0A_appendix-a.pdf (appendix with letter index)

Page ranges for top-level groups

Each group's PDF covers from its first page to its last page (which may overlap with adjacent groups when sections share boundary pages). Use the page index to determine:

  • Start page: where the group's first section begins
  • End page: where the group's last subsection ends (may equal start page of next group)

Overlapping pages between adjacent PDFs is acceptable and expected at section boundaries.

Step 4: Transcribe Page-by-Page

For each page in each group PDF, create a transcript file. This is the most token-intensive step - avoid re-doing it.

Transcript filename convention

p{PAGE}_{IDX}-of-{TOTAL}_{SECTION-SLUG}.md
Component Description Format
p{PAGE} Document page number Zero-padded to 3 digits (p023)
{IDX}-of-{TOTAL} Part index within the page 0-based index, 1-based total (0-of-2)
{SECTION-SLUG} Section title slug Lowercase, hyphen-separated

Multi-part pages: When multiple sections begin on the same page, split into parts ordered by position on page. For example, if page 32 has the end of section 2.6 and the start of section 3:

  • p032_0-of-2_multiple-frameworks.md (section 2.6, in section-02/)
  • p032_1-of-2_formalism.md (section 3, in section-03/)

YAML frontmatter for transcripts

For books:

---
chapter_no: 1
chapter_title: The orality of language
section_no: "1.1"
section_title: The literate mind and the oral past
page_no: 5
---

For papers:

---
section_no: 2
section_title: Related Work
subsection_no: "2.1"
subsection_title: Rhetorical Structure Theory
page_no: 27
---
  • section_no / chapter_no: top-level group identifier
  • section_title / chapter_title: top-level group title
  • subsection_no / section_no: hierarchical number (only for subsections)
  • subsection_title / section_title: subsection title (only for subsections)
  • page_no: document page number
  • note: optional (e.g., "blank page")

Text formatting rules

  • Verbatim transcription preserving original spelling and punctuation
  • Italics: _text_
  • Bold: **text**
  • Headers: Markdown headers (#, ##, etc.) matching document hierarchy
  • Tables: Markdown tables reproducing data as faithfully as possible
  • Figures: Describe with [Figure N: caption] notation
  • Equations: Use inline LaTeX where practical

Step 5: Assemble Section Files

Section files in sections/{group-dir}/ are the body content from transcript files, concatenated per section, without YAML frontmatter.

Section filename convention

s{SECTION-NUM}_{TITLE-SLUG}.md

Examples:

  • s2_related-work.md (top-level section)
  • s2-1_rhetorical-structure-theory.md (subsection, using hyphens not dots)
  • s4-1-1_dm-orphan-and-secondary-edge-annotation.md (sub-subsection)
  • sA_appendix-a-relation-labels-in-gum.md (letter-indexed section)

Note: section numbers use hyphens (s2-1) not dots (s2.1) in filenames.

Step 6: Build section_groupings.tsv

This is the master manifest. It maps every section to its group and source transcript files.

Columns (tab-separated):

Column Description Example
output_file Relative path to section file section-02/s2-1_rhetorical-structure-theory.md
section_no or chapter_no Top-level group number 2
section_title or chapter_title Top-level group title Related Work
group_key Hierarchical section ID 2.1
group_title Full section/subsection title Rhetorical Structure Theory
start_page First page of this section 27
end_page Last page of this section 28
source_files Comma-separated transcript file paths section-02/p027_0-of-1_rhetorical-structure-theory.md

The source_files column is critical - it establishes the link from assembled sections back to their source transcripts.

Common Mistakes to Avoid

  1. Not splitting the PDF first - Always use qpdf to create per-group PDFs before transcribing
  2. Flat directory structure - Always create group subdirectories (section-02/, chapter-01/)
  3. Missing source_files column - The grouping TSV must include source_files to link sections to transcripts
  4. Section-level naming for transcripts - Transcripts use page-based naming (p027_...), not section naming (s2-1_...)
  5. Frontmatter on section files - Section files (in sections/) have NO YAML frontmatter; only transcript files (in transcripts/) have frontmatter
  6. Re-doing transcription - Transcription is the most expensive step (heavy token I/O). When reorganizing, use metadata + shell commands to rename/move files rather than re-reading the PDF
  7. Wrong page offset - Always verify: qpdf --show-npages file.pdf and cross-reference with the document's own page numbering

Reorganization Without Re-Transcription

If the directory structure needs fixing after transcription is already done:

  1. Read existing metadata files (hierarchical_page_index.tsv, section_groupings.tsv) to get page mappings
  2. Use shell scripts to batch-rename and move files
  3. Use awk to strip/rewrite YAML frontmatter: awk '/^---$/{c++; next} c>=2' file.md
  4. Split PDFs with qpdf using page ranges from the metadata
  5. Rebuild section_groupings.tsv with the correct column format

This avoids the costly PDF-reading step entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment