SKILL: Document Transcription Pipeline

This guide describes how to transcribe a PDF document (book or paper) into a hierarchical modular directory tree of markdown files. Follow each step in order.

Overview

The pipeline produces:

Split PDFs - one per top-level group (chapter/section), extracted with qpdf
Transcript files - page-level markdown files with YAML frontmatter, named by page number
Section files - assembled from transcripts, grouped into directories by top-level section
Metadata - section_groupings.tsv, hierarchical_page_index.tsv, hierarchical_page_index.md

Step 0: Determine Document Type and Grouping Key

Identify the "high bit" - the top-level grouping unit:

Document type	Group unit	Directory prefix	Example
Book	Chapter	`chapter-`	`chapter-01/`, `chapter-07/`
Paper	Section	`section-`	`section-00/`, `section-0A/`

For papers, top-level sections (Abstract, Introduction, Related Work, etc.) are the groups. Subsections (2.1, 4.1.1) belong to their parent group's directory.

Step 1: Create Directory Structure

data/{books|papers}/{slug}/
├── pdfs/                    # Split PDFs (one per group)
├── metadata/                # Index files
├── sections/                # Assembled section files (grouped by top-level)
│   ├── {group-dir}/        # e.g. chapter-01/ or section-02/
│   │   ├── s1_title.md
│   │   ├── s1-1_subtitle.md
│   │   └── ...
│   └── ...
├── transcripts/             # Page-level transcript files (grouped by top-level)
│   ├── {group-dir}/
│   │   ├── p005_0-of-2_title.md
│   │   └── ...
│   └── ...
├── summaries/               # Optional: AI summaries
└── section_groupings.tsv    # Master grouping metadata

Step 2: Build Hierarchical Page Index

Before doing any PDF splitting or transcription, read the full PDF and build the page index. This is the single source of truth for all subsequent steps.

Create metadata/hierarchical_page_index.tsv (tab-separated):

Index	Title	Page
0	Abstract	23
1	Introduction	23
2	Related Work	26
2.1	Rhetorical Structure Theory	27

Index: hierarchical section number (e.g., 3.2, 4.1.1, A, R)
Title: full section/chapter title
Page: page number where this section starts (use the document's own page numbers, not PDF page numbers)

Also create metadata/hierarchical_page_index.md as a human-readable markdown table version.

Step 3: Split PDF with qpdf

Split the full PDF into one file per top-level group. Install qpdf if needed: apt-get install -y qpdf

Critical: You must compute the offset between document page numbers and PDF page numbers. For example, if journal page 23 = PDF page 1, the offset is 22.

PDF_PAGE = DOCUMENT_PAGE - OFFSET

Split command pattern:

qpdf full.pdf --pages full.pdf START-END -- output.pdf

PDF naming convention

{slug}_{group_no}_{group-title-slug}.pdf

Examples:

erst_02_related-work.pdf (paper section 2)
ong_orality_and_literacy_01_the_orality_of_language.pdf (book chapter 1)
erst_0A_appendix-a.pdf (appendix with letter index)

Page ranges for top-level groups

Each group's PDF covers from its first page to its last page (which may overlap with adjacent groups when sections share boundary pages). Use the page index to determine:

Start page: where the group's first section begins
End page: where the group's last subsection ends (may equal start page of next group)

Overlapping pages between adjacent PDFs is acceptable and expected at section boundaries.

Step 4: Transcribe Page-by-Page

For each page in each group PDF, create a transcript file. This is the most token-intensive step - avoid re-doing it.

Transcript filename convention

p{PAGE}_{IDX}-of-{TOTAL}_{SECTION-SLUG}.md

Component	Description	Format
`p{PAGE}`	Document page number	Zero-padded to 3 digits (`p023`)
`{IDX}-of-{TOTAL}`	Part index within the page	0-based index, 1-based total (`0-of-2`)
`{SECTION-SLUG}`	Section title slug	Lowercase, hyphen-separated

Multi-part pages: When multiple sections begin on the same page, split into parts ordered by position on page. For example, if page 32 has the end of section 2.6 and the start of section 3:

p032_0-of-2_multiple-frameworks.md (section 2.6, in section-02/)
p032_1-of-2_formalism.md (section 3, in section-03/)

YAML frontmatter for transcripts

For books:

---
chapter_no: 1
chapter_title: The orality of language
section_no: "1.1"
section_title: The literate mind and the oral past
page_no: 5
---

For papers:

---
section_no: 2
section_title: Related Work
subsection_no: "2.1"
subsection_title: Rhetorical Structure Theory
page_no: 27
---

section_no / chapter_no: top-level group identifier
section_title / chapter_title: top-level group title
subsection_no / section_no: hierarchical number (only for subsections)
subsection_title / section_title: subsection title (only for subsections)
page_no: document page number
note: optional (e.g., "blank page")

Text formatting rules

Verbatim transcription preserving original spelling and punctuation
Italics: _text_
Bold: **text**
Headers: Markdown headers (#, ##, etc.) matching document hierarchy
Tables: Markdown tables reproducing data as faithfully as possible
Figures: Describe with [Figure N: caption] notation
Equations: Use inline LaTeX where practical

Step 5: Assemble Section Files

Section files in sections/{group-dir}/ are the body content from transcript files, concatenated per section, without YAML frontmatter.

Section filename convention

s{SECTION-NUM}_{TITLE-SLUG}.md

Examples:

s2_related-work.md (top-level section)
s2-1_rhetorical-structure-theory.md (subsection, using hyphens not dots)
s4-1-1_dm-orphan-and-secondary-edge-annotation.md (sub-subsection)
sA_appendix-a-relation-labels-in-gum.md (letter-indexed section)

Note: section numbers use hyphens (s2-1) not dots (s2.1) in filenames.

Step 6: Build section_groupings.tsv

This is the master manifest. It maps every section to its group and source transcript files.

Columns (tab-separated):

Column	Description	Example
`output_file`	Relative path to section file	`section-02/s2-1_rhetorical-structure-theory.md`
`section_no` or `chapter_no`	Top-level group number	`2`
`section_title` or `chapter_title`	Top-level group title	`Related Work`
`group_key`	Hierarchical section ID	`2.1`
`group_title`	Full section/subsection title	`Rhetorical Structure Theory`
`start_page`	First page of this section	`27`
`end_page`	Last page of this section	`28`
`source_files`	Comma-separated transcript file paths	`section-02/p027_0-of-1_rhetorical-structure-theory.md`

The source_files column is critical - it establishes the link from assembled sections back to their source transcripts.

Common Mistakes to Avoid

Not splitting the PDF first - Always use qpdf to create per-group PDFs before transcribing
Flat directory structure - Always create group subdirectories (section-02/, chapter-01/)
Missing source_files column - The grouping TSV must include source_files to link sections to transcripts
Section-level naming for transcripts - Transcripts use page-based naming (p027_...), not section naming (s2-1_...)
Frontmatter on section files - Section files (in sections/) have NO YAML frontmatter; only transcript files (in transcripts/) have frontmatter
Re-doing transcription - Transcription is the most expensive step (heavy token I/O). When reorganizing, use metadata + shell commands to rename/move files rather than re-reading the PDF
Wrong page offset - Always verify: qpdf --show-npages file.pdf and cross-reference with the document's own page numbering

Reorganization Without Re-Transcription

If the directory structure needs fixing after transcription is already done:

Read existing metadata files (hierarchical_page_index.tsv, section_groupings.tsv) to get page mappings
Use shell scripts to batch-rename and move files
Use awk to strip/rewrite YAML frontmatter: awk '/^---$/{c++; next} c>=2' file.md
Split PDFs with qpdf using page ranges from the metadata
Rebuild section_groupings.tsv with the correct column format

This avoids the costly PDF-reading step entirely.

lmmx/SKILL.md

Select an option

No results found