This guide describes how to transcribe a PDF document (book or paper) into a hierarchical modular directory tree of markdown files. Follow each step in order.
The pipeline produces:
- Split PDFs - one per top-level group (chapter/section), extracted with
qpdf - Transcript files - page-level markdown files with YAML frontmatter, named by page number
- Section files - assembled from transcripts, grouped into directories by top-level section
- Metadata -
section_groupings.tsv,hierarchical_page_index.tsv,hierarchical_page_index.md
Identify the "high bit" - the top-level grouping unit:
| Document type | Group unit | Directory prefix | Example |
|---|---|---|---|
| Book | Chapter | chapter- |
chapter-01/, chapter-07/ |
| Paper | Section | section- |
section-00/, section-0A/ |
For papers, top-level sections (Abstract, Introduction, Related Work, etc.) are the groups. Subsections (2.1, 4.1.1) belong to their parent group's directory.
data/{books|papers}/{slug}/
├── pdfs/ # Split PDFs (one per group)
├── metadata/ # Index files
├── sections/ # Assembled section files (grouped by top-level)
│ ├── {group-dir}/ # e.g. chapter-01/ or section-02/
│ │ ├── s1_title.md
│ │ ├── s1-1_subtitle.md
│ │ └── ...
│ └── ...
├── transcripts/ # Page-level transcript files (grouped by top-level)
│ ├── {group-dir}/
│ │ ├── p005_0-of-2_title.md
│ │ └── ...
│ └── ...
├── summaries/ # Optional: AI summaries
└── section_groupings.tsv # Master grouping metadata
Before doing any PDF splitting or transcription, read the full PDF and build the page index. This is the single source of truth for all subsequent steps.
Create metadata/hierarchical_page_index.tsv (tab-separated):
Index Title Page
0 Abstract 23
1 Introduction 23
2 Related Work 26
2.1 Rhetorical Structure Theory 27
- Index: hierarchical section number (e.g.,
3.2,4.1.1,A,R) - Title: full section/chapter title
- Page: page number where this section starts (use the document's own page numbers, not PDF page numbers)
Also create metadata/hierarchical_page_index.md as a human-readable markdown table version.
Split the full PDF into one file per top-level group. Install qpdf if needed: apt-get install -y qpdf
Critical: You must compute the offset between document page numbers and PDF page numbers. For example, if journal page 23 = PDF page 1, the offset is 22.
PDF_PAGE = DOCUMENT_PAGE - OFFSETSplit command pattern:
qpdf full.pdf --pages full.pdf START-END -- output.pdf{slug}_{group_no}_{group-title-slug}.pdf
Examples:
erst_02_related-work.pdf(paper section 2)ong_orality_and_literacy_01_the_orality_of_language.pdf(book chapter 1)erst_0A_appendix-a.pdf(appendix with letter index)
Each group's PDF covers from its first page to its last page (which may overlap with adjacent groups when sections share boundary pages). Use the page index to determine:
- Start page: where the group's first section begins
- End page: where the group's last subsection ends (may equal start page of next group)
Overlapping pages between adjacent PDFs is acceptable and expected at section boundaries.
For each page in each group PDF, create a transcript file. This is the most token-intensive step - avoid re-doing it.
p{PAGE}_{IDX}-of-{TOTAL}_{SECTION-SLUG}.md
| Component | Description | Format |
|---|---|---|
p{PAGE} |
Document page number | Zero-padded to 3 digits (p023) |
{IDX}-of-{TOTAL} |
Part index within the page | 0-based index, 1-based total (0-of-2) |
{SECTION-SLUG} |
Section title slug | Lowercase, hyphen-separated |
Multi-part pages: When multiple sections begin on the same page, split into parts ordered by position on page. For example, if page 32 has the end of section 2.6 and the start of section 3:
p032_0-of-2_multiple-frameworks.md(section 2.6, insection-02/)p032_1-of-2_formalism.md(section 3, insection-03/)
For books:
---
chapter_no: 1
chapter_title: The orality of language
section_no: "1.1"
section_title: The literate mind and the oral past
page_no: 5
---For papers:
---
section_no: 2
section_title: Related Work
subsection_no: "2.1"
subsection_title: Rhetorical Structure Theory
page_no: 27
---section_no/chapter_no: top-level group identifiersection_title/chapter_title: top-level group titlesubsection_no/section_no: hierarchical number (only for subsections)subsection_title/section_title: subsection title (only for subsections)page_no: document page numbernote: optional (e.g., "blank page")
- Verbatim transcription preserving original spelling and punctuation
- Italics:
_text_ - Bold:
**text** - Headers: Markdown headers (
#,##, etc.) matching document hierarchy - Tables: Markdown tables reproducing data as faithfully as possible
- Figures: Describe with
[Figure N: caption]notation - Equations: Use inline LaTeX where practical
Section files in sections/{group-dir}/ are the body content from transcript files, concatenated per section, without YAML frontmatter.
s{SECTION-NUM}_{TITLE-SLUG}.md
Examples:
s2_related-work.md(top-level section)s2-1_rhetorical-structure-theory.md(subsection, using hyphens not dots)s4-1-1_dm-orphan-and-secondary-edge-annotation.md(sub-subsection)sA_appendix-a-relation-labels-in-gum.md(letter-indexed section)
Note: section numbers use hyphens (s2-1) not dots (s2.1) in filenames.
This is the master manifest. It maps every section to its group and source transcript files.
Columns (tab-separated):
| Column | Description | Example |
|---|---|---|
output_file |
Relative path to section file | section-02/s2-1_rhetorical-structure-theory.md |
section_no or chapter_no |
Top-level group number | 2 |
section_title or chapter_title |
Top-level group title | Related Work |
group_key |
Hierarchical section ID | 2.1 |
group_title |
Full section/subsection title | Rhetorical Structure Theory |
start_page |
First page of this section | 27 |
end_page |
Last page of this section | 28 |
source_files |
Comma-separated transcript file paths | section-02/p027_0-of-1_rhetorical-structure-theory.md |
The source_files column is critical - it establishes the link from assembled sections back to their source transcripts.
- Not splitting the PDF first - Always use qpdf to create per-group PDFs before transcribing
- Flat directory structure - Always create group subdirectories (
section-02/,chapter-01/) - Missing source_files column - The grouping TSV must include
source_filesto link sections to transcripts - Section-level naming for transcripts - Transcripts use page-based naming (
p027_...), not section naming (s2-1_...) - Frontmatter on section files - Section files (in
sections/) have NO YAML frontmatter; only transcript files (intranscripts/) have frontmatter - Re-doing transcription - Transcription is the most expensive step (heavy token I/O). When reorganizing, use metadata + shell commands to rename/move files rather than re-reading the PDF
- Wrong page offset - Always verify:
qpdf --show-npages file.pdfand cross-reference with the document's own page numbering
If the directory structure needs fixing after transcription is already done:
- Read existing metadata files (
hierarchical_page_index.tsv,section_groupings.tsv) to get page mappings - Use shell scripts to batch-rename and move files
- Use
awkto strip/rewrite YAML frontmatter:awk '/^---$/{c++; next} c>=2' file.md - Split PDFs with qpdf using page ranges from the metadata
- Rebuild
section_groupings.tsvwith the correct column format
This avoids the costly PDF-reading step entirely.