Design doc for enriching sources with full content via targeted crawling.
BLZ (pronounced "blaze") is a local-first search cache for llms.txt documentation. It keeps documentation local, searches it in milliseconds (P50 ≈ 6ms), and returns grounded spans with exact line citations.
AI agents need fast, reliable access to documentation. Traditional approaches have problems:
- Web search: Slow, noisy, can't guarantee freshness
- RAG/embeddings: Semantic drift, hallucination-prone citations
- Page-level fetching: Wastes tokens, no granular retrieval
BLZ solves this by:
- Caching documentation locally — One-time fetch, instant access
- Full-text search with BM25 — Deterministic, reproducible results
- Line-level citations —
source:1234-1256points to exact content - Progressive retrieval — Search → cite → expand as needed
blz add bun https://bun.sh/llms.txt # Add a source
blz query "test runner" # Search across sources
blz get bun:1234-1256 -C 5 # Retrieve cited lines with context
blz map bun --tree # Browse documentation structure
blz list # List all sources
blz sync bun # Refresh from upstreamBLZ exposes both CLI and MCP interfaces. Agents typically:
- Search for relevant docs:
blz query "authentication middleware" - Get citations from results:
bun:41994-42009 - Retrieve with context:
blz get bun:41994-42009 -C 10 - Expand if needed:
blz get bun:41994-42009 --context all
This workflow minimizes token usage while maintaining grounded, verifiable answers.
BLZ sources have two complementary layers:
┌─────────────────────────────────────────────────────────┐
│ BLZ Source │
├─────────────────────────────────────────────────────────┤
│ INDEX LAYER │
│ ├── Curated titles & descriptions │
│ ├── URL → line range mapping │
│ ├── Maintainer-intended structure │
│ └── Semantic routing for queries │
├─────────────────────────────────────────────────────────┤
│ CONTENT LAYER │
│ ├── Full searchable text │
│ ├── Actual heading structure │
│ ├── Line-level citations │
│ └── Context expansion │
├─────────────────────────────────────────────────────────┤
│ SEARCH INDEX (Tantivy) │
│ └── BM25 search across content │
└─────────────────────────────────────────────────────────┘
The index is a curated manifest of documentation entry points. It contains:
- Titles — Human-readable names chosen by maintainers
- Descriptions — Brief summaries of what each section covers
- URLs — Links to the actual documentation pages
- Structure — Logical groupings (Guides, API Reference, etc.)
This comes from llms.txt files, which are lightweight indexes that many sites already provide.
The content is the full searchable documentation. It contains:
- Complete text — Every word, searchable
- Heading structure — The actual document hierarchy
- Line numbers — For precise citations
- Body content — Details, examples, code snippets
This comes from llms-full.txt files (if available) or is generated by crawling the URLs from the index.
| Capability | Index | Content | Combined |
|---|---|---|---|
| Know what docs exist | ✓ | ✓ | |
| Search full text | ✓ | ✓ | |
| Curated titles/descriptions | ✓ | ✓ | |
| Line-level citations | ✓ | ✓ | |
| Semantic routing | ✓ | ✓ | |
| Context expansion | ✓ | ✓ |
A source with only index can tell you what documentation exists and where to find it. A source with only content can be searched but lacks curated metadata. A source with both provides the best experience: curated entry points with full searchability.
blz list
# hono [index + content] https://hono.dev/llms.txt
# clerk [index only] https://clerk.com/llms.txt
# internal [content only] /path/to/docs.md
# react [crawling 67/234] https://react.devThe index and content layers can drift out of sync:
- Index stale, content fresh: The llms.txt hasn't been updated but we crawled new pages
- Content stale, index fresh: The llms.txt was updated with new entries we haven't crawled
- Both stale: Neither reflects current site state
Track freshness per layer:
{
"alias": "hono",
"index": {
"url": "https://hono.dev/llms.txt",
"fetchedAt": "2026-01-15T10:00:00Z",
"etag": "abc123",
"entryCount": 24
},
"content": {
"source": "crawled",
"generatedAt": "2026-01-20T14:30:00Z",
"pageCount": 24,
"crawlJobId": "blz_xyz789"
},
"reconciliation": {
"lastChecked": "2026-01-25T08:00:00Z",
"indexEntriesWithContent": 24,
"indexEntriesMissingContent": 0,
"contentPagesNotInIndex": 0,
"status": "synced"
}
}Reconciliation on sync:
async fn reconcile_index_content(source: &mut Source) -> Result<ReconciliationResult> {
// Fetch latest index
let fresh_index = fetch_index(&source.index.url).await?;
// Compare with current content
let index_urls: HashSet<_> = fresh_index.entries.iter()
.map(|e| &e.url)
.collect();
let content_urls: HashSet<_> = source.content.pages.keys().collect();
let missing_content: Vec<_> = index_urls.difference(&content_urls).collect();
let orphaned_content: Vec<_> = content_urls.difference(&index_urls).collect();
Ok(ReconciliationResult {
index_entries_missing_content: missing_content.len(),
content_pages_not_in_index: orphaned_content.len(),
suggested_action: if missing_content.is_empty() && orphaned_content.is_empty() {
SyncAction::None
} else if !missing_content.is_empty() {
SyncAction::CrawlMissing(missing_content)
} else {
SyncAction::ReviewOrphans(orphaned_content)
}
})
}User-facing check:
blz check hono
hono [index + content]
Index: 26 entries (fetched 10 days ago)
Content: 24 pages (crawled 5 days ago)
⚠ 2 index entries missing content:
- /docs/new-feature (added to index recently)
- /api/streaming (added to index recently)
Recommendation: blz sync honoMany documentation sites provide an index (llms.txt) but not full content (llms-full.txt). Currently:
- Index-only sources can route you to URLs but can't be searched
- Users must wait for maintainers to add full content, or
- Manually download and format documentation
The index tells us exactly which URLs contain documentation. Instead of blindly crawling a site, we:
- Parse the index — Extract URLs, titles, descriptions
- Crawl targeted URLs — Only the pages in the index (not blog/marketing)
- Preserve metadata — Link index entries to content line ranges
- Enable full search — Content layer becomes searchable
| Scenario | Strategy |
|---|---|
Has llms-full.txt |
Just fetch it (no crawling needed) |
Has llms.txt only |
Parse index → crawl listed URLs → assemble content |
| Has neither | Agent-managed discovery → propose crawl plan → execute |
When llms.txt exists but is sparse or outdated, index-guided crawling misses important pages. We need secondary discovery strategies.
Detection: An index is considered sparse if:
- Entry count < 10 for a large site (many more pages visible)
- Index hasn't been updated in 6+ months
- Known URL patterns (e.g.,
/docs/*) aren't represented
Fallback strategies (tiered):
┌─────────────────────────────────────────────────────────────┐
│ 1. Sitemap.xml (preferred) │
│ └─ Fetch /sitemap.xml, extract doc URLs │
│ └─ Filter to doc paths (/docs/*, /api/*, /guides/*) │
│ └─ Merge with index, dedupe │
├─────────────────────────────────────────────────────────────┤
│ 2. Limited BFS from doc root (if sitemap unavailable) │
│ └─ Start from known doc root (e.g., /docs/) │
│ └─ Crawl breadth-first with depth limit (default: 3) │
│ └─ Filter to same path prefix │
│ └─ Cap total pages (default: 100 beyond index) │
├─────────────────────────────────────────────────────────────┤
│ 3. Firecrawl map (discovery mode) │
│ └─ Use firecrawl_map to discover all site URLs │
│ └─ Apply heuristics to identify doc pages │
│ └─ Propose to user for approval │
└─────────────────────────────────────────────────────────────┘
Implementation:
async fn expand_sparse_index(
index: &IndexLayer,
site_url: &str,
config: &ExpansionConfig,
) -> Result<Vec<String>> {
let index_urls: HashSet<_> = index.entries.iter().map(|e| &e.url).collect();
let mut discovered = Vec::new();
// Try sitemap first (free, structured)
if let Ok(sitemap_urls) = fetch_sitemap_urls(site_url).await {
let doc_urls: Vec<_> = sitemap_urls
.into_iter()
.filter(|url| looks_like_doc_url(url))
.filter(|url| !index_urls.contains(url))
.take(config.max_expansion)
.collect();
if !doc_urls.is_empty() {
discovered.extend(doc_urls);
return Ok(discovered);
}
}
// Fallback: limited BFS from doc root
if let Some(doc_root) = detect_doc_root(index) {
let bfs_urls = bfs_discover(
&doc_root,
config.max_depth,
config.max_expansion,
|url| !index_urls.contains(url) && url.starts_with(&doc_root),
).await?;
discovered.extend(bfs_urls);
}
Ok(discovered)
}User prompt for expansion:
blz add clerk https://clerk.com/llms.txt
Fetching index... ✓
Found 8 entries in llms.txt
⚠ Index appears sparse for this site
Sitemap shows 45 additional doc pages not in index
Expand beyond index? [Y/n/customize]
Y: Crawl all 53 pages (8 from index + 45 from sitemap)
n: Crawl only 8 pages from index
customize: Select which additional pages to include| BLZ Component | Crawl Integration |
|---|---|
blz add |
Detects index-only sources, offers to generate content |
blz crawl |
Start/manage crawl jobs (index-guided or discovery) |
blz query |
Index entries boost search ranking |
blz map |
Content structure + index annotations |
blz sync |
Cost-optimized updates via index diffing |
| MCP | blz_crawl tool for agent orchestration |
| Skills | /crawl for agent-managed discovery |
User: blz add hono https://hono.dev/llms.txt
↓
1. Fetch llms.txt (index)
2. Parse: found 24 documentation URLs
3. Check: llms-full.txt not available
↓
Prompt: "Found index with 24 doc URLs. Generate full content? [Y/n]"
↓
4. Crawl exactly those 24 URLs
5. Assemble content, preserving order from index
6. Link index entries → content line ranges
↓
Result: Source with index + content layers
Curated metadata + full searchability
Agent: "I need the Hono framework docs"
↓
Check: Does hono source exist in blz?
↓
No → /crawl https://hono.dev
Agent probes site, finds /docs/* is the doc root
Proposes: "Crawl 156 pages from /docs/*?"
User approves
Crawl starts, pages indexed progressively
↓
Yes → blz query "middleware" --source hono
Return cached results immediately
The index serves as a semantic routing table. When querying:
blz query "middleware"The flow:
1. Check index entries first (fast, high-signal)
→ Match: "Middleware" with description "Built-in and custom middleware"
→ This entry points to lines 2341-2567 in content
2. Search content layer (comprehensive)
→ Additional matches from body text
3. Results ranked:
→ Index matches get a boost (maintainer said this is THE middleware doc)
→ Content matches follow
# Line-based citation (always works)
blz get hono:2341-2567
# Entry-based citation (resolves via index)
blz get hono:middleware
blz get hono:"Getting Started"The index provides stable, human-readable references that map to line ranges.
blz query "auth" --index-only
# Searches just index titles/descriptions across all sources
# Fast, high-signal, good for "where should I look?"
Results:
clerk:auth "Authentication and user management"
supabase:auth "Row-level security and auth"
hono:middleware "Built-in and custom middleware" (weak match)The blz map command uses content structure as primary, with index annotations:
blz map hono --tree
# Content structure (primary) + index annotations
Hono
├── Getting Started ★ "Quick start guide"
│ ├── Installation
│ └── First App
├── Core Concepts
│ ├── Routing ★ "URL pattern matching"
│ ├── Context ★ "Request/response context"
│ └── Error Handling
└── Advanced
└── Middleware ★ "Built-in and custom"The ★ indicates sections featured in the index. Descriptions come from index metadata.
Many documentation sites don't provide full content files. Users need a way to generate searchable content from existing doc sites, ideally guided by the index when available.
- Index-guided crawling — Use index entries as a manifest for targeted Firecrawl fetching
- Firecrawl integration — API (default) or self-hosted for speed/cost optimization
- Progressive loading — Index content as pages arrive, prioritize specific paths
- Metadata preservation — Link index entries to content line ranges
- Agent-assisted discovery — Smart identification when no index exists
- Resilient execution — Handle flaky networks, rate limits, resumable jobs
- Cost transparency — Show credit estimates before crawling
- Native Rust HTML extraction (Firecrawl handles this well)
- Automatic sync scheduling (manual
blz syncfor now) - Index generation from content (content → index inference)
A dumb blz crawl <url> command has problems:
- Pulls in entire site (marketing, blog, careers, etc.)
- No intelligence about doc structure
- Can't adapt mid-crawl based on what's found
An agent with a skill can:
- Probe the site first to find the docs root
- Analyze URL patterns to identify docs vs non-docs
- Propose a crawl plan for user approval
- Adapt based on early results
/crawl https://example.com
1. DISCOVER
│
├─ Check for existing llms.txt / llms-full.txt
│ └─ If found → suggest using blz add instead
│
├─ Map site URLs (firecrawl_map)
│ └─ Get list of all discoverable paths
│
└─ Analyze URL patterns
├─ /docs/*, /documentation/*, /guide/* → likely docs
├─ /blog/*, /news/*, /press/* → skip
├─ /pricing, /careers, /about → skip
└─ /api/*, /reference/* → likely API docs
2. PROPOSE
│
└─ Present crawl plan to user:
"I found 847 URLs on example.com. Here's my analysis:
📚 Docs (/docs/*): 234 pages
📖 API Reference (/api/*): 156 pages
📝 Blog (/blog/*): 312 pages — SKIP
🏢 Marketing (/, /pricing, etc.): 45 pages — SKIP
❓ Unknown: 100 pages
Proposed crawl: 390 pages (docs + API)
Estimated time: ~5 minutes
Should I proceed? [Y/n/customize]"
3. EXECUTE
│
├─ Start crawl with approved include/exclude paths
├─ Stream results progressively (see below)
└─ Report progress and any issues
4. FINALIZE
│
├─ Assemble llms-full.txt
├─ Store and index
└─ Confirm completion
# /crawl
Crawl a documentation site and add it as a blz source.
## Usage
/crawl <url> [--alias <name>]
## What This Skill Does
1. Checks if llms.txt already exists (suggests `blz add` if so)
2. Maps the site to discover all URLs
3. Analyzes patterns to identify documentation vs other content
4. Proposes a crawl plan for your approval
5. Executes crawl with progressive indexing
6. Assembles and stores as a blz source
## Tools Used
- firecrawl_map: Discover site URLs
- firecrawl_crawl: Execute crawl
- firecrawl_check_crawl_status: Monitor progress
- blz_add: Store final resultFirecrawl handles HTML → clean markdown. This is a hard problem (main content extraction, JS rendering, edge cases) that Firecrawl has solved well. We don't reinvent it — but we keep the interface pluggable.
- Main content extraction — Strips nav, footer, sidebar, cookie banners
- Clean markdown conversion — Preserves code blocks, tables, headings
- JS rendering — Handles SPAs and dynamic content
- Edge case handling — Years of iteration on weird HTML
The onlyMainContent: true option is critical — without it you get navigation, footers, and boilerplate.
Firecrawl excels at article-style pages but can struggle with:
- API references — Complex tables, code tabs, interactive playgrounds
- OpenAPI/JSON specs — Structured data that doesn't map cleanly to markdown
- PDF-heavy docs — Embedded PDFs need separate handling
Mitigation: Store original HTML alongside markdown for sources where fidelity matters. Allow per-source raw_html: true option.
┌─────────────────────────────────────────────────────────┐
│ Firecrawl API (default) │
│ ├─ Just works, no setup │
│ ├─ Pay-per-page pricing │
│ └─ Best for: most users, occasional crawls │
├─────────────────────────────────────────────────────────┤
│ Self-Hosted Firecrawl (power users) │
│ ├─ AGPL licensed, can self-host │
│ ├─ Unlimited crawls, no per-page cost │
│ ├─ No queue — dedicated capacity, faster throughput │
│ ├─ Requires: Docker, Redis, Playwright │
│ └─ Best for: heavy users, air-gapped, speed + cost │
└─────────────────────────────────────────────────────────┘
# Default: Firecrawl API
blz config set firecrawl.api_key fc-xxxxx
# Self-hosted: point to local instance
blz config set firecrawl.url http://localhost:3002
blz config set firecrawl.api_key local # or omit if not requiredWhen an index exists, we use it to minimize crawl scope:
Index has 24 URLs
↓
Firecrawl scrapes exactly those 24 pages
↓
Result: 24 credits instead of 200+ for blind crawl
The index tells us exactly what to fetch — no site mapping, no guessing, no marketing pages.
blz add hono https://hono.dev/llms.txt
Fetching index... ✓
Found 24 documentation URLs
No llms-full.txt available
Generate content via Firecrawl?
Pages: 24
Estimated credits: 24
[Y/n] y
Crawling via Firecrawl...
[████████████████████] 24/24 pages
Assembling content... ✓
Linking index entries... ✓
Added source 'hono'
Index entries: 24
Content: 45,230 lines
Firecrawl credits used: 24| Scenario | Firecrawl Credits |
|---|---|
| Blind site crawl (discovery mode) | 100-500+ |
| Index-guided crawl (24 URLs) | 24 |
| With smart sync (10% changed) | ~3 |
The index dramatically reduces crawl scope. Smart sync reduces ongoing costs.
BLZ includes a Docker Compose file for local Firecrawl:
# Start local Firecrawl (from blz repo)
docker compose -f docker/docker-compose.firecrawl.yml up -d
# BLZ auto-detects localhost:3002, or configure explicitly
blz config set firecrawl.url http://localhost:3002The included docker-compose.firecrawl.yml:
# docker/docker-compose.firecrawl.yml
services:
firecrawl:
# Pin to specific digest for deterministic builds
image: mendableai/firecrawl@sha256:<pin-digest-here>
ports:
- "3002:3002"
environment:
- REDIS_URL=redis://redis:6379
# Disable telemetry to prevent leaking private docs
- TELEMETRY_ENABLED=false
depends_on:
- redis
# Security: restrict network access
networks:
- internal
security_opt:
- no-new-privileges:true
redis:
image: redis:alpine
volumes:
- redis-data:/data
networks:
- internal
networks:
internal:
driver: bridge
volumes:
redis-data:Requirements: Docker and ~2GB RAM for Chromium/Playwright.
Auto-detection: BLZ checks localhost:3002 before falling back to API. No config needed if local Firecrawl is running.
Deterministic builds: Pin the Firecrawl image digest and checksum to avoid silent upstream changes. Update deliberately.
See Firecrawl self-hosting docs for advanced configuration.
Firecrawl is AGPL-licensed. Key considerations:
- Using the API: No license obligations — it's a service
- Self-hosting unmodified: Fine, no source disclosure required
- Modifying Firecrawl: Triggers source-offer obligations if distributed
- Bundling in commercial product: Consult legal counsel
We include a Docker Compose file that pulls the official image — this is safe. If you modify Firecrawl itself, you must comply with AGPL.
Firecrawl executes JavaScript when rendering pages. For untrusted sources:
- Network isolation: The Docker Compose restricts egress by default
- SSRF risk: Malicious pages could attempt to access internal services
- Data exfiltration: JS could try to phone home with crawled content
Recommendations:
- Use self-hosted Firecrawl for sensitive/internal documentation
- Review the container's network access for your threat model
- Disable telemetry (
TELEMETRY_ENABLED=false) to prevent data leaks
While Firecrawl is the primary extraction engine, we keep the interface pluggable for future flexibility.
- Avoid lock-in: Firecrawl could change pricing, API, or disappear
- Offline fallback: Some users need air-gapped operation
- Cost optimization: Simple sites don't need full JS rendering
- Testing: Mock extractors for unit tests
#[async_trait]
pub trait Extractor: Send + Sync {
/// Extract markdown content from a URL
async fn extract(&self, url: &str, options: &ExtractOptions) -> Result<ExtractedPage>;
/// Health check for the extractor
async fn health(&self) -> Result<ExtractorHealth>;
/// Extractor capabilities
fn capabilities(&self) -> ExtractorCapabilities;
}
pub struct ExtractorCapabilities {
pub js_rendering: bool,
pub main_content_extraction: bool,
pub rate_limit: Option<u32>, // requests per minute
}
pub struct ExtractedPage {
pub url: String,
pub markdown: String,
pub html: Option<String>, // Original HTML if requested
pub title: Option<String>,
pub metadata: HashMap<String, String>,
}| Extractor | JS Rendering | Quality | Cost |
|---|---|---|---|
FirecrawlApiExtractor |
✓ | High | Pay-per-page |
FirecrawlLocalExtractor |
✓ | High | Self-hosted |
ReadabilityExtractor |
✗ | Medium | Free (future) |
For v1, only Firecrawl extractors are implemented. The trait exists so we can add alternatives later without touching indexing logic.
After extraction, validate content before indexing to catch parser failures:
fn validate_extraction(page: &ExtractedPage) -> Result<(), ValidationError> {
// Minimum content length
if page.markdown.len() < 200 {
return Err(ValidationError::ContentTooShort);
}
// Check for extraction artifacts
if page.markdown.contains("cookie") && page.markdown.contains("accept") {
warn!("Possible cookie banner in extracted content: {}", page.url);
}
// Code block ratio check (API docs should have code)
let code_blocks = page.markdown.matches("```").count() / 2;
// Warn but don't fail — some pages legitimately have no code
Ok(())
}Failed validations are logged and can trigger re-extraction or manual review.
Track crawl health and tune performance:
struct CrawlMetrics {
// Per-source
pages_crawled: Counter,
pages_failed: Counter,
bytes_fetched: Counter,
extraction_duration_ms: Histogram,
// Tantivy
commit_duration_ms: Histogram,
segments_created: Counter,
// Firecrawl
api_latency_ms: Histogram,
rate_limit_hits: Counter,
}[INFO] crawl:start source=hono pages=24 extractor=firecrawl-api
[INFO] crawl:page url=/docs/getting-started status=ok bytes=12340 duration=1.2s
[WARN] crawl:page url=/docs/internal status=403 error="Forbidden"
[INFO] crawl:commit pages=10 duration=45ms segments=1
[INFO] crawl:complete source=hono pages=23/24 duration=48s
blz crawl --health
# Firecrawl API: ✓ (latency: 230ms)
# Firecrawl Local: ✓ (latency: 45ms)
# Tantivy: ✓ (segments: 3, docs: 45230)Full site crawls can take 10+ minutes. Users shouldn't have to wait for everything before they can search.
As pages complete crawling, index them immediately:
Crawl Progress Search Availability
───────────────── ────────────────────
Page 1 complete ──> Indexed, searchable
Page 2 complete ──> Indexed, searchable
Page 3 complete ──> Indexed, searchable
... ...
Page N complete ──> Full index ready
Assemble llms-full.txt
Agent or user can specify paths to crawl first:
/crawl https://docs.example.com --priority "/api/hooks,/guides/getting-started"
Firecrawl crawl order:
- Priority paths (immediate need)
- Breadth-first from root (comprehensive coverage)
Agent discovers it needs specific docs during a task:
Agent: "I need the React hooks documentation"
↓
Check: Is /docs/hooks already indexed?
↓
No → Inject /docs/hooks as priority in active crawl
OR start targeted scrape just for that path
↓
Yes → Return cached content
struct ProgressiveCrawl {
operation_id: String,
partial_index: TantivyIndex, // Growing index
pending_pages: HashSet<String>,
completed_pages: Vec<CrawledPage>,
uncommitted_count: usize,
last_commit: Instant,
}
const BATCH_SIZE: usize = 10;
const COMMIT_INTERVAL: Duration = Duration::from_secs(30);
impl ProgressiveCrawl {
async fn poll_and_index(&mut self) -> Result<CrawlStatus> {
let status = check_crawl_status(&self.operation_id).await?;
// Index any newly completed pages
for page in &status.data {
if !self.completed_pages.iter().any(|p| p.url == page.url) {
self.index_page(page).await?;
self.completed_pages.push(page.clone());
}
}
// Batch commits: every 10 pages or 30 seconds
// Avoids Tantivy segment churn while staying progressive
if self.uncommitted_count >= BATCH_SIZE
|| self.last_commit.elapsed() > COMMIT_INTERVAL
{
self.partial_index.commit()?;
self.uncommitted_count = 0;
self.last_commit = Instant::now();
}
Ok(status)
}
async fn index_page(&mut self, page: &CrawledPage) -> Result<()> {
let doc = parse_markdown_to_doc(&page.markdown);
self.partial_index.add_document(doc)?;
self.uncommitted_count += 1;
Ok(())
}
fn search(&self, query: &str) -> Result<Vec<SearchResult>> {
// Search works even mid-crawl (searches committed segments)
self.partial_index.search(query)
}
}Track that a source is still being crawled:
{
"alias": "example",
"status": "crawling",
"pages_indexed": 67,
"pages_total": 234,
"crawl_job_id": "fc_abc123",
"searchable": true,
"complete": false
}User sees:
blz list
# example [crawling 67/234] https://docs.example.com
# bun [ready] https://bun.sh/llms-full.txt
blz query "hooks" --source example
# ⚠ Source 'example' is still crawling (67/234 pages indexed)
# Results from indexed pages:
# ...Keep CLI and MCP aligned. MCP tools enable agents to orchestrate crawls programmatically.
{
"name": "blz_crawl",
"description": "Crawl a documentation site to generate and index an llms-full.txt source",
"parameters": {
"action": {
"type": "string",
"enum": ["start", "status", "priority", "cancel", "resume"],
"description": "Crawl action to perform"
},
"url": {
"type": "string",
"description": "Site URL to crawl (required for 'start')"
},
"alias": {
"type": "string",
"description": "Source alias (derived from domain if omitted)"
},
"jobId": {
"type": "string",
"description": "Job ID (required for status/priority/cancel/resume)"
},
"includePaths": {
"type": "array",
"items": { "type": "string" },
"description": "Glob patterns for paths to include"
},
"excludePaths": {
"type": "array",
"items": { "type": "string" },
"description": "Glob patterns for paths to exclude"
},
"priorityPaths": {
"type": "array",
"items": { "type": "string" },
"description": "Paths to crawl first (for 'start' or 'priority' action)"
},
"limit": {
"type": "integer",
"description": "Maximum pages to crawl",
"default": 500
}
}
}Begin a new crawl job.
// Request
{
"action": "start",
"url": "https://docs.example.com",
"alias": "example",
"includePaths": ["/docs/*", "/api/*"],
"excludePaths": ["/blog/*"],
"priorityPaths": ["/docs/getting-started", "/api/hooks"]
}
// Response
{
"action": "start",
"jobId": "blz_abc123",
"status": "started",
"url": "https://docs.example.com",
"alias": "example",
"estimatedPages": 234,
"message": "Crawl started. Use status action to monitor progress."
}Check crawl progress and get partial results.
// Request
{
"action": "status",
"jobId": "blz_abc123"
}
// Response
{
"action": "status",
"jobId": "blz_abc123",
"status": "crawling", // "pending" | "crawling" | "completed" | "failed" | "paused"
"progress": {
"pagesDiscovered": 234,
"pagesCrawled": 67,
"pagesIndexed": 67,
"pagesFailed": 2,
"percentComplete": 28
},
"searchable": true,
"canResume": true,
"priorityQueue": ["/api/hooks"], // paths still in priority queue
"errors": [
{ "url": "/api/internal", "error": "403 Forbidden" }
]
}Inject priority paths into an active crawl.
// Request
{
"action": "priority",
"jobId": "blz_abc123",
"priorityPaths": ["/docs/advanced/hooks", "/api/useEffect"]
}
// Response
{
"action": "priority",
"jobId": "blz_abc123",
"added": ["/docs/advanced/hooks", "/api/useEffect"],
"alreadyCrawled": [],
"message": "2 paths added to priority queue"
}Resume a paused or failed crawl.
// Request
{
"action": "resume",
"jobId": "blz_abc123"
}
// Response
{
"action": "resume",
"jobId": "blz_abc123",
"status": "crawling",
"resumedFrom": {
"pagesCrawled": 67,
"pagesRemaining": 167
}
}Stop a crawl and keep what's been indexed.
// Request
{
"action": "cancel",
"jobId": "blz_abc123",
"keepPartial": true // default: true
}
// Response
{
"action": "cancel",
"jobId": "blz_abc123",
"status": "cancelled",
"pagesKept": 67,
"message": "Crawl cancelled. 67 pages retained and searchable."
}| CLI Command | MCP Action |
|---|---|
blz crawl <url> |
blz_crawl(action: "start", url: ...) |
blz crawl --status <id> |
blz_crawl(action: "status", jobId: ...) |
blz crawl --priority <paths> |
blz_crawl(action: "priority", ...) |
blz crawl --resume <id> |
blz_crawl(action: "resume", jobId: ...) |
blz crawl --cancel <id> |
blz_crawl(action: "cancel", jobId: ...) |
| Failure | Detection | Recovery |
|---|---|---|
| Network timeout | Firecrawl status = failed | Auto-retry with backoff |
| Rate limited | 429 response | Pause, wait, resume |
| Firecrawl outage | API unreachable | Pause job, retry later |
| blz crash | Job file exists, no process | Resume from checkpoint |
| Partial page fail | Individual page 4xx/5xx | Log, continue, report at end |
Persist state frequently so we can resume from any point:
// ~/.local/share/blz/jobs/blz_abc123.json
{
"id": "blz_abc123",
"firecrawlOperationId": "fc_xyz789",
"url": "https://docs.example.com",
"alias": "example",
"status": "crawling",
"config": {
"includePaths": ["/docs/*"],
"excludePaths": ["/blog/*"],
"limit": 500
},
"progress": {
"discovered": ["/docs/intro", "/docs/api", ...],
"crawled": ["/docs/intro", "/docs/api"],
"indexed": ["/docs/intro", "/docs/api"],
"failed": [{ "url": "/docs/internal", "error": "403", "attempts": 2 }],
"priorityQueue": ["/api/hooks"]
},
"timestamps": {
"startedAt": "2026-01-25T12:00:00Z",
"lastCheckpoint": "2026-01-25T12:05:30Z",
"lastActivity": "2026-01-25T12:05:28Z"
},
"retryState": {
"consecutiveFailures": 0,
"backoffUntil": null,
"totalRetries": 3
}
}async fn crawl_with_recovery(job: &mut CrawlJob) -> Result<()> {
loop {
match poll_and_index(job).await {
Ok(status) if status.is_complete() => {
return finalize_crawl(job).await;
}
Ok(_) => {
job.checkpoint().await?;
job.reset_failure_count();
}
Err(e) if e.is_retryable() => {
job.increment_failure_count();
if job.consecutive_failures() >= MAX_RETRIES {
job.pause("Too many consecutive failures").await?;
return Err(e);
}
let backoff = exponential_backoff(job.consecutive_failures());
job.set_backoff_until(Instant::now() + backoff).await?;
tokio::time::sleep(backoff).await;
}
Err(e) => {
job.fail(&e.to_string()).await?;
return Err(e);
}
}
}
}Scenario 1: Network blip
Crawling... 67/234 pages
[Network timeout]
Retrying in 5s...
Retrying in 15s...
Resumed. 68/234 pages...
Scenario 2: Rate limited
Crawling... 120/234 pages
[Rate limited by Firecrawl]
Paused. Will resume at 12:15:00 (2 minutes)
...
Resumed. 121/234 pages...
Scenario 3: blz process killed
# Later...
blz crawl --jobs
# blz_abc123 [paused] example 67/234 pages (last activity: 5 min ago)
blz crawl --resume blz_abc123
# Resuming crawl for 'example'...
# 68/234 pages...Scenario 4: Firecrawl outage
Crawling... 89/234 pages
[Firecrawl API unreachable]
Job paused. Firecrawl appears to be down.
Run 'blz crawl --resume blz_abc123' to retry.
# Status shows paused job
blz crawl --status blz_abc123
# Status: paused (Firecrawl unreachable)
# Pages indexed: 89 (searchable)
# Resume: blz crawl --resume blz_abc123
Documentation changes over time. How do we:
- Know when to update?
- Update efficiently (not re-crawl everything)?
- Handle structural changes (new pages, removed pages)?
Option A: Manual sync
blz sync example
# Checks for changes, re-crawls if neededOption B: Staleness check
blz check example
# Source 'example' last synced 14 days ago
# Run 'blz sync example' to updateOption C: HTTP caching (ETag/Last-Modified)
- Store ETag per page from initial crawl
- On sync, do conditional requests
- Only re-fetch changed pages
Firecrawl has built-in change detection. Add changeTracking to formats:
{
"url": "https://docs.example.com/page",
"formats": ["markdown", "changeTracking"]
}Response includes:
{
"changeTracking": {
"previousScrapeAt": "2026-01-15T10:00:00Z",
"changeStatus": "same" // "new" | "same" | "changed" | "removed"
}
}Key benefit: Firecrawl compares against your previous scrape (scoped to API key). You pay for a scrape but skip processing if changeStatus: "same".
Check cheapest sources first, only escalate when needed:
┌─────────────────────────────────────────────────────────────┐
│ Tier 0: Sitemap.xml (FREE - no Firecrawl) │
│ └─ Fetch sitemap, check <lastmod> dates │
│ └─ If lastmod unchanged → skip page entirely │
├─────────────────────────────────────────────────────────────┤
│ Tier 1: HEAD requests (FREE - no Firecrawl) │
│ └─ Check ETag / Last-Modified headers │
│ └─ If headers unchanged → skip page │
├─────────────────────────────────────────────────────────────┤
│ Tier 2: Firecrawl with changeTracking (1 credit) │
│ └─ changeStatus: "same" → skip processing │
│ └─ changeStatus: "changed" → re-index page │
│ └─ changeStatus: "new" → index new page │
└─────────────────────────────────────────────────────────────┘
Full re-crawl (simplest, most reliable)
blz sync example --full
# Re-crawls entire site, replaces indexPros: Simple, handles all changes Cons: Slow, uses most Firecrawl credits
Smart incremental (default, cost-optimized)
blz sync example
# 1. Map site for current URLs (detect new/removed)
# 2. Check sitemap.xml for lastmod hints (free)
# 3. HEAD requests for ETag/Last-Modified (free)
# 4. Firecrawl changeTracking for remaining (1 credit each)
# 5. Only re-index pages with changeStatus: "changed"Pros: Minimal credit usage, catches all changes Cons: More complex, multiple request types
Sitemap-only (if site has good sitemap)
blz sync example --sitemap-only
# 1. Fetch sitemap.xml
# 2. Check lastmod dates
# 3. Only re-crawl pages with newer lastmodPros: Very efficient, no Firecrawl credits for unchanged Cons: Relies on accurate sitemap (many sites don't update lastmod)
async fn sync_smart(source: &mut Source) -> Result<SyncResult> {
let mut result = SyncResult::default();
// Step 1: Map site for current URLs
let current_urls = firecrawl_map(&source.url).await?;
let stored_urls: HashSet<_> = source.pages.keys().collect();
// Detect removed pages
for url in stored_urls.difference(¤t_urls) {
source.remove_page(url).await?;
result.removed.push(url.clone());
}
// Detect new pages (definitely need to crawl)
let new_urls: Vec<_> = current_urls.difference(&stored_urls).collect();
// Step 2: Check sitemap for lastmod hints (free)
let sitemap_hints = fetch_sitemap_lastmod(&source.url).await.ok();
// Step 3: For existing pages, use tiered checking
let mut pages_to_check = Vec::new();
for url in current_urls.intersection(&stored_urls) {
// Tier 0: Sitemap lastmod
if let Some(ref hints) = sitemap_hints {
if let Some(lastmod) = hints.get(url) {
if *lastmod <= source.pages[url].last_indexed {
continue; // Skip - sitemap says unchanged
}
}
}
// Tier 1: HEAD request for ETag/Last-Modified
if let Ok(headers) = head_request(url).await {
if headers_unchanged(&headers, &source.pages[url]) {
continue; // Skip - headers say unchanged
}
}
// Tier 2: Need Firecrawl check
pages_to_check.push(url.clone());
}
// Step 4: Batch check with Firecrawl changeTracking
for url in pages_to_check {
let page = firecrawl_scrape(&url, &ScrapeOptions {
formats: vec!["markdown", "changeTracking"],
only_main_content: true,
..Default::default()
}).await?;
match page.change_tracking.change_status.as_str() {
"same" => {
// No change, skip
}
"changed" => {
source.reindex_page(&page).await?;
result.modified.push(url);
}
_ => {}
}
}
// Step 5: Crawl new pages
for url in new_urls {
let page = firecrawl_scrape(&url, &default_scrape_options()).await?;
source.index_page(&page).await?;
result.new.push(url.clone());
}
// Step 6: Regenerate llms-full.txt if anything changed
if !result.is_empty() {
source.regenerate_llms_full().await?;
}
Ok(result)
}For a 200-page doc site with 10 changed pages:
| Approach | Firecrawl Credits | Time |
|---|---|---|
| Full re-crawl | 200 | ~5 min |
| Naive incremental | 200 | ~5 min |
| Smart (sitemap + HEAD) | ~20-50 | ~1 min |
| Smart (perfect sitemap) | ~10 | ~30 sec |
Firecrawl Observer is an open-source monitoring tool. Could integrate for automated sync triggers:
Observer monitors docs.example.com
↓
Change detected → webhook to blz
↓
blz sync example --auto
This would enable "push-based" sync instead of polling.
Track sync state per source:
{
"alias": "example",
"origin": {
"type": "crawled",
"url": "https://docs.example.com",
"crawlConfig": {
"includePaths": ["/docs/*"],
"excludePaths": ["/blog/*"]
}
},
"sync": {
"lastFullSync": "2026-01-15T10:00:00Z",
"lastIncrementalSync": "2026-01-25T10:00:00Z",
"pageCount": 234,
"pageHashes": {
"/docs/intro": "sha256:abc123",
"/docs/api": "sha256:def456"
},
"sitemapUrl": "https://docs.example.com/sitemap.xml",
"sitemapLastChecked": "2026-01-25T10:00:00Z"
}
}CLI:
blz sync example # Incremental (default)
blz sync example --full # Full re-crawl
blz sync example --check # Dry-run, show what would change
blz sync --all # Sync all crawled sourcesMCP:
{
"action": "sync",
"alias": "example",
"mode": "incremental", // "incremental" | "full" | "check"
}{
"action": "sync",
"alias": "example",
"mode": "check",
"changes": {
"new": ["/docs/new-feature", "/api/v2/hooks"],
"modified": ["/docs/intro"],
"removed": ["/docs/deprecated"],
"unchanged": 230
},
"recommendation": "Incremental sync will update 4 pages",
"estimatedTime": "~30 seconds"
}When pages disappear:
- Remove from search index
- Keep in archive (optional, for history)
- Update llms-full.txt
async fn sync_incremental(source: &mut Source) -> Result<SyncResult> {
let current_urls = map_site(&source.url).await?;
let stored_urls: HashSet<_> = source.sync.page_hashes.keys().collect();
let new_urls: Vec<_> = current_urls.difference(&stored_urls).collect();
let removed_urls: Vec<_> = stored_urls.difference(¤t_urls).collect();
// Crawl new pages
for url in &new_urls {
let page = scrape_page(url).await?;
source.index_page(&page).await?;
}
// Check existing pages for changes (sample or full)
for url in current_urls.intersection(&stored_urls) {
if should_check(url, &source.sync) {
let page = scrape_page(url).await?;
let hash = hash_content(&page.markdown);
if hash != source.sync.page_hashes[url] {
source.reindex_page(&page).await?;
}
}
}
// Remove deleted pages
for url in &removed_urls {
source.remove_page(url).await?;
}
// Regenerate llms-full.txt
source.regenerate_llms_full().await?;
Ok(SyncResult { new, removed, modified })
}
---
## Architecture
### External Dependency: Firecrawl
Firecrawl provides:
- Site crawling with JS rendering
- Markdown extraction with `onlyMainContent`
- Async job handling with status polling
- Rate limiting and politeness built-in
### WorkflowUser blz Firecrawl │ │ │ │ blz crawl │ │ │─────────────────────>│ │ │ │ firecrawl_crawl() │ │ │──────────────────────────>│ │ │ operation_id │ │ │<──────────────────────────│ │ │ │ │ [progress] │ check_crawl_status() │ │<─────────────────────│──────────────────────────>│ │ │ status/pages │ │ │<──────────────────────────│ │ │ ... │ │ │ │ │ │ [crawl complete] │ │ │<──────────────────────────│ │ │ │ │ │ assemble_llms_full() │ │ │ store_and_index() │ │ │ │ │ ✓ Added 'example' │ │ │<─────────────────────│ │
---
## CLI Interface
### Basic Usage (Blocking)
```bash
# Crawl and add as new source
blz crawl https://docs.example.com
# Crawl with explicit alias
blz crawl https://docs.example.com --alias example
# Crawl with page limit
blz crawl https://docs.example.com --limit 100
# Dry run - show what would be crawled
blz crawl https://docs.example.com --dry-run
# Start crawl in background
blz crawl https://docs.example.com --background
# Output: Started crawl job: fc_abc123
# Check status
blz crawl --status fc_abc123
# Output: Status: crawling (67/120 pages)
# List all active jobs
blz crawl --jobs
# Cancel a job
blz crawl --cancel fc_abc123| Flag | Description | Default |
|---|---|---|
--alias <name> |
Source alias | Derived from domain |
--limit <n> |
Max pages to crawl | 500 |
--depth <n> |
Max link depth | 5 |
--include <glob> |
Only crawl matching paths | - |
--exclude <glob> |
Skip matching paths | - |
--background |
Run in background | false |
--dry-run |
Preview without crawling | false |
--yes |
Skip confirmation | false |
async fn start_crawl(url: &str, options: &CrawlOptions) -> Result<String> {
let response = firecrawl_crawl(FirecrawlCrawlRequest {
url: url.to_string(),
limit: Some(options.limit),
max_discovery_depth: Some(options.depth),
include_paths: options.include.clone(),
exclude_paths: options.exclude.clone(),
scrape_options: Some(ScrapeOptions {
formats: vec!["markdown".to_string()],
only_main_content: true,
..Default::default()
}),
..Default::default()
}).await?;
Ok(response.operation_id)
}async fn poll_until_complete(
operation_id: &str,
progress_callback: impl Fn(CrawlProgress),
) -> Result<Vec<CrawledPage>> {
loop {
let status = firecrawl_check_crawl_status(operation_id).await?;
progress_callback(CrawlProgress {
completed: status.completed,
total: status.total,
status: status.status.clone(),
});
match status.status.as_str() {
"completed" => return Ok(status.data),
"failed" => return Err(anyhow!("Crawl failed: {}", status.error)),
"cancelled" => return Err(anyhow!("Crawl was cancelled")),
_ => {
tokio::time::sleep(Duration::from_secs(2)).await;
}
}
}
}Index-guided crawl: Use index order (maintainer-curated). This is the best ordering because the maintainer deliberately organized the entries.
Discovery crawl: Use breadth-first crawl order from root. This follows the natural site structure.
Fallback: URL path sort (alphabetical). Least ideal — /api/authentication before /api/getting-started.
fn order_pages(
pages: Vec<CrawledPage>,
index: Option<&IndexLayer>,
crawl_order: &[String],
) -> Vec<CrawledPage> {
if let Some(index) = index {
// Index-guided: use maintainer's order
let url_to_page: HashMap<_, _> = pages.into_iter()
.map(|p| (p.url.clone(), p))
.collect();
index.entries.iter()
.filter_map(|e| url_to_page.get(&e.url).cloned())
.collect()
} else if !crawl_order.is_empty() {
// Discovery: use crawl order (breadth-first)
let url_to_page: HashMap<_, _> = pages.into_iter()
.map(|p| (p.url.clone(), p))
.collect();
crawl_order.iter()
.filter_map(|url| url_to_page.get(url).cloned())
.collect()
} else {
// Fallback: URL path sort
let mut pages = pages;
pages.sort_by(|a, b| {
let path_a = Url::parse(&a.url).map(|u| u.path().to_string()).unwrap_or_default();
let path_b = Url::parse(&b.url).map(|u| u.path().to_string()).unwrap_or_default();
path_a.cmp(&path_b)
});
pages
}
}Ensure consistent heading hierarchy across pages:
fn normalize_headings(markdown: &str, page_title: &str) -> String {
let mut output = String::new();
// Add page title as h2 (reserve h1 for doc title)
output.push_str(&format!("## {}\n\n", page_title));
// Shift all headings down by one level
for line in markdown.lines() {
if line.starts_with('#') {
// Count existing heading level
let level = line.chars().take_while(|&c| c == '#').count();
// Shift down (h1 -> h3, h2 -> h4, etc.)
let new_level = (level + 2).min(6);
let hashes = "#".repeat(new_level);
let text = line.trim_start_matches('#').trim();
output.push_str(&format!("{} {}\n", hashes, text));
} else {
output.push_str(line);
output.push('\n');
}
}
output
}fn assemble_llms_full(
site_url: &str,
site_name: &str,
pages: Vec<CrawledPage>,
) -> String {
let mut output = String::new();
// Header
output.push_str(&format!("# {}\n\n", site_name));
output.push_str(&format!("> Documentation for {}.\n", site_name));
output.push_str(&format!("> Generated by blz from {}\n\n", site_url));
// Table of contents (optional)
output.push_str("## Contents\n\n");
for page in &pages {
let title = extract_title(&page.markdown);
output.push_str(&format!("- {}\n", title));
}
output.push_str("\n---\n\n");
// Page content
for page in pages {
let title = extract_title(&page.markdown);
let normalized = normalize_headings(&page.markdown, &title);
output.push_str(&normalized);
output.push_str("\n\n---\n\n");
}
output
}For background mode, persist job state:
~/.local/share/blz/jobs/
<job_id>.json
{
"id": "fc_abc123",
"url": "https://docs.example.com",
"alias": "example",
"operation_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "crawling",
"started_at": "2026-01-25T12:00:00Z",
"completed_at": null,
"pages_crawled": 67,
"pages_total": 120,
"error": null
}- Created: Job file written when crawl starts
- Running: Updated periodically with progress
- Completed: Marked complete, results stored as source
- Cleanup: Job file deleted after 24h or on
--cleanup
- Network timeout → Retry with backoff (handled by Firecrawl)
- Rate limited → Wait and retry (handled by Firecrawl)
- Partial failure → Store what we got, warn user
- Invalid URL → Exit with clear message
- Auth required → Suggest manual download
- Firecrawl API error → Show error, suggest retry
# Success
✓ Crawled 120 pages from docs.example.com
✓ Assembled llms-full.txt (45,230 lines)
✓ Added source 'example'
# Partial success
⚠ Crawled 98/120 pages (22 failed)
- /api/private: 403 Forbidden
- /old/deprecated: 404 Not Found
✓ Assembled llms-full.txt (38,450 lines)
✓ Added source 'example'
# Failure
✗ Crawl failed: Authentication required
This site requires login. Try downloading manually and using:
blz add example /path/to/downloaded.mdDerive from domain, don't prompt interactively (breaks agent workflows):
docs.example.com→exampleexample.com/docs→example- Collision → error with suggestion:
Use --alias <name> or --force to overwrite
blz crawl <url> # Interactive (default): always prompt
blz crawl <url> --yes # Skip confirmation
blz crawl <url> --auto=50 # Auto-approve if under 50 pages
blz crawl <url> --dry-run # Show plan and credit estimate, don't execute- Default limit: 500 pages
- Warn and require confirmation above 100 pages
--limit <n>to override
- One active crawl job at a time (v1 simplicity)
- Queue additional requests, don't reject
Track how a source was created for transparency:
blz info hono
hono (crawled)
Origin: https://hono.dev
Type: index-guided
Include: /docs/*, /api/*
Pages: 156
Last sync: 2026-01-25
Status: ready
Firecrawl: APIFor sources with both index and content:
blz info clerk
clerk (index + content)
Index: https://clerk.com/llms.txt (24 entries)
Content: crawled from index URLs
Pages: 24
Last sync: 2026-01-20
Status: readyTrack what's changed since last crawl:
blz sync example --check # Show what would change
blz sync example # Incremental update
blz sync example --full # Force full re-crawl- Detect sidebar/nav structure for better ordering
- Extract metadata (version, last updated)
- Generate index from content (content → index inference)
When index exists but is incomplete:
- Fetch the index
- Map the site to find what's not covered
- Crawl only the gaps
- Merge into unified source
blz crawl <url> --max-credits 50
# Abort if crawl would exceed budget- BLZ-335: CLI command restructure (crawl fits as new top-level command)
- BLZ-337: BLZ_FORMAT env var (crawl should respect output format)
- BLZ-340: Error codes (crawl needs proper exit codes)