Skip to content

Instantly share code, notes, and snippets.

@galligan
Last active January 25, 2026 19:22
Show Gist options
  • Select an option

  • Save galligan/e0b5009001fb20ddf17edc484ec94f54 to your computer and use it in GitHub Desktop.

Select an option

Save galligan/e0b5009001fb20ddf17edc484ec94f54 to your computer and use it in GitHub Desktop.
BLZ crawl feature design - agent-managed site crawling for llms-full.txt generation

blz crawl Feature Design

Design doc for enriching sources with full content via targeted crawling.


What is BLZ?

BLZ (pronounced "blaze") is a local-first search cache for llms.txt documentation. It keeps documentation local, searches it in milliseconds (P50 ≈ 6ms), and returns grounded spans with exact line citations.

Why BLZ Exists

AI agents need fast, reliable access to documentation. Traditional approaches have problems:

  • Web search: Slow, noisy, can't guarantee freshness
  • RAG/embeddings: Semantic drift, hallucination-prone citations
  • Page-level fetching: Wastes tokens, no granular retrieval

BLZ solves this by:

  1. Caching documentation locally — One-time fetch, instant access
  2. Full-text search with BM25 — Deterministic, reproducible results
  3. Line-level citationssource:1234-1256 points to exact content
  4. Progressive retrieval — Search → cite → expand as needed

Core Commands

blz add bun https://bun.sh/llms.txt      # Add a source
blz query "test runner"                   # Search across sources
blz get bun:1234-1256 -C 5               # Retrieve cited lines with context
blz map bun --tree                        # Browse documentation structure
blz list                                  # List all sources
blz sync bun                              # Refresh from upstream

How Agents Use BLZ

BLZ exposes both CLI and MCP interfaces. Agents typically:

  1. Search for relevant docs: blz query "authentication middleware"
  2. Get citations from results: bun:41994-42009
  3. Retrieve with context: blz get bun:41994-42009 -C 10
  4. Expand if needed: blz get bun:41994-42009 --context all

This workflow minimizes token usage while maintaining grounded, verifiable answers.


Source Architecture: Index + Content

BLZ sources have two complementary layers:

┌─────────────────────────────────────────────────────────┐
│                      BLZ Source                         │
├─────────────────────────────────────────────────────────┤
│  INDEX LAYER                                            │
│  ├── Curated titles & descriptions                      │
│  ├── URL → line range mapping                           │
│  ├── Maintainer-intended structure                      │
│  └── Semantic routing for queries                       │
├─────────────────────────────────────────────────────────┤
│  CONTENT LAYER                                          │
│  ├── Full searchable text                               │
│  ├── Actual heading structure                           │
│  ├── Line-level citations                               │
│  └── Context expansion                                  │
├─────────────────────────────────────────────────────────┤
│  SEARCH INDEX (Tantivy)                                 │
│  └── BM25 search across content                         │
└─────────────────────────────────────────────────────────┘

Index Layer

The index is a curated manifest of documentation entry points. It contains:

  • Titles — Human-readable names chosen by maintainers
  • Descriptions — Brief summaries of what each section covers
  • URLs — Links to the actual documentation pages
  • Structure — Logical groupings (Guides, API Reference, etc.)

This comes from llms.txt files, which are lightweight indexes that many sites already provide.

Content Layer

The content is the full searchable documentation. It contains:

  • Complete text — Every word, searchable
  • Heading structure — The actual document hierarchy
  • Line numbers — For precise citations
  • Body content — Details, examples, code snippets

This comes from llms-full.txt files (if available) or is generated by crawling the URLs from the index.

How They Work Together

Capability Index Content Combined
Know what docs exist
Search full text
Curated titles/descriptions
Line-level citations
Semantic routing
Context expansion

A source with only index can tell you what documentation exists and where to find it. A source with only content can be searched but lacks curated metadata. A source with both provides the best experience: curated entry points with full searchability.

Source States

blz list

# hono       [index + content]  https://hono.dev/llms.txt
# clerk      [index only]       https://clerk.com/llms.txt
# internal   [content only]     /path/to/docs.md
# react      [crawling 67/234]  https://react.dev

Index Freshness & Reconciliation

The index and content layers can drift out of sync:

  • Index stale, content fresh: The llms.txt hasn't been updated but we crawled new pages
  • Content stale, index fresh: The llms.txt was updated with new entries we haven't crawled
  • Both stale: Neither reflects current site state

Track freshness per layer:

{
  "alias": "hono",
  "index": {
    "url": "https://hono.dev/llms.txt",
    "fetchedAt": "2026-01-15T10:00:00Z",
    "etag": "abc123",
    "entryCount": 24
  },
  "content": {
    "source": "crawled",
    "generatedAt": "2026-01-20T14:30:00Z",
    "pageCount": 24,
    "crawlJobId": "blz_xyz789"
  },
  "reconciliation": {
    "lastChecked": "2026-01-25T08:00:00Z",
    "indexEntriesWithContent": 24,
    "indexEntriesMissingContent": 0,
    "contentPagesNotInIndex": 0,
    "status": "synced"
  }
}

Reconciliation on sync:

async fn reconcile_index_content(source: &mut Source) -> Result<ReconciliationResult> {
    // Fetch latest index
    let fresh_index = fetch_index(&source.index.url).await?;

    // Compare with current content
    let index_urls: HashSet<_> = fresh_index.entries.iter()
        .map(|e| &e.url)
        .collect();
    let content_urls: HashSet<_> = source.content.pages.keys().collect();

    let missing_content: Vec<_> = index_urls.difference(&content_urls).collect();
    let orphaned_content: Vec<_> = content_urls.difference(&index_urls).collect();

    Ok(ReconciliationResult {
        index_entries_missing_content: missing_content.len(),
        content_pages_not_in_index: orphaned_content.len(),
        suggested_action: if missing_content.is_empty() && orphaned_content.is_empty() {
            SyncAction::None
        } else if !missing_content.is_empty() {
            SyncAction::CrawlMissing(missing_content)
        } else {
            SyncAction::ReviewOrphans(orphaned_content)
        }
    })
}

User-facing check:

blz check hono

hono [index + content]
  Index: 26 entries (fetched 10 days ago)
  Content: 24 pages (crawled 5 days ago)

  ⚠ 2 index entries missing content:
    - /docs/new-feature (added to index recently)
    - /api/streaming (added to index recently)

  Recommendation: blz sync hono

How This Feature Integrates

The Gap

Many documentation sites provide an index (llms.txt) but not full content (llms-full.txt). Currently:

  1. Index-only sources can route you to URLs but can't be searched
  2. Users must wait for maintainers to add full content, or
  3. Manually download and format documentation

The Solution: Index-Guided Crawling

The index tells us exactly which URLs contain documentation. Instead of blindly crawling a site, we:

  1. Parse the index — Extract URLs, titles, descriptions
  2. Crawl targeted URLs — Only the pages in the index (not blog/marketing)
  3. Preserve metadata — Link index entries to content line ranges
  4. Enable full search — Content layer becomes searchable

Crawl Strategies

Scenario Strategy
Has llms-full.txt Just fetch it (no crawling needed)
Has llms.txt only Parse index → crawl listed URLs → assemble content
Has neither Agent-managed discovery → propose crawl plan → execute

Fallback for Sparse Indexes

When llms.txt exists but is sparse or outdated, index-guided crawling misses important pages. We need secondary discovery strategies.

Detection: An index is considered sparse if:

  • Entry count < 10 for a large site (many more pages visible)
  • Index hasn't been updated in 6+ months
  • Known URL patterns (e.g., /docs/*) aren't represented

Fallback strategies (tiered):

┌─────────────────────────────────────────────────────────────┐
│ 1. Sitemap.xml (preferred)                                   │
│    └─ Fetch /sitemap.xml, extract doc URLs                   │
│    └─ Filter to doc paths (/docs/*, /api/*, /guides/*)       │
│    └─ Merge with index, dedupe                               │
├─────────────────────────────────────────────────────────────┤
│ 2. Limited BFS from doc root (if sitemap unavailable)        │
│    └─ Start from known doc root (e.g., /docs/)               │
│    └─ Crawl breadth-first with depth limit (default: 3)      │
│    └─ Filter to same path prefix                             │
│    └─ Cap total pages (default: 100 beyond index)            │
├─────────────────────────────────────────────────────────────┤
│ 3. Firecrawl map (discovery mode)                            │
│    └─ Use firecrawl_map to discover all site URLs            │
│    └─ Apply heuristics to identify doc pages                 │
│    └─ Propose to user for approval                           │
└─────────────────────────────────────────────────────────────┘

Implementation:

async fn expand_sparse_index(
    index: &IndexLayer,
    site_url: &str,
    config: &ExpansionConfig,
) -> Result<Vec<String>> {
    let index_urls: HashSet<_> = index.entries.iter().map(|e| &e.url).collect();
    let mut discovered = Vec::new();

    // Try sitemap first (free, structured)
    if let Ok(sitemap_urls) = fetch_sitemap_urls(site_url).await {
        let doc_urls: Vec<_> = sitemap_urls
            .into_iter()
            .filter(|url| looks_like_doc_url(url))
            .filter(|url| !index_urls.contains(url))
            .take(config.max_expansion)
            .collect();

        if !doc_urls.is_empty() {
            discovered.extend(doc_urls);
            return Ok(discovered);
        }
    }

    // Fallback: limited BFS from doc root
    if let Some(doc_root) = detect_doc_root(index) {
        let bfs_urls = bfs_discover(
            &doc_root,
            config.max_depth,
            config.max_expansion,
            |url| !index_urls.contains(url) && url.starts_with(&doc_root),
        ).await?;

        discovered.extend(bfs_urls);
    }

    Ok(discovered)
}

User prompt for expansion:

blz add clerk https://clerk.com/llms.txt

Fetching index... ✓
Found 8 entries in llms.txt

⚠ Index appears sparse for this site
  Sitemap shows 45 additional doc pages not in index

Expand beyond index? [Y/n/customize]
  Y: Crawl all 53 pages (8 from index + 45 from sitemap)
  n: Crawl only 8 pages from index
  customize: Select which additional pages to include

Integration Points

BLZ Component Crawl Integration
blz add Detects index-only sources, offers to generate content
blz crawl Start/manage crawl jobs (index-guided or discovery)
blz query Index entries boost search ranking
blz map Content structure + index annotations
blz sync Cost-optimized updates via index diffing
MCP blz_crawl tool for agent orchestration
Skills /crawl for agent-managed discovery

Workflow: Index-Guided

User: blz add hono https://hono.dev/llms.txt
                    ↓
1. Fetch llms.txt (index)
2. Parse: found 24 documentation URLs
3. Check: llms-full.txt not available
                    ↓
Prompt: "Found index with 24 doc URLs. Generate full content? [Y/n]"
                    ↓
4. Crawl exactly those 24 URLs
5. Assemble content, preserving order from index
6. Link index entries → content line ranges
                    ↓
Result: Source with index + content layers
        Curated metadata + full searchability

Workflow: Discovery (No Index)

Agent: "I need the Hono framework docs"
       ↓
Check: Does hono source exist in blz?
       ↓
No  →  /crawl https://hono.dev
       Agent probes site, finds /docs/* is the doc root
       Proposes: "Crawl 156 pages from /docs/*?"
       User approves
       Crawl starts, pages indexed progressively
       ↓
Yes →  blz query "middleware" --source hono
       Return cached results immediately

Query Integration

Index as Routing Layer

The index serves as a semantic routing table. When querying:

blz query "middleware"

The flow:

1. Check index entries first (fast, high-signal)
   → Match: "Middleware" with description "Built-in and custom middleware"
   → This entry points to lines 2341-2567 in content

2. Search content layer (comprehensive)
   → Additional matches from body text

3. Results ranked:
   → Index matches get a boost (maintainer said this is THE middleware doc)
   → Content matches follow

Entry-Based Retrieval

# Line-based citation (always works)
blz get hono:2341-2567

# Entry-based citation (resolves via index)
blz get hono:middleware
blz get hono:"Getting Started"

The index provides stable, human-readable references that map to line ranges.

Index-Only Search

blz query "auth" --index-only

# Searches just index titles/descriptions across all sources
# Fast, high-signal, good for "where should I look?"

Results:
  clerk:auth           "Authentication and user management"
  supabase:auth        "Row-level security and auth"
  hono:middleware      "Built-in and custom middleware" (weak match)

Map Enhancement

The blz map command uses content structure as primary, with index annotations:

blz map hono --tree

# Content structure (primary) + index annotations
Hono
├── Getting Started          ★ "Quick start guide"
│   ├── Installation
│   └── First App
├── Core Concepts
│   ├── Routing              ★ "URL pattern matching"
│   ├── Context              ★ "Request/response context"
│   └── Error Handling
└── Advanced
    └── Middleware           ★ "Built-in and custom"

The indicates sections featured in the index. Descriptions come from index metadata.


Problem

Many documentation sites don't provide full content files. Users need a way to generate searchable content from existing doc sites, ideally guided by the index when available.

Goals

  1. Index-guided crawling — Use index entries as a manifest for targeted Firecrawl fetching
  2. Firecrawl integration — API (default) or self-hosted for speed/cost optimization
  3. Progressive loading — Index content as pages arrive, prioritize specific paths
  4. Metadata preservation — Link index entries to content line ranges
  5. Agent-assisted discovery — Smart identification when no index exists
  6. Resilient execution — Handle flaky networks, rate limits, resumable jobs
  7. Cost transparency — Show credit estimates before crawling

Non-Goals (for v1)

  • Native Rust HTML extraction (Firecrawl handles this well)
  • Automatic sync scheduling (manual blz sync for now)
  • Index generation from content (content → index inference)

Agent-Managed Crawling (Skill-Based)

Why Agent-Managed?

A dumb blz crawl <url> command has problems:

  • Pulls in entire site (marketing, blog, careers, etc.)
  • No intelligence about doc structure
  • Can't adapt mid-crawl based on what's found

An agent with a skill can:

  • Probe the site first to find the docs root
  • Analyze URL patterns to identify docs vs non-docs
  • Propose a crawl plan for user approval
  • Adapt based on early results

Skill: /crawl

/crawl https://example.com

Agent Workflow

1. DISCOVER
   │
   ├─ Check for existing llms.txt / llms-full.txt
   │   └─ If found → suggest using blz add instead
   │
   ├─ Map site URLs (firecrawl_map)
   │   └─ Get list of all discoverable paths
   │
   └─ Analyze URL patterns
       ├─ /docs/*, /documentation/*, /guide/* → likely docs
       ├─ /blog/*, /news/*, /press/* → skip
       ├─ /pricing, /careers, /about → skip
       └─ /api/*, /reference/* → likely API docs

2. PROPOSE
   │
   └─ Present crawl plan to user:

      "I found 847 URLs on example.com. Here's my analysis:

       📚 Docs (/docs/*): 234 pages
       📖 API Reference (/api/*): 156 pages
       📝 Blog (/blog/*): 312 pages — SKIP
       🏢 Marketing (/, /pricing, etc.): 45 pages — SKIP
       ❓ Unknown: 100 pages

       Proposed crawl: 390 pages (docs + API)
       Estimated time: ~5 minutes

       Should I proceed? [Y/n/customize]"

3. EXECUTE
   │
   ├─ Start crawl with approved include/exclude paths
   ├─ Stream results progressively (see below)
   └─ Report progress and any issues

4. FINALIZE
   │
   ├─ Assemble llms-full.txt
   ├─ Store and index
   └─ Confirm completion

Skill Definition

# /crawl

Crawl a documentation site and add it as a blz source.

## Usage
/crawl <url> [--alias <name>]

## What This Skill Does
1. Checks if llms.txt already exists (suggests `blz add` if so)
2. Maps the site to discover all URLs
3. Analyzes patterns to identify documentation vs other content
4. Proposes a crawl plan for your approval
5. Executes crawl with progressive indexing
6. Assembles and stores as a blz source

## Tools Used
- firecrawl_map: Discover site URLs
- firecrawl_crawl: Execute crawl
- firecrawl_check_crawl_status: Monitor progress
- blz_add: Store final result

Firecrawl as Extraction Engine

Firecrawl handles HTML → clean markdown. This is a hard problem (main content extraction, JS rendering, edge cases) that Firecrawl has solved well. We don't reinvent it — but we keep the interface pluggable.

What Firecrawl Provides

  • Main content extraction — Strips nav, footer, sidebar, cookie banners
  • Clean markdown conversion — Preserves code blocks, tables, headings
  • JS rendering — Handles SPAs and dynamic content
  • Edge case handling — Years of iteration on weird HTML

The onlyMainContent: true option is critical — without it you get navigation, footers, and boilerplate.

Known Limitations

Firecrawl excels at article-style pages but can struggle with:

  • API references — Complex tables, code tabs, interactive playgrounds
  • OpenAPI/JSON specs — Structured data that doesn't map cleanly to markdown
  • PDF-heavy docs — Embedded PDFs need separate handling

Mitigation: Store original HTML alongside markdown for sources where fidelity matters. Allow per-source raw_html: true option.

Firecrawl Options

┌─────────────────────────────────────────────────────────┐
│ Firecrawl API (default)                                 │
│ ├─ Just works, no setup                                 │
│ ├─ Pay-per-page pricing                                 │
│ └─ Best for: most users, occasional crawls              │
├─────────────────────────────────────────────────────────┤
│ Self-Hosted Firecrawl (power users)                     │
│ ├─ AGPL licensed, can self-host                         │
│ ├─ Unlimited crawls, no per-page cost                   │
│ ├─ No queue — dedicated capacity, faster throughput     │
│ ├─ Requires: Docker, Redis, Playwright                  │
│ └─ Best for: heavy users, air-gapped, speed + cost      │
└─────────────────────────────────────────────────────────┘

Configuration

# Default: Firecrawl API
blz config set firecrawl.api_key fc-xxxxx

# Self-hosted: point to local instance
blz config set firecrawl.url http://localhost:3002
blz config set firecrawl.api_key local  # or omit if not required

Index-Guided Crawling

When an index exists, we use it to minimize crawl scope:

Index has 24 URLs
        ↓
Firecrawl scrapes exactly those 24 pages
        ↓
Result: 24 credits instead of 200+ for blind crawl

The index tells us exactly what to fetch — no site mapping, no guessing, no marketing pages.

User Experience

blz add hono https://hono.dev/llms.txt

Fetching index... ✓
Found 24 documentation URLs
No llms-full.txt available

Generate content via Firecrawl?
  Pages: 24
  Estimated credits: 24

[Y/n] y

Crawling via Firecrawl...
  [████████████████████] 24/24 pages

Assembling content... ✓
Linking index entries... ✓

Added source 'hono'
  Index entries: 24
  Content: 45,230 lines
  Firecrawl credits used: 24

Cost Comparison

Scenario Firecrawl Credits
Blind site crawl (discovery mode) 100-500+
Index-guided crawl (24 URLs) 24
With smart sync (10% changed) ~3

The index dramatically reduces crawl scope. Smart sync reduces ongoing costs.

Self-Hosted Setup

BLZ includes a Docker Compose file for local Firecrawl:

# Start local Firecrawl (from blz repo)
docker compose -f docker/docker-compose.firecrawl.yml up -d

# BLZ auto-detects localhost:3002, or configure explicitly
blz config set firecrawl.url http://localhost:3002

The included docker-compose.firecrawl.yml:

# docker/docker-compose.firecrawl.yml
services:
  firecrawl:
    # Pin to specific digest for deterministic builds
    image: mendableai/firecrawl@sha256:<pin-digest-here>
    ports:
      - "3002:3002"
    environment:
      - REDIS_URL=redis://redis:6379
      # Disable telemetry to prevent leaking private docs
      - TELEMETRY_ENABLED=false
    depends_on:
      - redis
    # Security: restrict network access
    networks:
      - internal
    security_opt:
      - no-new-privileges:true

  redis:
    image: redis:alpine
    volumes:
      - redis-data:/data
    networks:
      - internal

networks:
  internal:
    driver: bridge

volumes:
  redis-data:

Requirements: Docker and ~2GB RAM for Chromium/Playwright.

Auto-detection: BLZ checks localhost:3002 before falling back to API. No config needed if local Firecrawl is running.

Deterministic builds: Pin the Firecrawl image digest and checksum to avoid silent upstream changes. Update deliberately.

See Firecrawl self-hosting docs for advanced configuration.

AGPL Compliance

Firecrawl is AGPL-licensed. Key considerations:

  • Using the API: No license obligations — it's a service
  • Self-hosting unmodified: Fine, no source disclosure required
  • Modifying Firecrawl: Triggers source-offer obligations if distributed
  • Bundling in commercial product: Consult legal counsel

We include a Docker Compose file that pulls the official image — this is safe. If you modify Firecrawl itself, you must comply with AGPL.

Security Considerations

Firecrawl executes JavaScript when rendering pages. For untrusted sources:

  • Network isolation: The Docker Compose restricts egress by default
  • SSRF risk: Malicious pages could attempt to access internal services
  • Data exfiltration: JS could try to phone home with crawled content

Recommendations:

  • Use self-hosted Firecrawl for sensitive/internal documentation
  • Review the container's network access for your threat model
  • Disable telemetry (TELEMETRY_ENABLED=false) to prevent data leaks

Pluggable Extractor Interface

While Firecrawl is the primary extraction engine, we keep the interface pluggable for future flexibility.

Why Pluggable?

  • Avoid lock-in: Firecrawl could change pricing, API, or disappear
  • Offline fallback: Some users need air-gapped operation
  • Cost optimization: Simple sites don't need full JS rendering
  • Testing: Mock extractors for unit tests

Extractor Trait

#[async_trait]
pub trait Extractor: Send + Sync {
    /// Extract markdown content from a URL
    async fn extract(&self, url: &str, options: &ExtractOptions) -> Result<ExtractedPage>;

    /// Health check for the extractor
    async fn health(&self) -> Result<ExtractorHealth>;

    /// Extractor capabilities
    fn capabilities(&self) -> ExtractorCapabilities;
}

pub struct ExtractorCapabilities {
    pub js_rendering: bool,
    pub main_content_extraction: bool,
    pub rate_limit: Option<u32>,  // requests per minute
}

pub struct ExtractedPage {
    pub url: String,
    pub markdown: String,
    pub html: Option<String>,      // Original HTML if requested
    pub title: Option<String>,
    pub metadata: HashMap<String, String>,
}

Implementations

Extractor JS Rendering Quality Cost
FirecrawlApiExtractor High Pay-per-page
FirecrawlLocalExtractor High Self-hosted
ReadabilityExtractor Medium Free (future)

For v1, only Firecrawl extractors are implemented. The trait exists so we can add alternatives later without touching indexing logic.


Validation Layer

After extraction, validate content before indexing to catch parser failures:

fn validate_extraction(page: &ExtractedPage) -> Result<(), ValidationError> {
    // Minimum content length
    if page.markdown.len() < 200 {
        return Err(ValidationError::ContentTooShort);
    }

    // Check for extraction artifacts
    if page.markdown.contains("cookie") && page.markdown.contains("accept") {
        warn!("Possible cookie banner in extracted content: {}", page.url);
    }

    // Code block ratio check (API docs should have code)
    let code_blocks = page.markdown.matches("```").count() / 2;
    // Warn but don't fail — some pages legitimately have no code

    Ok(())
}

Failed validations are logged and can trigger re-extraction or manual review.


Observability

Track crawl health and tune performance:

Metrics

struct CrawlMetrics {
    // Per-source
    pages_crawled: Counter,
    pages_failed: Counter,
    bytes_fetched: Counter,
    extraction_duration_ms: Histogram,

    // Tantivy
    commit_duration_ms: Histogram,
    segments_created: Counter,

    // Firecrawl
    api_latency_ms: Histogram,
    rate_limit_hits: Counter,
}

Logging

[INFO] crawl:start source=hono pages=24 extractor=firecrawl-api
[INFO] crawl:page url=/docs/getting-started status=ok bytes=12340 duration=1.2s
[WARN] crawl:page url=/docs/internal status=403 error="Forbidden"
[INFO] crawl:commit pages=10 duration=45ms segments=1
[INFO] crawl:complete source=hono pages=23/24 duration=48s

Health Checks

blz crawl --health
# Firecrawl API: ✓ (latency: 230ms)
# Firecrawl Local: ✓ (latency: 45ms)
# Tantivy: ✓ (segments: 3, docs: 45230)

Progressive/Priority Loading

Problem

Full site crawls can take 10+ minutes. Users shouldn't have to wait for everything before they can search.

Solution: Stream-and-Index

As pages complete crawling, index them immediately:

Crawl Progress          Search Availability
─────────────────       ────────────────────
Page 1 complete    ──>  Indexed, searchable
Page 2 complete    ──>  Indexed, searchable
Page 3 complete    ──>  Indexed, searchable
...                     ...
Page N complete    ──>  Full index ready
                        Assemble llms-full.txt

Priority Paths

Agent or user can specify paths to crawl first:

/crawl https://docs.example.com --priority "/api/hooks,/guides/getting-started"

Firecrawl crawl order:

  1. Priority paths (immediate need)
  2. Breadth-first from root (comprehensive coverage)

Mid-Crawl Priority Injection

Agent discovers it needs specific docs during a task:

Agent: "I need the React hooks documentation"
       ↓
Check: Is /docs/hooks already indexed?
       ↓
No  →  Inject /docs/hooks as priority in active crawl
       OR start targeted scrape just for that path
       ↓
Yes →  Return cached content

Implementation Sketch

struct ProgressiveCrawl {
    operation_id: String,
    partial_index: TantivyIndex,  // Growing index
    pending_pages: HashSet<String>,
    completed_pages: Vec<CrawledPage>,
    uncommitted_count: usize,
    last_commit: Instant,
}

const BATCH_SIZE: usize = 10;
const COMMIT_INTERVAL: Duration = Duration::from_secs(30);

impl ProgressiveCrawl {
    async fn poll_and_index(&mut self) -> Result<CrawlStatus> {
        let status = check_crawl_status(&self.operation_id).await?;

        // Index any newly completed pages
        for page in &status.data {
            if !self.completed_pages.iter().any(|p| p.url == page.url) {
                self.index_page(page).await?;
                self.completed_pages.push(page.clone());
            }
        }

        // Batch commits: every 10 pages or 30 seconds
        // Avoids Tantivy segment churn while staying progressive
        if self.uncommitted_count >= BATCH_SIZE
            || self.last_commit.elapsed() > COMMIT_INTERVAL
        {
            self.partial_index.commit()?;
            self.uncommitted_count = 0;
            self.last_commit = Instant::now();
        }

        Ok(status)
    }

    async fn index_page(&mut self, page: &CrawledPage) -> Result<()> {
        let doc = parse_markdown_to_doc(&page.markdown);
        self.partial_index.add_document(doc)?;
        self.uncommitted_count += 1;
        Ok(())
    }

    fn search(&self, query: &str) -> Result<Vec<SearchResult>> {
        // Search works even mid-crawl (searches committed segments)
        self.partial_index.search(query)
    }
}

Partial Source State

Track that a source is still being crawled:

{
  "alias": "example",
  "status": "crawling",
  "pages_indexed": 67,
  "pages_total": 234,
  "crawl_job_id": "fc_abc123",
  "searchable": true,
  "complete": false
}

User sees:

blz list
# example    [crawling 67/234]  https://docs.example.com
# bun        [ready]            https://bun.sh/llms-full.txt

blz query "hooks" --source example
# ⚠ Source 'example' is still crawling (67/234 pages indexed)
# Results from indexed pages:
# ...

MCP Tool Design

Keep CLI and MCP aligned. MCP tools enable agents to orchestrate crawls programmatically.

Tool: blz_crawl

{
  "name": "blz_crawl",
  "description": "Crawl a documentation site to generate and index an llms-full.txt source",
  "parameters": {
    "action": {
      "type": "string",
      "enum": ["start", "status", "priority", "cancel", "resume"],
      "description": "Crawl action to perform"
    },
    "url": {
      "type": "string",
      "description": "Site URL to crawl (required for 'start')"
    },
    "alias": {
      "type": "string",
      "description": "Source alias (derived from domain if omitted)"
    },
    "jobId": {
      "type": "string",
      "description": "Job ID (required for status/priority/cancel/resume)"
    },
    "includePaths": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Glob patterns for paths to include"
    },
    "excludePaths": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Glob patterns for paths to exclude"
    },
    "priorityPaths": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Paths to crawl first (for 'start' or 'priority' action)"
    },
    "limit": {
      "type": "integer",
      "description": "Maximum pages to crawl",
      "default": 500
    }
  }
}

Action: start

Begin a new crawl job.

// Request
{
  "action": "start",
  "url": "https://docs.example.com",
  "alias": "example",
  "includePaths": ["/docs/*", "/api/*"],
  "excludePaths": ["/blog/*"],
  "priorityPaths": ["/docs/getting-started", "/api/hooks"]
}

// Response
{
  "action": "start",
  "jobId": "blz_abc123",
  "status": "started",
  "url": "https://docs.example.com",
  "alias": "example",
  "estimatedPages": 234,
  "message": "Crawl started. Use status action to monitor progress."
}

Action: status

Check crawl progress and get partial results.

// Request
{
  "action": "status",
  "jobId": "blz_abc123"
}

// Response
{
  "action": "status",
  "jobId": "blz_abc123",
  "status": "crawling",  // "pending" | "crawling" | "completed" | "failed" | "paused"
  "progress": {
    "pagesDiscovered": 234,
    "pagesCrawled": 67,
    "pagesIndexed": 67,
    "pagesFailed": 2,
    "percentComplete": 28
  },
  "searchable": true,
  "canResume": true,
  "priorityQueue": ["/api/hooks"],  // paths still in priority queue
  "errors": [
    { "url": "/api/internal", "error": "403 Forbidden" }
  ]
}

Action: priority

Inject priority paths into an active crawl.

// Request
{
  "action": "priority",
  "jobId": "blz_abc123",
  "priorityPaths": ["/docs/advanced/hooks", "/api/useEffect"]
}

// Response
{
  "action": "priority",
  "jobId": "blz_abc123",
  "added": ["/docs/advanced/hooks", "/api/useEffect"],
  "alreadyCrawled": [],
  "message": "2 paths added to priority queue"
}

Action: resume

Resume a paused or failed crawl.

// Request
{
  "action": "resume",
  "jobId": "blz_abc123"
}

// Response
{
  "action": "resume",
  "jobId": "blz_abc123",
  "status": "crawling",
  "resumedFrom": {
    "pagesCrawled": 67,
    "pagesRemaining": 167
  }
}

Action: cancel

Stop a crawl and keep what's been indexed.

// Request
{
  "action": "cancel",
  "jobId": "blz_abc123",
  "keepPartial": true  // default: true
}

// Response
{
  "action": "cancel",
  "jobId": "blz_abc123",
  "status": "cancelled",
  "pagesKept": 67,
  "message": "Crawl cancelled. 67 pages retained and searchable."
}

CLI ↔ MCP Alignment

CLI Command MCP Action
blz crawl <url> blz_crawl(action: "start", url: ...)
blz crawl --status <id> blz_crawl(action: "status", jobId: ...)
blz crawl --priority <paths> blz_crawl(action: "priority", ...)
blz crawl --resume <id> blz_crawl(action: "resume", jobId: ...)
blz crawl --cancel <id> blz_crawl(action: "cancel", jobId: ...)

Resilience & Recovery

Failure Modes

Failure Detection Recovery
Network timeout Firecrawl status = failed Auto-retry with backoff
Rate limited 429 response Pause, wait, resume
Firecrawl outage API unreachable Pause job, retry later
blz crash Job file exists, no process Resume from checkpoint
Partial page fail Individual page 4xx/5xx Log, continue, report at end

Job Checkpointing

Persist state frequently so we can resume from any point:

// ~/.local/share/blz/jobs/blz_abc123.json
{
  "id": "blz_abc123",
  "firecrawlOperationId": "fc_xyz789",
  "url": "https://docs.example.com",
  "alias": "example",
  "status": "crawling",
  "config": {
    "includePaths": ["/docs/*"],
    "excludePaths": ["/blog/*"],
    "limit": 500
  },
  "progress": {
    "discovered": ["/docs/intro", "/docs/api", ...],
    "crawled": ["/docs/intro", "/docs/api"],
    "indexed": ["/docs/intro", "/docs/api"],
    "failed": [{ "url": "/docs/internal", "error": "403", "attempts": 2 }],
    "priorityQueue": ["/api/hooks"]
  },
  "timestamps": {
    "startedAt": "2026-01-25T12:00:00Z",
    "lastCheckpoint": "2026-01-25T12:05:30Z",
    "lastActivity": "2026-01-25T12:05:28Z"
  },
  "retryState": {
    "consecutiveFailures": 0,
    "backoffUntil": null,
    "totalRetries": 3
  }
}

Auto-Recovery Logic

async fn crawl_with_recovery(job: &mut CrawlJob) -> Result<()> {
    loop {
        match poll_and_index(job).await {
            Ok(status) if status.is_complete() => {
                return finalize_crawl(job).await;
            }
            Ok(_) => {
                job.checkpoint().await?;
                job.reset_failure_count();
            }
            Err(e) if e.is_retryable() => {
                job.increment_failure_count();

                if job.consecutive_failures() >= MAX_RETRIES {
                    job.pause("Too many consecutive failures").await?;
                    return Err(e);
                }

                let backoff = exponential_backoff(job.consecutive_failures());
                job.set_backoff_until(Instant::now() + backoff).await?;

                tokio::time::sleep(backoff).await;
            }
            Err(e) => {
                job.fail(&e.to_string()).await?;
                return Err(e);
            }
        }
    }
}

Resume Scenarios

Scenario 1: Network blip

Crawling... 67/234 pages
[Network timeout]
Retrying in 5s...
Retrying in 15s...
Resumed. 68/234 pages...

Scenario 2: Rate limited

Crawling... 120/234 pages
[Rate limited by Firecrawl]
Paused. Will resume at 12:15:00 (2 minutes)
...
Resumed. 121/234 pages...

Scenario 3: blz process killed

# Later...
blz crawl --jobs
# blz_abc123  [paused]  example  67/234 pages  (last activity: 5 min ago)

blz crawl --resume blz_abc123
# Resuming crawl for 'example'...
# 68/234 pages...

Scenario 4: Firecrawl outage

Crawling... 89/234 pages
[Firecrawl API unreachable]
Job paused. Firecrawl appears to be down.
Run 'blz crawl --resume blz_abc123' to retry.

# Status shows paused job
blz crawl --status blz_abc123
# Status: paused (Firecrawl unreachable)
# Pages indexed: 89 (searchable)
# Resume: blz crawl --resume blz_abc123

Sync & Updates

Problem

Documentation changes over time. How do we:

  1. Know when to update?
  2. Update efficiently (not re-crawl everything)?
  3. Handle structural changes (new pages, removed pages)?

Detection Strategies

Option A: Manual sync

blz sync example
# Checks for changes, re-crawls if needed

Option B: Staleness check

blz check example
# Source 'example' last synced 14 days ago
# Run 'blz sync example' to update

Option C: HTTP caching (ETag/Last-Modified)

  • Store ETag per page from initial crawl
  • On sync, do conditional requests
  • Only re-fetch changed pages

Firecrawl Change Tracking

Firecrawl has built-in change detection. Add changeTracking to formats:

{
  "url": "https://docs.example.com/page",
  "formats": ["markdown", "changeTracking"]
}

Response includes:

{
  "changeTracking": {
    "previousScrapeAt": "2026-01-15T10:00:00Z",
    "changeStatus": "same"  // "new" | "same" | "changed" | "removed"
  }
}

Key benefit: Firecrawl compares against your previous scrape (scoped to API key). You pay for a scrape but skip processing if changeStatus: "same".

Tiered Sync Strategy (Cost-Optimized)

Check cheapest sources first, only escalate when needed:

┌─────────────────────────────────────────────────────────────┐
│ Tier 0: Sitemap.xml (FREE - no Firecrawl)                   │
│         └─ Fetch sitemap, check <lastmod> dates             │
│         └─ If lastmod unchanged → skip page entirely        │
├─────────────────────────────────────────────────────────────┤
│ Tier 1: HEAD requests (FREE - no Firecrawl)                 │
│         └─ Check ETag / Last-Modified headers               │
│         └─ If headers unchanged → skip page                 │
├─────────────────────────────────────────────────────────────┤
│ Tier 2: Firecrawl with changeTracking (1 credit)            │
│         └─ changeStatus: "same" → skip processing           │
│         └─ changeStatus: "changed" → re-index page          │
│         └─ changeStatus: "new" → index new page             │
└─────────────────────────────────────────────────────────────┘

Sync Approaches

Full re-crawl (simplest, most reliable)

blz sync example --full
# Re-crawls entire site, replaces index

Pros: Simple, handles all changes Cons: Slow, uses most Firecrawl credits

Smart incremental (default, cost-optimized)

blz sync example
# 1. Map site for current URLs (detect new/removed)
# 2. Check sitemap.xml for lastmod hints (free)
# 3. HEAD requests for ETag/Last-Modified (free)
# 4. Firecrawl changeTracking for remaining (1 credit each)
# 5. Only re-index pages with changeStatus: "changed"

Pros: Minimal credit usage, catches all changes Cons: More complex, multiple request types

Sitemap-only (if site has good sitemap)

blz sync example --sitemap-only
# 1. Fetch sitemap.xml
# 2. Check lastmod dates
# 3. Only re-crawl pages with newer lastmod

Pros: Very efficient, no Firecrawl credits for unchanged Cons: Relies on accurate sitemap (many sites don't update lastmod)

Sync Implementation

async fn sync_smart(source: &mut Source) -> Result<SyncResult> {
    let mut result = SyncResult::default();

    // Step 1: Map site for current URLs
    let current_urls = firecrawl_map(&source.url).await?;
    let stored_urls: HashSet<_> = source.pages.keys().collect();

    // Detect removed pages
    for url in stored_urls.difference(&current_urls) {
        source.remove_page(url).await?;
        result.removed.push(url.clone());
    }

    // Detect new pages (definitely need to crawl)
    let new_urls: Vec<_> = current_urls.difference(&stored_urls).collect();

    // Step 2: Check sitemap for lastmod hints (free)
    let sitemap_hints = fetch_sitemap_lastmod(&source.url).await.ok();

    // Step 3: For existing pages, use tiered checking
    let mut pages_to_check = Vec::new();

    for url in current_urls.intersection(&stored_urls) {
        // Tier 0: Sitemap lastmod
        if let Some(ref hints) = sitemap_hints {
            if let Some(lastmod) = hints.get(url) {
                if *lastmod <= source.pages[url].last_indexed {
                    continue; // Skip - sitemap says unchanged
                }
            }
        }

        // Tier 1: HEAD request for ETag/Last-Modified
        if let Ok(headers) = head_request(url).await {
            if headers_unchanged(&headers, &source.pages[url]) {
                continue; // Skip - headers say unchanged
            }
        }

        // Tier 2: Need Firecrawl check
        pages_to_check.push(url.clone());
    }

    // Step 4: Batch check with Firecrawl changeTracking
    for url in pages_to_check {
        let page = firecrawl_scrape(&url, &ScrapeOptions {
            formats: vec!["markdown", "changeTracking"],
            only_main_content: true,
            ..Default::default()
        }).await?;

        match page.change_tracking.change_status.as_str() {
            "same" => {
                // No change, skip
            }
            "changed" => {
                source.reindex_page(&page).await?;
                result.modified.push(url);
            }
            _ => {}
        }
    }

    // Step 5: Crawl new pages
    for url in new_urls {
        let page = firecrawl_scrape(&url, &default_scrape_options()).await?;
        source.index_page(&page).await?;
        result.new.push(url.clone());
    }

    // Step 6: Regenerate llms-full.txt if anything changed
    if !result.is_empty() {
        source.regenerate_llms_full().await?;
    }

    Ok(result)
}

Sync Cost Comparison

For a 200-page doc site with 10 changed pages:

Approach Firecrawl Credits Time
Full re-crawl 200 ~5 min
Naive incremental 200 ~5 min
Smart (sitemap + HEAD) ~20-50 ~1 min
Smart (perfect sitemap) ~10 ~30 sec

Firecrawl Observer Integration (Future)

Firecrawl Observer is an open-source monitoring tool. Could integrate for automated sync triggers:

Observer monitors docs.example.com
        ↓
Change detected → webhook to blz
        ↓
blz sync example --auto

This would enable "push-based" sync instead of polling.

Sync Metadata

Track sync state per source:

{
  "alias": "example",
  "origin": {
    "type": "crawled",
    "url": "https://docs.example.com",
    "crawlConfig": {
      "includePaths": ["/docs/*"],
      "excludePaths": ["/blog/*"]
    }
  },
  "sync": {
    "lastFullSync": "2026-01-15T10:00:00Z",
    "lastIncrementalSync": "2026-01-25T10:00:00Z",
    "pageCount": 234,
    "pageHashes": {
      "/docs/intro": "sha256:abc123",
      "/docs/api": "sha256:def456"
    },
    "sitemapUrl": "https://docs.example.com/sitemap.xml",
    "sitemapLastChecked": "2026-01-25T10:00:00Z"
  }
}

Sync CLI/MCP

CLI:

blz sync example              # Incremental (default)
blz sync example --full       # Full re-crawl
blz sync example --check      # Dry-run, show what would change
blz sync --all                # Sync all crawled sources

MCP:

{
  "action": "sync",
  "alias": "example",
  "mode": "incremental",  // "incremental" | "full" | "check"
}

Change Detection Response

{
  "action": "sync",
  "alias": "example",
  "mode": "check",
  "changes": {
    "new": ["/docs/new-feature", "/api/v2/hooks"],
    "modified": ["/docs/intro"],
    "removed": ["/docs/deprecated"],
    "unchanged": 230
  },
  "recommendation": "Incremental sync will update 4 pages",
  "estimatedTime": "~30 seconds"
}

Handling Removed Pages

When pages disappear:

  1. Remove from search index
  2. Keep in archive (optional, for history)
  3. Update llms-full.txt
async fn sync_incremental(source: &mut Source) -> Result<SyncResult> {
    let current_urls = map_site(&source.url).await?;
    let stored_urls: HashSet<_> = source.sync.page_hashes.keys().collect();

    let new_urls: Vec<_> = current_urls.difference(&stored_urls).collect();
    let removed_urls: Vec<_> = stored_urls.difference(&current_urls).collect();

    // Crawl new pages
    for url in &new_urls {
        let page = scrape_page(url).await?;
        source.index_page(&page).await?;
    }

    // Check existing pages for changes (sample or full)
    for url in current_urls.intersection(&stored_urls) {
        if should_check(url, &source.sync) {
            let page = scrape_page(url).await?;
            let hash = hash_content(&page.markdown);
            if hash != source.sync.page_hashes[url] {
                source.reindex_page(&page).await?;
            }
        }
    }

    // Remove deleted pages
    for url in &removed_urls {
        source.remove_page(url).await?;
    }

    // Regenerate llms-full.txt
    source.regenerate_llms_full().await?;

    Ok(SyncResult { new, removed, modified })
}

---

## Architecture

### External Dependency: Firecrawl

Firecrawl provides:
- Site crawling with JS rendering
- Markdown extraction with `onlyMainContent`
- Async job handling with status polling
- Rate limiting and politeness built-in

### Workflow

User blz Firecrawl │ │ │ │ blz crawl │ │ │─────────────────────>│ │ │ │ firecrawl_crawl() │ │ │──────────────────────────>│ │ │ operation_id │ │ │<──────────────────────────│ │ │ │ │ [progress] │ check_crawl_status() │ │<─────────────────────│──────────────────────────>│ │ │ status/pages │ │ │<──────────────────────────│ │ │ ... │ │ │ │ │ │ [crawl complete] │ │ │<──────────────────────────│ │ │ │ │ │ assemble_llms_full() │ │ │ store_and_index() │ │ │ │ │ ✓ Added 'example' │ │ │<─────────────────────│ │


---

## CLI Interface

### Basic Usage (Blocking)

```bash
# Crawl and add as new source
blz crawl https://docs.example.com

# Crawl with explicit alias
blz crawl https://docs.example.com --alias example

# Crawl with page limit
blz crawl https://docs.example.com --limit 100

# Dry run - show what would be crawled
blz crawl https://docs.example.com --dry-run

Background Mode

# Start crawl in background
blz crawl https://docs.example.com --background
# Output: Started crawl job: fc_abc123

# Check status
blz crawl --status fc_abc123
# Output: Status: crawling (67/120 pages)

# List all active jobs
blz crawl --jobs

# Cancel a job
blz crawl --cancel fc_abc123

Options

Flag Description Default
--alias <name> Source alias Derived from domain
--limit <n> Max pages to crawl 500
--depth <n> Max link depth 5
--include <glob> Only crawl matching paths -
--exclude <glob> Skip matching paths -
--background Run in background false
--dry-run Preview without crawling false
--yes Skip confirmation false

Firecrawl Integration

Starting a Crawl

async fn start_crawl(url: &str, options: &CrawlOptions) -> Result<String> {
    let response = firecrawl_crawl(FirecrawlCrawlRequest {
        url: url.to_string(),
        limit: Some(options.limit),
        max_discovery_depth: Some(options.depth),
        include_paths: options.include.clone(),
        exclude_paths: options.exclude.clone(),
        scrape_options: Some(ScrapeOptions {
            formats: vec!["markdown".to_string()],
            only_main_content: true,
            ..Default::default()
        }),
        ..Default::default()
    }).await?;

    Ok(response.operation_id)
}

Polling Status

async fn poll_until_complete(
    operation_id: &str,
    progress_callback: impl Fn(CrawlProgress),
) -> Result<Vec<CrawledPage>> {
    loop {
        let status = firecrawl_check_crawl_status(operation_id).await?;

        progress_callback(CrawlProgress {
            completed: status.completed,
            total: status.total,
            status: status.status.clone(),
        });

        match status.status.as_str() {
            "completed" => return Ok(status.data),
            "failed" => return Err(anyhow!("Crawl failed: {}", status.error)),
            "cancelled" => return Err(anyhow!("Crawl was cancelled")),
            _ => {
                tokio::time::sleep(Duration::from_secs(2)).await;
            }
        }
    }
}

Assembly Logic

Page Ordering

Index-guided crawl: Use index order (maintainer-curated). This is the best ordering because the maintainer deliberately organized the entries.

Discovery crawl: Use breadth-first crawl order from root. This follows the natural site structure.

Fallback: URL path sort (alphabetical). Least ideal — /api/authentication before /api/getting-started.

fn order_pages(
    pages: Vec<CrawledPage>,
    index: Option<&IndexLayer>,
    crawl_order: &[String],
) -> Vec<CrawledPage> {
    if let Some(index) = index {
        // Index-guided: use maintainer's order
        let url_to_page: HashMap<_, _> = pages.into_iter()
            .map(|p| (p.url.clone(), p))
            .collect();

        index.entries.iter()
            .filter_map(|e| url_to_page.get(&e.url).cloned())
            .collect()
    } else if !crawl_order.is_empty() {
        // Discovery: use crawl order (breadth-first)
        let url_to_page: HashMap<_, _> = pages.into_iter()
            .map(|p| (p.url.clone(), p))
            .collect();

        crawl_order.iter()
            .filter_map(|url| url_to_page.get(url).cloned())
            .collect()
    } else {
        // Fallback: URL path sort
        let mut pages = pages;
        pages.sort_by(|a, b| {
            let path_a = Url::parse(&a.url).map(|u| u.path().to_string()).unwrap_or_default();
            let path_b = Url::parse(&b.url).map(|u| u.path().to_string()).unwrap_or_default();
            path_a.cmp(&path_b)
        });
        pages
    }
}

Heading Normalization

Ensure consistent heading hierarchy across pages:

fn normalize_headings(markdown: &str, page_title: &str) -> String {
    let mut output = String::new();

    // Add page title as h2 (reserve h1 for doc title)
    output.push_str(&format!("## {}\n\n", page_title));

    // Shift all headings down by one level
    for line in markdown.lines() {
        if line.starts_with('#') {
            // Count existing heading level
            let level = line.chars().take_while(|&c| c == '#').count();
            // Shift down (h1 -> h3, h2 -> h4, etc.)
            let new_level = (level + 2).min(6);
            let hashes = "#".repeat(new_level);
            let text = line.trim_start_matches('#').trim();
            output.push_str(&format!("{} {}\n", hashes, text));
        } else {
            output.push_str(line);
            output.push('\n');
        }
    }

    output
}

Full Assembly

fn assemble_llms_full(
    site_url: &str,
    site_name: &str,
    pages: Vec<CrawledPage>,
) -> String {
    let mut output = String::new();

    // Header
    output.push_str(&format!("# {}\n\n", site_name));
    output.push_str(&format!("> Documentation for {}.\n", site_name));
    output.push_str(&format!("> Generated by blz from {}\n\n", site_url));

    // Table of contents (optional)
    output.push_str("## Contents\n\n");
    for page in &pages {
        let title = extract_title(&page.markdown);
        output.push_str(&format!("- {}\n", title));
    }
    output.push_str("\n---\n\n");

    // Page content
    for page in pages {
        let title = extract_title(&page.markdown);
        let normalized = normalize_headings(&page.markdown, &title);
        output.push_str(&normalized);
        output.push_str("\n\n---\n\n");
    }

    output
}

Job Persistence

For background mode, persist job state:

Storage Location

~/.local/share/blz/jobs/
  <job_id>.json

Job Schema

{
  "id": "fc_abc123",
  "url": "https://docs.example.com",
  "alias": "example",
  "operation_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "crawling",
  "started_at": "2026-01-25T12:00:00Z",
  "completed_at": null,
  "pages_crawled": 67,
  "pages_total": 120,
  "error": null
}

Job Lifecycle

  1. Created: Job file written when crawl starts
  2. Running: Updated periodically with progress
  3. Completed: Marked complete, results stored as source
  4. Cleanup: Job file deleted after 24h or on --cleanup

Error Handling

Recoverable Errors

  • Network timeout → Retry with backoff (handled by Firecrawl)
  • Rate limited → Wait and retry (handled by Firecrawl)
  • Partial failure → Store what we got, warn user

Fatal Errors

  • Invalid URL → Exit with clear message
  • Auth required → Suggest manual download
  • Firecrawl API error → Show error, suggest retry

User Feedback

# Success
✓ Crawled 120 pages from docs.example.com
✓ Assembled llms-full.txt (45,230 lines)
✓ Added source 'example'

# Partial success
⚠ Crawled 98/120 pages (22 failed)
  - /api/private: 403 Forbidden
  - /old/deprecated: 404 Not Found
✓ Assembled llms-full.txt (38,450 lines)
✓ Added source 'example'

# Failure
✗ Crawl failed: Authentication required
  This site requires login. Try downloading manually and using:
  blz add example /path/to/downloaded.md

Decisions (Resolved)

Alias Derivation

Derive from domain, don't prompt interactively (breaks agent workflows):

  • docs.example.comexample
  • example.com/docsexample
  • Collision → error with suggestion: Use --alias <name> or --force to overwrite

Approval Modes

blz crawl <url>                    # Interactive (default): always prompt
blz crawl <url> --yes              # Skip confirmation
blz crawl <url> --auto=50          # Auto-approve if under 50 pages
blz crawl <url> --dry-run          # Show plan and credit estimate, don't execute

Page Limits

  • Default limit: 500 pages
  • Warn and require confirmation above 100 pages
  • --limit <n> to override

Rate Limiting

  • One active crawl job at a time (v1 simplicity)
  • Queue additional requests, don't reject

Provenance Metadata

Track how a source was created for transparency:

blz info hono

hono (crawled)
  Origin: https://hono.dev
  Type: index-guided
  Include: /docs/*, /api/*
  Pages: 156
  Last sync: 2026-01-25
  Status: ready
  Firecrawl: API

For sources with both index and content:

blz info clerk

clerk (index + content)
  Index: https://clerk.com/llms.txt (24 entries)
  Content: crawled from index URLs
  Pages: 24
  Last sync: 2026-01-20
  Status: ready

Future Enhancements

Incremental Updates

Track what's changed since last crawl:

blz sync example --check    # Show what would change
blz sync example            # Incremental update
blz sync example --full     # Force full re-crawl

Smart Assembly

  • Detect sidebar/nav structure for better ordering
  • Extract metadata (version, last updated)
  • Generate index from content (content → index inference)

Hybrid Crawl

When index exists but is incomplete:

  1. Fetch the index
  2. Map the site to find what's not covered
  3. Crawl only the gaps
  4. Merge into unified source

Budget Caps

blz crawl <url> --max-credits 50
# Abort if crawl would exceed budget

Related Issues

  • BLZ-335: CLI command restructure (crawl fits as new top-level command)
  • BLZ-337: BLZ_FORMAT env var (crawl should respect output format)
  • BLZ-340: Error codes (crawl needs proper exit codes)

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment