Skip to content

Instantly share code, notes, and snippets.

@dumkydewilde
Created February 9, 2026 00:44
Show Gist options
  • Select an option

  • Save dumkydewilde/97ab8337e30ca09c52b25343adf2aae1 to your computer and use it in GitHub Desktop.

Select an option

Save dumkydewilde/97ab8337e30ca09c52b25343adf2aae1 to your computer and use it in GitHub Desktop.
Ultimate Web Scraping AI Skill (Claude, ChatGPT)

Discovery Strategies Reference

Systematic approaches to find the best data source on a target site. Work through these in order — stop as soon as you find a viable approach.

Table of Contents

  1. Network Request Interception
  2. Server-Rendered JSON Blobs
  3. CMS and Platform APIs
  4. GraphQL Endpoints
  5. Structured Data in HTML
  6. Data Attributes
  7. Stable CSS Selectors
  8. Discovery Report Template

1. Network Request Interception

The single most valuable discovery technique. Most modern sites load data via XHR/fetch — finding these endpoints gives you a clean JSON API.

What to look for

  • Requests to /api/, /v1/, /v2/, /_next/data/, /graphql
  • JSON responses with the data you need
  • Pagination parameters: ?page=2, ?offset=20, ?cursor=abc
  • Auth headers: Authorization: Bearer ..., API keys in query params

Critical: check page 2 and interactions

Many SPAs load the first page from server-rendered HTML but fetch subsequent pages via API. Always try these actions:

  • Click "next page" or "load more"
  • Change sort order or apply a filter
  • Scroll down (infinite scroll triggers)
  • Open a detail view / modal
  • Use the site's search

Each of these actions may reveal API endpoints not visible on initial load.

Using Claude in Chrome for discovery

Always use the read_network_requests tool rather than injecting JS-based network interceptors (e.g., patching fetch or XMLHttpRequest). JS interceptors are lost on page navigation, which is exactly when the most interesting requests happen (pagination, filter changes, search). The read_network_requests tool survives navigations and captures everything.

Workflow:

1. Call read_network_requests once to start tracking (even before navigating)
2. Navigate to the target URL
3. Wait for the page to load (2-3 seconds)
4. Read network requests filtered by useful patterns:
   - urlPattern="/api/"    → REST endpoints
   - urlPattern="graphql"  → GraphQL
   - urlPattern="_rsc"     → React Server Components
   - urlPattern="_next"    → Next.js data routes
5. Interact with the page (paginate, filter, search, sort)
6. Read network requests again — new endpoints often appear only on
   interaction, not on initial load
7. Inspect promising requests:
   - URL pattern and query parameters
   - Response format (JSON, RSC payload, HTML)
   - Pagination mechanism (page number, cursor, offset)

Why not JS interceptors? When you navigate to page 2 or apply a filter, the browser often does a full navigation (not a client-side transition). This destroys any fetch/XHR monkey-patches you injected. The read_network_requests tool is browser-level and persists across navigations within the same domain.

Using curl to verify an endpoint

Once you find a candidate endpoint, verify with curl:

curl -s 'https://example.com/api/products?page=1&limit=20' \
  -H 'Accept: application/json' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)' \
  | python3 -m json.tool | head -50

Check:

  • Does it return JSON without browser context?
  • Does it require cookies or auth tokens?
  • Does pagination work by incrementing a parameter?
  • What's the rate limit? (check X-RateLimit-* headers)
  • Test the page size limit — many APIs accept much larger limit or per_page values than the frontend uses (e.g., 500 or 1000 instead of 20). Fewer requests means less rate limiting risk. Try increasing the limit parameter in steps: 50 → 100 → 250 → 500. Stop when the API errors or caps the response.

2. Server-Rendered JSON Blobs

Many frameworks embed the full page data as JSON in the initial HTML. This is often the easiest approach — one HTTP request, all data included.

Framework-specific patterns

Framework Where to look Extraction
Next.js <script id="__NEXT_DATA__" type="application/json"> json.loads(script.string)["props"]["pageProps"]
Nuxt.js <script>window.__NUXT__= or <script id="__NUXT_DATA__"> Parse the JS assignment or JSON
Remix <script>window.__remixContext = Parse the JS object
Gatsby <script>/*! gatsby-page-data */ or inline pageContext Extract from script content
SvelteKit <script type="application/json" data-sveltekit-fetched> json.loads(script.string)
Angular Universal <script id="serverApp-state" type="application/json"> json.loads(script.string)
Vue SSR <script>window.__INITIAL_STATE__= Parse the JS assignment
Astro <script type="application/json" data-astro-*> json.loads(script.string)

React Server Components (RSC) payloads

Next.js App Router sites (13.4+) do NOT use __NEXT_DATA__. Instead they stream React Server Components as a custom line-based format. You'll recognise RSC by:

  • Network requests with ?_rsc= query parameter
  • Content-Type: text/x-component in response headers
  • Response body with lines like 0:["$","div",null,{"children":...}]

RSC payloads contain the full page data but in a nested, non-trivial format. During discovery, check if the same data is available more cleanly (JSON-LD, <script> tags, or an API). Only parse RSC directly when no better source exists.

When to use RSC parsing

RSC is the right choice when:

  • No JSON-LD or structured data in the HTML
  • No API endpoints found via network inspection
  • The site is a Next.js App Router app (RSC requests visible in network)
  • The RSC payload contains data not available elsewhere (e.g., aggregated counts, metadata, filters)

RSC format overview

Each line in an RSC response is: ID:TYPE_OR_JSON

0:["$","$L1",null,{"children":["$","div",null,...]}]
1:["$","$L2",null,{"data":{"products":[...]}}]
2:"$Sreact.suspense"
3:["$","ul",null,{"children":[["$","li",...]]}]
  • Lines starting with a number + colon contain data or component trees
  • JSON arrays/objects are embedded within each line
  • Data you want is typically nested inside arrays with "children" or "data" keys

Extraction pattern

import json
import re

import httpx


def fetch_rsc(url: str, client: httpx.Client) -> str:
    """Fetch the RSC payload for a URL."""
    resp = client.get(
        url,
        params={"_rsc": "1"},  # trigger RSC response
        headers={
            "RSC": "1",  # required header for RSC
            "Next-Router-State-Tree": "",
            **HEADERS,
        },
    )
    resp.raise_for_status()
    return resp.text


def parse_rsc_payload(rsc_text: str) -> list[dict]:
    """Extract JSON objects from RSC line protocol.
    
    RSC lines look like:
        0:["$","div",null,{...}]
        1:{"key":"value"}
        2:"plain string"
    
    We extract all parseable JSON from each line and collect
    any dicts/lists that look like data (not React component trees).
    """
    results = []

    for line in rsc_text.splitlines():
        # Strip the line ID prefix: "0:", "1a:", "2f:" etc.
        match = re.match(r'^[0-9a-f]+:', line)
        if not match:
            continue
        payload = line[match.end():]

        try:
            parsed = json.loads(payload)
        except (json.JSONDecodeError, TypeError):
            continue

        # Collect interesting data — skip plain strings and React markers
        if isinstance(parsed, dict):
            results.append(parsed)
        elif isinstance(parsed, list):
            # Walk the RSC array to find nested data objects
            _extract_data_from_rsc_tree(parsed, results)

    return results


def _extract_data_from_rsc_tree(node, results: list):
    """Recursively walk an RSC tree and extract data-like dicts.
    
    RSC arrays follow the pattern: ["$", "tag", key, props_dict]
    Data objects are typically inside props dicts under keys like
    "data", "items", "results", "children", "pageProps".
    """
    if isinstance(node, dict):
        # If it has data-like keys, collect it
        data_keys = {"data", "items", "results", "products", "listings",
                     "pageProps", "initialData", "props"}
        if data_keys & set(node.keys()):
            results.append(node)
        # Also recurse into all dict values
        for v in node.values():
            if isinstance(v, (dict, list)):
                _extract_data_from_rsc_tree(v, results)
    elif isinstance(node, list):
        for item in node:
            if isinstance(item, (dict, list)):
                _extract_data_from_rsc_tree(item, results)

Tips for RSC discovery

  • Compare RSC payload vs HTML source: sometimes the HTML already contains the data as JSON-LD or <script> tags (server-rendered for SEO), making RSC parsing unnecessary.
  • Multiple RSC requests: App Router may split data across several RSC fetches (shell, page data, suspended chunks). Check all _rsc requests.
  • The RSC: 1 header is required — without it, the server returns full HTML instead of the RSC payload.
  • Pagination via RSC: when paginating client-side, the browser fetches only the RSC diff (not full HTML). This can be much smaller and faster to parse than the full page.

Framework detection cheat sheet

import json, re
from bs4 import BeautifulSoup

html = httpx.get(url).text
soup = BeautifulSoup(html, "html.parser")

# Next.js
tag = soup.find("script", id="__NEXT_DATA__")
if tag:
    data = json.loads(tag.string)
    page_data = data["props"]["pageProps"]

# Nuxt
tag = soup.find("script", string=re.compile(r"window\.__NUXT__"))
if tag:
    # Extract JSON from: window.__NUXT__={...}
    match = re.search(r'window\.__NUXT__\s*=\s*(.+?);\s*$', tag.string, re.DOTALL)
    if match:
        data = json.loads(match.group(1))

# Generic initial state
for tag in soup.find_all("script"):
    if tag.string and "window.__INITIAL_STATE__" in tag.string:
        match = re.search(r'window\.__INITIAL_STATE__\s*=\s*(.+?);\s*$',
                          tag.string, re.DOTALL)
        if match:
            data = json.loads(match.group(1))

3. CMS and Platform APIs

If the site runs on a known platform, there's almost certainly a REST API.

WordPress

# Check if it's WordPress
curl -s https://example.com/wp-json/ | python3 -m json.tool

# List posts (public by default)
curl -s 'https://example.com/wp-json/wp/v2/posts?per_page=20&page=1'

# Other endpoints
/wp-json/wp/v2/pages
/wp-json/wp/v2/categories
/wp-json/wp/v2/tags
/wp-json/wp/v2/users        # sometimes restricted
/wp-json/wp/v2/media
/wp-json/wp/v2/comments

Detection: look for <link rel="https://api.w.org/" in HTML head, or /wp-content/ / /wp-includes/ in page source.

Shopify

# Product listing (JSON)
curl -s 'https://store.example.com/products.json?limit=250&page=1'

# Collection products
curl -s 'https://store.example.com/collections/all/products.json'

# Single product
curl -s 'https://store.example.com/products/product-handle.json'

# Storefront API (GraphQL, may need token)
curl -s 'https://store.example.com/api/2024-01/graphql.json' \
  -H 'Content-Type: application/json' \
  -H 'X-Shopify-Storefront-Access-Token: TOKEN' \
  -d '{"query": "{ products(first: 10) { edges { node { title } } } }"}'

Detection: Shopify.theme, cdn.shopify.com, myshopify.com.

Drupal (JSON:API)

curl -s 'https://example.com/jsonapi/node/article?page[limit]=20'

Detection: <meta name="Generator" content="Drupal", /sites/default/files/.

Ghost

curl -s 'https://example.com/ghost/api/content/posts/?key=API_KEY&limit=20'

Detection: <meta name="generator" content="Ghost".

Webflow

# Webflow sites expose collection data via their API
# Look for data-w-id attributes and /api/v1/ endpoints

Detection: webflow.js, data-wf-site, data-wf-page.

Squarespace

# Append ?format=json to most URLs
curl -s 'https://example.com/blog?format=json'

Detection: squarespace.com, static.squarespace.com.

Wix

# Wix sites use a data API internally
# Look for /_api/ endpoints in XHR requests
# Common: /_api/communities-blog-node-api/

Detection: wix.com, parastorage.com, static.wixstatic.com.


4. GraphQL Endpoints

Detection

Common GraphQL endpoint URLs:

  • /graphql
  • /gql
  • /api/graphql
  • /v1/graphql
  • /__graphql

Check for:

  • Requests with Content-Type: application/json and a query field in body
  • The string __typename in responses
  • Relay-style edges / node / pageInfo patterns

Introspection query

curl -s 'https://example.com/graphql' \
  -H 'Content-Type: application/json' \
  -d '{"query": "{ __schema { queryType { name } types { name fields { name type { name } } } } }"}'

Many production endpoints disable introspection, but it's worth trying. If it works, you get the full schema — which makes building queries trivial.

Query building without introspection

If introspection is disabled, observe the queries the frontend makes (via network tab), then adapt them. The frontend's queries are typically well-tested and paginated.


5. Structured Data in HTML

Many sites embed structured data for SEO. This is clean, reliable, and rarely changes.

JSON-LD

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("script", type="application/ld+json"):
    data = json.loads(tag.string)
    # Common types: Product, Article, Event, Recipe, Organization
    print(data.get("@type"), data)

Microdata / RDFa

Less convenient to parse but still structured. Use extruct library:

import extruct

data = extruct.extract(html, syntaxes=["json-ld", "microdata", "rdfa", "opengraph"])

6. Data Attributes

Modern frameworks often attach data to DOM elements via data- attributes. These are typically more stable than class names.

What to look for

# Find elements with data attributes
for elem in soup.find_all(attrs={"data-product-id": True}):
    print(elem["data-product-id"], elem.get("data-price"))

# Common patterns
# data-id, data-item-id, data-product-id
# data-price, data-currency
# data-category, data-type
# data-url, data-href, data-src
# data-testid (React Testing Library — surprisingly stable)
# data-cy (Cypress test IDs — also stable)

7. Stable CSS Selectors

Last resort for DOM-based scraping. Prefer semantic HTML elements over class names.

Good selectors (stable across redesigns)

article                  /* semantic HTML5 */
article h2 a            /* link inside article heading */
table tbody tr td        /* table data */
li                       /* list items */
time[datetime]           /* dates with machine-readable value */
[itemprop="name"]        /* microdata attributes */
nav a                    /* navigation links */
main section             /* main content sections */
figure img               /* images with semantic wrapper */

Bad selectors (will break)

.css-1a2b3c             /* CSS-in-JS generated */
.MuiButton-root         /* Material UI internals */
.sc-bdVTJa              /* styled-components hash */
.tw-flex.tw-gap-4       /* Tailwind utility soup */
._3xk2z                 /* minified class names */
div > div > div > span  /* fragile nesting */

Resilience tips

  • Anchor on <table>, <article>, <section> when possible
  • Use attribute selectors: [role="listitem"], [aria-label="Price"]
  • Combine tag + attribute: li[data-testid]
  • Avoid nth-child selectors — position changes break them
  • Test selector matches a consistent count across multiple pages

8. Discovery Report Template

After running discovery, produce a brief report for the user:

## Discovery Report: [site name]

**Target URL:** https://example.com/products
**Date:** 2024-01-15

### Findings

| # | Method | Endpoint/Selector | Auth needed? | Paginated? | Max page size | Notes |
|---|--------|-------------------|-------------|-----------|--------------|-------|
| 1 | REST API | /api/v1/products?page={n} | No | Yes (page param) | 250 (tested) | Returns JSON, default 20 items/page |
| 2 | __NEXT_DATA__ | Embedded in HTML | No | No (single page) | N/A | Full product list on first load |
| 3 | CSS selectors | article.product-card | No | No | N/A | 20 items per page load |

### Recommendation

**Use approach #1 (REST API)** because:
- Clean JSON response, no HTML parsing needed
- Built-in pagination
- No authentication required
- Likely the most stable (API contracts change less than HTML)

### Caveats
- robots.txt allows /api/ paths
- No visible rate limit headers (recommend 1-2 req/s)
- TOS does not explicitly prohibit scraping

Scraping Patterns Reference

Python script templates and patterns for the scraping phase. All scripts use uv inline metadata for dependency management.

Table of Contents

  1. Script Template
  2. HTTP Client Ladder
  3. Output Handlers
  4. Pagination Patterns
  5. Playwright Stealth
  6. Error Handling and Retry
  7. Rate Limiting
  8. Common Extraction Patterns

1. Script Template

Every generated script should follow this structure. Use uv script metadata so the user can run it with uv run scrape.py without managing a virtual environment.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "httpx",
#     "selectolax",
# ]
# ///
"""
Scraper: [Site Name] — [what it extracts]
Discovery: [method used, e.g., REST API at /api/v1/products]

Usage:
    uv run scrape.py --output data/products.jsonl
    uv run scrape.py --output data/products.csv --format csv
    uv run scrape.py --connection "postgresql://user:pass@host/db?table=products"
"""

import argparse
import json
import logging
import sys
import time
from datetime import datetime, timezone
from pathlib import Path

import httpx

# ── Configuration ──────────────────────────────────────────────
BASE_URL = "https://example.com"
API_URL = f"{BASE_URL}/api/v1/products"
USER_AGENT = (
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/131.0.0.0 Safari/537.36"
)
HEADERS = {
    "User-Agent": USER_AGENT,
    "Accept": "application/json",
    "Accept-Language": "en-US,en;q=0.9",
}
REQUEST_DELAY = 1.5  # seconds between requests
PAGE_SIZE = 20  # Try increasing during discovery — many APIs accept 100-500+

# ── Logging ────────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
    stream=sys.stderr,
)
log = logging.getLogger(__name__)


def parse_args() -> argparse.Namespace:
    p = argparse.ArgumentParser(description=__doc__)
    p.add_argument("--output", "-o", type=Path, help="Output file path")
    p.add_argument(
        "--format", "-f",
        choices=["jsonl", "csv", "parquet"],
        default="jsonl",
        help="Output format (default: jsonl)",
    )
    p.add_argument("--connection", "-c", help="DB connection string")
    p.add_argument("--page-limit", type=int, default=0, help="Max pages (0=all)")
    p.add_argument("--date-range", help="Date range: YYYY-MM-DD:YYYY-MM-DD")
    return p.parse_args()


def fetch_page(client: httpx.Client, page: int) -> list[dict]:
    """Fetch a single page of results. Adapt to your endpoint."""
    resp = client.get(API_URL, params={"page": page, "limit": PAGE_SIZE})
    resp.raise_for_status()
    data = resp.json()
    return data.get("items", data.get("results", data.get("data", [])))


def transform(raw: dict) -> dict:
    """Normalize a single raw record. Customize per site."""
    return {
        "id": raw.get("id"),
        "title": raw.get("title", "").strip(),
        "url": raw.get("url"),
        "scraped_at": datetime.now(timezone.utc).isoformat(),
        # add more fields as needed
    }


def scrape(args: argparse.Namespace) -> list[dict]:
    records = []
    page = 1
    with httpx.Client(headers=HEADERS, timeout=30, follow_redirects=True) as client:
        while True:
            log.info("Fetching page %d", page)
            items = fetch_page(client, page)
            if not items:
                log.info("No more items, stopping at page %d", page)
                break
            for item in items:
                records.append(transform(item))
            log.info("Got %d items (total: %d)", len(items), len(records))
            if args.page_limit and page >= args.page_limit:
                log.info("Reached page limit %d", args.page_limit)
                break
            page += 1
            time.sleep(REQUEST_DELAY)
    return records


def write_output(records: list[dict], args: argparse.Namespace) -> None:
    if args.connection:
        write_to_db(records, args.connection)
        return

    output = args.output or Path(f"output.{args.format}")
    output.parent.mkdir(parents=True, exist_ok=True)

    if args.format == "jsonl":
        with open(output, "w") as f:
            for r in records:
                f.write(json.dumps(r, ensure_ascii=False) + "\n")
    elif args.format == "csv":
        import csv
        if not records:
            return
        with open(output, "w", newline="") as f:
            w = csv.DictWriter(f, fieldnames=records[0].keys())
            w.writeheader()
            w.writerows(records)
    elif args.format == "parquet":
        import pyarrow as pa
        import pyarrow.parquet as pq
        table = pa.Table.from_pylist(records)
        pq.write_table(table, output)

    log.info("Wrote %d records to %s", len(records), output)


def write_to_db(records: list[dict], connection: str) -> None:
    """Write records to a database. Supports PostgreSQL connection strings."""
    import duckdb
    db = duckdb.connect()
    db.execute("INSTALL postgres; LOAD postgres;")
    # Use DuckDB to write to postgres via its postgres extension,
    # or adapt to use psycopg/sqlalchemy as preferred
    import pyarrow as pa
    table = pa.Table.from_pylist(records)
    db.register("records", table)
    # Adapt table name from connection string query params
    from urllib.parse import urlparse, parse_qs
    parsed = urlparse(connection)
    table_name = parse_qs(parsed.query).get("table", ["scraped_data"])[0]
    clean_conn = connection.split("?")[0]
    db.execute(f"ATTACH '{clean_conn}' AS pg (TYPE postgres)")
    db.execute(f"CREATE TABLE IF NOT EXISTS pg.{table_name} AS SELECT * FROM records WHERE false")
    db.execute(f"INSERT INTO pg.{table_name} SELECT * FROM records")
    log.info("Wrote %d records to %s.%s", len(records), clean_conn, table_name)


def main() -> int:
    args = parse_args()
    try:
        records = scrape(args)
        if not records:
            log.warning("No records scraped")
            return 1
        write_output(records, args)
        return 0
    except httpx.HTTPStatusError as e:
        log.error("HTTP error: %s", e)
        return 2
    except Exception as e:
        log.error("Fatal error: %s", e, exc_info=True)
        return 2


if __name__ == "__main__":
    sys.exit(main())

2. HTTP Client Ladder

Try these in order. Move to the next only if the previous one gets blocked.

Level 1: httpx (default)

# /// script
# dependencies = ["httpx"]
# ///
import httpx

with httpx.Client(
    headers=HEADERS,
    timeout=30,
    follow_redirects=True,
    http2=True,  # enable HTTP/2
) as client:
    resp = client.get(url)

Level 2: curl_cffi (TLS fingerprint bypass)

When sites block based on TLS fingerprint (JA3/JA4). curl_cffi impersonates real browser TLS fingerprints.

# /// script
# dependencies = ["curl_cffi"]
# ///
from curl_cffi import requests as curl_requests

resp = curl_requests.get(
    url,
    headers=HEADERS,
    impersonate="chrome131",  # match a real browser
    timeout=30,
)

Available impersonation targets (use recent ones):

  • "chrome131", "chrome124", "chrome120"
  • "safari18_0", "safari17_5"
  • "edge131"

Level 3: Playwright (full browser, last resort)

See Playwright Stealth section below.


3. Output Handlers

JSONL (default, recommended)

Append-friendly, one JSON object per line. Best for streaming and incremental scraping.

with open(output, "w") as f:
    for record in records:
        f.write(json.dumps(record, ensure_ascii=False, default=str) + "\n")

CSV

import csv

with open(output, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=records[0].keys())
    writer.writeheader()
    writer.writerows(records)

Parquet

Best for analytical workloads and DuckDB/MotherDuck ingestion.

# /// script
# dependencies = ["pyarrow"]
# ///
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_pylist(records)
pq.write_table(table, output, compression="snappy")

PostgreSQL via DuckDB

# /// script
# dependencies = ["duckdb", "pyarrow"]
# ///
import duckdb

db = duckdb.connect()
db.install_extension("postgres")
db.load_extension("postgres")
table = pa.Table.from_pylist(records)
db.register("records", table)
db.execute(f"ATTACH '{conn_string}' AS pg (TYPE postgres)")
db.execute(f"INSERT INTO pg.{table_name} SELECT * FROM records")

4. Pagination Patterns

Probe the maximum page size first

Before paginating, test whether the API accepts larger page sizes than the frontend defaults. Many APIs let you request 250–500 items per page even if the UI only shows 20. This dramatically reduces the number of requests and lowers the chance of hitting rate limits.

# During discovery, probe the max page size:
for size in [50, 100, 250, 500]:
    resp = client.get(url, params={"limit": size, "page": 1})
    if resp.status_code == 200 and len(resp.json().get("items", [])) > 0:
        log.info("Page size %d works (%d items)", size, len(resp.json()["items"]))
    else:
        log.info("Page size %d rejected or capped", size)
        break

Page-number based

page = 1
while True:
    data = client.get(url, params={"page": page, "per_page": 20}).json()
    items = data["items"]
    if not items:
        break
    yield from items
    page += 1
    time.sleep(REQUEST_DELAY)

Cursor-based

cursor = None
while True:
    params = {"limit": 20}
    if cursor:
        params["cursor"] = cursor
    data = client.get(url, params=params).json()
    yield from data["items"]
    cursor = data.get("next_cursor")
    if not cursor:
        break
    time.sleep(REQUEST_DELAY)

Offset-based

offset = 0
limit = 20
while True:
    data = client.get(url, params={"offset": offset, "limit": limit}).json()
    items = data["results"]
    if not items:
        break
    yield from items
    offset += limit
    time.sleep(REQUEST_DELAY)

Next-URL based

next_url = initial_url
while next_url:
    data = client.get(next_url).json()
    yield from data["results"]
    next_url = data.get("next")  # full URL to next page
    time.sleep(REQUEST_DELAY)

Infinite scroll / load-more (JS required)

If all HTTP approaches fail for pagination, use Playwright:

async def scroll_and_collect(page):
    previous_height = 0
    while True:
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await page.wait_for_timeout(2000)
        current_height = await page.evaluate("document.body.scrollHeight")
        if current_height == previous_height:
            break
        previous_height = current_height

5. Playwright Stealth

Only use when httpx and curl_cffi both fail. Apply ALL of these anti-detection measures.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "playwright",
# ]
# ///
"""Run `playwright install chromium` before first use."""

from playwright.sync_api import sync_playwright

def create_stealth_browser():
    pw = sync_playwright().start()
    browser = pw.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-features=IsolateOrigins,site-per-process",
            "--disable-infobars",
            "--no-first-run",
            "--no-default-browser-check",
        ],
    )
    context = browser.new_context(
        user_agent=USER_AGENT,
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
        # Realistic screen params
        screen={"width": 1920, "height": 1080},
        color_scheme="light",
    )

    # Remove automation indicators
    context.add_init_script("""
        // Remove webdriver flag
        Object.defineProperty(navigator, 'webdriver', { get: () => undefined });

        // Fix plugins array (headless has empty plugins)
        Object.defineProperty(navigator, 'plugins', {
            get: () => [1, 2, 3, 4, 5],
        });

        // Fix languages
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en'],
        });

        // Fix permissions query
        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (parameters) =>
            parameters.name === 'notifications'
                ? Promise.resolve({ state: Notification.permission })
                : originalQuery(parameters);

        // Fix chrome object
        window.chrome = { runtime: {} };

        // Fix connection info
        Object.defineProperty(navigator, 'connection', {
            get: () => ({ effectiveType: '4g', rtt: 50, downlink: 10, saveData: false }),
        });

        // Hide headless renderer
        Object.defineProperty(navigator, 'hardwareConcurrency', { get: () => 8 });
        Object.defineProperty(navigator, 'deviceMemory', { get: () => 8 });
    """)

    page = context.new_page()
    return pw, browser, page

Anti-detection checklist

When using Playwright, verify ALL of these:

  • navigator.webdriver returns undefined (not true)
  • navigator.plugins has entries (not empty)
  • navigator.languages is set (not empty)
  • window.chrome object exists
  • User-Agent matches a real, recent browser
  • Viewport is a realistic resolution (1920x1080, not 800x600)
  • --disable-blink-features=AutomationControlled is set
  • Real-looking mouse movements and delays between actions
  • Timezone and locale are set consistently
  • hardwareConcurrency and deviceMemory return reasonable values

Playwright with human-like behavior

import random

async def human_like_delay():
    """Random delay between 0.5-2.5 seconds."""
    await asyncio.sleep(random.uniform(0.5, 2.5))

async def human_like_scroll(page):
    """Scroll with varying speed like a human."""
    for _ in range(random.randint(2, 5)):
        await page.mouse.wheel(0, random.randint(200, 600))
        await asyncio.sleep(random.uniform(0.3, 1.0))

6. Error Handling and Retry

from time import sleep

MAX_RETRIES = 3
RETRY_BACKOFF = [2, 5, 15]  # seconds


def fetch_with_retry(client: httpx.Client, url: str, **kwargs) -> httpx.Response:
    for attempt in range(MAX_RETRIES):
        try:
            resp = client.get(url, **kwargs)
            if resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", RETRY_BACKOFF[attempt]))
                log.warning("Rate limited, sleeping %ds", retry_after)
                sleep(retry_after)
                continue
            resp.raise_for_status()
            return resp
        except httpx.HTTPStatusError as e:
            if e.response.status_code >= 500 and attempt < MAX_RETRIES - 1:
                log.warning("Server error %d, retry in %ds",
                            e.response.status_code, RETRY_BACKOFF[attempt])
                sleep(RETRY_BACKOFF[attempt])
                continue
            raise
        except (httpx.ConnectError, httpx.ReadTimeout) as e:
            if attempt < MAX_RETRIES - 1:
                log.warning("Connection error: %s, retry in %ds",
                            e, RETRY_BACKOFF[attempt])
                sleep(RETRY_BACKOFF[attempt])
                continue
            raise
    raise RuntimeError(f"Failed after {MAX_RETRIES} retries: {url}")

7. Rate Limiting

Simple sleep-based (default)

REQUEST_DELAY = 1.5  # seconds
time.sleep(REQUEST_DELAY)

Adaptive rate limiting

class RateLimiter:
    def __init__(self, requests_per_second: float = 1.0):
        self.min_interval = 1.0 / requests_per_second
        self.last_request = 0.0

    def wait(self):
        elapsed = time.time() - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        self.last_request = time.time()

limiter = RateLimiter(requests_per_second=0.5)  # 1 req every 2 seconds

Respect Retry-After headers

if resp.status_code == 429:
    wait = int(resp.headers.get("Retry-After", 60))
    log.warning("Rate limited. Waiting %d seconds.", wait)
    time.sleep(wait)

8. Common Extraction Patterns

JSON API response

data = resp.json()
items = data["items"]  # or data["results"], data["data"], etc.

BeautifulSoup HTML parsing

# /// script
# dependencies = ["httpx", "beautifulsoup4"]
# ///
from bs4 import BeautifulSoup

soup = BeautifulSoup(resp.text, "html.parser")
for article in soup.select("article"):
    title = article.select_one("h2 a")
    yield {
        "title": title.get_text(strip=True) if title else None,
        "url": title["href"] if title else None,
    }

selectolax (faster alternative to BS4)

# /// script
# dependencies = ["httpx", "selectolax"]
# ///
from selectolax.parser import HTMLParser

tree = HTMLParser(resp.text)
for node in tree.css("article"):
    title_node = node.css_first("h2 a")
    yield {
        "title": title_node.text(strip=True) if title_node else None,
        "url": title_node.attrs.get("href") if title_node else None,
    }

JSON-LD extraction

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(resp.text, "html.parser")
for script in soup.find_all("script", type="application/ld+json"):
    try:
        data = json.loads(script.string)
        if isinstance(data, list):
            yield from data
        else:
            yield data
    except json.JSONDecodeError:
        continue

NEXT_DATA extraction

import json, re

match = re.search(
    r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
    resp.text,
    re.DOTALL,
)
if match:
    data = json.loads(match.group(1))
    page_props = data["props"]["pageProps"]

React Server Components (RSC) extraction

For Next.js App Router sites (13.4+) that don't have __NEXT_DATA__. The RSC payload is fetched by adding the RSC: 1 header.

# /// script
# dependencies = ["httpx"]
# ///
import json
import re

import httpx


def fetch_rsc_payload(client: httpx.Client, url: str) -> str:
    """Fetch RSC payload instead of full HTML."""
    resp = client.get(
        url,
        headers={
            **HEADERS,
            "RSC": "1",
            "Next-Router-State-Tree": "",
        },
    )
    resp.raise_for_status()
    return resp.text


def extract_data_from_rsc(rsc_text: str) -> list[dict]:
    """Parse the RSC line protocol and extract data objects.
    
    Each RSC line is: LINE_ID:JSON_PAYLOAD
    We parse each line's JSON and recursively look for data-like dicts.
    """
    data_objects = []
    for line in rsc_text.splitlines():
        m = re.match(r'^[0-9a-f]+:', line)
        if not m:
            continue
        try:
            parsed = json.loads(line[m.end():])
        except (json.JSONDecodeError, TypeError):
            continue
        _walk_rsc(parsed, data_objects)
    return data_objects


def _walk_rsc(node, out: list):
    """Recursively collect dicts that look like data (not React nodes)."""
    if isinstance(node, dict):
        # Heuristic: dicts with data-like keys are interesting
        useful_keys = {"data", "items", "results", "products", "listings",
                       "props", "pageProps", "initialData", "records"}
        if useful_keys & set(node.keys()):
            out.append(node)
        for v in node.values():
            if isinstance(v, (dict, list)):
                _walk_rsc(v, out)
    elif isinstance(node, list):
        for item in node:
            if isinstance(item, (dict, list)):
                _walk_rsc(item, out)

When to use RSC vs HTML: If the site also embeds JSON-LD or other structured data in the HTML, prefer that — it's simpler and more stable. RSC parsing is the fallback when the data only exists in the RSC stream.

Table extraction

from bs4 import BeautifulSoup

soup = BeautifulSoup(resp.text, "html.parser")
table = soup.select_one("table")
headers = [th.get_text(strip=True) for th in table.select("thead th")]
for row in table.select("tbody tr"):
    cells = [td.get_text(strip=True) for td in row.select("td")]
    yield dict(zip(headers, cells))
name description
web-scraper
Build reliable, production-grade web scrapers through a two-phase approach: Discovery (finding the best data source on a site) then Scraping (generating a Python script for scheduled extraction). Use this skill whenever the user wants to scrape a website, extract data from web pages, build a crawler, set up recurring data collection, reverse-engineer a site's API, or find hidden data endpoints. Also trigger when the user mentions "scrape", "crawl", "extract from site", "web data", "pull data from URL", "scheduled scraping", "monitor a page", or asks how to get data from a specific website.

Web Scraper Skill

Build scrapers that survive site redesigns by finding the most stable and efficient data source first, then generating a clean Python script for recurring extraction.

Two-Phase Workflow

Every scraping task MUST go through both phases in order. Skipping discovery leads to fragile scrapers that break on the first deploy.

Phase 1 — Discovery

Goal: find the most efficient and stable way to get the data. Prefer structured endpoints over DOM parsing. Work down the priority list until you find something that works.

Read references/discovery-strategies.md before starting discovery.

Discovery priority (highest to lowest):

  1. Public/undocumented REST API — look in XHR/fetch requests, especially on pagination or filter actions (page 2, sort, search). Often missed on initial page load.
  2. GraphQL endpoint — check for /graphql, /gql, or __relay in network requests.
  3. Server-rendered JSON blobs__NEXT_DATA__, __NUXT__, window.__INITIAL_STATE__, window.__DATA__, Remix __remixContext, Gatsby pageContext, etc.
  4. React Server Components (RSC) — Next.js App Router sites use RSC instead of __NEXT_DATA__. Look for ?_rsc= requests in the network tab. Only parse RSC directly if no cleaner source (JSON-LD, API) exists.
  5. CMS/platform API — WordPress REST (/wp-json/wp/v2/), Shopify Storefront API, Drupal JSON:API, Contentful, Strapi, Ghost, etc.
  6. Structured data in HTML<script type="application/ld+json">, microdata, RDFa.
  7. data- attributes — e.g., data-product-id, data-price, data-sku.
  8. Stable CSS selectors — semantic HTML elements (<article>, <li>, <table>, <time>) over class names. Avoid framework-generated classes like .css-1a2b3c or .MuiButton-root.

Discovery tools:

  • read_network_requests (Claude in Chrome) — preferred for intercepting XHR/fetch/RSC requests. Survives page navigations unlike JS-based interceptors. Always start network tracking before navigating.
  • read_page / javascript_tool — for inspecting DOM, JSON-LD, data attributes, embedded script tags.
  • curl / httpx from the terminal — for verifying endpoints work outside the browser context.

Output of discovery: a short report documenting what was found, which approach is recommended, and why.

Phase 2 — Scraping Script Generation

Goal: produce a single Python script the user can run on a schedule (daily, monthly, etc.) with minimal dependencies.

Read references/scraping-patterns.md before generating the script.

Key principles:

  • httpx > requests — async-capable, HTTP/2 support, better defaults.
  • Pure HTTP > headless browser — always. Only fall back to browser when the data genuinely requires JS execution.
  • curl_cffi as middle ground — when httpx gets blocked by TLS fingerprinting but full browser is overkill.
  • Playwright as last resort — with stealth measures applied (see reference file for anti-detection checklist).

Script requirements:

python scrape.py \
  --output ./data/output.jsonl \       # or .csv, .parquet
  --format jsonl \                      # jsonl | csv | parquet
  --date-range 2024-01-01:2024-01-31 \ # optional
  --page-limit 10 \                     # optional, for pagination
  --connection postgres://...           # optional, write to DB instead

The generated script must:

  • Use uv script metadata header (# /// script) for dependency management
  • Accept CLI args via argparse or click
  • Support output to file (JSONL default) or DB connection string
  • Include rate limiting / polite delays
  • Handle pagination automatically
  • Log progress to stderr
  • Exit with proper codes (0 success, 1 partial, 2 failure)
  • Include a USER_AGENT constant and HEADERS dict at the top for easy tuning
  • Be a single file — no package structure needed

Decision Flowchart

Start
  │
  ├─ Has the user provided a URL? ──No──▶ Ask for the target URL
  │
  Yes
  │
  ├─ Run Discovery (Phase 1)
  │   ├─ API endpoint found? ──Yes──▶ Use httpx + JSON parsing
  │   ├─ GraphQL endpoint?    ──Yes──▶ Use httpx + GraphQL queries
  │   ├─ JSON blob in HTML?   ──Yes──▶ Use httpx + regex/bs4 extraction
  │   ├─ RSC payload?         ──Yes──▶ Check if cleaner source exists
  │   │   ├─ JSON-LD/API also available? ──Yes──▶ Use that instead
  │   │   └─ RSC is only source?         ──Yes──▶ Use httpx + RSC parser
  │   ├─ CMS API available?   ──Yes──▶ Use httpx + platform-specific path
  │   ├─ Structured data?     ──Yes──▶ Use httpx + bs4/json extraction
  │   ├─ Stable selectors?    ──Yes──▶ Use httpx + bs4 CSS selectors
  │   └─ JS-rendered only?    ──Yes──▶ Try curl_cffi first, then Playwright
  │
  ├─ Report findings to user
  │
  ├─ Run Scraping (Phase 2)
  │   └─ Generate Python script following patterns in reference file
  │
  └─ Deliver script + usage instructions

Important Caveats

  • Always check robots.txt and mention it to the user.
  • Note if the site has terms of service that restrict scraping.
  • Add appropriate delays between requests (1-3s default for polite scraping).
  • For authenticated endpoints, prompt the user for credentials or tokens rather than hardcoding anything.
  • If the site uses Cloudflare, Akamai, or similar WAFs, flag this early and adjust the strategy accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment