dumkydewilde/SKILL.md

Created February 9, 2026 00:44

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/dumkydewilde/97ab8337e30ca09c52b25343adf2aae1.js"></script>
Save dumkydewilde/97ab8337e30ca09c52b25343adf2aae1 to your computer and use it in GitHub Desktop.

Download ZIP

Ultimate Web Scraping AI Skill (Claude, ChatGPT)

Raw

discovery-strategies.md

Discovery Strategies Reference

Systematic approaches to find the best data source on a target site. Work through these in order — stop as soon as you find a viable approach.

Network Request Interception
Server-Rendered JSON Blobs
CMS and Platform APIs
GraphQL Endpoints
Structured Data in HTML
Data Attributes
Stable CSS Selectors
Discovery Report Template

1. Network Request Interception

The single most valuable discovery technique. Most modern sites load data via XHR/fetch — finding these endpoints gives you a clean JSON API.

What to look for

Requests to /api/, /v1/, /v2/, /_next/data/, /graphql
JSON responses with the data you need
Pagination parameters: ?page=2, ?offset=20, ?cursor=abc
Auth headers: Authorization: Bearer ..., API keys in query params

Critical: check page 2 and interactions

Many SPAs load the first page from server-rendered HTML but fetch subsequent pages via API. Always try these actions:

Click "next page" or "load more"
Change sort order or apply a filter
Scroll down (infinite scroll triggers)
Open a detail view / modal
Use the site's search

Each of these actions may reveal API endpoints not visible on initial load.

Using Claude in Chrome for discovery

Always use the read_network_requests tool rather than injecting JS-based network interceptors (e.g., patching fetch or XMLHttpRequest). JS interceptors are lost on page navigation, which is exactly when the most interesting requests happen (pagination, filter changes, search). The read_network_requests tool survives navigations and captures everything.

Workflow:

1. Call read_network_requests once to start tracking (even before navigating)
2. Navigate to the target URL
3. Wait for the page to load (2-3 seconds)
4. Read network requests filtered by useful patterns:
   - urlPattern="/api/"    → REST endpoints
   - urlPattern="graphql"  → GraphQL
   - urlPattern="_rsc"     → React Server Components
   - urlPattern="_next"    → Next.js data routes
5. Interact with the page (paginate, filter, search, sort)
6. Read network requests again — new endpoints often appear only on
   interaction, not on initial load
7. Inspect promising requests:
   - URL pattern and query parameters
   - Response format (JSON, RSC payload, HTML)
   - Pagination mechanism (page number, cursor, offset)

Why not JS interceptors? When you navigate to page 2 or apply a filter, the browser often does a full navigation (not a client-side transition). This destroys any fetch/XHR monkey-patches you injected. The read_network_requests tool is browser-level and persists across navigations within the same domain.

Using curl to verify an endpoint

Once you find a candidate endpoint, verify with curl:

curl -s 'https://example.com/api/products?page=1&limit=20' \
  -H 'Accept: application/json' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)' \
  | python3 -m json.tool | head -50

Check:

Does it return JSON without browser context?
Does it require cookies or auth tokens?
Does pagination work by incrementing a parameter?
What's the rate limit? (check X-RateLimit-* headers)
Test the page size limit — many APIs accept much larger limit or per_page values than the frontend uses (e.g., 500 or 1000 instead of 20). Fewer requests means less rate limiting risk. Try increasing the limit parameter in steps: 50 → 100 → 250 → 500. Stop when the API errors or caps the response.

2. Server-Rendered JSON Blobs

Many frameworks embed the full page data as JSON in the initial HTML. This is often the easiest approach — one HTTP request, all data included.

Framework-specific patterns

Framework	Where to look	Extraction
Next.js	`<script id="__NEXT_DATA__" type="application/json">`	`json.loads(script.string)["props"]["pageProps"]`
Nuxt.js	`<script>window.__NUXT__=` or `<script id="__NUXT_DATA__">`	Parse the JS assignment or JSON
Remix	`<script>window.__remixContext =`	Parse the JS object
Gatsby	`<script>/! gatsby-page-data /` or inline `pageContext`	Extract from script content
SvelteKit	`<script type="application/json" data-sveltekit-fetched>`	`json.loads(script.string)`
Angular Universal	`<script id="serverApp-state" type="application/json">`	`json.loads(script.string)`
Vue SSR	`<script>window.__INITIAL_STATE__=`	Parse the JS assignment
Astro	`<script type="application/json" data-astro-*>`	`json.loads(script.string)`

React Server Components (RSC) payloads

Next.js App Router sites (13.4+) do NOT use __NEXT_DATA__. Instead they stream React Server Components as a custom line-based format. You'll recognise RSC by:

Network requests with ?_rsc= query parameter
Content-Type: text/x-component in response headers
Response body with lines like 0:["$","div",null,{"children":...}]

RSC payloads contain the full page data but in a nested, non-trivial format. During discovery, check if the same data is available more cleanly (JSON-LD, <script> tags, or an API). Only parse RSC directly when no better source exists.

When to use RSC parsing

RSC is the right choice when:

No JSON-LD or structured data in the HTML
No API endpoints found via network inspection
The site is a Next.js App Router app (RSC requests visible in network)
The RSC payload contains data not available elsewhere (e.g., aggregated counts, metadata, filters)

RSC format overview

Each line in an RSC response is: ID:TYPE_OR_JSON

0:["$","$L1",null,{"children":["$","div",null,...]}]
1:["$","$L2",null,{"data":{"products":[...]}}]
2:"$Sreact.suspense"
3:["$","ul",null,{"children":[["$","li",...]]}]

Lines starting with a number + colon contain data or component trees
JSON arrays/objects are embedded within each line
Data you want is typically nested inside arrays with "children" or "data" keys

Extraction pattern

import json
import re

import httpx


def fetch_rsc(url: str, client: httpx.Client) -> str:
    """Fetch the RSC payload for a URL."""
    resp = client.get(
        url,
        params={"_rsc": "1"},  # trigger RSC response
        headers={
            "RSC": "1",  # required header for RSC
            "Next-Router-State-Tree": "",
            **HEADERS,
        },
    )
    resp.raise_for_status()
    return resp.text


def parse_rsc_payload(rsc_text: str) -> list[dict]:
    """Extract JSON objects from RSC line protocol.
    
    RSC lines look like:
        0:["$","div",null,{...}]
        1:{"key":"value"}
        2:"plain string"
    
    We extract all parseable JSON from each line and collect
    any dicts/lists that look like data (not React component trees).
    """
    results = []

    for line in rsc_text.splitlines():
        # Strip the line ID prefix: "0:", "1a:", "2f:" etc.
        match = re.match(r'^[0-9a-f]+:', line)
        if not match:
            continue
        payload = line[match.end():]

        try:
            parsed = json.loads(payload)
        except (json.JSONDecodeError, TypeError):
            continue

        # Collect interesting data — skip plain strings and React markers
        if isinstance(parsed, dict):
            results.append(parsed)
        elif isinstance(parsed, list):
            # Walk the RSC array to find nested data objects
            _extract_data_from_rsc_tree(parsed, results)

    return results


def _extract_data_from_rsc_tree(node, results: list):
    """Recursively walk an RSC tree and extract data-like dicts.
    
    RSC arrays follow the pattern: ["$", "tag", key, props_dict]
    Data objects are typically inside props dicts under keys like
    "data", "items", "results", "children", "pageProps".
    """
    if isinstance(node, dict):
        # If it has data-like keys, collect it
        data_keys = {"data", "items", "results", "products", "listings",
                     "pageProps", "initialData", "props"}
        if data_keys & set(node.keys()):
            results.append(node)
        # Also recurse into all dict values
        for v in node.values():
            if isinstance(v, (dict, list)):
                _extract_data_from_rsc_tree(v, results)
    elif isinstance(node, list):
        for item in node:
            if isinstance(item, (dict, list)):
                _extract_data_from_rsc_tree(item, results)

Tips for RSC discovery

Compare RSC payload vs HTML source: sometimes the HTML already contains the data as JSON-LD or <script> tags (server-rendered for SEO), making RSC parsing unnecessary.
Multiple RSC requests: App Router may split data across several RSC fetches (shell, page data, suspended chunks). Check all _rsc requests.
The RSC: 1 header is required — without it, the server returns full HTML instead of the RSC payload.
Pagination via RSC: when paginating client-side, the browser fetches only the RSC diff (not full HTML). This can be much smaller and faster to parse than the full page.

Framework detection cheat sheet

import json, re
from bs4 import BeautifulSoup

html = httpx.get(url).text
soup = BeautifulSoup(html, "html.parser")

# Next.js
tag = soup.find("script", id="__NEXT_DATA__")
if tag:
    data = json.loads(tag.string)
    page_data = data["props"]["pageProps"]

# Nuxt
tag = soup.find("script", string=re.compile(r"window\.__NUXT__"))
if tag:
    # Extract JSON from: window.__NUXT__={...}
    match = re.search(r'window\.__NUXT__\s*=\s*(.+?);\s*$', tag.string, re.DOTALL)
    if match:
        data = json.loads(match.group(1))

# Generic initial state
for tag in soup.find_all("script"):
    if tag.string and "window.__INITIAL_STATE__" in tag.string:
        match = re.search(r'window\.__INITIAL_STATE__\s*=\s*(.+?);\s*$',
                          tag.string, re.DOTALL)
        if match:
            data = json.loads(match.group(1))

3. CMS and Platform APIs

If the site runs on a known platform, there's almost certainly a REST API.

WordPress

# Check if it's WordPress
curl -s https://example.com/wp-json/ | python3 -m json.tool

# List posts (public by default)
curl -s 'https://example.com/wp-json/wp/v2/posts?per_page=20&page=1'

# Other endpoints
/wp-json/wp/v2/pages
/wp-json/wp/v2/categories
/wp-json/wp/v2/tags
/wp-json/wp/v2/users        # sometimes restricted
/wp-json/wp/v2/media
/wp-json/wp/v2/comments

Detection: look for <link rel="https://api.w.org/" in HTML head, or /wp-content/ / /wp-includes/ in page source.

Shopify

# Product listing (JSON)
curl -s 'https://store.example.com/products.json?limit=250&page=1'

# Collection products
curl -s 'https://store.example.com/collections/all/products.json'

# Single product
curl -s 'https://store.example.com/products/product-handle.json'

# Storefront API (GraphQL, may need token)
curl -s 'https://store.example.com/api/2024-01/graphql.json' \
  -H 'Content-Type: application/json' \
  -H 'X-Shopify-Storefront-Access-Token: TOKEN' \
  -d '{"query": "{ products(first: 10) { edges { node { title } } } }"}'

Detection: Shopify.theme, cdn.shopify.com, myshopify.com.

Drupal (JSON:API)

curl -s 'https://example.com/jsonapi/node/article?page[limit]=20'

Detection: <meta name="Generator" content="Drupal", /sites/default/files/.

Ghost

curl -s 'https://example.com/ghost/api/content/posts/?key=API_KEY&limit=20'

Detection: <meta name="generator" content="Ghost".

Webflow

# Webflow sites expose collection data via their API
# Look for data-w-id attributes and /api/v1/ endpoints

Detection: webflow.js, data-wf-site, data-wf-page.

Squarespace

# Append ?format=json to most URLs
curl -s 'https://example.com/blog?format=json'

Detection: squarespace.com, static.squarespace.com.

Wix

# Wix sites use a data API internally
# Look for /_api/ endpoints in XHR requests
# Common: /_api/communities-blog-node-api/

Detection: wix.com, parastorage.com, static.wixstatic.com.

4. GraphQL Endpoints

Detection

Common GraphQL endpoint URLs:

/graphql
/gql
/api/graphql
/v1/graphql
/__graphql

Check for:

Requests with Content-Type: application/json and a query field in body
The string __typename in responses
Relay-style edges / node / pageInfo patterns

Introspection query

curl -s 'https://example.com/graphql' \
  -H 'Content-Type: application/json' \
  -d '{"query": "{ __schema { queryType { name } types { name fields { name type { name } } } } }"}'

Many production endpoints disable introspection, but it's worth trying. If it works, you get the full schema — which makes building queries trivial.

Query building without introspection

If introspection is disabled, observe the queries the frontend makes (via network tab), then adapt them. The frontend's queries are typically well-tested and paginated.

5. Structured Data in HTML

Many sites embed structured data for SEO. This is clean, reliable, and rarely changes.

JSON-LD

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("script", type="application/ld+json"):
    data = json.loads(tag.string)
    # Common types: Product, Article, Event, Recipe, Organization
    print(data.get("@type"), data)

Microdata / RDFa

Less convenient to parse but still structured. Use extruct library:

import extruct

data = extruct.extract(html, syntaxes=["json-ld", "microdata", "rdfa", "opengraph"])

6. Data Attributes

Modern frameworks often attach data to DOM elements via data- attributes. These are typically more stable than class names.

What to look for

# Find elements with data attributes
for elem in soup.find_all(attrs={"data-product-id": True}):
    print(elem["data-product-id"], elem.get("data-price"))

# Common patterns
# data-id, data-item-id, data-product-id
# data-price, data-currency
# data-category, data-type
# data-url, data-href, data-src
# data-testid (React Testing Library — surprisingly stable)
# data-cy (Cypress test IDs — also stable)

7. Stable CSS Selectors

Last resort for DOM-based scraping. Prefer semantic HTML elements over class names.

Good selectors (stable across redesigns)

article                  /* semantic HTML5 */
article h2 a            /* link inside article heading */
table tbody tr td        /* table data */
li                       /* list items */
time[datetime]           /* dates with machine-readable value */
[itemprop="name"]        /* microdata attributes */
nav a                    /* navigation links */
main section             /* main content sections */
figure img               /* images with semantic wrapper */

Bad selectors (will break)

.css-1a2b3c             /* CSS-in-JS generated */
.MuiButton-root         /* Material UI internals */
.sc-bdVTJa              /* styled-components hash */
.tw-flex.tw-gap-4       /* Tailwind utility soup */
._3xk2z                 /* minified class names */
div > div > div > span  /* fragile nesting */

Resilience tips

Anchor on <table>, <article>, <section> when possible
Use attribute selectors: [role="listitem"], [aria-label="Price"]
Combine tag + attribute: li[data-testid]
Avoid nth-child selectors — position changes break them
Test selector matches a consistent count across multiple pages

8. Discovery Report Template

After running discovery, produce a brief report for the user:

## Discovery Report: [site name]

**Target URL:** https://example.com/products
**Date:** 2024-01-15

### Findings

| # | Method | Endpoint/Selector | Auth needed? | Paginated? | Max page size | Notes |
|---|--------|-------------------|-------------|-----------|--------------|-------|
| 1 | REST API | /api/v1/products?page={n} | No | Yes (page param) | 250 (tested) | Returns JSON, default 20 items/page |
| 2 | __NEXT_DATA__ | Embedded in HTML | No | No (single page) | N/A | Full product list on first load |
| 3 | CSS selectors | article.product-card | No | No | N/A | 20 items per page load |

### Recommendation

**Use approach #1 (REST API)** because:
- Clean JSON response, no HTML parsing needed
- Built-in pagination
- No authentication required
- Likely the most stable (API contracts change less than HTML)

### Caveats
- robots.txt allows /api/ paths
- No visible rate limit headers (recommend 1-2 req/s)
- TOS does not explicitly prohibit scraping

Raw

SKILL.md

name	description
web-scraper	Build reliable, production-grade web scrapers through a two-phase approach: Discovery (finding the best data source on a site) then Scraping (generating a Python script for scheduled extraction). Use this skill whenever the user wants to scrape a website, extract data from web pages, build a crawler, set up recurring data collection, reverse-engineer a site's API, or find hidden data endpoints. Also trigger when the user mentions "scrape", "crawl", "extract from site", "web data", "pull data from URL", "scheduled scraping", "monitor a page", or asks how to get data from a specific website.

name

description

web-scraper

Build reliable, production-grade web scrapers through a two-phase approach: Discovery (finding the best data source on a site) then Scraping (generating a Python script for scheduled extraction). Use this skill whenever the user wants to scrape a website, extract data from web pages, build a crawler, set up recurring data collection, reverse-engineer a site's API, or find hidden data endpoints. Also trigger when the user mentions "scrape", "crawl", "extract from site", "web data", "pull data from URL", "scheduled scraping", "monitor a page", or asks how to get data from a specific website.

Web Scraper Skill

Build scrapers that survive site redesigns by finding the most stable and efficient data source first, then generating a clean Python script for recurring extraction.

Two-Phase Workflow

Every scraping task MUST go through both phases in order. Skipping discovery leads to fragile scrapers that break on the first deploy.

Phase 1 — Discovery

Goal: find the most efficient and stable way to get the data. Prefer structured endpoints over DOM parsing. Work down the priority list until you find something that works.

Read references/discovery-strategies.md before starting discovery.

Discovery priority (highest to lowest):

Public/undocumented REST API — look in XHR/fetch requests, especially on pagination or filter actions (page 2, sort, search). Often missed on initial page load.
GraphQL endpoint — check for /graphql, /gql, or __relay in network requests.
Server-rendered JSON blobs — __NEXT_DATA__, __NUXT__, window.__INITIAL_STATE__, window.__DATA__, Remix __remixContext, Gatsby pageContext, etc.
React Server Components (RSC) — Next.js App Router sites use RSC instead of __NEXT_DATA__. Look for ?_rsc= requests in the network tab. Only parse RSC directly if no cleaner source (JSON-LD, API) exists.
CMS/platform API — WordPress REST (/wp-json/wp/v2/), Shopify Storefront API, Drupal JSON:API, Contentful, Strapi, Ghost, etc.
Structured data in HTML — <script type="application/ld+json">, microdata, RDFa.
data- attributes — e.g., data-product-id, data-price, data-sku.
Stable CSS selectors — semantic HTML elements (<article>, <li>, <table>, <time>) over class names. Avoid framework-generated classes like .css-1a2b3c or .MuiButton-root.

Discovery tools:

read_network_requests (Claude in Chrome) — preferred for intercepting XHR/fetch/RSC requests. Survives page navigations unlike JS-based interceptors. Always start network tracking before navigating.
read_page / javascript_tool — for inspecting DOM, JSON-LD, data attributes, embedded script tags.
curl / httpx from the terminal — for verifying endpoints work outside the browser context.

Output of discovery: a short report documenting what was found, which approach is recommended, and why.

Phase 2 — Scraping Script Generation

Goal: produce a single Python script the user can run on a schedule (daily, monthly, etc.) with minimal dependencies.

Read references/scraping-patterns.md before generating the script.

Key principles:

httpx > requests — async-capable, HTTP/2 support, better defaults.
Pure HTTP > headless browser — always. Only fall back to browser when the data genuinely requires JS execution.
curl_cffi as middle ground — when httpx gets blocked by TLS fingerprinting but full browser is overkill.
Playwright as last resort — with stealth measures applied (see reference file for anti-detection checklist).

Script requirements:

python scrape.py \
  --output ./data/output.jsonl \       # or .csv, .parquet
  --format jsonl \                      # jsonl | csv | parquet
  --date-range 2024-01-01:2024-01-31 \ # optional
  --page-limit 10 \                     # optional, for pagination
  --connection postgres://...           # optional, write to DB instead

The generated script must:

Use uv script metadata header (# /// script) for dependency management
Accept CLI args via argparse or click
Support output to file (JSONL default) or DB connection string
Include rate limiting / polite delays
Handle pagination automatically
Log progress to stderr
Exit with proper codes (0 success, 1 partial, 2 failure)
Include a USER_AGENT constant and HEADERS dict at the top for easy tuning
Be a single file — no package structure needed

Decision Flowchart

Start
  │
  ├─ Has the user provided a URL? ──No──▶ Ask for the target URL
  │
  Yes
  │
  ├─ Run Discovery (Phase 1)
  │   ├─ API endpoint found? ──Yes──▶ Use httpx + JSON parsing
  │   ├─ GraphQL endpoint?    ──Yes──▶ Use httpx + GraphQL queries
  │   ├─ JSON blob in HTML?   ──Yes──▶ Use httpx + regex/bs4 extraction
  │   ├─ RSC payload?         ──Yes──▶ Check if cleaner source exists
  │   │   ├─ JSON-LD/API also available? ──Yes──▶ Use that instead
  │   │   └─ RSC is only source?         ──Yes──▶ Use httpx + RSC parser
  │   ├─ CMS API available?   ──Yes──▶ Use httpx + platform-specific path
  │   ├─ Structured data?     ──Yes──▶ Use httpx + bs4/json extraction
  │   ├─ Stable selectors?    ──Yes──▶ Use httpx + bs4 CSS selectors
  │   └─ JS-rendered only?    ──Yes──▶ Try curl_cffi first, then Playwright
  │
  ├─ Report findings to user
  │
  ├─ Run Scraping (Phase 2)
  │   └─ Generate Python script following patterns in reference file
  │
  └─ Deliver script + usage instructions

Important Caveats

Always check robots.txt and mention it to the user.
Note if the site has terms of service that restrict scraping.
Add appropriate delays between requests (1-3s default for polite scraping).
For authenticated endpoints, prompt the user for credentials or tokens rather than hardcoding anything.
If the site uses Cloudflare, Akamai, or similar WAFs, flag this early and adjust the strategy accordingly.

dumkydewilde/SKILL.md

Discovery Strategies Reference

Table of Contents

1. Network Request Interception

What to look for

Critical: check page 2 and interactions

Using Claude in Chrome for discovery

Using curl to verify an endpoint

2. Server-Rendered JSON Blobs

Framework-specific patterns

React Server Components (RSC) payloads

When to use RSC parsing

RSC format overview

Extraction pattern

Tips for RSC discovery

Framework detection cheat sheet

3. CMS and Platform APIs

WordPress

Shopify

Drupal (JSON:API)

Ghost

Webflow

Squarespace

Wix

4. GraphQL Endpoints

Detection

Introspection query

Query building without introspection

5. Structured Data in HTML

JSON-LD

Microdata / RDFa

6. Data Attributes

What to look for

7. Stable CSS Selectors

Good selectors (stable across redesigns)

Bad selectors (will break)

Resilience tips

8. Discovery Report Template

Scraping Patterns Reference

Table of Contents

1. Script Template

2. HTTP Client Ladder

Level 1: httpx (default)

Level 2: curl_cffi (TLS fingerprint bypass)

Level 3: Playwright (full browser, last resort)

3. Output Handlers

JSONL (default, recommended)

CSV

Parquet

PostgreSQL via DuckDB

4. Pagination Patterns

Probe the maximum page size first

Page-number based

Cursor-based

Offset-based

Next-URL based

Infinite scroll / load-more (JS required)

5. Playwright Stealth

Anti-detection checklist

Playwright with human-like behavior

6. Error Handling and Retry

7. Rate Limiting

Simple sleep-based (default)

Adaptive rate limiting

Respect Retry-After headers

8. Common Extraction Patterns

JSON API response

BeautifulSoup HTML parsing

selectolax (faster alternative to BS4)

JSON-LD extraction

NEXT_DATA extraction

React Server Components (RSC) extraction

Table extraction

Web Scraper Skill

Two-Phase Workflow

Phase 1 — Discovery

Phase 2 — Scraping Script Generation

Decision Flowchart

Important Caveats