Systematic approaches to find the best data source on a target site. Work through these in order — stop as soon as you find a viable approach.
- Network Request Interception
- Server-Rendered JSON Blobs
- CMS and Platform APIs
- GraphQL Endpoints
- Structured Data in HTML
- Data Attributes
- Stable CSS Selectors
- Discovery Report Template
The single most valuable discovery technique. Most modern sites load data via XHR/fetch — finding these endpoints gives you a clean JSON API.
- Requests to
/api/,/v1/,/v2/,/_next/data/,/graphql - JSON responses with the data you need
- Pagination parameters:
?page=2,?offset=20,?cursor=abc - Auth headers:
Authorization: Bearer ..., API keys in query params
Many SPAs load the first page from server-rendered HTML but fetch subsequent pages via API. Always try these actions:
- Click "next page" or "load more"
- Change sort order or apply a filter
- Scroll down (infinite scroll triggers)
- Open a detail view / modal
- Use the site's search
Each of these actions may reveal API endpoints not visible on initial load.
Always use the read_network_requests tool rather than injecting JS-based
network interceptors (e.g., patching fetch or XMLHttpRequest). JS
interceptors are lost on page navigation, which is exactly when the most
interesting requests happen (pagination, filter changes, search). The
read_network_requests tool survives navigations and captures everything.
Workflow:
1. Call read_network_requests once to start tracking (even before navigating)
2. Navigate to the target URL
3. Wait for the page to load (2-3 seconds)
4. Read network requests filtered by useful patterns:
- urlPattern="/api/" → REST endpoints
- urlPattern="graphql" → GraphQL
- urlPattern="_rsc" → React Server Components
- urlPattern="_next" → Next.js data routes
5. Interact with the page (paginate, filter, search, sort)
6. Read network requests again — new endpoints often appear only on
interaction, not on initial load
7. Inspect promising requests:
- URL pattern and query parameters
- Response format (JSON, RSC payload, HTML)
- Pagination mechanism (page number, cursor, offset)
Why not JS interceptors? When you navigate to page 2 or apply a filter,
the browser often does a full navigation (not a client-side transition).
This destroys any fetch/XHR monkey-patches you injected. The
read_network_requests tool is browser-level and persists across
navigations within the same domain.
Once you find a candidate endpoint, verify with curl:
curl -s 'https://example.com/api/products?page=1&limit=20' \
-H 'Accept: application/json' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)' \
| python3 -m json.tool | head -50Check:
- Does it return JSON without browser context?
- Does it require cookies or auth tokens?
- Does pagination work by incrementing a parameter?
- What's the rate limit? (check
X-RateLimit-*headers) - Test the page size limit — many APIs accept much larger
limitorper_pagevalues than the frontend uses (e.g., 500 or 1000 instead of 20). Fewer requests means less rate limiting risk. Try increasing the limit parameter in steps: 50 → 100 → 250 → 500. Stop when the API errors or caps the response.
Many frameworks embed the full page data as JSON in the initial HTML. This is often the easiest approach — one HTTP request, all data included.
| Framework | Where to look | Extraction |
|---|---|---|
| Next.js | <script id="__NEXT_DATA__" type="application/json"> |
json.loads(script.string)["props"]["pageProps"] |
| Nuxt.js | <script>window.__NUXT__= or <script id="__NUXT_DATA__"> |
Parse the JS assignment or JSON |
| Remix | <script>window.__remixContext = |
Parse the JS object |
| Gatsby | <script>/*! gatsby-page-data */ or inline pageContext |
Extract from script content |
| SvelteKit | <script type="application/json" data-sveltekit-fetched> |
json.loads(script.string) |
| Angular Universal | <script id="serverApp-state" type="application/json"> |
json.loads(script.string) |
| Vue SSR | <script>window.__INITIAL_STATE__= |
Parse the JS assignment |
| Astro | <script type="application/json" data-astro-*> |
json.loads(script.string) |
Next.js App Router sites (13.4+) do NOT use __NEXT_DATA__. Instead they
stream React Server Components as a custom line-based format. You'll
recognise RSC by:
- Network requests with
?_rsc=query parameter Content-Type: text/x-componentin response headers- Response body with lines like
0:["$","div",null,{"children":...}]
RSC payloads contain the full page data but in a nested, non-trivial format.
During discovery, check if the same data is available more cleanly
(JSON-LD, <script> tags, or an API). Only parse RSC directly when no
better source exists.
RSC is the right choice when:
- No JSON-LD or structured data in the HTML
- No API endpoints found via network inspection
- The site is a Next.js App Router app (RSC requests visible in network)
- The RSC payload contains data not available elsewhere (e.g., aggregated counts, metadata, filters)
Each line in an RSC response is: ID:TYPE_OR_JSON
0:["$","$L1",null,{"children":["$","div",null,...]}]
1:["$","$L2",null,{"data":{"products":[...]}}]
2:"$Sreact.suspense"
3:["$","ul",null,{"children":[["$","li",...]]}]
- Lines starting with a number + colon contain data or component trees
- JSON arrays/objects are embedded within each line
- Data you want is typically nested inside arrays with
"children"or"data"keys
import json
import re
import httpx
def fetch_rsc(url: str, client: httpx.Client) -> str:
"""Fetch the RSC payload for a URL."""
resp = client.get(
url,
params={"_rsc": "1"}, # trigger RSC response
headers={
"RSC": "1", # required header for RSC
"Next-Router-State-Tree": "",
**HEADERS,
},
)
resp.raise_for_status()
return resp.text
def parse_rsc_payload(rsc_text: str) -> list[dict]:
"""Extract JSON objects from RSC line protocol.
RSC lines look like:
0:["$","div",null,{...}]
1:{"key":"value"}
2:"plain string"
We extract all parseable JSON from each line and collect
any dicts/lists that look like data (not React component trees).
"""
results = []
for line in rsc_text.splitlines():
# Strip the line ID prefix: "0:", "1a:", "2f:" etc.
match = re.match(r'^[0-9a-f]+:', line)
if not match:
continue
payload = line[match.end():]
try:
parsed = json.loads(payload)
except (json.JSONDecodeError, TypeError):
continue
# Collect interesting data — skip plain strings and React markers
if isinstance(parsed, dict):
results.append(parsed)
elif isinstance(parsed, list):
# Walk the RSC array to find nested data objects
_extract_data_from_rsc_tree(parsed, results)
return results
def _extract_data_from_rsc_tree(node, results: list):
"""Recursively walk an RSC tree and extract data-like dicts.
RSC arrays follow the pattern: ["$", "tag", key, props_dict]
Data objects are typically inside props dicts under keys like
"data", "items", "results", "children", "pageProps".
"""
if isinstance(node, dict):
# If it has data-like keys, collect it
data_keys = {"data", "items", "results", "products", "listings",
"pageProps", "initialData", "props"}
if data_keys & set(node.keys()):
results.append(node)
# Also recurse into all dict values
for v in node.values():
if isinstance(v, (dict, list)):
_extract_data_from_rsc_tree(v, results)
elif isinstance(node, list):
for item in node:
if isinstance(item, (dict, list)):
_extract_data_from_rsc_tree(item, results)- Compare RSC payload vs HTML source: sometimes the HTML already
contains the data as JSON-LD or
<script>tags (server-rendered for SEO), making RSC parsing unnecessary. - Multiple RSC requests: App Router may split data across several RSC
fetches (shell, page data, suspended chunks). Check all
_rscrequests. - The
RSC: 1header is required — without it, the server returns full HTML instead of the RSC payload. - Pagination via RSC: when paginating client-side, the browser fetches only the RSC diff (not full HTML). This can be much smaller and faster to parse than the full page.
import json, re
from bs4 import BeautifulSoup
html = httpx.get(url).text
soup = BeautifulSoup(html, "html.parser")
# Next.js
tag = soup.find("script", id="__NEXT_DATA__")
if tag:
data = json.loads(tag.string)
page_data = data["props"]["pageProps"]
# Nuxt
tag = soup.find("script", string=re.compile(r"window\.__NUXT__"))
if tag:
# Extract JSON from: window.__NUXT__={...}
match = re.search(r'window\.__NUXT__\s*=\s*(.+?);\s*$', tag.string, re.DOTALL)
if match:
data = json.loads(match.group(1))
# Generic initial state
for tag in soup.find_all("script"):
if tag.string and "window.__INITIAL_STATE__" in tag.string:
match = re.search(r'window\.__INITIAL_STATE__\s*=\s*(.+?);\s*$',
tag.string, re.DOTALL)
if match:
data = json.loads(match.group(1))If the site runs on a known platform, there's almost certainly a REST API.
# Check if it's WordPress
curl -s https://example.com/wp-json/ | python3 -m json.tool
# List posts (public by default)
curl -s 'https://example.com/wp-json/wp/v2/posts?per_page=20&page=1'
# Other endpoints
/wp-json/wp/v2/pages
/wp-json/wp/v2/categories
/wp-json/wp/v2/tags
/wp-json/wp/v2/users # sometimes restricted
/wp-json/wp/v2/media
/wp-json/wp/v2/commentsDetection: look for <link rel="https://api.w.org/" in HTML head, or
/wp-content/ / /wp-includes/ in page source.
# Product listing (JSON)
curl -s 'https://store.example.com/products.json?limit=250&page=1'
# Collection products
curl -s 'https://store.example.com/collections/all/products.json'
# Single product
curl -s 'https://store.example.com/products/product-handle.json'
# Storefront API (GraphQL, may need token)
curl -s 'https://store.example.com/api/2024-01/graphql.json' \
-H 'Content-Type: application/json' \
-H 'X-Shopify-Storefront-Access-Token: TOKEN' \
-d '{"query": "{ products(first: 10) { edges { node { title } } } }"}'Detection: Shopify.theme, cdn.shopify.com, myshopify.com.
curl -s 'https://example.com/jsonapi/node/article?page[limit]=20'Detection: <meta name="Generator" content="Drupal", /sites/default/files/.
curl -s 'https://example.com/ghost/api/content/posts/?key=API_KEY&limit=20'Detection: <meta name="generator" content="Ghost".
# Webflow sites expose collection data via their API
# Look for data-w-id attributes and /api/v1/ endpointsDetection: webflow.js, data-wf-site, data-wf-page.
# Append ?format=json to most URLs
curl -s 'https://example.com/blog?format=json'Detection: squarespace.com, static.squarespace.com.
# Wix sites use a data API internally
# Look for /_api/ endpoints in XHR requests
# Common: /_api/communities-blog-node-api/Detection: wix.com, parastorage.com, static.wixstatic.com.
Common GraphQL endpoint URLs:
/graphql/gql/api/graphql/v1/graphql/__graphql
Check for:
- Requests with
Content-Type: application/jsonand aqueryfield in body - The string
__typenamein responses - Relay-style
edges/node/pageInfopatterns
curl -s 'https://example.com/graphql' \
-H 'Content-Type: application/json' \
-d '{"query": "{ __schema { queryType { name } types { name fields { name type { name } } } } }"}'Many production endpoints disable introspection, but it's worth trying. If it works, you get the full schema — which makes building queries trivial.
If introspection is disabled, observe the queries the frontend makes (via network tab), then adapt them. The frontend's queries are typically well-tested and paginated.
Many sites embed structured data for SEO. This is clean, reliable, and rarely changes.
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("script", type="application/ld+json"):
data = json.loads(tag.string)
# Common types: Product, Article, Event, Recipe, Organization
print(data.get("@type"), data)Less convenient to parse but still structured. Use extruct library:
import extruct
data = extruct.extract(html, syntaxes=["json-ld", "microdata", "rdfa", "opengraph"])Modern frameworks often attach data to DOM elements via data- attributes.
These are typically more stable than class names.
# Find elements with data attributes
for elem in soup.find_all(attrs={"data-product-id": True}):
print(elem["data-product-id"], elem.get("data-price"))
# Common patterns
# data-id, data-item-id, data-product-id
# data-price, data-currency
# data-category, data-type
# data-url, data-href, data-src
# data-testid (React Testing Library — surprisingly stable)
# data-cy (Cypress test IDs — also stable)Last resort for DOM-based scraping. Prefer semantic HTML elements over class names.
article /* semantic HTML5 */
article h2 a /* link inside article heading */
table tbody tr td /* table data */
li /* list items */
time[datetime] /* dates with machine-readable value */
[itemprop="name"] /* microdata attributes */
nav a /* navigation links */
main section /* main content sections */
figure img /* images with semantic wrapper */.css-1a2b3c /* CSS-in-JS generated */
.MuiButton-root /* Material UI internals */
.sc-bdVTJa /* styled-components hash */
.tw-flex.tw-gap-4 /* Tailwind utility soup */
._3xk2z /* minified class names */
div > div > div > span /* fragile nesting */- Anchor on
<table>,<article>,<section>when possible - Use attribute selectors:
[role="listitem"],[aria-label="Price"] - Combine tag + attribute:
li[data-testid] - Avoid nth-child selectors — position changes break them
- Test selector matches a consistent count across multiple pages
After running discovery, produce a brief report for the user:
## Discovery Report: [site name]
**Target URL:** https://example.com/products
**Date:** 2024-01-15
### Findings
| # | Method | Endpoint/Selector | Auth needed? | Paginated? | Max page size | Notes |
|---|--------|-------------------|-------------|-----------|--------------|-------|
| 1 | REST API | /api/v1/products?page={n} | No | Yes (page param) | 250 (tested) | Returns JSON, default 20 items/page |
| 2 | __NEXT_DATA__ | Embedded in HTML | No | No (single page) | N/A | Full product list on first load |
| 3 | CSS selectors | article.product-card | No | No | N/A | 20 items per page load |
### Recommendation
**Use approach #1 (REST API)** because:
- Clean JSON response, no HTML parsing needed
- Built-in pagination
- No authentication required
- Likely the most stable (API contracts change less than HTML)
### Caveats
- robots.txt allows /api/ paths
- No visible rate limit headers (recommend 1-2 req/s)
- TOS does not explicitly prohibit scraping