Skip to content

Instantly share code, notes, and snippets.

@ethanabrooks
Created December 30, 2025 14:36
Show Gist options
  • Select an option

  • Save ethanabrooks/cbf9bd880a385ec59a03c22cbd461e98 to your computer and use it in GitHub Desktop.

Select an option

Save ethanabrooks/cbf9bd880a385ec59a03c22cbd461e98 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "0a3a1d08",
"metadata": {
"papermill": {
"duration": 0.002038,
"end_time": "2025-12-30T14:30:08.256611",
"exception": false,
"start_time": "2025-12-30T14:30:08.254573",
"status": "completed"
},
"tags": []
},
"source": [
"# Web Content Extraction Tool Comparison\n",
"\n",
"Comparing tools for extracting readable text/markdown from web pages.\n",
"\n",
"## Tools Evaluated\n",
"\n",
"| Tool | Type | Notes |\n",
"| ------------------- | ---------- | ---------------------------------------- |\n",
"| trafilatura | Python | Purpose-built for web text extraction |\n",
"| newspaper3k | Python | News article extraction |\n",
"| readability-lxml | Python | Python port of Mozilla Readability |\n",
"| Mozilla Readability | JavaScript | Original Firefox Reader View library |\n",
"| Playwright | Python | Browser automation for JS-rendered pages |\n",
"| html2text | Python | HTML to Markdown converter |\n",
"| BeautifulSoup | Python | Manual extraction baseline |\n",
"| Parallel.ai | API | Commercial service (requires API key) |\n"
]
},
{
"cell_type": "markdown",
"id": "025255de",
"metadata": {
"papermill": {
"duration": 0.001373,
"end_time": "2025-12-30T14:30:08.259563",
"exception": false,
"start_time": "2025-12-30T14:30:08.258190",
"status": "completed"
},
"tags": []
},
"source": [
"## Configuration\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d880d862",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:08.263953Z",
"iopub.status.busy": "2025-12-30T14:30:08.263825Z",
"iopub.status.idle": "2025-12-30T14:30:08.267098Z",
"shell.execute_reply": "2025-12-30T14:30:08.266653Z"
},
"papermill": {
"duration": 0.0059,
"end_time": "2025-12-30T14:30:08.267564",
"exception": false,
"start_time": "2025-12-30T14:30:08.261664",
"status": "completed"
},
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# Parameters - these can be overridden by papermill\n",
"TEST_URL = \"https://en.wikipedia.org/wiki/WBA_interim_middleweight_championship#List_of_interim_champions\"\n",
"MAX_CHARS = 3000 # Maximum characters to display per output"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a47623a4",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:08.271754Z",
"iopub.status.busy": "2025-12-30T14:30:08.271639Z",
"iopub.status.idle": "2025-12-30T14:30:08.273381Z",
"shell.execute_reply": "2025-12-30T14:30:08.272945Z"
},
"papermill": {
"duration": 0.004361,
"end_time": "2025-12-30T14:30:08.273885",
"exception": false,
"start_time": "2025-12-30T14:30:08.269524",
"status": "completed"
},
"tags": [
"injected-parameters"
]
},
"outputs": [],
"source": [
"# Parameters\n",
"TEST_URL = \"https://www.dailymail.co.uk/\"\n"
]
},
{
"cell_type": "markdown",
"id": "b24ee9dd",
"metadata": {
"papermill": {
"duration": 0.001756,
"end_time": "2025-12-30T14:30:08.277155",
"exception": false,
"start_time": "2025-12-30T14:30:08.275399",
"status": "completed"
},
"tags": []
},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "97de6407",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:08.280768Z",
"iopub.status.busy": "2025-12-30T14:30:08.280672Z",
"iopub.status.idle": "2025-12-30T14:30:08.323592Z",
"shell.execute_reply": "2025-12-30T14:30:08.322389Z"
},
"papermill": {
"duration": 0.046387,
"end_time": "2025-12-30T14:30:08.325103",
"exception": false,
"start_time": "2025-12-30T14:30:08.278716",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"import subprocess\n",
"import requests\n",
"from pathlib import Path"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ee68e648",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:08.330888Z",
"iopub.status.busy": "2025-12-30T14:30:08.330756Z",
"iopub.status.idle": "2025-12-30T14:30:09.685366Z",
"shell.execute_reply": "2025-12-30T14:30:09.685006Z"
},
"papermill": {
"duration": 1.358071,
"end_time": "2025-12-30T14:30:09.686126",
"exception": false,
"start_time": "2025-12-30T14:30:08.328055",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fetched 1,309,998 bytes\n"
]
}
],
"source": [
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\",\n",
" \"Accept\": \"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\",\n",
" \"Accept-Language\": \"en-US,en;q=0.9\",\n",
"}\n",
"response = requests.get(TEST_URL, headers=headers, timeout=30)\n",
"html_content = None\n",
"fetch_error = None\n",
"\n",
"if response.ok:\n",
" html_content = response.text\n",
" print(f\"Fetched {len(html_content):,} bytes\")\n",
"else:\n",
" fetch_error = f\"HTTP {response.status_code}: {response.reason}\"\n",
" print(f\"Fetch failed: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "20a00121",
"metadata": {
"papermill": {
"duration": 0.001385,
"end_time": "2025-12-30T14:30:09.689427",
"exception": false,
"start_time": "2025-12-30T14:30:09.688042",
"status": "completed"
},
"tags": []
},
"source": [
"## 1. Trafilatura\n",
"\n",
"[trafilatura](https://trafilatura.readthedocs.io/) - Purpose-built for web text extraction with native markdown output.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "77dc050c",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:09.693140Z",
"iopub.status.busy": "2025-12-30T14:30:09.693021Z",
"iopub.status.idle": "2025-12-30T14:30:09.989360Z",
"shell.execute_reply": "2025-12-30T14:30:09.988767Z"
},
"papermill": {
"duration": 0.299041,
"end_time": "2025-12-30T14:30:09.989958",
"exception": false,
"start_time": "2025-12-30T14:30:09.690917",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Products featured in these articles are selected by our buyline writers, who scour the internet to let you know about great deals. If you click on or make a purchase using links in these articles, MailOnline will earn a small payment or commission on any sales. To find out more, click here\n",
"\n",
"SHOPPING: Diet culture is out. We break down the GLP-1 boom, why metabolism-first weight care is taking over and how Freya makes it affordable to start.\n",
"\n",
"SHOPPING: At-home rowing is the workout of the moment, and right now you can save up to $500 on Hydrow at-home rowers that help you shed pounds, build muscle, and virtually travel all over the world.\n",
"\n",
"SHOPPING: Give your wellness routine a new year boost with NAD+, the supplement that restores vitality to aging cells and helps you find a new lease on life with increased energy and focus.\n",
"\n",
"SHOPPING: Streamline your beauty routine in the new year with a four-piece tarte makeup kit including a jumbo size Face Tape foundation, a full size Maracuja Juicy lip gloss, a brush, and a chic bag.\n",
"\n",
"SHOPPING: The fan-favourite item has been hailed as a 'must-have' for anyone looking to revive tired eyes and appear instantly more awake - in just 15 minutes.\n",
"\n",
"SHOPPING: U Beauty just launched at Sephora, and the brand known for its cult-favorite, science-backed formulas is offering Daily Mail readers an exclusive 20 percent discount for a limited time.\n",
"\n",
"SHOPPING: Reboot your wellness routine with the 'life-changing' supplement packed with 11 bioavailable ingredients shown to boost energy, improve skin, help sleep, and so much more.\n"
]
}
],
"source": [
"import trafilatura\n",
"\n",
"if html_content:\n",
" trafilatura_text = trafilatura.extract(\n",
" html_content,\n",
" output_format=\"markdown\",\n",
" include_tables=True,\n",
" include_links=True,\n",
" include_images=False,\n",
" )\n",
" print(trafilatura_text[:MAX_CHARS] if trafilatura_text else \"No content\")\n",
"else:\n",
" trafilatura_text = None\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "37293fa6",
"metadata": {
"papermill": {
"duration": 0.001996,
"end_time": "2025-12-30T14:30:09.993700",
"exception": false,
"start_time": "2025-12-30T14:30:09.991704",
"status": "completed"
},
"tags": []
},
"source": [
"## 2. Newspaper3k\n",
"\n",
"[newspaper3k](https://newspaper.readthedocs.io/) - Designed for news article extraction.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f406ca17",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:09.997823Z",
"iopub.status.busy": "2025-12-30T14:30:09.997651Z",
"iopub.status.idle": "2025-12-30T14:30:10.323013Z",
"shell.execute_reply": "2025-12-30T14:30:10.322512Z"
},
"papermill": {
"duration": 0.328286,
"end_time": "2025-12-30T14:30:10.323741",
"exception": false,
"start_time": "2025-12-30T14:30:09.995455",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Row all over the world from your living room with big savings on the at-home rower that's helped thousands shed pounds and get stronger\n",
"\n",
"SHOPPING: At-home rowing is the workout of the moment, and right now you can save up to $500 on Hydrow at-home rowers that help you shed pounds, build muscle, and virtually travel all over the world.\n"
]
}
],
"source": [
"from newspaper import Article\n",
"\n",
"article = Article(TEST_URL)\n",
"newspaper_error = None\n",
"\n",
"if html_content:\n",
" article.set_html(html_content)\n",
" try:\n",
" article.parse()\n",
" except ValueError as e:\n",
" newspaper_error = f\"ValueError: {e}\"\n",
"\n",
" if newspaper_error:\n",
" print(f\"Newspaper3k error: {newspaper_error}\")\n",
" else:\n",
" print(article.text[:MAX_CHARS] if article.text else \"No content\")\n",
"else:\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "4580af76",
"metadata": {
"papermill": {
"duration": 0.002789,
"end_time": "2025-12-30T14:30:10.328575",
"exception": false,
"start_time": "2025-12-30T14:30:10.325786",
"status": "completed"
},
"tags": []
},
"source": [
"## 3. Readability-lxml\n",
"\n",
"[readability-lxml](https://github.com/buriy/python-readability) - Python port of Mozilla Readability.\n",
"Outputs HTML, so we pipe through html2text for markdown.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a0111b20",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:10.332496Z",
"iopub.status.busy": "2025-12-30T14:30:10.332378Z",
"iopub.status.idle": "2025-12-30T14:30:10.398166Z",
"shell.execute_reply": "2025-12-30T14:30:10.397776Z"
},
"papermill": {
"duration": 0.068437,
"end_time": "2025-12-30T14:30:10.398646",
"exception": false,
"start_time": "2025-12-30T14:30:10.330209",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[](/buyline/article-15402157/hydrow-rower-rowing-machine-new-year-sale.html?ico=mail_best_commerce_xp_desktop_185)\n",
"\n",
"SHOPPING: At-home rowing is the workout of the moment, and right now you can save up to $500 on Hydrow at-home rowers that help you shed pounds, build muscle, and virtually travel all over the world.\n",
"\n"
]
}
],
"source": [
"from readability import Document\n",
"import html2text\n",
"\n",
"h2t = html2text.HTML2Text()\n",
"h2t.ignore_links = False\n",
"h2t.ignore_images = True\n",
"h2t.body_width = 0\n",
"\n",
"readability_markdown = \"\"\n",
"readability_error = None\n",
"\n",
"if html_content:\n",
" doc = Document(html_content)\n",
" try:\n",
" readable_html = doc.summary()\n",
" readability_markdown = h2t.handle(readable_html)\n",
" except Exception as e:\n",
" readability_error = f\"{type(e).__name__}: {e}\"\n",
"\n",
" if readability_error:\n",
" print(f\"Readability-lxml error: {readability_error}\")\n",
" else:\n",
" print(readability_markdown[:MAX_CHARS])\n",
"else:\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "3ef93b51",
"metadata": {
"papermill": {
"duration": 0.002347,
"end_time": "2025-12-30T14:30:10.402684",
"exception": false,
"start_time": "2025-12-30T14:30:10.400337",
"status": "completed"
},
"tags": []
},
"source": [
"## 4. Mozilla Readability (JavaScript)\n",
"\n",
"[Mozilla Readability](https://github.com/mozilla/readability) - Original Firefox Reader View library, called via Node.js.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3dba3e69",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:10.406875Z",
"iopub.status.busy": "2025-12-30T14:30:10.406752Z",
"iopub.status.idle": "2025-12-30T14:30:12.053818Z",
"shell.execute_reply": "2025-12-30T14:30:12.053393Z"
},
"papermill": {
"duration": 1.649953,
"end_time": "2025-12-30T14:30:12.054412",
"exception": false,
"start_time": "2025-12-30T14:30:10.404459",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## [Why metabolism-first weight care is replacing diet culture and the affordable place we found to start your 2026 journey](https://example.com/buyline/article-15419593/metabolism-glp1-weight-loss-wellness-health-Freya.html?ico=mail_best_commerce_xp_desktop_185)\n",
"\n",
"[](https://example.com/buyline/article-15419593/metabolism-glp1-weight-loss-wellness-health-Freya.html?ico=mail_best_commerce_xp_desktop_185)\n",
"\n",
"SHOPPING: Diet culture is out. We break down the GLP-1 boom, why metabolism-first weight care is taking over and how Freya makes it affordable to start.\n",
"\n",
"## [Row all over the world from your living room with big savings on the at-home rower that's helped thousands shed pounds and get stronger](https://example.com/buyline/article-15402157/hydrow-rower-rowing-machine-new-year-sale.html?ico=mail_best_commerce_xp_desktop_185)\n",
"\n",
"[](https://example.com/buyline/article-15402157/hydrow-rower-rowing-machine-new-year-sale.html?ico=mail_best_commerce_xp_desktop_185)\n",
"\n",
"SHOPPING: At-home rowing is the workout of the moment, and right now you can save up to $500 on Hydrow at-home rowers that help you shed pounds, build muscle, and virtually travel all over the world.\n",
"\n"
]
}
],
"source": [
"script_path = Path(\"readability_extract.js\")\n",
"mozilla_markdown = \"\"\n",
"\n",
"if not html_content:\n",
" print(f\"Skipped: {fetch_error}\")\n",
"elif not script_path.exists():\n",
" print(\"readability_extract.js not found\")\n",
"else:\n",
" result = subprocess.run(\n",
" [\"node\", str(script_path)],\n",
" input=html_content,\n",
" capture_output=True,\n",
" text=True,\n",
" timeout=30,\n",
" )\n",
" if result.returncode == 0:\n",
" mozilla_result = json.loads(result.stdout)\n",
" mozilla_markdown = h2t.handle(mozilla_result.get(\"content\", \"\"))\n",
" print(mozilla_markdown[:MAX_CHARS])\n",
" else:\n",
" print(f\"Error: {result.stderr}\")"
]
},
{
"cell_type": "markdown",
"id": "ceaffb93",
"metadata": {
"papermill": {
"duration": 0.001354,
"end_time": "2025-12-30T14:30:12.057404",
"exception": false,
"start_time": "2025-12-30T14:30:12.056050",
"status": "completed"
},
"tags": []
},
"source": [
"## 5. Playwright\n",
"\n",
"[Playwright](https://playwright.dev/) - Browser automation that renders JavaScript before extraction.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5e9bfba5",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:12.061173Z",
"iopub.status.busy": "2025-12-30T14:30:12.061045Z",
"iopub.status.idle": "2025-12-30T14:30:17.270269Z",
"shell.execute_reply": "2025-12-30T14:30:17.269859Z"
},
"papermill": {
"duration": 5.212116,
"end_time": "2025-12-30T14:30:17.270848",
"exception": false,
"start_time": "2025-12-30T14:30:12.058732",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"##\n",
"[\n",
"Devastated mother of TV news anchor, 30, reveals shocking way she found out daughter had died in plane crash](/news/article-15422197/carley-mccord-mother-moment-daughter-killed-louisiana-plane-crash.html)\n",
"\n",
"\n",
"This is a modal window.\n",
"\n",
"Beginning of dialog window. Escape will cancel and close the window.\n",
"\n",
"End of dialog window.\n"
]
}
],
"source": [
"import asyncio\n",
"from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"\n",
"async def fetch_with_playwright(\n",
" url: str, timeout: int = 30000\n",
") -> tuple[str | None, str | None]:\n",
" \"\"\"Returns (html, error). One will be None.\"\"\"\n",
" try:\n",
" async with async_playwright() as p:\n",
" browser = await p.chromium.launch(headless=True)\n",
" page = await browser.new_page()\n",
" response = await page.goto(url, wait_until=\"domcontentloaded\", timeout=timeout)\n",
" await page.wait_for_timeout(3000) # Let JS render\n",
" html = await page.content()\n",
" await browser.close()\n",
" status = response.status if response else None\n",
" if status and status >= 400:\n",
" return None, f\"HTTP {status}\"\n",
" return html, None\n",
" except PlaywrightTimeout:\n",
" return None, f\"Timeout after {timeout}ms\"\n",
" except Exception as e:\n",
" return None, f\"{type(e).__name__}: {e}\"\n",
"\n",
"\n",
"playwright_html = None\n",
"playwright_extracted = None\n",
"playwright_error = None\n",
"\n",
"loop = asyncio.get_event_loop()\n",
"result = loop.run_until_complete(fetch_with_playwright(TEST_URL))\n",
"playwright_html, playwright_error = result\n",
"\n",
"if playwright_html:\n",
" playwright_extracted = trafilatura.extract(\n",
" playwright_html,\n",
" output_format=\"markdown\",\n",
" include_tables=True,\n",
" include_links=True,\n",
" include_images=False,\n",
" )\n",
" print(\n",
" playwright_extracted[:MAX_CHARS]\n",
" if playwright_extracted\n",
" else \"No content extracted from HTML\"\n",
" )\n",
"else:\n",
" print(f\"Playwright error: {playwright_error}\")"
]
},
{
"cell_type": "markdown",
"id": "6e522bf1",
"metadata": {
"papermill": {
"duration": 0.00153,
"end_time": "2025-12-30T14:30:17.274347",
"exception": false,
"start_time": "2025-12-30T14:30:17.272817",
"status": "completed"
},
"tags": []
},
"source": [
"## 6. Parallel.ai\n",
"\n",
"[Parallel.ai](https://docs.parallel.ai/) - Commercial API for web extraction using the Python SDK.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a66a4eb3",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:17.278441Z",
"iopub.status.busy": "2025-12-30T14:30:17.278327Z",
"iopub.status.idle": "2025-12-30T14:30:18.228795Z",
"shell.execute_reply": "2025-12-30T14:30:18.228330Z"
},
"papermill": {
"duration": 0.95336,
"end_time": "2025-12-30T14:30:18.229401",
"exception": false,
"start_time": "2025-12-30T14:30:17.276041",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[](/)\n",
"\n",
"* [Home](/home/index.html)\n",
"* [News](/news/index.html)\n",
"* [Royals](/news/royals/index.html)\n",
"* [U.S.](/ushome/index.html)\n",
"* [Sport](/sport/index.html)\n",
"* [TV](/tv/index.html)\n",
"* [Showbiz](/tvshowbiz/index.html)\n",
"* [Lifestyle](/lifestyle/index.html)\n",
"* [Health](/health/index.html)\n",
"* [Science](/sciencetech/index.html)\n",
"* [Money](/money/index.html)\n",
"* [Travel](/travel/index.html)\n",
"* [Podcasts](/podcasts/index.html)\n",
"* [Buyline](/buyline/index.html)\n",
" \n",
" + [Beauty](/buyline/beauty/index.html)\n",
" + [Fashion](/buyline/fashion/index.html)\n",
" + [Home and Garden](/buyline/home-garden/index.html)\n",
" + [Black Friday](/buyline/black-friday/index.html)\n",
" \n",
" \n",
" + [My Profile](/registration/profile.html)\n",
" + [Logout](/registration/logout.html)\n",
" \n",
" \n",
" + [Login](/registration/login.html?reg_source=navigation&targetUrl=)\n",
"\n",
"* [Latest Headlines](/home/latest/index.html)\n",
"* [Australia](/auhome/index.html)\n",
"* [Scotland](/scotland/index.html)\n",
"* [Books](/home/books/index.html)\n",
"* [Rewards](https://www.mailplus.co.uk/rewards)\n",
"* [Mail Shop](https://www.mailshop.co.uk/?utm_source=mailonline&utm_medium=referral&utm_campaign=mol-nav&utm_content=top-navbar)\n",
"* [Cars](/money/cars/index.html)\n",
"* [Property](/property/index.html)\n",
"* [Columnists](/columnists/index.html)\n",
"* [Games](https://games.dailymail.co.uk/all-games?arkpromo=TopNavAllGames)\n",
"\n",
"\n",
"* [My Profile](/registration/profile.html)\n",
"* [Logout](/registration/logout.html)\n",
"\n",
"\n",
"* [Login](/registration/login.html?reg_source=navigation&targetUrl=)\n",
"\n",
"UK Edition [Privacy Policy](/privacy) [Feedback](/home/contactus/index.html)\n",
"\n",
"**Tuesday, Dec 30th 2025** [12PM **5°C** 3PM **5°C** 5-Day Forecast](/home/weather/index.html)\n",
"\n",
"# Home\n",
"\n",
"Last updated: 14:11 GMT, 30 December 2025\n",
"\n",
"Advertisement\n",
"\n",
"## [Anthony Joshua's mum rushes to his bedside at 'Nigeria's best hospital' as eyewitness describes crash that killed his two friends as 'sounding like a bomb was going off'](/sport/boxing/article-15421993/Anthony-Joshua-mum-bedside-Nigeria-best-hospital-eyewitness-describes-crash-killed-two-friends-sounding-like-bomb-going-off.html)\n",
"\n",
"[](/sport/boxing/article-15421993/Anthony-Joshua-mum-bedside-Nigeria-best-hospital-eyewitness-describes-crash-killed-two-friends-sounding-like-bomb-going-off.html)\n",
"\n",
"[Anthony Joshua's mum has visited his bedside in hospital as he recovers from injuries sustained in a car crash in Nigeria that killed two of his close friends. Joshua is being treated at the Duchess International Hospital in Lagos, which has been named as the best private hospital in Nigeria for the past two years. Nigeria's president confirmed on Monday that Joshua's mum had joined him at the hospital after calling the heavyweight boxing star.](/sport/boxing/article-15421993/Anthony-Joshua-mum-bedside-Nigeria-best-hospital-eyewitness-describes-crash-killed-two-friends-sounding-like-bomb-going-off.html)\n",
"\n",
"* \n",
"* 17\n",
"* [94 comments](https://www.dailymail.co.uk/sport/boxing/article-15421993/Anthony-Joshua-mum-bedside-Nigeria-best-hospital-eyewitness-describes-crash-\n"
]
}
],
"source": [
"from parallel import Parallel\n",
"\n",
"parallel_result = None\n",
"parallel_error = None\n",
"\n",
"api_key = os.getenv(\"PARALLEL_API_KEY\")\n",
"if not api_key:\n",
" parallel_error = \"PARALLEL_API_KEY not set\"\n",
"else:\n",
" client = Parallel(api_key=api_key)\n",
" extract = client.beta.extract(\n",
" urls=[TEST_URL],\n",
" objective=\"Extract the main content of this page\",\n",
" excerpts=True,\n",
" full_content=True,\n",
" )\n",
" parallel_result = extract.results\n",
"\n",
"if parallel_result:\n",
" for result in parallel_result:\n",
" if result.full_content:\n",
" print(result.full_content[:MAX_CHARS])\n",
"else:\n",
" print(f\"Parallel.ai error: {parallel_error}\")"
]
},
{
"cell_type": "markdown",
"id": "fb8e4c91",
"metadata": {
"papermill": {
"duration": 0.001771,
"end_time": "2025-12-30T14:30:18.233190",
"exception": false,
"start_time": "2025-12-30T14:30:18.231419",
"status": "completed"
},
"tags": []
},
"source": [
"## 7. Exa\n",
"\n",
"[Exa](https://exa.ai/) - AI-native search and content extraction API.\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fd2bc1d0",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:18.237690Z",
"iopub.status.busy": "2025-12-30T14:30:18.237571Z",
"iopub.status.idle": "2025-12-30T14:30:18.782405Z",
"shell.execute_reply": "2025-12-30T14:30:18.781899Z"
},
"papermill": {
"duration": 0.547973,
"end_time": "2025-12-30T14:30:18.783056",
"exception": false,
"start_time": "2025-12-30T14:30:18.235083",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"US Home | Daily Mail Online\n",
"[![Daily Mail - news, sport, celebrity, science and health stories](https://i.dailymail.co.uk/i/sitelogos/DailyMail_Main.png)](https://www.dailymail.co.uk/)[](https://www.mailsubscriptions.co.uk/info/418990/choose-your-subscription?molclicksource=banner)\n",
"* [Home](https://www.dailymail.co.uk/ushome/index.html)\n",
"* [Showbiz](https://www.dailymail.co.uk/usshowbiz/index.html)\n",
"* [TV](https://www.dailymail.co.uk/tv/us/index.html)\n",
"* [Sports](https://www.dailymail.co.uk/sport/us/index.html)\n",
"* [Royals](https://www.dailymail.co.uk/news/royals/index.html)\n",
"* [Video](https://www.dailymail.co.uk/video/index.html)\n",
"* [News](https://www.dailymail.co.uk/news/us/index.html)\n",
"* [Lifestyle](https://www.dailymail.co.uk/lifestyle/us/index.html)\n",
"* [Money](https://www.dailymail.co.uk/yourmoney/index.html)\n",
"* [U.K.](https://www.dailymail.co.uk/home/index.html)\n",
"* [Buyline](https://www.dailymail.co.uk/buyline/us/index.html)\n",
"* [DailyMail+](https://www.dailymail.co.uk/mailplus/us/index.html)\n",
"* [Latest Headlines](https://www.dailymail.co.uk/home/latest/index.html)\n",
"* [Science](https://www.dailymail.co.uk/sciencetech/index.html)\n",
"* [Health](https://www.dailymail.co.uk/health/us/index.html)\n",
"* [Podcasts](https://www.dailymail.co.uk/podcasts/index.html)\n",
"* [Travel](https://www.dailymail.co.uk/travel/index.html)\n",
"* [Australia](https://www.dailymail.co.uk/auhome/index.html)\n",
"* [Games](https://games.dailymail.co.uk/all-games?arkpromo=TopNavAllGames)\n",
"* [Puzzles](https://www.dailymail.co.uk/puzzles/index.html)\n",
"* [My Profile](https://www.dailymail.co.uk/registration/profile.html)\n",
"* [Logout](https://www.dailymail.co.uk/registration/logout.html)\n",
"* [Login](https://www.dailymail.co.uk/registration/login.html?reg_source=navigation&targetUrl=)\n",
"![](https://i.dailymail.co.uk/i/pix/channelheaders/US.png)\n",
"US Edition[Privacy Policy](https://www.dailymail.co.uk/privacy)[Feedback](https://www.dailymail.co.uk/home/contactus/index.html)\n",
"![](https://i.dailymail.co.uk/i/furniture/facebook/DailyMail/DailyMail.png)\n",
"**Sunday, Dec 21st 2025**[6PM**63°F**9PM**61°F**5-Day Forecast](https://www.dailymail.co.uk/home/weather/index.html)\n",
"# Home\n",
"Updated: 19:39 EST\n",
"Advertisement\n",
"## [DOJ forced into Epstein files U-turn after Trump photos vanished... as Pam Bondi is warned she may face CHARGES](https://www.dailymail.co.uk/news/article-15404691/DOJ-Epstein-Trump-photo-Pam-Bondi-charges-contempt.html)\n",
"[![DOJ forced into Epstein files U-turn after Trump photos vanished as Pam Bondi is warned](https://i.dailymail.co.uk/1s/2025/12/22/00/104929933-0-image-m-1_1766363247531.jpg)](https://www.dailymail.co.uk/news/article-15404691/DOJ-Epstein-Trump-photo-Pam-Bondi-charges-contempt.html)\n",
"[The Department of Justice was forced to make an embarrassing U-turn on its release of documents related to Jeffrey Epstein's sex crimes after a photo of President Donald Trump was removed from the files. The missing photo showed Trump alongside his wife Melania, Epstein, and the pedophile's longtime associate Ghislaine\n"
]
}
],
"source": [
"from exa_py import Exa\n",
"\n",
"exa_result = None\n",
"exa_error = None\n",
"\n",
"exa_api_key = os.getenv(\"EXA_API_KEY\")\n",
"if not exa_api_key:\n",
" exa_error = \"EXA_API_KEY not set\"\n",
"else:\n",
" exa = Exa(exa_api_key)\n",
" results = exa.get_contents(urls=[TEST_URL], text=True)\n",
" if results.results:\n",
" exa_result = results.results[0].text\n",
"\n",
"if exa_result:\n",
" print(exa_result[:MAX_CHARS])\n",
"else:\n",
" print(f\"Exa error: {exa_error}\")"
]
},
{
"cell_type": "markdown",
"id": "b5c89c1a",
"metadata": {
"papermill": {
"duration": 0.001947,
"end_time": "2025-12-30T14:30:18.787986",
"exception": false,
"start_time": "2025-12-30T14:30:18.786039",
"status": "completed"
},
"tags": []
},
"source": [
"## 8. html2text (direct)\n",
"\n",
"[html2text](https://github.com/Alir3z4/html2text) - Converts HTML to Markdown without readability filtering.\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "298e1fb0",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:18.795248Z",
"iopub.status.busy": "2025-12-30T14:30:18.795132Z",
"iopub.status.idle": "2025-12-30T14:30:18.865218Z",
"shell.execute_reply": "2025-12-30T14:30:18.864894Z"
},
"papermill": {
"duration": 0.074583,
"end_time": "2025-12-30T14:30:18.865934",
"exception": false,
"start_time": "2025-12-30T14:30:18.791351",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
"\n",
"[](/)\n",
"\n",
" * [Home](/ushome/index.html)\n",
" * [Showbiz](/usshowbiz/index.html)\n",
" * [TV](/tv/us/index.html)\n",
" * [Sports](/sport/us/index.html)\n",
" * [Royals](/news/royals/index.html)\n",
" * [Video](/video/index.html)\n",
" * [News](/news/us/index.html)\n",
" * [Lifestyle](/lifestyle/us/index.html)\n",
" * [Money](/yourmoney/index.html)\n",
" * [U.K.](/home/index.html)\n",
" * [Buyline](/buyline/us/index.html)\n",
"\n",
"\n",
"\n",
" * [Latest Headlines](/home/latest/index.html)\n",
" * [ Science ](/sciencetech/index.html)\n",
" * [ Health ](/health/us/index.html)\n",
" * [ Podcasts ](/podcasts/index.html)\n",
" * [ Travel ](/travel/index.html)\n",
" * [ Australia ](/auhome/index.html)\n",
" * [ Games ](https://games.dailymail.co.uk/all-games?arkpromo=TopNavAllGames)\n",
" * [ Puzzles ](/puzzles/index.html)\n",
"\n",
"\n",
" * [My Profile](/registration/profile.html)\n",
" * [Logout](/registration/logout.html)\n",
"\n",
"\n",
" * [ Login ](/registration/login.html?reg_source=navigation&targetUrl=)\n",
"\n",
"\n",
"\n",
"US Edition [Privacy Policy](/privacy) [Feedback](/home/contactus/index.html)\n",
"\n",
"**Tuesday, Dec 30th 2025**[ 10AM **33 °F** 1PM **34 °F** 5-Day Forecast](/home/weather/index.html)\n",
"\n",
" \n",
"\n",
"# Home\n",
"\n",
"Updated: 09:20 EST\n",
"\n",
"Advertisement\n",
"\n",
"## [ Devastated mother of TV news anchor, 30, reveals shocking way she found out daughter had died in plane crash](/news/article-15422197/carley-mccord-mother-moment-daughter-killed-louisiana-plane-crash.html)\n",
"\n",
"[ ](/news/article-15422197/carley-mccord-mother-moment-daughter-killed-louisiana-plane-crash.html)\n",
"\n",
"[ Journalist Carley McCord, 30, was among five people killed in a plane crash on her way to a football game on December 28, 2019. The plane came down near a post office and Walmart n Lafayette, Louisiana, hitting a car, flipping it over and then slamming into a tree. ](/news/article-15422197/carley-mccord-mother-moment-daughter-killed-louisiana-plane-crash.html)\n",
"\n",
" * * * [ 8 comments ](https://www.dailymail.co.uk/news/article-15422197/carley-mccord-mother-moment-daughter-killed-louisiana-plane-crash.html?ico=comment-anchor#comments)\n",
" * [ 1 video ](https://www.dailymail.co.uk/news/article-15422197/carley-mccord-mother-moment-daughter-killed-louisiana-plane-crash.html#video)\n",
"\n",
"\n",
"\n",
"###### \n",
"\n",
"## [ Another bourbon distiller files for bankruptcy as sector collapses](/yourmoney/article-15420329/bourbon-distillery-bankruptcy-ohio-kentucky.html)\n",
"\n",
"[ ](/yourmoney/article-15420329/bourbon-distillery-bankruptcy-ohio-kentucky.html)\n",
"\n",
"[ The bourbon buzz is fading fast. Another American spirits maker has gone bust, adding to growing signs that the US whiskey and bourbon boom has sharply reversed. 'It's a sad day for bourbon, to be honest with you,' whiskey expert Fred Minnick said. ](/yourmoney/article-15420329/bourbon-distillery-bankruptcy-ohio-kentucky.html)\n",
"\n",
" * * [ 416 comments ](https://www.dailymail.co.uk/yourmoney/article-15420329/bourbon-distillery-bankruptcy-ohio-kentucky.html?ico=comment-anchor#comments)\n",
" * 13 shares Another bourbon distiller files for bankruptcy as sector collapses\n",
"\n",
"\n",
"\n",
"## ['Truth nuke' explodes on CNN as Somal\n"
]
}
],
"source": [
"if html_content:\n",
" html2text_output = h2t.handle(html_content)\n",
" print(html2text_output[:MAX_CHARS])\n",
"else:\n",
" html2text_output = \"\"\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "f6584f23",
"metadata": {
"papermill": {
"duration": 0.002122,
"end_time": "2025-12-30T14:30:18.870108",
"exception": false,
"start_time": "2025-12-30T14:30:18.867986",
"status": "completed"
},
"tags": []
},
"source": [
"## 9. BeautifulSoup\n",
"\n",
"[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) - Manual text extraction baseline.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1a165cb6",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:18.874520Z",
"iopub.status.busy": "2025-12-30T14:30:18.874386Z",
"iopub.status.idle": "2025-12-30T14:30:18.947426Z",
"shell.execute_reply": "2025-12-30T14:30:18.947091Z"
},
"papermill": {
"duration": 0.076201,
"end_time": "2025-12-30T14:30:18.948217",
"exception": false,
"start_time": "2025-12-30T14:30:18.872016",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"US Home | Daily Mail Online\n",
"Home\n",
"Showbiz\n",
"TV\n",
"Sports\n",
"Royals\n",
"Video\n",
"News\n",
"Lifestyle\n",
"Money\n",
"U.K.\n",
"Buyline\n",
"Latest Headlines\n",
"Science\n",
"Health\n",
"Podcasts\n",
"Travel\n",
"Australia\n",
"Games\n",
"Puzzles\n",
"My Profile\n",
"Logout\n",
"Login\n",
"US Edition\n",
"Privacy Policy\n",
"Feedback\n",
"Tuesday, Dec 30th 2025\n",
"10AM\n",
"33°F\n",
"1PM\n",
"34°F\n",
"5-Day Forecast\n",
"Home\n",
"Updated: 09:20 EST\n",
"Advertisement\n",
"Devastated mother of TV news anchor, 30, reveals shocking way she found out daughter had died in plane crash\n",
"Journalist Carley McCord, 30, was among five people killed in a plane crash on her way to a football game on December 28, 2019. The plane came down near a post office and Walmart n Lafayette, Louisiana, hitting a car, flipping it over and then slamming into a tree.\n",
"8\n",
"comments\n",
"1\n",
"video\n",
"Another bourbon distiller files for bankruptcy as sector collapses\n",
"The bourbon buzz is fading fast. Another American spirits maker has gone bust, adding to growing signs that the US whiskey and bourbon boom has sharply reversed. 'It's a sad day for bourbon, to be honest with you,' whiskey expert Fred Minnick said.\n",
"416\n",
"comments\n",
"13\n",
"shares\n",
"Another bourbon distiller files for bankruptcy as sector collapses\n",
"'Truth nuke' explodes on CNN as Somali daycare scandal sparks GOP demand for total migrant shutdown\n",
"NEW\n",
"Scott Jennings dropped what social media described as a 'truth nuke' when he clashed with CNN host Abby Phillip over the lack of accountability by elected officials for the childcare fraud.\n",
"4\n",
"comments\n",
"1\n",
"video\n",
"share\n",
"'Truth nuke' explodes on CNN as Somali scandal sparks new GOP demand\n",
"Popular holiday destination dubbed 'little Paris' to introduce new tourist tax in 2026\n",
"Travelers heading to this city in 2026 will be stung by a new tourist tax, despite growing backlash from the hotel industry.\n",
"16\n",
"comments\n",
"share\n",
"Holiday spot dubbed 'little Paris' to introduce tourist tax in 2026\n",
"I'm a leading surgeon, professor... and a recovering alcoholic. This is the shocking amount I was drinking at my lowest, how I stayed functional - and how I finally turned my life around: DR CHARLES KNOWLES\n",
"According to experts, functioning alcoholics are often high achievers with stressful jobs, while also suffering low self-esteem and mental health problems such as depression.\n",
"43\n",
"237\n",
"comments\n",
"All the reasons PR gurus gave for leaving Team Sussex\n",
"Denise and ex Aaron evicted from LA mansion\n",
"Kevin Costner, 70, enjoys a night out with daughter, 39\n",
"Dakota, 36, sparks romance rumors with musician, 28\n",
"Amelia Hamlin showcases rock hard abs in bikini\n",
"Sharon Stone's son announces his engagement\n",
"All the reasons PR gurus gave for leaving Team Sussex\n",
"Denise and ex Aaron evicted from LA mansion\n",
"Kevin Costner, 70, enjoys a night out with daughter, 39\n",
"Dakota, 36, sparks romance rumors with musician, 28\n",
"Amelia Hamlin showcases rock hard abs in bikini\n",
"Sharon Stone's son announces his engagement\n",
"All the reasons PR gurus gave for leaving Team Sussex\n",
"Denise and ex Aaron evicted from LA mansion\n",
"Kevin Costner, 70, enjoys a night out with daughter, 39\n",
"Dakota, 36, sparks romance rumors with mus\n"
]
}
],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"if html_content:\n",
" soup = BeautifulSoup(html_content, \"lxml\")\n",
" for el in soup([\"script\", \"style\", \"nav\", \"footer\", \"header\"]):\n",
" el.decompose()\n",
"\n",
" content = soup.find(\"div\", {\"id\": \"mw-content-text\"})\n",
" bs_text = (\n",
" content.get_text(separator=\"\\n\", strip=True)\n",
" if content\n",
" else soup.get_text(separator=\"\\n\", strip=True)\n",
" )\n",
" print(bs_text[:MAX_CHARS])\n",
"else:\n",
" bs_text = \"\"\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "899b203c",
"metadata": {
"papermill": {
"duration": 0.002677,
"end_time": "2025-12-30T14:30:18.953096",
"exception": false,
"start_time": "2025-12-30T14:30:18.950419",
"status": "completed"
},
"tags": []
},
"source": [
"## Summary\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "0888b242",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:18.959573Z",
"iopub.status.busy": "2025-12-30T14:30:18.959311Z",
"iopub.status.idle": "2025-12-30T14:30:18.962685Z",
"shell.execute_reply": "2025-12-30T14:30:18.962132Z"
},
"papermill": {
"duration": 0.007596,
"end_time": "2025-12-30T14:30:18.963132",
"exception": false,
"start_time": "2025-12-30T14:30:18.955536",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"exa : 191,160 chars\n",
"html2text : 139,268 chars\n",
"parallel.ai : 71,194 chars\n",
"beautifulsoup : 63,728 chars\n",
"trafilatura : 1,578 chars\n",
"mozilla readability : 1,170 chars\n",
"newspaper3k : 336 chars\n",
"playwright : 328 chars\n",
"readability-lxml : 316 chars\n"
]
}
],
"source": [
"results = {\n",
" \"trafilatura\": len(trafilatura_text or \"\"),\n",
" \"newspaper3k\": len(article.text or \"\")\n",
" if html_content and not newspaper_error\n",
" else 0,\n",
" \"readability-lxml\": len(readability_markdown),\n",
" \"mozilla readability\": len(mozilla_markdown),\n",
" \"playwright\": len(playwright_extracted or \"\"),\n",
" \"parallel.ai\": len(parallel_result[0].full_content or \"\") if parallel_result else 0,\n",
" \"exa\": len(exa_result or \"\"),\n",
" \"html2text\": len(html2text_output),\n",
" \"beautifulsoup\": len(bs_text),\n",
"}\n",
"\n",
"if fetch_error:\n",
" print(\n",
" f\"Note: requests fetch failed ({fetch_error}), some tools used Playwright-fetched HTML\\n\"\n",
" )\n",
"\n",
"for name, length in sorted(results.items(), key=lambda x: -x[1]):\n",
" print(f\"{name:25s}: {length:>8,} chars\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
},
"papermill": {
"default_parameters": {},
"duration": 11.984834,
"end_time": "2025-12-30T14:30:19.386343",
"environment_variables": {},
"exception": null,
"input_path": "compare_extractors.ipynb",
"output_path": "cluttered.ipynb",
"parameters": {
"TEST_URL": "https://www.dailymail.co.uk/"
},
"start_time": "2025-12-30T14:30:07.401509",
"version": "2.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment