Skip to content

Instantly share code, notes, and snippets.

@danyaljj
Created February 6, 2026 00:49
Show Gist options
  • Select an option

  • Save danyaljj/75a950d1ce62b233d7b6879d1152291a to your computer and use it in GitHub Desktop.

Select an option

Save danyaljj/75a950d1ce62b233d7b6879d1152291a to your computer and use it in GitHub Desktop.
Dataset / Platform Scope (what it includes) Approx. size Bulk access License / reuse for products & demos
Semantic Scholar Open Research Corpus (S2ORC) Papers (metadata, abstracts, references); full text only for OA subset ~80M papers; ~8–12M full text Static snapshots (≤2020) + API-backed bulk Mixed. Metadata generally reusable. Full text only if OA license permits (e.g., CC-BY). You may build research products, but must respect per-paper licenses and cannot redistribute closed text.
OpenAlex Papers/works, authors, venues, concepts, citations 400M+ works Monthly open snapshots CC0 (public domain). ✅ Fully safe for commercial products, public demos, redistribution, citation graphs, analytics. One of the safest choices.
Crossref DOI metadata + references (when deposited) 150M+ DOIs Annual public dumps / REST API Metadata is open, but references may have restrictions depending on depositor. Generally safe for analytics and discovery, not for redistributing publisher text.
PubMed Biomedical citations & abstracts 39M+ records FTP baseline Abstracts are freely reusable for research & products. No full text included. ✅ Safe for public demos using abstracts & metadata.
PubMed Central (PMC) Full-text biomedical articles (XML/PDF) ~9M articles OA subset + archive Mixed. PMC hosts both OA and non-OA. PMC Open Access Subset is safe (CC licenses). Non-OA content ❌ cannot be redistributed.
arXiv Preprint papers (full text) ~2.5M papers Full bulk (S3) Per-paper licenses (CC-BY, CC-BY-SA, arXiv non-exclusive). Many allow redistribution, some don’t. You must filter by license for public products.
ACL Anthology NLP/CL papers (mostly full text PDFs) ~120k papers Direct download Mostly permissive for research, but licenses vary by paper. Public demos OK if you don’t redistribute PDFs wholesale and respect licenses.
CORE OA papers (metadata + full text) ~291M records; ~34M full text Bulk access with agreement OA only, but licenses vary (CC-BY, CC-BY-NC, etc.). Commercial reuse may be restricted for NC papers. Good for research; license filtering required for products.
Unpaywall OA links & metadata (no hosted full text) ~100M records Snapshot downloads Metadata is open (CC0). Links only. ✅ Very safe for discovery products and demos.
OpenCitations Citation graphs + bibliographic metadata 2B+ citation edges Full bulk dumps CC0. ✅ Safe for any research, product, commercial use, and redistribution.
Dimensions (Dimensions.ai) Publications, citations, grants, patents, trials, policy docs 100M+ pubs + grants/patents API / enterprise only Commercial, proprietary license. ❌ No open redistribution. Public demos typically restricted to UI screenshots or limited API usage under contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment