| Dataset / Platform | Scope (what it includes) | Approx. size | Bulk access | License / reuse for products & demos |
|---|---|---|---|---|
| Semantic Scholar Open Research Corpus (S2ORC) | Papers (metadata, abstracts, references); full text only for OA subset | ~80M papers; ~8–12M full text | Static snapshots (≤2020) + API-backed bulk | Mixed. Metadata generally reusable. Full text only if OA license permits (e.g., CC-BY). You may build research products, but must respect per-paper licenses and cannot redistribute closed text. |
| OpenAlex | Papers/works, authors, venues, concepts, citations | 400M+ works | Monthly open snapshots | CC0 (public domain). ✅ Fully safe for commercial products, public demos, redistribution, citation graphs, analytics. One of the safest choices. |
| Crossref | DOI metadata + references (when deposited) | 150M+ DOIs | Annual public dumps / REST API | Metadata is open, but references may have restrictions depending on depositor. Generally safe for analytics and discovery, not for redistributing publisher text. |
| PubMed | Biomedical citations & abstracts | 39M+ records | FTP baseline | Abstracts are freely reusable for research & products. No full text included. ✅ Safe for public demos using abstracts & metadata. |
| PubMed Central (PMC) | Full-text biomedical articles (XML/PDF) | ~9M articles | OA subset + archive | Mixed. PMC hosts both OA and non-OA. PMC Open Access Subset is safe (CC licenses). Non-OA content ❌ cannot be redistributed. |
| arXiv | Preprint papers (full text) | ~2.5M papers | Full bulk (S3) | Per-paper licenses (CC-BY, CC-BY-SA, arXiv non-exclusive). Many allow redistribution, some don’t. You must filter by license for public products. |
| ACL Anthology | NLP/CL papers (mostly full text PDFs) | ~120k papers | Direct download | Mostly permissive for research, but licenses vary by paper. Public demos OK if you don’t redistribute PDFs wholesale and respect licenses. |
| CORE | OA papers (metadata + full text) | ~291M records; ~34M full text | Bulk access with agreement | OA only, but licenses vary (CC-BY, CC-BY-NC, etc.). Commercial reuse may be restricted for NC papers. Good for research; license filtering required for products. |
| Unpaywall | OA links & metadata (no hosted full text) | ~100M records | Snapshot downloads | Metadata is open (CC0). Links only. ✅ Very safe for discovery products and demos. |
| OpenCitations | Citation graphs + bibliographic metadata | 2B+ citation edges | Full bulk dumps | CC0. ✅ Safe for any research, product, commercial use, and redistribution. |
| Dimensions (Dimensions.ai) | Publications, citations, grants, patents, trials, policy docs | 100M+ pubs + grants/patents | API / enterprise only | Commercial, proprietary license. ❌ No open redistribution. Public demos typically restricted to UI screenshots or limited API usage under contract. |
Created
February 6, 2026 00:49
-
-
Save danyaljj/75a950d1ce62b233d7b6879d1152291a to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment