A cleaned, analysis-ready version of CMS T-MSIS Medicaid claims data covering January 2018 – December 2024, hosted by Carolina Cloud — a bioinformatics and data cloud built in the Research Triangle.
Why we're hosting this: Public datasets should be easy to use. We've cleaned, normalized, and split these files so you can be doing analysis in minutes. If you have an interesting analysis you want to run but need serious compute to do it, email derek@carolinacloud.io — describe what you're investigating and we'll give you free credits on a high-RAM instance.
import polars as pl
claims = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-claims.parquet')
providers = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-providers.parquet')
codes = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-codes.parquet')from polars import col
(
claims
.group_by('hcpcs_code')
.agg(col('total_paid').sum().alias('total_paid'))
.sort('total_paid', descending=True)
.head(10)
.join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
)provider_slim = providers.select('npi', 'business_name', 'provider_last_name',
'provider_first_name', 'practice_state', 'provider_taxonomy_code')
enriched = claims.join(provider_slim, on='npi', how='left')(
claims
.filter(col('hcpcs_code') == 'J0696') # Ceftriaxone injection
.group_by('month')
.agg(col('total_paid').sum(), col('n_claims').sum())
.sort('month')
)Are orthopedic surgeons billing for psychotherapy? Are psychiatrists billing for surgery? This query surfaces the top HCPCS code per taxonomy group — great for sanity checking or finding anomalies.
from polars import col
(
claims
.join(providers.select('npi', 'provider_taxonomy_code'), on='npi', how='left')
.join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
.group_by('provider_taxonomy_code', 'hcpcs_code', 'short_description')
.agg(col('n_claims').sum().alias('total_claims'))
.sort('total_claims', descending=True)
.group_by('provider_taxonomy_code')
.agg(col('hcpcs_code').first(), col('short_description').first(), col('total_claims').first())
.sort('total_claims', descending=True)
)Flag providers billing codes that are almost never billed by their specialty — a classic fraud/abuse signal. Here we find providers whose top billed code is billed by fewer than 1% of providers sharing their taxonomy.
# Step 1: typical codes per taxonomy
specialty_codes = (
claims
.join(providers.select('npi', 'provider_taxonomy_code'), on='npi', how='left')
.group_by('provider_taxonomy_code', 'hcpcs_code')
.agg(pl.n_unique('npi').alias('n_providers_billing'))
)
specialty_totals = (
providers
.group_by('provider_taxonomy_code')
.agg(pl.len().alias('n_providers_in_specialty'))
)
specialty_codes = specialty_codes.join(specialty_totals, on='provider_taxonomy_code', how='left').with_columns(
(col('n_providers_billing') / col('n_providers_in_specialty')).alias('pct_of_specialty_billing_this_code')
)
# Step 2: each provider's top billed code
provider_top_code = (
claims
.group_by('npi', 'hcpcs_code')
.agg(col('n_claims').sum())
.sort('n_claims', descending=True)
.group_by('npi')
.agg(col('hcpcs_code').first().alias('top_hcpcs_code'))
)
# Step 3: flag mismatches
(
provider_top_code
.join(providers.select('npi', 'provider_taxonomy_code', 'provider_first_name', 'provider_last_name', 'business_name'), on='npi', how='left')
.join(specialty_codes.rename({'hcpcs_code': 'top_hcpcs_code'}), on=['provider_taxonomy_code', 'top_hcpcs_code'], how='left')
.filter(col('pct_of_specialty_billing_this_code') < 0.01)
.join(codes.select('hcpcs_code', 'short_description').rename({'hcpcs_code': 'top_hcpcs_code'}), on='top_hcpcs_code', how='left')
.sort('pct_of_specialty_billing_this_code')
.select('npi', 'business_name', 'provider_last_name', 'provider_taxonomy_code', 'top_hcpcs_code', 'short_description', 'pct_of_specialty_billing_this_code')
)Find providers billing an unusual number of claims relative to peers billing the same code in the same state — a billing volume anomaly flag.
peer_stats = (
claims
.join(providers.select('npi', 'practice_state'), on='npi', how='left')
.group_by('practice_state', 'hcpcs_code')
.agg(
col('n_claims').mean().alias('peer_mean_claims'),
col('n_claims').std().alias('peer_std_claims'),
)
)
(
claims
.join(providers.select('npi', 'practice_state', 'business_name', 'provider_last_name'), on='npi', how='left')
.join(peer_stats, on=['practice_state', 'hcpcs_code'], how='left')
.with_columns(
((col('n_claims') - col('peer_mean_claims')) / col('peer_std_claims')).alias('z_score')
)
.filter(col('z_score') > 4)
.sort('z_score', descending=True)
.join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
)| claims | providers | codes | |
|---|---|---|---|
| Source | CMS T-MSIS | CMS NPPES NPI Registry | CMS HCPCS 2026 |
| Rows | 227 million | 9.4 million | 9,006 |
| Columns | 7 | 22 | 3 |
| Size on disk | ~11 GB | ~1.2 GB | < 1 MB |
| Granularity | Provider × Procedure × Month | One row per NPI | One row per HCPCS code |
| Time Period | Jan 2018 – Dec 2024 | Current as of Feb 2026 | 2026 release |
| Join Key | npi, hcpcs_code |
npi |
hcpcs_code |
The claims file alone is ~11 GB on disk. Polars is lazy and efficient, but in-memory operations — especially joins and aggregations — typically require roughly 2× the data size in RAM, which puts you around 22 GB just to work with claims before joining anything. The providers file adds another ~1.2 GB but it's wide (22 columns of strings), and joining it to the long claims table is where memory really balloons. A full enriched join of both datasets is well into 128 GB territory.
If your laptop can't handle it, don't fight it — use .select() to grab only the provider columns you actually need before joining, or filter claims down to a state or time range first. If you have a serious analysis that requires the full dataset, email derek@carolinacloud.io with what you're investigating and we'll set you up on a high-RAM instance.
Context on why this data matters right now: DOGE/HHS has been publicly posting Medicaid spending anomalies on social media. This dataset lets you do that analysis yourself, rigorously, rather than taking curated screenshots at face value.
The core dataset. Each row is one provider billing one procedure code in one month. 227 million rows, 7 columns, ~11 GB.
| Column | Type | Description |
|---|---|---|
npi |
String | Billing provider NPI. Join to providers.npi. |
servicing_provider_npi_num |
String | NPI of the provider who actually performed the service, if different from the billing provider. Common in group practices. |
hcpcs_code |
String | Procedure/service code. Join to codes.hcpcs_code for descriptions. |
month |
Date | First day of the claim month (e.g. 2024-03-01 = March 2024). |
n_unique_patients |
Int | Number of distinct Medicaid beneficiaries who received this procedure from this provider in this month. |
n_claims |
Int | Total claims submitted. A single patient may generate multiple claims. |
total_paid |
Float | Total dollars paid by Medicaid for this provider × procedure × month combination. |
Important caveats:
- Rows with fewer than 12 claims are suppressed by CMS for privacy. This dataset represents the majority of Medicaid spend but is not exhaustive.
- This covers fee-for-service and managed care encounter data. Capitation payments to MCOs (Managed Care Organizations) — which represent ~70% of total Medicaid enrollment — are not reflected here. Do not use
total_paidas a proxy for total Medicaid spending. - Data accuracy depends on state submissions to T-MSIS. Quality varies by state. See CMS DQ Atlas for known issues.
Cleaned provider reference data from the NPPES NPI Registry. 9.4 million rows, 22 columns, ~1.2 GB.
| Column | Type | Description |
|---|---|---|
npi |
String | 10-digit National Provider Identifier. Primary join key. |
business_name |
String | Legal business name. Populated for organizations (hospitals, group practices). Null for individual providers — use provider_first_name + provider_last_name instead. |
provider_last_name |
String | Last name of individual provider. Null for organizations. Lowercased. |
provider_first_name |
String | First name of individual provider. Null for organizations. Lowercased. |
provider_middle_initial |
String | Middle name/initial. Sparsely populated. Lowercased. |
provider_credential |
String | Credential stripped to letters only (e.g. MD, M.D., m.d. all become md). Lowercased. |
provider_business_mailing_address |
String | Mailing address lines 1 + 2 concatenated. Lowercased. |
business_mailing_city |
String | Mailing address city. Lowercased. |
business_mailing_state |
String | Mailing address 2-letter state code. Lowercased. |
business_mailing_zip5 |
String | Mailing address 5-digit ZIP (truncated from ZIP+4). |
business_mailing_country |
String | Country code for non-US addresses. Blank = United States. |
business_phone_number |
String | Mailing address phone number. |
business_fax_number |
String | Mailing address fax number. Sparsely populated. |
provider_business_practice_location_address |
String | Physical practice address lines 1 + 2. More useful than mailing address for geographic analysis. Lowercased. |
practice_city |
String | Practice location city. Lowercased. |
practice_state |
String | Practice location 2-letter state code. Lowercased. |
practice_zip5 |
String | Practice location 5-digit ZIP. Use this for geographic analysis. |
practice_country |
String | Country code for non-US practice locations. Blank = United States. |
practice_phone_number |
String | Practice location phone number. |
provider_enumeration_date |
Date | Date the provider was assigned their NPI by CMS. Not the date they joined Medicaid. Rough proxy for years in practice. |
provider_sex |
String | M / F. Null for organizations or where not reported. |
provider_taxonomy_code |
String | Primary NUCC taxonomy code identifying provider specialty (e.g. 207X00000X = Orthopedic Surgery, 251G00000X = Critical Access Hospital). Join to the NUCC taxonomy crosswalk for descriptions. Providers may have up to 15 taxonomy codes; only the primary is included here. |
HCPCS procedure code descriptions from the CMS 2026 annual release. 9,006 rows, 3 columns, trivially small.
| Column | Type | Description |
|---|---|---|
hcpcs_code |
String | 5-character HCPCS code. Join to claims.hcpcs_code. |
long_description |
String | Full plain-English description of the procedure or service (up to 80 chars). |
short_description |
String | Abbreviated description (up to 28 chars). Useful for chart labels. |
Note on CPT codes: HCPCS codes starting with a letter (A–V) are Level II codes maintained by CMS and are fully public. Numeric codes (00000–99999) are CPT codes owned by the AMA and carry copyright restrictions. Descriptions for numeric codes in this file are the AMA's "consumer-friendly" descriptions provided under CMS's licensing agreement.
- Claims: CMS T-MSIS — the federal data system collecting Medicaid/CHIP claims from all 50 states, DC, and territories.
- Providers: CMS NPPES NPI Registry — the national registry of all healthcare providers. Updated monthly.
- Codes: CMS HCPCS Annual Release — the official procedure code file. Updated quarterly.
T-MSIS (Transformed Medicaid Statistical Information System) is CMS's system for collecting standardized Medicaid and CHIP data from states. States submit data monthly covering fee-for-service claims, managed care encounters, beneficiary enrollment, and provider information.
Why does this dataset only show ~$180B in 2024 when total Medicaid spending is ~$900B?
Several reasons:
- Managed care capitation — ~70% of Medicaid beneficiaries are enrolled in managed care. States pay MCOs (like Centene, Molina, UnitedHealthcare) a fixed monthly rate per member. That capitation payment doesn't appear as a claim — only the encounters MCOs report back to states do, and that reporting is inconsistent.
- Institutional claims — inpatient hospital and long-term care (nursing homes, LTSS) are large cost drivers that may use claim types not captured here.
- Cell suppression — low-volume provider-procedure combinations are dropped for privacy.
- State data quality — T-MSIS data quality varies significantly by state.
This dataset is best used for relative comparisons (which providers bill the most, which procedures are trending, geographic patterns) rather than as an accounting of total Medicaid dollars.
These datasets are hosted on Carolina Cloud, a bioinformatics and data cloud based in the Research Triangle, NC. We specialize in high-RAM compute for genomics and large-scale data workloads.
Have a compelling analysis in mind but need the RAM to run it? Email derek@carolinacloud.io with what you want to investigate and we'll get you set up.