Medicaid Provider Spending Dataset

A cleaned, analysis-ready version of CMS T-MSIS Medicaid claims data covering January 2018 – December 2024, hosted by Carolina Cloud — a bioinformatics and data cloud built in the Research Triangle.

Why we're hosting this: Public datasets should be easy to use. We've cleaned, normalized, and split these files so you can be doing analysis in minutes. If you have an interesting analysis you want to run but need serious compute to do it, email derek@carolinacloud.io — describe what you're investigating and we'll give you free credits on a high-RAM instance.

Quickstart

import polars as pl

claims    = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-claims.parquet')
providers = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-providers.parquet')
codes     = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-codes.parquet')

Example: Top 10 procedures by total Medicaid spend

from polars import col

(
    claims
    .group_by('hcpcs_code')
    .agg(col('total_paid').sum().alias('total_paid'))
    .sort('total_paid', descending=True)
    .head(10)
    .join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
)

Example: Join providers to claims

provider_slim = providers.select('npi', 'business_name', 'provider_last_name',
                                  'provider_first_name', 'practice_state', 'provider_taxonomy_code')

enriched = claims.join(provider_slim, on='npi', how='left')

Example: Monthly spend trend for a specific procedure

(
    claims
    .filter(col('hcpcs_code') == 'J0696')   # Ceftriaxone injection
    .group_by('month')
    .agg(col('total_paid').sum(), col('n_claims').sum())
    .sort('month')
)

Example: Most billed procedure by provider specialty

Are orthopedic surgeons billing for psychotherapy? Are psychiatrists billing for surgery? This query surfaces the top HCPCS code per taxonomy group — great for sanity checking or finding anomalies.

from polars import col

(
    claims
    .join(providers.select('npi', 'provider_taxonomy_code'), on='npi', how='left')
    .join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
    .group_by('provider_taxonomy_code', 'hcpcs_code', 'short_description')
    .agg(col('n_claims').sum().alias('total_claims'))
    .sort('total_claims', descending=True)
    .group_by('provider_taxonomy_code')
    .agg(col('hcpcs_code').first(), col('short_description').first(), col('total_claims').first())
    .sort('total_claims', descending=True)
)

Example: Specialty mismatch detection

Flag providers billing codes that are almost never billed by their specialty — a classic fraud/abuse signal. Here we find providers whose top billed code is billed by fewer than 1% of providers sharing their taxonomy.

# Step 1: typical codes per taxonomy
specialty_codes = (
    claims
    .join(providers.select('npi', 'provider_taxonomy_code'), on='npi', how='left')
    .group_by('provider_taxonomy_code', 'hcpcs_code')
    .agg(pl.n_unique('npi').alias('n_providers_billing'))
)

specialty_totals = (
    providers
    .group_by('provider_taxonomy_code')
    .agg(pl.len().alias('n_providers_in_specialty'))
)

specialty_codes = specialty_codes.join(specialty_totals, on='provider_taxonomy_code', how='left').with_columns(
    (col('n_providers_billing') / col('n_providers_in_specialty')).alias('pct_of_specialty_billing_this_code')
)

# Step 2: each provider's top billed code
provider_top_code = (
    claims
    .group_by('npi', 'hcpcs_code')
    .agg(col('n_claims').sum())
    .sort('n_claims', descending=True)
    .group_by('npi')
    .agg(col('hcpcs_code').first().alias('top_hcpcs_code'))
)

# Step 3: flag mismatches
(
    provider_top_code
    .join(providers.select('npi', 'provider_taxonomy_code', 'provider_first_name', 'provider_last_name', 'business_name'), on='npi', how='left')
    .join(specialty_codes.rename({'hcpcs_code': 'top_hcpcs_code'}), on=['provider_taxonomy_code', 'top_hcpcs_code'], how='left')
    .filter(col('pct_of_specialty_billing_this_code') < 0.01)
    .join(codes.select('hcpcs_code', 'short_description').rename({'hcpcs_code': 'top_hcpcs_code'}), on='top_hcpcs_code', how='left')
    .sort('pct_of_specialty_billing_this_code')
    .select('npi', 'business_name', 'provider_last_name', 'provider_taxonomy_code', 'top_hcpcs_code', 'short_description', 'pct_of_specialty_billing_this_code')
)

Example: High-volume, high-spend outliers per state

Find providers billing an unusual number of claims relative to peers billing the same code in the same state — a billing volume anomaly flag.

peer_stats = (
    claims
    .join(providers.select('npi', 'practice_state'), on='npi', how='left')
    .group_by('practice_state', 'hcpcs_code')
    .agg(
        col('n_claims').mean().alias('peer_mean_claims'),
        col('n_claims').std().alias('peer_std_claims'),
    )
)

(
    claims
    .join(providers.select('npi', 'practice_state', 'business_name', 'provider_last_name'), on='npi', how='left')
    .join(peer_stats, on=['practice_state', 'hcpcs_code'], how='left')
    .with_columns(
        ((col('n_claims') - col('peer_mean_claims')) / col('peer_std_claims')).alias('z_score')
    )
    .filter(col('z_score') > 4)
    .sort('z_score', descending=True)
    .join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
)

Dataset Overview

	claims	providers	codes
Source	CMS T-MSIS	CMS NPPES NPI Registry	CMS HCPCS 2026
Rows	227 million	9.4 million	9,006
Columns	7	22	3
Size on disk	~11 GB	~1.2 GB	< 1 MB
Granularity	Provider × Procedure × Month	One row per NPI	One row per HCPCS code
Time Period	Jan 2018 – Dec 2024	Current as of Feb 2026	2026 release
Join Key	`npi`, `hcpcs_code`	`npi`	`hcpcs_code`

A note on RAM

The claims file alone is ~11 GB on disk. Polars is lazy and efficient, but in-memory operations — especially joins and aggregations — typically require roughly 2× the data size in RAM, which puts you around 22 GB just to work with claims before joining anything. The providers file adds another ~1.2 GB but it's wide (22 columns of strings), and joining it to the long claims table is where memory really balloons. A full enriched join of both datasets is well into 128 GB territory.

If your laptop can't handle it, don't fight it — use .select() to grab only the provider columns you actually need before joining, or filter claims down to a state or time range first. If you have a serious analysis that requires the full dataset, email derek@carolinacloud.io with what you're investigating and we'll set you up on a high-RAM instance.

Context on why this data matters right now: DOGE/HHS has been publicly posting Medicaid spending anomalies on social media. This dataset lets you do that analysis yourself, rigorously, rather than taking curated screenshots at face value.

`claims.parquet`

The core dataset. Each row is one provider billing one procedure code in one month. 227 million rows, 7 columns, ~11 GB.

Column	Type	Description
`npi`	String	Billing provider NPI. Join to `providers.npi`.
`servicing_provider_npi_num`	String	NPI of the provider who actually performed the service, if different from the billing provider. Common in group practices.
`hcpcs_code`	String	Procedure/service code. Join to `codes.hcpcs_code` for descriptions.
`month`	Date	First day of the claim month (e.g. `2024-03-01` = March 2024).
`n_unique_patients`	Int	Number of distinct Medicaid beneficiaries who received this procedure from this provider in this month.
`n_claims`	Int	Total claims submitted. A single patient may generate multiple claims.
`total_paid`	Float	Total dollars paid by Medicaid for this provider × procedure × month combination.

Important caveats:

Rows with fewer than 12 claims are suppressed by CMS for privacy. This dataset represents the majority of Medicaid spend but is not exhaustive.
This covers fee-for-service and managed care encounter data. Capitation payments to MCOs (Managed Care Organizations) — which represent ~70% of total Medicaid enrollment — are not reflected here. Do not use total_paid as a proxy for total Medicaid spending.
Data accuracy depends on state submissions to T-MSIS. Quality varies by state. See CMS DQ Atlas for known issues.

`providers.parquet`

Cleaned provider reference data from the NPPES NPI Registry. 9.4 million rows, 22 columns, ~1.2 GB.

Column	Type	Description
`npi`	String	10-digit National Provider Identifier. Primary join key.
`business_name`	String	Legal business name. Populated for organizations (hospitals, group practices). Null for individual providers — use `provider_first_name` + `provider_last_name` instead.
`provider_last_name`	String	Last name of individual provider. Null for organizations. Lowercased.
`provider_first_name`	String	First name of individual provider. Null for organizations. Lowercased.
`provider_middle_initial`	String	Middle name/initial. Sparsely populated. Lowercased.
`provider_credential`	String	Credential stripped to letters only (e.g. `MD`, `M.D.`, `m.d.` all become `md`). Lowercased.
`provider_business_mailing_address`	String	Mailing address lines 1 + 2 concatenated. Lowercased.
`business_mailing_city`	String	Mailing address city. Lowercased.
`business_mailing_state`	String	Mailing address 2-letter state code. Lowercased.
`business_mailing_zip5`	String	Mailing address 5-digit ZIP (truncated from ZIP+4).
`business_mailing_country`	String	Country code for non-US addresses. Blank = United States.
`business_phone_number`	String	Mailing address phone number.
`business_fax_number`	String	Mailing address fax number. Sparsely populated.
`provider_business_practice_location_address`	String	Physical practice address lines 1 + 2. More useful than mailing address for geographic analysis. Lowercased.
`practice_city`	String	Practice location city. Lowercased.
`practice_state`	String	Practice location 2-letter state code. Lowercased.
`practice_zip5`	String	Practice location 5-digit ZIP. Use this for geographic analysis.
`practice_country`	String	Country code for non-US practice locations. Blank = United States.
`practice_phone_number`	String	Practice location phone number.
`provider_enumeration_date`	Date	Date the provider was assigned their NPI by CMS. Not the date they joined Medicaid. Rough proxy for years in practice.
`provider_sex`	String	`M` / `F`. Null for organizations or where not reported.
`provider_taxonomy_code`	String	Primary NUCC taxonomy code identifying provider specialty (e.g. `207X00000X` = Orthopedic Surgery, `251G00000X` = Critical Access Hospital). Join to the NUCC taxonomy crosswalk for descriptions. Providers may have up to 15 taxonomy codes; only the primary is included here.

`codes.parquet`

HCPCS procedure code descriptions from the CMS 2026 annual release. 9,006 rows, 3 columns, trivially small.

Column	Type	Description
`hcpcs_code`	String	5-character HCPCS code. Join to `claims.hcpcs_code`.
`long_description`	String	Full plain-English description of the procedure or service (up to 80 chars).
`short_description`	String	Abbreviated description (up to 28 chars). Useful for chart labels.

Note on CPT codes: HCPCS codes starting with a letter (A–V) are Level II codes maintained by CMS and are fully public. Numeric codes (00000–99999) are CPT codes owned by the AMA and carry copyright restrictions. Descriptions for numeric codes in this file are the AMA's "consumer-friendly" descriptions provided under CMS's licensing agreement.

About the Data Sources

Claims: CMS T-MSIS — the federal data system collecting Medicaid/CHIP claims from all 50 states, DC, and territories.
Providers: CMS NPPES NPI Registry — the national registry of all healthcare providers. Updated monthly.
Codes: CMS HCPCS Annual Release — the official procedure code file. Updated quarterly.

What is T-MSIS?

T-MSIS (Transformed Medicaid Statistical Information System) is CMS's system for collecting standardized Medicaid and CHIP data from states. States submit data monthly covering fee-for-service claims, managed care encounters, beneficiary enrollment, and provider information.

Why does this dataset only show ~$180B in 2024 when total Medicaid spending is ~$900B?

Several reasons:

Managed care capitation — ~70% of Medicaid beneficiaries are enrolled in managed care. States pay MCOs (like Centene, Molina, UnitedHealthcare) a fixed monthly rate per member. That capitation payment doesn't appear as a claim — only the encounters MCOs report back to states do, and that reporting is inconsistent.
Institutional claims — inpatient hospital and long-term care (nursing homes, LTSS) are large cost drivers that may use claim types not captured here.
Cell suppression — low-volume provider-procedure combinations are dropped for privacy.
State data quality — T-MSIS data quality varies significantly by state.

This dataset is best used for relative comparisons (which providers bill the most, which procedures are trending, geographic patterns) rather than as an accounting of total Medicaid dollars.

Hosted by Carolina Cloud

These datasets are hosted on Carolina Cloud, a bioinformatics and data cloud based in the Research Triangle, NC. We specialize in high-RAM compute for genomics and large-scale data workloads.

Have a compelling analysis in mind but need the RAM to run it? Email derek@carolinacloud.io with what you want to investigate and we'll get you set up.

Ddfulton/medicaid.md

Select an option

No results found