Skip to content

Instantly share code, notes, and snippets.

@Ddfulton
Created February 14, 2026 19:40
Show Gist options
  • Select an option

  • Save Ddfulton/4e35d675ec7ffdeef983f0117ab1bd52 to your computer and use it in GitHub Desktop.

Select an option

Save Ddfulton/4e35d675ec7ffdeef983f0117ab1bd52 to your computer and use it in GitHub Desktop.

Medicaid Provider Spending Dataset

A cleaned, analysis-ready version of CMS T-MSIS Medicaid claims data covering January 2018 – December 2024, hosted by Carolina Cloud — a bioinformatics and data cloud built in the Research Triangle.

Why we're hosting this: Public datasets should be easy to use. We've cleaned, normalized, and split these files so you can be doing analysis in minutes. If you have an interesting analysis you want to run but need serious compute to do it, email derek@carolinacloud.io — describe what you're investigating and we'll give you free credits on a high-RAM instance.


Quickstart

import polars as pl

claims    = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-claims.parquet')
providers = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-providers.parquet')
codes     = pl.read_parquet('https://console.carolinacloud.io/static/datasets/medicaid-codes.parquet')

Example: Top 10 procedures by total Medicaid spend

from polars import col

(
    claims
    .group_by('hcpcs_code')
    .agg(col('total_paid').sum().alias('total_paid'))
    .sort('total_paid', descending=True)
    .head(10)
    .join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
)

Example: Join providers to claims

provider_slim = providers.select('npi', 'business_name', 'provider_last_name',
                                  'provider_first_name', 'practice_state', 'provider_taxonomy_code')

enriched = claims.join(provider_slim, on='npi', how='left')

Example: Monthly spend trend for a specific procedure

(
    claims
    .filter(col('hcpcs_code') == 'J0696')   # Ceftriaxone injection
    .group_by('month')
    .agg(col('total_paid').sum(), col('n_claims').sum())
    .sort('month')
)

Example: Most billed procedure by provider specialty

Are orthopedic surgeons billing for psychotherapy? Are psychiatrists billing for surgery? This query surfaces the top HCPCS code per taxonomy group — great for sanity checking or finding anomalies.

from polars import col

(
    claims
    .join(providers.select('npi', 'provider_taxonomy_code'), on='npi', how='left')
    .join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
    .group_by('provider_taxonomy_code', 'hcpcs_code', 'short_description')
    .agg(col('n_claims').sum().alias('total_claims'))
    .sort('total_claims', descending=True)
    .group_by('provider_taxonomy_code')
    .agg(col('hcpcs_code').first(), col('short_description').first(), col('total_claims').first())
    .sort('total_claims', descending=True)
)

Example: Specialty mismatch detection

Flag providers billing codes that are almost never billed by their specialty — a classic fraud/abuse signal. Here we find providers whose top billed code is billed by fewer than 1% of providers sharing their taxonomy.

# Step 1: typical codes per taxonomy
specialty_codes = (
    claims
    .join(providers.select('npi', 'provider_taxonomy_code'), on='npi', how='left')
    .group_by('provider_taxonomy_code', 'hcpcs_code')
    .agg(pl.n_unique('npi').alias('n_providers_billing'))
)

specialty_totals = (
    providers
    .group_by('provider_taxonomy_code')
    .agg(pl.len().alias('n_providers_in_specialty'))
)

specialty_codes = specialty_codes.join(specialty_totals, on='provider_taxonomy_code', how='left').with_columns(
    (col('n_providers_billing') / col('n_providers_in_specialty')).alias('pct_of_specialty_billing_this_code')
)

# Step 2: each provider's top billed code
provider_top_code = (
    claims
    .group_by('npi', 'hcpcs_code')
    .agg(col('n_claims').sum())
    .sort('n_claims', descending=True)
    .group_by('npi')
    .agg(col('hcpcs_code').first().alias('top_hcpcs_code'))
)

# Step 3: flag mismatches
(
    provider_top_code
    .join(providers.select('npi', 'provider_taxonomy_code', 'provider_first_name', 'provider_last_name', 'business_name'), on='npi', how='left')
    .join(specialty_codes.rename({'hcpcs_code': 'top_hcpcs_code'}), on=['provider_taxonomy_code', 'top_hcpcs_code'], how='left')
    .filter(col('pct_of_specialty_billing_this_code') < 0.01)
    .join(codes.select('hcpcs_code', 'short_description').rename({'hcpcs_code': 'top_hcpcs_code'}), on='top_hcpcs_code', how='left')
    .sort('pct_of_specialty_billing_this_code')
    .select('npi', 'business_name', 'provider_last_name', 'provider_taxonomy_code', 'top_hcpcs_code', 'short_description', 'pct_of_specialty_billing_this_code')
)

Example: High-volume, high-spend outliers per state

Find providers billing an unusual number of claims relative to peers billing the same code in the same state — a billing volume anomaly flag.

peer_stats = (
    claims
    .join(providers.select('npi', 'practice_state'), on='npi', how='left')
    .group_by('practice_state', 'hcpcs_code')
    .agg(
        col('n_claims').mean().alias('peer_mean_claims'),
        col('n_claims').std().alias('peer_std_claims'),
    )
)

(
    claims
    .join(providers.select('npi', 'practice_state', 'business_name', 'provider_last_name'), on='npi', how='left')
    .join(peer_stats, on=['practice_state', 'hcpcs_code'], how='left')
    .with_columns(
        ((col('n_claims') - col('peer_mean_claims')) / col('peer_std_claims')).alias('z_score')
    )
    .filter(col('z_score') > 4)
    .sort('z_score', descending=True)
    .join(codes.select('hcpcs_code', 'short_description'), on='hcpcs_code', how='left')
)

Dataset Overview

claims providers codes
Source CMS T-MSIS CMS NPPES NPI Registry CMS HCPCS 2026
Rows 227 million 9.4 million 9,006
Columns 7 22 3
Size on disk ~11 GB ~1.2 GB < 1 MB
Granularity Provider × Procedure × Month One row per NPI One row per HCPCS code
Time Period Jan 2018 – Dec 2024 Current as of Feb 2026 2026 release
Join Key npi, hcpcs_code npi hcpcs_code

A note on RAM

The claims file alone is ~11 GB on disk. Polars is lazy and efficient, but in-memory operations — especially joins and aggregations — typically require roughly 2× the data size in RAM, which puts you around 22 GB just to work with claims before joining anything. The providers file adds another ~1.2 GB but it's wide (22 columns of strings), and joining it to the long claims table is where memory really balloons. A full enriched join of both datasets is well into 128 GB territory.

If your laptop can't handle it, don't fight it — use .select() to grab only the provider columns you actually need before joining, or filter claims down to a state or time range first. If you have a serious analysis that requires the full dataset, email derek@carolinacloud.io with what you're investigating and we'll set you up on a high-RAM instance.

Context on why this data matters right now: DOGE/HHS has been publicly posting Medicaid spending anomalies on social media. This dataset lets you do that analysis yourself, rigorously, rather than taking curated screenshots at face value.


claims.parquet

The core dataset. Each row is one provider billing one procedure code in one month. 227 million rows, 7 columns, ~11 GB.

Column Type Description
npi String Billing provider NPI. Join to providers.npi.
servicing_provider_npi_num String NPI of the provider who actually performed the service, if different from the billing provider. Common in group practices.
hcpcs_code String Procedure/service code. Join to codes.hcpcs_code for descriptions.
month Date First day of the claim month (e.g. 2024-03-01 = March 2024).
n_unique_patients Int Number of distinct Medicaid beneficiaries who received this procedure from this provider in this month.
n_claims Int Total claims submitted. A single patient may generate multiple claims.
total_paid Float Total dollars paid by Medicaid for this provider × procedure × month combination.

Important caveats:

  • Rows with fewer than 12 claims are suppressed by CMS for privacy. This dataset represents the majority of Medicaid spend but is not exhaustive.
  • This covers fee-for-service and managed care encounter data. Capitation payments to MCOs (Managed Care Organizations) — which represent ~70% of total Medicaid enrollment — are not reflected here. Do not use total_paid as a proxy for total Medicaid spending.
  • Data accuracy depends on state submissions to T-MSIS. Quality varies by state. See CMS DQ Atlas for known issues.

providers.parquet

Cleaned provider reference data from the NPPES NPI Registry. 9.4 million rows, 22 columns, ~1.2 GB.

Column Type Description
npi String 10-digit National Provider Identifier. Primary join key.
business_name String Legal business name. Populated for organizations (hospitals, group practices). Null for individual providers — use provider_first_name + provider_last_name instead.
provider_last_name String Last name of individual provider. Null for organizations. Lowercased.
provider_first_name String First name of individual provider. Null for organizations. Lowercased.
provider_middle_initial String Middle name/initial. Sparsely populated. Lowercased.
provider_credential String Credential stripped to letters only (e.g. MD, M.D., m.d. all become md). Lowercased.
provider_business_mailing_address String Mailing address lines 1 + 2 concatenated. Lowercased.
business_mailing_city String Mailing address city. Lowercased.
business_mailing_state String Mailing address 2-letter state code. Lowercased.
business_mailing_zip5 String Mailing address 5-digit ZIP (truncated from ZIP+4).
business_mailing_country String Country code for non-US addresses. Blank = United States.
business_phone_number String Mailing address phone number.
business_fax_number String Mailing address fax number. Sparsely populated.
provider_business_practice_location_address String Physical practice address lines 1 + 2. More useful than mailing address for geographic analysis. Lowercased.
practice_city String Practice location city. Lowercased.
practice_state String Practice location 2-letter state code. Lowercased.
practice_zip5 String Practice location 5-digit ZIP. Use this for geographic analysis.
practice_country String Country code for non-US practice locations. Blank = United States.
practice_phone_number String Practice location phone number.
provider_enumeration_date Date Date the provider was assigned their NPI by CMS. Not the date they joined Medicaid. Rough proxy for years in practice.
provider_sex String M / F. Null for organizations or where not reported.
provider_taxonomy_code String Primary NUCC taxonomy code identifying provider specialty (e.g. 207X00000X = Orthopedic Surgery, 251G00000X = Critical Access Hospital). Join to the NUCC taxonomy crosswalk for descriptions. Providers may have up to 15 taxonomy codes; only the primary is included here.

codes.parquet

HCPCS procedure code descriptions from the CMS 2026 annual release. 9,006 rows, 3 columns, trivially small.

Column Type Description
hcpcs_code String 5-character HCPCS code. Join to claims.hcpcs_code.
long_description String Full plain-English description of the procedure or service (up to 80 chars).
short_description String Abbreviated description (up to 28 chars). Useful for chart labels.

Note on CPT codes: HCPCS codes starting with a letter (A–V) are Level II codes maintained by CMS and are fully public. Numeric codes (00000–99999) are CPT codes owned by the AMA and carry copyright restrictions. Descriptions for numeric codes in this file are the AMA's "consumer-friendly" descriptions provided under CMS's licensing agreement.


About the Data Sources

  • Claims: CMS T-MSIS — the federal data system collecting Medicaid/CHIP claims from all 50 states, DC, and territories.
  • Providers: CMS NPPES NPI Registry — the national registry of all healthcare providers. Updated monthly.
  • Codes: CMS HCPCS Annual Release — the official procedure code file. Updated quarterly.

What is T-MSIS?

T-MSIS (Transformed Medicaid Statistical Information System) is CMS's system for collecting standardized Medicaid and CHIP data from states. States submit data monthly covering fee-for-service claims, managed care encounters, beneficiary enrollment, and provider information.

Why does this dataset only show ~$180B in 2024 when total Medicaid spending is ~$900B?

Several reasons:

  1. Managed care capitation — ~70% of Medicaid beneficiaries are enrolled in managed care. States pay MCOs (like Centene, Molina, UnitedHealthcare) a fixed monthly rate per member. That capitation payment doesn't appear as a claim — only the encounters MCOs report back to states do, and that reporting is inconsistent.
  2. Institutional claims — inpatient hospital and long-term care (nursing homes, LTSS) are large cost drivers that may use claim types not captured here.
  3. Cell suppression — low-volume provider-procedure combinations are dropped for privacy.
  4. State data quality — T-MSIS data quality varies significantly by state.

This dataset is best used for relative comparisons (which providers bill the most, which procedures are trending, geographic patterns) rather than as an accounting of total Medicaid dollars.


Hosted by Carolina Cloud

These datasets are hosted on Carolina Cloud, a bioinformatics and data cloud based in the Research Triangle, NC. We specialize in high-RAM compute for genomics and large-scale data workloads.

Have a compelling analysis in mind but need the RAM to run it? Email derek@carolinacloud.io with what you want to investigate and we'll get you set up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment