Skip to content

Instantly share code, notes, and snippets.

@MaxGhenis
Last active December 11, 2025 21:30
Show Gist options
  • Select an option

  • Save MaxGhenis/e8d747208acafba3674e1cbf5e05970d to your computer and use it in GitHub Desktop.

Select an option

Save MaxGhenis/e8d747208acafba3674e1cbf5e05970d to your computer and use it in GitHub Desktop.
ONS firm data wishlist for UK microsimulation and OLG models

ONS Firm Data Wishlist for UK Microsimulation & OLG Models

Prepared for Vahid's meeting with ONS firm data team, December 2025

Context

PolicyEngine is exploring building an OLG (Overlapping Generations) model for the UK, similar to OG-USA/OG-Core but with greater firm-level heterogeneity. We also want to expand our firm microsimulation capabilities for tax policy analysis (VAT, corporation tax, business rates).

This document outlines what firm-level data would be most valuable from ONS.

What ONS Already Publishes

Based on the Trends in UK Business Dynamism and Productivity bulletin:

Source Coverage Key Variables
Annual Business Survey (ABS) 1997-2023, ~98% of turnover Turnover, employment, GVA, intermediate consumption
Longitudinal Business Database (LBD) 1999-present, quarterly Firm entry/exit, employment dynamics
Inter-departmental Business Register (IDBR) Universe of businesses Registration, legal status

Published Aggregates

  • Firm size distributions by turnover band
  • Entry/exit rates (aggregate and by broad sector)
  • Sectoral capital-output ratios
  • Labor shares by sector
  • Aggregate markup trends
  • Job creation/destruction rates

The Gap: Joint Distributions

Individual moments exist, but cross-tabulations are limited. For calibrating heterogeneous firm models, we need the shape of distributions and correlations between characteristics.

For Firm Microsimulation

Our VAT threshold analysis constructed synthetic microdata via optimization because joint distributions weren't published—only marginals from ONS and HMRC separately. Better data would enable:

  • More accurate policy simulations
  • Behavioral response modeling (bunching near thresholds)
  • Distributional analysis by firm type

For OLG Models

OG-Core's firm sector uses representative firms per sector. For heterogeneous firms, we need:

Parameter Type Examples
Production Capital-labor ratios, TFP distributions, markup dispersion
Dynamics Entry/exit rates, growth transitions, survival by age
Labor Wage distributions by firm size, hiring patterns

Prioritized Cross-Tab Requests

We evaluated candidate cross-tabulations against five criteria:

  1. Calibration value - Improves model fit to real economy
  2. New policy questions - Unlocks analyses we can't currently do
  3. Publication potential - Leads to high-visibility outputs
  4. Feasibility - Likelihood ONS can/will provide
  5. Reusability - Useful across multiple projects

Recommended Asks

Priority Cross-Tabulation Rationale
1 Entry/Exit rates × Sector × Firm Size Core to OLG dynamics, high feasibility
2 Fine turnover bins (£5k increments) near £70k-£120k × Sector Critical for VAT/corp tax threshold analysis
3 Turnover × Firm Age × Sector Enables firm lifecycle modeling
4 Capital stock × Employment × Sector Essential for production function calibration
5 Turnover × Employment × Sector Baseline joint distribution
6 Size class transition matrices (panel) Highest reuse value for dynamic models
7 Markup × Firm Size × Sector Novel for market power analysis
8 Wage bill × Turnover × Sector Labor share heterogeneity

Detailed Justifications

Entry/Exit × Sector × Size

  • Why: OLG models need firm birth/death rates to calibrate steady-state firm distributions and transition dynamics
  • Current gap: Published rates are aggregate or by broad sector only
  • Ideal format: Annual rates by 2-digit SIC, size class (micro/small/medium/large), for last 10 years

Fine Turnover Bins Near Policy Thresholds

  • Why: VAT threshold (£90k), corporation tax small profits (£50k), and other policy thresholds create behavioral responses. Current ONS bands (£50-99k) are too coarse to identify bunching.
  • Current gap: Finest public data is £50k bands
  • Ideal format: £5k bins from £50k-£150k, by sector

Turnover × Age × Sector

  • Why: Firm age is strongly predictive of growth, exit risk, and productivity. Essential for lifecycle modeling.
  • Current gap: Age distributions exist separately from turnover distributions
  • Ideal format: Joint distribution by firm age cohort (0-2, 3-5, 6-10, 11+), turnover band, sector

Capital × Employment × Sector

  • Why: Calibrating production functions requires knowing capital intensity variation across firm types
  • Current gap: Capital stocks published at sector level only
  • Ideal format: Average capital per worker by sector and firm size class

Transition Matrices

  • Why: Dynamic models need P(size class at t+1 | size class at t) to calibrate adjustment costs and growth processes
  • Current gap: Not published; would require LBD panel analysis
  • Ideal format: 5×5 transition matrix (by size class) for each major sector, annual

Alternative Access Routes

If custom tabulations aren't feasible:

  1. Secure Research Service - Direct access to anonymised microdata (requires accreditation, but PolicyEngine has academic collaborators)
  2. Existing detailed tables - ONS may have unpublished tables from previous projects
  3. HMRC linkage - Some data may exist in linked ONS-HMRC datasets

Questions for the Meeting

  1. What's the process for requesting custom tabulations vs. requiring SRS access?
  2. Are there existing unpublished tables that match our needs?
  3. What's the timeline for data requests?
  4. Is there appetite for a formal data-sharing agreement for ongoing research?
  5. Can longitudinal linkages (LBD panel) be provided as tabulations, or only via SRS?

How PolicyEngine Technology Could Help ONS

This could be a two-way collaboration. We've built open-source tools that may address challenges ONS faces:

1. Firm Tax Microsimulation Framework

The problem: Does ONS build bespoke models for corporation tax, business rates, or VAT policy analysis? These often end up as one-off spreadsheets or scripts.

Our approach: PolicyEngine's rules-as-code framework encodes tax-benefit logic in a modular, version-controlled, testable way. We currently cover household taxes and benefits; extending to firm taxes would mean:

  • Transparent, auditable corporation tax calculations
  • Easy scenario analysis (rate changes, threshold shifts, reliefs)
  • API access for integration with other tools
  • Automatic handling of policy changes over time

We'd be interested in collaborating on a firm tax module if ONS sees value.

2. Data Integration with Microimpute

The problem: Combining multiple firm datasets (e.g., ABS survey data with HMRC administrative tax records) requires statistical matching or imputation when direct linkage isn't possible.

Our approach: microimpute is our open-source package for imputing variables across datasets. It supports:

  • Quantile regression forests for continuous variables
  • Multiple imputation for uncertainty quantification
  • Calibration to known marginals
  • Donor-based matching methods

If ONS needs to combine, say, detailed ABS characteristics with HMRC tax liability data without full record linkage, microimpute could help.

3. Geographic Calibration

The problem: Producing reliable estimates at fine geographic levels (constituencies, local authorities) when surveys are designed for national/regional representativeness.

Our approach: We've built survey-enhance for reweighting survey data to match area-level targets. For UK household data, we calibrate to all 650 parliamentary constituencies using:

  • Gradient-based optimization for survey weights
  • Multiple constraint types (means, totals, quantiles)
  • Entropy-based regularization to stay close to original design weights

This could apply to firm surveys if ONS wants constituency-level business statistics without running massive sample boosts.

4. Synthetic Data Generation

The problem: Releasing microdata with disclosure control is costly; synthetic data is an alternative but quality varies.

Our approach: For our VAT analysis, we generated synthetic firm microdata that matches published marginals from multiple sources. The optimization-based approach ensures:

  • Exact calibration to published totals
  • Plausible joint distributions
  • No disclosure risk (fully synthetic)

We could collaborate on synthetic firm datasets that ONS could release publicly while we use for modeling.

About PolicyEngine

PolicyEngine is a nonprofit that builds open-source tax and benefit microsimulation models. Our UK model covers the full tax-benefit system and is used by researchers, journalists, and policymakers. We're expanding into firm-level modeling to analyze business taxation and macroeconomic policy.

Contact


Last updated: December 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment