Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save sany2k8/770c9a7568589ff14de2c2e23cdcac3d to your computer and use it in GitHub Desktop.

Select an option

Save sany2k8/770c9a7568589ff14de2c2e23cdcac3d to your computer and use it in GitHub Desktop.
This document explains why Impala requires partition refresh operations, what exactly happens internally, what breaks if you skip them, and all available strategies to make newly written Parquet data queryable and performant.

Impala Partition Refresh – Complete, Practical Guide

This document explains why Impala requires partition refresh operations, what exactly happens internally, what breaks if you skip them, and all available strategies to make newly written Parquet data queryable and performant.

This is written to be understandable for new engineers, while still being accurate for production systems.


1. The Core Problem Impala Solves

When you write Parquet files directly to HDFS, Impala is not notified automatically.

Impala works on metadata, not filesystem events.

After data lands in HDFS, Impala must know:

  1. Does this partition exist?
  2. Which files belong to that partition?
  3. How much data is there (statistics)?

If any of these are missing or stale, Impala queries may:

  • Miss data
  • Scan unnecessary files
  • Choose inefficient join strategies

2. How Impala Sees a Table (Mental Model)

Impala does not scan HDFS directories dynamically.

Instead, it relies on:

  • Hive Metastore (schema, partitions)
  • Impala Catalog cache (fast, in-memory metadata)
Query → Impala Planner → Metadata → HDFS Files

If metadata is wrong, the query plan is wrong.


3. What a Partition Really Is

For a partitioned table like:

PARTITIONED BY (cob_dt_id INT)

HDFS layout looks like:

/table_path/
  cob_dt_id=20250201/
  cob_dt_id=20250202/

But Impala only sees partitions that are:

  • Registered in the metastore
  • Cached in the catalog

A directory existing in HDFS does not mean Impala knows about it.


4. What ALTER TABLE ADD PARTITION Does

ALTER TABLE db.table
ADD PARTITION (cob_dt_id=20250202);

Step-by-step internally

  1. Registers the partition key/value
  2. Maps it to an HDFS directory
  3. Stores this mapping in Hive Metastore
  4. Updates Impala Catalog metadata

What it does NOT do

  • Does not scan files
  • Does not load statistics

If you skip this step

  • Impala may ignore the partition entirely
  • REFRESH will not work
  • Query behavior becomes unpredictable

5. What REFRESH PARTITION Does

REFRESH db.table PARTITION (cob_dt_id=20250202);

Step-by-step internally

  1. Contacts NameNode
  2. Lists files under the partition directory
  3. Updates file metadata cache
  4. Makes new files visible to queries

What it does NOT do

  • Does not create partitions
  • Does not compute statistics

If you skip this step

  • New Parquet files may not be read
  • Overwritten files may still be queried
  • Results can be stale

6. What COMPUTE INCREMENTAL STATS Does

COMPUTE INCREMENTAL STATS db.table
PARTITION (cob_dt_id=20250202);

Step-by-step internally

  1. Reads Parquet metadata

  2. Calculates:

    • Row counts
    • Column cardinality
    • Data size
  3. Updates planner statistics

Why stats matter

Impala uses stats to decide:

  • Broadcast vs shuffle joins
  • Join order
  • Predicate selectivity

If you skip this step

  • Queries still run
  • Performance degrades over time
  • Join-heavy queries suffer badly

7. What Happens If You Do NOTHING

If you only write Parquet files to HDFS:

Aspect Result
Partition discovery ❌ Not guaranteed
File visibility ❌ Inconsistent
Query correctness ⚠️ Unreliable
Performance ❌ Degrades

You may sometimes see data due to cached metadata, but this is accidental.


8. The Recommended Combined Strategy (Best Practice)

1. Write Parquet to HDFS
2. ADD PARTITION (if new)
3. REFRESH PARTITION
4. COMPUTE INCREMENTAL STATS

Why this works

Concern Solved
Partition registration
File discovery
Query planning
Cluster impact Minimal

This is safe, fast, and scalable.


9. Other Ways to Achieve the Same Goal

9.1 MSCK REPAIR TABLE

MSCK REPAIR TABLE db.table;
  • Scans entire directory tree
  • Discovers all partitions
  • Very slow for large tables

Use only for:

  • One-time migrations
  • Recovery scenarios

9.2 INVALIDATE METADATA

INVALIDATE METADATA db.table;
  • Drops all cached metadata
  • Forces full reload
  • Expensive and disruptive

Use only for:

  • Schema changes
  • Emergency fixes

9.3 Impala INSERT / CTAS

INSERT INTO TABLE db.table
PARTITION (cob_dt_id=20250202)
SELECT ...;
  • Impala manages metadata automatically
  • Best when Impala owns ingestion

Not suitable if:

  • Spark or external systems write data

10. Strategy Comparison Table

Strategy Partitions Files Stats Daily Use
ADD PARTITION
REFRESH
INCR STATS
Combined
MSCK
INVALIDATE

11. Final Takeaway

  • Impala is metadata-driven, not filesystem-driven
  • Writing Parquet files is only half the job
  • Refreshing partitions is mandatory for correctness
  • Statistics are mandatory for performance

If you treat these steps as part of ingestion, Impala becomes predictable, fast, and reliable.


Recommended rule:

If data lands outside Impala, always refresh metadata explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment