Impala Partition Refresh – Complete, Practical Guide

This document explains why Impala requires partition refresh operations, what exactly happens internally, what breaks if you skip them, and all available strategies to make newly written Parquet data queryable and performant.

This is written to be understandable for new engineers, while still being accurate for production systems.

1. The Core Problem Impala Solves

When you write Parquet files directly to HDFS, Impala is not notified automatically.

Impala works on metadata, not filesystem events.

After data lands in HDFS, Impala must know:

Does this partition exist?
Which files belong to that partition?
How much data is there (statistics)?

If any of these are missing or stale, Impala queries may:

Miss data
Scan unnecessary files
Choose inefficient join strategies

2. How Impala Sees a Table (Mental Model)

Impala does not scan HDFS directories dynamically.

Instead, it relies on:

Hive Metastore (schema, partitions)
Impala Catalog cache (fast, in-memory metadata)

Query → Impala Planner → Metadata → HDFS Files

If metadata is wrong, the query plan is wrong.

3. What a Partition Really Is

For a partitioned table like:

PARTITIONED BY (cob_dt_id INT)

HDFS layout looks like:

/table_path/
  cob_dt_id=20250201/
  cob_dt_id=20250202/

But Impala only sees partitions that are:

Registered in the metastore
Cached in the catalog

A directory existing in HDFS does not mean Impala knows about it.

4. What `ALTER TABLE ADD PARTITION` Does

ALTER TABLE db.table
ADD PARTITION (cob_dt_id=20250202);

Step-by-step internally

Registers the partition key/value
Maps it to an HDFS directory
Stores this mapping in Hive Metastore
Updates Impala Catalog metadata

What it does NOT do

Does not scan files
Does not load statistics

If you skip this step

Impala may ignore the partition entirely
REFRESH will not work
Query behavior becomes unpredictable

5. What `REFRESH PARTITION` Does

REFRESH db.table PARTITION (cob_dt_id=20250202);

Step-by-step internally

Contacts NameNode
Lists files under the partition directory
Updates file metadata cache
Makes new files visible to queries

What it does NOT do

Does not create partitions
Does not compute statistics

If you skip this step

New Parquet files may not be read
Overwritten files may still be queried
Results can be stale

6. What `COMPUTE INCREMENTAL STATS` Does

COMPUTE INCREMENTAL STATS db.table
PARTITION (cob_dt_id=20250202);

Step-by-step internally

Reads Parquet metadata
Calculates:
- Row counts
- Column cardinality
- Data size
Updates planner statistics

Why stats matter

Impala uses stats to decide:

Broadcast vs shuffle joins
Join order
Predicate selectivity

If you skip this step

Queries still run
Performance degrades over time
Join-heavy queries suffer badly

7. What Happens If You Do NOTHING

If you only write Parquet files to HDFS:

Aspect	Result
Partition discovery	❌ Not guaranteed
File visibility	❌ Inconsistent
Query correctness	⚠️ Unreliable
Performance	❌ Degrades

You may sometimes see data due to cached metadata, but this is accidental.

8. The Recommended Combined Strategy (Best Practice)

1. Write Parquet to HDFS
2. ADD PARTITION (if new)
3. REFRESH PARTITION
4. COMPUTE INCREMENTAL STATS

Why this works

Concern	Solved
Partition registration	✅
File discovery	✅
Query planning	✅
Cluster impact	Minimal

This is safe, fast, and scalable.

9. Other Ways to Achieve the Same Goal

9.1 MSCK REPAIR TABLE

MSCK REPAIR TABLE db.table;

Scans entire directory tree
Discovers all partitions
Very slow for large tables

Use only for:

One-time migrations
Recovery scenarios

9.2 INVALIDATE METADATA

INVALIDATE METADATA db.table;

Drops all cached metadata
Forces full reload
Expensive and disruptive

Use only for:

Schema changes
Emergency fixes

9.3 Impala INSERT / CTAS

INSERT INTO TABLE db.table
PARTITION (cob_dt_id=20250202)
SELECT ...;

Impala manages metadata automatically
Best when Impala owns ingestion

Not suitable if:

Spark or external systems write data

10. Strategy Comparison Table

Strategy	Partitions	Files	Stats	Daily Use
ADD PARTITION	✅	❌	❌	✅
REFRESH	❌	✅	❌	✅
INCR STATS	❌	❌	✅	✅
Combined	✅	✅	✅	⭐
MSCK	✅	❌	❌	❌
INVALIDATE	✅	✅	❌	❌

11. Final Takeaway

Impala is metadata-driven, not filesystem-driven
Writing Parquet files is only half the job
Refreshing partitions is mandatory for correctness
Statistics are mandatory for performance

If you treat these steps as part of ingestion, Impala becomes predictable, fast, and reliable.

Recommended rule:

If data lands outside Impala, always refresh metadata explicitly.

sany2k8/Impala Partition Refresh – Complete, Practical Guide.md

Select an option

No results found

Select an option

No results found

Impala Partition Refresh – Complete, Practical Guide

1. The Core Problem Impala Solves

2. How Impala Sees a Table (Mental Model)

3. What a Partition Really Is

4. What `ALTER TABLE ADD PARTITION` Does

Step-by-step internally

What it does NOT do

If you skip this step

5. What `REFRESH PARTITION` Does

Step-by-step internally

What it does NOT do

If you skip this step

6. What `COMPUTE INCREMENTAL STATS` Does

Step-by-step internally

Why stats matter

If you skip this step

7. What Happens If You Do NOTHING

8. The Recommended Combined Strategy (Best Practice)

Why this works

9. Other Ways to Achieve the Same Goal

9.1 MSCK REPAIR TABLE

9.2 INVALIDATE METADATA

9.3 Impala INSERT / CTAS

10. Strategy Comparison Table

11. Final Takeaway

sany2k8/Impala Partition Refresh – Complete, Practical Guide.md

Impala Partition Refresh – Complete, Practical Guide

1. The Core Problem Impala Solves

2. How Impala Sees a Table (Mental Model)

3. What a Partition Really Is

4. What ALTER TABLE ADD PARTITION Does

Step-by-step internally

What it does NOT do

If you skip this step

5. What REFRESH PARTITION Does

Step-by-step internally

What it does NOT do

If you skip this step

6. What COMPUTE INCREMENTAL STATS Does

Step-by-step internally

Why stats matter

If you skip this step

7. What Happens If You Do NOTHING

8. The Recommended Combined Strategy (Best Practice)

Why this works

9. Other Ways to Achieve the Same Goal

9.1 MSCK REPAIR TABLE

9.2 INVALIDATE METADATA

9.3 Impala INSERT / CTAS

10. Strategy Comparison Table

11. Final Takeaway

4. What `ALTER TABLE ADD PARTITION` Does

5. What `REFRESH PARTITION` Does

6. What `COMPUTE INCREMENTAL STATS` Does