This document explains why Impala requires partition refresh operations, what exactly happens internally, what breaks if you skip them, and all available strategies to make newly written Parquet data queryable and performant.
This is written to be understandable for new engineers, while still being accurate for production systems.
When you write Parquet files directly to HDFS, Impala is not notified automatically.
Impala works on metadata, not filesystem events.
After data lands in HDFS, Impala must know:
- Does this partition exist?
- Which files belong to that partition?
- How much data is there (statistics)?
If any of these are missing or stale, Impala queries may:
- Miss data
- Scan unnecessary files
- Choose inefficient join strategies
Impala does not scan HDFS directories dynamically.
Instead, it relies on:
- Hive Metastore (schema, partitions)
- Impala Catalog cache (fast, in-memory metadata)
Query → Impala Planner → Metadata → HDFS Files
If metadata is wrong, the query plan is wrong.
For a partitioned table like:
PARTITIONED BY (cob_dt_id INT)HDFS layout looks like:
/table_path/
cob_dt_id=20250201/
cob_dt_id=20250202/
But Impala only sees partitions that are:
- Registered in the metastore
- Cached in the catalog
A directory existing in HDFS does not mean Impala knows about it.
ALTER TABLE db.table
ADD PARTITION (cob_dt_id=20250202);- Registers the partition key/value
- Maps it to an HDFS directory
- Stores this mapping in Hive Metastore
- Updates Impala Catalog metadata
- Does not scan files
- Does not load statistics
- Impala may ignore the partition entirely
REFRESHwill not work- Query behavior becomes unpredictable
REFRESH db.table PARTITION (cob_dt_id=20250202);- Contacts NameNode
- Lists files under the partition directory
- Updates file metadata cache
- Makes new files visible to queries
- Does not create partitions
- Does not compute statistics
- New Parquet files may not be read
- Overwritten files may still be queried
- Results can be stale
COMPUTE INCREMENTAL STATS db.table
PARTITION (cob_dt_id=20250202);-
Reads Parquet metadata
-
Calculates:
- Row counts
- Column cardinality
- Data size
-
Updates planner statistics
Impala uses stats to decide:
- Broadcast vs shuffle joins
- Join order
- Predicate selectivity
- Queries still run
- Performance degrades over time
- Join-heavy queries suffer badly
If you only write Parquet files to HDFS:
| Aspect | Result |
|---|---|
| Partition discovery | ❌ Not guaranteed |
| File visibility | ❌ Inconsistent |
| Query correctness | |
| Performance | ❌ Degrades |
You may sometimes see data due to cached metadata, but this is accidental.
1. Write Parquet to HDFS
2. ADD PARTITION (if new)
3. REFRESH PARTITION
4. COMPUTE INCREMENTAL STATS
| Concern | Solved |
|---|---|
| Partition registration | ✅ |
| File discovery | ✅ |
| Query planning | ✅ |
| Cluster impact | Minimal |
This is safe, fast, and scalable.
MSCK REPAIR TABLE db.table;- Scans entire directory tree
- Discovers all partitions
- Very slow for large tables
Use only for:
- One-time migrations
- Recovery scenarios
INVALIDATE METADATA db.table;- Drops all cached metadata
- Forces full reload
- Expensive and disruptive
Use only for:
- Schema changes
- Emergency fixes
INSERT INTO TABLE db.table
PARTITION (cob_dt_id=20250202)
SELECT ...;- Impala manages metadata automatically
- Best when Impala owns ingestion
Not suitable if:
- Spark or external systems write data
| Strategy | Partitions | Files | Stats | Daily Use |
|---|---|---|---|---|
| ADD PARTITION | ✅ | ❌ | ❌ | ✅ |
| REFRESH | ❌ | ✅ | ❌ | ✅ |
| INCR STATS | ❌ | ❌ | ✅ | ✅ |
| Combined | ✅ | ✅ | ✅ | ⭐ |
| MSCK | ✅ | ❌ | ❌ | ❌ |
| INVALIDATE | ✅ | ✅ | ❌ | ❌ |
- Impala is metadata-driven, not filesystem-driven
- Writing Parquet files is only half the job
- Refreshing partitions is mandatory for correctness
- Statistics are mandatory for performance
If you treat these steps as part of ingestion, Impala becomes predictable, fast, and reliable.
Recommended rule:
If data lands outside Impala, always refresh metadata explicitly.