ParallelSparseTrie::prune() Parallelism Analysis

Summary

Benchmarking revealed that rayon parallelism in ParallelSparseTrie::prune() causes significant overhead without any performance benefit. The parallel path is always slower than serial, even at the maximum of 256 lower subtries.

Benchmark Results

Serial vs Parallel (forced paths)

Subtries	Serial	Parallel	Overhead
4	1.62 µs	27.5 µs	17x slower
8	2.67 µs	43.3 µs	16x slower
16	4.77 µs	68.1 µs	14x slower
32	9.03 µs	96.3 µs	11x slower
64	18.2 µs	104.6 µs	5.7x slower

Threshold Sweep (32 subtries)

Threshold	Time	Notes
1 (parallel)	81.3 µs	Always parallel
4 (parallel)	84.4 µs	Current default
64 (serial)	9.06 µs	Threshold > subtries
MAX (serial)	9.01 µs	Always serial

Root Cause Analysis

1. Rayon Dispatch Overhead

Base overhead: ~80-100 µs just for thread pool dispatch
Additional overhead from mem::replace pattern (extracting/reinserting subtries into Vec)

2. Per-Subtrie Work is Trivially Fast

Each subtrie's retain() with binary search predicates: ~0.3 µs
Work is memory-bound (HashMap iteration), not CPU-bound
No cache locality benefit from parallelism

3. Break-Even Analysis

Rayon overhead = 80 µs
Per-subtrie work = 0.3 µs
Assuming 8-core perfect scaling:

Break-even: N × 0.3 = 80 + (N × 0.3 / 8)
N × 0.2625 = 80
N ≈ 305 subtries

But max subtries = 256, so parallelism can never break even.

At 256 subtries:

Serial: 256 × 0.3 = 77 µs
Parallel: 80 + 9.6 = ~90 µs (still slower, even with perfect scaling)

Recommendation

Remove parallelism from prune() entirely.

The current threshold of 4 causes 17x overhead at small subtrie counts. Since:

Max lower subtries = 256
Serial at 256 subtries = ~77 µs
Rayon overhead alone = ~80-100 µs
Work per subtrie = ~0.3 µs (too fast to benefit)

Parallelism will never be beneficial for prune().

Code Change

Replaced the #[cfg(feature = "std")] parallel path with a simple serial loop:

// Before: ~60 lines of rayon parallel code with threshold check
// After:
for (subtrie_idx, subtrie) in self.lower_subtries.iter_mut().enumerate() {
    if fully_pruned_subtries[subtrie_idx] {
        continue;
    }
    if let Some(s) = subtrie.as_revealed_mut() {
        let bucket = &roots_by_lower[subtrie_idx];
        s.nodes.retain(|p, _| {
            !is_strict_descendant_in(bucket, p) && 
            !is_strict_descendant_in(&roots_upper, p)
        });
        s.inner.values.retain(|p, _| {
            !starts_with_pruned_in(bucket, p) && 
            !starts_with_pruned_in(&roots_upper, p)
        });
    }
}

Benchmark Setup

CPU: (run on standard benchmark machine)
Criterion with 100-200 samples per benchmark
Test tries built with 4 leaves per subtrie
Prune depth = 2 (typical cross-payload caching scenario)

Files

Benchmark: crates/trie/sparse-parallel/benches/prune.rs
Implementation: crates/trie/sparse-parallel/src/trie.rs (line ~1122)

yongkangc/prune-parallelism-analysis.md

Select an option

No results found