Skip to content

Instantly share code, notes, and snippets.

@yongkangc
Created January 27, 2026 08:06
Show Gist options
  • Select an option

  • Save yongkangc/c692e3bb177fb24a05266a8130a112c8 to your computer and use it in GitHub Desktop.

Select an option

Save yongkangc/c692e3bb177fb24a05266a8130a112c8 to your computer and use it in GitHub Desktop.
ParallelSparseTrie::prune() parallelism analysis - rayon overhead makes parallel path always slower

ParallelSparseTrie::prune() Parallelism Analysis

Summary

Benchmarking revealed that rayon parallelism in ParallelSparseTrie::prune() causes significant overhead without any performance benefit. The parallel path is always slower than serial, even at the maximum of 256 lower subtries.

Benchmark Results

Serial vs Parallel (forced paths)

Subtries Serial Parallel Overhead
4 1.62 µs 27.5 µs 17x slower
8 2.67 µs 43.3 µs 16x slower
16 4.77 µs 68.1 µs 14x slower
32 9.03 µs 96.3 µs 11x slower
64 18.2 µs 104.6 µs 5.7x slower

Threshold Sweep (32 subtries)

Threshold Time Notes
1 (parallel) 81.3 µs Always parallel
4 (parallel) 84.4 µs Current default
64 (serial) 9.06 µs Threshold > subtries
MAX (serial) 9.01 µs Always serial

Root Cause Analysis

1. Rayon Dispatch Overhead

  • Base overhead: ~80-100 µs just for thread pool dispatch
  • Additional overhead from mem::replace pattern (extracting/reinserting subtries into Vec)

2. Per-Subtrie Work is Trivially Fast

  • Each subtrie's retain() with binary search predicates: ~0.3 µs
  • Work is memory-bound (HashMap iteration), not CPU-bound
  • No cache locality benefit from parallelism

3. Break-Even Analysis

Rayon overhead = 80 µs
Per-subtrie work = 0.3 µs
Assuming 8-core perfect scaling:

Break-even: N × 0.3 = 80 + (N × 0.3 / 8)
N × 0.2625 = 80
N ≈ 305 subtries

But max subtries = 256, so parallelism can never break even.

At 256 subtries:

  • Serial: 256 × 0.3 = 77 µs
  • Parallel: 80 + 9.6 = ~90 µs (still slower, even with perfect scaling)

Recommendation

Remove parallelism from prune() entirely.

The current threshold of 4 causes 17x overhead at small subtrie counts. Since:

  • Max lower subtries = 256
  • Serial at 256 subtries = ~77 µs
  • Rayon overhead alone = ~80-100 µs
  • Work per subtrie = ~0.3 µs (too fast to benefit)

Parallelism will never be beneficial for prune().

Code Change

Replaced the #[cfg(feature = "std")] parallel path with a simple serial loop:

// Before: ~60 lines of rayon parallel code with threshold check
// After:
for (subtrie_idx, subtrie) in self.lower_subtries.iter_mut().enumerate() {
    if fully_pruned_subtries[subtrie_idx] {
        continue;
    }
    if let Some(s) = subtrie.as_revealed_mut() {
        let bucket = &roots_by_lower[subtrie_idx];
        s.nodes.retain(|p, _| {
            !is_strict_descendant_in(bucket, p) && 
            !is_strict_descendant_in(&roots_upper, p)
        });
        s.inner.values.retain(|p, _| {
            !starts_with_pruned_in(bucket, p) && 
            !starts_with_pruned_in(&roots_upper, p)
        });
    }
}

Benchmark Setup

  • CPU: (run on standard benchmark machine)
  • Criterion with 100-200 samples per benchmark
  • Test tries built with 4 leaves per subtrie
  • Prune depth = 2 (typical cross-payload caching scenario)

Files

  • Benchmark: crates/trie/sparse-parallel/benches/prune.rs
  • Implementation: crates/trie/sparse-parallel/src/trie.rs (line ~1122)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment