Benchmarking revealed that rayon parallelism in ParallelSparseTrie::prune() causes significant overhead without any performance benefit. The parallel path is always slower than serial, even at the maximum of 256 lower subtries.
| Subtries | Serial | Parallel | Overhead |
|---|---|---|---|
| 4 | 1.62 µs | 27.5 µs | 17x slower |
| 8 | 2.67 µs | 43.3 µs | 16x slower |
| 16 | 4.77 µs | 68.1 µs | 14x slower |
| 32 | 9.03 µs | 96.3 µs | 11x slower |
| 64 | 18.2 µs | 104.6 µs | 5.7x slower |
| Threshold | Time | Notes |
|---|---|---|
| 1 (parallel) | 81.3 µs | Always parallel |
| 4 (parallel) | 84.4 µs | Current default |
| 64 (serial) | 9.06 µs | Threshold > subtries |
| MAX (serial) | 9.01 µs | Always serial |
- Base overhead: ~80-100 µs just for thread pool dispatch
- Additional overhead from
mem::replacepattern (extracting/reinserting subtries into Vec)
- Each subtrie's
retain()with binary search predicates: ~0.3 µs - Work is memory-bound (HashMap iteration), not CPU-bound
- No cache locality benefit from parallelism
Rayon overhead = 80 µs
Per-subtrie work = 0.3 µs
Assuming 8-core perfect scaling:
Break-even: N × 0.3 = 80 + (N × 0.3 / 8)
N × 0.2625 = 80
N ≈ 305 subtries
But max subtries = 256, so parallelism can never break even.
At 256 subtries:
- Serial: 256 × 0.3 = 77 µs
- Parallel: 80 + 9.6 = ~90 µs (still slower, even with perfect scaling)
Remove parallelism from prune() entirely.
The current threshold of 4 causes 17x overhead at small subtrie counts. Since:
- Max lower subtries = 256
- Serial at 256 subtries = ~77 µs
- Rayon overhead alone = ~80-100 µs
- Work per subtrie = ~0.3 µs (too fast to benefit)
Parallelism will never be beneficial for prune().
Replaced the #[cfg(feature = "std")] parallel path with a simple serial loop:
// Before: ~60 lines of rayon parallel code with threshold check
// After:
for (subtrie_idx, subtrie) in self.lower_subtries.iter_mut().enumerate() {
if fully_pruned_subtries[subtrie_idx] {
continue;
}
if let Some(s) = subtrie.as_revealed_mut() {
let bucket = &roots_by_lower[subtrie_idx];
s.nodes.retain(|p, _| {
!is_strict_descendant_in(bucket, p) &&
!is_strict_descendant_in(&roots_upper, p)
});
s.inner.values.retain(|p, _| {
!starts_with_pruned_in(bucket, p) &&
!starts_with_pruned_in(&roots_upper, p)
});
}
}- CPU: (run on standard benchmark machine)
- Criterion with 100-200 samples per benchmark
- Test tries built with 4 leaves per subtrie
- Prune depth = 2 (typical cross-payload caching scenario)
- Benchmark:
crates/trie/sparse-parallel/benches/prune.rs - Implementation:
crates/trie/sparse-parallel/src/trie.rs(line ~1122)