[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161
[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161yew1eb wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Optimizes the SIMD-based probe path in the native-engine join hash map by short-circuiting the “empty slot” SIMD comparison when a hash match is found, targeting reduced instruction count in typical high-hit-rate join workloads.
Changes:
- Splits the probe condition into a fast-path (hash match) and slow-path (empty slot) to avoid an unconditional empty-mask SIMD compare.
- Returns
MapValue::EMPTYdirectly when an empty slot is detected in the probed group.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let hash_matched = self.map[e].hashes.simd_eq(Simd::splat(hashes[i])); | ||
| let empty = self.map[e].hashes.simd_eq(Simd::splat(0)); | ||
|
|
||
| if let Some(pos) = (hash_matched | empty).first_set() { | ||
| // Fast path: check hash match first (common case) | ||
| if let Some(pos) = hash_matched.first_set() { | ||
| hashes[i] = unsafe { | ||
| // safety: transmute MapValue(u32) to u32 | ||
| std::mem::transmute(self.map[e].values[pos]) | ||
| }; | ||
| break; | ||
| } | ||
|
|
||
| // Slow path: check empty slot only when no match | ||
| let empty = self.map[e].hashes.simd_eq(Simd::splat(0)); | ||
| if empty.any() { |
There was a problem hiding this comment.
The correctness of checking hash_matched.first_set() before computing the empty mask relies on an invariant that, within a MapValueGroup, all occupied lanes are packed from the beginning (i.e., there cannot be an empty lane before a later occupied lane). That invariant currently holds because insertion always uses empty.first_set(), but it's not stated here and a future change (e.g., tombstones/deletes or a different insertion strategy) could break this lookup logic. Please document this invariant explicitly (or add a debug-only assertion) so this fast-path doesn't become subtly incorrect later.
ShreyeshArangath
left a comment
There was a problem hiding this comment.
How was the performance tested? Can you share some logs/numbers in the PR description as well?
We should probably set up a microbenchmark for lookup_many with controlled hit rates (0%, 50%, 100%) if possible, WDYT?
@ShreyeshArangath Done. Added benches/join_hash_map.rs with 0%/50%/100% hit rates across 5M/10M/20M keys. The numbers are in the PR description: on M2 Pro the win is ~4–5% between hit=0% and hit=100%, which is modest but expected since this is a small hot-path cleanup. Should be safe to merge. |
7a9cce3 to
3cd3a0b
Compare
[AURON-2160] Optimize join hash map probe by checking hash_matched first before computing empty mask. This reduces ~50% SIMD instructions when hash hit rate is high (typical join scenarios). Before: Always compute both hash_matched and empty SIMD masks. After: Only compute empty mask when hash_matched has no hits. Also add a criterion microbenchmark (benches/join_hash_map.rs) covering realistic BHJ build sizes (5M/10M/20M keys) × three hit rates (0/50/100%). Results on Apple M2 Pro (probe_size=4096): build size | hit=0% | hit=50% | hit=100% ----------------+---------+---------+--------- 5M (~128 MB) | 6.63 µs | 6.52 µs | 6.35 µs 10M (~256 MB) | 6.68 µs | 6.50 µs | 6.36 µs 20M (~512 MB) | 6.70 µs | 6.59 µs | 6.36 µs Latency stays flat because prefetch_read_data (4-step ahead) fully pipelines cache misses. The hit=100% path is consistently ~4-5% faster, aligning with the optimization goal. Instruction-count savings can be confirmed on x86 via: perf stat -e instructions Run benchmark: cargo bench --bench join_hash_map -p datafusion-ext-plans
Which issue does this PR close?
Closes #2160
Rationale for this change
Optimize join hash map probe by checking hash_matched first before computing empty mask.
What changes are included in this PR?
Changes:
hash_matchedbefore computingemptymask.benches/join_hash_map.rswith 0%/50%/100% hit rates × 5M/10M/20M keys.Are there any user-facing changes?
How was this patch tested?
Benchmark (M2 Pro, probe_size=4096):
hit=100% is consistently ~4–5% faster than hit=0%.