Skip to content

[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161

Open
yew1eb wants to merge 1 commit intoapache:masterfrom
yew1eb:AURON_2160
Open

[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161
yew1eb wants to merge 1 commit intoapache:masterfrom
yew1eb:AURON_2160

Conversation

@yew1eb
Copy link
Copy Markdown
Contributor

@yew1eb yew1eb commented Apr 3, 2026

Which issue does this PR close?

Closes #2160

Rationale for this change

Optimize join hash map probe by checking hash_matched first before computing empty mask.

What changes are included in this PR?

Changes:

  • Reordered SIMD probe to check hash_matched before computing empty mask.
  • Added benches/join_hash_map.rs with 0%/50%/100% hit rates × 5M/10M/20M keys.

Are there any user-facing changes?

How was this patch tested?

Benchmark (M2 Pro, probe_size=4096):

build size map size hit=0% hit=50% hit=100%
5M keys ~128 MB 6.63 µs 6.52 µs 6.35 µs
10M keys ~256 MB 6.68 µs 6.50 µs 6.36 µs
20M keys ~512 MB 6.70 µs 6.59 µs 6.36 µs

hit=100% is consistently ~4–5% faster than hit=0%.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the SIMD-based probe path in the native-engine join hash map by short-circuiting the “empty slot” SIMD comparison when a hash match is found, targeting reduced instruction count in typical high-hit-rate join workloads.

Changes:

  • Splits the probe condition into a fast-path (hash match) and slow-path (empty slot) to avoid an unconditional empty-mask SIMD compare.
  • Returns MapValue::EMPTY directly when an empty slot is detected in the probed group.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 255 to +268
let hash_matched = self.map[e].hashes.simd_eq(Simd::splat(hashes[i]));
let empty = self.map[e].hashes.simd_eq(Simd::splat(0));

if let Some(pos) = (hash_matched | empty).first_set() {
// Fast path: check hash match first (common case)
if let Some(pos) = hash_matched.first_set() {
hashes[i] = unsafe {
// safety: transmute MapValue(u32) to u32
std::mem::transmute(self.map[e].values[pos])
};
break;
}

// Slow path: check empty slot only when no match
let empty = self.map[e].hashes.simd_eq(Simd::splat(0));
if empty.any() {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correctness of checking hash_matched.first_set() before computing the empty mask relies on an invariant that, within a MapValueGroup, all occupied lanes are packed from the beginning (i.e., there cannot be an empty lane before a later occupied lane). That invariant currently holds because insertion always uses empty.first_set(), but it's not stated here and a future change (e.g., tombstones/deletes or a different insertion strategy) could break this lookup logic. Please document this invariant explicitly (or add a debug-only assertion) so this fast-path doesn't become subtly incorrect later.

Copilot uses AI. Check for mistakes.
Comment thread native-engine/datafusion-ext-plans/src/joins/join_hash_map.rs Outdated
Copy link
Copy Markdown
Contributor

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the performance tested? Can you share some logs/numbers in the PR description as well?

We should probably set up a microbenchmark for lookup_many with controlled hit rates (0%, 50%, 100%) if possible, WDYT?

@github-actions github-actions Bot added the build label Apr 28, 2026
@yew1eb
Copy link
Copy Markdown
Contributor Author

yew1eb commented Apr 28, 2026

How was the performance tested? Can you share some logs/numbers in the PR description as well?

We should probably set up a microbenchmark for lookup_many with controlled hit rates (0%, 50%, 100%) if possible, WDYT?

@ShreyeshArangath Done. Added benches/join_hash_map.rs with 0%/50%/100% hit rates across 5M/10M/20M keys. The numbers are in the PR description: on M2 Pro the win is ~4–5% between hit=0% and hit=100%, which is modest but expected since this is a small hot-path cleanup. Should be safe to merge.

@yew1eb yew1eb force-pushed the AURON_2160 branch 3 times, most recently from 7a9cce3 to 3cd3a0b Compare April 28, 2026 03:55
[AURON-2160] Optimize join hash map probe by checking hash_matched
first before computing empty mask. This reduces ~50% SIMD instructions
when hash hit rate is high (typical join scenarios).

Before: Always compute both hash_matched and empty SIMD masks.
After: Only compute empty mask when hash_matched has no hits.

Also add a criterion microbenchmark (benches/join_hash_map.rs) covering
realistic BHJ build sizes (5M/10M/20M keys) × three hit rates (0/50/100%).

Results on Apple M2 Pro (probe_size=4096):

  build size      | hit=0%  | hit=50% | hit=100%
  ----------------+---------+---------+---------
  5M  (~128 MB)   | 6.63 µs | 6.52 µs | 6.35 µs
  10M (~256 MB)   | 6.68 µs | 6.50 µs | 6.36 µs
  20M (~512 MB)   | 6.70 µs | 6.59 µs | 6.36 µs

Latency stays flat because prefetch_read_data (4-step ahead) fully
pipelines cache misses. The hit=100% path is consistently ~4-5% faster,
aligning with the optimization goal. Instruction-count savings can be
confirmed on x86 via: perf stat -e instructions

Run benchmark:
  cargo bench --bench join_hash_map -p datafusion-ext-plans
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JoinHashMap probe SIMD short-circuit optimization

3 participants