bench: add routed expert locality profiler by hexxyan · Pull Request #307 · antirez/ds4

hexxyan · 2026-05-31T12:01:21Z

Summary

Opt-in profiler that records per-layer per-token MoE expert selections during CPU decode and writes detailed JSON statistics. Answers the question: does DeepSeek-V4's MoE exhibit exploitable expert locality?

This is a diagnostic/data-gathering PR — it makes no performance claims and does not change the inference path when not activated.

How it works:

Hooks into layer_routed_moe_one() and layer_routed_moe_one_prealloc() (CPU decode path)
Records selected expert IDs and router weights per layer per token
At engine shutdown, computes statistics and writes JSON

Per-layer JSON output includes:

Expert frequency histogram and average weight distribution
Top-10 experts with percentage and cumulative coverage curve
Shannon entropy / entropy ratio (measures uniformity vs concentration)
Adjacent-token top-k overlap (how many of 6 experts stay the same)
Adjacent-token Jaccard similarity
Position stability (same slot, same expert across tokens)
Hash-routing vs top-k-routing per layer

Usage:

# Via env var (works with any ds4 binary)
DS4_EXPERT_PROFILE=profile.json ./ds4-server -m <model.gguf> --cpu

# Via ds4-bench flag
./ds4-bench --cpu --expert-profile profile.json -m <model.gguf>

Why this PR: Before investing in expert prefetch, cache, or mini-GEMM grouping optimizations, we need data on whether expert selection has temporal locality. PowerInfer showed this is viable for ReLU-sparse models; this profiler determines whether ds4's SiLU/SwiGLU MoE exhibits similar patterns.

Verification

make cpu clean build
make ds4 (Metal) clean build
make test passes (extractors + q4k_dot)
--help shows --expert-profile flag
--expert-profile without --cpu correctly errors
Real model profiling output validated (needs a model)

Design Notes

CPU only: Expert selection patterns are a model property, not backend-dependent. GPU path does not expose selected experts to the CPU, and profiling on CPU is sufficient for locality analysis.
Capacity: Records up to 8192 tokens by default. For DeepSeek-V4 (61 layers × 8192 tokens × 6 experts), this uses ~12 MB — trivial.
Zero overhead when off: All profiling guarded by g_expert_profile_active.
Multi-frontier caveat: In ds4-bench multi-frontier mode, adjacent-token overlap/Jaccard metrics include a few cross-frontier boundary pairs. Histograms, entropy, and top-expert stats are unaffected. These metrics are most meaningful with a single frontier or sufficiently many generated tokens per frontier.

Opt-in profiler that records per-layer per-token MoE expert selections during CPU decode. Outputs JSON with per-layer statistics: - Expert frequency histogram and weight distribution - Top-10 experts with cumulative coverage curve - Shannon entropy / entropy ratio (uniformity measure) - Adjacent-token top-k overlap and Jaccard similarity - Position stability (same slot, same expert across tokens) - Hash-routing vs top-k-routing per layer Usage: DS4_EXPERT_PROFILE=profile.json ./ds4-server -m <model> --cpu ./ds4-bench --cpu --expert-profile profile.json -m <model> No performance impact when not activated. CPU-only: expert selection patterns are a model property, not backend-dependent.

hexxyan force-pushed the bench/expert-locality-profiler branch from 85d201d to e5455bb Compare May 31, 2026 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: add routed expert locality profiler#307

bench: add routed expert locality profiler#307
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:bench/expert-locality-profiler

hexxyan commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hexxyan commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Design Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hexxyan commented May 31, 2026 •

edited

Loading