bench: add routed expert locality profiler#307
Open
hexxyan wants to merge 1 commit into
Open
Conversation
Opt-in profiler that records per-layer per-token MoE expert selections during CPU decode. Outputs JSON with per-layer statistics: - Expert frequency histogram and weight distribution - Top-10 experts with cumulative coverage curve - Shannon entropy / entropy ratio (uniformity measure) - Adjacent-token top-k overlap and Jaccard similarity - Position stability (same slot, same expert across tokens) - Hash-routing vs top-k-routing per layer Usage: DS4_EXPERT_PROFILE=profile.json ./ds4-server -m <model> --cpu ./ds4-bench --cpu --expert-profile profile.json -m <model> No performance impact when not activated. CPU-only: expert selection patterns are a model property, not backend-dependent.
85d201d to
e5455bb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Opt-in profiler that records per-layer per-token MoE expert selections during CPU decode and writes detailed JSON statistics. Answers the question: does DeepSeek-V4's MoE exhibit exploitable expert locality?
This is a diagnostic/data-gathering PR — it makes no performance claims and does not change the inference path when not activated.
How it works:
layer_routed_moe_one()andlayer_routed_moe_one_prealloc()(CPU decode path)Per-layer JSON output includes:
Usage:
Why this PR: Before investing in expert prefetch, cache, or mini-GEMM grouping optimizations, we need data on whether expert selection has temporal locality. PowerInfer showed this is viable for ReLU-sparse models; this profiler determines whether ds4's SiLU/SwiGLU MoE exhibits similar patterns.
Verification
make cpuclean buildmake ds4(Metal) clean buildmake testpasses (extractors + q4k_dot)--helpshows--expert-profileflag--expert-profilewithout--cpucorrectly errorsDesign Notes
g_expert_profile_active.ds4-benchmulti-frontier mode, adjacent-token overlap/Jaccard metrics include a few cross-frontier boundary pairs. Histograms, entropy, and top-expert stats are unaffected. These metrics are most meaningful with a single frontier or sufficiently many generated tokens per frontier.