Skip to content

bench: add routed expert locality profiler#307

Open
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:bench/expert-locality-profiler
Open

bench: add routed expert locality profiler#307
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:bench/expert-locality-profiler

Conversation

@hexxyan
Copy link
Copy Markdown
Contributor

@hexxyan hexxyan commented May 31, 2026

Summary

Opt-in profiler that records per-layer per-token MoE expert selections during CPU decode and writes detailed JSON statistics. Answers the question: does DeepSeek-V4's MoE exhibit exploitable expert locality?

This is a diagnostic/data-gathering PR — it makes no performance claims and does not change the inference path when not activated.

How it works:

  • Hooks into layer_routed_moe_one() and layer_routed_moe_one_prealloc() (CPU decode path)
  • Records selected expert IDs and router weights per layer per token
  • At engine shutdown, computes statistics and writes JSON

Per-layer JSON output includes:

  • Expert frequency histogram and average weight distribution
  • Top-10 experts with percentage and cumulative coverage curve
  • Shannon entropy / entropy ratio (measures uniformity vs concentration)
  • Adjacent-token top-k overlap (how many of 6 experts stay the same)
  • Adjacent-token Jaccard similarity
  • Position stability (same slot, same expert across tokens)
  • Hash-routing vs top-k-routing per layer

Usage:

# Via env var (works with any ds4 binary)
DS4_EXPERT_PROFILE=profile.json ./ds4-server -m <model.gguf> --cpu

# Via ds4-bench flag
./ds4-bench --cpu --expert-profile profile.json -m <model.gguf>

Why this PR: Before investing in expert prefetch, cache, or mini-GEMM grouping optimizations, we need data on whether expert selection has temporal locality. PowerInfer showed this is viable for ReLU-sparse models; this profiler determines whether ds4's SiLU/SwiGLU MoE exhibits similar patterns.

Verification

  • make cpu clean build
  • make ds4 (Metal) clean build
  • make test passes (extractors + q4k_dot)
  • --help shows --expert-profile flag
  • --expert-profile without --cpu correctly errors
  • Real model profiling output validated (needs a model)

Design Notes

  • CPU only: Expert selection patterns are a model property, not backend-dependent. GPU path does not expose selected experts to the CPU, and profiling on CPU is sufficient for locality analysis.
  • Capacity: Records up to 8192 tokens by default. For DeepSeek-V4 (61 layers × 8192 tokens × 6 experts), this uses ~12 MB — trivial.
  • Zero overhead when off: All profiling guarded by g_expert_profile_active.
  • Multi-frontier caveat: In ds4-bench multi-frontier mode, adjacent-token overlap/Jaccard metrics include a few cross-frontier boundary pairs. Histograms, entropy, and top-expert stats are unaffected. These metrics are most meaningful with a single frontier or sufficiently many generated tokens per frontier.

Opt-in profiler that records per-layer per-token MoE expert selections
during CPU decode.  Outputs JSON with per-layer statistics:

  - Expert frequency histogram and weight distribution
  - Top-10 experts with cumulative coverage curve
  - Shannon entropy / entropy ratio (uniformity measure)
  - Adjacent-token top-k overlap and Jaccard similarity
  - Position stability (same slot, same expert across tokens)
  - Hash-routing vs top-k-routing per layer

Usage:
  DS4_EXPERT_PROFILE=profile.json ./ds4-server -m <model> --cpu
  ./ds4-bench --cpu --expert-profile profile.json -m <model>

No performance impact when not activated.  CPU-only: expert selection
patterns are a model property, not backend-dependent.
@hexxyan hexxyan force-pushed the bench/expert-locality-profiler branch from 85d201d to e5455bb Compare May 31, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant