[OP][Optimization] Fused NoauxTC Kernel by ShaneGZhu · Pull Request #7679 · PaddlePaddle/FastDeploy

ShaneGZhu · 2026-04-30T03:40:02Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-30T03:40:08Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-04-30T03:50:32Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-30 16:37:53

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: bdac974
Merge base: dae246e (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

所有必选（Required）任务均已通过（无必选任务配置），1 个可选任务失败，不阻塞合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	1	1	0	0	0

2 任务状态汇总

2.1 Required 任务：0/0 通过

必选任务阻塞合并，失败需优先处理。

本 PR 暂无必选任务配置。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR`	50s	Job	-
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-04-30 16:19:59

📋 Review 摘要

PR 概述：将 NoauxTC MoE 路由中原来独立的 topk_with_k2_kernel（计算 group scores）与 group_idx_and_topk_idx_kernel（选 topk 专家）融合为单一 CUDA kernel，通过共享内存缓存 sigmoid 分数并将 group_scores 下沉至寄存器，消除中间全局缓冲区，提升 MoE 推理性能。

变更范围：custom_ops/gpu_ops/（noaux_tc kernel）、model_executor/layers/moe/moe.py（Python 调用侧）、model_executor/models/deepseek_v3.py（无关变更混入）

影响面 Tag：[OP] [Models]

📝 PR 规范检查

PR 描述所有 section（Motivation / Modifications / Usage or Command / Accuracy Tests）均为空占位符，未填写任何实际内容；Checklist 条目全部未勾选。

标题建议（可直接复制）：

[OP][Optimization] Fused NoauxTC Kernel（标题本身合规，含两个官方 Tag，语义清晰，无需修改）

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation

将 NoauxTC MoE 路由中原来分两步执行的 CUDA kernel 融合为单一 kernel：
- 原方案：先由 `topk_with_k2_kernel` 将每个 token 的专家偏置得分按 group 归约成 group_scores 写入全局内存，再由 `group_idx_and_topk_idx_kernel` 读回 group_scores 做 topk_group 选择。
- 新方案：在 `group_idx_and_topk_idx_kernel` 内通过 Phase 0+1 merged loop 完成 sigmoid + bias 计算及 warp-level group score 归约，结果保留在寄存器（group_scores_reg）；中间 group_scores 全局缓冲区不再参与实际路由计算，减少全局内存读写，提升 MoE 路由效率。

## Modifications

- `custom_ops/gpu_ops/noauxtc_kernel.h`：
  - 重构 `group_idx_and_topk_idx_kernel`：新增 Phase 0（sigmoid + bias 写 smem_scores_warp）、Phase 1（warp-level top-2 归约至 group_scores_reg 寄存器）融合循环；Phase 2/3/4 改从寄存器/smem 读取，不再读全局 group_scores
  - 为 `WarpSelect` 新增带显式 `smem_base` 偏移的构造函数，避免 WarpSelect staging smem 覆盖 smem_scores_warp 区域
  - 移除 `topk_with_k2_kernel` 的独立调用路径（`invokeNoAuxTc` 中删除该 kernel 启动逻辑）
- `custom_ops/gpu_ops/noaux_tc.cu`：更新参数命名（`scores`/`scores_with_bias` → `gating_output`/`e_score_correction_bias`）；重新计算 dynamic smem 大小（新增 smem_scores_warp 区域）；移除 topk_with_k2 kernel 启动
- `custom_ops/gpu_ops/cpp_extensions.cc`：同步更新 `NoauxTc` 函数签名参数名
- `fastdeploy/model_executor/layers/moe/moe.py`：Python 调用侧将 sigmoid + bias 计算移入 kernel（仅 `noaux_tc` 分支；`noaux_tc_redundant` 分支暂未融合）
- `fastdeploy/model_executor/models/deepseek_v3.py`：添加 `paddle.enable_compat(scope={"deep_gemm": True})`，将 `deep_gemm` 改为直接导入（与本次 kernel 优化无关，建议拆分）

## Usage or Command

N/A（内部 kernel 优化，对外 Python API 接口不变）

## Accuracy Tests

N/A（待补充：建议提供融合前后 DeepSeek-V3 在标准 benchmark 上的输出一致性对比结果）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
📝 PR 规范	PR 描述	所有 section 均为空占位符，Checklist 未勾选
🟡 建议	`fastdeploy/model_executor/models/deepseek_v3.py:73`	无关变更混入：`paddle.enable_compat` 新增及 `import deep_gemm` 路径变更与本 PR 主题无关
❓ 疑问	`fastdeploy/model_executor/layers/moe/moe.py:117`	`noaux_tc_redundant` 分支暂未融合，两分支 sigmoid 计算位置不一致，请确认精度影响及后续计划

总体评价

kernel 融合思路清晰，共享内存布局设计合理，通过寄存器缓存 group_scores 消除全局内存往返访问，优化方向正确。建议补充精度对比结果，并将 deepseek_v3.py 中的无关变更（deep_gemm 导入路径调整）拆分到独立 PR 处理。

PaddlePaddle-bot · 2026-04-30T08:27:18Z

        radix_topk_ragged_transform,
    )

+    paddle.enable_compat(scope={"deep_gemm": True})


🟡 建议 此处新增 paddle.enable_compat(scope={"deep_gemm": True}) 与本 PR 主题（NoauxTC Kernel 融合）无关，且无任何硬件版本条件判断（fp8_utils.py 中同样的调用仅在 SM100+ 时触发）。建议拆分到独立 PR，并补充必要的条件门控。

PaddlePaddle-bot · 2026-04-30T08:27:18Z

    else:
-        # noaux_tc_redundant returns 4 values: scores, topk_values, topk_idx,
-        # and tokens_per_expert_stats_list_out (inplace updated)
+        # noaux_tc_redundant still takes scores + scores_with_bias (not yet fused)


❓ 疑问 noaux_tc_redundant 分支注释说明「暂未融合」，仍保留了 sigmoid(gating_output) + + e_score_correction_bias 的 Python 计算。请确认：

是否有后续 PR 计划对 noaux_tc_redundant 做同样的 kernel 融合？

两条分支行为不一致（一条在 kernel 内做 sigmoid，一条在 Python 侧做），是否会影响数值精度（float16 精度差异）？

ShaneGZhu added 3 commits April 28, 2026 13:04

draft

5b60093

fused_noauxtc

dc5467d

temp dev

30c892a

ShaneGZhu had a problem deploying to Metax_ci April 30, 2026 03:40 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

clean

fe4d762

ShaneGZhu had a problem deploying to Metax_ci April 30, 2026 06:29 — with GitHub Actions Failure

ShaneGZhu changed the title ~~[Ops][Optimization]Fused Noauxtc Kernel~~ [OP][Optimization] Fused NoauxTC Kernel Apr 30, 2026

This comment was marked as outdated.

Sign in to view

Rename the bias parameter to e_score_correction_bias.

bdac974

ShaneGZhu had a problem deploying to Metax_ci April 30, 2026 08:11 — with GitHub Actions Failure

PaddlePaddle-bot reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OP][Optimization] Fused NoauxTC Kernel#7679

[OP][Optimization] Fused NoauxTC Kernel#7679
ShaneGZhu wants to merge 5 commits intoPaddlePaddle:developfrom
ShaneGZhu:noaux_dev

ShaneGZhu commented Apr 30, 2026

Uh oh!

paddle-bot Bot commented Apr 30, 2026

Uh oh!

PaddlePaddle-bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Apr 30, 2026

Uh oh!

PaddlePaddle-bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShaneGZhu commented Apr 30, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 30, 2026

Uh oh!

PaddlePaddle-bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required 任务：0/0 通过

2.2 可选任务 — 1/2 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PaddlePaddle-bot commented Apr 30, 2026 •

edited

Loading