Skip to content

[OP][Optimization] Fused NoauxTC Kernel#7679

Draft
ShaneGZhu wants to merge 5 commits intoPaddlePaddle:developfrom
ShaneGZhu:noaux_dev
Draft

[OP][Optimization] Fused NoauxTC Kernel#7679
ShaneGZhu wants to merge 5 commits intoPaddlePaddle:developfrom
ShaneGZhu:noaux_dev

Conversation

@ShaneGZhu
Copy link
Copy Markdown
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 30, 2026

Thanks for your contribution!

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 30, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-30 16:37:53

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有必选(Required)任务均已通过(无必选任务配置),1 个可选任务失败,不阻塞合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
2(0) 2 1 1 0 0 0

2 任务状态汇总

2.1 Required 任务:0/0 通过

必选任务阻塞合并,失败需优先处理。

本 PR 暂无必选任务配置。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 50s Job -
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@ShaneGZhu ShaneGZhu changed the title [Ops][Optimization]Fused Noauxtc Kernel [OP][Optimization] Fused NoauxTC Kernel Apr 30, 2026
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-04-30 16:19:59

📋 Review 摘要

PR 概述:将 NoauxTC MoE 路由中原来独立的 topk_with_k2_kernel(计算 group scores)与 group_idx_and_topk_idx_kernel(选 topk 专家)融合为单一 CUDA kernel,通过共享内存缓存 sigmoid 分数并将 group_scores 下沉至寄存器,消除中间全局缓冲区,提升 MoE 推理性能。

变更范围custom_ops/gpu_ops/(noaux_tc kernel)、model_executor/layers/moe/moe.py(Python 调用侧)、model_executor/models/deepseek_v3.py(无关变更混入)

影响面 Tag[OP] [Models]


📝 PR 规范检查

PR 描述所有 section(Motivation / Modifications / Usage or Command / Accuracy Tests)均为空占位符,未填写任何实际内容;Checklist 条目全部未勾选。

标题建议(可直接复制):

  • [OP][Optimization] Fused NoauxTC Kernel(标题本身合规,含两个官方 Tag,语义清晰,无需修改)

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation

将 NoauxTC MoE 路由中原来分两步执行的 CUDA kernel 融合为单一 kernel:
- 原方案:先由 `topk_with_k2_kernel` 将每个 token 的专家偏置得分按 group 归约成 group_scores 写入全局内存,再由 `group_idx_and_topk_idx_kernel` 读回 group_scores 做 topk_group 选择。
- 新方案:在 `group_idx_and_topk_idx_kernel` 内通过 Phase 0+1 merged loop 完成 sigmoid + bias 计算及 warp-level group score 归约,结果保留在寄存器(group_scores_reg);中间 group_scores 全局缓冲区不再参与实际路由计算,减少全局内存读写,提升 MoE 路由效率。

## Modifications

- `custom_ops/gpu_ops/noauxtc_kernel.h`- 重构 `group_idx_and_topk_idx_kernel`:新增 Phase 0(sigmoid + bias 写 smem_scores_warp)、Phase 1(warp-level top-2 归约至 group_scores_reg 寄存器)融合循环;Phase 2/3/4 改从寄存器/smem 读取,不再读全局 group_scores
  -`WarpSelect` 新增带显式 `smem_base` 偏移的构造函数,避免 WarpSelect staging smem 覆盖 smem_scores_warp 区域
  - 移除 `topk_with_k2_kernel` 的独立调用路径(`invokeNoAuxTc` 中删除该 kernel 启动逻辑)
- `custom_ops/gpu_ops/noaux_tc.cu`:更新参数命名(`scores`/`scores_with_bias``gating_output`/`e_score_correction_bias`);重新计算 dynamic smem 大小(新增 smem_scores_warp 区域);移除 topk_with_k2 kernel 启动
- `custom_ops/gpu_ops/cpp_extensions.cc`:同步更新 `NoauxTc` 函数签名参数名
- `fastdeploy/model_executor/layers/moe/moe.py`:Python 调用侧将 sigmoid + bias 计算移入 kernel(仅 `noaux_tc` 分支;`noaux_tc_redundant` 分支暂未融合)
- `fastdeploy/model_executor/models/deepseek_v3.py`:添加 `paddle.enable_compat(scope={"deep_gemm": True})`,将 `deep_gemm` 改为直接导入(与本次 kernel 优化无关,建议拆分)

## Usage or Command

N/A(内部 kernel 优化,对外 Python API 接口不变)

## Accuracy Tests

N/A(待补充:建议提供融合前后 DeepSeek-V3 在标准 benchmark 上的输出一致性对比结果)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
📝 PR 规范 PR 描述 所有 section 均为空占位符,Checklist 未勾选
🟡 建议 fastdeploy/model_executor/models/deepseek_v3.py:73 无关变更混入:paddle.enable_compat 新增及 import deep_gemm 路径变更与本 PR 主题无关
❓ 疑问 fastdeploy/model_executor/layers/moe/moe.py:117 noaux_tc_redundant 分支暂未融合,两分支 sigmoid 计算位置不一致,请确认精度影响及后续计划

总体评价

kernel 融合思路清晰,共享内存布局设计合理,通过寄存器缓存 group_scores 消除全局内存往返访问,优化方向正确。建议补充精度对比结果,并将 deepseek_v3.py 中的无关变更(deep_gemm 导入路径调整)拆分到独立 PR 处理。

radix_topk_ragged_transform,
)

paddle.enable_compat(scope={"deep_gemm": True})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 此处新增 paddle.enable_compat(scope={"deep_gemm": True}) 与本 PR 主题(NoauxTC Kernel 融合)无关,且无任何硬件版本条件判断(fp8_utils.py 中同样的调用仅在 SM100+ 时触发)。建议拆分到独立 PR,并补充必要的条件门控。

else:
# noaux_tc_redundant returns 4 values: scores, topk_values, topk_idx,
# and tokens_per_expert_stats_list_out (inplace updated)
# noaux_tc_redundant still takes scores + scores_with_bias (not yet fused)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 noaux_tc_redundant 分支注释说明「暂未融合」,仍保留了 sigmoid(gating_output) + + e_score_correction_bias 的 Python 计算。请确认:

  1. 是否有后续 PR 计划对 noaux_tc_redundant 做同样的 kernel 融合?
  2. 两条分支行为不一致(一条在 kernel 内做 sigmoid,一条在 Python 侧做),是否会影响数值精度(float16 精度差异)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants