Issue/1153 add fused FFN operator and hardware-task mutual awareness analyzer#1154
Merged
Conversation
voltjia
approved these changes
May 9, 2026
Ziminli
approved these changes
May 9, 2026
Collaborator
|
感谢周老师 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
概述
本 PR 关联 #1153,新增两个相互独立的模块,均通过编译选项或环境变量控制,默认关闭,不影响现有算子的源码、签名与默认编译路径:
infiniopFusedFFN):将 LLM decode 阶段 FFN 主路径融合为单算子。--mutual-awareness=y):在算子调度链路上提供运行时上下文采集与目标驱动的 kernel 选择能力。模块详情
1. Hardware-Task Mutual Awareness Analyzer
通过
--mutual-awareness=y启用,关闭时零编译产出、零 ABI 影响。analyzer/:OpTraceRing/PhaseDetector/ResourceSensor/IntentGenerator/MutualAwarenessAnalyzer五个核心子模块OpDispatcher新增OptimizationGoal重载(registerDevice(device, fn, goal)/lookup(device, goal)),完全向后兼容;只有Attention::execute接入 goal-aware 路径,其余算子不变infinirt新增统一资源快照 API:infinirtGetMemInfo/infinirtGetDeviceResourceSnapshot,按平台分层:NVIDIA / Iluvatar 走 NVML / IXML(dlopen),MetaX 走mcMemGetInfo,CPU 走/proc/meminfo,其它 backend 返回 NOOP / fallbackinfinicore.analyzer在启用时自动暴露2. Fused FFN Operator
公开 C API:
include/infiniop/ops/fused_ffn.hout = Down( SwiGLU( GateUp( RMSNorm(in) ) ) ) + residual
gate_up_weight两种 layout([2*di, d]/[d, 2*di])和down_weight两种 layoutgemm/rms_norm/swiglu/add子算子 descriptor;可选 deep-fused 路径(INFINIOP_FUSED_FFN_DEEP=1走调度器、=2强制),将GateUp+SwiGLU 合并为单 kernel 消除 HBM 往返,对小 ntok 友好;当
out == residual时通过beta=1把 residual 融入 Down GEMMmcblas调用 GEMM,复用 NVIDIAkernel.cuh的 RMSNorm / SwiGLU / ResidualAdd kerneltest/infiniop/fused_ffn.py,13 种 shape × 2 dtype,覆盖 LLaMA-7B / 13B / Qwen 三种典型架构 × {with / without residual}test/infiniop/libinfiniop/op_register.py末尾追加fused_ffn_ctypes 注册测试验证
编译
--metax-gpu=y --use-mc=y --mutual-awareness=y--iluvatar-gpu=y --cpu=y --mutual-awareness=y--cpu=y --mutual-awareness=y正确性
Fused FFN
--metax):26/26 PASS--cpu):24/24 PASSMutual Awareness
--mutual-awareness=n时 analyzer 符号未链接回归
天数 cpu / 沐曦 metax 平台上对
rms_norm/swiglu/add/gemm/attention五个相关算子做了 sanity 回归,全部 PASS。影响范围
--mutual-awareness=n)下:analyzer 文件不编译,dispatch 无 goal-aware 分支,infinirt 新增 API 在禁用平台为 NOOP,零侵入xmake.lua仅新增mutual-awarenessoption 与ENABLE_MUTUAL_AWARENESS宏;fused_ffn通过现有 glob 自动收文件,无新增 build 规则Checklist
issue/1153开头