feat: auto-pad FP8 GEMM dimensions for unaligned sequence packing by NoonePauseferg · Pull Request #2911 · NVIDIA/TransformerEngine

NoonePauseferg · 2026-04-21T15:40:05Z

Problem

cuBLAS FP8 GEMM requires lda/ldb % 16 == 0 and m % 8 == 0. With sequence packing (used in RL training frameworks like VERL, OpenRLHF), the total token count per micro-batch is dynamic and almost never aligned to 16:

Micro-batch: 8 sequences, total = 11486 tokens
11486 / TP_size(2) = 5743 tokens per rank
5743 % 16 = 15 ≠ 0 → cuBLAS FP8 GEMM crashes with:
  Assertion failed: ret.lda % 16 == 0

This affects:

RL training (GRPO/PPO with sequence packing and dynamic batch sizes)
MoE models (GroupedLinear per-expert token counts are dynamic after AllToAll dispatch)
Any workload with variable-length packed sequences + FP8

Currently users must manually pad tensors before calling TE modules. External padding corrupts training — padding tokens distort FP8 scale factors, causing 40–500× gradient explosion (documented in #2892).

Solution

Auto-pad m and k dimensions to multiples of 16 inside cublas_gemm() using temporary buffers. No external padding needed, no training corruption.

How it works

m-padding (output dimension, e.g. fprop/dgrad):

Allocate padded output buffer via cudaMallocAsync (stream-ordered, no CPU sync)
Run cuBLAS GEMM into padded buffer (ldd = m_padded)
Copy m_real rows per column back to original output via cudaMemcpy2DAsync
cudaFreeAsync — no sync, no pipeline bubbles

k-padding (contraction dimension, e.g. wgrad where k = num_tokens):

Allocate zero-initialized padded copies of A and/or B via cudaMallocAsync
Copy original data with cudaMemcpy2DAsync (k_real rows per column, rest stays zero)
Run GEMM — zero-padded rows contribute 0 to dot product (mathematically exact)
cudaFreeAsync

Changes

transformer_engine/common/gemm/cublaslt_gemm.cu: auto-pad logic in cublas_gemm(), removed lda%16 / m%8 assertions in CanonicalizeGemmInput()
transformer_engine/pytorch/utils.py: relaxed assert_dim_for_fp8_exec() — C++ handles alignment now

Results

Tested on H100 (SM90), TE 2.12, PyTorch 2.9.1, CUDA 12.9.
Full RL training pipeline: DeepSeek 10B MoE, 4 nodes × 8 H100, TP=2 PP=2 EP=2, sequence packing.

Training quality (FP8 E2E vs BF16 baseline)

Metric	BF16 baseline	FP8 E2E (this PR)	FP8 with external padding
grad_norm	0.29	0.27–0.30	3.3–500
training_log_ppl	1.28	1.34	6.87
log_ppl_diff	0.0003	0.018–0.035	5.30

Per-layer gradient accuracy (FP8 with unaligned M=5743 vs BF16)

Layer type	Dimensions	FP8/BF16 ratio
kv_down_proj (MLA)	1536→512	1.0000×
kv_up_proj (MLA)	512→1536	1.0000×
q_proj	1536→1536	1.0000×
proj (output)	1536→1536	1.0000×
shared_expert fc1	1536→2560	1.0000×
shared_expert fc2	1280→1536	1.0000×
MoE fc1 (32 experts)	1536→2560	1.0000×
MoE fc2 (32 experts)	1280→1536	1.0000×

Memory & performance

Worst-case temp buffer: 15 rows × 4096 × 2B = 120 KB per GEMM
cudaMallocAsync/cudaFreeAsync reuses stream-ordered pool — no fragmentation
Aligned vs unaligned perf: no regression (1.2ms vs 1.1ms per iter)

Addressing previous review comments

The initial version of this PR had issues flagged by the review bot. All have been addressed:

Issue	Status	Fix
P1: Output buffer corruption (writing to undersized buffer with padded ldd)	Fixed	Separate `_pad_D` temp buffer via `cudaMallocAsync`; `cudaMemcpy2DAsync` copies only `m_real` rows back
P1: Block-scaled FP8 scale coupling (padding corrupts per-block scales)	Fixed	Padding uses zero-filled copies; zeros don't affect dot product. Scale factors computed on original data before GEMM
P2: Unused variables (`m_orig`, `k_orig`, `did_pad`)	Fixed	Removed. Now uses `m_real`/`k_real` throughout
P2: Read-beyond-buffer safety (relying on allocator alignment)	Fixed	No out-of-bounds reads. All padded data is in explicitly allocated temp buffers

Related issues

[BUG] SIGSEGV in fused_attn_fwd when cu_seqlens_q != cu_seqlens_q_padded with FP8 blockwise #2892 — SIGSEGV with unpadded cu_seqlens (FP8 + sequence packing)
Issue Using Qwen2-VL HuggingFace Model with Transformer Engine FP8 Training #1889 — cuBLAS assertion with unaligned FP8 dims

greptile-apps · 2026-04-21T15:47:18Z

Greptile Summary

This PR auto-pads FP8 GEMM m and k dimensions to multiples of 16 inside cublas_gemm() using scratch buffers, removing the caller-side alignment requirement that caused crashes with dynamic sequence-packed token counts. Two new P1 defects remain in the C++ path:

GELU auxiliary buffer not padded: when m is padded and a GELU epilogue is active, ld_gelumat = ldd = m_padded is set but pre_gelu_out was allocated for m_real rows per column \u2014 every column after the first is written out of bounds.
Unchecked CUDA errors: cudaMallocAsync, cudaMemsetAsync, cudaMemcpy2DAsync, and cudaFreeAsync return values are not checked with NVTE_CHECK_CUDA, leaving OOM and device-unsupported failures silent.

Confidence Score: 3/5

Not safe to merge — silent memory corruption in the GELU-fused FP8 path and unchecked CUDA allocation errors.

Two P1 defects remain: the GELU auxiliary output buffer is not padded alongside _pad_D (out-of-bounds writes when gelu fusion is active and m is unaligned), and none of the new cudaMallocAsync/cudaMemsetAsync/cudaMemcpy2DAsync/cudaFreeAsync calls propagate errors via NVTE_CHECK_CUDA. The core non-gelu path the author tested appears correct, but the gelu path common in MLP layers can silently corrupt memory.

transformer_engine/common/gemm/cublaslt_gemm.cu — specifically the GELU epilogue setup (~line 504) and all new cudaMallocAsync/cudaMemcpy2DAsync call sites.

Important Files Changed

Filename	Overview
transformer_engine/common/gemm/cublaslt_gemm.cu	Adds auto-padding of m/k to multiples of 16 for FP8 GEMMs with scratch buffers. Two P1 issues remain: GELU auxiliary buffer not padded (memory corruption when gelu fusion is active with unaligned m), and CUDA async errors unchecked.
transformer_engine/pytorch/utils.py	Replaces assert_dim_for_fp8_exec body with pass, removing dimension-alignment validation for all callers.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[cublas_gemm called] --> B{is_fp8_a or is_fp8_b?}
    B -- No --> C[m=m_real, k=k_real, ldd=m_real]
    B -- Yes --> D[m = ceil16 m_real / k = ceil16 k_real / ldd = m]
    D --> E{m != m_real?}
    E -- Yes --> F[cudaMallocAsync _pad_D / cudaMemsetAsync zeros]
    E -- No --> G[_pad_D = nullptr]
    F --> H[CanonicalizeGemmInput with padded m,k]
    G --> H
    H --> I{k != k_real?}
    I -- Yes --> J[Allocate _pad_A / _pad_B / cudaMemcpy2DAsync real data]
    I -- No --> K[Use original A/B pointers]
    J --> L[cublasLtMatmul]
    K --> L
    L --> M{_pad_D?}
    M -- Yes --> N[cudaMemcpy2DAsync m_real rows back / cudaFreeAsync _pad_D]
    M -- No --> O[Done]
    N --> O
    L -.->|gelu active + m padded| P[pre_gelu_out written with ldd=m but buffer=m_real rows -> memory corruption]
    style P fill:#f66,color:#fff

Comments Outside Diff (1)

transformer_engine/common/gemm/cublaslt_gemm.cu, line 504 (link)

GELU auxiliary output buffer not padded when m is padded

When m != m_real and a GELU epilogue is active, ld_gelumat = ldd = m (padded) is set at line 504. cuBLAS then writes column j of the pre-GELU auxiliary output to byte offset j × m × sizeof(T) in pre_gelu_out. But pre_gelu_out = outputPreGelu->data.dptr was allocated for m_real rows per column, so every column after the first is written beyond the buffer. This is silent memory corruption — structurally the same problem as the original _pad_D bug (addressed by the new _pad_D allocation), but not fixed for the GELU auxiliary path.

A compatible fix is to allocate a padded GELU auxiliary buffer the same way _pad_D is allocated, use it during the GEMM, and cudaMemcpy2DAsync the m_real valid rows back into outputPreGelu->data.dptr afterward, then cudaFreeAsync the padded buffer.

_{Reviews (2): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-04-21T15:47:22Z

+      m += m_pad;
+      k += k_pad;
+      did_pad = true;
+    }
+  }
+  const int ldd = m;


Output-buffer corruption when m is padded

When m_pad > 0, ldd is set to m_padded and Ddesc tells cuBLAS the output matrix has m_padded rows. But the output buffer outputD->data.dptr was allocated for m_orig × n elements, not m_padded × n. Because cuBLAS writes column j at byte offset j × ldd × sizeof(T) = j × m_padded × sizeof(T), every column after the first is written to a position shifted by j × m_pad elements relative to what the caller expects. The caller's buffer for column j starts at j × m_orig, so column 1 onwards is silently misaligned and contains corrupt data.

This manifests in practice with MoE AllToAll (the third use case listed in the PR): after dispatch, each expert receives a variable token count. When transa == CUBLAS_OP_T (e.g., weight-gradient GEMM), m = A0 = token_count_per_expert, which can be unaligned. For n > 1 output columns, the gradient accumulation tensor will have scrambled contents.

A safe approach for the m-dimension is to keep ldd = m_orig and only report m_padded to cuBLAS as the logical number of rows — or, as standard practice suggests, allocate a temporary padded output buffer and copy only the valid m_orig rows back.

greptile-apps · 2026-04-21T15:47:23Z

+  const bool is_fp8_a = is_fp8_dtype(inputA->data.dtype) ||
+                         (inputA->has_columnwise_data() && is_fp8_dtype(inputA->columnwise_data.dtype));
+  const bool is_fp8_b = is_fp8_dtype(inputB->data.dtype) ||
+                         (inputB->has_columnwise_data() && is_fp8_dtype(inputB->columnwise_data.dtype));
+  const int m_orig = m;
+  const int k_orig = k;
+  bool did_pad = false;
+  if (is_fp8_a || is_fp8_b) {


Auto-padding fires for MXFP8 and block-scaled FP8, breaking scale-factor coupling

is_fp8_dtype() returns true for any FP8 data type regardless of the scaling mode — it will trigger for NVTE_MXFP8_SCALING, NVTE_BLOCK_SCALING_1D, and NVTE_BLOCK_SCALING_2D tensors as well as plain tensor-scaled FP8. For MXFP8 and block-scaled modes, scale factors are tied to a fixed-size block (32 or 128 elements) along the contracted dimension k. Padding k by up to 15 elements causes the last block's scale factor to be applied to phantom (out-of-bounds) data, producing incorrect accumulation values for that block.

Consider scoping the auto-padding to tensor-scaled FP8 only:

const bool is_tensor_fp8_a = is_tensor_scaling(inputA->scaling_mode) && (is_fp8_dtype(inputA->data.dtype) || (inputA->has_columnwise_data() && is_fp8_dtype(inputA->columnwise_data.dtype))); const bool is_tensor_fp8_b = is_tensor_scaling(inputB->scaling_mode) && (is_fp8_dtype(inputB->data.dtype) || (inputB->has_columnwise_data() && is_fp8_dtype(inputB->columnwise_data.dtype))); if (is_tensor_fp8_a || is_tensor_fp8_b) { /* pad */ }

greptile-apps · 2026-04-21T15:47:24Z

+  const int m_orig = m;
+  const int k_orig = k;
+  bool did_pad = false;


Unused variables — dead code

m_orig, k_orig, and did_pad are assigned but never referenced again anywhere in cublas_gemm. They appear to be leftovers from a planned output-truncation step that was not implemented. They can be removed.

Suggested change

const int m_orig = m;

const int k_orig = k;

bool did_pad = false;

greptile-apps · 2026-04-21T15:47:25Z

+      // Pad m and k to multiples of 16.
+      // For the GEMM, we pass padded m/k. Input data pointers still point to
+      // the original (unpadded) buffers. cuBLAS will read beyond the valid data
+      // for the padded rows — this is OK as long as:
+      // 1. The padded area is within allocated memory (tensor allocations are
+      //    typically page-aligned, so a few extra rows are safe)
+      // 2. The padded rows' values don't matter (they only affect padded output rows
+      //    which we discard)


Read-beyond-buffer safety relies on allocator implementation details

The comment claims reading beyond the valid input buffer is safe "due to page-aligned GPU allocations." This is generally true for CUDA's caching allocator (allocations are rounded up to 512-byte or larger boundaries), but it is not guaranteed by the CUDA API. A future allocator, a custom allocator (cudaMallocAsync pools, RAPIDS RMM, etc.), or a tensor obtained from a memory-mapped source might not have this guarantee. A brief note acknowledging the reliance on the caching allocator, or a check like NVTE_CHECK(k_pad < 16), would make the intent explicit and defensible.

ptrendx · 2026-04-21T20:16:35Z

Hi @NoonePauseferg - the PR as-is contains many unrelated commits (I think you made it on top of the 2.12 branch), which makes it very difficult to see the actual changes. Could you rebase the PR on top of the current main branch? Thank you!

NoonePauseferg · 2026-04-22T10:10:09Z

Hi @NoonePauseferg - the PR as-is contains many unrelated commits (I think you made it on top of the 2.12 branch), which makes it very difficult to see the actual changes. Could you rebase the PR on top of the current main branch? Thank you!

yeah, working on it - soon gonna fix pr

cuBLAS FP8 GEMM requires lda/ldb % 16 == 0 and m % 8 == 0. RL training frameworks (VERL, OpenRLHF) use sequence packing where total token counts are dynamic and rarely aligned. Manual pre-padding corrupts training by distorting FP8 scale factors (proven: BF16 with padding tokens = grad_norm 1064x explosion). Changes in cublas_gemm(): - Detect FP8 inputs, round up m and k to multiples of 16 - Allocate padded temp buffers via cudaMallocAsync (stream-ordered) - For k-padding: zero-pad A/B columns beyond k_real with cudaMemcpy2D - For m-padding: GEMM into padded output, copy m_real rows back - cudaFreeAsync for cleanup (no CPU-GPU sync, no pipeline bubbles) Changes in utils.py: - Relax assert_dim_for_fp8_exec — C++ now handles alignment internally Tested on H100 (SM90), TE 2.12, PyTorch 2.9.1, CUDA 12.9: - DeepSeek 10B MoE, 4 nodes x 8 GPUs, TP=2 PP=2 EP=2 - FP8/BF16 grad ratio: 0.99-1.00 across all layer types - grad_norm: 0.27-0.30 (BF16 baseline: 0.29) - Memory overhead: <120KB per GEMM (worst case +15 pad rows) - No performance regression (cudaMallocAsync reuses pool) Related: NVIDIA#2892 NVIDIA#1889

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

ptrendx added community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. labels Apr 21, 2026

NoonePauseferg force-pushed the fix/fp8-gemm-auto-alignment-padding branch from 756b153 to b65b244 Compare April 22, 2026 10:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

b6f0ea4

for more information, see https://pre-commit.ci

ptrendx self-assigned this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-pad FP8 GEMM dimensions for unaligned sequence packing#2911

feat: auto-pad FP8 GEMM dimensions for unaligned sequence packing#2911
NoonePauseferg wants to merge 2 commits intoNVIDIA:mainfrom
NoonePauseferg:fix/fp8-gemm-auto-alignment-padding

NoonePauseferg commented Apr 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

ptrendx commented Apr 21, 2026

Uh oh!

NoonePauseferg commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	const int m_orig = m;
	const int k_orig = k;
	bool did_pad = false;

Conversation

NoonePauseferg commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

How it works

Changes

Results

Training quality (FP8 E2E vs BF16 baseline)

Per-layer gradient accuracy (FP8 with unaligned M=5743 vs BF16)

Memory & performance

Addressing previous review comments

Related issues

Uh oh!

greptile-apps Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Apr 21, 2026

Uh oh!

NoonePauseferg commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NoonePauseferg commented Apr 21, 2026 •

edited

Loading

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading