Skip to content

Add AI written qwen3_moe example#2887

Open
skyw wants to merge 9 commits intoNVIDIA:mainfrom
skyw:vibe_qwen3
Open

Add AI written qwen3_moe example#2887
skyw wants to merge 9 commits intoNVIDIA:mainfrom
skyw:vibe_qwen3

Conversation

@skyw
Copy link
Copy Markdown

@skyw skyw commented Apr 15, 2026

Description

A almost pure TE module implementation of Qwen3 Moe model

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Add Qwen3 MoE model use TE module only
  • Simple test to match HF counterpart.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

skyw added 4 commits April 15, 2026 11:16
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
@ksivaman ksivaman self-requested a review April 15, 2026 18:39
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

This PR adds a new examples/pytorch/qwen3_moe/ directory with a pure-TE implementation of Qwen3 MoE (Qwen3MoeForCausalLM) and a numerical comparison test against the HuggingFace reference model. The model mapping is well-structured and the test correctly seeds weights and checks both forward logits and backward gradients. Remaining findings are all P2: a truncated module docstring, tokens_per_expert being int32 (TE GroupedLinear typically expects int64), a silent skip of None-gradient parameters in the backward test, and a minor debuggability gap in the expert-weight name-mapping fallthrough.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/quality improvements with no blocking correctness issues

All four findings are P2: a truncated docstring, a potential int32 dtype concern (uncertain without running against the actual TE kernel), a test coverage gap, and a minor fallthrough comment. None are definitive runtime breakages in the model itself. The core model mapping logic is sound and the test structure is reasonable.

examples/pytorch/qwen3_moe/model.py — verify tokens_per_expert dtype accepted by te_ops.GroupedLinear; examples/pytorch/qwen3_moe/test_vs_hf.py — address truncated docstring and None-grad skip

Important Files Changed

Filename Overview
examples/pytorch/qwen3_moe/model.py Full TE-based Qwen3 MoE implementation; potential int32 dtype issue for tokens_per_expert passed to GroupedLinear
examples/pytorch/qwen3_moe/test_vs_hf.py HF comparison test; truncated docstring, silent None-grad skip in backward loop, and a no-op data.copy_() before backward (addressed in prior review thread)
examples/pytorch/qwen3_moe/config.py Frozen dataclass mirroring HuggingFace Qwen3MoeConfig defaults; no issues found
examples/pytorch/qwen3_moe/README.md Concise README with module mapping table, file descriptions, and correct cd + python test_vs_hf.py invocation instructions

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["input_ids (batch, seq_len)"] --> B["embed_tokens\n(nn.Embedding)"]
    B --> C["RotaryPositionEmbedding\nfreqs"]
    C --> D

    subgraph LAYER["Qwen3MoeDecoderLayer (×N)"]
        D["hidden_states"] --> E["te.MultiheadAttention\n(fused LN + QKV + QK-norm + RoPE + attn + O)"]
        E --> F["+ residual"]
        F --> G["te.RMSNorm\npost_attention_layernorm"]
        G --> H

        subgraph MOE["Qwen3MoeBlock"]
            H["hidden_flat (tokens, hidden)"] --> I["Qwen3MoeRouter\n(softmax + top-k)"]
            I --> J["moe_permute_with_probs"]
            J --> K["te_ops.GroupedLinear\n(gate+up, int32 tokens_per_expert⚠)"]
            K --> L["te_ops.SwiGLU"]
            L --> M["te_ops.GroupedLinear\n(down)"]
            M --> N["moe_unpermute\n(prob-weighted combine)"]
        end

        N --> O["+ residual"]
    end

    O --> P["te.RMSNorm\nfinal norm"]
    P --> Q["te.Linear\nlm_head"]
    Q --> R["logits (batch, seq_len, vocab_size)"]
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into vibe_qwen3" | Re-trigger Greptile

Comment thread examples/pytorch/qwen3_moe/test_vs_hf.py Outdated
Comment thread examples/pytorch/qwen3_moe/test_vs_hf.py
skyw and others added 4 commits April 15, 2026 12:30
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants