Skip to content

feat: fused Metal q4 inference for MLX 4-bit models#82

Open
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:q4-metal-patchset
Open

feat: fused Metal q4 inference for MLX 4-bit models#82
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:q4-metal-patchset

Conversation

@cjchanh
Copy link
Copy Markdown

@cjchanh cjchanh commented Apr 13, 2026

What

This patch enables Cake to serve MLX 4-bit quantized models that upstream main cannot load. On M5 Max, Qwen2.5-7B-Instruct-4bit loaded at 9.5 GiB and generated 10 tokens at 56.71 tok/s in a bounded API run.

How

  • q4_matvec_f16 and q4_matmul_tiled_f16 MSL kernels that read packed 4-bit weights and dequantize on-the-fly during matmul
  • QuantizedLinear layer type that stores packed U32 weights + F16 scales/biases on Metal without expansion
  • MetalMlxBackend VarBuilder that auto-detects MLX 4-bit format and keeps weights packed
  • MLP and Attention layers use polymorphic LinearWeight (Dense/Quantized)
  • Non-quantized tensors (embeddings, norms, lm_head) fall through to standard F16 dequantization

Tested

Files

  • cake-core/src/backends/metal/ops.msl — MSL kernels
  • cake-core/src/backends/metal/mod.rs — kernel dispatch + CustomOp
  • cake-core/src/backends/mod.rs — trait method
  • cake-core/src/utils/quantized_linear.rs — QuantizedWeight + LinearWeight
  • cake-core/src/utils/mlx_quant.rs — MLX detection
  • cake-core/src/utils/gptq.rs — MetalMlxBackend
  • cake-core/src/utils/mod.rs — auto-detection wiring
  • cake-core/src/models/common/mlp.rs — quantized MLP
  • cake-core/src/models/common/attention.rs — quantized attention
  • cake-core/tests/unit_tests/test_quantization.rs — q4 validation tests

@cjchanh
Copy link
Copy Markdown
Author

cjchanh commented Apr 14, 2026

Additional benchmark context for this PR.

Benchmark results

M5 Max (128 GB), single-device

  • model: mlx-community/Qwen2.5-7B-Instruct-4bit
  • throughput: 56.71 tok/s (10 tokens in a bounded API request)
  • loaded memory: 9.5 GiB
  • upstream main: cannot load this checkpoint (shape mismatch for model.embed_tokens.weight, expected [152064, 3584], got [152064, 448])

M5 Max + iPad Air M3, distributed (0.5B)

  • model: mlx-community/Qwen2.5-0.5B-Instruct-4bit
  • iPad worker discovered via zero-config discovery
  • iPad assigned model.layers.0
  • throughput: 55.44 tok/s

M1 Air + iPad Air M3, distributed (0.5B)

  • model: mlx-community/Qwen2.5-0.5B-Instruct-4bit
  • iPad worker assigned model.layers.0 via manual topology
  • throughput: 29.17 tok/s on a bounded request
  • master memory at generation: 854.7 MiB

Tested on Apple Silicon (M5 Max, M1 Air) and iPad Air M3.

This is the key distinction for the patch: it does not just optimize an existing path. It enables Cake to serve MLX 4-bit checkpoints that upstream main cannot load, and it does so at usable throughput on Apple Silicon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant