feat: fused Metal q4 inference for MLX 4-bit models by cjchanh · Pull Request #82 · evilsocket/cake

cjchanh · 2026-04-13T23:25:44Z

What

This patch enables Cake to serve MLX 4-bit quantized models that upstream main cannot load. On M5 Max, Qwen2.5-7B-Instruct-4bit loaded at 9.5 GiB and generated 10 tokens at 56.71 tok/s in a bounded API run.

How

q4_matvec_f16 and q4_matmul_tiled_f16 MSL kernels that read packed 4-bit weights and dequantize on-the-fly during matmul
QuantizedLinear layer type that stores packed U32 weights + F16 scales/biases on Metal without expansion
MetalMlxBackend VarBuilder that auto-detects MLX 4-bit format and keeps weights packed
MLP and Attention layers use polymorphic LinearWeight (Dense/Quantized)
Non-quantized tensors (embeddings, norms, lm_head) fall through to standard F16 dequantization

Tested

cargo test -p cake-core --features metal
cargo clippy -p cake-core -p cake-cli --features metal
Verified on M5 Max 128GB: 56.71 tok/s, coherent output
Verified on M1 Air 8GB: model loads at 1.5 GiB (memory-constrained throughput due to candle buffer pool growth, see Metal buffer pool grows beyond physical RAM on small unified-memory devices huggingface/candle#3464)

Files

cake-core/src/backends/metal/ops.msl — MSL kernels
cake-core/src/backends/metal/mod.rs — kernel dispatch + CustomOp
cake-core/src/backends/mod.rs — trait method
cake-core/src/utils/quantized_linear.rs — QuantizedWeight + LinearWeight
cake-core/src/utils/mlx_quant.rs — MLX detection
cake-core/src/utils/gptq.rs — MetalMlxBackend
cake-core/src/utils/mod.rs — auto-detection wiring
cake-core/src/models/common/mlp.rs — quantized MLP
cake-core/src/models/common/attention.rs — quantized attention
cake-core/tests/unit_tests/test_quantization.rs — q4 validation tests

cjchanh · 2026-04-14T04:29:21Z

Additional benchmark context for this PR.

Benchmark results

M5 Max (128 GB), single-device

model: mlx-community/Qwen2.5-7B-Instruct-4bit
throughput: 56.71 tok/s (10 tokens in a bounded API request)
loaded memory: 9.5 GiB
upstream main: cannot load this checkpoint (shape mismatch for model.embed_tokens.weight, expected [152064, 3584], got [152064, 448])

M5 Max + iPad Air M3, distributed (0.5B)

model: mlx-community/Qwen2.5-0.5B-Instruct-4bit
iPad worker discovered via zero-config discovery
iPad assigned model.layers.0
throughput: 55.44 tok/s

M1 Air + iPad Air M3, distributed (0.5B)

model: mlx-community/Qwen2.5-0.5B-Instruct-4bit
iPad worker assigned model.layers.0 via manual topology
throughput: 29.17 tok/s on a bounded request
master memory at generation: 854.7 MiB

Tested on Apple Silicon (M5 Max, M1 Air) and iPad Air M3.

This is the key distinction for the patch: it does not just optimize an existing path. It enables Cake to serve MLX 4-bit checkpoints that upstream main cannot load, and it does so at usable throughput on Apple Silicon.

Add fused Metal q4 path for MLX 4-bit models

50851da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: fused Metal q4 inference for MLX 4-bit models#82

feat: fused Metal q4 inference for MLX 4-bit models#82
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:q4-metal-patchset

cjchanh commented Apr 13, 2026

Uh oh!

cjchanh commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cjchanh commented Apr 13, 2026

What

How

Tested

Files

Uh oh!

cjchanh commented Apr 14, 2026

Benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant