Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
-
Updated
Apr 29, 2026 - Python
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
Runtime patches for vLLM — Qwen3.6 (27B int4 / 35B-A3B FP8) on consumer Ampere. 50+ patches: TurboQuant KV, MTP / DFlash / ngram spec-decode, FULL cudagraph, 256K-320K context. v7.64: P67 non-pow-2 GQA, Cliff 1 fix, 6 docs (FAQ/HARDWARE/CONFIGS/CLIFFS), Genesis Compat Layer.
vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.
Local AI workstation — discover, run, chat, benchmark, and generate images from open-weight models. DFlash/DDTree speculative decoding, five cache compression strategies (RotorQuant, TriAttention, TurboQuant, ChaosEngine), MLX + llama.cpp + vLLM backends.
llama.cpp fork optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1) — TurboQuant weights + KV, NVFP4, DFlash MTP
Add a description, image, and links to the dflash topic page so that developers can more easily learn about it.
To associate your repository with the dflash topic, visit your repo's landing page and select "manage topics."