BitDecoding is a high-performance, GPU-optimized system
designed to accelerate long-context LLMs decoding with a low-bit KV
cache. Achieve 3-9x speedup than Flash Attention v2.

git clone --recursive https://github.com/DD-DuDa/BitDecoding.git
conda create -n bitdecode python=3.10
conda activate bitdecode
pip install -r requirements.txt
bash install.sh
- Run the GSM8K example
cd evaluation bash scripts/example.sh
If you find BitDecoding useful or want to use in your projects, please kindly cite our paper:
@INPROCEEDINGS{11408481,
author={Du, Dayou and Cao, Shijie and Cheng, Jianyi and Mai, Luo and Cao, Ting and Yang, Mao},
booktitle={2026 IEEE International Symposium on High Performance Computer Architecture (HPCA)},
title={BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache},
year={2026},
volume={},
number={},
pages={1-13},
keywords={Tensors;Quantization (signal);Layout;Graphics processing units;Computer architecture;Throughput;Decoding;Systems support;Kernel;Optimization},
doi={10.1109/HPCA68181.2026.11408481}}
BitDecoding is inspired by many open-source libraries, including (but not limited to) flash-attention, flute, Atom, omniserve, KIVI.

