BitDecoding

BitDecoding is a high-performance, GPU-optimized system designed to accelerate long-context LLMs decoding with a low-bit KV cache. Achieve 3-9x speedup than Flash Attention v2.

Benchmark

Kernel Performance in RTX4090
Kernel Performance in A100

Installation

git clone --recursive https://github.com/DD-DuDa/BitDecoding.git
conda create -n bitdecode python=3.10
conda activate bitdecode
pip install -r requirements.txt
bash install.sh

Quick Start

Run the GSM8K example
```
cd evaluation
bash scripts/example.sh
```

Citation

If you find BitDecoding useful or want to use in your projects, please kindly cite our paper:

@INPROCEEDINGS{11408481,
  author={Du, Dayou and Cao, Shijie and Cheng, Jianyi and Mai, Luo and Cao, Ting and Yang, Mao},
  booktitle={2026 IEEE International Symposium on High Performance Computer Architecture (HPCA)}, 
  title={BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache}, 
  year={2026},
  volume={},
  number={},
  pages={1-13},
  keywords={Tensors;Quantization (signal);Layout;Graphics processing units;Computer architecture;Throughput;Decoding;Systems support;Kernel;Optimization},
  doi={10.1109/HPCA68181.2026.11408481}}

Acknowledgement

BitDecoding is inspired by many open-source libraries, including (but not limited to) flash-attention, flute, Atom, omniserve, KIVI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BitDecoding

Benchmark

Installation

Quick Start

Citation

Acknowledgement

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

BitDecoding

Benchmark

Installation

Quick Start

Citation

Acknowledgement