README.md 1.39 KB
Newer Older
Jiashi Li's avatar
Jiashi Li committed
1
2
3
4
5
# FlashMLA

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:
Sijia Chen's avatar
Sijia Chen committed
6
- BF16, FP16
Jiashi Li's avatar
Jiashi Li committed
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
- Paged kvcache with block size of 64

## Quick start

### Install

```bash
python setup.py install
```

### Benchmark

```bash
python tests/test_flash_mla.py
```

ljss's avatar
ljss committed
23
Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.8.
Jiashi Li's avatar
Jiashi Li committed
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

### Usage

```python
from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...
```

## Requirements

- Hopper GPUs
- CUDA 12.3 and above
ljss's avatar
ljss committed
45
    - **But we highly recommend 12.8 or above for the best performance**
Jiashi Li's avatar
Jiashi Li committed
46
47
48
49
50
51
52
53
54
55
- PyTorch 2.0 and above

## Acknowledgement

FlashMLA is inspired by [FlashAttention 2&3](https://github.com/dao-AILab/flash-attention/) and [cutlass](https://github.com/nvidia/cutlass) projects.

## Citation

```bibtex
@misc{flashmla2025,
ljss's avatar
ljss committed
56
      title={FlashMLA: Efficient MLA decoding kernels},
Jiashi Li's avatar
Jiashi Li committed
57
58
59
60
61
62
      author={Jiashi Li},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}
```