README.md 1.78 KB
Newer Older
Jiashi Li's avatar
Jiashi Li committed
1
2
3
4
5
# FlashMLA

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:
Sijia Chen's avatar
Sijia Chen committed
6
- BF16, FP16
Jiashi Li's avatar
Jiashi Li committed
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
- Paged kvcache with block size of 64

## Quick start

### Install

```bash
python setup.py install
```

### Benchmark

```bash
python tests/test_flash_mla.py
```

ljss's avatar
ljss committed
23
Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.8.
Jiashi Li's avatar
Jiashi Li committed
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

### Usage

```python
from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...
```

## Requirements

- Hopper GPUs
- CUDA 12.3 and above
ljss's avatar
ljss committed
45
    - **But we highly recommend 12.8 or above for the best performance**
Jiashi Li's avatar
Jiashi Li committed
46
47
48
49
50
51
- PyTorch 2.0 and above

## Acknowledgement

FlashMLA is inspired by [FlashAttention 2&3](https://github.com/dao-AILab/flash-attention/) and [cutlass](https://github.com/nvidia/cutlass) projects.

52
53
54
55
## Community Support

### MetaX 

Jiashi Li's avatar
Jiashi Li committed
56
57
For [MetaX](https://www.metax-tech.com) GPUs, the corresponding FlashMLA version can be found at:
- [MetaX-MACA/FlashMLA](https://github.com/MetaX-MACA/FlashMLA)
58
59

### Moore Threads (WIP)
Jiashi Li's avatar
Jiashi Li committed
60
61
For [Moore Threads](https://www.metax-tech.com) GPUs, the corresponding FlashMLA version can be found at:
- [MooreThreads/MT-DeepSeek](https://github.com/MooreThreads/MT-DeepSeek)
62

Jiashi Li's avatar
Jiashi Li committed
63
64
65
66
## Citation

```bibtex
@misc{flashmla2025,
ljss's avatar
ljss committed
67
      title={FlashMLA: Efficient MLA decoding kernels},
Jiashi Li's avatar
Jiashi Li committed
68
69
70
71
72
73
      author={Jiashi Li},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}
```