README.md 2.64 KB
Newer Older
Jiashi Li's avatar
Jiashi Li committed
1
2
3
4
5
# FlashMLA

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:
Sijia Chen's avatar
Sijia Chen committed
6
- BF16, FP16
Jiashi Li's avatar
Jiashi Li committed
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
- Paged kvcache with block size of 64

## Quick start

### Install

```bash
python setup.py install
```

### Benchmark

```bash
python tests/test_flash_mla.py
```

ljss's avatar
ljss committed
23
Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.8.
Jiashi Li's avatar
Jiashi Li committed
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

### Usage

```python
from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...
```

## Requirements

- Hopper GPUs
- CUDA 12.3 and above
ljss's avatar
ljss committed
45
    - **But we highly recommend 12.8 or above for the best performance**
Jiashi Li's avatar
Jiashi Li committed
46
47
48
49
50
51
- PyTorch 2.0 and above

## Acknowledgement

FlashMLA is inspired by [FlashAttention 2&3](https://github.com/dao-AILab/flash-attention/) and [cutlass](https://github.com/nvidia/cutlass) projects.

52
53
## Community Support

hpp's avatar
hpp committed
54
### MetaX
55
56
For MetaX GPUs, visit the official website: [MetaX](https://www.metax-tech.com).

hpp's avatar
hpp committed
57
The corresponding FlashMLA version can be found at: [MetaX-MACA/FlashMLA](https://github.com/MetaX-MACA/FlashMLA)
58

59
60
61
62

### Moore Threads
For the Moore Threads GPU, visit the official website: [Moore Threads](https://www.mthreads.com/).

hpp's avatar
hpp committed
63
The corresponding FlashMLA version is available on GitHub: [MooreThreads/MT-flashMLA](https://github.com/MooreThreads/MT-flashMLA).
64
65
66
67
68


### Hygon DCU
For the Hygon DCU, visit the official website: [Hygon Developer](https://developer.sourcefind.cn/).

hpp's avatar
hpp committed
69
The corresponding FlashMLA version is available here: [OpenDAS/MLAttention](https://developer.sourcefind.cn/codes/OpenDAS/MLAttention).
70
71
72
73
74


### Intellifusion
For the Intellifusion NNP, visit the official website: [Intellifusion](https://www.intellif.com).

hpp's avatar
hpp committed
75
The corresponding FlashMLA version is available on Gitee: [Intellifusion/tyllm](https://gitee.com/Intellifusion_2025/tyllm/blob/master/python/tylang/flash_mla.py).
76
77
78
79
80


### Iluvatar Corex
For Iluvatar Corex GPUs, visit the official website: [Iluvatar Corex](https://www.iluvatar.com).

hpp's avatar
hpp committed
81
The corresponding FlashMLA version is available on GitHub: [Deep-Spark/FlashMLA](https://github.com/Deep-Spark/FlashMLA/tree/iluvatar_flashmla)
82

Jiashi Li's avatar
Jiashi Li committed
83
84
85
86
## Citation

```bibtex
@misc{flashmla2025,
ljss's avatar
ljss committed
87
      title={FlashMLA: Efficient MLA decoding kernels},
Jiashi Li's avatar
Jiashi Li committed
88
89
90
91
92
93
      author={Jiashi Li},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}
```