README.md 4.06 KB
Newer Older
Jiashi Li's avatar
Jiashi Li committed
1
2
# FlashMLA

3
4
5
6
## Performance Update (2025.04.22)

We're excited to announce the new release of Flash MLA, which delivers 5% ~ 15% performance improvement on compute-bound workloads, achieving up to 660 TFlops on NVIDIA H800 SXM5 GPUs. The interface of the new version is fully compatible with the old one. Just switch to the new version and enjoy the instant speedup! 🚀🚀🚀

Shengyu Liu's avatar
Shengyu Liu committed
7
Besides, we'd love to share the technical details behind the new kernel! Check out our deep-dive write-up [here](docs/20250422-new-kernel-deep-dive.md).
8
9
10
11
12

The new kernel primarily targets compute-intensive settings (where the number of q heads $\times$ the number of q tokens per request (if MTP is disabled then it's 1) $\ge 64$). For memory-bound cases, we recommend using version [b31bfe7](https://github.com/deepseek-ai/FlashMLA/tree/b31bfe72a83ea205467b3271a5845440a03ed7cb) for optimal performance.

## Introduction

Jiashi Li's avatar
Jiashi Li committed
13
14
15
FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:
Sijia Chen's avatar
Sijia Chen committed
16
- BF16, FP16
Jiashi Li's avatar
Jiashi Li committed
17
18
- Paged kvcache with block size of 64

19
20
21
22
23
24
25
## Requirements

- Hopper GPUs
- CUDA 12.3 and above
    - **But we highly recommend 12.8 or above for the best performance**
- PyTorch 2.0 and above

Jiashi Li's avatar
Jiashi Li committed
26
27
28
29
30
## Quick start

### Install

```bash
31
pip install -v .
Jiashi Li's avatar
Jiashi Li committed
32
33
34
35
```

### Benchmark

36
37
38
39
40
41
42
43
#### Testing MLA Decoding 

```bash
python tests/test_flash_mla_sm90.py
```

#### Testing MLA Forward/Backward

Jiashi Li's avatar
Jiashi Li committed
44
```bash
45
python tests/test_fmha_sm100.py
Jiashi Li's avatar
Jiashi Li committed
46
47
```

48
49
50
It is able up to 3000 GB/s in memory-bound configuration and 660 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.8.

Note. For memory-bound cases, we recommend using version [b31bfe7](https://github.com/deepseek-ai/FlashMLA/tree/b31bfe72a83ea205467b3271a5845440a03ed7cb) for optimal performance.
Jiashi Li's avatar
Jiashi Li committed
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

### Usage

```python
from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...
```

## Acknowledgement

FlashMLA is inspired by [FlashAttention 2&3](https://github.com/dao-AILab/flash-attention/) and [cutlass](https://github.com/nvidia/cutlass) projects.

72
73
## Community Support

hpp's avatar
hpp committed
74
### MetaX
75
76
For MetaX GPUs, visit the official website: [MetaX](https://www.metax-tech.com).

hpp's avatar
hpp committed
77
The corresponding FlashMLA version can be found at: [MetaX-MACA/FlashMLA](https://github.com/MetaX-MACA/FlashMLA)
78

79
80
81
82

### Moore Threads
For the Moore Threads GPU, visit the official website: [Moore Threads](https://www.mthreads.com/).

hpp's avatar
hpp committed
83
The corresponding FlashMLA version is available on GitHub: [MooreThreads/MT-flashMLA](https://github.com/MooreThreads/MT-flashMLA).
84
85
86
87
88


### Hygon DCU
For the Hygon DCU, visit the official website: [Hygon Developer](https://developer.sourcefind.cn/).

hpp's avatar
hpp committed
89
The corresponding FlashMLA version is available here: [OpenDAS/MLAttention](https://developer.sourcefind.cn/codes/OpenDAS/MLAttention).
90
91
92
93
94


### Intellifusion
For the Intellifusion NNP, visit the official website: [Intellifusion](https://www.intellif.com).

hpp's avatar
hpp committed
95
The corresponding FlashMLA version is available on Gitee: [Intellifusion/tyllm](https://gitee.com/Intellifusion_2025/tyllm/blob/master/python/tylang/flash_mla.py).
96
97
98
99
100


### Iluvatar Corex
For Iluvatar Corex GPUs, visit the official website: [Iluvatar Corex](https://www.iluvatar.com).

hpp's avatar
hpp committed
101
The corresponding FlashMLA version is available on GitHub: [Deep-Spark/FlashMLA](https://github.com/Deep-Spark/FlashMLA/tree/iluvatar_flashmla)
102

Jiashi Li's avatar
Jiashi Li committed
103
104
105
106
107
108

### AMD Instinct
For AMD Instinct GPUs, visit the official website: [AMD Instinct](https://www.amd.com/en/products/accelerators/instinct.html).

The corresponding FlashMLA version can be found at: [AITER/MLA](https://github.com/ROCm/aiter/blob/main/aiter/mla.py)

Jiashi Li's avatar
Jiashi Li committed
109
110
111
112
## Citation

```bibtex
@misc{flashmla2025,
ljss's avatar
ljss committed
113
      title={FlashMLA: Efficient MLA decoding kernels},
114
      author={Jiashi Li, Shengyu Liu},
Jiashi Li's avatar
Jiashi Li committed
115
116
117
118
119
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}
```