README.md 2.65 KB
Newer Older
Jiashi Li's avatar
Jiashi Li committed
1
2
3
4
5
# FlashMLA

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:
Sijia Chen's avatar
Sijia Chen committed
6
- BF16, FP16
Jiashi Li's avatar
Jiashi Li committed
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
- Paged kvcache with block size of 64

## Quick start

### Install

```bash
python setup.py install
```

### Benchmark

```bash
python tests/test_flash_mla.py
```

ljss's avatar
ljss committed
23
Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.8.
Jiashi Li's avatar
Jiashi Li committed
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

### Usage

```python
from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...
```

## Requirements

- Hopper GPUs
- CUDA 12.3 and above
ljss's avatar
ljss committed
45
    - **But we highly recommend 12.8 or above for the best performance**
Jiashi Li's avatar
Jiashi Li committed
46
47
48
49
50
51
- PyTorch 2.0 and above

## Acknowledgement

FlashMLA is inspired by [FlashAttention 2&3](https://github.com/dao-AILab/flash-attention/) and [cutlass](https://github.com/nvidia/cutlass) projects.

52
53
54
55
## Community Support

### MetaX 

56
57
58
For MetaX GPUs, visit the official website: [MetaX](https://www.metax-tech.com).

The corresponding FlashMLA version can be found at:
59
[MetaX-MACA/FlashMLA](https://github.com/MetaX-MACA/FlashMLA)
60

61
62
63
64
65

### Moore Threads
For the Moore Threads GPU, visit the official website: [Moore Threads](https://www.mthreads.com/).

The corresponding FlashMLA version is available on GitHub:  
66
[MooreThreads/MT-flashMLA](https://github.com/MooreThreads/MT-flashMLA).
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90


### Hygon DCU

For the Hygon DCU, visit the official website: [Hygon Developer](https://developer.sourcefind.cn/).

The corresponding FlashMLA version is available here:  
[OpenDAS/MLAttention](https://developer.sourcefind.cn/codes/OpenDAS/MLAttention).


### Intellifusion

For the Intellifusion NNP, visit the official website: [Intellifusion](https://www.intellif.com).

The corresponding FlashMLA version is available on Gitee:  
[Intellifusion/tyllm](https://gitee.com/Intellifusion_2025/tyllm/blob/master/python/tylang/flash_mla.py).


### Iluvatar Corex

For Iluvatar Corex GPUs, visit the official website: [Iluvatar Corex](https://www.iluvatar.com).

The corresponding FlashMLA version is available on GitHub:  
[Deep-Spark/FlashMLA](https://github.com/Deep-Spark/FlashMLA/tree/iluvatar_flashmla)
91

Jiashi Li's avatar
Jiashi Li committed
92
93
94
95
## Citation

```bibtex
@misc{flashmla2025,
ljss's avatar
ljss committed
96
      title={FlashMLA: Efficient MLA decoding kernels},
Jiashi Li's avatar
Jiashi Li committed
97
98
99
100
101
102
      author={Jiashi Li},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}
```