README.md 959 Bytes
Newer Older
zhangqha's avatar
zhangqha committed
1
2
# MLAttention

zhangqha's avatar
zhangqha committed
3
4
5
6
7
8
9
10
## 简介
```
MLAttention is an efficient MLA decoding kernel , optimized for variable-length sequences serving.

目前支持的精度:
- BF16, FP16
目前支持的实现方式:
- OpenAI Triton
zhangqha's avatar
zhangqha committed
11
12
正在开发的实现方式:
- Cutlass 
zhangqha's avatar
zhangqha committed
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

```

## 安装

### 源码方式安装

```bash
python3 -m pip install .
```

### 单测验证

```bash

pytest -s tests/test_triton_decode_attention.py
```


## 使用方式

```python
import triton
from triton_mla_op.triton_decode_attention import decode_attention_fwd
    ...
    decode_attention_fwd(
        q,
        k_buffer,
        v_buffer,
        o,
        req_to_token,
        b_seq_len,
        attn_logits,
        num_kv_splits,
        sm_scale,
    )
    ...
```

## MLAttention开发进度
```
zhangqha's avatar
zhangqha committed
54
目前,基于Cutlass的MLAttention版本正在积极开发中。
zhangqha's avatar
zhangqha committed
55
56
57
58
我们会及时在项目仓库中更新开发进度。欢迎关注我们的开发者社区以获取最新信息。
```