README.md

# MLAttention

## 简介
```
MLAttention is an efficient MLA decoding kernel , optimized for variable-length sequences serving.

目前支持的精度:
- BF16, FP16
目前支持的实现方式：
- OpenAI Triton
正在开发的实现方式：
- Cutlass 

```

## 安装

### 源码方式安装

```bash
python3 -m pip install .
```

### 单测验证

```bash

pytest -s tests/test_triton_decode_attention.py
```


## 使用方式

```python
import triton
from triton_mla_op.triton_decode_attention import decode_attention_fwd
    ...
    decode_attention_fwd(
        q,
        k_buffer,
        v_buffer,
        o,
        req_to_token,
        b_seq_len,
        attn_logits,
        num_kv_splits,
        sm_scale,
    )
    ...
```

## MLAttention开发进度
```
目前，基于Cutlass的MLAttention版本正在积极开发中。
我们会及时在项目仓库中更新开发进度。欢迎关注我们的开发者社区以获取最新信息。
```