@@ -41,6 +41,10 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
...
@@ -41,6 +41,10 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
| **FA4** | 128 | ❌ | ❌ | ❌ | ❌ |
| **FA4** | 128 | ❌ | ❌ | ❌ | ❌ |
| **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ |
| **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ |
```{note}
Multimodal attention is selected by `--mm-attention-backend`. The "MultiModal" column indicates whether a corresponding multimodal implementation exists for that backend family.
```
```{warning}
```{warning}
FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https://github.com/sgl-project/sglang/pull/8856). Use non-FP8 KV or another backend when FP8 KV cache is required.
FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https://github.com/sgl-project/sglang/pull/8856). Use non-FP8 KV or another backend when FP8 KV cache is required.
```
```
...
@@ -103,9 +107,9 @@ If you set only one of `--prefill-attention-backend` or `--decode-attention-back
...
@@ -103,9 +107,9 @@ If you set only one of `--prefill-attention-backend` or `--decode-attention-back
If both are specified and differ, SGLang automatically enables a hybrid wrapper to dispatch to the chosen backend per phase.
If both are specified and differ, SGLang automatically enables a hybrid wrapper to dispatch to the chosen backend per phase.
```
```
## User guide
## User Guide
### Launch command for different attention backends.
### Launch Command for Different Attention Backends
- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)
- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)