Unverified Commit 32ec68fa authored by b8zhong's avatar b8zhong Committed by GitHub
Browse files

keep attention backend document up to date (#12741)

parent 3c0a6df8
......@@ -19,13 +19,13 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
| **FA3 (FlashAttention 3)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **FA4 (FlashAttention 4)** | 128 | ❌ | ❌ | ❌ | ❌ | ❌ |
| **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
| **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | |
| **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | |
| **FlexAttention (PyTorch)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **TRTLLM MHA** | 16, 32 or 64 | ✅ | ✅ | ❌ | ✅ | ❌ |
| **Dual Chunk FlashAttention** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **AITER (ROCm)** | ✅ | ❌ | ✅ | ✅ | ❌ | |
| **AITER (ROCm)** | ✅ | ❌ | ✅ | ✅ | ❌ | |
| **Wave (ROCm)** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **Ascend (NPU)** | ✅ | ❌ | ❌ | ❌ | ❌ | |
| **Ascend (NPU)** | ✅ | ❌ | ❌ | ❌ | ❌ | |
| **Intel XPU** | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
### MLA Backends
......@@ -41,6 +41,10 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
| **FA4** | 128 | ❌ | ❌ | ❌ | ❌ |
| **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ |
```{note}
Multimodal attention is selected by `--mm-attention-backend`. The "MultiModal" column indicates whether a corresponding multimodal implementation exists for that backend family.
```
```{warning}
FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https://github.com/sgl-project/sglang/pull/8856). Use non-FP8 KV or another backend when FP8 KV cache is required.
```
......@@ -103,9 +107,9 @@ If you set only one of `--prefill-attention-backend` or `--decode-attention-back
If both are specified and differ, SGLang automatically enables a hybrid wrapper to dispatch to the chosen backend per phase.
```
## User guide
## User Guide
### Launch command for different attention backends.
### Launch Command for Different Attention Backends
- FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)
```bash
......@@ -212,7 +216,7 @@ python3 -m sglang.launch_server \
--attention-backend flex_attention
```
- Dual Chunk FlashAttention (MHA-only)
- Dual Chunk FlashAttention
```bash
python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-14B-Instruct-1M \
......
......@@ -241,7 +241,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| `--decode-attention-backend` | Choose the kernels for decode attention layers (have priority over --attention-backend). | `None` | `triton`, `torch_native`, `flex_attention`, `nsa`, `cutlass_mla`, `fa3`, `fa4`, `flashinfer`, `flashmla`, `trtllm_mla`, `trtllm_mha`, `dual_chunk_flash_attn`, `aiter`, `wave`, `intel_amx`, `ascend` |
| `--sampling-backend` | Choose the kernels for sampling layers. | `None` | `flashinfer`, `pytorch` |
| `--grammar-backend` | Choose the backend for grammar-guided decoding. | `None` | `xgrammar`, `outlines`, `llguidance`, `none` |
| `--mm-attention-backend` | Set multimodal attention backend. | `None` | `sdpa`, `fa3`, `triton_attn`, `ascend_attn` |
| `--mm-attention-backend` | Set multimodal attention backend. | `None` | `sdpa`, `fa3`, `triton_attn`, `ascend_attn`, `aiter_attn` |
| `--nsa-prefill` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_decode`, `fa3`, `tilelang`, `aiter` |
| `--nsa-decode` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `flashmla_kv` | `flashmla_prefill`, `flashmla_kv`, `fa3`, `tilelang`, `aiter` |
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment