keep attention backend document up to date (#12741)

32ec68fa · b8zhong · GitHub · 3c0a6df8 · 32ec68fa · 32ec68fa
Unverified Commit 32ec68fa authored Nov 05, 2025 by b8zhong Committed by GitHub Nov 05, 2025
Showing with 11 additions and 7 deletions

docs/advanced_features/attention_backend.md docs/advanced_features/attention_backend.md +10 -6

docs/advanced_features/server_arguments.md docs/advanced_features/server_arguments.md +1 -1

No files found.
--- a/docs/advanced_features/attention_backend.md
+++ b/docs/advanced_features/attention_backend.md
@@ -19,13 +19,13 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
 | **FA3 (FlashAttention 3)**      | ✅                          | ✅               | ✅              | ✅              | ✅                 | ✅             |
 | **FA4 (FlashAttention 4)**      | 128                         | ❌               | ❌              | ❌              | ❌                 | ❌             |
 | **Triton**                      | ❌                          | ❌               | ✅              | ✅              | ✅                 | ✅             |
-| **Torch Native (SDPA)**         | ❌                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
+| **Torch Native (SDPA)**         | ❌                          | ❌               | ❌              | ❌              | ❌                 | ✅             |
 | **FlexAttention (PyTorch)**     | ❌                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
 | **TRTLLM MHA**                  | 16, 32 or 64                | ✅               | ✅              | ❌              | ✅                 | ❌             |
 | **Dual Chunk FlashAttention**   | ✅                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
-| **AITER (ROCm)**                | ✅                          | ❌               | ✅              | ✅              | ❌                 | ❌             |
+| **AITER (ROCm)**                | ✅                          | ❌               | ✅              | ✅              | ❌                 | ✅             |
 | **Wave (ROCm)**                 | ✅                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
-| **Ascend (NPU)**                | ✅                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
+| **Ascend (NPU)**                | ✅                          | ❌               | ❌              | ❌              | ❌                 | ✅             |
 | **Intel XPU**                   | ✅                          | ❌               | ❌              | ❌              | ✅                 | ❌             |

 ### MLA Backends
@@ -41,6 +41,10 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
 | **FA4**                    | 128                       | ❌               | ❌                       | ❌              | ❌              |
 | **Ascend MLA (NPU)**       | 128                       | ❌               | ❌                       | ❌              | ❌              |

+```{note}
+Multimodal attention is selected by `--mm-attention-backend`. The "MultiModal" column indicates whether a corresponding multimodal implementation exists for that backend family.
+```
+
 ```{warning}
 FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https://github.com/sgl-project/sglang/pull/8856). Use non-FP8 KV or another backend when FP8 KV cache is required.
 ```
@@ -103,9 +107,9 @@ If you set only one of `--prefill-attention-backend` or `--decode-attention-back
 If both are specified and differ, SGLang automatically enables a hybrid wrapper to dispatch to the chosen backend per phase.
 ```

-## User guide
+## User Guide

-### Launch command for different attention backends.
+### Launch Command for Different Attention Backends

 - FlashInfer (Default for Non-Hopper Machines, e.g., A100, A40)
 ```bash
@@ -212,7 +216,7 @@ python3 -m sglang.launch_server \
  --attention-backend flex_attention
 ```

- Dual Chunk FlashAttention (MHA-only)
+- Dual Chunk FlashAttention
 ```bash
 python3 -m sglang.launch_server \
  --model Qwen/Qwen2.5-14B-Instruct-1M \

--- a/docs/advanced_features/server_arguments.md
+++ b/docs/advanced_features/server_arguments.md
@@ -241,7 +241,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
 | `--decode-attention-backend` | Choose the kernels for decode attention layers (have priority over --attention-backend). | `None` | `triton`, `torch_native`, `flex_attention`, `nsa`, `cutlass_mla`, `fa3`, `fa4`, `flashinfer`, `flashmla`, `trtllm_mla`, `trtllm_mha`, `dual_chunk_flash_attn`, `aiter`, `wave`, `intel_amx`, `ascend` |
 | `--sampling-backend` | Choose the kernels for sampling layers. | `None` | `flashinfer`, `pytorch` |
 | `--grammar-backend` | Choose the backend for grammar-guided decoding. | `None` | `xgrammar`, `outlines`, `llguidance`, `none` |
-| `--mm-attention-backend` | Set multimodal attention backend. | `None` | `sdpa`, `fa3`, `triton_attn`, `ascend_attn` |
+| `--mm-attention-backend` | Set multimodal attention backend. | `None` | `sdpa`, `fa3`, `triton_attn`, `ascend_attn`, `aiter_attn` |
 | `--nsa-prefill` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_decode`, `fa3`, `tilelang`, `aiter` |
 | `--nsa-decode` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `flashmla_kv` | `flashmla_prefill`, `flashmla_kv`, `fa3`, `tilelang`, `aiter` |