[Doc] Update documents for FA4 (#11778)

44f0ece9 · Baizhou Zhang · GitHub · be0058bc · 44f0ece9
Unverified Commit 44f0ece9 authored Oct 19, 2025 by Baizhou Zhang Committed by GitHub Oct 19, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 5 additions and 4 deletions

docs/advanced_features/attention_backend.md docs/advanced_features/attention_backend.md +5 -4

No files found.
--- a/docs/advanced_features/attention_backend.md
+++ b/docs/advanced_features/attention_backend.md
@@ -17,7 +17,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
 |---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------|
 | **FlashInfer**                  | ✅                          | ✅               | ✅              | ✅              | ✅                 | ❌             |
 | **FA3 (FlashAttention 3)**      | ✅                          | ✅               | ✅              | ✅              | ✅                 | ✅             |
-| **FA4 (FlashAttention 4)**      | ❌                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
+| **FA4 (FlashAttention 4)**      | 128                         | ❌               | ❌              | ❌              | ❌                 | ❌             |
 | **Triton**                      | ❌                          | ❌               | ✅              | ✅              | ✅                 | ✅             |
 | **Torch Native (SDPA)**         | ❌                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
 | **FlexAttention (PyTorch)**     | ❌                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
@@ -37,7 +37,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
 | **TRTLLM MLA (Blackwell)** | 32 or 64                  | ✅               | ✅                       | ✅              | ❌              |
 | **FA3 (FlashAttention 3)** | n/a                       | ❌               | ✅                       | ✅              | ⚠️ (page_size=1 only) |
 | **Triton**                 | n/a                       | ❌               | ❌                       | ✅              | ⚠️ (page_size=1 only) |
-| **FA4**                    | n/a                       | ❌               | ❌                       | ❌              | ❌              |
+| **FA4**                    | 128                       | ❌               | ❌                       | ❌              | ❌              |
 | **Ascend MLA (NPU)**       | 128                       | ❌               | ❌                       | ❌              | ❌              |

 ```{warning}
@@ -53,13 +53,14 @@ FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https
 Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
 ```

-Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128).
+Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), FA4 (128), Ascend (128).

 MLA page-size constraints:
 - FlashInfer MLA: page_size = 1.
 - FlashMLA: page_size = 64.
 - Cutlass MLA: page_size = 128.
 - TRTLLM MLA: page_size ∈ {32, 64}.
+- FA4: page_size = 128.

 ### Hybrid attention (different backends for prefill vs decode) (Experimental)

@@ -224,7 +225,7 @@ python3 -m sglang.launch_server \
 python3 -m sglang.launch_server \
  --tp 8 \
  --model deepseek-ai/DeepSeek-R1 \
-  --attention-backend fa4 \
+  --prefill-attention-backend fa4 \
  --trust-remote-code
 ```