Unverified Commit 44f0ece9 authored by Baizhou Zhang's avatar Baizhou Zhang Committed by GitHub
Browse files

[Doc] Update documents for FA4 (#11778)

parent be0058bc
...@@ -17,7 +17,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu ...@@ -17,7 +17,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
|---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------| |---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------|
| **FlashInfer** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | **FlashInfer** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| **FA3 (FlashAttention 3)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | **FA3 (FlashAttention 3)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **FA4 (FlashAttention 4)** | | ❌ | ❌ | ❌ | ❌ | ❌ | | **FA4 (FlashAttention 4)** | 128 | ❌ | ❌ | ❌ | ❌ | ❌ |
| **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | **Triton** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
| **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | **Torch Native (SDPA)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **FlexAttention (PyTorch)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | **FlexAttention (PyTorch)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
...@@ -37,7 +37,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu ...@@ -37,7 +37,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
| **TRTLLM MLA (Blackwell)** | 32 or 64 | ✅ | ✅ | ✅ | ❌ | | **TRTLLM MLA (Blackwell)** | 32 or 64 | ✅ | ✅ | ✅ | ❌ |
| **FA3 (FlashAttention 3)** | n/a | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) | | **FA3 (FlashAttention 3)** | n/a | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) |
| **Triton** | n/a | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) | | **Triton** | n/a | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) |
| **FA4** | n/a | ❌ | ❌ | ❌ | ❌ | | **FA4** | 128 | ❌ | ❌ | ❌ | ❌ |
| **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ | | **Ascend MLA (NPU)** | 128 | ❌ | ❌ | ❌ | ❌ |
```{warning} ```{warning}
...@@ -53,13 +53,14 @@ FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https ...@@ -53,13 +53,14 @@ FlashMLA FP8 KV cache is currently not working. See upstream issue [#8856](https
Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths. Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
``` ```
Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128). Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), FA4 (128), Ascend (128).
MLA page-size constraints: MLA page-size constraints:
- FlashInfer MLA: page_size = 1. - FlashInfer MLA: page_size = 1.
- FlashMLA: page_size = 64. - FlashMLA: page_size = 64.
- Cutlass MLA: page_size = 128. - Cutlass MLA: page_size = 128.
- TRTLLM MLA: page_size ∈ {32, 64}. - TRTLLM MLA: page_size ∈ {32, 64}.
- FA4: page_size = 128.
### Hybrid attention (different backends for prefill vs decode) (Experimental) ### Hybrid attention (different backends for prefill vs decode) (Experimental)
...@@ -224,7 +225,7 @@ python3 -m sglang.launch_server \ ...@@ -224,7 +225,7 @@ python3 -m sglang.launch_server \
python3 -m sglang.launch_server \ python3 -m sglang.launch_server \
--tp 8 \ --tp 8 \
--model deepseek-ai/DeepSeek-R1 \ --model deepseek-ai/DeepSeek-R1 \
--attention-backend fa4 \ --prefill-attention-backend fa4 \
--trust-remote-code --trust-remote-code
``` ```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment