Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
5e75a14a
Unverified
Commit
5e75a14a
authored
Feb 09, 2026
by
Michael Goin
Committed by
GitHub
Feb 09, 2026
Browse files
[Doc] Add DCP support to attention backend doc (#33936)
parent
e7e52781
Changes
2
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
769 additions
and
644 deletions
+769
-644
docs/design/attention_backends.md
docs/design/attention_backends.md
+26
-25
tools/pre_commit/generate_attention_backend_docs.py
tools/pre_commit/generate_attention_backend_docs.py
+743
-619
No files found.
docs/design/attention_backends.md
View file @
5e75a14a
...
@@ -152,6 +152,7 @@ Priority is **1 = highest** (tried first).
...
@@ -152,6 +152,7 @@ Priority is **1 = highest** (tried first).
|
**Sink**
| Attention sink support (for StreamingLLM) |
|
**Sink**
| Attention sink support (for StreamingLLM) |
|
**Sparse**
| Sparse attention support (MLA only) |
|
**Sparse**
| Sparse attention support (MLA only) |
|
**MM Prefix**
| Multimodal prefix full attention support |
|
**MM Prefix**
| Multimodal prefix full attention support |
|
**DCP**
| Decode Context Parallelism support (
`--decode-context-parallel-size`
) |
|
**Attention Types**
| Supported attention patterns (Decoder, Encoder, Enc-Dec) |
|
**Attention Types**
| Supported attention patterns (Decoder, Encoder, Enc-Dec) |
|
**Compute Cap.**
| Required CUDA compute capability (N/A for non-CUDA backends) |
|
**Compute Cap.**
| Required CUDA compute capability (N/A for non-CUDA backends) |
...
@@ -159,20 +160,20 @@ Priority is **1 = highest** (tried first).
...
@@ -159,20 +160,20 @@ Priority is **1 = highest** (tried first).
## Standard Attention (MHA, MQA, GQA) Backends
## Standard Attention (MHA, MQA, GQA) Backends
| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | Attention Types | Compute Cap. |
| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix |
DCP |
Attention Types | Compute Cap. |
|---------|---------|--------|-----------|-------------|------------|------|-----------|-----------------|--------------|
|---------|---------|--------|-----------|-------------|------------|------|-----------|-----
|-----
------------|--------------|
|
`CPU_ATTN`
| | fp16, bf16, fp32 |
`auto`
| Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | All | N/A |
|
`CPU_ATTN`
| | fp16, bf16, fp32 |
`auto`
| Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ |
❌ |
All | N/A |
|
`FLASHINFER`
| Native† | fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
,
`fp8_e5m2`
| 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | Decoder | 7.x-9.x |
|
`FLASHINFER`
| Native† | fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
,
`fp8_e5m2`
| 16, 32, 64 | 64, 128, 256 | ❌ | ❌ |
✅ |
Decoder | 7.x-9.x |
|
`FLASHINFER`
| TRTLLM† | fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
,
`fp8_e5m2`
| 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | Decoder | 10.x |
|
`FLASHINFER`
| TRTLLM† | fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
,
`fp8_e5m2`
| 16, 32, 64 | 64, 128, 256 | ✅ | ❌ |
✅ |
Decoder | 10.x |
|
`FLASH_ATTN`
| FA2
*
| fp16, bf16 |
`auto`
,
`bfloat16`
| %16 | Any | ❌ | ❌ | All | ≥8.0 |
|
`FLASH_ATTN`
| FA2
*
| fp16, bf16 |
`auto`
,
`bfloat16`
| %16 | Any | ❌ | ❌ |
✅ |
All | ≥8.0 |
|
`FLASH_ATTN`
| FA3
*
| fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
,
`fp8_e5m2`
| %16 | Any | ✅ | ❌ | All | 9.x |
|
`FLASH_ATTN`
| FA3
*
| fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
,
`fp8_e5m2`
| %16 | Any | ✅ | ❌ |
✅ |
All | 9.x |
|
`FLASH_ATTN_DIFFKV`
| | fp16, bf16 |
`auto`
| Any | Any | ❌ | ❌ | Decoder | Any |
|
`FLASH_ATTN_DIFFKV`
| | fp16, bf16 |
`auto`
| Any | Any | ❌ | ❌ |
✅ |
Decoder | Any |
|
`FLEX_ATTENTION`
| | fp16, bf16, fp32 |
`auto`
,
`bfloat16`
| Any | Any | ❌ | ✅ | Decoder, Encoder Only | Any |
|
`FLEX_ATTENTION`
| | fp16, bf16, fp32 |
`auto`
,
`bfloat16`
| Any | Any | ❌ | ✅ |
❌ |
Decoder, Encoder Only | Any |
|
`ROCM_AITER_FA`
| | fp16, bf16 |
`auto`
| 16, 32 | 64, 128, 256 | ❌ | ❌ | Decoder | N/A |
|
`ROCM_AITER_FA`
| | fp16, bf16 |
`auto`
| 16, 32 | 64, 128, 256 | ❌ | ❌ |
❌ |
Decoder | N/A |
|
`ROCM_AITER_UNIFIED_ATTN`
| | fp16, bf16 |
`auto`
| Any | Any | ❌ | ❌ | Decoder | N/A |
|
`ROCM_AITER_UNIFIED_ATTN`
| | fp16, bf16 |
`auto`
| Any | Any | ❌ | ❌ |
❌ |
Decoder | N/A |
|
`ROCM_ATTN`
| | fp16, bf16, fp32 |
`auto`
| 16, 32, 544 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | Decoder | N/A |
|
`ROCM_ATTN`
| | fp16, bf16, fp32 |
`auto`
| 16, 32, 544 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ |
❌ |
Decoder | N/A |
|
`TREE_ATTN`
| | fp16, bf16 |
`auto`
| %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | Decoder | Any |
|
`TREE_ATTN`
| | fp16, bf16 |
`auto`
| %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ |
❌ |
Decoder | Any |
|
`TRITON_ATTN`
| | fp16, bf16, fp32 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
,
`fp8_e5m2`
| %16 | Any | ✅ | ✅ | All | Any |
|
`TRITON_ATTN`
| | fp16, bf16, fp32 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
,
`fp8_e5m2`
| %16 | Any | ✅ | ✅ |
❌ |
All | Any |
> **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
> **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
>
>
...
@@ -199,14 +200,14 @@ configuration.
...
@@ -199,14 +200,14 @@ configuration.
### Decode Backends
### Decode Backends
| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | Attention Types | Compute Cap. |
| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix |
DCP |
Attention Types | Compute Cap. |
|---------|--------|-----------|-------------|------------|------|--------|-----------|-----------------|--------------|
|---------|--------|-----------|-------------|------------|------|--------|-----------|-----
|-----
------------|--------------|
|
`CUTLASS_MLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
| 128 | Any | ❌ | ❌ | ❌ | Decoder | 10.x |
|
`CUTLASS_MLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
| 128 | Any | ❌ | ❌ | ❌ |
✅ |
Decoder | 10.x |
|
`FLASHINFER_MLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
| 32, 64 | Any | ❌ | ❌ | ❌ | Decoder | 10.x |
|
`FLASHINFER_MLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
| 32, 64 | Any | ❌ | ❌ | ❌ |
❌ |
Decoder | 10.x |
|
`FLASHMLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
| 64 | Any | ❌ | ❌ | ❌ | Decoder | 9.x-10.x |
|
`FLASHMLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
,
`fp8`
,
`fp8_e4m3`
| 64 | Any | ❌ | ❌ | ❌ |
✅ |
Decoder | 9.x-10.x |
|
`FLASHMLA_SPARSE`
| bf16 |
`auto`
,
`bfloat16`
,
`fp8_ds_mla`
| 64 | 576 | ❌ | ✅ | ❌ | Decoder | 9.x-10.x |
|
`FLASHMLA_SPARSE`
| bf16 |
`auto`
,
`bfloat16`
,
`fp8_ds_mla`
| 64 | 576 | ❌ | ✅ | ❌ |
❌ |
Decoder | 9.x-10.x |
|
`FLASH_ATTN_MLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
| %16 | Any | ❌ | ❌ | ❌ | Decoder | 9.x |
|
`FLASH_ATTN_MLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
| %16 | Any | ❌ | ❌ | ❌ |
✅ |
Decoder | 9.x |
|
`ROCM_AITER_MLA`
| fp16, bf16 |
`auto`
| 1 | Any | ❌ | ❌ | ❌ | Decoder | N/A |
|
`ROCM_AITER_MLA`
| fp16, bf16 |
`auto`
| 1 | Any | ❌ | ❌ | ❌ |
❌ |
Decoder | N/A |
|
`ROCM_AITER_MLA_SPARSE`
| fp16, bf16 |
`auto`
| Any | 576 | ❌ | ❌ | ❌ | Decoder | N/A |
|
`ROCM_AITER_MLA_SPARSE`
| fp16, bf16 |
`auto`
| Any | 576 | ❌ | ❌ | ❌ |
❌ |
Decoder | N/A |
|
`ROCM_AITER_TRITON_MLA`
| fp16, bf16 |
`auto`
| Any | Any | ❌ | ❌ | ❌ | Decoder | N/A |
|
`ROCM_AITER_TRITON_MLA`
| fp16, bf16 |
`auto`
| Any | Any | ❌ | ❌ | ❌ |
❌ |
Decoder | N/A |
|
`TRITON_MLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
| Any | Any | ❌ | ❌ | ❌ | Decoder | Any |
|
`TRITON_MLA`
| fp16, bf16 |
`auto`
,
`bfloat16`
| Any | Any | ❌ | ❌ | ❌ |
✅ |
Decoder | Any |
tools/pre_commit/generate_attention_backend_docs.py
View file @
5e75a14a
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment