Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
a076ec1a
Unverified
Commit
a076ec1a
authored
Oct 30, 2025
by
b8zhong
Committed by
GitHub
Oct 30, 2025
Browse files
Revert "fix llama4 kv cache layout" (#12437)
parent
72b5f3d0
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
1 addition
and
8 deletions
+1
-8
docs/advanced_features/attention_backend.md
docs/advanced_features/attention_backend.md
+1
-1
python/sglang/srt/server_args.py
python/sglang/srt/server_args.py
+0
-7
No files found.
docs/advanced_features/attention_backend.md
View file @
a076ec1a
...
@@ -21,7 +21,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
...
@@ -21,7 +21,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
|
**Triton**
| ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
|
**Triton**
| ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
|
**Torch Native (SDPA)**
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
**Torch Native (SDPA)**
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
**FlexAttention (PyTorch)**
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
**FlexAttention (PyTorch)**
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
**TRTLLM MHA**
| 16, 32 or 64 |
❌
| ✅ | ❌ | ❌ | ❌ |
|
**TRTLLM MHA**
| 16, 32 or 64 |
✅
| ✅ | ❌ | ❌ | ❌ |
|
**Dual Chunk FlashAttention**
| ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
**Dual Chunk FlashAttention**
| ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
**AITER (ROCm)**
| ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
**AITER (ROCm)**
| ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
**Wave (ROCm)**
| ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
**Wave (ROCm)**
| ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
...
...
python/sglang/srt/server_args.py
View file @
a076ec1a
...
@@ -980,13 +980,6 @@ class ServerArgs:
...
@@ -980,13 +980,6 @@ class ServerArgs:
logger
.
warning
(
logger
.
warning
(
"Use trtllm_mha as attention backend on sm100 for Llama4 model"
"Use trtllm_mha as attention backend on sm100 for Llama4 model"
)
)
if
is_sm100_supported
()
and
self
.
attention_backend
==
"trtllm_mha"
:
# TODO(brayden): remove this once TRTLLM MHA kernel for FP8 w/ tileSizeKv=128 is available.
# This is a Llama 4 specific issue only.
self
.
kv_cache_dtype
=
"bfloat16"
logger
.
warning
(
"Setting kv_cache_dtype to bfloat16 for Llama4 with trtllm_mha backend, due to a missing FlashInfer TRTLLM MHA kernel for FP8 KV Cache"
)
if
is_sm100_supported
()
and
self
.
moe_runner_backend
==
"auto"
:
if
is_sm100_supported
()
and
self
.
moe_runner_backend
==
"auto"
:
if
self
.
quantization
in
{
"fp8"
,
"modelopt_fp8"
}:
if
self
.
quantization
in
{
"fp8"
,
"modelopt_fp8"
}:
self
.
moe_runner_backend
=
"flashinfer_trtllm"
self
.
moe_runner_backend
=
"flashinfer_trtllm"
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment