Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
0d9e89ec
Unverified
Commit
0d9e89ec
authored
Aug 12, 2025
by
Jimmy
Committed by
GitHub
Aug 11, 2025
Browse files
[PD]decode: add CLIP_MAX_NEW_TOKEN for pop_preallocated (#8866)
parent
3d64fda3
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
11 additions
and
3 deletions
+11
-3
docs/advanced_features/pd_disaggregation.md
docs/advanced_features/pd_disaggregation.md
+1
-0
python/sglang/srt/disaggregation/decode.py
python/sglang/srt/disaggregation/decode.py
+10
-3
No files found.
docs/advanced_features/pd_disaggregation.md
View file @
0d9e89ec
...
@@ -67,6 +67,7 @@ Please be aware that this setting will cause prefill instances to take a longer
...
@@ -67,6 +67,7 @@ Please be aware that this setting will cause prefill instances to take a longer
|
**`SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL`**
| Interval (seconds) between health checks to prefill bootstrap servers |
`5.0`
|
|
**`SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL`**
| Interval (seconds) between health checks to prefill bootstrap servers |
`5.0`
|
|
**`SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE`**
| Consecutive heartbeat failures before marking prefill server offline |
`2`
|
|
**`SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE`**
| Consecutive heartbeat failures before marking prefill server offline |
`2`
|
|
**`SGLANG_DISAGGREGATION_WAITING_TIMEOUT`**
| Timeout (seconds) for receiving KV Cache after request initialization |
`300`
|
|
**`SGLANG_DISAGGREGATION_WAITING_TIMEOUT`**
| Timeout (seconds) for receiving KV Cache after request initialization |
`300`
|
|
**`SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION`**
| Clip request param "max_tokens" to pre_allocate |
`4096`
|
If a greater mean TTFT is acceptable, you can
`export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600`
(10 minutes) to relax the timeout condition.
If a greater mean TTFT is acceptable, you can
`export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600`
(10 minutes) to relax the timeout condition.
...
...
python/sglang/srt/disaggregation/decode.py
View file @
0d9e89ec
...
@@ -51,7 +51,7 @@ from sglang.srt.mem_cache.base_prefix_cache import BasePrefixCache
...
@@ -51,7 +51,7 @@ from sglang.srt.mem_cache.base_prefix_cache import BasePrefixCache
from
sglang.srt.mem_cache.memory_pool
import
KVCache
,
ReqToTokenPool
from
sglang.srt.mem_cache.memory_pool
import
KVCache
,
ReqToTokenPool
from
sglang.srt.model_executor.forward_batch_info
import
ForwardMode
from
sglang.srt.model_executor.forward_batch_info
import
ForwardMode
from
sglang.srt.torch_memory_saver_adapter
import
TorchMemorySaverAdapter
from
sglang.srt.torch_memory_saver_adapter
import
TorchMemorySaverAdapter
from
sglang.srt.utils
import
require_mlp_sync
from
sglang.srt.utils
import
get_int_env_var
,
require_mlp_sync
logger
=
logging
.
getLogger
(
__name__
)
logger
=
logging
.
getLogger
(
__name__
)
...
@@ -59,6 +59,10 @@ if TYPE_CHECKING:
...
@@ -59,6 +59,10 @@ if TYPE_CHECKING:
from
sglang.srt.managers.schedule_batch
import
Req
from
sglang.srt.managers.schedule_batch
import
Req
from
sglang.srt.managers.scheduler
import
Scheduler
from
sglang.srt.managers.scheduler
import
Scheduler
DECODE_CLIP_MAX_NEW_TOKEN
=
get_int_env_var
(
"SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION"
,
4096
)
class
DecodeReqToTokenPool
:
class
DecodeReqToTokenPool
:
"""
"""
...
@@ -384,7 +388,10 @@ class DecodePreallocQueue:
...
@@ -384,7 +388,10 @@ class DecodePreallocQueue:
max
(
max
(
required_tokens_for_request
,
required_tokens_for_request
,
origin_input_len
origin_input_len
+
decode_req
.
req
.
sampling_params
.
max_new_tokens
+
min
(
decode_req
.
req
.
sampling_params
.
max_new_tokens
,
DECODE_CLIP_MAX_NEW_TOKEN
,
)
-
retractable_tokens
,
-
retractable_tokens
,
)
)
>
allocatable_tokens
>
allocatable_tokens
...
@@ -433,7 +440,7 @@ class DecodePreallocQueue:
...
@@ -433,7 +440,7 @@ class DecodePreallocQueue:
need_space_for_single_req
=
(
need_space_for_single_req
=
(
max
(
max
(
[
[
x
.
sampling_params
.
max_new_tokens
min
(
x
.
sampling_params
.
max_new_tokens
,
DECODE_CLIP_MAX_NEW_TOKEN
)
+
len
(
x
.
origin_input_ids
)
+
len
(
x
.
origin_input_ids
)
-
retractable_tokens
-
retractable_tokens
for
x
in
self
.
scheduler
.
running_batch
.
reqs
for
x
in
self
.
scheduler
.
running_batch
.
reqs
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment