Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
3fa8448b
Unverified
Commit
3fa8448b
authored
Nov 12, 2025
by
Ziqi Fan
Committed by
GitHub
Nov 13, 2025
Browse files
chore: enlarge default KVBM leader-worker timeout and better wording (#4283)
Signed-off-by:
Ziqi Fan
<
ziqif@nvidia.com
>
parent
1f44fca7
Changes
5
Show whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
12 additions
and
9 deletions
+12
-9
docs/kvbm/trtllm-setup.md
docs/kvbm/trtllm-setup.md
+3
-3
docs/kvbm/vllm-setup.md
docs/kvbm/vllm-setup.md
+3
-3
lib/kvbm/src/block_manager/distributed/leader.rs
lib/kvbm/src/block_manager/distributed/leader.rs
+1
-1
lib/llm/src/block_manager/distributed/worker.rs
lib/llm/src/block_manager/distributed/worker.rs
+4
-1
lib/llm/src/block_manager/distributed/zmq.rs
lib/llm/src/block_manager/distributed/zmq.rs
+1
-1
No files found.
docs/kvbm/trtllm-setup.md
View file @
3fa8448b
...
...
@@ -115,11 +115,11 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex
## Troubleshooting
1.
Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout for leader–worker initialization.
To avoid it, please set a longer timeout
(default 1800 seconds)
for leader–worker initialization.
```
bash
#
12
00 means
12
00 seconds timeout
export
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS
=
12
00
#
36
00 means
36
00 seconds timeout
export
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS
=
36
00
```
2.
When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
...
...
docs/kvbm/vllm-setup.md
View file @
3fa8448b
...
...
@@ -107,11 +107,11 @@ vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv
## Troubleshooting
1.
Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout for leader–worker initialization.
To avoid it, please set a longer timeout
(default 1800 seconds)
for leader–worker initialization.
```
bash
#
12
00 means
12
00 seconds timeout
export
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS
=
12
00
#
36
00 means
36
00 seconds timeout
export
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS
=
36
00
```
2.
When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
...
...
lib/kvbm/src/block_manager/distributed/leader.rs
View file @
3fa8448b
...
...
@@ -16,7 +16,7 @@ const DISK_CACHE: &str = "DYN_KVBM_DISK_CACHE_GB";
const
DISK_CACHE_OVERRIDE
:
&
str
=
"DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS"
;
const
LEADER_WORKER_INIT_TIMEOUT_SECS
:
&
str
=
"DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS"
;
const
DEFAULT_INIT_TIMEOUT_SECS
:
u64
=
1
2
0
;
const
DEFAULT_INIT_TIMEOUT_SECS
:
u64
=
1
80
0
;
fn
read_env_usize
(
key
:
&
str
)
->
Option
<
usize
>
{
std
::
env
::
var
(
key
)
.ok
()
?
.trim
()
.parse
::
<
usize
>
()
.ok
()
...
...
lib/llm/src/block_manager/distributed/worker.rs
View file @
3fa8448b
...
...
@@ -423,7 +423,10 @@ struct GatedPing {
impl
Handler
for
GatedPing
{
async
fn
handle
(
&
self
,
mut
message
:
MessageHandle
)
->
anyhow
::
Result
<
()
>
{
if
!
self
.state
.is_ready
()
{
tracing
::
info!
(
"Ping received but worker not ready; deferring ACK"
);
tracing
::
info!
(
"KVBM worker is under initialization. It could take a while if set with large CPU or DISK cache size. Please wait..."
);
tracing
::
debug!
(
"Ping received but worker not ready; deferring ACK"
);
// Prevent Drop panic; leader won't get an ACK for this round and will retry.
message
.mark_handled
();
return
Ok
(());
...
...
lib/llm/src/block_manager/distributed/zmq.rs
View file @
3fa8448b
...
...
@@ -220,7 +220,7 @@ impl ZmqActiveMessageLeader {
"Timed out waiting for ping readiness after handshake."
));
}
tracing
::
info
!
(
"Handshake: final readiness ping..."
);
tracing
::
debug
!
(
"Handshake: final readiness ping..."
);
let
ping
=
this
.broadcast
(
ZMQ_PING_MESSAGE
,
vec!
[])
.await
?
;
tokio
::
select!
{
_
=
ping
=>
break
,
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment