Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
3fa8448b
Unverified
Commit
3fa8448b
authored
Nov 12, 2025
by
Ziqi Fan
Committed by
GitHub
Nov 13, 2025
Browse files
chore: enlarge default KVBM leader-worker timeout and better wording (#4283)
Signed-off-by:
Ziqi Fan
<
ziqif@nvidia.com
>
parent
1f44fca7
Changes
5
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
12 additions
and
9 deletions
+12
-9
docs/kvbm/trtllm-setup.md
docs/kvbm/trtllm-setup.md
+3
-3
docs/kvbm/vllm-setup.md
docs/kvbm/vllm-setup.md
+3
-3
lib/kvbm/src/block_manager/distributed/leader.rs
lib/kvbm/src/block_manager/distributed/leader.rs
+1
-1
lib/llm/src/block_manager/distributed/worker.rs
lib/llm/src/block_manager/distributed/worker.rs
+4
-1
lib/llm/src/block_manager/distributed/zmq.rs
lib/llm/src/block_manager/distributed/zmq.rs
+1
-1
No files found.
docs/kvbm/trtllm-setup.md
View file @
3fa8448b
...
@@ -115,11 +115,11 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex
...
@@ -115,11 +115,11 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex
## Troubleshooting
## Troubleshooting
1.
Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
1.
Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout for leader–worker initialization.
To avoid it, please set a longer timeout
(default 1800 seconds)
for leader–worker initialization.
```
bash
```
bash
#
12
00 means
12
00 seconds timeout
#
36
00 means
36
00 seconds timeout
export
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS
=
12
00
export
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS
=
36
00
```
```
2.
When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
2.
When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
...
...
docs/kvbm/vllm-setup.md
View file @
3fa8448b
...
@@ -107,11 +107,11 @@ vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv
...
@@ -107,11 +107,11 @@ vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv
## Troubleshooting
## Troubleshooting
1.
Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
1.
Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout for leader–worker initialization.
To avoid it, please set a longer timeout
(default 1800 seconds)
for leader–worker initialization.
```
bash
```
bash
#
12
00 means
12
00 seconds timeout
#
36
00 means
36
00 seconds timeout
export
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS
=
12
00
export
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS
=
36
00
```
```
2.
When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
2.
When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
...
...
lib/kvbm/src/block_manager/distributed/leader.rs
View file @
3fa8448b
...
@@ -16,7 +16,7 @@ const DISK_CACHE: &str = "DYN_KVBM_DISK_CACHE_GB";
...
@@ -16,7 +16,7 @@ const DISK_CACHE: &str = "DYN_KVBM_DISK_CACHE_GB";
const
DISK_CACHE_OVERRIDE
:
&
str
=
"DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS"
;
const
DISK_CACHE_OVERRIDE
:
&
str
=
"DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS"
;
const
LEADER_WORKER_INIT_TIMEOUT_SECS
:
&
str
=
"DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS"
;
const
LEADER_WORKER_INIT_TIMEOUT_SECS
:
&
str
=
"DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS"
;
const
DEFAULT_INIT_TIMEOUT_SECS
:
u64
=
1
2
0
;
const
DEFAULT_INIT_TIMEOUT_SECS
:
u64
=
1
80
0
;
fn
read_env_usize
(
key
:
&
str
)
->
Option
<
usize
>
{
fn
read_env_usize
(
key
:
&
str
)
->
Option
<
usize
>
{
std
::
env
::
var
(
key
)
.ok
()
?
.trim
()
.parse
::
<
usize
>
()
.ok
()
std
::
env
::
var
(
key
)
.ok
()
?
.trim
()
.parse
::
<
usize
>
()
.ok
()
...
...
lib/llm/src/block_manager/distributed/worker.rs
View file @
3fa8448b
...
@@ -423,7 +423,10 @@ struct GatedPing {
...
@@ -423,7 +423,10 @@ struct GatedPing {
impl
Handler
for
GatedPing
{
impl
Handler
for
GatedPing
{
async
fn
handle
(
&
self
,
mut
message
:
MessageHandle
)
->
anyhow
::
Result
<
()
>
{
async
fn
handle
(
&
self
,
mut
message
:
MessageHandle
)
->
anyhow
::
Result
<
()
>
{
if
!
self
.state
.is_ready
()
{
if
!
self
.state
.is_ready
()
{
tracing
::
info!
(
"Ping received but worker not ready; deferring ACK"
);
tracing
::
info!
(
"KVBM worker is under initialization. It could take a while if set with large CPU or DISK cache size. Please wait..."
);
tracing
::
debug!
(
"Ping received but worker not ready; deferring ACK"
);
// Prevent Drop panic; leader won't get an ACK for this round and will retry.
// Prevent Drop panic; leader won't get an ACK for this round and will retry.
message
.mark_handled
();
message
.mark_handled
();
return
Ok
(());
return
Ok
(());
...
...
lib/llm/src/block_manager/distributed/zmq.rs
View file @
3fa8448b
...
@@ -220,7 +220,7 @@ impl ZmqActiveMessageLeader {
...
@@ -220,7 +220,7 @@ impl ZmqActiveMessageLeader {
"Timed out waiting for ping readiness after handshake."
"Timed out waiting for ping readiness after handshake."
));
));
}
}
tracing
::
info
!
(
"Handshake: final readiness ping..."
);
tracing
::
debug
!
(
"Handshake: final readiness ping..."
);
let
ping
=
this
.broadcast
(
ZMQ_PING_MESSAGE
,
vec!
[])
.await
?
;
let
ping
=
this
.broadcast
(
ZMQ_PING_MESSAGE
,
vec!
[])
.await
?
;
tokio
::
select!
{
tokio
::
select!
{
_
=
ping
=>
break
,
_
=
ping
=>
break
,
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment