Unverified Commit 3fa8448b authored by Ziqi Fan's avatar Ziqi Fan Committed by GitHub
Browse files

chore: enlarge default KVBM leader-worker timeout and better wording (#4283)


Signed-off-by: default avatarZiqi Fan <ziqif@nvidia.com>
parent 1f44fca7
...@@ -115,11 +115,11 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex ...@@ -115,11 +115,11 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex
## Troubleshooting ## Troubleshooting
1. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout. 1. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout for leader–worker initialization. To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
```bash ```bash
# 1200 means 1200 seconds timeout # 3600 means 3600 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=1200 export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600
``` ```
2. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files. 2. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
......
...@@ -107,11 +107,11 @@ vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv ...@@ -107,11 +107,11 @@ vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv
## Troubleshooting ## Troubleshooting
1. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout. 1. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout for leader–worker initialization. To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
```bash ```bash
# 1200 means 1200 seconds timeout # 3600 means 3600 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=1200 export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600
``` ```
2. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files. 2. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
......
...@@ -16,7 +16,7 @@ const DISK_CACHE: &str = "DYN_KVBM_DISK_CACHE_GB"; ...@@ -16,7 +16,7 @@ const DISK_CACHE: &str = "DYN_KVBM_DISK_CACHE_GB";
const DISK_CACHE_OVERRIDE: &str = "DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS"; const DISK_CACHE_OVERRIDE: &str = "DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS";
const LEADER_WORKER_INIT_TIMEOUT_SECS: &str = "DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS"; const LEADER_WORKER_INIT_TIMEOUT_SECS: &str = "DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS";
const DEFAULT_INIT_TIMEOUT_SECS: u64 = 120; const DEFAULT_INIT_TIMEOUT_SECS: u64 = 1800;
fn read_env_usize(key: &str) -> Option<usize> { fn read_env_usize(key: &str) -> Option<usize> {
std::env::var(key).ok()?.trim().parse::<usize>().ok() std::env::var(key).ok()?.trim().parse::<usize>().ok()
......
...@@ -423,7 +423,10 @@ struct GatedPing { ...@@ -423,7 +423,10 @@ struct GatedPing {
impl Handler for GatedPing { impl Handler for GatedPing {
async fn handle(&self, mut message: MessageHandle) -> anyhow::Result<()> { async fn handle(&self, mut message: MessageHandle) -> anyhow::Result<()> {
if !self.state.is_ready() { if !self.state.is_ready() {
tracing::info!("Ping received but worker not ready; deferring ACK"); tracing::info!(
"KVBM worker is under initialization. It could take a while if set with large CPU or DISK cache size. Please wait..."
);
tracing::debug!("Ping received but worker not ready; deferring ACK");
// Prevent Drop panic; leader won't get an ACK for this round and will retry. // Prevent Drop panic; leader won't get an ACK for this round and will retry.
message.mark_handled(); message.mark_handled();
return Ok(()); return Ok(());
......
...@@ -220,7 +220,7 @@ impl ZmqActiveMessageLeader { ...@@ -220,7 +220,7 @@ impl ZmqActiveMessageLeader {
"Timed out waiting for ping readiness after handshake." "Timed out waiting for ping readiness after handshake."
)); ));
} }
tracing::info!("Handshake: final readiness ping..."); tracing::debug!("Handshake: final readiness ping...");
let ping = this.broadcast(ZMQ_PING_MESSAGE, vec![]).await?; let ping = this.broadcast(ZMQ_PING_MESSAGE, vec![]).await?;
tokio::select! { tokio::select! {
_ = ping => break, _ = ping => break,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment