Unverified Commit 3fa8448b authored by Ziqi Fan's avatar Ziqi Fan Committed by GitHub
Browse files

chore: enlarge default KVBM leader-worker timeout and better wording (#4283)


Signed-off-by: default avatarZiqi Fan <ziqif@nvidia.com>
parent 1f44fca7
......@@ -115,11 +115,11 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex
## Troubleshooting
1. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout for leader–worker initialization.
To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
```bash
# 1200 means 1200 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=1200
# 3600 means 3600 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600
```
2. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
......
......@@ -107,11 +107,11 @@ vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv
## Troubleshooting
1. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout for leader–worker initialization.
To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
```bash
# 1200 means 1200 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=1200
# 3600 means 3600 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600
```
2. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
......
......@@ -16,7 +16,7 @@ const DISK_CACHE: &str = "DYN_KVBM_DISK_CACHE_GB";
const DISK_CACHE_OVERRIDE: &str = "DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS";
const LEADER_WORKER_INIT_TIMEOUT_SECS: &str = "DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS";
const DEFAULT_INIT_TIMEOUT_SECS: u64 = 120;
const DEFAULT_INIT_TIMEOUT_SECS: u64 = 1800;
fn read_env_usize(key: &str) -> Option<usize> {
std::env::var(key).ok()?.trim().parse::<usize>().ok()
......
......@@ -423,7 +423,10 @@ struct GatedPing {
impl Handler for GatedPing {
async fn handle(&self, mut message: MessageHandle) -> anyhow::Result<()> {
if !self.state.is_ready() {
tracing::info!("Ping received but worker not ready; deferring ACK");
tracing::info!(
"KVBM worker is under initialization. It could take a while if set with large CPU or DISK cache size. Please wait..."
);
tracing::debug!("Ping received but worker not ready; deferring ACK");
// Prevent Drop panic; leader won't get an ACK for this round and will retry.
message.mark_handled();
return Ok(());
......
......@@ -220,7 +220,7 @@ impl ZmqActiveMessageLeader {
"Timed out waiting for ping readiness after handshake."
));
}
tracing::info!("Handshake: final readiness ping...");
tracing::debug!("Handshake: final readiness ping...");
let ping = this.broadcast(ZMQ_PING_MESSAGE, vec![]).await?;
tokio::select! {
_ = ping => break,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment