Unverified Commit 1bec3555 authored by Ziqi Fan's avatar Ziqi Fan Committed by GitHub
Browse files

docs: add details on KVBM metrics to help customer self-debug perf (#4316)


Signed-off-by: default avatarZiqi Fan <ziqif@nvidia.com>
parent af26a013
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
...@@ -112,24 +112,6 @@ Alternatively, can use "trtllm-serve" with KVBM by replacing the above two [DYNA ...@@ -112,24 +112,6 @@ Alternatively, can use "trtllm-serve" with KVBM by replacing the above two [DYNA
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
``` ```
## Troubleshooting
1. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
```bash
# 3600 means 3600 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600
```
2. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
To bypass the issue, please use disk zerofill fallback.
```bash
# Set to true to enable fallback behavior when disk operations fail (e.g. fallocate not available)
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
```
## Enable and View KVBM Metrics ## Enable and View KVBM Metrics
Follow below steps to enable metrics collection and view via Grafana dashboard: Follow below steps to enable metrics collection and view via Grafana dashboard:
...@@ -152,6 +134,37 @@ sudo ufw allow 6880/tcp ...@@ -152,6 +134,37 @@ sudo ufw allow 6880/tcp
View grafana metrics via http://localhost:3000 (default login: dynamo/dynamo) and look for KVBM Dashboard View grafana metrics via http://localhost:3000 (default login: dynamo/dynamo) and look for KVBM Dashboard
KVBM currently provides following types of metrics out of the box:
- `kvbm_matched_tokens`: The number of matched tokens
- `kvbm_offload_blocks_d2h`: The number of offload blocks from device to host
- `kvbm_offload_blocks_h2d`: The number of offload blocks from host to disk
- `kvbm_offload_blocks_d2d`: The number of offload blocks from device to disk (bypassing host memory)
- `kvbm_onboard_blocks_d2d`: The number of onboard blocks from disk to device
- `kvbm_onboard_blocks_h2d`: The number of onboard blocks from host to device
## Troubleshooting
1. If enabling KVBM does not show any TTFT perf gain or even perf degradation, one potential reason is not enough prefix cache hit on KVBM to reuse offloaded KV blocks.
To confirm, please enable KVBM metrics as mentioned above and check the grafana dashboard `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`.
If observed large number of onboarded KV blocks as the example below, we can rule out this cause:
![Grafana Example](kvbm_metrics_grafana.png)
2. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
```bash
# 3600 means 3600 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600
```
3. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
To bypass the issue, please use disk zerofill fallback.
```bash
# Set to true to enable fallback behavior when disk operations fail (e.g. fallocate not available)
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
```
## Benchmark KVBM ## Benchmark KVBM
Once the model is loaded ready, follow below steps to use LMBenchmark to benchmark KVBM performance: Once the model is loaded ready, follow below steps to use LMBenchmark to benchmark KVBM performance:
......
...@@ -104,24 +104,6 @@ Alternatively, can use `vllm serve` directly to use KVBM for aggregated serving: ...@@ -104,24 +104,6 @@ Alternatively, can use `vllm serve` directly to use KVBM for aggregated serving:
vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
``` ```
## Troubleshooting
1. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
```bash
# 3600 means 3600 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600
```
2. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
To bypass the issue, please use disk zerofill fallback.
```bash
# Set to true to enable fallback behavior when disk operations fail (e.g. fallocate not available)
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
```
## Enable and View KVBM Metrics ## Enable and View KVBM Metrics
Follow below steps to enable metrics collection and view via Grafana dashboard: Follow below steps to enable metrics collection and view via Grafana dashboard:
...@@ -145,6 +127,37 @@ sudo ufw allow 6880/tcp ...@@ -145,6 +127,37 @@ sudo ufw allow 6880/tcp
View grafana metrics via http://localhost:3000 (default login: dynamo/dynamo) and look for KVBM Dashboard View grafana metrics via http://localhost:3000 (default login: dynamo/dynamo) and look for KVBM Dashboard
KVBM currently provides following types of metrics out of the box:
- `kvbm_matched_tokens`: The number of matched tokens
- `kvbm_offload_blocks_d2h`: The number of offload blocks from device to host
- `kvbm_offload_blocks_h2d`: The number of offload blocks from host to disk
- `kvbm_offload_blocks_d2d`: The number of offload blocks from device to disk (bypassing host memory)
- `kvbm_onboard_blocks_d2d`: The number of onboard blocks from disk to device
- `kvbm_onboard_blocks_h2d`: The number of onboard blocks from host to device
## Troubleshooting
1. If enabling KVBM does not show any TTFT perf gain or even perf degradation, one potential reason is not enough prefix cache hit on KVBM to reuse offloaded KV blocks.
To confirm, please enable KVBM metrics as mentioned above and check the grafana dashboard `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`.
If observed large number of onboarded KV blocks as the example below, we can rule out this cause:
![Grafana Example](kvbm_metrics_grafana.png)
2. Allocating large memory and disk storage can take some time and lead to KVBM worker initialization timeout.
To avoid it, please set a longer timeout (default 1800 seconds) for leader–worker initialization.
```bash
# 3600 means 3600 seconds timeout
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600
```
3. When offloading to disk is enabled, KVBM could fail to start up if fallocate is not supported to create the files.
To bypass the issue, please use disk zerofill fallback.
```bash
# Set to true to enable fallback behavior when disk operations fail (e.g. fallocate not available)
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
```
## Benchmark KVBM ## Benchmark KVBM
Once the model is loaded ready, follow below steps to use LMBenchmark to benchmark KVBM performance: Once the model is loaded ready, follow below steps to use LMBenchmark to benchmark KVBM performance:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment