Unverified Commit 8e72fb69 authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

docs: DYN-1967 update metrics docs after kvstats removal (#5704)


Signed-off-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
Co-authored-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent e5557803
...@@ -227,7 +227,6 @@ Dynamo exports several metrics useful for autoscaling. These are available at th ...@@ -227,7 +227,6 @@ Dynamo exports several metrics useful for autoscaling. These are available at th
| `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers | | `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
| `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode | | `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
| `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General | | `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
| `kvstats_gpu_cache_usage_percent` | Gauge | GPU KV cache usage (0-1) | ✅ Decode |
#### Metric Labels #### Metric Labels
...@@ -641,7 +640,7 @@ Avoid configuring multiple autoscalers for the same service: ...@@ -641,7 +640,7 @@ Avoid configuring multiple autoscalers for the same service:
|--------------|---------------------|---------------| |--------------|---------------------|---------------|
| Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` | | Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
| Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` | | Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
| Decode | KV cache utilization, ITL | `kvstats_gpu_cache_usage_percent`, `dynamo_frontend_inter_token_latency_seconds` | | Decode | ITL | `dynamo_frontend_inter_token_latency_seconds` |
### 3. Configure Stabilization Windows ### 3. Configure Stabilization Windows
......
...@@ -123,19 +123,6 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model> ...@@ -123,19 +123,6 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
curl http://localhost:8081/metrics curl http://localhost:8081/metrics
``` ```
### KV Router Statistics (kvstats)
KV router statistics are automatically exposed by LLM workers and KV router components on the backend system status port (port 8081) with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
These metrics are published by:
- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
### Specialized Component Metrics ### Specialized Component Metrics
Some components expose additional metrics specific to their functionality: Some components expose additional metrics specific to their functionality:
......
...@@ -233,7 +233,6 @@ Dynamo exports several metrics useful for autoscaling. These are available at th ...@@ -233,7 +233,6 @@ Dynamo exports several metrics useful for autoscaling. These are available at th
| `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers | | `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
| `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode | | `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
| `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General | | `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
| `kvstats_gpu_cache_usage_percent` | Gauge | GPU KV cache usage (0-1) | ✅ Decode |
#### Metric Labels #### Metric Labels
...@@ -647,7 +646,7 @@ Avoid configuring multiple autoscalers for the same service: ...@@ -647,7 +646,7 @@ Avoid configuring multiple autoscalers for the same service:
|--------------|---------------------|---------------| |--------------|---------------------|---------------|
| Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` | | Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
| Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` | | Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
| Decode | KV cache utilization, ITL | `kvstats_gpu_cache_usage_percent`, `dynamo_frontend_inter_token_latency_seconds` | | Decode | ITL | `dynamo_frontend_inter_token_latency_seconds` |
### 3. Configure Stabilization Windows ### 3. Configure Stabilization Windows
......
...@@ -122,19 +122,6 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model> ...@@ -122,19 +122,6 @@ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
curl http://localhost:8081/metrics curl http://localhost:8081/metrics
``` ```
### KV Router Statistics (kvstats)
KV router statistics are automatically exposed by LLM workers and KV router components on the backend system status port (port 8081) with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
These metrics are published by:
- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
### Specialized Component Metrics ### Specialized Component Metrics
Some components expose additional metrics specific to their functionality: Some components expose additional metrics specific to their functionality:
......
...@@ -16,21 +16,20 @@ cargo run -p dynamo-codegen --bin gen-python-prometheus-names ...@@ -16,21 +16,20 @@ cargo run -p dynamo-codegen --bin gen-python-prometheus-names
- Parses Rust AST from `lib/runtime/src/metrics/prometheus_names.rs` - Parses Rust AST from `lib/runtime/src/metrics/prometheus_names.rs`
- Generates Python classes with constants at `lib/bindings/python/src/dynamo/prometheus_names.py` - Generates Python classes with constants at `lib/bindings/python/src/dynamo/prometheus_names.py`
- Handles macro-generated constants (e.g., `kvstats_name!("active_blocks")``"kvstats_active_blocks"`)
### Example ### Example
**Rust input:** **Rust input:**
```rust ```rust
pub mod kvstats { pub mod kvrouter {
pub const ACTIVE_BLOCKS: &str = kvstats_name!("active_blocks"); pub const KV_CACHE_EVENTS_APPLIED: &str = "kv_cache_events_applied";
} }
``` ```
**Python output:** **Python output:**
```python ```python
class kvstats: class kvrouter:
ACTIVE_BLOCKS = "kvstats_active_blocks" KV_CACHE_EVENTS_APPLIED = "kv_cache_events_applied"
``` ```
### When to run ### When to run
......
...@@ -196,7 +196,7 @@ Parses lib/runtime/src/metrics/prometheus_names.rs and generates a pure Python ...@@ -196,7 +196,7 @@ Parses lib/runtime/src/metrics/prometheus_names.rs and generates a pure Python
module with 1:1 constant mappings at lib/bindings/python/src/dynamo/prometheus_names.py module with 1:1 constant mappings at lib/bindings/python/src/dynamo/prometheus_names.py
This allows Python code to import Prometheus metric constants without Rust bindings: This allows Python code to import Prometheus metric constants without Rust bindings:
from dynamo.prometheus_names import frontend_service, kvstats from dynamo.prometheus_names import frontend_service
OPTIONS: OPTIONS:
--source PATH Path to Rust source file --source PATH Path to Rust source file
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment