Unverified Commit ff06b17e authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

fix: guarantee RouterRequestMetrics availability & documentation updates (#6558)


Signed-off-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
Co-authored-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 6c2714e0
......@@ -103,6 +103,44 @@ We also supports running lightweight mock engines that simulate vLLM behavior wi
**Note**: The `--speedup-ratio` parameter controls the inference speed of mocker engines. A higher value (e.g., 2.0) makes the mocker engines simulate faster inference, allowing benchmarks to complete more quickly. This is particularly useful for testing router performance without waiting for realistic inference times.
#### Disaggregated Serving with Mockers (No GPU Required)
You can test disaggregated serving entirely with mockers by launching separate prefill and decode mocker groups that share a namespace. This is useful for validating routing logic, metrics, and the prefill-decode handoff without any GPUs.
```bash
NAMESPACE="test-disagg"
MODEL="Qwen/Qwen3-0.6B"
# Terminal 1: Decode mockers (2 workers)
python -m dynamo.mocker --model-path "$MODEL" \
--endpoint "dyn://${NAMESPACE}.backend.generate" \
--disaggregation-mode decode --num-workers 2 \
--speedup-ratio 10 --block-size 16
# Terminal 2: Prefill mockers (2 workers)
python -m dynamo.mocker --model-path "$MODEL" \
--endpoint "dyn://${NAMESPACE}.prefill.generate" \
--disaggregation-mode prefill --num-workers 2 \
--speedup-ratio 10 --block-size 16
# Terminal 3: Frontend with KV router
# --model-path must be the on-disk snapshot directory
MODEL_PATH=$(find ~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots -mindepth 1 -maxdepth 1 -type d | head -1)
python -m dynamo.frontend --namespace "$NAMESPACE" \
--model-name "$MODEL" --model-path "$MODEL_PATH" \
--router-mode kv --http-port 8000 --kv-cache-block-size 16
```
Verify it works:
```bash
# Send a request (should show prefill_worker_id and decode_worker_id in nvext)
curl -s localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Hello"}],"max_tokens":10}' | python3 -m json.tool
# Check router metrics
curl -s localhost:8000/metrics | grep "^# HELP dynamo_component_router"
```
### Step 2: Start the Router
In a **new terminal**, launch the Dynamo router using the Python CLI:
......
......@@ -264,9 +264,12 @@ The `router_temperature` parameter controls routing randomness:
## Prometheus Metrics
The router exposes Prometheus metrics on the frontend's HTTP port (default 8000) at `/metrics`. All router metrics require `--router-mode kv` and will not appear when using `round-robin` or `random` routing.
The router exposes Prometheus metrics on the frontend's HTTP port (default 8000) at `/metrics`:
For the full list of router metrics (`dynamo_router_*`, `dynamo_router_overhead_*`, per-worker gauges), see the [Metrics reference](../../observability/metrics.md#router-metrics).
- **Router request metrics** (`dynamo_component_router_*`): Registered via the component's metrics hierarchy and exposed on the frontend via the `drt_metrics` bridge. In KV mode (aggregated and disaggregated) they are populated per-request; in non-KV modes (direct/random/round-robin) they are registered with zero values. The standalone router (`python -m dynamo.router`) also registers these metrics, available on `DYN_SYSTEM_PORT` when set.
- **Routing overhead metrics** (`dynamo_router_overhead_*`) and **per-worker gauges** (`dynamo_frontend_worker_*`): Registered on the frontend's own Prometheus registry. These are frontend-only and not available on the standalone router.
For the full list of router metrics, see the [Metrics reference](../../observability/metrics.md#router-metrics).
## Disaggregated Serving
......
......@@ -103,23 +103,20 @@ This hierarchical structure allows you to create metrics at the appropriate leve
### Backend Component Metrics
**Backend workers** (`python -m dynamo.vllm`, `python -m dynamo.sglang`, etc.) expose `dynamo_component_*` metrics on port 8081 by default (configurable via `DYN_SYSTEM_PORT`).
**Backend workers** (`python -m dynamo.vllm`, `python -m dynamo.sglang`, etc.) expose `dynamo_component_*` metrics on the system status port (configurable via `DYN_SYSTEM_PORT`, disabled by default). In Kubernetes the operator typically sets `DYN_SYSTEM_PORT=9090`; for local development you must set it explicitly (e.g. `DYN_SYSTEM_PORT=8081`).
The core Dynamo backend system automatically exposes metrics on the system status port (default: 8081, configurable via `DYN_SYSTEM_PORT`) at the `/metrics` endpoint with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework:
The core Dynamo backend system exposes metrics at the `/metrics` endpoint with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework:
- `dynamo_component_inflight_requests`: Requests currently being processed (gauge)
- `dynamo_component_request_bytes_total`: Total bytes received in requests (counter)
- `dynamo_component_request_duration_seconds`: Request processing time (histogram)
- `dynamo_component_requests_total`: Total requests processed (counter)
- `dynamo_component_response_bytes_total`: Total bytes sent in responses (counter)
- `dynamo_component_uptime_seconds`: DistributedRuntime uptime (gauge). Automatically updated before each Prometheus scrape on both the frontend (`/metrics` on port 8000) and system status server (`/metrics` on port 8081).
- `dynamo_component_uptime_seconds`: DistributedRuntime uptime (gauge). Automatically updated before each Prometheus scrape on both the frontend (`/metrics` on port 8000) and the system status server (`/metrics` on `DYN_SYSTEM_PORT` when set).
**Access backend component metrics:**
```bash
# Default port 8081
curl http://localhost:8081/metrics
# Or with custom port
# Set DYN_SYSTEM_PORT to enable the system status server
DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
curl http://localhost:8081/metrics
```
......@@ -219,32 +216,28 @@ Suppose the backend allows 3 concurrent requests and there are 10 clients contin
### Router Metrics
When using the KV cache router (`--router-mode kv`), the frontend exposes additional metrics for monitoring routing decisions and overhead. These metrics are not registered when using `round-robin` or `random` routing, so they will not appear in `/metrics` output at all. Defined in `lib/llm/src/kv_router/metrics.rs`.
The router exposes metrics for monitoring routing decisions and overhead. Defined in `lib/llm/src/kv_router/metrics.rs`.
For router configuration and tuning, see the [Router Guide](../components/router/router-guide.md).
#### Router Request Metrics (`dynamo_router_*`)
#### Router Request Metrics (`dynamo_component_router_*`)
Histograms and counters for aggregate request-level statistics. Only registered when `--router-mode kv` is used. If no requests have been routed yet, the metrics will exist but show zero values. Exposed on the frontend port (default 8000) at `/metrics`.
Histograms and counters for aggregate request-level statistics. Eagerly registered via `from_component()` with the DRT `MetricsRegistry` hierarchy. On the frontend, exposed at `/metrics` on the HTTP port (default 8000) via the `drt_metrics` bridge. On the standalone router (`python -m dynamo.router`), exposed on `DYN_SYSTEM_PORT` when set. Populated per-request when `--router-mode kv` is active; registered with zero values in non-KV modes.
All metrics carry a `router_id` constant label (the frontend's discovery instance ID). Filter in Prometheus with:
```promql
dynamo_router_requests_total{router_id="12345"}
```
All metrics carry the standard hierarchy labels (`dynamo_namespace`, `dynamo_component`, `dynamo_endpoint`).
| Metric | Type | Description |
|--------|------|-------------|
| `dynamo_router_requests_total` | Counter | Total requests processed by the router |
| `dynamo_router_time_to_first_token_seconds` | Histogram | Time to first token (seconds) |
| `dynamo_router_inter_token_latency_seconds` | Histogram | Average inter-token latency (seconds) |
| `dynamo_router_input_sequence_tokens` | Histogram | Input sequence length (tokens) |
| `dynamo_router_output_sequence_tokens` | Histogram | Output sequence length (tokens) |
| `dynamo_router_kv_hit_rate` | Histogram | Predicted KV cache hit rate at routing time (0.0-1.0) |
| `dynamo_component_router_requests_total` | Counter | Total requests processed by the router |
| `dynamo_component_router_time_to_first_token_seconds` | Histogram | Time to first token (seconds) |
| `dynamo_component_router_inter_token_latency_seconds` | Histogram | Average inter-token latency (seconds) |
| `dynamo_component_router_input_sequence_tokens` | Histogram | Input sequence length (tokens) |
| `dynamo_component_router_output_sequence_tokens` | Histogram | Output sequence length (tokens) |
| `dynamo_component_router_kv_hit_rate` | Histogram | Predicted KV cache hit rate at routing time (0.0-1.0) |
#### Per-Request Routing Overhead (`dynamo_router_overhead_*`)
Histograms (in milliseconds) tracking the time spent in each phase of the routing decision for every request. Created on first routing decision. Same `router_id` label as the request metrics above.
Histograms (in milliseconds) tracking the time spent in each phase of the routing decision for every request. Registered on the frontend port (default 8000) at `/metrics` with a `router_id` label (the frontend's discovery instance ID).
| Metric | Type | Description |
|--------|------|-------------|
......
......@@ -186,6 +186,8 @@ class labels:
MODEL_NAME = "model_name"
# Label for worker type (e.g., "aggregated", "prefill", "decode", "encoder", etc.)
WORKER_TYPE = "worker_type"
# Label for router instance (discovery.instance_id() of the frontend)
ROUTER_ID = "router_id"
class model_info:
......@@ -200,10 +202,20 @@ class name_prefix:
COMPONENT = "dynamo_component"
# Prefix for frontend service metrics
FRONTEND = "dynamo_frontend"
# Prefix for KV router metrics (used with router_id label)
ROUTER = "dynamo_router"
class router_request:
"""Router per-request metrics (component-scoped via `MetricsHierarchy`)."""
# Prefix prepended to `frontend_service::*` names to form router metric names.
# e.g. `"router_"` + `frontend_service::REQUESTS_TOTAL` → `"router_requests_total"`.
METRIC_PREFIX = "router_"
class routing_overhead:
"""Routing overhead phase latency histogram names (component-scoped)."""
"""Routing overhead phase latency histogram suffixes."""
# Time spent computing block hashes
BLOCK_HASHING_MS = "overhead_block_hashing_ms"
......
......@@ -9,7 +9,9 @@ use crate::{
engines::StreamingEngineAdapter,
entrypoint::{EngineConfig, RouterConfig},
http::service::metrics::Metrics,
kv_router::{DirectRoutingRouter, KvPushRouter, KvRouter, PrefillRouter},
kv_router::{
DirectRoutingRouter, KvPushRouter, KvRouter, PrefillRouter, metrics::RouterRequestMetrics,
},
migration::Migration,
model_card::ModelDeploymentCard,
namespace::NamespaceFilter,
......@@ -280,6 +282,12 @@ where
)
.await?;
// Eagerly register router request metrics so they appear as zeros even in
// non-KV modes (Direct, Random, RoundRobin) where KvPushRouter is never created.
// In KV mode, KvPushRouter::new() also calls from_component() (idempotent via
// OnceLock), which covers the standalone router path as well.
RouterRequestMetrics::from_component(client.endpoint.component());
let service_backend = match router_mode {
RouterMode::Direct => {
ServiceBackend::from_engine(Arc::new(DirectRoutingRouter::new(router)))
......@@ -291,8 +299,7 @@ where
let Some(chooser) = chooser else {
anyhow::bail!("RouterMode::KV requires KVRouter to not be null");
};
let kv_push_router = KvPushRouter::new(router, chooser);
ServiceBackend::from_engine(Arc::new(kv_push_router))
ServiceBackend::from_engine(Arc::new(KvPushRouter::new(router, chooser)))
}
};
......
......@@ -18,9 +18,7 @@ use super::metrics;
use super::metrics::register_worker_timing_metrics;
use crate::discovery::ModelManager;
use crate::endpoint_type::EndpointType;
use crate::kv_router::metrics::{
RouterRequestMetrics, RoutingOverheadMetrics, register_worker_load_metrics,
};
use crate::kv_router::metrics::{RoutingOverheadMetrics, register_worker_load_metrics};
use crate::request_template::RequestTemplate;
use anyhow::Result;
use axum_server::tls_rustls::RustlsConfig;
......@@ -430,9 +428,6 @@ impl HttpServiceConfigBuilder {
if let Some(ref discovery) = config.drt_discovery {
let instance_id = discovery.instance_id();
if let Err(e) = RouterRequestMetrics::register(&registry, instance_id) {
tracing::warn!("Failed to register router request metrics: {}", e);
}
if let Err(e) = RoutingOverheadMetrics::register(&registry, instance_id) {
tracing::warn!("Failed to register routing overhead metrics: {}", e);
}
......
......@@ -4,19 +4,55 @@
//! Prometheus metrics for the KV router.
//!
//! This module centralizes all router-side Prometheus metric definitions:
//!
//! - [`WorkerLoadMetrics`]: Per-worker active decode blocks and prefill tokens gauges.
//! Registered on the frontend's own `prometheus::Registry` (default port 8000).
//! Populated by `KvWorkerMonitor` in the frontend when receiving ActiveLoad events.
//! - Frontend (aggregated and disaggregated): available on default port 8000
//! - Standalone router (`python -m dynamo.router`): not created (frontend-only)
//!
//! - [`RoutingOverheadMetrics`]: Per-request routing phase latency histograms.
//! Registered on the frontend's own `prometheus::Registry` (default port 8000).
//! Populated by `KvPushRouter` in the frontend during routing decisions.
//! - Frontend (aggregated and disaggregated): available on default port 8000
//! - Standalone router: not created (frontend-only)
//!
//! - [`RouterRequestMetrics`]: Per-request aggregate histograms (TTFT, ITL, tokens, KV hit rate).
//! Registered on the DRT `MetricsRegistry` hierarchy via `Component::metrics()`.
//! Eagerly created so they appear as zeros before any requests arrive.
//! Populated by `KvPushRouter::generate()` and its `RequestGuard` as it observes
//! the streaming response (TTFT on first token, ITL per output block,
//! ISL/OSL/kv_hit_rate at routing and completion).
//! - Frontend, non-KV modes (direct/random/round-robin): always zero (registered
//! on default port 8000, but never populated since KvPushRouter is not used)
//! - Frontend, KV mode (aggregated and disaggregated): available on default port
//! 8000 via the `drt_metrics` bridge, populated per-request
//! - Standalone router (`python -m dynamo.router`): available on `DYN_SYSTEM_PORT`
//! when set (default is `-1`, disabled), populated per-request
//!
//! The standalone router does not create `WorkerLoadMetrics` or
//! `RoutingOverheadMetrics` (those are frontend-only). It only exposes
//! `RouterRequestMetrics` and standard DRT transport metrics
//! (`dynamo_component_inflight_requests`, `dynamo_component_requests_total`, etc.)
//! via the system status server when `DYN_SYSTEM_PORT` is explicitly set.
//!
//! See also: `docs/pages/observability/metrics.md` (Router Metrics section).
use std::sync::{Arc, LazyLock, OnceLock};
use std::time::Duration;
use dynamo_runtime::component::Component;
use dynamo_runtime::metrics::MetricsHierarchy;
use dynamo_runtime::metrics::prometheus_names::{
frontend_service, labels, name_prefix, routing_overhead,
frontend_service, labels, name_prefix, router_request, routing_overhead,
};
use prometheus::{HistogramOpts, IntCounter, IntGaugeVec, Opts};
/// Build a router metric name: `"router_" + frontend_service_suffix`.
fn router_metric(suffix: &str) -> String {
format!("{}{}", router_request::METRIC_PREFIX, suffix)
}
use dynamo_runtime::traits::DistributedRuntimeProvider;
use prometheus::{HistogramOpts, IntGaugeVec, Opts};
use crate::http::service::metrics::generate_log_buckets;
......@@ -86,6 +122,7 @@ pub static WORKER_LOAD_METRICS: LazyLock<WorkerLoadMetrics> = LazyLock::new(|| W
});
/// Register the worker load gauges with the given Prometheus registry.
/// Called during frontend HTTP service setup (`service_v2.rs`), served on port 8000.
pub fn register_worker_load_metrics(
registry: &prometheus::Registry,
) -> Result<(), prometheus::Error> {
......@@ -113,7 +150,9 @@ static ROUTING_OVERHEAD_METRICS: OnceLock<Arc<RoutingOverheadMetrics>> = OnceLoc
impl RoutingOverheadMetrics {
/// Register routing overhead histograms with the given registry and store for later use.
/// Metric names: `dynamo_router_overhead_*` with const label `router_id=instance_id`.
/// Call once during HTTP service setup when `--router-mode kv` is used.
/// Called during frontend HTTP service setup (`service_v2.rs`), so these metrics
/// are served on the frontend's own port (default 8000). Not available in the
/// standalone router, which has no frontend HTTP server.
pub fn register(
registry: &prometheus::Registry,
instance_id: u64,
......@@ -204,13 +243,40 @@ impl RoutingOverheadMetrics {
}
// ---------------------------------------------------------------------------
// Router request metrics (dynamo_router_* with router_id label)
// Router request metrics (dynamo_component_router_* via MetricsHierarchy)
// ---------------------------------------------------------------------------
/// Aggregate per-request metrics observed at the router level.
/// Registered via `register()` with `dynamo_router_*` names and `router_id` label.
///
/// Component-scoped via `from_component()` to get automatic `dynamo_component_` prefix,
/// `dynamo_namespace`/`dynamo_component`/`dynamo_endpoint` labels, and registration
/// with the DRT `MetricsRegistry` hierarchy.
///
/// # Scrapeability
///
/// - **Frontend, non-KV modes**: Always zero (registered but never populated).
/// - **Frontend, KV mode (aggregated and disaggregated)**: Available on the
/// frontend's `/metrics` endpoint (default port 8000) via the `drt_metrics`
/// bridge, populated per-request.
/// - **Standalone router** (`python -m dynamo.router`): Available on the system
/// status server when `DYN_SYSTEM_PORT` is set, populated per-request.
///
/// # When these metrics are created
///
/// Eagerly in `KvPushRouter::new()`, so they appear as zeros before any requests.
/// Both the frontend pipeline and the standalone router (via Python bindings)
/// create a `KvPushRouter`, so both get these metrics registered automatically.
///
/// # Why component-scoped
///
/// These metrics MUST be registered through the Component hierarchy (not a standalone
/// registry). In hierarchical planner deployments, the frontend's router is the global
/// entry point, but each worker pool has its own local router (e.g. prefill pool,
/// decode pool). Component-scoped metrics let each local router emit metrics with
/// distinct `dynamo_component` labels, so pools can be monitored and scaled
/// independently.
pub struct RouterRequestMetrics {
pub requests_total: IntCounter,
pub requests_total: prometheus::IntCounter,
pub time_to_first_token_seconds: prometheus::Histogram,
pub inter_token_latency_seconds: prometheus::Histogram,
pub input_sequence_tokens: prometheus::Histogram,
......@@ -221,109 +287,77 @@ pub struct RouterRequestMetrics {
static ROUTER_REQUEST_METRICS: OnceLock<Arc<RouterRequestMetrics>> = OnceLock::new();
impl RouterRequestMetrics {
/// Register router request metrics with the given registry and store for later use.
/// Metric names: `dynamo_router_*` with const label `router_id=instance_id`.
/// Call once during HTTP service setup when `--router-mode kv` is used.
pub fn register(
registry: &prometheus::Registry,
instance_id: u64,
) -> Result<(), prometheus::Error> {
let m = ROUTER_REQUEST_METRICS.get_or_init(|| {
let router_id = instance_id.to_string();
let requests_total = IntCounter::with_opts(
Opts::new(
format!(
"{}_{}",
name_prefix::ROUTER,
frontend_service::REQUESTS_TOTAL
),
"Total number of requests processed by the router",
)
.const_label(labels::ROUTER_ID, &router_id),
)
.expect("dynamo_router_requests_total");
let time_to_first_token_seconds = prometheus::Histogram::with_opts(
HistogramOpts::new(
format!(
"{}_{}",
name_prefix::ROUTER,
frontend_service::TIME_TO_FIRST_TOKEN_SECONDS
),
"Time to first token observed at the router",
)
.const_label(labels::ROUTER_ID, &router_id)
.buckets(generate_log_buckets(0.001, 480.0, 18)),
)
.expect("dynamo_router_time_to_first_token_seconds");
let inter_token_latency_seconds = prometheus::Histogram::with_opts(
HistogramOpts::new(
format!(
"{}_{}",
name_prefix::ROUTER,
frontend_service::INTER_TOKEN_LATENCY_SECONDS
),
"Average inter-token latency observed at the router",
)
.const_label(labels::ROUTER_ID, &router_id)
.buckets(generate_log_buckets(0.001, 2.0, 13)),
)
.expect("dynamo_router_inter_token_latency_seconds");
let input_sequence_tokens = prometheus::Histogram::with_opts(
HistogramOpts::new(
format!(
"{}_{}",
name_prefix::ROUTER,
frontend_service::INPUT_SEQUENCE_TOKENS
),
"Input sequence length in tokens observed at the router",
)
.const_label(labels::ROUTER_ID, &router_id)
.buckets(generate_log_buckets(50.0, 128000.0, 12)),
)
.expect("dynamo_router_input_sequence_tokens");
let output_sequence_tokens = prometheus::Histogram::with_opts(
HistogramOpts::new(
format!(
"{}_{}",
name_prefix::ROUTER,
frontend_service::OUTPUT_SEQUENCE_TOKENS
),
"Output sequence length in tokens observed at the router",
)
.const_label(labels::ROUTER_ID, &router_id)
.buckets(generate_log_buckets(50.0, 32000.0, 10)),
)
.expect("dynamo_router_output_sequence_tokens");
let kv_hit_rate = prometheus::Histogram::with_opts(
HistogramOpts::new(
format!("{}_{}", name_prefix::ROUTER, frontend_service::KV_HIT_RATE),
"Predicted KV cache hit rate at routing time (0.0-1.0)",
)
.const_label(labels::ROUTER_ID, &router_id)
.buckets(prometheus::linear_buckets(0.0, 0.05, 21).unwrap()),
)
.expect("dynamo_router_kv_hit_rate");
Arc::new(Self {
requests_total,
time_to_first_token_seconds,
inter_token_latency_seconds,
input_sequence_tokens,
output_sequence_tokens,
kv_hit_rate,
})
});
registry.register(Box::new(m.requests_total.clone()))?;
registry.register(Box::new(m.time_to_first_token_seconds.clone()))?;
registry.register(Box::new(m.inter_token_latency_seconds.clone()))?;
registry.register(Box::new(m.input_sequence_tokens.clone()))?;
registry.register(Box::new(m.output_sequence_tokens.clone()))?;
registry.register(Box::new(m.kv_hit_rate.clone()))?;
Ok(())
}
/// Create from a Component, memoized in a static OnceLock.
/// Uses the MetricsHierarchy API which auto-prepends `dynamo_component_`,
/// injects hierarchy labels, and registers with the DRT `MetricsRegistry`.
/// Also adds `router_id` (discovery instance_id) to distinguish router instances.
///
/// Called eagerly by `KvPushRouter::new()` so metrics appear as zeros at startup.
pub fn from_component(component: &Component) -> Arc<Self> {
ROUTER_REQUEST_METRICS
.get_or_init(|| {
let instance_id = component.drt().discovery().instance_id();
let router_id = instance_id.to_string();
let extra_labels: &[(&str, &str)] = &[(labels::ROUTER_ID, &router_id)];
/// Returns the registered metrics if `register()` was called earlier.
pub fn get() -> Option<Arc<Self>> {
ROUTER_REQUEST_METRICS.get().cloned()
let metrics = component.metrics();
let requests_total = metrics
.create_intcounter(
&router_metric(frontend_service::REQUESTS_TOTAL),
"Total number of requests processed by the router",
extra_labels,
)
.expect("failed to create router_requests_total");
let time_to_first_token_seconds = metrics
.create_histogram(
&router_metric(frontend_service::TIME_TO_FIRST_TOKEN_SECONDS),
"Time to first token observed at the router",
extra_labels,
Some(generate_log_buckets(0.001, 480.0, 18)),
)
.expect("failed to create router_time_to_first_token_seconds");
let inter_token_latency_seconds = metrics
.create_histogram(
&router_metric(frontend_service::INTER_TOKEN_LATENCY_SECONDS),
"Average inter-token latency observed at the router",
extra_labels,
Some(generate_log_buckets(0.001, 2.0, 13)),
)
.expect("failed to create router_inter_token_latency_seconds");
let input_sequence_tokens = metrics
.create_histogram(
&router_metric(frontend_service::INPUT_SEQUENCE_TOKENS),
"Input sequence length in tokens observed at the router",
extra_labels,
Some(generate_log_buckets(50.0, 128000.0, 12)),
)
.expect("failed to create router_input_sequence_tokens");
let output_sequence_tokens = metrics
.create_histogram(
&router_metric(frontend_service::OUTPUT_SEQUENCE_TOKENS),
"Output sequence length in tokens observed at the router",
extra_labels,
Some(generate_log_buckets(50.0, 32000.0, 10)),
)
.expect("failed to create router_output_sequence_tokens");
let kv_hit_rate = metrics
.create_histogram(
&router_metric(frontend_service::KV_HIT_RATE),
"Predicted KV cache hit rate at routing time (0.0-1.0)",
extra_labels,
Some(prometheus::linear_buckets(0.0, 0.05, 21).unwrap()),
)
.expect("failed to create router_kv_hit_rate");
Arc::new(Self {
requests_total,
time_to_first_token_seconds,
inter_token_latency_seconds,
input_sequence_tokens,
output_sequence_tokens,
kv_hit_rate,
})
})
.clone()
}
}
......
......@@ -50,7 +50,7 @@ struct RequestGuard {
chooser: Arc<KvRouter>,
context_id: String,
tracker: Option<Arc<RequestTracker>>,
request_metrics: Option<Arc<RouterRequestMetrics>>,
request_metrics: Arc<RouterRequestMetrics>,
cumulative_osl: usize,
metrics_recorded: bool,
freed: bool,
......@@ -87,8 +87,10 @@ impl RequestGuard {
if !self.first_token_recorded && new_tokens > 0 {
if let Some(ref tracker) = self.tracker {
tracker.record_first_token();
if let (Some(m), Some(ttft)) = (&self.request_metrics, tracker.ttft_ms()) {
m.time_to_first_token_seconds.observe(ttft / 1000.0);
if let Some(ttft) = tracker.ttft_ms() {
self.request_metrics
.time_to_first_token_seconds
.observe(ttft / 1000.0);
}
}
self.first_token_recorded = true;
......@@ -116,9 +118,10 @@ impl RequestGuard {
if let Some(ref tracker) = self.tracker {
tracker.record_osl(self.cumulative_osl);
tracker.record_finish();
if let (Some(m), Some(avg_itl)) = (&self.request_metrics, tracker.avg_itl_ms())
{
m.inter_token_latency_seconds.observe(avg_itl / 1000.0);
if let Some(avg_itl) = tracker.avg_itl_ms() {
self.request_metrics
.inter_token_latency_seconds
.observe(avg_itl / 1000.0);
}
}
......@@ -144,10 +147,10 @@ impl RequestGuard {
tracker.record_finish();
tracker.record_osl(self.cumulative_osl);
}
if let Some(ref m) = self.request_metrics {
m.output_sequence_tokens.observe(self.cumulative_osl as f64);
m.requests_total.inc();
}
self.request_metrics
.output_sequence_tokens
.observe(self.cumulative_osl as f64);
self.request_metrics.requests_total.inc();
}
}
......@@ -175,6 +178,10 @@ impl KvPushRouter {
inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>,
chooser: Arc<KvRouter>,
) -> Self {
// Eagerly register router request metrics (as zeros) so they are
// scrapeable before any requests arrive. Both the frontend pipeline
// and the standalone router create KvPushRouter, so this covers both.
RouterRequestMetrics::from_component(chooser.client().endpoint.component());
KvPushRouter { inner, chooser }
}
......@@ -366,7 +373,8 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
}
// Record routing metrics on tracker and observe ISL + prefill start.
let request_metrics = RouterRequestMetrics::get();
let request_metrics =
RouterRequestMetrics::from_component(self.chooser.client().endpoint.component());
if let Some(ref tracker) = request.tracker {
let (routing_token_ids, _) = request.block_mm_routing_info();
let isl_blocks = routing_token_ids.len().div_ceil(block_size);
......@@ -376,14 +384,13 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
overlap_amount as usize * block_size,
);
tracker.record_worker_full(instance_id, dp_rank, self.chooser.worker_type());
if let (Some(m), Some(hit_rate)) = (&request_metrics, tracker.kv_hit_rate()) {
m.kv_hit_rate.observe(hit_rate);
if let Some(hit_rate) = tracker.kv_hit_rate() {
request_metrics.kv_hit_rate.observe(hit_rate);
}
}
if let Some(ref m) = request_metrics {
m.input_sequence_tokens
.observe(request.token_ids.len() as f64);
}
request_metrics
.input_sequence_tokens
.observe(request.token_ids.len() as f64);
// Handle query-only requests: early return with worker info
if is_query_only {
......
......@@ -417,6 +417,18 @@ pub mod kvbm {
pub const OBJECT_WRITE_FAILURES: &str = "object_write_failures";
}
/// Router per-request metrics (component-scoped via `MetricsHierarchy`).
///
/// Metric names are composed as `"{METRIC_PREFIX}{frontend_service::*}"` at init time,
/// then passed to `component.metrics().create_*()` which auto-prepends `dynamo_component_`,
/// yielding e.g. `dynamo_component_router_requests_total`.
/// See `lib/llm/src/kv_router/metrics.rs` `RouterRequestMetrics::from_component()`.
pub mod router_request {
/// Prefix prepended to `frontend_service::*` names to form router metric names.
/// e.g. `"router_"` + `frontend_service::REQUESTS_TOTAL` → `"router_requests_total"`.
pub const METRIC_PREFIX: &str = "router_";
}
/// Routing overhead phase latency histogram suffixes.
///
/// Combined with `name_prefix::ROUTER` ("dynamo_router") in `RoutingOverheadMetrics::register()`,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment