refactor: standardize Prometheus metric naming conventions (part 1) (#3035)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

refactor: standardize Prometheus metric naming conventions (part 1) (#3035)
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
f4a3a6b6 · Keiven C · GitHub · cab23f21 · f4a3a6b6 · f4a3a6b6
Unverified Commit f4a3a6b6 authored Sep 30, 2025 by Keiven C Committed by GitHub Sep 30, 2025
10 changed files
--- a/deploy/metrics/README.md
+++ b/deploy/metrics/README.md
@@ -70,8 +70,8 @@ Some components expose additional metrics specific to their functionality:

 When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name:

- `dynamo_frontend_inflight_requests_total`: Inflight requests (gauge)
- `dynamo_frontend_queued_requests_total`: Number of requests in HTTP processing queue (gauge)
+- `dynamo_frontend_inflight_requests`: Inflight requests (gauge)
+- `dynamo_frontend_queued_requests`: Number of requests in HTTP processing queue (gauge)
 - `dynamo_frontend_input_sequence_tokens`: Input sequence length (histogram)
 - `dynamo_frontend_inter_token_latency_seconds`: Inter-token latency (histogram)
 - `dynamo_frontend_output_sequence_tokens`: Output sequence length (histogram)
@@ -79,6 +79,8 @@ When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), th
 - `dynamo_frontend_requests_total`: Total LLM requests (counter)
 - `dynamo_frontend_time_to_first_token_seconds`: Time to first token (histogram)

+**Note**: The `dynamo_frontend_inflight_requests` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
+
 ##### Model Configuration Metrics

 The frontend also exposes model configuration metrics with the `dynamo_frontend_model_*` prefix. These metrics are populated from the worker backend registration service when workers register with the system:
@@ -91,7 +93,7 @@ These metrics come from the runtime configuration provided by worker backends du
 - `dynamo_frontend_model_max_num_batched_tokens`: Maximum number of batched tokens for a worker serving the model (gauge)

 **MDC Metrics (from ModelDeploymentCard):**
-These metrics come from the Model Deployment Card information provided by worker backends during registration.
+These metrics come from the Model Deployment Card information provided by worker backends during registration. Note that when multiple worker instances register with the same model name, only the first instance's configuration metrics (runtime config and MDC metrics) will be populated. Subsequent instances with duplicate model names will be skipped for configuration metric updates, though the worker count metric will reflect all instances.

 - `dynamo_frontend_model_context_length`: Maximum context length for a worker serving the model (gauge)
 - `dynamo_frontend_model_kv_cache_block_size`: KV cache block size for a worker serving the model (gauge)
@@ -100,10 +102,6 @@ These metrics come from the Model Deployment Card information provided by worker
 **Worker Management Metrics:**
 - `dynamo_frontend_model_workers`: Number of worker instances currently serving the model (gauge)

-**Important Notes:**
- The `dynamo_frontend_inflight_requests_total` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests_total` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
- **Model Name Deduplication**: When multiple worker instances register with the same model name, only the first instance's configuration metrics (runtime config and MDC metrics) will be populated. Subsequent instances with duplicate model names will be skipped for configuration metric updates, though the worker count metric will reflect all instances.
-
 #### Request Processing Flow

 This section explains the distinction between two key metrics used to track request processing:
@@ -148,10 +146,10 @@ Try launching a frontend and a Mocker backend that allows 3 concurrent requests:
 $ python -m dynamo.frontend --http-port 8000
 $ python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --max-num-seqs 3
 # Launch your 10 concurrent clients here
-# Then check the queued_requests_total and inflight_requests_total metrics from the frontend:
+# Then check the queued_requests and inflight_requests metrics from the frontend:
 $ curl -s localhost:8000/metrics|grep -v '^#'|grep -E 'queue|inflight'
-dynamo_frontend_queued_requests_total{model="qwen/qwen3-0.6b"} 7
-dynamo_frontend_inflight_requests_total{model="qwen/qwen3-0.6b"} 10
+dynamo_frontend_queued_requests{model="qwen/qwen3-0.6b"} 7
+dynamo_frontend_inflight_requests{model="qwen/qwen3-0.6b"} 10
 ```

 **Real setup using vLLM (instead of Mocker):**
@@ -294,8 +292,8 @@ let component = namespace.component("my_component")?;
 let endpoint = component.endpoint("my_endpoint")?;

 // Create endpoint-level counters (this is a Prometheus Counter type)
-let total_requests = endpoint.create_counter(
-    "total_requests",
+let requests_total = endpoint.create_counter(
+    "requests_total",
    "Total requests across all namespaces",
    &[]
 )?;
@@ -472,8 +470,8 @@ let latency = endpoint.create_histogram(

 ```rust
 // Aggregate metrics across multiple endpoints
-let total_requests = namespace.create_counter(
-    "total_requests",
+let requests_total = namespace.create_counter(
+    "requests_total",
    "Total requests across all endpoints",
    &[]
 )?;

--- a/lib/bindings/python/rust/prometheus_names.rs
+++ b/lib/bindings/python/rust/prometheus_names.rs
@@ -28,13 +28,32 @@
 //! # Access metrics directly (no constructor call needed!)
 //! frontend = prometheus_names.frontend
 //! print(frontend.requests_total)           # "dynamo_frontend_requests_total"
+//! print(frontend.queued_requests)          # "dynamo_frontend_queued_requests"
+//! print(frontend.inflight_requests)        # "dynamo_frontend_inflight_requests"
+//! print(frontend.disconnected_clients)     # "dynamo_frontend_disconnected_clients"
 //! print(frontend.request_duration_seconds) # "dynamo_frontend_request_duration_seconds"
+//! print(frontend.input_sequence_tokens)    # "dynamo_frontend_input_sequence_tokens"
+//! print(frontend.output_sequence_tokens)   # "dynamo_frontend_output_sequence_tokens"
+//! print(frontend.time_to_first_token_seconds) # "dynamo_frontend_time_to_first_token_seconds"
 //! print(frontend.inter_token_latency_seconds) # "dynamo_frontend_inter_token_latency_seconds"
+//! print(frontend.model_context_length)     # "dynamo_frontend_model_context_length"
+//! print(frontend.model_kv_cache_block_size) # "dynamo_frontend_model_kv_cache_block_size"
+//! print(frontend.model_migration_limit)    # "dynamo_frontend_model_migration_limit"
 //!
 //! work_handler = prometheus_names.work_handler
 //! print(work_handler.requests_total)       # "dynamo_component_requests_total"
+//! print(work_handler.request_bytes_total)  # "dynamo_component_request_bytes_total"
+//! print(work_handler.response_bytes_total) # "dynamo_component_response_bytes_total"
+//! print(work_handler.inflight_requests)    # "dynamo_component_inflight_requests"
+//! print(work_handler.request_duration_seconds) # "dynamo_component_request_duration_seconds"
 //! print(work_handler.errors_total)         # "dynamo_component_errors_total"
 //!
+//! kvstats = prometheus_names.kvstats
+//! print(kvstats.active_blocks)             # "kvstats_active_blocks"
+//! print(kvstats.total_blocks)              # "kvstats_total_blocks"
+//! print(kvstats.gpu_cache_usage_percent)   # "kvstats_gpu_cache_usage_percent"
+//! print(kvstats.gpu_prefix_cache_hit_rate) # "kvstats_gpu_prefix_cache_hit_rate"
+//!
 //! # Use in Prometheus queries
 //! query = f"rate({frontend.requests_total}[5m])"
 //! pattern = rf'{work_handler.requests_total}\{{[^}}]*model="[^"]*"[^}}]*\}}'
@@ -60,6 +79,12 @@ impl PrometheusNames {
    fn work_handler(&self) -> WorkHandler {
        WorkHandler
    }
+
+    /// KV stats metrics
+    #[getter]
+    fn kvstats(&self) -> KvStatsMetrics {
+        KvStatsMetrics
+    }
 }

 /// Frontend service metrics (LLM HTTP service)
@@ -86,21 +111,21 @@ impl FrontendService {

    /// Number of requests waiting in HTTP queue before receiving the first response
    #[getter]
-    fn queued_requests_total(&self) -> String {
+    fn queued_requests(&self) -> String {
        format!(
            "{}_{}",
            name_prefix::FRONTEND,
-            frontend_service::QUEUED_REQUESTS_TOTAL
+            frontend_service::QUEUED_REQUESTS
        )
    }

    /// Number of inflight requests going to the engine (vLLM, SGLang, ...)
    #[getter]
-    fn inflight_requests_total(&self) -> String {
+    fn inflight_requests(&self) -> String {
        format!(
            "{}_{}",
            name_prefix::FRONTEND,
-            frontend_service::INFLIGHT_REQUESTS_TOTAL
+            frontend_service::INFLIGHT_REQUESTS
        )
    }

@@ -153,6 +178,76 @@ impl FrontendService {
            frontend_service::INTER_TOKEN_LATENCY_SECONDS
        )
    }
+
+    /// Number of disconnected clients
+    #[getter]
+    fn disconnected_clients(&self) -> String {
+        format!(
+            "{}_{}",
+            name_prefix::FRONTEND,
+            frontend_service::DISCONNECTED_CLIENTS
+        )
+    }
+
+    /// Model total KV blocks
+    #[getter]
+    fn model_total_kv_blocks(&self) -> String {
+        format!(
+            "{}_{}",
+            name_prefix::FRONTEND,
+            frontend_service::MODEL_TOTAL_KV_BLOCKS
+        )
+    }
+
+    /// Model max number of sequences
+    #[getter]
+    fn model_max_num_seqs(&self) -> String {
+        format!(
+            "{}_{}",
+            name_prefix::FRONTEND,
+            frontend_service::MODEL_MAX_NUM_SEQS
+        )
+    }
+
+    /// Model max number of batched tokens
+    #[getter]
+    fn model_max_num_batched_tokens(&self) -> String {
+        format!(
+            "{}_{}",
+            name_prefix::FRONTEND,
+            frontend_service::MODEL_MAX_NUM_BATCHED_TOKENS
+        )
+    }
+
+    /// Model context length
+    #[getter]
+    fn model_context_length(&self) -> String {
+        format!(
+            "{}_{}",
+            name_prefix::FRONTEND,
+            frontend_service::MODEL_CONTEXT_LENGTH
+        )
+    }
+
+    /// Model KV cache block size
+    #[getter]
+    fn model_kv_cache_block_size(&self) -> String {
+        format!(
+            "{}_{}",
+            name_prefix::FRONTEND,
+            frontend_service::MODEL_KV_CACHE_BLOCK_SIZE
+        )
+    }
+
+    /// Model migration limit
+    #[getter]
+    fn model_migration_limit(&self) -> String {
+        format!(
+            "{}_{}",
+            name_prefix::FRONTEND,
+            frontend_service::MODEL_MIGRATION_LIMIT
+        )
+    }
 }

 /// Work handler metrics (component request processing)
@@ -219,11 +314,44 @@ impl WorkHandler {
    }
 }

+/// KV stats metrics (KV cache statistics)
+/// These methods return the metric names with the "kvstats_" prefix
+#[pyclass]
+pub struct KvStatsMetrics;
+
+#[pymethods]
+impl KvStatsMetrics {
+    /// Number of active KV cache blocks currently in use
+    #[getter]
+    fn active_blocks(&self) -> String {
+        kvstats::ACTIVE_BLOCKS.to_string()
+    }
+
+    /// Total number of KV cache blocks available
+    #[getter]
+    fn total_blocks(&self) -> String {
+        kvstats::TOTAL_BLOCKS.to_string()
+    }
+
+    /// GPU cache usage as a percentage (0.0-1.0)
+    #[getter]
+    fn gpu_cache_usage_percent(&self) -> String {
+        kvstats::GPU_CACHE_USAGE_PERCENT.to_string()
+    }
+
+    /// GPU prefix cache hit rate as a percentage (0.0-1.0)
+    #[getter]
+    fn gpu_prefix_cache_hit_rate(&self) -> String {
+        kvstats::GPU_PREFIX_CACHE_HIT_RATE.to_string()
+    }
+}
+
 /// Add prometheus_names module to the Python bindings
 pub fn add_to_module(m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_class::<PrometheusNames>()?;
    m.add_class::<FrontendService>()?;
    m.add_class::<WorkHandler>()?;
+    m.add_class::<KvStatsMetrics>()?;

    // Add a module-level singleton instance for convenience
    let prometheus_names_instance = PrometheusNames;

--- a/lib/bindings/python/src/dynamo/_core.pyi
+++ b/lib/bindings/python/src/dynamo/_core.pyi
@@ -12,6 +12,9 @@ from typing import (
    Tuple,
 )

+# Prometheus metric names are defined in a separate module
+from ._prometheus_names import prometheus_names
+
 def log_message(level: str, message: str, module: str, file: str, line: int) -> None:
    """
    Log a message from Python with file and line info
@@ -1376,134 +1379,7 @@ class VirtualConnectorClient:
        """Blocks until there is a new decision to fetch using 'get'"""
        ...

-class PrometheusNames:
-    """
-    Main container for all Prometheus metric name constants
-    """
-
-    @property
-    def frontend(self) -> FrontendService:
-        """
-        Frontend service metrics
-        """
-        ...
-
-    @property
-    def work_handler(self) -> WorkHandler:
-        """
-        Work handler metrics
-        """
-        ...
-
-class FrontendService:
-    """
-    Frontend service metrics (LLM HTTP service)
-    These methods return the full metric names with the "dynamo_frontend_" prefix
-    """
-
-    @property
-    def requests_total(self) -> str:
-        """
-        Total number of LLM requests processed
-        """
-        ...
-
-    @property
-    def queued_requests_total(self) -> str:
-        """
-        Number of requests waiting in HTTP queue before receiving the first response
-        """
-        ...
-
-    @property
-    def inflight_requests_total(self) -> str:
-        """
-        Number of inflight requests going to the engine (vLLM, SGLang, ...)
-        """
-        ...
-
-    @property
-    def request_duration_seconds(self) -> str:
-        """
-        Duration of LLM requests
-        """
-        ...
-
-    @property
-    def input_sequence_tokens(self) -> str:
-        """
-        Input sequence length in tokens
-        """
-        ...
-
-    @property
-    def output_sequence_tokens(self) -> str:
-        """
-        Output sequence length in tokens
-        """
-        ...
-
-    @property
-    def time_to_first_token_seconds(self) -> str:
-        """
-        Time to first token in seconds
-        """
-        ...
-
-    @property
-    def inter_token_latency_seconds(self) -> str:
-        """
-        Inter-token latency in seconds
-        """
-        ...
-
-class WorkHandler:
-    """
-    Work handler metrics (component request processing)
-    These methods return the full metric names with the "dynamo_component_" prefix
-    """
-
-    @property
-    def requests_total(self) -> str:
-        """
-        Total number of requests processed by work handler
-        """
-        ...
-
-    @property
-    def request_bytes_total(self) -> str:
-        """
-        Total number of bytes received in requests by work handler
-        """
-        ...
-
-    @property
-    def response_bytes_total(self) -> str:
-        """
-        Total number of bytes sent in responses by work handler
-        """
-        ...
-
-    @property
-    def inflight_requests(self) -> str:
-        """
-        Number of requests currently being processed by work handler
-        """
-        ...
-
-    @property
-    def request_duration_seconds(self) -> str:
-        """
-        Time spent processing requests by work handler (histogram)
-        """
-        ...
-
-    @property
-    def errors_total(self) -> str:
-        """
-        Total number of errors in work handler processing
-        """
-        ...
-
-# Module-level singleton instance for convenient access
-prometheus_names: PrometheusNames
+__all__ = [
+    # ... existing exports ...
+    "prometheus_names"
+]
--- a/lib/bindings/python/src/dynamo/_prometheus_names.pyi
+++ b/lib/bindings/python/src/dynamo/_prometheus_names.pyi
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Python type stubs for Prometheus metric name constants
+
+⚠️  **CRITICAL: SYNC WITH RUST SOURCE** ⚠️
+This file must stay in sync with:
+- Source: `lib/runtime/src/metrics/prometheus_names.rs`
+- Bindings: `lib/bindings/python/rust/prometheus_names.rs`
+
+When the Rust source is modified, update all three files immediately.
+"""
+
+class PrometheusNames:
+    """
+    Main container for all Prometheus metric name constants
+    """
+
+    @property
+    def frontend(self) -> FrontendService:
+        """
+        Frontend service metrics
+        """
+        ...
+
+    @property
+    def work_handler(self) -> WorkHandler:
+        """
+        Work handler metrics
+        """
+        ...
+
+    @property
+    def kvstats(self) -> KvStatsMetrics:
+        """
+        KV stats metrics
+        """
+        ...
+
+class FrontendService:
+    """
+    Frontend service metrics (LLM HTTP service)
+    These methods return the full metric names with the "dynamo_frontend_" prefix
+    """
+
+    @property
+    def requests_total(self) -> str:
+        """
+        Total number of LLM requests processed
+        """
+        ...
+
+    @property
+    def queued_requests(self) -> str:
+        """
+        Number of requests waiting in HTTP queue before receiving the first response
+        """
+        ...
+
+    @property
+    def inflight_requests(self) -> str:
+        """
+        Number of inflight requests going to the engine (vLLM, SGLang, ...)
+        """
+        ...
+
+    @property
+    def request_duration_seconds(self) -> str:
+        """
+        Duration of LLM requests
+        """
+        ...
+
+    @property
+    def input_sequence_tokens(self) -> str:
+        """
+        Input sequence length in tokens
+        """
+        ...
+
+    @property
+    def output_sequence_tokens(self) -> str:
+        """
+        Output sequence length in tokens
+        """
+        ...
+
+    @property
+    def time_to_first_token_seconds(self) -> str:
+        """
+        Time to first token in seconds
+        """
+        ...
+
+    @property
+    def inter_token_latency_seconds(self) -> str:
+        """
+        Inter-token latency in seconds
+        """
+        ...
+
+    @property
+    def disconnected_clients(self) -> str:
+        """
+        Number of disconnected clients
+        """
+        ...
+
+    @property
+    def model_total_kv_blocks(self) -> str:
+        """
+        Model total KV blocks
+        """
+        ...
+
+    @property
+    def model_max_num_seqs(self) -> str:
+        """
+        Model max number of sequences
+        """
+        ...
+
+    @property
+    def model_max_num_batched_tokens(self) -> str:
+        """
+        Model max number of batched tokens
+        """
+        ...
+
+    @property
+    def model_context_length(self) -> str:
+        """
+        Model context length
+        """
+        ...
+
+    @property
+    def model_kv_cache_block_size(self) -> str:
+        """
+        Model KV cache block size
+        """
+        ...
+
+    @property
+    def model_migration_limit(self) -> str:
+        """
+        Model migration limit
+        """
+        ...
+
+class WorkHandler:
+    """
+    Work handler metrics (component request processing)
+    These methods return the full metric names with the "dynamo_component_" prefix
+    """
+
+    @property
+    def requests_total(self) -> str:
+        """
+        Total number of requests processed by work handler
+        """
+        ...
+
+    @property
+    def request_bytes_total(self) -> str:
+        """
+        Total number of bytes received in requests by work handler
+        """
+        ...
+
+    @property
+    def response_bytes_total(self) -> str:
+        """
+        Total number of bytes sent in responses by work handler
+        """
+        ...
+
+    @property
+    def inflight_requests(self) -> str:
+        """
+        Number of requests currently being processed by work handler
+        """
+        ...
+
+    @property
+    def request_duration_seconds(self) -> str:
+        """
+        Time spent processing requests by work handler (histogram)
+        """
+        ...
+
+    @property
+    def errors_total(self) -> str:
+        """
+        Total number of errors in work handler processing
+        """
+        ...
+
+class KvStatsMetrics:
+    """
+    KV stats metrics (KV cache statistics)
+    These methods return the metric names with the "kvstats_" prefix
+    """
+
+    @property
+    def active_blocks(self) -> str:
+        """
+        Number of active KV cache blocks currently in use
+        """
+        ...
+
+    @property
+    def total_blocks(self) -> str:
+        """
+        Total number of KV cache blocks available
+        """
+        ...
+
+    @property
+    def gpu_cache_usage_percent(self) -> str:
+        """
+        GPU cache usage as a percentage (0.0-1.0)
+        """
+        ...
+
+    @property
+    def gpu_prefix_cache_hit_rate(self) -> str:
+        """
+        GPU prefix cache hit rate as a percentage (0.0-1.0)
+        """
+        ...
+
+# Module-level singleton instance for convenient access
+prometheus_names: PrometheusNames
+
+
--- a/lib/llm/src/http/service/metrics.rs
+++ b/lib/llm/src/http/service/metrics.rs
@@ -134,7 +134,8 @@ impl Metrics {
    ///
    /// The following metrics will be created with the configured prefix:
    /// - `{prefix}_requests_total` - IntCounterVec for the total number of requests processed
-    /// - `{prefix}_inflight_requests` - IntGaugeVec for the number of inflight requests
+    /// - `{prefix}_inflight_requests` - IntGaugeVec for the number of inflight/concurrent requests
+    /// - `{prefix}_disconnected_clients` - IntGauge for the number of disconnected clients
    /// - `{prefix}_request_duration_seconds` - HistogramVec for the duration of requests
    /// - `{prefix}_input_sequence_tokens` - HistogramVec for input sequence length in tokens
    /// - `{prefix}_output_sequence_tokens` - HistogramVec for output sequence length in tokens
@@ -185,7 +186,7 @@ impl Metrics {

        let inflight_gauge = IntGaugeVec::new(
            Opts::new(
-                frontend_metric_name(frontend_service::INFLIGHT_REQUESTS_TOTAL),
+                frontend_metric_name(frontend_service::INFLIGHT_REQUESTS),
                "Number of inflight requests",
            ),
            &["model"],
@@ -193,14 +194,14 @@ impl Metrics {
        .unwrap();

        let client_disconnect_gauge = prometheus::IntGauge::new(
-            frontend_metric_name("client_disconnects"),
-            "Number of connections dropped by clients",
+            frontend_metric_name(frontend_service::DISCONNECTED_CLIENTS),
+            "Number of disconnected clients",
        )
        .unwrap();

        let http_queue_gauge = IntGaugeVec::new(
            Opts::new(
-                frontend_metric_name(frontend_service::QUEUED_REQUESTS_TOTAL),
+                frontend_metric_name(frontend_service::QUEUED_REQUESTS),
                "Number of requests in HTTP processing queue",
            ),
            &["model"],

--- a/lib/llm/tests/http_metrics.rs
+++ b/lib/llm/tests/http_metrics.rs
@@ -90,9 +90,9 @@ async fn test_metrics_prefix_default() {

        // Assert metrics that are actually present in the default configuration
        assert!(body.contains("dynamo_frontend_requests_total"));
-        assert!(body.contains("dynamo_frontend_inflight_requests_total"));
+        assert!(body.contains("dynamo_frontend_inflight_requests"));
        assert!(body.contains("dynamo_frontend_request_duration_seconds"));
-        assert!(body.contains("dynamo_frontend_client_disconnects"));
+        assert!(body.contains("dynamo_frontend_disconnected_clients"));

        token.cancel();
        let _ = handle.await;
@@ -271,10 +271,10 @@ async fn test_metrics_with_mock_model() {
        // Assert that key metrics are present with the mockmodel
        assert!(metrics_body.contains("dynamo_frontend_requests_total"));
        assert!(metrics_body.contains("model=\"mockmodel\""));
-        assert!(metrics_body.contains("dynamo_frontend_inflight_requests_total"));
+        assert!(metrics_body.contains("dynamo_frontend_inflight_requests"));
        assert!(metrics_body.contains("dynamo_frontend_request_duration_seconds"));
        assert!(metrics_body.contains("dynamo_frontend_output_sequence_tokens"));
-        assert!(metrics_body.contains("dynamo_frontend_queued_requests_total"));
+        assert!(metrics_body.contains("dynamo_frontend_queued_requests"));

        // Verify specific request counter incremented
        assert!(metrics_body.contains("endpoint=\"chat_completions\""));
@@ -386,6 +386,23 @@ mod integration_tests {
            .await
            .unwrap();

+        // Manually save the model card and update metrics
+        // This simulates what the ModelWatcher polling task would do in production
+        let card = local_model.card().clone();
+        manager.save_model_card("test-mdc-key", card.clone());
+
+        if let Err(e) = service
+            .state()
+            .metrics_clone()
+            .update_metrics_from_mdc(&card)
+        {
+            tracing::debug!(
+                model = %card.display_name,
+                error = %e,
+                "Failed to update MDC metrics in test"
+            );
+        }
+
        // Start the HTTP service
        let token = CancellationToken::new();
        let cancel_token = token.clone();
@@ -456,10 +473,10 @@ mod integration_tests {
        let model_name = model.service_name();
        assert!(metrics_body.contains("dynamo_frontend_requests_total"));
        assert!(metrics_body.contains(&format!("model=\"{}\"", model_name)));
-        assert!(metrics_body.contains("dynamo_frontend_inflight_requests_total"));
+        assert!(metrics_body.contains("dynamo_frontend_inflight_requests"));
        assert!(metrics_body.contains("dynamo_frontend_request_duration_seconds"));
        assert!(metrics_body.contains("dynamo_frontend_output_sequence_tokens"));
-        assert!(metrics_body.contains("dynamo_frontend_queued_requests_total"));
+        assert!(metrics_body.contains("dynamo_frontend_queued_requests"));

        // Assert MDC-based model configuration metrics are present
        // These MUST be present for the test to pass

--- a/lib/runtime/src/metrics.rs
+++ b/lib/runtime/src/metrics.rs
@@ -1176,8 +1176,8 @@ dynamo_component_nats_client_connection_state 1
 # TYPE dynamo_component_latency histogram
 dynamo_component_latency_bucket{le="0.1"} 10
 dynamo_component_latency_bucket{le="0.5"} 25
-dynamo_component_nats_service_total_requests 100
-dynamo_component_nats_service_total_errors 5"#;
+dynamo_component_nats_service_requests_total 100
+dynamo_component_nats_service_errors_total 5"#;

        // Test remove_nats_lines (excludes NATS lines but keeps help/type)
        let filtered_out = super::test_helpers::remove_nats_lines(test_input);
@@ -1421,7 +1421,11 @@ mod test_metricsregistry_nats {
                1.0,
                1.0,
            ), // Should be connected
-            (build_component_metric_name(nats_client::CONNECTS), 1.0, 1.0), // Should have 1 connection
+            (
+                build_component_metric_name(nats_client::CURRENT_CONNECTIONS),
+                1.0,
+                1.0,
+            ), // Should have 1 connection
            (
                build_component_metric_name(nats_client::IN_TOTAL_BYTES),
                800.0,
@@ -1444,22 +1448,22 @@ mod test_metricsregistry_nats {
            ), // Wide range around 2
            // Component NATS metrics (ordered to match COMPONENT_NATS_METRICS)
            (
-                build_component_metric_name(nats_service::AVG_PROCESSING_MS),
+                build_component_metric_name(nats_service::PROCESSING_MS_AVG),
                0.0,
                0.0,
            ), // No processing yet
            (
-                build_component_metric_name(nats_service::TOTAL_ERRORS),
+                build_component_metric_name(nats_service::ERRORS_TOTAL),
                0.0,
                0.0,
            ), // No errors yet
            (
-                build_component_metric_name(nats_service::TOTAL_REQUESTS),
+                build_component_metric_name(nats_service::REQUESTS_TOTAL),
                0.0,
                0.0,
            ), // No requests yet
            (
-                build_component_metric_name(nats_service::TOTAL_PROCESSING_MS),
+                build_component_metric_name(nats_service::PROCESSING_MS_TOTAL),
                0.0,
                0.0,
            ), // No processing yet
@@ -1550,7 +1554,11 @@ mod test_metricsregistry_nats {
                1.0,
                1.0,
            ), // Connected
-            (build_component_metric_name(nats_client::CONNECTS), 1.0, 1.0), // 1 connection
+            (
+                build_component_metric_name(nats_client::CURRENT_CONNECTIONS),
+                1.0,
+                1.0,
+            ), // 1 connection
            (
                build_component_metric_name(nats_client::IN_TOTAL_BYTES),
                20000.0,
@@ -1573,22 +1581,22 @@ mod test_metricsregistry_nats {
            ), // Wide range around 16
            // Component NATS metrics
            (
-                build_component_metric_name(nats_service::AVG_PROCESSING_MS),
+                build_component_metric_name(nats_service::PROCESSING_MS_AVG),
                0.0,
                1.0,
            ), // Low processing time
            (
-                build_component_metric_name(nats_service::TOTAL_ERRORS),
+                build_component_metric_name(nats_service::ERRORS_TOTAL),
                0.0,
                0.0,
            ), // No errors
            (
-                build_component_metric_name(nats_service::TOTAL_REQUESTS),
+                build_component_metric_name(nats_service::REQUESTS_TOTAL),
                0.0,
                0.0,
            ), // No work handler requests
            (
-                build_component_metric_name(nats_service::TOTAL_PROCESSING_MS),
+                build_component_metric_name(nats_service::PROCESSING_MS_TOTAL),
                0.0,
                5.0,
            ), // Low total processing time

--- a/lib/runtime/src/metrics/prometheus_names.rs
+++ b/lib/runtime/src/metrics/prometheus_names.rs
@@ -20,26 +20,38 @@
 //! **Prefix**: Component identifier (`dynamo_component_`, `dynamo_frontend_`, etc.)
 //! **Name**: Descriptive snake_case name indicating what is measured
 //! **Suffix**:
-//!   - Units: `_seconds`, `_bytes`, `_ms`, `_percent`
-//!   - Counters: `_total` (not `total_` prefix)
+//!   - Units: `_seconds`, `_bytes`, `_ms`, `_percent`, `_messages`, `_connections`
+//!   - Counters: `_total` (not `total_` prefix) - for cumulative metrics that only increase
+//!   - Gauges: No `_total` suffix - for current state metrics that can go up and down
 //!   - Note: Do not use `_counter`, `_gauge`, `_time`, or `_size` in Prometheus names (too vague)
 //!
 //! **Common Transformations**:
 //! - ❌ `_counter` → ✅ `_total`
+//! - ❌ `_sum` → ✅ `_total`
+//! - ❌ `_gauge` → ✅ (no suffix needed for current values)
 //! - ❌ `_time` → ✅ `_seconds`, `_ms`, `_hours`, `_duration_seconds`
+//! - ❌ `_time_total` → ✅ `_seconds_total`, `_ms_total`, `_hours_total`
+//! - ❌ `_total_time` → ✅ `_seconds_total`, `_ms_total`, `_hours_total`
+//! - ❌ `_total_time_seconds` → ✅ `_seconds_total`
+//! - ❌ `_average_time` → ✅ `_seconds_avg`, `_ms_avg`
 //! - ❌ `_size` → ✅ `_bytes`, `_total`, `_length`
-//! - ❌ `_gauge` → ✅ (no suffix needed for current values)
+//! - ❌ `_some_request_size` → ✅ `_some_request_bytes_avg`
 //! - ❌ `_rate` → ✅ `_per_second`, `_per_minute`
+//! - ❌ `disconnected_clients_total` → ✅ `disconnected_clients` (gauge, not counter)
+//! - ❌ `inflight_requests_total` → ✅ `inflight_requests` (gauge, not counter)
+//! - ❌ `connections_total` → ✅ `current_connections` (gauge, not counter)
 //!
 //! **Examples**:
 //! - ✅ `dynamo_frontend_requests_total` - Total request counter (not `incoming_requests`)
 //! - ✅ `dynamo_frontend_request_duration_seconds` - Request duration histogram (not `response_time`)
 //! - ✅ `dynamo_component_errors_total` - Total error counter (not `total_errors`)
 //! - ✅ `dynamo_component_memory_usage_bytes` - Memory usage gauge
-//! - ✅ `dynamo_frontend_inflight_requests_total` - Current inflight requests gauge
+//! - ✅ `dynamo_frontend_inflight_requests` - Current inflight requests gauge
 //! - ✅ `nats_client_connection_duration_ms` - Connection time in milliseconds
 //! - ✅ `dynamo_component_cpu_usage_percent` - CPU usage percentage
 //! - ✅ `dynamo_frontend_tokens_per_second` - Token generation rate
+//! - ✅ `nats_client_current_connections` - Current active connections gauge
+//! - ✅ `nats_client_in_messages` - Total messages received counter
 //!
 //! ## Key Differences: Prometheus Metric Names vs Prometheus Label Names
 //!
@@ -83,11 +95,15 @@ pub mod frontend_service {
    /// Total number of LLM requests processed
    pub const REQUESTS_TOTAL: &str = "requests_total";

-    /// Number of requests waiting in HTTP queue before receiving the first response.
-    pub const QUEUED_REQUESTS_TOTAL: &str = "queued_requests_total";
+    /// Number of requests waiting in HTTP queue before receiving the first response (gauge)
+    pub const QUEUED_REQUESTS: &str = "queued_requests";
+
+    /// Number of inflight/concurrent requests going to the engine (vLLM, SGLang, ...)
+    /// Note: This is a gauge metric (current state) that can go up and down, so no _total suffix
+    pub const INFLIGHT_REQUESTS: &str = "inflight_requests";

-    /// Number of inflight requests going to the engine (vLLM, SGLang, ...)
-    pub const INFLIGHT_REQUESTS_TOTAL: &str = "inflight_requests_total";
+    /// Number of disconnected clients (gauge that can go up and down)
+    pub const DISCONNECTED_CLIENTS: &str = "disconnected_clients";

    /// Duration of LLM requests
    pub const REQUEST_DURATION_SECONDS: &str = "request_duration_seconds";
@@ -157,6 +173,7 @@ pub mod work_handler {
    pub const RESPONSE_BYTES_TOTAL: &str = "response_bytes_total";

    /// Number of requests currently being processed by work handler
+    /// Note: This is a gauge metric (current state) that can go up and down, so no _total suffix
    pub const INFLIGHT_REQUESTS: &str = "inflight_requests";

    /// Time spent processing requests by work handler (histogram)
@@ -214,8 +231,9 @@ pub mod nats_client {
    /// Total number of messages sent by NATS client
    pub const OUT_MESSAGES: &str = nats_client_name!("out_messages");

-    /// Total number of connections established by NATS client
-    pub const CONNECTS: &str = nats_client_name!("connects");
+    /// Current number of active connections for NATS client
+    /// Note: Gauge metric measuring current connections, not cumulative total
+    pub const CURRENT_CONNECTIONS: &str = nats_client_name!("current_connections");

    /// Current connection state of NATS client (0=disconnected, 1=connected, 2=reconnecting)
    pub const CONNECTION_STATE: &str = nats_client_name!("connection_state");
@@ -234,16 +252,16 @@ pub mod nats_service {
    pub const PREFIX: &str = nats_service_name!("");

    /// Average processing time in milliseconds (maps to: average_processing_time in ms)
-    pub const AVG_PROCESSING_MS: &str = nats_service_name!("avg_processing_time_ms");
+    pub const PROCESSING_MS_AVG: &str = nats_service_name!("processing_ms_avg");

    /// Total errors across all endpoints (maps to: num_errors)
-    pub const TOTAL_ERRORS: &str = nats_service_name!("total_errors");
+    pub const ERRORS_TOTAL: &str = nats_service_name!("errors_total");

    /// Total requests across all endpoints (maps to: num_requests)
-    pub const TOTAL_REQUESTS: &str = nats_service_name!("total_requests");
+    pub const REQUESTS_TOTAL: &str = nats_service_name!("requests_total");

    /// Total processing time in milliseconds (maps to: processing_time in ms)
-    pub const TOTAL_PROCESSING_MS: &str = nats_service_name!("total_processing_time_ms");
+    pub const PROCESSING_MS_TOTAL: &str = nats_service_name!("processing_ms_total");

    /// Number of active services (derived from ServiceSet.services)
    pub const ACTIVE_SERVICES: &str = nats_service_name!("active_services");
@@ -255,7 +273,7 @@ pub mod nats_service {
 /// All NATS client Prometheus metric names as an array for iteration/validation
 pub const DRT_NATS_METRICS: &[&str] = &[
    nats_client::CONNECTION_STATE,
-    nats_client::CONNECTS,
+    nats_client::CURRENT_CONNECTIONS,
    nats_client::IN_TOTAL_BYTES,
    nats_client::IN_MESSAGES,
    nats_client::OUT_OVERHEAD_BYTES,
@@ -265,10 +283,10 @@ pub const DRT_NATS_METRICS: &[&str] = &[
 /// All component service Prometheus metric names as an array for iteration/validation
 /// (ordered to match NatsStatsMetrics fields)
 pub const COMPONENT_NATS_METRICS: &[&str] = &[
-    nats_service::AVG_PROCESSING_MS, // maps to: average_processing_time (nanoseconds)
-    nats_service::TOTAL_ERRORS,      // maps to: num_errors
-    nats_service::TOTAL_REQUESTS,    // maps to: num_requests
-    nats_service::TOTAL_PROCESSING_MS, // maps to: processing_time (nanoseconds)
+    nats_service::PROCESSING_MS_AVG, // maps to: average_processing_time (nanoseconds)
+    nats_service::ERRORS_TOTAL,      // maps to: num_errors
+    nats_service::REQUESTS_TOTAL,    // maps to: num_requests
+    nats_service::PROCESSING_MS_TOTAL, // maps to: processing_time (nanoseconds)
    nats_service::ACTIVE_SERVICES,   // derived from ServiceSet.services
    nats_service::ACTIVE_ENDPOINTS,  // derived from ServiceInfo.endpoints
 ];

--- a/lib/runtime/src/service.rs
+++ b/lib/runtime/src/service.rs
@@ -306,15 +306,18 @@ mod tests {
 /// Flow: NATS Service → NatsStatsMetrics (Counters) → Metrics Callback → Prometheus Gauge
 /// Note: These are snapshots updated when execute_metrics_callbacks() is called.
 #[derive(Debug, Clone)]
+/// Prometheus metrics for NATS server components.
+/// Note: Metrics with `_total` names use IntGauge because we copy counter values
+/// from underlying services rather than incrementing directly.
 pub struct ComponentNatsServerPrometheusMetrics {
    /// Average processing time in milliseconds (maps to: average_processing_time)
-    pub service_avg_processing_ms: prometheus::Gauge,
+    pub service_processing_ms_avg: prometheus::Gauge,
    /// Total errors across all endpoints (maps to: num_errors)
-    pub service_total_errors: prometheus::IntGauge,
+    pub service_errors_total: prometheus::IntGauge,
    /// Total requests across all endpoints (maps to: num_requests)
-    pub service_total_requests: prometheus::IntGauge,
+    pub service_requests_total: prometheus::IntGauge,
    /// Total processing time in milliseconds (maps to: processing_time)
-    pub service_total_processing_ms: prometheus::IntGauge,
+    pub service_processing_ms_total: prometheus::IntGauge,
    /// Number of active services (derived from ServiceSet.services)
    pub service_active_services: prometheus::IntGauge,
    /// Number of active endpoints (derived from ServiceInfo.endpoints)
@@ -336,26 +339,26 @@ impl ComponentNatsServerPrometheusMetrics {

        let labels: &[(&str, &str)] = &labels_vec;

-        let service_avg_processing_ms = component.create_gauge(
-            nats_service::AVG_PROCESSING_MS,
+        let service_processing_ms_avg = component.create_gauge(
+            nats_service::PROCESSING_MS_AVG,
            "Average processing time across all component endpoints in milliseconds",
            labels,
        )?;

-        let service_total_errors = component.create_intgauge(
-            nats_service::TOTAL_ERRORS,
+        let service_errors_total = component.create_intgauge(
+            nats_service::ERRORS_TOTAL,
            "Total number of errors across all component endpoints",
            labels,
        )?;

-        let service_total_requests = component.create_intgauge(
-            nats_service::TOTAL_REQUESTS,
+        let service_requests_total = component.create_intgauge(
+            nats_service::REQUESTS_TOTAL,
            "Total number of requests across all component endpoints",
            labels,
        )?;

-        let service_total_processing_ms = component.create_intgauge(
-            nats_service::TOTAL_PROCESSING_MS,
+        let service_processing_ms_total = component.create_intgauge(
+            nats_service::PROCESSING_MS_TOTAL,
            "Total processing time across all component endpoints in milliseconds",
            labels,
        )?;
@@ -373,10 +376,10 @@ impl ComponentNatsServerPrometheusMetrics {
        )?;

        Ok(Self {
-            service_avg_processing_ms,
-            service_total_errors,
-            service_total_requests,
-            service_total_processing_ms,
+            service_processing_ms_avg,
+            service_errors_total,
+            service_requests_total,
+            service_processing_ms_total,
            service_active_services,
            service_active_endpoints,
        })
@@ -414,14 +417,14 @@ impl ComponentNatsServerPrometheusMetrics {
        if processing_time_samples > 0 && total_requests > 0 {
            let avg_time_nanos = total_processing_time_nanos as f64 / total_requests as f64;
            let avg_time_ms = avg_time_nanos / 1_000_000.0; // Convert nanoseconds to milliseconds
-            self.service_avg_processing_ms.set(avg_time_ms);
+            self.service_processing_ms_avg.set(avg_time_ms);
        } else {
-            self.service_avg_processing_ms.set(0.0);
+            self.service_processing_ms_avg.set(0.0);
        }

-        self.service_total_errors.set(total_errors as i64); // maps to: num_errors
-        self.service_total_requests.set(total_requests as i64); // maps to: num_requests
-        self.service_total_processing_ms
+        self.service_errors_total.set(total_errors as i64); // maps to: num_errors
+        self.service_requests_total.set(total_requests as i64); // maps to: num_requests
+        self.service_processing_ms_total
            .set((total_processing_time_nanos / 1_000_000) as i64); // maps to: processing_time (converted to milliseconds)
        self.service_active_services.set(service_count); // derived from ServiceSet.services
        self.service_active_endpoints.set(endpoint_count as i64); // derived from ServiceInfo.endpoints
@@ -429,10 +432,10 @@ impl ComponentNatsServerPrometheusMetrics {

    /// Reset all metrics to zero. Useful when no data is available or to clear stale values.
    pub fn reset_to_zeros(&self) {
-        self.service_avg_processing_ms.set(0.0);
-        self.service_total_errors.set(0);
-        self.service_total_requests.set(0);
-        self.service_total_processing_ms.set(0);
+        self.service_processing_ms_avg.set(0.0);
+        self.service_errors_total.set(0);
+        self.service_requests_total.set(0);
+        self.service_processing_ms_total.set(0);
        self.service_active_services.set(0);
        self.service_active_endpoints.set(0);
    }

--- a/lib/runtime/src/transports/nats.rs
+++ b/lib/runtime/src/transports/nats.rs
@@ -919,8 +919,8 @@ impl DRTNatsClientPrometheusMetrics {
            &[],
        )?;
        let connects = drt.create_intgauge(
-            nats_metrics::CONNECTS,
-            "Total number of connections established by NATS client",
+            nats_metrics::CURRENT_CONNECTIONS,
+            "Current number of active connections for NATS client",
            &[],
        )?;
        let connection_state = drt.create_intgauge(