refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) (#4173)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>

refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) (#4173)
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
afccc9d4 · Keiven C · GitHub · 3577b5c1 · afccc9d4 · afccc9d4
Unverified Commit afccc9d4 authored Nov 10, 2025 by Keiven C Committed by GitHub Nov 10, 2025
20 changed files
--- a/deploy/logging/grafana/loki-datasource.yaml
+++ b/deploy/logging/grafana/loki-datasource.yaml
--- a/deploy/logging/values/alloy-values.yaml
+++ b/deploy/logging/values/alloy-values.yaml
--- a/deploy/logging/values/loki-values.yaml
+++ b/deploy/logging/values/loki-values.yaml
--- a/deploy/metrics/prometheus.yml
+++ b/deploy/metrics/prometheus.yml
--- a/deploy/tracing/grafana/provisioning/datasources/tempo.yaml
+++ b/deploy/tracing/grafana/provisioning/datasources/tempo.yaml
@@ -9,7 +9,7 @@ datasources:
    access: proxy
    url: http://tempo:3200
    uid: tempo
-    isDefault: true
+    isDefault: false
    editable: true
    jsonData:
      httpMethod: GET

--- a/deploy/tracing/tempo.yaml
+++ b/deploy/tracing/tempo.yaml
--- a/deploy/tracing/docker-compose.yml
+++ b/deploy/tracing/docker-compose.yml
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-version: '3.8'
-services:
-  # Tempo - Distributed tracing backend
-  tempo:
-    image: grafana/tempo:2.8.2
-    command: [ "-config.file=/etc/tempo.yaml" ]
-    volumes:
-      - ./tempo.yaml:/etc/tempo.yaml
-      - tempo-data:/tmp/tempo
-    ports:
-      - "3200:3200"   # Tempo HTTP
-      - "4317:4317"   # OTLP gRPC receiver (accessible from host)
-      - "4318:4318"   # OTLP HTTP receiver (accessible from host)
-  # Grafana - Visualization and dashboards
-  grafana:
-    image: grafana/grafana:12.2.0
-    ports:
-      - "3000:3000"
-    environment:
-      - GF_SECURITY_ADMIN_PASSWORD=admin
-      - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor
-    volumes:
-      - grafana-data:/var/lib/grafana
-      - ./grafana/provisioning:/etc/grafana/provisioning
-    depends_on:
-      - tempo
-volumes:
-  tempo-data:
-  grafana-data:
--- a/docs/_sections/observability.rst
+++ b/docs/_sections/observability.rst
@@ -4,6 +4,10 @@ Observability
 .. toctree::
   :hidden:
+   Overview <../observability/README>
+   Prometheus + Grafana Setup <../observability/prometheus-grafana>
   Metrics <../observability/metrics>
+   Metrics Developer Guide <../observability/metrics-developer-guide>
+   Health Checks <../observability/health-checks>
+   Tracing <../observability/tracing>
   Logging <../observability/logging>
-   Health Checks <../observability/health-checks>
\ No newline at end of file
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -26,6 +26,7 @@
   kubernetes/api_reference.md
   kubernetes/deployment/create_deployment.md
+   kubernetes/deployment/dynamomodel-guide.md
   kubernetes/fluxcd.md
   kubernetes/grove.md

--- a/docs/kubernetes/observability/logging.md
+++ b/docs/kubernetes/observability/logging.md
@@ -25,6 +25,8 @@ While this guide does not use Prometheus, it assumes Grafana is pre-installed wi
 ### 3. Environment Variables
+#### Kubernetes Setup Variables
 The following env variables are set:
 - `MONITORING_NAMESPACE`: The namespace where Loki is installed
 - `DYN_NAMESPACE`: The namespace where Dynamo Cloud Operator is installed
@@ -34,6 +36,14 @@ export MONITORING_NAMESPACE=monitoring
 export DYN_NAMESPACE=dynamo-system
 ```
+#### Dynamo Logging Variables
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for Loki) | `true` |
+| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
+| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `true` |
 ## Installation Steps
 ### 1. Install Loki
@@ -46,7 +56,7 @@ helm repo add grafana https://grafana.github.io/helm-charts
 helm repo update
 # Install Loki
-helm install --values deploy/logging/values/loki-values.yaml loki grafana/loki -n $MONITORING_NAMESPACE
+helm install --values deploy/observability/k8s/logging/values/loki-values.yaml loki grafana/loki -n $MONITORING_NAMESPACE
 ```
 Our configuration (`loki-values.yaml`) sets up Loki in a simple configuration that is suitable for testing and development. It uses a local MinIO for storage. The installation pods can be viewed with:
@@ -60,7 +70,7 @@ Next, install the Grafana Alloy collector to gather logs from your Kubernetes cl
 ```bash
 # Generate a custom values file with the namespace information
-envsubst < deploy/logging/values/alloy-values.yaml > alloy-custom-values.yaml
+envsubst < deploy/observability/k8s/logging/values/alloy-values.yaml > alloy-custom-values.yaml
 # Install the collector
 helm install --values alloy-custom-values.yaml alloy grafana/k8s-monitoring -n $MONITORING_NAMESPACE
@@ -110,10 +120,10 @@ Since we are using Grafana with the Prometheus Operator, we can simply apply the
 ```bash
 # Configure Grafana with the Loki datasource
-envsubst < deploy/logging/grafana/loki-datasource.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
+envsubst < deploy/observability/k8s/logging/grafana/loki-datasource.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
 # Configure Grafana with the Dynamo Logs dashboard
-envsubst < deploy/logging/grafana/logging-dashboard.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
+envsubst < deploy/observability/k8s/logging/grafana/logging-dashboard.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
 ```
 > [!Note]
@@ -141,4 +151,4 @@ kubectl port-forward svc/prometheus-grafana 3000:80 -n $MONITORING_NAMESPACE
 If everything is working, under Home > Dashboards > Dynamo Logs, you should see a dashboard that can be used to view the logs associated with our DynamoGraphDeployments
-The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g frontend, worker, etc).
+The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g., frontend, worker, etc.).
\ No newline at end of file
--- a/docs/kubernetes/observability/metrics.md
+++ b/docs/kubernetes/observability/metrics.md
@@ -128,9 +128,7 @@ spec:
 Apply the Dynamo dashboard configuration to populate Grafana with the Dynamo dashboard:
 ```bash
-pushd deploy/metrics/k8s
+kubectl apply -n monitoring -f deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
-kubectl apply -n monitoring -f grafana-dynamo-dashboard-configmap.yaml
-popd
 ```
 The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_dashboard: "1"`, the Grafana will discover and populate it to its list of available dashboards. The dashboard includes panels for:

--- a/docs/observability/README.md
+++ b/docs/observability/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Dynamo Observability
+## Getting Started Quickly
+This is an example to get started quickly on a single machine.
+### Prerequisites
+Install these on your machine:
+- [Docker](https://docs.docker.com/get-docker/)
+- [Docker Compose](https://docs.docker.com/compose/install/)
+### Starting the Observability Stack
+Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, and various exporters for metrics, tracing, and visualization.
+From the Dynamo root directory:
+```bash
+# Start infrastructure (NATS, etcd)
+docker compose -f deploy/docker-compose.yml up -d
+# Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
+docker compose -f deploy/docker-observability.yml up -d
+```
+For detailed setup instructions and configuration, see [Prometheus + Grafana Setup](prometheus-grafana.md).
+## Observability Documentations
+| Guide | Description | Environment Variables to Control |
+|-------|-------------|----------------------------------|
+| [Metrics](metrics.md) | Available metrics reference | `DYN_SYSTEM_PORT`† |
+| [Health Checks](health-checks.md) | Component health monitoring and readiness probes | `DYN_SYSTEM_PORT`†, `DYN_SYSTEM_STARTING_HEALTH_STATUS`, `DYN_SYSTEM_HEALTH_PATH`, `DYN_SYSTEM_LIVE_PATH`, `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` |
+| [Tracing](tracing.md) | Distributed tracing with OpenTelemetry and Tempo | `DYN_LOGGING_JSONL`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`†, `OTEL_SERVICE_NAME`† |
+| [Logging](logging.md) | Structured logging configuration | `DYN_LOGGING_JSONL`†, `DYN_LOG`, `DYN_LOG_USE_LOCAL_TZ`, `DYN_LOGGING_CONFIG_PATH`, `OTEL_SERVICE_NAME`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`† |
+**Variables marked with † are shared across multiple observability systems.**
+## Developer Guides
+| Guide | Description | Environment Variables to Control |
+|-------|-------------|----------------------------------|
+| [Metrics Developer Guide](metrics-developer-guide.md) | Creating custom metrics in Rust and Python | `DYN_SYSTEM_PORT`† |
+## Kubernetes
+For Kubernetes-specific setup and configuration, see [docs/kubernetes/observability/](../kubernetes/observability/).
+---
+## Topology
+This provides:
+- **Prometheus** on `http://localhost:9090` - metrics collection and querying
+- **Grafana** on `http://localhost:3000` - visualization dashboards (username: `dynamo`, password: `dynamo`)
+- **Tempo** on `http://localhost:3200` - distributed tracing backend
+- **DCGM Exporter** on `http://localhost:9401/metrics` - GPU metrics
+- **NATS Exporter** on `http://localhost:7777/metrics` - NATS messaging metrics
+### Service Relationship Diagram
+```mermaid
+graph TD
+    BROWSER[Browser] -->|:3000| GRAFANA[Grafana :3000]
+    subgraph DockerComposeNetwork [Network inside Docker Compose]
+        NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222]
+        PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
+        PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
+        PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
+        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
+        PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
+        DYNAMOFE --> DYNAMOBACKEND
+        GRAFANA -->|:9090/query API| PROMETHEUS
+    end
+```
+The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.
+### Configuration Files
+The following configuration files are located in the `deploy/observability/` directory:
+- [docker-compose.yml](../../deploy/docker-compose.yml): Defines NATS and etcd services
+- [docker-observability.yml](../../deploy/docker-observability.yml): Defines Prometheus, Grafana, Tempo, and exporters
+- [prometheus.yml](../../deploy/observability/prometheus.yml): Contains Prometheus scraping configuration
+- [grafana-datasources.yml](../../deploy/observability/grafana-datasources.yml): Contains Grafana datasource configuration
+- [grafana_dashboards/dashboard-providers.yml](../../deploy/observability/grafana_dashboards/dashboard-providers.yml): Contains Grafana dashboard provider configuration
+- [grafana_dashboards/dynamo.json](../../deploy/observability/grafana_dashboards/dynamo.json): A general Dynamo Dashboard for both SW and HW metrics
+- [grafana_dashboards/dcgm-metrics.json](../../deploy/observability/grafana_dashboards/dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
+- [grafana_dashboards/kvbm.json](../../deploy/observability/grafana_dashboards/kvbm.json): Contains Grafana dashboard configuration for KVBM metrics
--- a/docs/observability/health-checks.md
+++ b/docs/observability/health-checks.md
@@ -11,6 +11,38 @@ Dynamo provides health check and liveness HTTP endpoints for each component whic
 can be used to configure startup, liveness and readiness probes in
 orchestration frameworks such as Kubernetes.
+## Environment Variables
+| Variable | Description | Default | Example |
+|----------|-------------|---------|---------|
+| `DYN_SYSTEM_PORT` | System status server port | `8081` | `9090` |
+| `DYN_SYSTEM_STARTING_HEALTH_STATUS` | Initial health status | `notready` | `ready`, `notready` |
+| `DYN_SYSTEM_HEALTH_PATH` | Custom health endpoint path | `/health` | `/custom/health` |
+| `DYN_SYSTEM_LIVE_PATH` | Custom liveness endpoint path | `/live` | `/custom/live` |
+| `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` | Endpoints required for ready state | none | `["generate"]` |
+## Getting Started Quickly
+Enable health checks and query endpoints:
+```bash
+# Start your Dynamo components
+python -m dynamo.frontend --http-port 8000 &
+# Enable system status server on port 8081
+DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
+```
+Check health status:
+```bash
+# Frontend health (port 8000)
+curl -s localhost:8000/health | jq
+# Worker health (port 8081)
+curl -s localhost:8081/health | jq
+```
 ## Frontend Liveness Check
 The frontend liveness endpoint reports a status of `live` as long as
@@ -124,16 +156,6 @@ when initializing and HTTP status code `HTTP/1.1 200 OK` once ready.
 > **Note**: Both /live and /ready return the same information
-### Environment Variables for Enabling Health Checks
-| **Environment Variable** | **Description**     | **Example Settings**                             |
-| -------------------------| ------------------- | ------------------------------------------------ |
-| `DYN_SYSTEM_PORT`        | Specifies the port for the system status server (automatically enables it when set to a positive value). | `9090`, `8081`                           |
-| `DYN_SYSTEM_STARTING_HEALTH_STATUS`     | Sets the initial health status of the system (ready/not ready).                | `ready`, `notready`      |
-| `DYN_SYSTEM_HEALTH_PATH`                | Custom path for the health endpoint.                                         | `/custom/health`           |
-| `DYN_SYSTEM_LIVE_PATH`                   | Custom path for the liveness endpoint.                                       | `/custom/live`            |
-| `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` | Specifies endpoints to check for determining overall system health status.    | `["generate"]`            |
 ### Example Environment Setting
 ```

--- a/docs/observability/logging.md
+++ b/docs/observability/logging.md
 <!--
 SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
 -->
 # Dynamo Logging
@@ -24,18 +12,38 @@ JSONL is enabled logs additionally contain `span` creation and exit
 events as well as support for `trace_id` and `span_id` fields for
 distributed tracing.
-## Environment Variables for configuring Logging
+## Environment Variables
+| Variable | Description | Default | Example |
+|----------|-------------|---------|---------|
+| `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` |
+| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `info` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
+| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps (default is UTC) | `false` | `true` |
+| `DYN_LOGGING_CONFIG_PATH` | Path to custom TOML logging configuration | none | `/path/to/config.toml` |
+| `OTEL_SERVICE_NAME` | Service name for trace and span information | `dynamo` | `dynamo-frontend` |
+| `OTEL_EXPORT_ENABLED` | Enable OTLP trace exporting | `false` | `true` |
+| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP exporter endpoint | `http://localhost:4317` | `http://tempo:4317` |
+## Getting Started Quickly
+### Start Observability Stack
+For collecting and visualizing logs with Grafana Loki (Kubernetes), or viewing trace context in logs alongside Grafana Tempo, start the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for instructions.
+### Enable Structured Logging
+Enable structured JSONL logging:
-| Environment Variable                | Description                                 | Example Settings                  |
+```bash
-| ----------------------------------- | --------------------------------------------| ---------------------------------------------------- |
+export DYN_LOGGING_JSONL=true
-| `DYN_LOGGING_JSONL`                | Enable JSONL logging format (default: READABLE)                  | `DYN_LOGGING_JSONL=true`                          |
+export DYN_LOG=debug
-| `DYN_LOG_USE_LOCAL_TZ`             | Use local timezone for logging timestamps (default: UTC)         | `DYN_LOG_USE_LOCAL_TZ=1`                       |
-| `DYN_LOG`                          | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>`             | `DYN_LOG=info,dynamo_runtime::system_status_server:trace`  |
-| `DYN_LOGGING_CONFIG_PATH`          | Path to custom TOML logging configuration file            | `DYN_LOGGING_CONFIG_PATH=/path/to/config.toml`|
-| `OTEL_SERVICE_NAME`                | Service name for OpenTelemetry traces (default: `dynamo`) | `OTEL_SERVICE_NAME=dynamo-frontend` |
-| `OTEL_EXPORT_ENABLED`              | Enable OTLP trace exporting (set to `1` to enable) | `OTEL_EXPORT_ENABLED=1` |
-| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`             | OTLP exporter endpoint (default: http://localhost:4317) | `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://tempo:4317` |
+# Start your Dynamo components
+python -m dynamo.frontend --http-port 8000 &
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
+```
+Logs will be written to stderr in JSONL format with trace context.
 ## Available Logging Levels
@@ -85,68 +93,57 @@ Resulting Log format:
 {"time":"2025-09-02T15:53:31.943747Z","level":"INFO","target":"log","message":"Scheduler config values: {'max_num_seqs': 256, 'max_num_batched_tokens': 2048}","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":268,"log.target":"main.get_engine_cache_info"}
 ```
-## OpenTelemetry Distributed Tracing
+## Logging of Trace and Span IDs
-When `DYN_LOGGING_JSONL` is enabled, Dynamo uses OpenTelemetry for distributed tracing. All logs include `trace_id` and `span_id` fields, and spans are automatically created for requests. By default, traces are **not exported**. To export traces to an observability backend (like Tempo, Jaeger, or Zipkin), set `OTEL_EXPORT_ENABLED=1`.
-### Behavior
+When `DYN_LOGGING_JSONL` is enabled, all logs include `trace_id` and `span_id` fields, and spans are automatically created for requests. This is useful for short debugging sessions where you want to examine trace context in logs without setting up a full tracing backend and for correlating log messages with traces.
- **With `DYN_LOGGING_JSONL=true` only**: OpenTelemetry layer is active, generating trace context and span IDs for all requests. Traces appear in logs but are not exported anywhere.
+The trace and span information uses the OpenTelemetry format and libraries, which means the IDs are compatible with OpenTelemetry-based tracing backends like Tempo or Jaeger if you later choose to enable trace export.
- **With `OTEL_EXPORT_ENABLED=1` and `DYN_LOGGING_JSONL=true`**: Same as above, plus traces are exported to an OTLP collector for visualization.
-### Configuration
+**Note:** This section has overlap with [Distributed Tracing with Tempo](tracing.md). For trace visualization in Grafana Tempo and persistent trace analysis, see [Distributed Tracing with Tempo](tracing.md).
-To enable OTLP trace exporting:
+### Configuration for Logging
-1. Set `OTEL_EXPORT_ENABLED=1` to enable trace export
+To see trace information in logs:
-2. Optionally configure the endpoint using `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` (default: `http://localhost:4317`)
-3. Optionally set `OTEL_SERVICE_NAME` to identify the service (useful in Kubernetes, default: `dynamo`)
-**Export Settings:**
- **Protocol**: gRPC (Tonic)
- **Service Name**: Value of `OTEL_SERVICE_NAME` env var, or `dynamo` if not set
- **Endpoint**: Value of `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` env var, or `http://localhost:4317` if not set
-### Example: JSONL Logging Only (No Export)
 ```bash
 export DYN_LOGGING_JSONL=true
-# OpenTelemetry is active, traces appear in logs, but nothing is exported
+export DYN_LOG=debug  # Set to debug to see detailed trace logs
+# Start your Dynamo components (e.g., frontend and worker)
+python -m dynamo.frontend --http-port 8000 &
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
 ```
-### Example: JSONL Logging + Trace Export to Tempo
+This enables JSONL logging with `trace_id` and `span_id` fields. Traces appear in logs but are not exported to any backend.
+### Example Request
+Send a request to generate logs with trace context:
 ```bash
-export DYN_LOGGING_JSONL=true
+curl -H 'Content-Type: application/json' \
-export OTEL_EXPORT_ENABLED=1
+-H 'x-request-id: test-trace-001' \
-export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://tempo:4317
+-d '{
-export OTEL_SERVICE_NAME=dynamo-frontend
+  "model": "Qwen/Qwen3-0.6B",
-# OpenTelemetry is active, traces appear in logs AND are exported to Tempo
+  "max_completion_tokens": 100,
+  "messages": [
+    {"role": "user", "content": "What is the capital of France?"}
+  ]
+}' \
+http://localhost:8000/v1/chat/completions
 ```
-## Trace and Span Information
+Check the logs (stderr) for JSONL output containing `trace_id`, `span_id`, and `x_request_id` fields.
-### Example Request
+## Trace and Span Information in Logs
-```sh
+This section shows how trace and span information appears in JSONL logs. These logs can be used to understand request flows even without a trace visualization backend.
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H 'Content-Type: application/json' \
+### Example Disaggregated Trace in Grafana
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [
-      {
-        "role": "user",
-        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
-      }
-    ],
-    "stream": true,
-    "max_tokens": 1000,
-  }'
-```
 When viewing the corresponding trace in Grafana, you should be able to see something like the following:
-![Trace Example](./grafana-disagg-trace.png)
+![Disaggregated Trace Example](grafana-disagg-trace.png)
 ### Trace Overview
@@ -208,18 +205,18 @@ When viewing the corresponding trace in Grafana, you should be able to see somet
 | **Busy Time** | 3,795,258 ns (3.79ms) |
 | **Idle Time** | 3,996,532,471 ns (3.99s) |
-### Frontend Logs
+### Frontend Logs with Trace Context
 The following shows the JSONL logs from the frontend service for the same request. Note the `trace_id` field (`b672ccf48683b392891c5cb4163d4b51`) that correlates all logs for this request, and the `span_id` field that identifies individual operations:
 ```
-{"time":"2025-10-31T20:52:07.707164Z","level":"INFO","file":"/opt/dynamo/lib/runtime/src/logging.rs","line":806,"target":"dynamo_runtime::logging","message":"OpenTelemetry OTLP export enabled","endpoint":"http://tempo.tm.svc.cluster.local:4317","service":"frontend"}
+{"time":"2025-10-31T20:52:07.707164Z","level":"INFO","file":"/opt/dynamo/lib/runtime/src/logging.rs","line":806,"target":"dynamo_runtime::logging","message":"OTLP export enabled","endpoint":"http://tempo.tm.svc.cluster.local:4317","service":"frontend"}
 {"time":"2025-10-31T20:52:10.707164Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"5c20cc08e6afb2b7","span_name":"http-request","trace_id":"b672ccf48683b392891c5cb4163d4b51","uri":"/v1/chat/completions","version":"HTTP/1.1"}
 {"time":"2025-10-31T20:52:10.745264Z","level":"DEBUG","file":"/opt/dynamo/lib/llm/src/kv_router/prefill_router.rs","line":232,"target":"dynamo_llm::kv_router::prefill_router","message":"Prefill succeeded, using disaggregated params for decode","method":"POST","span_id":"5c20cc08e6afb2b7","span_name":"http-request","trace_id":"b672ccf48683b392891c5cb4163d4b51","uri":"/v1/chat/completions","version":"HTTP/1.1"}
 {"time":"2025-10-31T20:52:10.745545Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"5c20cc08e6afb2b7","span_name":"http-request","trace_id":"b672ccf48683b392891c5cb4163d4b51","uri":"/v1/chat/completions","version":"HTTP/1.1"}
 ```
-## Custom Request IDs
+## Custom Request IDs in Logs
 You can provide a custom request ID using the `x-request-id` header. This ID will be attached to all spans and logs for that request, making it easier to correlate traces with application-level request tracking.
@@ -237,7 +234,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
      }
    ],
-    "stream": true,
+    "stream": false,
    "max_tokens": 1000
  }'
 ```

--- a/docs/observability/metrics-developer-guide.md
+++ b/docs/observability/metrics-developer-guide.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Metrics Developer Guide
+This guide explains how to create and use custom metrics in Dynamo components using the Dynamo metrics API.
+## Metrics Exposure
+All metrics created via the Dynamo metrics API are automatically exposed on the `/metrics` HTTP endpoint in Prometheus Exposition Format text when the following environment variable is set:
+- `DYN_SYSTEM_PORT=<port>` - Port for the metrics endpoint (set to positive value to enable, default: `-1` disabled)
+Example:
+```bash
+DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
+```
+Prometheus Exposition Format text metrics will be available at: `http://localhost:8081/metrics`
+## Metric Name Constants
+The [prometheus_names.rs](../../lib/runtime/src/metrics/prometheus_names.rs) module provides centralized metric name constants and sanitization functions to ensure consistency across all Dynamo components.
+---
+## Metrics API in Rust
+The metrics API is accessible through the `.metrics()` method on runtime, namespace, component, and endpoint objects. See [Runtime Hierarchy](metrics.md#runtime-hierarchy) for details on the hierarchical structure.
+### Available Methods
+- `.metrics().create_counter()`: Create a counter metric
+- `.metrics().create_gauge()`: Create a gauge metric
+- `.metrics().create_histogram()`: Create a histogram metric
+- `.metrics().create_countervec()`: Create a counter with labels
+- `.metrics().create_gaugevec()`: Create a gauge with labels
+- `.metrics().create_histogramvec()`: Create a histogram with labels
+### Creating Metrics
+```rust
+use dynamo_runtime::DistributedRuntime;
+let runtime = DistributedRuntime::new()?;
+let endpoint = runtime.namespace("my_namespace").component("my_component").endpoint("my_endpoint");
+// Simple metrics
+let requests_total = endpoint.metrics().create_counter(
+    "requests_total",
+    "Total requests",
+    &[]
+)?;
+let active_connections = endpoint.metrics().create_gauge(
+    "active_connections",
+    "Active connections",
+    &[]
+)?;
+let latency = endpoint.metrics().create_histogram(
+    "latency_seconds",
+    "Request latency",
+    &[],
+    Some(vec![0.001, 0.01, 0.1, 1.0, 10.0])
+)?;
+```
+### Using Metrics
+```rust
+// Counters
+requests_total.inc();
+// Gauges
+active_connections.set(42.0);
+active_connections.inc();
+active_connections.dec();
+// Histograms
+latency.observe(0.023);  // 23ms
+```
+### Vector Metrics with Labels
+```rust
+// Create vector metrics with label names
+let requests_by_model = endpoint.metrics().create_countervec(
+    "requests_by_model",
+    "Requests by model",
+    &["model_type", "model_size"],
+    &[]
+)?;
+let memory_by_gpu = endpoint.metrics().create_gaugevec(
+    "gpu_memory_bytes",
+    "GPU memory by device",
+    &["gpu_id", "memory_type"],
+    &[]
+)?;
+// Use with specific label values
+requests_by_model.with_label_values(&["llama", "7b"]).inc();
+memory_by_gpu.with_label_values(&["0", "allocated"]).set(8192.0);
+```
+### Advanced Features
+**Custom histogram buckets:**
+```rust
+let latency = endpoint.metrics().create_histogram(
+    "latency_seconds",
+    "Request latency",
+    &[],
+    Some(vec![0.001, 0.01, 0.1, 1.0, 10.0])
+)?;
+```
+**Constant labels:**
+```rust
+let counter = endpoint.metrics().create_counter(
+    "requests_total",
+    "Total requests",
+    &[("region", "us-west"), ("env", "prod")]
+)?;
+```
+---
+## Metrics API in Python
+Python components can create and manage Prometheus metrics using the same metrics API through Python bindings.
+### Available Methods
+- `endpoint.metrics.create_counter()` / `create_intcounter()`: Create a counter metric
+- `endpoint.metrics.create_gauge()` / `create_intgauge()`: Create a gauge metric
+- `endpoint.metrics.create_histogram()`: Create a histogram metric
+- `endpoint.metrics.create_countervec()` / `create_intcountervec()`: Create a counter with labels
+- `endpoint.metrics.create_gaugevec()` / `create_intgaugevec()`: Create a gauge with labels
+- `endpoint.metrics.create_histogramvec()`: Create a histogram with labels
+All metrics are imported from `dynamo.prometheus_metrics`.
+### Creating Metrics
+```python
+from dynamo.runtime import DistributedRuntime
+drt = DistributedRuntime()
+endpoint = drt.namespace("my_namespace").component("my_component").endpoint("my_endpoint")
+# Simple metrics
+requests_total = endpoint.metrics.create_intcounter(
+    "requests_total",
+    "Total requests"
+)
+active_connections = endpoint.metrics.create_intgauge(
+    "active_connections",
+    "Active connections"
+)
+latency = endpoint.metrics.create_histogram(
+    "latency_seconds",
+    "Request latency",
+    buckets=[0.001, 0.01, 0.1, 1.0, 10.0]
+)
+```
+### Using Metrics
+```python
+# Counters
+requests_total.inc()
+requests_total.inc_by(5)
+# Gauges
+active_connections.set(42)
+active_connections.inc()
+active_connections.dec()
+# Histograms
+latency.observe(0.023)  # 23ms
+```
+### Vector Metrics with Labels
+```python
+# Create vector metrics with label names
+requests_by_model = endpoint.metrics.create_intcountervec(
+    "requests_by_model",
+    "Requests by model",
+    ["model_type", "model_size"]
+)
+memory_by_gpu = endpoint.metrics.create_intgaugevec(
+    "gpu_memory_bytes",
+    "GPU memory by device",
+    ["gpu_id", "memory_type"]
+)
+# Use with specific label values
+requests_by_model.inc({"model_type": "llama", "model_size": "7b"})
+memory_by_gpu.set(8192, {"gpu_id": "0", "memory_type": "allocated"})
+```
+### Advanced Features
+**Constant labels:**
+```python
+counter = endpoint.metrics.create_intcounter(
+    "requests_total",
+    "Total requests",
+    [("region", "us-west"), ("env", "prod")]
+)
+```
+**Metric introspection:**
+```python
+print(counter.name())            # "my_namespace_my_component_my_endpoint_requests_total"
+print(counter.const_labels())    # {"dynamo_namespace": "my_namespace", ...}
+print(gauge_vec.variable_labels())  # ["model_type", "model_size"]
+```
+**Update patterns:**
+Background thread updates:
+```python
+import threading
+import time
+def update_loop():
+    while True:
+        active_connections.set(compute_current_connections())
+        time.sleep(2)
+threading.Thread(target=update_loop, daemon=True).start()
+```
+Callback-based updates (called before each `/metrics` scrape):
+```python
+def update_metrics():
+    active_connections.set(compute_current_connections())
+endpoint.metrics.register_callback(update_metrics)
+```
+### Examples
+Example scripts: [lib/bindings/python/examples/metrics/](../../lib/bindings/python/examples/metrics/)
+```bash
+cd ~/dynamo/lib/bindings/python/examples/metrics
+DYN_SYSTEM_PORT=8081 ./server_with_loop.py
+DYN_SYSTEM_PORT=8081 ./server_with_callback.py
+```
+---
+## Related Documentation
+- [Metrics Overview](metrics.md)
+- [Prometheus and Grafana Setup](prometheus-grafana.md)
+- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
+- [Python Metrics Examples](../../lib/bindings/python/examples/metrics/)
--- a/docs/observability/metrics.md
+++ b/docs/observability/metrics.md
@@ -3,27 +3,91 @@ SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
 SPDX-License-Identifier: Apache-2.0
 -->
-# Dynamo MetricsRegistry
+# Dynamo Metrics
 ## Overview
-Dynamo provides built-in metrics capabilities through the `MetricsRegistry` trait, which is automatically available whenever you use the `DistributedRuntime` framework. This guide explains how to use metrics for observability and monitoring across all Dynamo components.
+Dynamo provides built-in metrics capabilities through the Dynamo metrics API, which is automatically available whenever you use the `DistributedRuntime` framework. This document serves as a reference for all available metrics in Dynamo.
-## Automatic Metrics
+**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](prometheus-grafana.md).
-Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also adds the following labels `dynamo_namespace`, `dynamo_component`, and `dynamo_endpoint` to indicate which component is providing the metric.
+**For creating custom metrics**, see the [Metrics Developer Guide](metrics-developer-guide.md).
-**Frontend Metrics**: When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name. These cover request handling, token processing, and latency measurements. See [prometheus-grafana.md](prometheus-grafana.md#available-metrics) for the complete list of frontend metrics.
+## Environment Variables
-**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See [prometheus-grafana.md](prometheus-grafana.md#available-metrics) for the complete list of component metrics.
+| Variable | Description | Default | Example |
+|----------|-------------|---------|---------|
+| `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
-**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See [prometheus-grafana.md](prometheus-grafana.md#available-metrics) for details on specialized component metrics.
+## Getting Started Quickly
-**Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](../kubernetes/observability/metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana.
+This is a single machine example.
-## Metrics Hierarchy
+### Start Observability Stack
-The `MetricsRegistry` trait is implemented by `DistributedRuntime`, `Namespace`, `Component`, and `Endpoint`, providing a hierarchical approach to metric collection that matches Dynamo's distributed architecture:
+For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for instructions.
+### Launch Dynamo Components
+Launch a frontend and vLLM backend to test metrics:
+```bash
+$ python -m dynamo.frontend --http-port 8000
+# Enable system metrics server on port 8081
+$ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B  \
+   --enforce-eager --no-enable-prefix-caching --max-num-seqs 3
+```
+Wait for the vLLM worker to start, then send requests and check metrics:
+```bash
+# Send a request
+curl -H 'Content-Type: application/json' \
+-d '{
+  "model": "Qwen/Qwen3-0.6B",
+  "max_completion_tokens": 100,
+  "messages": [{"role": "user", "content": "Hello"}]
+}' \
+http://localhost:8000/v1/chat/completions
+# Check metrics from the worker
+curl -s localhost:8081/metrics | grep dynamo_component
+```
+## Exposed Metrics
+Dynamo exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All Dynamo-generated metrics use the `dynamo_*` prefix and include labels (`dynamo_namespace`, `dynamo_component`, `dynamo_endpoint`) to identify the source component.
+**Example Prometheus Exposition Format text:**
+```
+# HELP dynamo_component_requests_total Total requests processed
+# TYPE dynamo_component_requests_total counter
+dynamo_component_requests_total{dynamo_namespace="default",dynamo_component="worker",dynamo_endpoint="generate"} 42
+# HELP dynamo_component_request_duration_seconds Request processing time
+# TYPE dynamo_component_request_duration_seconds histogram
+dynamo_component_request_duration_seconds_bucket{dynamo_namespace="default",dynamo_component="worker",dynamo_endpoint="generate",le="0.005"} 10
+dynamo_component_request_duration_seconds_bucket{dynamo_namespace="default",dynamo_component="worker",dynamo_endpoint="generate",le="0.01"} 15
+dynamo_component_request_duration_seconds_bucket{dynamo_namespace="default",dynamo_component="worker",dynamo_endpoint="generate",le="+Inf"} 42
+dynamo_component_request_duration_seconds_sum{dynamo_namespace="default",dynamo_component="worker",dynamo_endpoint="generate"} 2.5
+dynamo_component_request_duration_seconds_count{dynamo_namespace="default",dynamo_component="worker",dynamo_endpoint="generate"} 42
+```
+### Metric Categories
+Dynamo exposes several categories of metrics:
+- **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements
+- **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime
+- **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics
+- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/prometheus.md) (`vllm:*`), [SGLang](../backends/sglang/prometheus.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm:*`)
+## Runtime Hierarchy
+The Dynamo metrics API is available on `DistributedRuntime`, `Namespace`, `Component`, and `Endpoint`, providing a hierarchical approach to metric collection that matches Dynamo's distributed architecture:
 - `DistributedRuntime`: Global metrics across the entire runtime
 - `Namespace`: Metrics scoped to a specific dynamo_namespace
@@ -32,65 +96,116 @@ The `MetricsRegistry` trait is implemented by `DistributedRuntime`, `Namespace`,
 This hierarchical structure allows you to create metrics at the appropriate level of granularity for your monitoring needs.
+## Available Metrics
-## Getting Started
+### Backend Component Metrics
-For a complete setup guide including Docker Compose configuration, Prometheus setup, and Grafana dashboards, see the [Getting Started section](prometheus-grafana.md#getting-started) in the Prometheus and Grafana guide.
+The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework:
-The quick start includes:
+- `dynamo_component_inflight_requests`: Requests currently being processed (gauge)
- Docker Compose setup for Prometheus and Grafana
+- `dynamo_component_request_bytes_total`: Total bytes received in requests (counter)
- Pre-configured dashboards and datasources
+- `dynamo_component_request_duration_seconds`: Request processing time (histogram)
- Access URLs for all monitoring endpoints
+- `dynamo_component_requests_total`: Total requests processed (counter)
- GPU targeting configuration
+- `dynamo_component_response_bytes_total`: Total bytes sent in responses (counter)
+- `dynamo_component_system_uptime_seconds`: DistributedRuntime uptime (gauge)
-## Implementation Examples
+### KV Router Statistics (kvstats)
-Examples of creating metrics at different hierarchy levels and using dynamic labels are included in this document below.
+KV router statistics are automatically exposed by LLM workers and KV router components with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
-### Grafana Dashboards
+- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
+- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
+- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
+- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
-Use dashboards in `deploy/metrics/grafana_dashboards/`:
+These metrics are published by:
- `grafana-dynamo-dashboard.json`: General Dynamo dashboard
+- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
- `grafana-dcgm-metrics.json`: DCGM GPU metrics dashboard
+- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
-## Metrics Visualization Architecture
+### Specialized Component Metrics
-### Service Topology
+Some components expose additional metrics specific to their functionality:
-The metrics system follows this architecture for collecting and visualizing metrics:
+- `dynamo_preprocessor_*`: Metrics specific to preprocessor components
-```mermaid
+### Frontend Metrics
-graph TD
-    BROWSER[Browser] -->|:3001| GRAFANA[Grafana :3001]
+When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name:
-    subgraph DockerComposeNetwork [Network inside Docker Compose]
-        NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222]
+- `dynamo_frontend_inflight_requests`: Inflight requests (gauge)
-        PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
+- `dynamo_frontend_queued_requests`: Number of requests in HTTP processing queue (gauge)
-        PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
+- `dynamo_frontend_input_sequence_tokens`: Input sequence length (histogram)
-        PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
+- `dynamo_frontend_inter_token_latency_seconds`: Inter-token latency (histogram)
-        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
+- `dynamo_frontend_output_sequence_tokens`: Output sequence length (histogram)
-        PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
+- `dynamo_frontend_request_duration_seconds`: LLM request duration (histogram)
-        DYNAMOFE --> DYNAMOBACKEND
+- `dynamo_frontend_requests_total`: Total LLM requests (counter)
-        GRAFANA -->|:9090/query API| PROMETHEUS
+- `dynamo_frontend_time_to_first_token_seconds`: Time to first token (histogram)
-    end
-```
+**Note**: The `dynamo_frontend_inflight_requests` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
+#### Model Configuration Metrics
-### Grafana Dashboard
+The frontend also exposes model configuration metrics with the `dynamo_frontend_model_*` prefix. These metrics are populated from the worker backend registration service when workers register with the system:
-The metrics system includes a pre-configured Grafana dashboard for visualizing service metrics:
+**Runtime Config Metrics (from ModelRuntimeConfig):**
+These metrics come from the runtime configuration provided by worker backends during registration.
-![Grafana Dynamo Dashboard](./grafana-dynamo-composite.png)
+- `dynamo_frontend_model_total_kv_blocks`: Total KV blocks available for a worker serving the model (gauge)
+- `dynamo_frontend_model_max_num_seqs`: Maximum number of sequences for a worker serving the model (gauge)
+- `dynamo_frontend_model_max_num_batched_tokens`: Maximum number of batched tokens for a worker serving the model (gauge)
-## Detailed Setup Guide
+**MDC Metrics (from ModelDeploymentCard):**
+These metrics come from the Model Deployment Card information provided by worker backends during registration. Note that when multiple worker instances register with the same model name, only the first instance's configuration metrics (runtime config and MDC metrics) will be populated. Subsequent instances with duplicate model names will be skipped for configuration metric updates, though the worker count metric will reflect all instances.
-For complete setup instructions including Docker Compose, Prometheus configuration, and Grafana dashboards, see:
+- `dynamo_frontend_model_context_length`: Maximum context length for a worker serving the model (gauge)
+- `dynamo_frontend_model_kv_cache_block_size`: KV cache block size for a worker serving the model (gauge)
+- `dynamo_frontend_model_migration_limit`: Request migration limit for a worker serving the model (gauge)
-```{toctree}
+**Worker Management Metrics:**
-:hidden:
+- `dynamo_frontend_model_workers`: Number of worker instances currently serving the model (gauge)
-prometheus-grafana
+### Request Processing Flow
+This section explains the distinction between two key metrics used to track request processing:
+1. **Inflight**: Tracks requests from HTTP handler start until the complete response is finished
+2. **HTTP Queue**: Tracks requests from HTTP handler start until first token generation begins (including prefill time)
+**Example Request Flow:**
+```
+curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
+  "model": "Qwen/Qwen3-0.6B",
+  "prompt": "Hello let's talk about LLMs",
+  "stream": false,
+  "max_tokens": 1000
+}'
 ```
- [Prometheus and Grafana Setup Guide](prometheus-grafana.md)
+**Timeline:**
+```
+Timeline:    0, 1, ...
+Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (vLLM, SGLang, TRT)
+             │request start                     │received                              │
+             |                                  |                                      |
+             │                                  ├──> start prefill ──> first token ──> |last token
+             │                                  │     (not impl)       |               |
+             ├─────actual HTTP queue¹ ──────────┘                      │               |
+             │                                                         │               │
+             ├─────implemented HTTP queue ─────────────────────────────┘               |
+             │                                                                         │
+             └─────────────────────────────────── Inflight ────────────────────────────┘
+```
+**Concurrency Example:**
+Suppose the backend allows 3 concurrent requests and there are 10 clients continuously hitting the frontend:
+- All 10 requests will be counted as inflight (from start until complete response)
+- 7 requests will be in HTTP queue most of the time
+- 3 requests will be actively processed (between first token and last token)
+**Key Differences:**
+- **Inflight**: Measures total request lifetime including processing time
+- **HTTP Queue**: Measures queuing time before processing begins (including prefill time)
+- **HTTP Queue ≤ Inflight** (HTTP queue is a subset of inflight time)
 ## Related Documentation

--- a/docs/observability/prometheus-grafana.md
+++ b/docs/observability/prometheus-grafana.md
--- a/deploy/tracing/trace.png
+++ b/deploy/tracing/trace.png
--- a/deploy/tracing/README.md
+++ b/deploy/tracing/README.md
@@ -5,87 +5,61 @@ SPDX-License-Identifier: Apache-2.0
 # Distributed Tracing with Tempo
-This guide explains how to set up and view distributed traces in Grafana Tempo for Dynamo workloads.
 ## Overview
-Dynamo supports OpenTelemetry-based distributed tracing, allowing you to visualize request flows across Frontend and Worker components. Traces are exported to Tempo via OTLP (OpenTelemetry Protocol) and visualized in Grafana.
+Dynamo supports OpenTelemetry-based distributed tracing for visualizing request flows across Frontend and Worker components. Traces are exported to Tempo via OTLP (OpenTelemetry Protocol) and visualized in Grafana.
+**Requirements:** Set `DYN_LOGGING_JSONL=true` and `OTEL_EXPORT_ENABLED=true` to export traces to Tempo.
-## Prerequisites
+This guide covers single GPU demo setup using Docker Compose. For Kubernetes deployments, see [Kubernetes Deployment](#kubernetes-deployment).
- Docker and Docker Compose (for local deployment)
+**Note:** This section has overlap with [Logging of OpenTelemetry Tracing](logging.md) since OpenTelemetry has aspects of both logging and tracing. The tracing approach documented here is for persistent trace visualization and analysis. For short debugging sessions examining trace context directly in logs, see the [Logging](logging.md) guide.
- Kubernetes cluster with kubectl access (for Kubernetes deployment)
- Dynamo runtime with tracing support
 ## Environment Variables
-Dynamo's tracing is configured via environment variables. For complete logging documentation, see [docs/observability/logging.md](../../docs/observability/logging.md).
+| Variable | Description | Default | Example |
+|----------|-------------|---------|---------|
+| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for tracing) | `false` | `true` |
+| `OTEL_EXPORT_ENABLED` | Enable OTLP trace export | `false` | `true` |
+| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP gRPC endpoint for Tempo | `http://localhost:4317` | `http://tempo:4317` |
+| `OTEL_SERVICE_NAME` | Service name for identifying components | `dynamo` | `dynamo-frontend` |
-### Required Environment Variables
+## Getting Started Quickly
-| Variable | Description | Example Value |
+### 1. Start Observability Stack
-|----------|-------------|---------------|
-| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for tracing) | `true` |
-| `OTEL_EXPORT_ENABLED` | Enable OTLP trace export | `1` |
-| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP gRPC endpoint for Tempo | `http://localhost:4317` (local) or `http://tempo:4317` (docker) |
-| `OTEL_SERVICE_NAME` | Service name for identifying components | `dynamo-frontend`, `dynamo-worker-prefill`, `dynamo-worker-decode` |
-**Note:** When `OTEL_EXPORT_ENABLED=1`, logging initialization is deferred until the runtime is available (required by the OTEL exporter). This means some early logs will be dropped. This will be fixed in a future release.
+Start the observability stack (Prometheus, Grafana, Tempo, exporters). See [Observability Getting Started](README.md#getting-started-quickly) for instructions.
-### Example Configuration
+### 2. Set Environment Variables
+Configure Dynamo components to export traces:
 ```bash
 # Enable JSONL logging and tracing
 export DYN_LOGGING_JSONL=true
+export OTEL_EXPORT_ENABLED=true
-# Enable trace export to Tempo
+export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
-export OTEL_EXPORT_ENABLED=1
-# Set the Tempo endpoint (docker-compose network)
-export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://tempo:4317
-# Set service name to identify this component
-export OTEL_SERVICE_NAME=dynamo-frontend
 ```
---
+### 3. Start Dynamo Components (Single GPU)
-## Local Deployment with Docker Compose
-### 1. Start Tempo and Grafana
-From the `deploy/tracing` directory, start the observability stack:
+For a simple single-GPU deployment, start the frontend and a single vLLM worker:
 ```bash
-cd deploy/tracing
+# Start the frontend with tracing enabled
-docker-compose up -d
+export OTEL_SERVICE_NAME=dynamo-frontend
-```
+python -m dynamo.frontend --router-mode kv --http-port=8000 &
-This will start:
- **Tempo** on `http://localhost:3200` (HTTP API) and `localhost:4317` (OTLP gRPC)
- **Grafana** on `http://localhost:3000` (username: `admin`, password: `admin`)
-Verify services are running:
+# Start a single vLLM worker (aggregated prefill and decode)
+export OTEL_SERVICE_NAME=dynamo-worker-vllm
+python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
-```bash
+wait
-docker-compose ps
 ```
-### 2. Set Environment Variables
+This runs both prefill and decode on the same GPU, providing a simpler setup for testing tracing.
-Configure Dynamo components to export traces:
+### Alternative: Disaggregated Deployment (2 GPUs)
-```bash
-# Enable JSONL logging and tracing
-export DYN_LOGGING_JSONL=true
-export OTEL_EXPORT_ENABLED=1
-export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
-# Set service names for each component
-export OTEL_SERVICE_NAME=dynamo-frontend
-```
-### 3. Run vLLM Disaggregated Deployment
 Run the vLLM disaggregated script with tracing enabled:
@@ -106,70 +80,66 @@ trap 'echo Cleaning up...; kill 0' EXIT
 # Enable tracing
 export DYN_LOGGING_JSONL=true
-export OTEL_EXPORT_ENABLED=1
+export OTEL_EXPORT_ENABLED=true
 export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
 # Run frontend
 export OTEL_SERVICE_NAME=dynamo-frontend
 python -m dynamo.frontend --router-mode kv --http-port=8000 &
-# Run decode worker
+# Run decode worker, make sure to wait for start up
 export OTEL_SERVICE_NAME=dynamo-worker-decode
 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
-# Run prefill worker
+# Run prefill worker, make sure to wait for start up
 export OTEL_SERVICE_NAME=dynamo-worker-prefill
 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --enforce-eager \
    --is-prefill-worker &
-wait
 ```
+For disaggregated deployments, this separates prefill and decode onto different GPUs for better resource utilization.
 ### 4. Generate Traces
-Send requests to the frontend to generate traces:
+Send requests to the frontend to generate traces (works for both aggregated and disaggregated deployments). **Note the `x-request-id` header**, which allows you to easily search for and correlate this specific trace in Grafana:
 ```bash
-curl -d '{
+curl -H 'Content-Type: application/json' \
+-H 'x-request-id: test-trace-001' \
+-d '{
  "model": "Qwen/Qwen3-0.6B",
  "max_completion_tokens": 100,
  "messages": [
    {"role": "user", "content": "What is the capital of France?"}
  ]
 }' \
-H 'Content-Type: application/json' \
-H 'x-request-id: test-trace-001' \
 http://localhost:8000/v1/chat/completions
 ```
 ### 5. View Traces in Grafana Tempo
 1. Open Grafana at `http://localhost:3000`
-2. Login with username `admin` and password `admin`
+2. Login with username `dynamo` and password `dynamo`
 3. Navigate to **Explore** (compass icon in the left sidebar)
 4. Select **Tempo** as the data source (should be selected by default)
-5. Use the **Search** tab to find traces:
+5. In the query type, select **"Search"** (not TraceQL, not Service Graph)
+6. Use the **Search** tab to find traces:
   - Search by **Service Name** (e.g., `dynamo-frontend`)
   - Search by **Span Name** (e.g., `http-request`, `handle_payload`)
   - Search by **Tags** (e.g., `x_request_id=test-trace-001`)
-6. Click on a trace to view the detailed flame graph
+7. Click on a trace to view the detailed flame graph
 #### Example Trace View
 Below is an example of what a trace looks like in Grafana Tempo:
-![Trace Example](./trace.png)
+![Trace Example](trace.png)
 ### 6. Stop Services
-When done, stop the Tempo and Grafana stack:
+When done, stop the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for Docker Compose commands.
-```bash
-cd deploy/tracing
-docker-compose down
-```
 ---
@@ -192,7 +162,7 @@ spec:
    - name: DYN_LOGGING_JSONL
      value: "true"
    - name: OTEL_EXPORT_ENABLED
-      value: "1"
+      value: "true"
    - name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
      value: "http://tempo.observability.svc.cluster.local:4317"

--- a/examples/backends/sglang/launch/agg.sh
+++ b/examples/backends/sglang/launch/agg.sh
@@ -17,7 +17,7 @@ python3 -m dynamo.frontend --http-port=8000 &
 DYNAMO_PID=$!
 # run worker with metrics enabled
-DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
+DYN_SYSTEM_PORT=8081 \
 python3 -m dynamo.sglang \
  --model-path Qwen/Qwen3-0.6B \
  --served-model-name Qwen/Qwen3-0.6B \