feat(observability): Add Grafana dashboard and monitoring setup for… (#4639)

Signed-off-by: rwipfelnv <rwipfel@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com>

feat(observability): Add Grafana dashboard and monitoring setup for… (#4639)
Signed-off-by: rwipfelnv <rwipfel@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com>
abacb96e · rwipfelnv · GitHub · 97f79537 · abacb96e · abacb96e
Unverified Commit abacb96e authored Feb 17, 2026 by rwipfelnv Committed by GitHub Feb 17, 2026
9 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -131,3 +131,7 @@ package-lock.json

 # Compiled static libraries
 *.a
+
+# macOS
+.DS_Store
+**/.DS_Store
--- a/deploy/observability/grafana_dashboards/DASHBOARD_METRICS.md
+++ b/deploy/observability/grafana_dashboards/DASHBOARD_METRICS.md
+# Grafana Dashboard Metrics Documentation
+
+This document explains where each panel in the `disagg-dashboard.json` gets its data and how it's displayed.
+
+## Dashboard Organization
+
+The dashboard is organized in **logical request flow order** (21 panels across 6 rows):
+
+**Row 1: Frontend Health** (User-facing metrics - y=0)
+- Frontend Requests/Sec (x=0), Avg TTFT (x=8), Avg Request Duration (x=16)
+
+**Row 2: Frontend Details** (y=8)
+- Avg Inter-Token Latency (x=0), Avg ISL/OSL (x=8), **Queued Requests** ⭐ (x=16)
+
+**Row 3: Prefill Workers** (The typical bottleneck! - y=16)
+- Prefill Worker Processing Time ⭐ (x=0), Prefill Worker Throughput (x=8), Component Latency Comparison (x=16)
+
+**Row 4: Decode Workers** (y=24)
+- Request Throughput (x=0), Avg Request Duration (x=8), KV Cache Utilization (%) (x=16)
+
+**Row 5: KV Cache + GPU** (y=32)
+- KV Cache Blocks (Active/Total) ⭐ (x=0), GPU Compute Utilization (x=8), GPU Memory Used (x=16)
+
+**Row 6: NIXL Transfer Metrics** (y=40)
+- GPU Memory Bandwidth (x=0), NVLink Bandwidth (GB/s) (x=8), Worker CPU Usage (x=16)
+
+**Row 7: Node + Worker** (y=48)
+- Node CPU Utilization (x=0), Worker Request Throughput (x=8), Worker Data Transfer (x=16)
+
+⭐ = Key metrics for diagnosing TTFT bottlenecks
+
+## Metric Sources
+
+### Frontend Metrics (from Frontend Pod)
+These metrics come from the `dynamo_frontend_*` namespace and are collected from the frontend deployment pod.
+
+| Panel | Metric | Formula | Description |
+|-------|--------|---------|-------------|
+| **Frontend Requests / Sec** | `dynamo_frontend_requests_total` | `rate(...[30s])` | Rate of requests per second hitting the frontend, broken down by request_type and status |
+| **Frontend Avg Time to First Token** ⭐ | `dynamo_frontend_time_to_first_token_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Average time (in ms) from request arrival to first token over the last 5 minutes. Includes queue wait, prefill compute, and NIXL transfer time. **The primary performance metric** |
+| **Frontend Avg Request Duration** | `dynamo_frontend_request_duration_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Total end-to-end request duration in milliseconds over the last 5 minutes |
+| **Frontend Avg Inter-Token Latency** | `dynamo_frontend_inter_token_latency_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Average time (in ms) between token generations during decode phase over the last 5 minutes |
+| **Frontend Avg Input/Output Sequence Length** | `dynamo_frontend_input_sequence_tokens_{sum,count}` & `dynamo_frontend_output_sequence_tokens_{sum,count}` | `rate(sum[5m]) / rate(count[5m])` for each | Average input prompt length (ISL) and output generation length (OSL) in tokens over the last 5 minutes |
+| **Frontend Queued Requests** ⭐⭐⭐ | `dynamo_frontend_queued_requests` | Raw value | Number of requests waiting in queue. **THE key metric for diagnosing worker saturation.** High values (>10) indicate workers cannot keep up with load. Yellow threshold at 10, red at 50 |
+
+### GPU Metrics (from DCGM Exporter)
+These metrics come from the DCGM (Data Center GPU Manager) exporter running as a DaemonSet in the `gpu-operator` namespace. DCGM collects hardware-level GPU metrics.
+
+| Panel | Metric | Formula | Description |
+|-------|--------|---------|-------------|
+| **GPU Compute Utilization** | `DCGM_FI_DEV_GPU_UTIL` | Raw value | GPU compute utilization percentage (0-100) for each GPU. Prefill workers show high utilization during prefill phase |
+| **GPU Memory Bandwidth** | `DCGM_FI_DEV_MEM_COPY_UTIL` | Raw value | GPU memory copy bandwidth utilization percentage (0-100). **Spikes indicate KV cache transfers over NIXL**. On single-node deployments, NIXL uses CUDA IPC (GPU→Host→Host→GPU) not direct GPU-to-GPU. Yellow threshold at 60%, red at 80% |
+| **NVLink Bandwidth (GB/s)** | `DCGM_FI_PROF_NVLINK_TX_BYTES` & `DCGM_FI_PROF_NVLINK_RX_BYTES` | `(rate(TX_BYTES[1m]) + rate(RX_BYTES[1m])) / 1e9` | NVLink transfer bandwidth in GB/s (rate of change) per GPU, measured from DCGM profiling metrics. Shows total bidirectional bandwidth (TX + RX). This includes intra-pod TP communication (TP=2 for prefill, TP=4 for decode). Low bandwidth (<1 GB/s) indicates inter-pod NIXL KV cache transfers may be using host memory copies instead of direct NVLink/GPUDirect. Yellow threshold at 5 GB/s, red at 10 GB/s |
+| **GPU Memory Used** | `DCGM_FI_DEV_FB_USED` | `value / 1024` | GPU framebuffer memory used in GB. Prefill workers allocate KV blocks on decode workers via NIXL |
+
+### Prefill Worker Metrics (from Prefill Worker Pods)
+These metrics come from the prefill worker pods' system endpoints (port 9090). They track request processing for prefill operations.
+
+| Panel | Metric | Formula | Description |
+|-------|--------|---------|-------------|
+| **Prefill Worker Processing Time** | `dynamo_component_request_duration_seconds_{sum,count,bucket}{dynamo_component="prefill",dynamo_endpoint="generate"}` | `1000 * rate(sum[5m]) / rate(count[5m])` for avg, `histogram_quantile(0.99, ...)` for P99 | Average and P99 time (in ms) spent processing prefill requests. **Includes prefill computation AND KV cache transfer over NIXL** |
+| **Prefill Worker Throughput** | `dynamo_component_requests_total{dynamo_component="prefill",dynamo_endpoint="generate"}` | `rate(...[5m])` | Rate of prefill requests being processed in requests/second |
+
+### Decode Worker Metrics (from Decode Worker Pods)
+These metrics come from the decode worker pods' system endpoints (port 9090). In disaggregated mode, decode workers receive KV cache from prefill workers and perform token generation.
+
+| Panel | Metric | Formula | Description |
+|-------|--------|---------|-------------|
+| **Component Latency - Prefill vs Decode** | `dynamo_component_request_duration_seconds_{sum,count}{dynamo_component="prefill",dynamo_endpoint="generate"}` & `{dynamo_component="backend",dynamo_endpoint="generate"}` | `rate(sum[5m]) / rate(count[5m])` | Average request duration for prefill workers (includes NIXL transfer) vs decode workers (entire decode session for all output tokens) over the last 5 minutes. **Note**: Decode worker latency measures the FULL decode session duration, not just time to first token. Only shows `generate` endpoint (filters out `clear_kv_blocks` maintenance operations) |
+| **Decode Worker - Request Throughput** | `dynamo_component_requests_total{dynamo_component="backend"}` | `rate(...[5m])` | Rate of requests processed by decode workers in requests/second |
+| **Decode Worker - Avg Request Duration** | `dynamo_component_request_duration_seconds_{sum,count}{dynamo_component="backend"}` | `rate(sum[5m]) / rate(count[5m])` | Average time decode workers spend processing requests (decode phase only) over the last 5 minutes |
+| **KV Cache Utilization** | `dynamo_component_kvstats_gpu_cache_usage_percent` | Raw value (0-100%) | GPU memory utilization for KV cache storage of active requests. High values (>90%) indicate workers are at capacity and requests are queueing. **Note**: Only available for decode workers - prefill workers in disaggregated mode don't expose this metric. Monitor Prefill Worker Processing Time instead for prefill capacity |
+| **KV Cache Blocks (Active/Total)** ⭐ | `dynamo_component_kvstats_active_blocks` & `dynamo_component_kvstats_total_blocks` | Raw values | Number of KV cache blocks in use vs total available for decode workers. When active approaches total, decode workers are at capacity. Shows numeric values (e.g., 2048/5297). **Note**: Only for decode workers |
+
+### CPU Metrics (from cAdvisor and Node Exporter)
+These metrics come from Kubernetes cAdvisor (container metrics) and Node Exporter (node-level metrics). CPU bottlenecks can impact prefill/decode performance.
+
+| Panel | Metric | Formula | Description |
+|-------|--------|---------|-------------|
+| **Worker CPU Usage** | `container_cpu_usage_seconds_total{namespace="robert",pod=~".*worker.*",container="main"}` | `rate(...[5m])` | CPU cores used by worker pods. Value shows actual CPU consumption (e.g., 2.5 = 2.5 cores). Yellow at 30 cores, red at 50 cores |
+| **Node CPU Utilization** | `node_cpu_seconds_total{mode="idle"}` | `100 - (avg(rate(idle)) * 100)` | Overall node CPU utilization percentage. Shows aggregate CPU usage across all cores |
+
+### Worker Metrics (from Worker Pods)
+These metrics track request processing across all worker pods (prefill and decode).
+
+| Panel | Metric | Formula | Description |
+|-------|--------|---------|-------------|
+| **Worker Request Throughput** | `dynamo_component_requests_total{dynamo_endpoint="generate"}` | `rate(...[5m])` | Requests per second processed by each worker, broken down by component type (prefill, backend). Shows overall system throughput |
+| **Worker Data Transfer** | `dynamo_component_request_bytes_total` & `dynamo_component_response_bytes_total` | `rate(...[5m])` | Bytes per second transferred in requests (IN) and responses (OUT). Shows data throughput across worker pods |
+
+## Metric Label Filters
+
+### Component Name Filtering
+- **Prefill workers**: `dynamo_component="prefill"`
+- **Decode workers**: `dynamo_component="backend"`
+- **All workers**: Filter by `dynamo_endpoint="generate"` to exclude maintenance operations like `clear_kv_blocks`
+
+### Important Labels
+- `pod`: Specific pod name (e.g., `llama3-70b-disagg-sn-0-vllmprefillworker-hrnt5`)
+- `namespace`: Kubernetes namespace (e.g., `robert`)
+- `dynamo_component`: Component type (`prefill`, `backend`, `frontend`)
+- `dynamo_endpoint`: Endpoint name (`generate`, `clear_kv_blocks`)
+- `gpu`: GPU index (0-7 for DCGM metrics)
+- `Hostname`: Node hostname (for DCGM metrics)
+
+## Metric Collection Architecture
+
+```text
+┌─────────────────┐
+│  Frontend Pod   │ ──► dynamo_frontend_* metrics (HTTP port)
+└─────────────────┘
+
+┌─────────────────┐
+│ Prefill Worker  │ ──► dynamo_component_* metrics (system port 9090)
+│     Pods        │     ├─ dynamo_component_request_* (request stats)
+└─────────────────┘     └─ dynamo_component_*_bytes_total (data transfer)
+                        └─ container_cpu_* metrics (cAdvisor)
+
+┌─────────────────┐
+│ Decode Worker   │ ──► dynamo_component_* metrics (system port 9090)
+│     Pods        │     └─ dynamo_component_request_* (component stats)
+└─────────────────┘     └─ container_cpu_* metrics (cAdvisor)
+
+┌─────────────────┐
+│ DCGM Exporter   │ ──► DCGM_FI_DEV_* metrics (GPU metrics port)
+│   DaemonSet     │     ├─ GPU compute utilization
+│ (gpu-operator)  │     ├─ GPU memory bandwidth (NIXL indicator)
+└─────────────────┘     ├─ GPU memory usage
+                        └─ GPU temperature
+
+┌─────────────────┐
+│ Node Exporter   │ ──► node_* metrics (node metrics port)
+│   DaemonSet     │     ├─ node_cpu_seconds_total (CPU by mode)
+│   (monitoring)  │     ├─ node_load1/5/15 (load average)
+└─────────────────┘     └─ node_memory_* (memory stats)
+
+┌─────────────────┐
+│    cAdvisor     │ ──► container_* metrics (built into kubelet)
+│   (kubelet)     │     ├─ container_cpu_usage_seconds_total
+└─────────────────┘     ├─ container_cpu_cfs_throttled_periods_total
+                        └─ container_memory_*
+
+           ▼
+   ┌─────────────────┐
+   │  Prometheus     │ ◄─── ServiceMonitor for DCGM & Node Exporter
+   │  (monitoring)   │ ◄─── PodMonitor for Dynamo workers
+   └─────────────────┘ ◄─── Scrapes cAdvisor from kubelet
+           ▼
+   ┌─────────────────┐
+   │    Grafana      │
+   │  (monitoring)   │
+   └─────────────────┘
+```
+
+## PodMonitor Configuration
+
+The Dynamo operator automatically creates PodMonitors for metrics-enabled deployments:
+- **Label**: `nvidia.com/metrics-enabled: "true"` on all worker pods
+- **Endpoint**: System port (9090) with path `/metrics`
+- **Namespace**: Pods in any namespace are discovered (via `podMonitorNamespaceSelector={}`)
+
+To opt-out a deployment:
+```yaml
+apiVersion: nvidia.com/v1
+kind: DynamoGraphDeployment
+metadata:
+  annotations:
+    nvidia.com/enable-metrics: "false"
+```
+
+## DCGM ServiceMonitor Configuration
+
+The DCGM ServiceMonitor must be manually created (see `dcgm-servicemonitor.yaml`):
+- **Namespace**: `gpu-operator` (where DCGM exporter runs)
+- **Label**: `release: prometheus` (required for Prometheus discovery)
+- **Selector**: `app: nvidia-dcgm-exporter`
+- **Endpoint**: `gpu-metrics` port with path `/metrics`
+
+## Troubleshooting
+
+### No metrics showing up:
+1. Check PodMonitor exists: `kubectl get podmonitor -A`
+2. Check pods have metrics label: `kubectl get pods -n <namespace> -l nvidia.com/metrics-enabled=true`
+3. Check Prometheus targets: Visit Prometheus UI → Status → Targets
+
+### DCGM metrics missing:
+1. Check DCGM exporter running: `kubectl get daemonset -A | grep dcgm-exporter`
+2. Check ServiceMonitor exists: `kubectl get servicemonitor -n gpu-operator`
+3. Verify `release: prometheus` label on ServiceMonitor
+
+### Prefill queue metrics showing zero:
+- These metrics only populate when **remote prefill** requests are processed
+- In local-only mode, decode workers handle prefill themselves (no queue)
+- Check deployment mode and request routing configuration
+
+### KV Cache metrics only showing decode workers:
+**Important Limitation**: In disaggregated mode, prefill workers (`--is-prefill-worker`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these.
+
+**Why this happens:**
+- Prefill workers transfer KV cache to decode workers via NIXL
+- They don't maintain long-term KV cache state
+- Only decode workers track KV cache utilization metrics
+
+**How to diagnose prefill worker capacity bottlenecks:**
+1. **Check worker logs** for KV cache size at startup:
+   ```bash
+   kubectl logs -n <namespace> <prefill-worker-pod> | grep "GPU KV cache size"
+   # Example output: GPU KV cache size: 254,336 tokens
+   ```
+
+2. **Calculate maximum concurrent requests**:
+   ```text
+   Max Concurrency = KV Cache Size ÷ Tokens Per Request
+   # For ISL=8192: 254,336 ÷ 131,072 = 1.94 requests per prefill worker
+   ```
+
+3. **Monitor indirect indicators in dashboard**:
+   - **Prefill Worker Processing Time**: High avg (>5s) or P99 (>10s) indicates saturation
+   - **Frontend Avg TTFT**: If much higher than Prefill Processing Time, indicates queueing
+   - **Gap = TTFT - Prefill Processing Time** = Queue wait time
+
+4. **Performance signature of prefill KV cache bottleneck**:
+   - Min TTFT is low (1-3s) - proves system CAN be fast
+   - Avg/Max TTFT is very high (>30s) - proves requests are queueing
+   - Large variance (Max ÷ Min > 20×) - signature of queueing behavior
+   - This variance pattern is impossible if the issue was compute-bound - it's the mathematical signature of KV cache capacity bottleneck
+
+### High CPU usage:
+1. **Worker CPU Usage showing high values (>30 cores)**:
+   - Check if workers have sufficient CPU limits configured
+   - May indicate CPU-bound operations (tokenization, scheduling)
+   - Compare against GPU utilization - CPU should not be the bottleneck
+
+2. **What's normal?**:
+   - vLLM workers use CPU for:
+     - Request scheduling and batching
+     - Tokenization (input/output processing)
+     - KV cache management
+     - TCP/gRPC communication (request plane)
+   - Expect moderate CPU usage (5-20 cores per worker)
+   - GPU compute should dominate, not CPU
+
+## Dashboard Variables
+
+The dashboard uses two template variables for flexibility:
+
+### Datasource Variable
+- **Variable**: `${datasource}`
+- **Type**: `datasource`
+- **Query**: `prometheus`
+- **Auto-selects**: Default Prometheus instance
+- **Purpose**: Allows the dashboard to automatically connect to your Prometheus instance without hardcoding UIDs
+
+### Namespace Variable
+- **Variable**: `${namespace}`
+- **Type**: `query`
+- **Query**: `label_values(dynamo_frontend_requests_total, namespace)`
+- **Purpose**: Allows filtering metrics by Kubernetes namespace (e.g., "robert", "default")
+- **Auto-populated**: Dynamically discovers namespaces from frontend pods
+
+**Usage**: All dashboard queries filter by `namespace="$namespace"` to show metrics for the selected deployment. You can switch between different Dynamo deployments in different namespaces using the namespace dropdown at the top of the dashboard.
--- a/deploy/observability/grafana_dashboards/README.md
+++ b/deploy/observability/grafana_dashboards/README.md
@@ -4,9 +4,12 @@ This directory contains example Grafana dashboards for Dynamo observability. The

 - `dynamo.json` - General Dynamo dashboard showing software and hardware metrics
 - `sglang.json` - SGLang engine metrics (request latency, throughput, cache) and HiCache KV cache metrics (GPU/CPU tier usage, eviction/load-back, PIN count)
+- `disagg-dashboard.json` - Dashboard for disaggregated serving - See [DASHBOARD_METRICS.md](DASHBOARD_METRICS.md) for detailed documentation on all metrics and panels
 - `dcgm-metrics.json` - GPU metrics dashboard using DCGM exporter data
 - `kvbm.json` - KV Block Manager metrics dashboard
 - `temp-loki.json` - Logging dashboard for Loki integration
 - `dashboard-providers.yml` - Configuration file for dashboard provisioning

 For setup instructions and usage, see [Observability Documentation](../../../docs/pages/observability/).
+
+For Kubernetes deployment setup, see [../k8s/MONITORING_SETUP.md](../k8s/MONITORING_SETUP.md).
--- a/deploy/observability/grafana_dashboards/disagg-dashboard.json
+++ b/deploy/observability/grafana_dashboards/disagg-dashboard.json
--- a/deploy/observability/k8s/MONITORING_SETUP.md
+++ b/deploy/observability/k8s/MONITORING_SETUP.md
+# Monitoring Setup for Dynamo Disaggregated Deployment
+
+## Prerequisites
+
+The k8s cluster must be created with GPU operator configured to enable DCGM ServiceMonitor for Prometheus metrics collection.
+
+```bash
+helm upgrade --install gpu-operator nvidia/gpu-operator \
+  --namespace gpu-operator \
+  --create-namespace \
+  --set operator.defaultRuntime=containerd \
+  --set gdrcopy.enabled=true \
+  --set dcgmExporter.serviceMonitor.enabled=true \
+  --set dcgmExporter.serviceMonitor.additionalLabels.release=prometheus \
+  --wait --timeout=600s
+```
+
+The key settings for monitoring are:
+- `dcgmExporter.serviceMonitor.enabled=true` - Enables ServiceMonitor creation
+- `dcgmExporter.serviceMonitor.additionalLabels.release=prometheus` - Adds label for Prometheus discovery
+
+## Installation
+
+Once the cluster is properly configured, run:
+
+```bash
+./setup-monitoring.sh
+```
+
+This script will:
+1. Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
+2. Configure Prometheus to discover pod monitors across all namespaces
+3. Update Dynamo operator with Prometheus endpoint
+4. Configure DCGM custom metrics for NVLink profiling
+5. Verify DCGM ServiceMonitor exists (created by GPU operator)
+6. Deploy Grafana disaggregated dashboard ConfigMap (auto-imported by Grafana sidecar)
+7. Provide Grafana credentials
+
+## Verification
+
+Check that GPU metrics are flowing:
+
+```bash
+# Verify DCGM ServiceMonitor exists and is owned by ClusterPolicy
+kubectl get servicemonitor -n gpu-operator nvidia-dcgm-exporter -o yaml | grep -A 5 ownerReferences
+
+# Query Prometheus for GPU metrics
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- \
+  wget -q -O- 'http://localhost:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL'
+```
+
+Expected: ServiceMonitor should show `ownerReferences` pointing to ClusterPolicy, and query should return 8+ GPU series.
--- a/deploy/observability/k8s/cleanup-monitoring.sh
+++ b/deploy/observability/k8s/cleanup-monitoring.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Script to cleanup/undo monitoring stack installation
+
+set -e
+
+echo "=========================================="
+echo "Cleaning up Prometheus & Grafana for Dynamo"
+echo "=========================================="
+
+# Step 1: Delete Grafana dashboard ConfigMap
+echo ""
+echo "Step 1: Removing Grafana dashboard ConfigMap..."
+kubectl delete configmap grafana-disagg-dashboard -n monitoring --ignore-not-found=true
+
+# Step 2: Revert DCGM custom metrics configuration
+echo ""
+echo "Step 2: Reverting DCGM custom metrics to default..."
+kubectl delete configmap dcgm-exporter-metrics-config -n gpu-operator --ignore-not-found=true
+
+echo "Adding required Helm repositories..."
+helm repo add nvidia https://helm.ngc.nvidia.com/nvidia 2>/dev/null || true
+helm repo add nvidia-dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo 2>/dev/null || true
+helm repo update
+
+echo "Reverting GPU Operator DCGM settings to default..."
+helm upgrade gpu-operator nvidia/gpu-operator \
+  --namespace gpu-operator \
+  --reuse-values \
+  --set dcgmExporter.config.name=""
+
+echo "Restarting DCGM exporter to apply default metrics..."
+kubectl rollout restart daemonset nvidia-dcgm-exporter -n gpu-operator
+kubectl rollout status daemonset nvidia-dcgm-exporter -n gpu-operator --timeout=180s
+
+# Step 3: Revert Dynamo operator Prometheus endpoint
+echo ""
+echo "Step 3: Removing Prometheus endpoint from Dynamo operator..."
+DYNAMO_VERSION=$(helm list -n dynamo -o json | jq -r '.[] | select(.name=="dynamo-platform") | .chart' | sed 's/dynamo-platform-//')
+echo "Detected Dynamo Platform version: ${DYNAMO_VERSION}"
+
+# Delete the conflicting secret (grove-operator will recreate it)
+echo "Removing grove-webhook-server-cert to avoid conflict..."
+kubectl delete secret grove-webhook-server-cert -n dynamo --ignore-not-found=true
+
+helm upgrade dynamo-platform nvidia-dynamo/dynamo-platform \
+  --version "${DYNAMO_VERSION}" \
+  --namespace dynamo \
+  --reuse-values \
+  --set prometheusEndpoint=""
+
+# Step 4: Uninstall kube-prometheus-stack
+echo ""
+echo "Step 4: Uninstalling kube-prometheus-stack..."
+helm uninstall prometheus -n monitoring || echo "Prometheus stack not found, skipping..."
+
+echo "Deleting kube-prometheus-stack CRDs (Helm doesn't remove these automatically)..."
+kubectl delete $(kubectl get crd -o name | grep monitoring.coreos.com) --ignore-not-found=true
+
+# Step 5: Delete monitoring namespace (optional)
+echo ""
+read -p "Delete monitoring namespace? This will remove all monitoring data (y/N): " -n 1 -r
+echo
+if [[ $REPLY =~ ^[Yy]$ ]]; then
+    echo "Deleting monitoring namespace..."
+    kubectl delete namespace monitoring --ignore-not-found=true
+else
+    echo "Keeping monitoring namespace"
+fi
+
+echo ""
+echo "=========================================="
+echo "✅ Cleanup Complete!"
+echo "=========================================="
+echo ""
+echo "You can now run setup-monitoring.sh to reinstall with a clean slate"
+echo ""
--- a/deploy/observability/k8s/dcgm-metrics-with-nvlink.csv
+++ b/deploy/observability/k8s/dcgm-metrics-with-nvlink.csv
+# Format
+# If line starts with a '#' it is considered a comment
+# DCGM FIELD, Prometheus metric type, help message
+
+# Clocks
+DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
+DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
+
+# Temperature
+DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
+DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
+
+# Power
+DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
+DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
+
+# PCIE
+# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
+# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
+DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
+
+# Utilization (the sample period varies depending on the product)
+DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
+DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
+DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
+DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
+
+# Errors and violations
+DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
+# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
+# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
+# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
+# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
+# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
+# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
+
+# Memory usage
+DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
+DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
+DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).
+
+# ECC
+# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
+# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
+# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
+# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
+
+# Retired pages
+# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
+# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
+# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
+
+# NVLink
+# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
+# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
+# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
+# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
+DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
+# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.
+
+# NVLink Profiling Metrics - ADDED FOR NIXL TRANSFER MONITORING
+DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, Total number of bytes of active NVLink tx (transmit) data including both header and payload.
+DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, Total number of bytes of active NVLink rx (receive) data including both header and payload.
+
+# VGPU License status
+DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
+
+# Remapped rows
+DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
+DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
+DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed
+
+# Static configuration information. These appear as labels on the other metrics
+DCGM_FI_DRIVER_VERSION,        label, Driver Version
+# DCGM_FI_NVML_VERSION,          label, NVML Version
+# DCGM_FI_DEV_BRAND,             label, Device Brand
+# DCGM_FI_DEV_SERIAL,            label, Device Serial Number
+# DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version
+# DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version
+# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
+# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
+# DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device
+
+# DCP metrics
+DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active.
+# DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned.
+# DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.
+DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
+DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
+# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
+# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
+# DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active.
+DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
+DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
--- a/deploy/observability/k8s/grafana-disagg-dashboard-configmap.yaml
+++ b/deploy/observability/k8s/grafana-disagg-dashboard-configmap.yaml
--- a/deploy/observability/k8s/setup-monitoring.sh
+++ b/deploy/observability/k8s/setup-monitoring.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Script to install Prometheus and Grafana monitoring stack for Dynamo
+# Following the official Dynamo Kubernetes observability guide:
+# https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/observability/metrics.md
+
+set -e
+
+echo "=========================================="
+echo "Installing Prometheus & Grafana for Dynamo"
+echo "=========================================="
+
+# Step 1: Add Helm repositories
+echo ""
+echo "Step 1: Adding required Helm repositories..."
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo add nvidia https://helm.ngc.nvidia.com/nvidia 2>/dev/null || true
+helm repo add nvidia-dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo 2>/dev/null || true
+
+# Step 2: Update Helm repositories
+echo ""
+echo "Step 2: Updating Helm repositories..."
+helm repo update
+
+# Step 3: Install kube-prometheus-stack
+echo ""
+echo "Step 3: Installing kube-prometheus-stack..."
+echo "This includes: Prometheus Operator, Prometheus, Grafana, Alertmanager"
+helm upgrade --install prometheus -n monitoring --create-namespace prometheus-community/kube-prometheus-stack \
+  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
+  --set-json 'prometheus.prometheusSpec.podMonitorNamespaceSelector={}' \
+  --set-json 'prometheus.prometheusSpec.probeNamespaceSelector={}'
+
+# Step 4: Wait for pods to be ready
+echo ""
+echo "Step 4: Waiting for monitoring stack pods to be ready..."
+echo "This may take 1-2 minutes..."
+kubectl wait --for=condition=ready pod -l "release=prometheus" -n monitoring --timeout=180s
+
+# Step 5: Verify installation
+echo ""
+echo "Step 5: Verifying installation..."
+kubectl get pods -n monitoring
+
+# Step 6: Update Dynamo operator with Prometheus endpoint
+echo ""
+echo "Step 6: Updating Dynamo operator with Prometheus endpoint..."
+# Detect currently installed version to avoid accidental upgrades
+DYNAMO_VERSION=$(helm list -n dynamo -o json | jq -r '.[] | select(.name=="dynamo-platform") | .chart' | sed 's/dynamo-platform-//')
+echo "Detected Dynamo Platform version: ${DYNAMO_VERSION}"
+
+# Delete the conflicting secret (grove-operator will recreate it)
+echo "Removing grove-webhook-server-cert to avoid conflict..."
+kubectl delete secret grove-webhook-server-cert -n dynamo --ignore-not-found=true
+
+# Perform the upgrade
+echo "Running Helm upgrade..."
+helm upgrade dynamo-platform nvidia-dynamo/dynamo-platform \
+  --version "${DYNAMO_VERSION}" \
+  --namespace dynamo \
+  --reuse-values \
+  --set prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
+
+# Step 7: Configure DCGM custom metrics for NVLink profiling
+echo ""
+echo "Step 7: Configuring DCGM custom metrics for NVLink profiling..."
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+kubectl create configmap dcgm-exporter-metrics-config \
+  --from-file=dcgm-metrics.csv="$SCRIPT_DIR/dcgm-metrics-with-nvlink.csv" \
+  --namespace=gpu-operator \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+echo "Updating GPU Operator to use custom DCGM metrics..."
+helm upgrade gpu-operator nvidia/gpu-operator \
+  --namespace gpu-operator \
+  --reuse-values \
+  --set dcgmExporter.config.name=dcgm-exporter-metrics-config
+
+echo "Restarting DCGM exporter to apply new metrics configuration..."
+kubectl rollout restart daemonset nvidia-dcgm-exporter -n gpu-operator
+kubectl rollout status daemonset nvidia-dcgm-exporter -n gpu-operator --timeout=60s
+
+echo "✅ DCGM custom metrics configured with NVLink profiling support"
+
+# Step 8: Verify DCGM ServiceMonitor exists (created by GPU operator during cluster setup)
+echo ""
+echo "Step 8: Verifying DCGM exporter ServiceMonitor..."
+if kubectl get servicemonitor -n gpu-operator nvidia-dcgm-exporter &>/dev/null; then
+    echo "✅ DCGM ServiceMonitor found - GPU metrics will be available in Prometheus/Grafana"
+else
+    echo "⚠️  DCGM ServiceMonitor not found."
+    echo "The GPU operator should have been installed with serviceMonitor enabled in createEks.sh"
+    echo "Please verify the cluster was created with the updated createEks.sh that includes:"
+    echo "  --set dcgmExporter.serviceMonitor.enabled=true"
+    echo "  --set dcgmExporter.serviceMonitor.additionalLabels.release=prometheus"
+fi
+
+# Step 9: Deploy Grafana dashboard ConfigMap
+echo ""
+echo "Step 9: Deploying Grafana disaggregated dashboard ConfigMap..."
+kubectl apply -f "$SCRIPT_DIR/grafana-disagg-dashboard-configmap.yaml"
+echo "✅ Dashboard ConfigMap deployed - Grafana sidecar will auto-import it within a few seconds"
+
+# Step 10: Get Grafana credentials
+echo ""
+echo "Step 10: Retrieving Grafana credentials..."
+GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
+GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
+
+echo ""
+echo "=========================================="
+echo "✅ Installation Complete!"
+echo "=========================================="
+echo ""
+echo "Grafana Access:"
+echo "  Username: $GRAFANA_USER"
+echo "  Password: $GRAFANA_PASSWORD"
+echo ""
+echo "To access Grafana:"
+echo "  kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring"
+echo "  Then visit: http://localhost:3000"
+echo ""
+echo "To access Prometheus:"
+echo "  kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring"
+echo "  Then visit: http://localhost:9090"
+echo ""
+echo "Next Steps:"
+echo "  1. Deploy or redeploy your DynamoGraphDeployment with DYN_SYSTEM_ENABLED=true"
+echo "  2. The Dynamo operator will automatically create PodMonitors for metrics collection"
+echo "  3. View metrics in Grafana under Dashboards → General → Dynamo Disaggregated Analysis"
+echo ""