Unverified Commit abacb96e authored by rwipfelnv's avatar rwipfelnv Committed by GitHub
Browse files

feat(observability): Add Grafana dashboard and monitoring setup for… (#4639)


Signed-off-by: default avatarrwipfelnv <rwipfel@nvidia.com>
Co-authored-by: default avatarClaude <noreply@anthropic.com>
parent 97f79537
......@@ -131,3 +131,7 @@ package-lock.json
# Compiled static libraries
*.a
# macOS
.DS_Store
**/.DS_Store
# Grafana Dashboard Metrics Documentation
This document explains where each panel in the `disagg-dashboard.json` gets its data and how it's displayed.
## Dashboard Organization
The dashboard is organized in **logical request flow order** (21 panels across 6 rows):
**Row 1: Frontend Health** (User-facing metrics - y=0)
- Frontend Requests/Sec (x=0), Avg TTFT (x=8), Avg Request Duration (x=16)
**Row 2: Frontend Details** (y=8)
- Avg Inter-Token Latency (x=0), Avg ISL/OSL (x=8), **Queued Requests** ⭐ (x=16)
**Row 3: Prefill Workers** (The typical bottleneck! - y=16)
- Prefill Worker Processing Time ⭐ (x=0), Prefill Worker Throughput (x=8), Component Latency Comparison (x=16)
**Row 4: Decode Workers** (y=24)
- Request Throughput (x=0), Avg Request Duration (x=8), KV Cache Utilization (%) (x=16)
**Row 5: KV Cache + GPU** (y=32)
- KV Cache Blocks (Active/Total) ⭐ (x=0), GPU Compute Utilization (x=8), GPU Memory Used (x=16)
**Row 6: NIXL Transfer Metrics** (y=40)
- GPU Memory Bandwidth (x=0), NVLink Bandwidth (GB/s) (x=8), Worker CPU Usage (x=16)
**Row 7: Node + Worker** (y=48)
- Node CPU Utilization (x=0), Worker Request Throughput (x=8), Worker Data Transfer (x=16)
⭐ = Key metrics for diagnosing TTFT bottlenecks
## Metric Sources
### Frontend Metrics (from Frontend Pod)
These metrics come from the `dynamo_frontend_*` namespace and are collected from the frontend deployment pod.
| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Frontend Requests / Sec** | `dynamo_frontend_requests_total` | `rate(...[30s])` | Rate of requests per second hitting the frontend, broken down by request_type and status |
| **Frontend Avg Time to First Token** ⭐ | `dynamo_frontend_time_to_first_token_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Average time (in ms) from request arrival to first token over the last 5 minutes. Includes queue wait, prefill compute, and NIXL transfer time. **The primary performance metric** |
| **Frontend Avg Request Duration** | `dynamo_frontend_request_duration_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Total end-to-end request duration in milliseconds over the last 5 minutes |
| **Frontend Avg Inter-Token Latency** | `dynamo_frontend_inter_token_latency_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Average time (in ms) between token generations during decode phase over the last 5 minutes |
| **Frontend Avg Input/Output Sequence Length** | `dynamo_frontend_input_sequence_tokens_{sum,count}` & `dynamo_frontend_output_sequence_tokens_{sum,count}` | `rate(sum[5m]) / rate(count[5m])` for each | Average input prompt length (ISL) and output generation length (OSL) in tokens over the last 5 minutes |
| **Frontend Queued Requests** ⭐⭐⭐ | `dynamo_frontend_queued_requests` | Raw value | Number of requests waiting in queue. **THE key metric for diagnosing worker saturation.** High values (>10) indicate workers cannot keep up with load. Yellow threshold at 10, red at 50 |
### GPU Metrics (from DCGM Exporter)
These metrics come from the DCGM (Data Center GPU Manager) exporter running as a DaemonSet in the `gpu-operator` namespace. DCGM collects hardware-level GPU metrics.
| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **GPU Compute Utilization** | `DCGM_FI_DEV_GPU_UTIL` | Raw value | GPU compute utilization percentage (0-100) for each GPU. Prefill workers show high utilization during prefill phase |
| **GPU Memory Bandwidth** | `DCGM_FI_DEV_MEM_COPY_UTIL` | Raw value | GPU memory copy bandwidth utilization percentage (0-100). **Spikes indicate KV cache transfers over NIXL**. On single-node deployments, NIXL uses CUDA IPC (GPU→Host→Host→GPU) not direct GPU-to-GPU. Yellow threshold at 60%, red at 80% |
| **NVLink Bandwidth (GB/s)** | `DCGM_FI_PROF_NVLINK_TX_BYTES` & `DCGM_FI_PROF_NVLINK_RX_BYTES` | `(rate(TX_BYTES[1m]) + rate(RX_BYTES[1m])) / 1e9` | NVLink transfer bandwidth in GB/s (rate of change) per GPU, measured from DCGM profiling metrics. Shows total bidirectional bandwidth (TX + RX). This includes intra-pod TP communication (TP=2 for prefill, TP=4 for decode). Low bandwidth (<1 GB/s) indicates inter-pod NIXL KV cache transfers may be using host memory copies instead of direct NVLink/GPUDirect. Yellow threshold at 5 GB/s, red at 10 GB/s |
| **GPU Memory Used** | `DCGM_FI_DEV_FB_USED` | `value / 1024` | GPU framebuffer memory used in GB. Prefill workers allocate KV blocks on decode workers via NIXL |
### Prefill Worker Metrics (from Prefill Worker Pods)
These metrics come from the prefill worker pods' system endpoints (port 9090). They track request processing for prefill operations.
| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Prefill Worker Processing Time** | `dynamo_component_request_duration_seconds_{sum,count,bucket}{dynamo_component="prefill",dynamo_endpoint="generate"}` | `1000 * rate(sum[5m]) / rate(count[5m])` for avg, `histogram_quantile(0.99, ...)` for P99 | Average and P99 time (in ms) spent processing prefill requests. **Includes prefill computation AND KV cache transfer over NIXL** |
| **Prefill Worker Throughput** | `dynamo_component_requests_total{dynamo_component="prefill",dynamo_endpoint="generate"}` | `rate(...[5m])` | Rate of prefill requests being processed in requests/second |
### Decode Worker Metrics (from Decode Worker Pods)
These metrics come from the decode worker pods' system endpoints (port 9090). In disaggregated mode, decode workers receive KV cache from prefill workers and perform token generation.
| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Component Latency - Prefill vs Decode** | `dynamo_component_request_duration_seconds_{sum,count}{dynamo_component="prefill",dynamo_endpoint="generate"}` & `{dynamo_component="backend",dynamo_endpoint="generate"}` | `rate(sum[5m]) / rate(count[5m])` | Average request duration for prefill workers (includes NIXL transfer) vs decode workers (entire decode session for all output tokens) over the last 5 minutes. **Note**: Decode worker latency measures the FULL decode session duration, not just time to first token. Only shows `generate` endpoint (filters out `clear_kv_blocks` maintenance operations) |
| **Decode Worker - Request Throughput** | `dynamo_component_requests_total{dynamo_component="backend"}` | `rate(...[5m])` | Rate of requests processed by decode workers in requests/second |
| **Decode Worker - Avg Request Duration** | `dynamo_component_request_duration_seconds_{sum,count}{dynamo_component="backend"}` | `rate(sum[5m]) / rate(count[5m])` | Average time decode workers spend processing requests (decode phase only) over the last 5 minutes |
| **KV Cache Utilization** | `dynamo_component_kvstats_gpu_cache_usage_percent` | Raw value (0-100%) | GPU memory utilization for KV cache storage of active requests. High values (>90%) indicate workers are at capacity and requests are queueing. **Note**: Only available for decode workers - prefill workers in disaggregated mode don't expose this metric. Monitor Prefill Worker Processing Time instead for prefill capacity |
| **KV Cache Blocks (Active/Total)** ⭐ | `dynamo_component_kvstats_active_blocks` & `dynamo_component_kvstats_total_blocks` | Raw values | Number of KV cache blocks in use vs total available for decode workers. When active approaches total, decode workers are at capacity. Shows numeric values (e.g., 2048/5297). **Note**: Only for decode workers |
### CPU Metrics (from cAdvisor and Node Exporter)
These metrics come from Kubernetes cAdvisor (container metrics) and Node Exporter (node-level metrics). CPU bottlenecks can impact prefill/decode performance.
| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Worker CPU Usage** | `container_cpu_usage_seconds_total{namespace="robert",pod=~".*worker.*",container="main"}` | `rate(...[5m])` | CPU cores used by worker pods. Value shows actual CPU consumption (e.g., 2.5 = 2.5 cores). Yellow at 30 cores, red at 50 cores |
| **Node CPU Utilization** | `node_cpu_seconds_total{mode="idle"}` | `100 - (avg(rate(idle)) * 100)` | Overall node CPU utilization percentage. Shows aggregate CPU usage across all cores |
### Worker Metrics (from Worker Pods)
These metrics track request processing across all worker pods (prefill and decode).
| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Worker Request Throughput** | `dynamo_component_requests_total{dynamo_endpoint="generate"}` | `rate(...[5m])` | Requests per second processed by each worker, broken down by component type (prefill, backend). Shows overall system throughput |
| **Worker Data Transfer** | `dynamo_component_request_bytes_total` & `dynamo_component_response_bytes_total` | `rate(...[5m])` | Bytes per second transferred in requests (IN) and responses (OUT). Shows data throughput across worker pods |
## Metric Label Filters
### Component Name Filtering
- **Prefill workers**: `dynamo_component="prefill"`
- **Decode workers**: `dynamo_component="backend"`
- **All workers**: Filter by `dynamo_endpoint="generate"` to exclude maintenance operations like `clear_kv_blocks`
### Important Labels
- `pod`: Specific pod name (e.g., `llama3-70b-disagg-sn-0-vllmprefillworker-hrnt5`)
- `namespace`: Kubernetes namespace (e.g., `robert`)
- `dynamo_component`: Component type (`prefill`, `backend`, `frontend`)
- `dynamo_endpoint`: Endpoint name (`generate`, `clear_kv_blocks`)
- `gpu`: GPU index (0-7 for DCGM metrics)
- `Hostname`: Node hostname (for DCGM metrics)
## Metric Collection Architecture
```text
┌─────────────────┐
│ Frontend Pod │ ──► dynamo_frontend_* metrics (HTTP port)
└─────────────────┘
┌─────────────────┐
│ Prefill Worker │ ──► dynamo_component_* metrics (system port 9090)
│ Pods │ ├─ dynamo_component_request_* (request stats)
└─────────────────┘ └─ dynamo_component_*_bytes_total (data transfer)
└─ container_cpu_* metrics (cAdvisor)
┌─────────────────┐
│ Decode Worker │ ──► dynamo_component_* metrics (system port 9090)
│ Pods │ └─ dynamo_component_request_* (component stats)
└─────────────────┘ └─ container_cpu_* metrics (cAdvisor)
┌─────────────────┐
│ DCGM Exporter │ ──► DCGM_FI_DEV_* metrics (GPU metrics port)
│ DaemonSet │ ├─ GPU compute utilization
│ (gpu-operator) │ ├─ GPU memory bandwidth (NIXL indicator)
└─────────────────┘ ├─ GPU memory usage
└─ GPU temperature
┌─────────────────┐
│ Node Exporter │ ──► node_* metrics (node metrics port)
│ DaemonSet │ ├─ node_cpu_seconds_total (CPU by mode)
│ (monitoring) │ ├─ node_load1/5/15 (load average)
└─────────────────┘ └─ node_memory_* (memory stats)
┌─────────────────┐
│ cAdvisor │ ──► container_* metrics (built into kubelet)
│ (kubelet) │ ├─ container_cpu_usage_seconds_total
└─────────────────┘ ├─ container_cpu_cfs_throttled_periods_total
└─ container_memory_*
┌─────────────────┐
│ Prometheus │ ◄─── ServiceMonitor for DCGM & Node Exporter
│ (monitoring) │ ◄─── PodMonitor for Dynamo workers
└─────────────────┘ ◄─── Scrapes cAdvisor from kubelet
┌─────────────────┐
│ Grafana │
│ (monitoring) │
└─────────────────┘
```
## PodMonitor Configuration
The Dynamo operator automatically creates PodMonitors for metrics-enabled deployments:
- **Label**: `nvidia.com/metrics-enabled: "true"` on all worker pods
- **Endpoint**: System port (9090) with path `/metrics`
- **Namespace**: Pods in any namespace are discovered (via `podMonitorNamespaceSelector={}`)
To opt-out a deployment:
```yaml
apiVersion: nvidia.com/v1
kind: DynamoGraphDeployment
metadata:
annotations:
nvidia.com/enable-metrics: "false"
```
## DCGM ServiceMonitor Configuration
The DCGM ServiceMonitor must be manually created (see `dcgm-servicemonitor.yaml`):
- **Namespace**: `gpu-operator` (where DCGM exporter runs)
- **Label**: `release: prometheus` (required for Prometheus discovery)
- **Selector**: `app: nvidia-dcgm-exporter`
- **Endpoint**: `gpu-metrics` port with path `/metrics`
## Troubleshooting
### No metrics showing up:
1. Check PodMonitor exists: `kubectl get podmonitor -A`
2. Check pods have metrics label: `kubectl get pods -n <namespace> -l nvidia.com/metrics-enabled=true`
3. Check Prometheus targets: Visit Prometheus UI → Status → Targets
### DCGM metrics missing:
1. Check DCGM exporter running: `kubectl get daemonset -A | grep dcgm-exporter`
2. Check ServiceMonitor exists: `kubectl get servicemonitor -n gpu-operator`
3. Verify `release: prometheus` label on ServiceMonitor
### Prefill queue metrics showing zero:
- These metrics only populate when **remote prefill** requests are processed
- In local-only mode, decode workers handle prefill themselves (no queue)
- Check deployment mode and request routing configuration
### KV Cache metrics only showing decode workers:
**Important Limitation**: In disaggregated mode, prefill workers (`--is-prefill-worker`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these.
**Why this happens:**
- Prefill workers transfer KV cache to decode workers via NIXL
- They don't maintain long-term KV cache state
- Only decode workers track KV cache utilization metrics
**How to diagnose prefill worker capacity bottlenecks:**
1. **Check worker logs** for KV cache size at startup:
```bash
kubectl logs -n <namespace> <prefill-worker-pod> | grep "GPU KV cache size"
# Example output: GPU KV cache size: 254,336 tokens
```
2. **Calculate maximum concurrent requests**:
```text
Max Concurrency = KV Cache Size ÷ Tokens Per Request
# For ISL=8192: 254,336 ÷ 131,072 = 1.94 requests per prefill worker
```
3. **Monitor indirect indicators in dashboard**:
- **Prefill Worker Processing Time**: High avg (>5s) or P99 (>10s) indicates saturation
- **Frontend Avg TTFT**: If much higher than Prefill Processing Time, indicates queueing
- **Gap = TTFT - Prefill Processing Time** = Queue wait time
4. **Performance signature of prefill KV cache bottleneck**:
- Min TTFT is low (1-3s) - proves system CAN be fast
- Avg/Max TTFT is very high (>30s) - proves requests are queueing
- Large variance (Max ÷ Min > 20×) - signature of queueing behavior
- This variance pattern is impossible if the issue was compute-bound - it's the mathematical signature of KV cache capacity bottleneck
### High CPU usage:
1. **Worker CPU Usage showing high values (>30 cores)**:
- Check if workers have sufficient CPU limits configured
- May indicate CPU-bound operations (tokenization, scheduling)
- Compare against GPU utilization - CPU should not be the bottleneck
2. **What's normal?**:
- vLLM workers use CPU for:
- Request scheduling and batching
- Tokenization (input/output processing)
- KV cache management
- TCP/gRPC communication (request plane)
- Expect moderate CPU usage (5-20 cores per worker)
- GPU compute should dominate, not CPU
## Dashboard Variables
The dashboard uses two template variables for flexibility:
### Datasource Variable
- **Variable**: `${datasource}`
- **Type**: `datasource`
- **Query**: `prometheus`
- **Auto-selects**: Default Prometheus instance
- **Purpose**: Allows the dashboard to automatically connect to your Prometheus instance without hardcoding UIDs
### Namespace Variable
- **Variable**: `${namespace}`
- **Type**: `query`
- **Query**: `label_values(dynamo_frontend_requests_total, namespace)`
- **Purpose**: Allows filtering metrics by Kubernetes namespace (e.g., "robert", "default")
- **Auto-populated**: Dynamically discovers namespaces from frontend pods
**Usage**: All dashboard queries filter by `namespace="$namespace"` to show metrics for the selected deployment. You can switch between different Dynamo deployments in different namespaces using the namespace dropdown at the top of the dashboard.
......@@ -4,9 +4,12 @@ This directory contains example Grafana dashboards for Dynamo observability. The
- `dynamo.json` - General Dynamo dashboard showing software and hardware metrics
- `sglang.json` - SGLang engine metrics (request latency, throughput, cache) and HiCache KV cache metrics (GPU/CPU tier usage, eviction/load-back, PIN count)
- `disagg-dashboard.json` - Dashboard for disaggregated serving - See [DASHBOARD_METRICS.md](DASHBOARD_METRICS.md) for detailed documentation on all metrics and panels
- `dcgm-metrics.json` - GPU metrics dashboard using DCGM exporter data
- `kvbm.json` - KV Block Manager metrics dashboard
- `temp-loki.json` - Logging dashboard for Loki integration
- `dashboard-providers.yml` - Configuration file for dashboard provisioning
For setup instructions and usage, see [Observability Documentation](../../../docs/pages/observability/).
For Kubernetes deployment setup, see [../k8s/MONITORING_SETUP.md](../k8s/MONITORING_SETUP.md).
This diff is collapsed.
# Monitoring Setup for Dynamo Disaggregated Deployment
## Prerequisites
The k8s cluster must be created with GPU operator configured to enable DCGM ServiceMonitor for Prometheus metrics collection.
```bash
helm upgrade --install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set operator.defaultRuntime=containerd \
--set gdrcopy.enabled=true \
--set dcgmExporter.serviceMonitor.enabled=true \
--set dcgmExporter.serviceMonitor.additionalLabels.release=prometheus \
--wait --timeout=600s
```
The key settings for monitoring are:
- `dcgmExporter.serviceMonitor.enabled=true` - Enables ServiceMonitor creation
- `dcgmExporter.serviceMonitor.additionalLabels.release=prometheus` - Adds label for Prometheus discovery
## Installation
Once the cluster is properly configured, run:
```bash
./setup-monitoring.sh
```
This script will:
1. Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
2. Configure Prometheus to discover pod monitors across all namespaces
3. Update Dynamo operator with Prometheus endpoint
4. Configure DCGM custom metrics for NVLink profiling
5. Verify DCGM ServiceMonitor exists (created by GPU operator)
6. Deploy Grafana disaggregated dashboard ConfigMap (auto-imported by Grafana sidecar)
7. Provide Grafana credentials
## Verification
Check that GPU metrics are flowing:
```bash
# Verify DCGM ServiceMonitor exists and is owned by ClusterPolicy
kubectl get servicemonitor -n gpu-operator nvidia-dcgm-exporter -o yaml | grep -A 5 ownerReferences
# Query Prometheus for GPU metrics
kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -- \
wget -q -O- 'http://localhost:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL'
```
Expected: ServiceMonitor should show `ownerReferences` pointing to ClusterPolicy, and query should return 8+ GPU series.
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Script to cleanup/undo monitoring stack installation
set -e
echo "=========================================="
echo "Cleaning up Prometheus & Grafana for Dynamo"
echo "=========================================="
# Step 1: Delete Grafana dashboard ConfigMap
echo ""
echo "Step 1: Removing Grafana dashboard ConfigMap..."
kubectl delete configmap grafana-disagg-dashboard -n monitoring --ignore-not-found=true
# Step 2: Revert DCGM custom metrics configuration
echo ""
echo "Step 2: Reverting DCGM custom metrics to default..."
kubectl delete configmap dcgm-exporter-metrics-config -n gpu-operator --ignore-not-found=true
echo "Adding required Helm repositories..."
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia 2>/dev/null || true
helm repo add nvidia-dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo 2>/dev/null || true
helm repo update
echo "Reverting GPU Operator DCGM settings to default..."
helm upgrade gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--reuse-values \
--set dcgmExporter.config.name=""
echo "Restarting DCGM exporter to apply default metrics..."
kubectl rollout restart daemonset nvidia-dcgm-exporter -n gpu-operator
kubectl rollout status daemonset nvidia-dcgm-exporter -n gpu-operator --timeout=180s
# Step 3: Revert Dynamo operator Prometheus endpoint
echo ""
echo "Step 3: Removing Prometheus endpoint from Dynamo operator..."
DYNAMO_VERSION=$(helm list -n dynamo -o json | jq -r '.[] | select(.name=="dynamo-platform") | .chart' | sed 's/dynamo-platform-//')
echo "Detected Dynamo Platform version: ${DYNAMO_VERSION}"
# Delete the conflicting secret (grove-operator will recreate it)
echo "Removing grove-webhook-server-cert to avoid conflict..."
kubectl delete secret grove-webhook-server-cert -n dynamo --ignore-not-found=true
helm upgrade dynamo-platform nvidia-dynamo/dynamo-platform \
--version "${DYNAMO_VERSION}" \
--namespace dynamo \
--reuse-values \
--set prometheusEndpoint=""
# Step 4: Uninstall kube-prometheus-stack
echo ""
echo "Step 4: Uninstalling kube-prometheus-stack..."
helm uninstall prometheus -n monitoring || echo "Prometheus stack not found, skipping..."
echo "Deleting kube-prometheus-stack CRDs (Helm doesn't remove these automatically)..."
kubectl delete $(kubectl get crd -o name | grep monitoring.coreos.com) --ignore-not-found=true
# Step 5: Delete monitoring namespace (optional)
echo ""
read -p "Delete monitoring namespace? This will remove all monitoring data (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo "Deleting monitoring namespace..."
kubectl delete namespace monitoring --ignore-not-found=true
else
echo "Keeping monitoring namespace"
fi
echo ""
echo "=========================================="
echo "✅ Cleanup Complete!"
echo "=========================================="
echo ""
echo "You can now run setup-monitoring.sh to reinstall with a clean slate"
echo ""
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES, counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES, counter, Total number of bytes received through PCIe RX via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).
# ECC
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages
# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# NVLink Profiling Metrics - ADDED FOR NIXL TRANSFER MONITORING
DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, Total number of bytes of active NVLink tx (transmit) data including both header and payload.
DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, Total number of bytes of active NVLink rx (receive) data including both header and payload.
# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rows
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
# Static configuration information. These appear as labels on the other metrics
DCGM_FI_DRIVER_VERSION, label, Driver Version
# DCGM_FI_NVML_VERSION, label, NVML Version
# DCGM_FI_DEV_BRAND, label, Device Brand
# DCGM_FI_DEV_SERIAL, label, Device Serial Number
# DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version
# DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version
# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
# DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device
# DCP metrics
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active.
# DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned.
# DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active.
# DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active.
DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Script to install Prometheus and Grafana monitoring stack for Dynamo
# Following the official Dynamo Kubernetes observability guide:
# https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/observability/metrics.md
set -e
echo "=========================================="
echo "Installing Prometheus & Grafana for Dynamo"
echo "=========================================="
# Step 1: Add Helm repositories
echo ""
echo "Step 1: Adding required Helm repositories..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia 2>/dev/null || true
helm repo add nvidia-dynamo https://helm.ngc.nvidia.com/nvidia/ai-dynamo 2>/dev/null || true
# Step 2: Update Helm repositories
echo ""
echo "Step 2: Updating Helm repositories..."
helm repo update
# Step 3: Install kube-prometheus-stack
echo ""
echo "Step 3: Installing kube-prometheus-stack..."
echo "This includes: Prometheus Operator, Prometheus, Grafana, Alertmanager"
helm upgrade --install prometheus -n monitoring --create-namespace prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set-json 'prometheus.prometheusSpec.podMonitorNamespaceSelector={}' \
--set-json 'prometheus.prometheusSpec.probeNamespaceSelector={}'
# Step 4: Wait for pods to be ready
echo ""
echo "Step 4: Waiting for monitoring stack pods to be ready..."
echo "This may take 1-2 minutes..."
kubectl wait --for=condition=ready pod -l "release=prometheus" -n monitoring --timeout=180s
# Step 5: Verify installation
echo ""
echo "Step 5: Verifying installation..."
kubectl get pods -n monitoring
# Step 6: Update Dynamo operator with Prometheus endpoint
echo ""
echo "Step 6: Updating Dynamo operator with Prometheus endpoint..."
# Detect currently installed version to avoid accidental upgrades
DYNAMO_VERSION=$(helm list -n dynamo -o json | jq -r '.[] | select(.name=="dynamo-platform") | .chart' | sed 's/dynamo-platform-//')
echo "Detected Dynamo Platform version: ${DYNAMO_VERSION}"
# Delete the conflicting secret (grove-operator will recreate it)
echo "Removing grove-webhook-server-cert to avoid conflict..."
kubectl delete secret grove-webhook-server-cert -n dynamo --ignore-not-found=true
# Perform the upgrade
echo "Running Helm upgrade..."
helm upgrade dynamo-platform nvidia-dynamo/dynamo-platform \
--version "${DYNAMO_VERSION}" \
--namespace dynamo \
--reuse-values \
--set prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
# Step 7: Configure DCGM custom metrics for NVLink profiling
echo ""
echo "Step 7: Configuring DCGM custom metrics for NVLink profiling..."
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
kubectl create configmap dcgm-exporter-metrics-config \
--from-file=dcgm-metrics.csv="$SCRIPT_DIR/dcgm-metrics-with-nvlink.csv" \
--namespace=gpu-operator \
--dry-run=client -o yaml | kubectl apply -f -
echo "Updating GPU Operator to use custom DCGM metrics..."
helm upgrade gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--reuse-values \
--set dcgmExporter.config.name=dcgm-exporter-metrics-config
echo "Restarting DCGM exporter to apply new metrics configuration..."
kubectl rollout restart daemonset nvidia-dcgm-exporter -n gpu-operator
kubectl rollout status daemonset nvidia-dcgm-exporter -n gpu-operator --timeout=60s
echo "✅ DCGM custom metrics configured with NVLink profiling support"
# Step 8: Verify DCGM ServiceMonitor exists (created by GPU operator during cluster setup)
echo ""
echo "Step 8: Verifying DCGM exporter ServiceMonitor..."
if kubectl get servicemonitor -n gpu-operator nvidia-dcgm-exporter &>/dev/null; then
echo "✅ DCGM ServiceMonitor found - GPU metrics will be available in Prometheus/Grafana"
else
echo "⚠️ DCGM ServiceMonitor not found."
echo "The GPU operator should have been installed with serviceMonitor enabled in createEks.sh"
echo "Please verify the cluster was created with the updated createEks.sh that includes:"
echo " --set dcgmExporter.serviceMonitor.enabled=true"
echo " --set dcgmExporter.serviceMonitor.additionalLabels.release=prometheus"
fi
# Step 9: Deploy Grafana dashboard ConfigMap
echo ""
echo "Step 9: Deploying Grafana disaggregated dashboard ConfigMap..."
kubectl apply -f "$SCRIPT_DIR/grafana-disagg-dashboard-configmap.yaml"
echo "✅ Dashboard ConfigMap deployed - Grafana sidecar will auto-import it within a few seconds"
# Step 10: Get Grafana credentials
echo ""
echo "Step 10: Retrieving Grafana credentials..."
GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
echo ""
echo "=========================================="
echo "✅ Installation Complete!"
echo "=========================================="
echo ""
echo "Grafana Access:"
echo " Username: $GRAFANA_USER"
echo " Password: $GRAFANA_PASSWORD"
echo ""
echo "To access Grafana:"
echo " kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring"
echo " Then visit: http://localhost:3000"
echo ""
echo "To access Prometheus:"
echo " kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring"
echo " Then visit: http://localhost:9090"
echo ""
echo "Next Steps:"
echo " 1. Deploy or redeploy your DynamoGraphDeployment with DYN_SYSTEM_ENABLED=true"
echo " 2. The Dynamo operator will automatically create PodMonitors for metrics collection"
echo " 3. View metrics in Grafana under Dashboards → General → Dynamo Disaggregated Analysis"
echo ""
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment