# Grafana Dashboard Metrics Documentation

This document explains where each panel in the `disagg-dashboard.json` gets its data and how it's displayed.

## Dashboard Organization

The dashboard is organized in **logical request flow order** (21 panels across 6 rows):

**Row 1: Frontend Health** (User-facing metrics - y=0)
- Frontend Requests/Sec (x=0), Avg TTFT (x=8), Avg Request Duration (x=16)

**Row 2: Frontend Details** (y=8)
- Avg Inter-Token Latency (x=0), Avg ISL/OSL (x=8), **Queued Requests** ⭐ (x=16)

**Row 3: Prefill Workers** (The typical bottleneck! - y=16)
- Prefill Worker Processing Time ⭐ (x=0), Prefill Worker Throughput (x=8), Component Latency Comparison (x=16)

**Row 4: Decode Workers** (y=24)
- Request Throughput (x=0), Avg Request Duration (x=8), KV Cache Utilization (%) (x=16)

**Row 5: KV Cache + GPU** (y=32)
- KV Cache Blocks (Total) (x=0), GPU Compute Utilization (x=8), GPU Memory Used (x=16)

**Row 6: NIXL Transfer Metrics** (y=40)
- GPU Memory Bandwidth (x=0), NVLink Bandwidth (GB/s) (x=8), Worker CPU Usage (x=16)

**Row 7: Node + Worker** (y=48)
- Node CPU Utilization (x=0), Worker Request Throughput (x=8), Worker Data Transfer (x=16)

⭐ = Key metrics for diagnosing TTFT bottlenecks

## Metric Sources

### Frontend Metrics (from Frontend Pod)
These metrics come from the `dynamo_frontend_*` namespace and are collected from the frontend deployment pod.

| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Frontend Requests / Sec** | `dynamo_frontend_requests_total` | `rate(...[30s])` | Rate of requests per second hitting the frontend, broken down by request_type and status |
| **Frontend Avg Time to First Token** ⭐ | `dynamo_frontend_time_to_first_token_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Average time (in ms) from request arrival to first token over the last 5 minutes. Includes queue wait, prefill compute, and NIXL transfer time. **The primary performance metric** |
| **Frontend Avg Request Duration** | `dynamo_frontend_request_duration_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Total end-to-end request duration in milliseconds over the last 5 minutes |
| **Frontend Avg Inter-Token Latency** | `dynamo_frontend_inter_token_latency_seconds_{sum,count}` | `1000 * (rate(sum[5m]) / rate(count[5m]))` | Average time (in ms) between token generations during decode phase over the last 5 minutes |
| **Frontend Avg Input/Output Sequence Length** | `dynamo_frontend_input_sequence_tokens_{sum,count}` & `dynamo_frontend_output_sequence_tokens_{sum,count}` | `rate(sum[5m]) / rate(count[5m])` for each | Average input prompt length (ISL) and output generation length (OSL) in tokens over the last 5 minutes |
| **Frontend Queued Requests** ⭐⭐⭐ | `dynamo_frontend_queued_requests` | Raw value | Number of requests waiting in queue. **THE key metric for diagnosing worker saturation.** High values (>10) indicate workers cannot keep up with load. Yellow threshold at 10, red at 50 |

### GPU Metrics (from DCGM Exporter)
These metrics come from the DCGM (Data Center GPU Manager) exporter running as a DaemonSet in the `gpu-operator` namespace. DCGM collects hardware-level GPU metrics.

| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **GPU Compute Utilization** | `DCGM_FI_DEV_GPU_UTIL` | Raw value | GPU compute utilization percentage (0-100) for each GPU. Prefill workers show high utilization during prefill phase |
| **GPU Memory Bandwidth** | `DCGM_FI_DEV_MEM_COPY_UTIL` | Raw value | GPU memory copy bandwidth utilization percentage (0-100). **Spikes indicate KV cache transfers over NIXL**. On single-node deployments, NIXL uses CUDA IPC (GPU→Host→Host→GPU) not direct GPU-to-GPU. Yellow threshold at 60%, red at 80% |
| **NVLink Bandwidth (GB/s)** | `DCGM_FI_PROF_NVLINK_TX_BYTES` & `DCGM_FI_PROF_NVLINK_RX_BYTES` | `(rate(TX_BYTES[1m]) + rate(RX_BYTES[1m])) / 1e9` | NVLink transfer bandwidth in GB/s (rate of change) per GPU, measured from DCGM profiling metrics. Shows total bidirectional bandwidth (TX + RX). This includes intra-pod TP communication (TP=2 for prefill, TP=4 for decode). Low bandwidth (<1 GB/s) indicates inter-pod NIXL KV cache transfers may be using host memory copies instead of direct NVLink/GPUDirect. Yellow threshold at 5 GB/s, red at 10 GB/s |
| **GPU Memory Used** | `DCGM_FI_DEV_FB_USED` | `value / 1024` | GPU framebuffer memory used in GB. Prefill workers allocate KV blocks on decode workers via NIXL |

### Prefill Worker Metrics (from Prefill Worker Pods)
These metrics come from the prefill worker pods' system endpoints (port 9090). They track request processing for prefill operations.

| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Prefill Worker Processing Time** | `dynamo_component_request_duration_seconds_{sum,count,bucket}{dynamo_component="prefill",dynamo_endpoint="generate"}` | `1000 * rate(sum[5m]) / rate(count[5m])` for avg, `histogram_quantile(0.99, ...)` for P99 | Average and P99 time (in ms) spent processing prefill requests. **Includes prefill computation AND KV cache transfer over NIXL** |
| **Prefill Worker Throughput** | `dynamo_component_requests_total{dynamo_component="prefill",dynamo_endpoint="generate"}` | `rate(...[5m])` | Rate of prefill requests being processed in requests/second |

### Decode Worker Metrics (from Decode Worker Pods)
These metrics come from the decode worker pods' system endpoints (port 9090). In disaggregated mode, decode workers receive KV cache from prefill workers and perform token generation.

| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Component Latency - Prefill vs Decode** | `dynamo_component_request_duration_seconds_{sum,count}{dynamo_component="prefill",dynamo_endpoint="generate"}` & `{dynamo_component="backend",dynamo_endpoint="generate"}` | `rate(sum[5m]) / rate(count[5m])` | Average request duration for prefill workers (includes NIXL transfer) vs decode workers (entire decode session for all output tokens) over the last 5 minutes. **Note**: Decode worker latency measures the FULL decode session duration, not just time to first token. Only shows `generate` endpoint (filters out `clear_kv_blocks` maintenance operations) |
| **Decode Worker - Request Throughput** | `dynamo_component_requests_total{dynamo_component="backend"}` | `rate(...[5m])` | Rate of requests processed by decode workers in requests/second |
| **Decode Worker - Avg Request Duration** | `dynamo_component_request_duration_seconds_{sum,count}{dynamo_component="backend"}` | `rate(sum[5m]) / rate(count[5m])` | Average time decode workers spend processing requests (decode phase only) over the last 5 minutes |
| **KV Cache Utilization** | `dynamo_component_gpu_cache_usage_percent` | Raw value (0-100%) | GPU memory utilization for KV cache storage of active requests. High values (>90%) indicate workers are at capacity and requests are queueing. **Note**: Only available for decode workers - prefill workers in disaggregated mode don't expose this metric. Monitor Prefill Worker Processing Time instead for prefill capacity |
| **KV Cache Blocks (Total)** | `dynamo_component_total_blocks` | Raw value | Total number of KV cache blocks available on decode workers. **Note**: Only for decode workers |

### CPU Metrics (from cAdvisor and Node Exporter)
These metrics come from Kubernetes cAdvisor (container metrics) and Node Exporter (node-level metrics). CPU bottlenecks can impact prefill/decode performance.

| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Worker CPU Usage** | `container_cpu_usage_seconds_total{namespace="robert",pod=~".*worker.*",container="main"}` | `rate(...[5m])` | CPU cores used by worker pods. Value shows actual CPU consumption (e.g., 2.5 = 2.5 cores). Yellow at 30 cores, red at 50 cores |
| **Node CPU Utilization** | `node_cpu_seconds_total{mode="idle"}` | `100 - (avg(rate(idle)) * 100)` | Overall node CPU utilization percentage. Shows aggregate CPU usage across all cores |

### Worker Metrics (from Worker Pods)
These metrics track request processing across all worker pods (prefill and decode).

| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **Worker Request Throughput** | `dynamo_component_requests_total{dynamo_endpoint="generate"}` | `rate(...[5m])` | Requests per second processed by each worker, broken down by component type (prefill, backend). Shows overall system throughput |
| **Worker Data Transfer** | `dynamo_component_request_bytes_total` & `dynamo_component_response_bytes_total` | `rate(...[5m])` | Bytes per second transferred in requests (IN) and responses (OUT). Shows data throughput across worker pods |

## Metric Label Filters

### Component Name Filtering
- **Prefill workers**: `dynamo_component="prefill"`
- **Decode workers**: `dynamo_component="backend"`
- **All workers**: Filter by `dynamo_endpoint="generate"` to exclude maintenance operations like `clear_kv_blocks`

### Important Labels
- `pod`: Specific pod name (e.g., `llama3-70b-disagg-sn-0-vllmprefillworker-hrnt5`)
- `namespace`: Kubernetes namespace (e.g., `robert`)
- `dynamo_component`: Component type (`prefill`, `backend`, `frontend`)
- `dynamo_endpoint`: Endpoint name (`generate`, `clear_kv_blocks`)
- `gpu`: GPU index (0-7 for DCGM metrics)
- `Hostname`: Node hostname (for DCGM metrics)

## Metric Collection Architecture

```text
┌─────────────────┐
│  Frontend Pod   │ ──► dynamo_frontend_* metrics (HTTP port)
└─────────────────┘

┌─────────────────┐
│ Prefill Worker  │ ──► dynamo_component_* metrics (system port 9090)
│     Pods        │     ├─ dynamo_component_request_* (request stats)
└─────────────────┘     └─ dynamo_component_*_bytes_total (data transfer)
                        └─ container_cpu_* metrics (cAdvisor)

┌─────────────────┐
│ Decode Worker   │ ──► dynamo_component_* metrics (system port 9090)
│     Pods        │     └─ dynamo_component_request_* (component stats)
└─────────────────┘     └─ container_cpu_* metrics (cAdvisor)

┌─────────────────┐
│ DCGM Exporter   │ ──► DCGM_FI_DEV_* metrics (GPU metrics port)
│   DaemonSet     │     ├─ GPU compute utilization
│ (gpu-operator)  │     ├─ GPU memory bandwidth (NIXL indicator)
└─────────────────┘     ├─ GPU memory usage
                        └─ GPU temperature

┌─────────────────┐
│ Node Exporter   │ ──► node_* metrics (node metrics port)
│   DaemonSet     │     ├─ node_cpu_seconds_total (CPU by mode)
│   (monitoring)  │     ├─ node_load1/5/15 (load average)
└─────────────────┘     └─ node_memory_* (memory stats)

┌─────────────────┐
│    cAdvisor     │ ──► container_* metrics (built into kubelet)
│   (kubelet)     │     ├─ container_cpu_usage_seconds_total
└─────────────────┘     ├─ container_cpu_cfs_throttled_periods_total
                        └─ container_memory_*

           ▼
   ┌─────────────────┐
   │  Prometheus     │ ◄─── ServiceMonitor for DCGM & Node Exporter
   │  (monitoring)   │ ◄─── PodMonitor for Dynamo workers
   └─────────────────┘ ◄─── Scrapes cAdvisor from kubelet
           ▼
   ┌─────────────────┐
   │    Grafana      │
   │  (monitoring)   │
   └─────────────────┘
```

## PodMonitor Configuration

The Dynamo operator automatically creates PodMonitors for metrics-enabled deployments:
- **Label**: `nvidia.com/metrics-enabled: "true"` on all worker pods
- **Endpoint**: System port (9090) with path `/metrics`
- **Namespace**: Pods in any namespace are discovered (via `podMonitorNamespaceSelector={}`)

To opt-out a deployment:
```yaml
apiVersion: nvidia.com/v1
kind: DynamoGraphDeployment
metadata:
  annotations:
    nvidia.com/enable-metrics: "false"
```

## DCGM ServiceMonitor Configuration

The DCGM ServiceMonitor must be manually created (see `dcgm-servicemonitor.yaml`):
- **Namespace**: `gpu-operator` (where DCGM exporter runs)
- **Label**: `release: prometheus` (required for Prometheus discovery)
- **Selector**: `app: nvidia-dcgm-exporter`
- **Endpoint**: `gpu-metrics` port with path `/metrics`

## Troubleshooting

### No metrics showing up:
1. Check PodMonitor exists: `kubectl get podmonitor -A`
2. Check pods have metrics label: `kubectl get pods -n <namespace> -l nvidia.com/metrics-enabled=true`
3. Check Prometheus targets: Visit Prometheus UI → Status → Targets

### DCGM metrics missing:
1. Check DCGM exporter running: `kubectl get daemonset -A | grep dcgm-exporter`
2. Check ServiceMonitor exists: `kubectl get servicemonitor -n gpu-operator`
3. Verify `release: prometheus` label on ServiceMonitor

### Prefill queue metrics showing zero:
- These metrics only populate when **remote prefill** requests are processed
- In local-only mode, decode workers handle prefill themselves (no queue)
- Check deployment mode and request routing configuration

### KV Cache metrics only showing decode workers:
**Important Limitation**: In disaggregated mode, prefill workers (`--disaggregation-mode prefill`) do NOT expose `dynamo_component_total_blocks` or `dynamo_component_gpu_cache_usage_percent` metrics. Only decode workers expose these.

**Why this happens:**
- Prefill workers transfer KV cache to decode workers via NIXL
- They don't maintain long-term KV cache state
- Only decode workers track KV cache utilization metrics

**How to diagnose prefill worker capacity bottlenecks:**
1. **Check worker logs** for KV cache size at startup:
   ```bash
   kubectl logs -n <namespace> <prefill-worker-pod> | grep "GPU KV cache size"
   # Example output: GPU KV cache size: 254,336 tokens
   ```

2. **Calculate maximum concurrent requests**:
   ```text
   Max Concurrency = KV Cache Size ÷ Tokens Per Request
   # For ISL=8192: 254,336 ÷ 131,072 = 1.94 requests per prefill worker
   ```

3. **Monitor indirect indicators in dashboard**:
   - **Prefill Worker Processing Time**: High avg (>5s) or P99 (>10s) indicates saturation
   - **Frontend Avg TTFT**: If much higher than Prefill Processing Time, indicates queueing
   - **Gap = TTFT - Prefill Processing Time** = Queue wait time

4. **Performance signature of prefill KV cache bottleneck**:
   - Min TTFT is low (1-3s) - proves system CAN be fast
   - Avg/Max TTFT is very high (>30s) - proves requests are queueing
   - Large variance (Max ÷ Min > 20×) - signature of queueing behavior
   - This variance pattern is impossible if the issue was compute-bound - it's the mathematical signature of KV cache capacity bottleneck

### High CPU usage:
1. **Worker CPU Usage showing high values (>30 cores)**:
   - Check if workers have sufficient CPU limits configured
   - May indicate CPU-bound operations (tokenization, scheduling)
   - Compare against GPU utilization - CPU should not be the bottleneck

2. **What's normal?**:
   - vLLM workers use CPU for:
     - Request scheduling and batching
     - Tokenization (input/output processing)
     - KV cache management
     - TCP/gRPC communication (request plane)
   - Expect moderate CPU usage (5-20 cores per worker)
   - GPU compute should dominate, not CPU

## Dashboard Variables

The dashboard uses two template variables for flexibility:

### Datasource Variable
- **Variable**: `${datasource}`
- **Type**: `datasource`
- **Query**: `prometheus`
- **Auto-selects**: Default Prometheus instance
- **Purpose**: Allows the dashboard to automatically connect to your Prometheus instance without hardcoding UIDs

### Namespace Variable
- **Variable**: `${namespace}`
- **Type**: `query`
- **Query**: `label_values(dynamo_frontend_requests_total, namespace)`
- **Purpose**: Allows filtering metrics by Kubernetes namespace (e.g., "robert", "default")
- **Auto-populated**: Dynamically discovers namespaces from frontend pods

**Usage**: All dashboard queries filter by `namespace="$namespace"` to show metrics for the selected deployment. You can switch between different Dynamo deployments in different namespaces using the namespace dropdown at the top of the dashboard.

---

## Intel XPU-SMI Metrics (from XPU-SMI Exporter)

These metrics come from the Intel XPU-SMI Prometheus exporter (`xpu-smi-exporter` job). XPU-SMI collects hardware-level Intel GPU metrics equivalent to NVIDIA's DCGM.

### Setup

Launch the XPU-SMI Prometheus exporter on the host, then start the observability stack with the XPU overlay:
```bash
# Install Intel XPU-SMI (xpumanager): https://github.com/intel/xpumanager
# Start the exporter (serves Prometheus metrics on port 9966)
python deploy/observability/xpu_smi_exporter.py --port 9966 &

# Start base services
docker compose -f deploy/docker-compose.yml up -d

# Start observability with XPU overlay (uses prometheus-xpu.yml + xpu-alert-rules.yml)
docker compose -f deploy/docker-observability.yml -f deploy/docker-observability-xpu.yml up -d
```

### XPU-SMI Dashboard Panels (`xpu-smi-metrics.json`)

| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **XPU Compute Utilization** | `xpu_engine_group_compute_engine_util` | Raw value (0-100%) | Compute engine utilization per XPU device. Equivalent to `DCGM_FI_DEV_GPU_UTIL` |
| **XPU Memory Usage** | `xpu_memory_used_bytes`, `xpu_memory_free_bytes` | Raw values | HBM/VRAM used and free bytes per XPU device. Equivalent to `DCGM_FI_DEV_FB_USED/FB_FREE` |
| **XPU Temperature** | `xpu_temperature_celsius` | Raw value (°C) | GPU die and memory temperature. Labels: `location="gpu"` or `location="memory"`. Thresholds: yellow@70°C, red@85°C |
| **XPU Power Usage** | `xpu_power_watts` | Raw value (W) | Instantaneous power draw per XPU device. Equivalent to `DCGM_FI_DEV_POWER_USAGE` |
| **XPU Engine Utilization** | `xpu_engine_group_compute_engine_util`, `xpu_engine_group_copy_engine_util`, `xpu_engine_group_render_engine_util` | Raw values (0-100%) | Per-engine-group utilization breakdown |
| **XPU Memory Bandwidth** | `xpu_memory_read_bytes_per_second`, `xpu_memory_write_bytes_per_second` | Raw value (bytes/sec) | HBM read/write bandwidth in bytes/sec |
| **XPU PCIe Bandwidth** | `xpu_pcie_read_bytes_per_second`, `xpu_pcie_write_bytes_per_second` | Raw value (bytes/sec) | PCIe read/write bandwidth. Equivalent to `DCGM_FI_PROF_PCIE_RX/TX_BYTES` |
| **Avg XPU Utilization** | `xpu_engine_group_compute_engine_util` | `avg(...)` | Average utilization gauge across all XPU devices |
| **Max XPU Temperature** | `xpu_temperature_celsius{location="gpu"}` | `max(...)` | Maximum temperature gauge across all XPU devices |

### XPU vs NVIDIA DCGM Metric Mapping

| NVIDIA DCGM Metric | Intel XPU-SMI Metric | Description |
|---|---|---|
| `DCGM_FI_DEV_GPU_UTIL` | `xpu_engine_group_compute_engine_util` | Compute utilization % |
| `DCGM_FI_DEV_FB_USED` | `xpu_memory_used_bytes` | Memory used |
| `DCGM_FI_DEV_FB_FREE` | `xpu_memory_free_bytes` | Memory free |
| `DCGM_FI_DEV_GPU_TEMP` | `xpu_temperature_celsius{location="gpu"}` | GPU temperature |
| `DCGM_FI_DEV_MEMORY_TEMP` | `xpu_temperature_celsius{location="memory"}` | Memory temperature |
| `DCGM_FI_DEV_POWER_USAGE` | `xpu_power_watts` | Power draw (W) |
| `DCGM_FI_PROF_PCIE_RX_BYTES` | `xpu_pcie_read_bytes_per_second` | PCIe RX bytes/sec |
| `DCGM_FI_PROF_PCIE_TX_BYTES` | `xpu_pcie_write_bytes_per_second` | PCIe TX bytes/sec |

### Metric Architecture (XPU)

```text
┌─────────────────┐
│  Intel XPU-SMI  │ ──► xpu_* metrics (Prometheus port :9966)
│    Exporter     │     ├─ xpu_engine_group_compute_engine_util
│  (host process) │     ├─ xpu_memory_used_bytes / free_bytes
└─────────────────┘     ├─ xpu_temperature_celsius
                        ├─ xpu_power_watts
                        └─ xpu_pcie_*_bytes_total

           ▼
   ┌─────────────────┐
   │  Prometheus     │ ◄─── scrape job: xpu-smi-exporter (port 9966)
   │  (monitoring)   │
   └─────────────────┘
           ▼
   ┌─────────────────┐
   │    Grafana      │ ◄─── xpu-smi-metrics.json dashboard
   │  (monitoring)   │
   └─────────────────┘
```

### XPU Alert Rules (`xpu-alert-rules.yml`)

| Alert | Condition | Severity | Description |
|-------|-----------|----------|-------------|
| `XPUHighTemperature` | temp > 85°C for 2m | warning | XPU GPU die overheating |
| `XPUCriticalTemperature` | temp > 95°C for 30s | critical | Immediate risk of thermal throttle/shutdown |
| `XPUMemoryAlmostFull` | mem > 90% for 1m | warning | KV cache allocation may fail |
| `XPUMemoryCritical` | mem > 98% for 30s | critical | OOM imminent |
| `XPUHighPowerDraw` | power > 400W for 5m | warning | Sustained high power draw |
| `XPUExporterDown` | `up{job="xpu-smi-exporter"} == 0` for 1m | critical | Monitoring blind spot |
| `XPULowComputeUtilizationDuringLoad` | util < 10% during active traffic (`rate()` > 0) | warning | Possible dispatch issue |
| `XPUWorkerLivenessLost` | no XPU metrics + active traffic (`rate()` > 0) | critical | XPU worker crash suspected |

### Troubleshooting XPU Metrics

#### XPU metrics not showing in Prometheus:
1. Verify XPU-SMI exporter is running: `curl http://localhost:9966/metrics | grep xpu_`
2. Check Intel GPU is visible: `xpu-smi discovery`
3. Verify scrape job in Prometheus UI: Status → Targets → `xpu-smi-exporter`
4. Check firewall: `sudo ufw allow 9966/tcp`

#### XPU device not detected:
```bash
# Check device visibility in container
ls /dev/dri/
# Should show renderD128, card0, etc.

# Verify XPU-SMI can see the device
xpu-smi discovery
# Expected: lists Intel GPU devices with model name, driver version
```