"vscode:/vscode.git/clone" did not exist on "37b9a1362e348a7e1f9db01c25da02c92f5b7cfc"
Unverified Commit 4218bbae authored by WenjiaoYue's avatar WenjiaoYue Committed by GitHub
Browse files

feat: Add Intel XPU (Intel GPU) monitoring support to observability stack (#7511)


Signed-off-by: default avatarWenjiao Yue <wenjiao.yue@intel.com>
parent 8f9c9998
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Docker Compose override for Intel XPU observability.
# Replaces NVIDIA DCGM with Intel XPU-SMI monitoring.
#
# Usage (XPU environment):
# # 1. Start the XPU exporter on the host (requires xpu-smi installed):
# python3 deploy/observability/xpu_smi_exporter.py --port 9966 &
#
# # 2. Start base services:
# docker compose -f deploy/docker-compose.yml up -d
#
# # 3. Start observability with XPU overlay:
# docker compose -f deploy/docker-observability.yml -f deploy/docker-observability-xpu.yml up -d
services:
# Override Prometheus to use XPU-specific config and alert rules
prometheus:
volumes:
- ./observability/prometheus-xpu.yml:/etc/prometheus/prometheus.yml
- ./observability/xpu-alert-rules.yml:/etc/prometheus/xpu-alert-rules.yml
......@@ -22,8 +22,12 @@ volumes:
services:
# DCGM stands for Data Center GPU Manager: https://developer.nvidia.com/dcgm
# dcgm-exporter is a tool from NVIDIA that exposes DCGM metrics in Prometheus format.
# Requires NVIDIA GPU and runtime. Enable with:
# docker compose --profile nvidia -f deploy/docker-observability.yml up -d
dcgm-exporter:
image: nvidia/dcgm-exporter:4.2.3-4.1.3-ubi9
profiles:
- nvidia
ports:
# Expose dcgm-exporter on port 9401 both inside and outside the container
# to avoid conflicts with other dcgm-exporter instances in distributed environments.
......@@ -81,7 +85,6 @@ services:
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
- dcgm-exporter
- nats-prometheus-exporter
# Loki - Log aggregation backend
......@@ -165,5 +168,4 @@ services:
depends_on:
- prometheus
- tempo
- loki
- loki
\ No newline at end of file
......@@ -18,6 +18,7 @@ apiVersion: 1
datasources:
- name: prometheus
type: prometheus
uid: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
......@@ -259,3 +259,106 @@ The dashboard uses two template variables for flexibility:
- **Auto-populated**: Dynamically discovers namespaces from frontend pods
**Usage**: All dashboard queries filter by `namespace="$namespace"` to show metrics for the selected deployment. You can switch between different Dynamo deployments in different namespaces using the namespace dropdown at the top of the dashboard.
---
## Intel XPU-SMI Metrics (from XPU-SMI Exporter)
These metrics come from the Intel XPU-SMI Prometheus exporter (`xpu-smi-exporter` job). XPU-SMI collects hardware-level Intel GPU metrics equivalent to NVIDIA's DCGM.
### Setup
Launch the XPU-SMI Prometheus exporter on the host, then start the observability stack with the XPU overlay:
```bash
# Install Intel XPU-SMI (xpumanager): https://github.com/intel/xpumanager
# Start the exporter (serves Prometheus metrics on port 9966)
python deploy/observability/xpu_smi_exporter.py --port 9966 &
# Start base services
docker compose -f deploy/docker-compose.yml up -d
# Start observability with XPU overlay (uses prometheus-xpu.yml + xpu-alert-rules.yml)
docker compose -f deploy/docker-observability.yml -f deploy/docker-observability-xpu.yml up -d
```
### XPU-SMI Dashboard Panels (`xpu-smi-metrics.json`)
| Panel | Metric | Formula | Description |
|-------|--------|---------|-------------|
| **XPU Compute Utilization** | `xpu_engine_group_compute_engine_util` | Raw value (0-100%) | Compute engine utilization per XPU device. Equivalent to `DCGM_FI_DEV_GPU_UTIL` |
| **XPU Memory Usage** | `xpu_memory_used_bytes`, `xpu_memory_free_bytes` | Raw values | HBM/VRAM used and free bytes per XPU device. Equivalent to `DCGM_FI_DEV_FB_USED/FB_FREE` |
| **XPU Temperature** | `xpu_temperature_celsius` | Raw value (°C) | GPU die and memory temperature. Labels: `location="gpu"` or `location="memory"`. Thresholds: yellow@70°C, red@85°C |
| **XPU Power Usage** | `xpu_power_watts` | Raw value (W) | Instantaneous power draw per XPU device. Equivalent to `DCGM_FI_DEV_POWER_USAGE` |
| **XPU Engine Utilization** | `xpu_engine_group_compute_engine_util`, `xpu_engine_group_copy_engine_util`, `xpu_engine_group_render_engine_util` | Raw values (0-100%) | Per-engine-group utilization breakdown |
| **XPU Memory Bandwidth** | `xpu_memory_read_bytes_per_second`, `xpu_memory_write_bytes_per_second` | Raw value (bytes/sec) | HBM read/write bandwidth in bytes/sec |
| **XPU PCIe Bandwidth** | `xpu_pcie_read_bytes_per_second`, `xpu_pcie_write_bytes_per_second` | Raw value (bytes/sec) | PCIe read/write bandwidth. Equivalent to `DCGM_FI_PROF_PCIE_RX/TX_BYTES` |
| **Avg XPU Utilization** | `xpu_engine_group_compute_engine_util` | `avg(...)` | Average utilization gauge across all XPU devices |
| **Max XPU Temperature** | `xpu_temperature_celsius{location="gpu"}` | `max(...)` | Maximum temperature gauge across all XPU devices |
### XPU vs NVIDIA DCGM Metric Mapping
| NVIDIA DCGM Metric | Intel XPU-SMI Metric | Description |
|---|---|---|
| `DCGM_FI_DEV_GPU_UTIL` | `xpu_engine_group_compute_engine_util` | Compute utilization % |
| `DCGM_FI_DEV_FB_USED` | `xpu_memory_used_bytes` | Memory used |
| `DCGM_FI_DEV_FB_FREE` | `xpu_memory_free_bytes` | Memory free |
| `DCGM_FI_DEV_GPU_TEMP` | `xpu_temperature_celsius{location="gpu"}` | GPU temperature |
| `DCGM_FI_DEV_MEMORY_TEMP` | `xpu_temperature_celsius{location="memory"}` | Memory temperature |
| `DCGM_FI_DEV_POWER_USAGE` | `xpu_power_watts` | Power draw (W) |
| `DCGM_FI_PROF_PCIE_RX_BYTES` | `xpu_pcie_read_bytes_per_second` | PCIe RX bytes/sec |
| `DCGM_FI_PROF_PCIE_TX_BYTES` | `xpu_pcie_write_bytes_per_second` | PCIe TX bytes/sec |
### Metric Architecture (XPU)
```text
┌─────────────────┐
│ Intel XPU-SMI │ ──► xpu_* metrics (Prometheus port :9966)
│ Exporter │ ├─ xpu_engine_group_compute_engine_util
│ (host process) │ ├─ xpu_memory_used_bytes / free_bytes
└─────────────────┘ ├─ xpu_temperature_celsius
├─ xpu_power_watts
└─ xpu_pcie_*_bytes_total
┌─────────────────┐
│ Prometheus │ ◄─── scrape job: xpu-smi-exporter (port 9966)
│ (monitoring) │
└─────────────────┘
┌─────────────────┐
│ Grafana │ ◄─── xpu-smi-metrics.json dashboard
│ (monitoring) │
└─────────────────┘
```
### XPU Alert Rules (`xpu-alert-rules.yml`)
| Alert | Condition | Severity | Description |
|-------|-----------|----------|-------------|
| `XPUHighTemperature` | temp > 85°C for 2m | warning | XPU GPU die overheating |
| `XPUCriticalTemperature` | temp > 95°C for 30s | critical | Immediate risk of thermal throttle/shutdown |
| `XPUMemoryAlmostFull` | mem > 90% for 1m | warning | KV cache allocation may fail |
| `XPUMemoryCritical` | mem > 98% for 30s | critical | OOM imminent |
| `XPUHighPowerDraw` | power > 400W for 5m | warning | Sustained high power draw |
| `XPUExporterDown` | `up{job="xpu-smi-exporter"} == 0` for 1m | critical | Monitoring blind spot |
| `XPULowComputeUtilizationDuringLoad` | util < 10% during active traffic (`rate()` > 0) | warning | Possible dispatch issue |
| `XPUWorkerLivenessLost` | no XPU metrics + active traffic (`rate()` > 0) | critical | XPU worker crash suspected |
### Troubleshooting XPU Metrics
#### XPU metrics not showing in Prometheus:
1. Verify XPU-SMI exporter is running: `curl http://localhost:9966/metrics | grep xpu_`
2. Check Intel GPU is visible: `xpu-smi discovery`
3. Verify scrape job in Prometheus UI: Status → Targets → `xpu-smi-exporter`
4. Check firewall: `sudo ufw allow 9966/tcp`
#### XPU device not detected:
```bash
# Check device visibility in container
ls /dev/dri/
# Should show renderD128, card0, etc.
# Verify XPU-SMI can see the device
xpu-smi discovery
# Expected: lists Intel GPU devices with model name, driver version
```
{
"_copyright": "SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0",
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 10,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent",
"min": 0,
"max": 100
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_engine_group_compute_engine_util{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Compute",
"range": true,
"refId": "A"
}
],
"title": "XPU Compute Utilization",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "bytes"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 2,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_memory_used_bytes{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Used",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_memory_free_bytes{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Free",
"range": true,
"refId": "B"
}
],
"title": "XPU Memory Usage",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 70
},
{
"color": "red",
"value": 85
}
]
},
"unit": "celsius"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 3,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_temperature_celsius{device_id=~\"$device_id\", location=\"gpu\"}",
"legendFormat": "XPU {{device_id}} GPU Temp",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_temperature_celsius{device_id=~\"$device_id\", location=\"memory\"}",
"legendFormat": "XPU {{device_id}} Memory Temp",
"range": true,
"refId": "B"
}
],
"title": "XPU Temperature",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "watt"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"id": 4,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_power_watts{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Power",
"range": true,
"refId": "A"
}
],
"title": "XPU Power Usage",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent",
"min": 0,
"max": 100
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"id": 5,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_engine_group_compute_engine_util{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Compute Engine",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_engine_group_copy_engine_util{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Copy Engine",
"range": true,
"refId": "B"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_engine_group_render_engine_util{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Render Engine",
"range": true,
"refId": "C"
}
],
"title": "XPU Engine Utilization",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "binBps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"id": 6,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_memory_read_bytes_per_second{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Read BW",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_memory_write_bytes_per_second{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} Write BW",
"range": true,
"refId": "B"
}
],
"title": "XPU Memory Bandwidth",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "binBps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
},
"id": 7,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_pcie_read_bytes_per_second{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} PCIe RX",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "xpu_pcie_write_bytes_per_second{device_id=~\"$device_id\"}",
"legendFormat": "XPU {{device_id}} PCIe TX",
"range": true,
"refId": "B"
}
],
"title": "XPU PCIe Bandwidth",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent",
"min": 0,
"max": 100
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 12,
"y": 24
},
"id": 8,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "avg(xpu_engine_group_compute_engine_util{device_id=~\"$device_id\"})",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Avg XPU Utilization",
"type": "gauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 70
},
{
"color": "red",
"value": 85
}
]
},
"unit": "celsius"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 18,
"y": 24
},
"id": 9,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "max(xpu_temperature_celsius{device_id=~\"$device_id\", location=\"gpu\"})",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Max XPU Temperature",
"type": "gauge"
}
],
"refresh": "5s",
"schemaVersion": 36,
"style": "dark",
"tags": [
"xpu",
"intel",
"xpu-smi"
],
"templating": {
"list": [
{
"allValue": ".*",
"current": {
"selected": true,
"text": "All",
"value": "$__all"
},
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"definition": "label_values(xpu_frequency_mhz, device_id)",
"hide": 0,
"includeAll": true,
"multi": true,
"name": "device_id",
"options": [],
"query": {
"query": "label_values(xpu_frequency_mhz, device_id)",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "",
"sort": 1,
"type": "query",
"label": "XPU Device"
}
]
},
"time": {
"from": "now-30m",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "Intel XPU-SMI Monitoring Dashboard",
"uid": "xpu-smi-dashboard",
"version": 1,
"weekStart": ""
}
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Prometheus configuration for Intel XPU environments.
# Used via: docker compose -f deploy/docker-observability.yml -f deploy/docker-observability-xpu.yml up -d
# Requires xpu_smi_exporter.py running on the host:
# python3 deploy/observability/xpu_smi_exporter.py --port 9966
global:
scrape_interval: 10s
evaluation_interval: 10s
rule_files:
- "/etc/prometheus/xpu-alert-rules.yml"
scrape_configs:
- job_name: 'nats-prometheus-exporter'
scrape_interval: 2s
static_configs:
- targets: ['nats-prometheus-exporter:7777']
- job_name: 'etcd-server'
scrape_interval: 2s
static_configs:
- targets: ['etcd-server:2379']
- job_name: 'dynamo-frontend'
scrape_interval: 10s
static_configs:
- targets: ['host.docker.internal:8000']
- job_name: 'dynamo-backend'
scrape_interval: 6s
static_configs:
- targets: ['host.docker.internal:8081']
- job_name: 'kvbm-metrics'
scrape_interval: 2s
static_configs:
- targets: ['host.docker.internal:6880']
- job_name: 'xpu-smi-exporter'
scrape_interval: 5s
scrape_timeout: 5s
static_configs:
- targets: ['host.docker.internal:9966']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'xpu-host'
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
groups:
- name: xpu_health
interval: 15s
rules:
- alert: XPUHighTemperature
expr: xpu_temperature_celsius{location="gpu"} > 85
for: 2m
labels:
severity: warning
annotations:
summary: "Intel XPU temperature too high"
description: "XPU device {{ $labels.device_id }} GPU temperature is {{ $value | printf \"%.1f\" }}°C (threshold: 85°C)"
- alert: XPUCriticalTemperature
expr: xpu_temperature_celsius{location="gpu"} > 95
for: 30s
labels:
severity: critical
annotations:
summary: "Intel XPU temperature critical"
description: "XPU device {{ $labels.device_id }} GPU temperature is {{ $value | printf \"%.1f\" }}°C immediate action required"
- alert: XPUMemoryAlmostFull
expr: xpu_memory_used_bytes / (xpu_memory_used_bytes + xpu_memory_free_bytes) > 0.90
for: 1m
labels:
severity: warning
annotations:
summary: "Intel XPU memory usage above 90%"
description: "XPU device {{ $labels.device_id }} memory usage is {{ $value | humanizePercentage }}"
- alert: XPUMemoryCritical
expr: xpu_memory_used_bytes / (xpu_memory_used_bytes + xpu_memory_free_bytes) > 0.98
for: 30s
labels:
severity: critical
annotations:
summary: "Intel XPU memory usage critical (>98%)"
description: "XPU device {{ $labels.device_id }} memory is almost exhausted: {{ $value | humanizePercentage }} used"
- alert: XPUHighPowerDraw
expr: xpu_power_watts > 400
for: 5m
labels:
severity: warning
annotations:
summary: "Intel XPU sustained high power draw"
description: "XPU device {{ $labels.device_id }} power draw is {{ $value | printf \"%.1f\" }}W for over 5 minutes"
- alert: XPUExporterDown
expr: up{job="xpu-smi-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Intel XPU-SMI exporter is down"
description: "Cannot scrape XPU metrics from {{ $labels.instance }}. XPU health monitoring is unavailable."
- name: xpu_sla
interval: 30s
rules:
- alert: XPULowComputeUtilizationDuringLoad
expr: |
xpu_engine_group_compute_engine_util < 10
and on() sum(rate(dynamo_frontend_requests_total[5m])) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "XPU compute utilization low while requests are active"
description: "XPU device {{ $labels.device_id }} compute utilization is only {{ $value | printf \"%.1f\" }}% despite active frontend traffic. Possible scheduling or dispatch issue."
- alert: XPUWorkerLivenessLost
expr: |
absent(xpu_engine_group_compute_engine_util)
and on() sum(rate(dynamo_frontend_requests_total[5m])) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "XPU worker liveness lost no XPU metrics while serving requests"
description: "No XPU metrics are being reported while Dynamo frontend is receiving requests. XPU worker may have crashed."
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Intel XPU-SMI Prometheus Exporter.
Collects Intel GPU metrics via xpu-smi and exposes them in Prometheus format
on a configurable HTTP port (default: 9966).
Usage:
python xpu_smi_exporter.py [--port 9966] [--interval 5]
Metrics exposed (matching the Grafana dashboard xpu-smi-metrics.json):
xpu_power_watts - GPU power draw in watts
xpu_frequency_mhz - GPU core frequency in MHz
xpu_memory_used_bytes - GPU memory used in bytes
xpu_memory_free_bytes - GPU memory free in bytes
xpu_memory_utilization_ratio - GPU memory utilization (0-1)
xpu_temperature_celsius - GPU temperature (from dump metric 3)
xpu_pcie_read_bytes_per_second - PCIe read throughput (gauge, bytes/sec)
xpu_pcie_write_bytes_per_second - PCIe write throughput (gauge, bytes/sec)
xpu_engine_group_compute_engine_util - Compute engine utilization %
xpu_engine_group_render_engine_util - Render engine utilization %
xpu_engine_group_copy_engine_util - Copy engine utilization %
xpu_memory_read_bytes_per_second - Memory read throughput (gauge, bytes/sec)
xpu_memory_write_bytes_per_second - Memory write throughput (gauge, bytes/sec)
"""
import argparse
import json
import logging
import subprocess
import sys
import threading
import time
from http.server import BaseHTTPRequestHandler, HTTPServer
logging.basicConfig(
level=logging.INFO,
format="[%(asctime)s] %(levelname)s %(name)s: %(message)s",
)
logger = logging.getLogger("xpu-smi-exporter")
# xpu-smi dump metric IDs
# 0=GPU Util%, 1=Power(W), 2=Freq(MHz), 3=CoreTemp(C), 4=MemTemp(C),
# 5=MemUtil%, 6=MemRead(kB/s), 7=MemWrite(kB/s), 18=MemUsed(MiB),
# 19=PCIeRead(kB/s), 20=PCIeWrite(kB/s),
# 31=ComputeEngGrp%, 32=RenderEngGrp%, 33=MediaEngGrp%, 34=CopyEngGrp%
DUMP_METRICS = "0,1,2,3,4,5,6,7,18,19,20,31,32,33,34"
# Metric name in dump header -> (prometheus_name, help, type, unit_conversion, extra_labels)
# unit_conversion: multiply raw value by this factor
# extra_labels: additional Prometheus labels (e.g. location for temperature)
DUMP_HEADER_MAP = {
"GPU Utilization (%)": (
"xpu_gpu_utilization_percent",
"GPU utilization percentage",
"gauge",
1,
{},
),
"GPU Power (W)": (
"xpu_power_watts",
"GPU power consumption in watts",
"gauge",
1,
{},
),
"GPU Frequency (MHz)": (
"xpu_frequency_mhz",
"GPU core frequency in MHz",
"gauge",
1,
{},
),
"GPU Core Temperature (Celsius Degree)": (
"xpu_temperature_celsius",
"XPU temperature in Celsius",
"gauge",
1,
{"location": "gpu"},
),
"GPU Memory Temperature (Celsius Degree)": (
"xpu_temperature_celsius",
"XPU temperature in Celsius",
"gauge",
1,
{"location": "memory"},
),
"GPU Memory Utilization (%)": (
"xpu_memory_utilization_percent",
"GPU memory utilization percentage",
"gauge",
1,
{},
),
"GPU Memory Read (kB/s)": (
"xpu_memory_read_bytes_per_second",
"GPU memory read throughput in bytes per second",
"gauge",
1024,
{},
),
"GPU Memory Write (kB/s)": (
"xpu_memory_write_bytes_per_second",
"GPU memory write throughput in bytes per second",
"gauge",
1024,
{},
),
"GPU Memory Used (MiB)": (
"xpu_memory_used_bytes",
"GPU memory used in bytes",
"gauge",
1048576,
{},
),
"PCIe Read (kB/s)": (
"xpu_pcie_read_bytes_per_second",
"PCIe read throughput in bytes per second",
"gauge",
1024,
{},
),
"PCIe Write (kB/s)": (
"xpu_pcie_write_bytes_per_second",
"PCIe write throughput in bytes per second",
"gauge",
1024,
{},
),
"Compute engine group utilization (%)": (
"xpu_engine_group_compute_engine_util",
"Compute engine group utilization percentage",
"gauge",
1,
{},
),
"Render engine group utilization (%)": (
"xpu_engine_group_render_engine_util",
"Render engine group utilization percentage",
"gauge",
1,
{},
),
"Media engine group utilization (%)": (
"xpu_engine_group_media_engine_util",
"Media engine group utilization percentage",
"gauge",
1,
{},
),
"Copy engine group utilization (%)": (
"xpu_engine_group_copy_engine_util",
"Copy engine group utilization percentage",
"gauge",
1,
{},
),
}
class MetricsCollector:
"""Collects XPU metrics from xpu-smi commands.
Runs a background thread that periodically calls xpu-smi and caches the
results. The /metrics handler returns the cached snapshot instantly,
avoiding Prometheus scrape-timeout issues caused by slow xpu-smi calls.
"""
def __init__(self, interval: int = 5):
self._lock = threading.Lock()
self._metrics: dict = {}
self._devices: list = []
self._device_memory_total: dict = {} # device_id -> total memory bytes
self._interval = interval
self._discover_devices()
def _discover_devices(self):
"""Discover available XPU devices."""
try:
result = subprocess.run(
["xpu-smi", "discovery", "-j"],
capture_output=True,
text=True,
timeout=10,
)
data = json.loads(result.stdout)
self._devices = [d["device_id"] for d in data.get("device_list", [])]
# Get total memory per device
for dev_id in self._devices:
self._get_device_memory_total(dev_id)
logger.info(
f"Discovered {len(self._devices)} XPU device(s): {self._devices}"
)
except Exception as e:
logger.error(f"Failed to discover devices: {e}")
self._devices = []
def _get_device_memory_total(self, device_id: int):
"""Get total physical memory for a device."""
try:
result = subprocess.run(
["xpu-smi", "discovery", "-d", str(device_id), "-j"],
capture_output=True,
text=True,
timeout=10,
)
data = json.loads(result.stdout)
total = int(data.get("memory_physical_size_byte", 0))
self._device_memory_total[device_id] = total
logger.info(
f"Device {device_id}: total memory = {total / (1024**3):.1f} GiB"
)
except Exception as e:
logger.warning(f"Failed to get memory total for device {device_id}: {e}")
def _collect_dump_metrics(self, device_id: int) -> dict:
"""Collect metrics via xpu-smi dump for a single device."""
metrics = {}
try:
result = subprocess.run(
[
"xpu-smi",
"dump",
"-d",
str(device_id),
"-m",
DUMP_METRICS,
"-n",
"1",
],
capture_output=True,
text=True,
timeout=15,
)
lines = result.stdout.strip().split("\n")
if len(lines) < 2:
return metrics
# Parse header
header_line = lines[0]
headers = [h.strip() for h in header_line.split(",")]
# Parse data (last line)
data_line = lines[-1]
values = [v.strip() for v in data_line.split(",")]
if len(headers) != len(values):
logger.warning(
f"Header/value count mismatch: {len(headers)} vs {len(values)}"
)
return metrics
# Skip Timestamp and DeviceId columns
for i in range(2, len(headers)):
header = headers[i]
raw_val = values[i]
if raw_val == "N/A" or raw_val == "":
continue
mapping = DUMP_HEADER_MAP.get(header)
if not mapping:
continue
prom_name, help_text, metric_type, conversion, extra_labels = mapping
try:
val = float(raw_val) * conversion
labels = {"device_id": str(device_id), **extra_labels}
# Use a composite key to handle metrics with the same name
# but different labels (e.g. xpu_temperature_celsius with
# location=gpu vs location=memory)
label_suffix = "_".join(
f"{k}={v}" for k, v in sorted(extra_labels.items())
)
metric_key = (
f"{prom_name}:{label_suffix}" if label_suffix else prom_name
)
metrics[metric_key] = {
"name": prom_name,
"value": val,
"help": help_text,
"type": metric_type,
"labels": labels,
}
except ValueError:
continue
except subprocess.TimeoutExpired:
logger.warning(f"xpu-smi dump timed out for device {device_id}")
except Exception as e:
logger.warning(f"Error collecting dump metrics for device {device_id}: {e}")
return metrics
def _collect_stats_metrics(self, device_id: int) -> dict:
"""Collect metrics via xpu-smi stats for a single device (fallback/supplement)."""
metrics = {}
try:
result = subprocess.run(
["xpu-smi", "stats", "-d", str(device_id), "-j"],
capture_output=True,
text=True,
timeout=10,
)
data = json.loads(result.stdout)
labels = {"device_id": str(device_id)}
# Device-level metrics
for entry in data.get("device_level", []):
mtype = entry.get("metrics_type", "")
val = entry.get("value")
if val is None:
continue
if mtype == "XPUM_STATS_POWER":
metrics["xpu_power_watts"] = {
"name": "xpu_power_watts",
"value": float(val),
"help": "GPU power consumption in watts",
"type": "gauge",
"labels": labels,
}
# Tile-level metrics (aggregate to device level)
tile_data = data.get("tile_level", [])
if tile_data:
mem_used_sum = 0.0
mem_util_sum = 0.0
freq_sum = 0.0
tile_count = 0
for tile in tile_data:
tile_count += 1
for entry in tile.get("data_list", []):
mtype = entry.get("metrics_type", "")
val = entry.get("value")
if val is None:
continue
if mtype == "XPUM_STATS_MEMORY_USED":
mem_used_sum += float(val) # MiB
elif mtype == "XPUM_STATS_MEMORY_UTILIZATION":
mem_util_sum += float(val)
elif mtype == "XPUM_STATS_GPU_FREQUENCY":
freq_sum += float(val)
if tile_count > 0:
# Memory used: sum across tiles, convert MiB -> bytes
mem_used_bytes = mem_used_sum * 1048576
metrics["xpu_memory_used_bytes"] = {
"name": "xpu_memory_used_bytes",
"value": mem_used_bytes,
"help": "GPU memory used in bytes",
"type": "gauge",
"labels": labels,
}
# Memory free: total - used
total = self._device_memory_total.get(device_id, 0)
if total > 0:
metrics["xpu_memory_free_bytes"] = {
"name": "xpu_memory_free_bytes",
"value": max(0, total - mem_used_bytes),
"help": "GPU memory free in bytes",
"type": "gauge",
"labels": labels,
}
# Average frequency across tiles
metrics["xpu_frequency_mhz"] = {
"name": "xpu_frequency_mhz",
"value": freq_sum / tile_count,
"help": "GPU core frequency in MHz",
"type": "gauge",
"labels": labels,
}
except subprocess.TimeoutExpired:
logger.warning(f"xpu-smi stats timed out for device {device_id}")
except Exception as e:
logger.warning(f"Error collecting stats for device {device_id}: {e}")
return metrics
def start_background_collection(self):
"""Start a daemon thread that collects metrics periodically."""
def _loop():
while True:
try:
self.collect()
except Exception as e:
logger.error(f"Background collection error: {e}")
time.sleep(self._interval)
t = threading.Thread(target=_loop, daemon=True)
t.start()
logger.info(f"Background collection started (interval={self._interval}s)")
def collect(self):
"""Collect all metrics from all devices."""
all_metrics = {}
for dev_id in self._devices:
# Collect from dump first
dump_metrics = self._collect_dump_metrics(dev_id)
# Collect from stats (supplements dump, especially for memory)
stats_metrics = self._collect_stats_metrics(dev_id)
# Merge: dump takes priority for metrics it provides,
# stats fills in what dump doesn't have
merged = {**stats_metrics, **dump_metrics}
# But for memory_used_bytes and memory_free_bytes, prefer stats
# since dump often returns N/A for memory
if "xpu_memory_used_bytes" in stats_metrics:
merged["xpu_memory_used_bytes"] = stats_metrics["xpu_memory_used_bytes"]
if "xpu_memory_free_bytes" in stats_metrics:
merged["xpu_memory_free_bytes"] = stats_metrics["xpu_memory_free_bytes"]
for name, data in merged.items():
if name not in all_metrics:
all_metrics[name] = []
all_metrics[name].append(data)
with self._lock:
self._metrics = all_metrics
def format_prometheus(self) -> str:
"""Format collected metrics in Prometheus exposition format."""
with self._lock:
metrics = self._metrics.copy()
# Group entries by actual Prometheus metric name (from 'name' field)
grouped: dict = {}
for _key, entries in metrics.items():
for entry in entries:
metric_name = entry.get("name", _key)
if metric_name not in grouped:
grouped[metric_name] = []
grouped[metric_name].append(entry)
lines = []
for metric_name, entries in sorted(grouped.items()):
if not entries:
continue
first = entries[0]
lines.append(f"# HELP {metric_name} {first['help']}")
lines.append(f"# TYPE {metric_name} {first['type']}")
for entry in entries:
label_parts = ",".join(
f'{k}="{v}"' for k, v in sorted(entry["labels"].items())
)
lines.append(f"{metric_name}{{{label_parts}}} {entry['value']}")
lines.append("")
return "\n".join(lines)
class MetricsHandler(BaseHTTPRequestHandler):
"""HTTP handler for /metrics endpoint."""
collector: MetricsCollector = None # Set by main
def do_GET(self):
try:
if self.path == "/metrics" or self.path == "/":
output = self.collector.format_prometheus()
self.send_response(200)
self.send_header(
"Content-Type", "text/plain; version=0.0.4; charset=utf-8"
)
self.end_headers()
self.wfile.write(output.encode("utf-8"))
elif self.path == "/healthz":
self.send_response(200)
self.send_header("Content-Type", "text/plain")
self.end_headers()
self.wfile.write(b"ok\n")
else:
self.send_response(404)
self.end_headers()
except BrokenPipeError:
pass
def log_message(self, format, *args):
# Suppress per-request logging to reduce noise
pass
def main():
parser = argparse.ArgumentParser(description="Intel XPU-SMI Prometheus Exporter")
parser.add_argument(
"--port",
type=int,
default=9966,
help="Port to expose Prometheus metrics on (default: 9966)",
)
parser.add_argument(
"--interval",
type=int,
default=5,
help="Seconds between background metric collections (default: 5)",
)
args = parser.parse_args()
collector = MetricsCollector(interval=args.interval)
if not collector._devices:
logger.error("No XPU devices found. Exiting.")
sys.exit(1)
# Do an initial collection to verify it works
collector.collect()
initial = collector.format_prometheus()
logger.info(f"Initial collection complete, {len(initial)} bytes of metrics")
# Start background collection so /metrics returns cached data instantly
collector.start_background_collection()
MetricsHandler.collector = collector
server = HTTPServer(("0.0.0.0", args.port), MetricsHandler)
logger.info(f"Serving XPU metrics on http://0.0.0.0:{args.port}/metrics")
try:
server.serve_forever()
except KeyboardInterrupt:
logger.info("Shutting down exporter")
server.shutdown()
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment