feat: Add Intel XPU (Intel GPU) monitoring support to observability stack (#7511)

Signed-off-by: Wenjiao Yue <wenjiao.yue@intel.com>

feat: Add Intel XPU (Intel GPU) monitoring support to observability stack (#7511)
Signed-off-by: Wenjiao Yue <wenjiao.yue@intel.com>
4218bbae · WenjiaoYue · GitHub · 8f9c9998 · 4218bbae · 4218bbae
Unverified Commit 4218bbae authored Apr 09, 2026 by WenjiaoYue Committed by GitHub Apr 08, 2026
8 changed files
--- a/deploy/docker-observability-xpu.yml
+++ b/deploy/docker-observability-xpu.yml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Docker Compose override for Intel XPU observability.
+# Replaces NVIDIA DCGM with Intel XPU-SMI monitoring.
+#
+# Usage (XPU environment):
+#   # 1. Start the XPU exporter on the host (requires xpu-smi installed):
+#   python3 deploy/observability/xpu_smi_exporter.py --port 9966 &
+#
+#   # 2. Start base services:
+#   docker compose -f deploy/docker-compose.yml up -d
+#
+#   # 3. Start observability with XPU overlay:
+#   docker compose -f deploy/docker-observability.yml -f deploy/docker-observability-xpu.yml up -d
+
+services:
+  # Override Prometheus to use XPU-specific config and alert rules
+  prometheus:
+    volumes:
+      - ./observability/prometheus-xpu.yml:/etc/prometheus/prometheus.yml
+      - ./observability/xpu-alert-rules.yml:/etc/prometheus/xpu-alert-rules.yml
--- a/deploy/docker-observability.yml
+++ b/deploy/docker-observability.yml
@@ -22,8 +22,12 @@ volumes:
 services:
  # DCGM stands for Data Center GPU Manager: https://developer.nvidia.com/dcgm
  # dcgm-exporter is a tool from NVIDIA that exposes DCGM metrics in Prometheus format.
+  # Requires NVIDIA GPU and runtime. Enable with:
+  #   docker compose --profile nvidia -f deploy/docker-observability.yml up -d
  dcgm-exporter:
    image: nvidia/dcgm-exporter:4.2.3-4.1.3-ubi9
+    profiles:
+      - nvidia
    ports:
      # Expose dcgm-exporter on port 9401 both inside and outside the container
      # to avoid conflicts with other dcgm-exporter instances in distributed environments.
@@ -81,7 +85,6 @@ services:
    extra_hosts:
    - "host.docker.internal:host-gateway"
    depends_on:
-      - dcgm-exporter
      - nats-prometheus-exporter

  # Loki - Log aggregation backend
@@ -165,5 +168,4 @@ services:
    depends_on:
      - prometheus
      - tempo
-      - loki
-
+      - loki
\ No newline at end of file
--- a/deploy/observability/grafana-datasources.yml
+++ b/deploy/observability/grafana-datasources.yml
@@ -18,6 +18,7 @@ apiVersion: 1
 datasources:
  - name: prometheus
    type: prometheus
+    uid: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
--- a/deploy/observability/grafana_dashboards/DASHBOARD_METRICS.md
+++ b/deploy/observability/grafana_dashboards/DASHBOARD_METRICS.md
@@ -259,3 +259,106 @@ The dashboard uses two template variables for flexibility:
 - **Auto-populated**: Dynamically discovers namespaces from frontend pods

 **Usage**: All dashboard queries filter by `namespace="$namespace"` to show metrics for the selected deployment. You can switch between different Dynamo deployments in different namespaces using the namespace dropdown at the top of the dashboard.
+
+---
+
+## Intel XPU-SMI Metrics (from XPU-SMI Exporter)
+
+These metrics come from the Intel XPU-SMI Prometheus exporter (`xpu-smi-exporter` job). XPU-SMI collects hardware-level Intel GPU metrics equivalent to NVIDIA's DCGM.
+
+### Setup
+
+Launch the XPU-SMI Prometheus exporter on the host, then start the observability stack with the XPU overlay:
+```bash
+# Install Intel XPU-SMI (xpumanager): https://github.com/intel/xpumanager
+# Start the exporter (serves Prometheus metrics on port 9966)
+python deploy/observability/xpu_smi_exporter.py --port 9966 &
+
+# Start base services
+docker compose -f deploy/docker-compose.yml up -d
+
+# Start observability with XPU overlay (uses prometheus-xpu.yml + xpu-alert-rules.yml)
+docker compose -f deploy/docker-observability.yml -f deploy/docker-observability-xpu.yml up -d
+```
+
+### XPU-SMI Dashboard Panels (`xpu-smi-metrics.json`)
+
+| Panel | Metric | Formula | Description |
+|-------|--------|---------|-------------|
+| **XPU Compute Utilization** | `xpu_engine_group_compute_engine_util` | Raw value (0-100%) | Compute engine utilization per XPU device. Equivalent to `DCGM_FI_DEV_GPU_UTIL` |
+| **XPU Memory Usage** | `xpu_memory_used_bytes`, `xpu_memory_free_bytes` | Raw values | HBM/VRAM used and free bytes per XPU device. Equivalent to `DCGM_FI_DEV_FB_USED/FB_FREE` |
+| **XPU Temperature** | `xpu_temperature_celsius` | Raw value (°C) | GPU die and memory temperature. Labels: `location="gpu"` or `location="memory"`. Thresholds: yellow@70°C, red@85°C |
+| **XPU Power Usage** | `xpu_power_watts` | Raw value (W) | Instantaneous power draw per XPU device. Equivalent to `DCGM_FI_DEV_POWER_USAGE` |
+| **XPU Engine Utilization** | `xpu_engine_group_compute_engine_util`, `xpu_engine_group_copy_engine_util`, `xpu_engine_group_render_engine_util` | Raw values (0-100%) | Per-engine-group utilization breakdown |
+| **XPU Memory Bandwidth** | `xpu_memory_read_bytes_per_second`, `xpu_memory_write_bytes_per_second` | Raw value (bytes/sec) | HBM read/write bandwidth in bytes/sec |
+| **XPU PCIe Bandwidth** | `xpu_pcie_read_bytes_per_second`, `xpu_pcie_write_bytes_per_second` | Raw value (bytes/sec) | PCIe read/write bandwidth. Equivalent to `DCGM_FI_PROF_PCIE_RX/TX_BYTES` |
+| **Avg XPU Utilization** | `xpu_engine_group_compute_engine_util` | `avg(...)` | Average utilization gauge across all XPU devices |
+| **Max XPU Temperature** | `xpu_temperature_celsius{location="gpu"}` | `max(...)` | Maximum temperature gauge across all XPU devices |
+
+### XPU vs NVIDIA DCGM Metric Mapping
+
+| NVIDIA DCGM Metric | Intel XPU-SMI Metric | Description |
+|---|---|---|
+| `DCGM_FI_DEV_GPU_UTIL` | `xpu_engine_group_compute_engine_util` | Compute utilization % |
+| `DCGM_FI_DEV_FB_USED` | `xpu_memory_used_bytes` | Memory used |
+| `DCGM_FI_DEV_FB_FREE` | `xpu_memory_free_bytes` | Memory free |
+| `DCGM_FI_DEV_GPU_TEMP` | `xpu_temperature_celsius{location="gpu"}` | GPU temperature |
+| `DCGM_FI_DEV_MEMORY_TEMP` | `xpu_temperature_celsius{location="memory"}` | Memory temperature |
+| `DCGM_FI_DEV_POWER_USAGE` | `xpu_power_watts` | Power draw (W) |
+| `DCGM_FI_PROF_PCIE_RX_BYTES` | `xpu_pcie_read_bytes_per_second` | PCIe RX bytes/sec |
+| `DCGM_FI_PROF_PCIE_TX_BYTES` | `xpu_pcie_write_bytes_per_second` | PCIe TX bytes/sec |
+
+### Metric Architecture (XPU)
+
+```text
+┌─────────────────┐
+│  Intel XPU-SMI  │ ──► xpu_* metrics (Prometheus port :9966)
+│    Exporter     │     ├─ xpu_engine_group_compute_engine_util
+│  (host process) │     ├─ xpu_memory_used_bytes / free_bytes
+└─────────────────┘     ├─ xpu_temperature_celsius
+                        ├─ xpu_power_watts
+                        └─ xpu_pcie_*_bytes_total
+
+           ▼
+   ┌─────────────────┐
+   │  Prometheus     │ ◄─── scrape job: xpu-smi-exporter (port 9966)
+   │  (monitoring)   │
+   └─────────────────┘
+           ▼
+   ┌─────────────────┐
+   │    Grafana      │ ◄─── xpu-smi-metrics.json dashboard
+   │  (monitoring)   │
+   └─────────────────┘
+```
+
+### XPU Alert Rules (`xpu-alert-rules.yml`)
+
+| Alert | Condition | Severity | Description |
+|-------|-----------|----------|-------------|
+| `XPUHighTemperature` | temp > 85°C for 2m | warning | XPU GPU die overheating |
+| `XPUCriticalTemperature` | temp > 95°C for 30s | critical | Immediate risk of thermal throttle/shutdown |
+| `XPUMemoryAlmostFull` | mem > 90% for 1m | warning | KV cache allocation may fail |
+| `XPUMemoryCritical` | mem > 98% for 30s | critical | OOM imminent |
+| `XPUHighPowerDraw` | power > 400W for 5m | warning | Sustained high power draw |
+| `XPUExporterDown` | `up{job="xpu-smi-exporter"} == 0` for 1m | critical | Monitoring blind spot |
+| `XPULowComputeUtilizationDuringLoad` | util < 10% during active traffic (`rate()` > 0) | warning | Possible dispatch issue |
+| `XPUWorkerLivenessLost` | no XPU metrics + active traffic (`rate()` > 0) | critical | XPU worker crash suspected |
+
+### Troubleshooting XPU Metrics
+
+#### XPU metrics not showing in Prometheus:
+1. Verify XPU-SMI exporter is running: `curl http://localhost:9966/metrics | grep xpu_`
+2. Check Intel GPU is visible: `xpu-smi discovery`
+3. Verify scrape job in Prometheus UI: Status → Targets → `xpu-smi-exporter`
+4. Check firewall: `sudo ufw allow 9966/tcp`
+
+#### XPU device not detected:
+```bash
+# Check device visibility in container
+ls /dev/dri/
+# Should show renderD128, card0, etc.
+
+# Verify XPU-SMI can see the device
+xpu-smi discovery
+# Expected: lists Intel GPU devices with model name, driver version
+```
--- a/deploy/observability/grafana_dashboards/xpu-smi-metrics.json
+++ b/deploy/observability/grafana_dashboards/xpu-smi-metrics.json
+{
+  "_copyright": "SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0",
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": {
+          "type": "grafana",
+          "uid": "-- Grafana --"
+        },
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 0,
+  "id": 10,
+  "links": [],
+  "liveNow": false,
+  "panels": [
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "smooth",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "percent",
+          "min": 0,
+          "max": 100
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "single",
+          "sort": "none"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_engine_group_compute_engine_util{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Compute",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "XPU Compute Utilization",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "smooth",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "bytes"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "single",
+          "sort": "none"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_memory_used_bytes{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Used",
+          "range": true,
+          "refId": "A"
+        },
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_memory_free_bytes{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Free",
+          "range": true,
+          "refId": "B"
+        }
+      ],
+      "title": "XPU Memory Usage",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "smooth",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 70
+              },
+              {
+                "color": "red",
+                "value": 85
+              }
+            ]
+          },
+          "unit": "celsius"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 8
+      },
+      "id": 3,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "single",
+          "sort": "none"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_temperature_celsius{device_id=~\"$device_id\", location=\"gpu\"}",
+          "legendFormat": "XPU {{device_id}} GPU Temp",
+          "range": true,
+          "refId": "A"
+        },
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_temperature_celsius{device_id=~\"$device_id\", location=\"memory\"}",
+          "legendFormat": "XPU {{device_id}} Memory Temp",
+          "range": true,
+          "refId": "B"
+        }
+      ],
+      "title": "XPU Temperature",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "smooth",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "watt"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 8
+      },
+      "id": 4,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "single",
+          "sort": "none"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_power_watts{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Power",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "XPU Power Usage",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "smooth",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "percent",
+          "min": 0,
+          "max": 100
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 16
+      },
+      "id": 5,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "single",
+          "sort": "none"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_engine_group_compute_engine_util{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Compute Engine",
+          "range": true,
+          "refId": "A"
+        },
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_engine_group_copy_engine_util{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Copy Engine",
+          "range": true,
+          "refId": "B"
+        },
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_engine_group_render_engine_util{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Render Engine",
+          "range": true,
+          "refId": "C"
+        }
+      ],
+      "title": "XPU Engine Utilization",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "smooth",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "binBps"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 16
+      },
+      "id": 6,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "single",
+          "sort": "none"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_memory_read_bytes_per_second{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Read BW",
+          "range": true,
+          "refId": "A"
+        },
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_memory_write_bytes_per_second{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} Write BW",
+          "range": true,
+          "refId": "B"
+        }
+      ],
+      "title": "XPU Memory Bandwidth",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "gradientMode": "none",
+            "hideFrom": {
+              "legend": false,
+              "tooltip": false,
+              "viz": false
+            },
+            "lineInterpolation": "smooth",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "binBps"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 24
+      },
+      "id": 7,
+      "options": {
+        "legend": {
+          "calcs": [
+            "mean",
+            "max"
+          ],
+          "displayMode": "table",
+          "placement": "right",
+          "showLegend": true
+        },
+        "tooltip": {
+          "mode": "single",
+          "sort": "none"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_pcie_read_bytes_per_second{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} PCIe RX",
+          "range": true,
+          "refId": "A"
+        },
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "xpu_pcie_write_bytes_per_second{device_id=~\"$device_id\"}",
+          "legendFormat": "XPU {{device_id}} PCIe TX",
+          "range": true,
+          "refId": "B"
+        }
+      ],
+      "title": "XPU PCIe Bandwidth",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 50
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "percent",
+          "min": 0,
+          "max": 100
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 6,
+        "x": 12,
+        "y": 24
+      },
+      "id": 8,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "avg(xpu_engine_group_compute_engine_util{device_id=~\"$device_id\"})",
+          "legendFormat": "__auto",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Avg XPU Utilization",
+      "type": "gauge"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 70
+              },
+              {
+                "color": "red",
+                "value": 85
+              }
+            ]
+          },
+          "unit": "celsius"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 6,
+        "x": 18,
+        "y": 24
+      },
+      "id": 9,
+      "options": {
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "max(xpu_temperature_celsius{device_id=~\"$device_id\", location=\"gpu\"})",
+          "legendFormat": "__auto",
+          "range": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Max XPU Temperature",
+      "type": "gauge"
+    }
+  ],
+  "refresh": "5s",
+  "schemaVersion": 36,
+  "style": "dark",
+  "tags": [
+    "xpu",
+    "intel",
+    "xpu-smi"
+  ],
+  "templating": {
+    "list": [
+      {
+        "allValue": ".*",
+        "current": {
+          "selected": true,
+          "text": "All",
+          "value": "$__all"
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(xpu_frequency_mhz, device_id)",
+        "hide": 0,
+        "includeAll": true,
+        "multi": true,
+        "name": "device_id",
+        "options": [],
+        "query": {
+          "query": "label_values(xpu_frequency_mhz, device_id)",
+          "refId": "StandardVariableQuery"
+        },
+        "refresh": 1,
+        "regex": "",
+        "sort": 1,
+        "type": "query",
+        "label": "XPU Device"
+      }
+    ]
+  },
+  "time": {
+    "from": "now-30m",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "Intel XPU-SMI Monitoring Dashboard",
+  "uid": "xpu-smi-dashboard",
+  "version": 1,
+  "weekStart": ""
+}
--- a/deploy/observability/prometheus-xpu.yml
+++ b/deploy/observability/prometheus-xpu.yml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Prometheus configuration for Intel XPU environments.
+# Used via: docker compose -f deploy/docker-observability.yml -f deploy/docker-observability-xpu.yml up -d
+# Requires xpu_smi_exporter.py running on the host:
+#   python3 deploy/observability/xpu_smi_exporter.py --port 9966
+
+global:
+  scrape_interval: 10s
+  evaluation_interval: 10s
+
+rule_files:
+  - "/etc/prometheus/xpu-alert-rules.yml"
+
+scrape_configs:
+  - job_name: 'nats-prometheus-exporter'
+    scrape_interval: 2s
+    static_configs:
+      - targets: ['nats-prometheus-exporter:7777']
+
+  - job_name: 'etcd-server'
+    scrape_interval: 2s
+    static_configs:
+      - targets: ['etcd-server:2379']
+
+  - job_name: 'dynamo-frontend'
+    scrape_interval: 10s
+    static_configs:
+      - targets: ['host.docker.internal:8000']
+
+  - job_name: 'dynamo-backend'
+    scrape_interval: 6s
+    static_configs:
+      - targets: ['host.docker.internal:8081']
+
+  - job_name: 'kvbm-metrics'
+    scrape_interval: 2s
+    static_configs:
+      - targets: ['host.docker.internal:6880']
+
+  - job_name: 'xpu-smi-exporter'
+    scrape_interval: 5s
+    scrape_timeout: 5s
+    static_configs:
+      - targets: ['host.docker.internal:9966']
+    relabel_configs:
+      - source_labels: [__address__]
+        target_label: instance
+        replacement: 'xpu-host'
--- a/deploy/observability/xpu-alert-rules.yml
+++ b/deploy/observability/xpu-alert-rules.yml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+groups:
+  - name: xpu_health
+    interval: 15s
+    rules:
+      - alert: XPUHighTemperature
+        expr: xpu_temperature_celsius{location="gpu"} > 85
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Intel XPU temperature too high"
+          description: "XPU device {{ $labels.device_id }} GPU temperature is {{ $value | printf \"%.1f\" }}°C (threshold: 85°C)"
+
+      - alert: XPUCriticalTemperature
+        expr: xpu_temperature_celsius{location="gpu"} > 95
+        for: 30s
+        labels:
+          severity: critical
+        annotations:
+          summary: "Intel XPU temperature critical"
+          description: "XPU device {{ $labels.device_id }} GPU temperature is {{ $value | printf \"%.1f\" }}°C — immediate action required"
+
+      - alert: XPUMemoryAlmostFull
+        expr: xpu_memory_used_bytes / (xpu_memory_used_bytes + xpu_memory_free_bytes) > 0.90
+        for: 1m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Intel XPU memory usage above 90%"
+          description: "XPU device {{ $labels.device_id }} memory usage is {{ $value | humanizePercentage }}"
+
+      - alert: XPUMemoryCritical
+        expr: xpu_memory_used_bytes / (xpu_memory_used_bytes + xpu_memory_free_bytes) > 0.98
+        for: 30s
+        labels:
+          severity: critical
+        annotations:
+          summary: "Intel XPU memory usage critical (>98%)"
+          description: "XPU device {{ $labels.device_id }} memory is almost exhausted: {{ $value | humanizePercentage }} used"
+
+      - alert: XPUHighPowerDraw
+        expr: xpu_power_watts > 400
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Intel XPU sustained high power draw"
+          description: "XPU device {{ $labels.device_id }} power draw is {{ $value | printf \"%.1f\" }}W for over 5 minutes"
+
+      - alert: XPUExporterDown
+        expr: up{job="xpu-smi-exporter"} == 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Intel XPU-SMI exporter is down"
+          description: "Cannot scrape XPU metrics from {{ $labels.instance }}. XPU health monitoring is unavailable."
+
+  - name: xpu_sla
+    interval: 30s
+    rules:
+      - alert: XPULowComputeUtilizationDuringLoad
+        expr: |
+          xpu_engine_group_compute_engine_util < 10
+          and on() sum(rate(dynamo_frontend_requests_total[5m])) > 0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "XPU compute utilization low while requests are active"
+          description: "XPU device {{ $labels.device_id }} compute utilization is only {{ $value | printf \"%.1f\" }}% despite active frontend traffic. Possible scheduling or dispatch issue."
+
+      - alert: XPUWorkerLivenessLost
+        expr: |
+          absent(xpu_engine_group_compute_engine_util)
+          and on() sum(rate(dynamo_frontend_requests_total[5m])) > 0
+        for: 2m
+        labels:
+          severity: critical
+        annotations:
+          summary: "XPU worker liveness lost — no XPU metrics while serving requests"
+          description: "No XPU metrics are being reported while Dynamo frontend is receiving requests. XPU worker may have crashed."
--- a/deploy/observability/xpu_smi_exporter.py
+++ b/deploy/observability/xpu_smi_exporter.py
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Intel XPU-SMI Prometheus Exporter.
+
+Collects Intel GPU metrics via xpu-smi and exposes them in Prometheus format
+on a configurable HTTP port (default: 9966).
+
+Usage:
+    python xpu_smi_exporter.py [--port 9966] [--interval 5]
+
+Metrics exposed (matching the Grafana dashboard xpu-smi-metrics.json):
+    xpu_power_watts              - GPU power draw in watts
+    xpu_frequency_mhz            - GPU core frequency in MHz
+    xpu_memory_used_bytes        - GPU memory used in bytes
+    xpu_memory_free_bytes        - GPU memory free in bytes
+    xpu_memory_utilization_ratio - GPU memory utilization (0-1)
+    xpu_temperature_celsius      - GPU temperature (from dump metric 3)
+    xpu_pcie_read_bytes_per_second  - PCIe read throughput (gauge, bytes/sec)
+    xpu_pcie_write_bytes_per_second - PCIe write throughput (gauge, bytes/sec)
+    xpu_engine_group_compute_engine_util - Compute engine utilization %
+    xpu_engine_group_render_engine_util  - Render engine utilization %
+    xpu_engine_group_copy_engine_util    - Copy engine utilization %
+    xpu_memory_read_bytes_per_second - Memory read throughput (gauge, bytes/sec)
+    xpu_memory_write_bytes_per_second - Memory write throughput (gauge, bytes/sec)
+"""
+
+import argparse
+import json
+import logging
+import subprocess
+import sys
+import threading
+import time
+from http.server import BaseHTTPRequestHandler, HTTPServer
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="[%(asctime)s] %(levelname)s %(name)s: %(message)s",
+)
+logger = logging.getLogger("xpu-smi-exporter")
+
+# xpu-smi dump metric IDs
+# 0=GPU Util%, 1=Power(W), 2=Freq(MHz), 3=CoreTemp(C), 4=MemTemp(C),
+# 5=MemUtil%, 6=MemRead(kB/s), 7=MemWrite(kB/s), 18=MemUsed(MiB),
+# 19=PCIeRead(kB/s), 20=PCIeWrite(kB/s),
+# 31=ComputeEngGrp%, 32=RenderEngGrp%, 33=MediaEngGrp%, 34=CopyEngGrp%
+DUMP_METRICS = "0,1,2,3,4,5,6,7,18,19,20,31,32,33,34"
+
+# Metric name in dump header -> (prometheus_name, help, type, unit_conversion, extra_labels)
+# unit_conversion: multiply raw value by this factor
+# extra_labels: additional Prometheus labels (e.g. location for temperature)
+DUMP_HEADER_MAP = {
+    "GPU Utilization (%)": (
+        "xpu_gpu_utilization_percent",
+        "GPU utilization percentage",
+        "gauge",
+        1,
+        {},
+    ),
+    "GPU Power (W)": (
+        "xpu_power_watts",
+        "GPU power consumption in watts",
+        "gauge",
+        1,
+        {},
+    ),
+    "GPU Frequency (MHz)": (
+        "xpu_frequency_mhz",
+        "GPU core frequency in MHz",
+        "gauge",
+        1,
+        {},
+    ),
+    "GPU Core Temperature (Celsius Degree)": (
+        "xpu_temperature_celsius",
+        "XPU temperature in Celsius",
+        "gauge",
+        1,
+        {"location": "gpu"},
+    ),
+    "GPU Memory Temperature (Celsius Degree)": (
+        "xpu_temperature_celsius",
+        "XPU temperature in Celsius",
+        "gauge",
+        1,
+        {"location": "memory"},
+    ),
+    "GPU Memory Utilization (%)": (
+        "xpu_memory_utilization_percent",
+        "GPU memory utilization percentage",
+        "gauge",
+        1,
+        {},
+    ),
+    "GPU Memory Read (kB/s)": (
+        "xpu_memory_read_bytes_per_second",
+        "GPU memory read throughput in bytes per second",
+        "gauge",
+        1024,
+        {},
+    ),
+    "GPU Memory Write (kB/s)": (
+        "xpu_memory_write_bytes_per_second",
+        "GPU memory write throughput in bytes per second",
+        "gauge",
+        1024,
+        {},
+    ),
+    "GPU Memory Used (MiB)": (
+        "xpu_memory_used_bytes",
+        "GPU memory used in bytes",
+        "gauge",
+        1048576,
+        {},
+    ),
+    "PCIe Read (kB/s)": (
+        "xpu_pcie_read_bytes_per_second",
+        "PCIe read throughput in bytes per second",
+        "gauge",
+        1024,
+        {},
+    ),
+    "PCIe Write (kB/s)": (
+        "xpu_pcie_write_bytes_per_second",
+        "PCIe write throughput in bytes per second",
+        "gauge",
+        1024,
+        {},
+    ),
+    "Compute engine group utilization (%)": (
+        "xpu_engine_group_compute_engine_util",
+        "Compute engine group utilization percentage",
+        "gauge",
+        1,
+        {},
+    ),
+    "Render engine group utilization (%)": (
+        "xpu_engine_group_render_engine_util",
+        "Render engine group utilization percentage",
+        "gauge",
+        1,
+        {},
+    ),
+    "Media engine group utilization (%)": (
+        "xpu_engine_group_media_engine_util",
+        "Media engine group utilization percentage",
+        "gauge",
+        1,
+        {},
+    ),
+    "Copy engine group utilization (%)": (
+        "xpu_engine_group_copy_engine_util",
+        "Copy engine group utilization percentage",
+        "gauge",
+        1,
+        {},
+    ),
+}
+
+
+class MetricsCollector:
+    """Collects XPU metrics from xpu-smi commands.
+
+    Runs a background thread that periodically calls xpu-smi and caches the
+    results.  The /metrics handler returns the cached snapshot instantly,
+    avoiding Prometheus scrape-timeout issues caused by slow xpu-smi calls.
+    """
+
+    def __init__(self, interval: int = 5):
+        self._lock = threading.Lock()
+        self._metrics: dict = {}
+        self._devices: list = []
+        self._device_memory_total: dict = {}  # device_id -> total memory bytes
+        self._interval = interval
+        self._discover_devices()
+
+    def _discover_devices(self):
+        """Discover available XPU devices."""
+        try:
+            result = subprocess.run(
+                ["xpu-smi", "discovery", "-j"],
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+            data = json.loads(result.stdout)
+            self._devices = [d["device_id"] for d in data.get("device_list", [])]
+            # Get total memory per device
+            for dev_id in self._devices:
+                self._get_device_memory_total(dev_id)
+            logger.info(
+                f"Discovered {len(self._devices)} XPU device(s): {self._devices}"
+            )
+        except Exception as e:
+            logger.error(f"Failed to discover devices: {e}")
+            self._devices = []
+
+    def _get_device_memory_total(self, device_id: int):
+        """Get total physical memory for a device."""
+        try:
+            result = subprocess.run(
+                ["xpu-smi", "discovery", "-d", str(device_id), "-j"],
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+            data = json.loads(result.stdout)
+            total = int(data.get("memory_physical_size_byte", 0))
+            self._device_memory_total[device_id] = total
+            logger.info(
+                f"Device {device_id}: total memory = {total / (1024**3):.1f} GiB"
+            )
+        except Exception as e:
+            logger.warning(f"Failed to get memory total for device {device_id}: {e}")
+
+    def _collect_dump_metrics(self, device_id: int) -> dict:
+        """Collect metrics via xpu-smi dump for a single device."""
+        metrics = {}
+        try:
+            result = subprocess.run(
+                [
+                    "xpu-smi",
+                    "dump",
+                    "-d",
+                    str(device_id),
+                    "-m",
+                    DUMP_METRICS,
+                    "-n",
+                    "1",
+                ],
+                capture_output=True,
+                text=True,
+                timeout=15,
+            )
+            lines = result.stdout.strip().split("\n")
+            if len(lines) < 2:
+                return metrics
+
+            # Parse header
+            header_line = lines[0]
+            headers = [h.strip() for h in header_line.split(",")]
+            # Parse data (last line)
+            data_line = lines[-1]
+            values = [v.strip() for v in data_line.split(",")]
+
+            if len(headers) != len(values):
+                logger.warning(
+                    f"Header/value count mismatch: {len(headers)} vs {len(values)}"
+                )
+                return metrics
+
+            # Skip Timestamp and DeviceId columns
+            for i in range(2, len(headers)):
+                header = headers[i]
+                raw_val = values[i]
+
+                if raw_val == "N/A" or raw_val == "":
+                    continue
+
+                mapping = DUMP_HEADER_MAP.get(header)
+                if not mapping:
+                    continue
+
+                prom_name, help_text, metric_type, conversion, extra_labels = mapping
+                try:
+                    val = float(raw_val) * conversion
+                    labels = {"device_id": str(device_id), **extra_labels}
+                    # Use a composite key to handle metrics with the same name
+                    # but different labels (e.g. xpu_temperature_celsius with
+                    # location=gpu vs location=memory)
+                    label_suffix = "_".join(
+                        f"{k}={v}" for k, v in sorted(extra_labels.items())
+                    )
+                    metric_key = (
+                        f"{prom_name}:{label_suffix}" if label_suffix else prom_name
+                    )
+                    metrics[metric_key] = {
+                        "name": prom_name,
+                        "value": val,
+                        "help": help_text,
+                        "type": metric_type,
+                        "labels": labels,
+                    }
+                except ValueError:
+                    continue
+
+        except subprocess.TimeoutExpired:
+            logger.warning(f"xpu-smi dump timed out for device {device_id}")
+        except Exception as e:
+            logger.warning(f"Error collecting dump metrics for device {device_id}: {e}")
+        return metrics
+
+    def _collect_stats_metrics(self, device_id: int) -> dict:
+        """Collect metrics via xpu-smi stats for a single device (fallback/supplement)."""
+        metrics = {}
+        try:
+            result = subprocess.run(
+                ["xpu-smi", "stats", "-d", str(device_id), "-j"],
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+            data = json.loads(result.stdout)
+            labels = {"device_id": str(device_id)}
+
+            # Device-level metrics
+            for entry in data.get("device_level", []):
+                mtype = entry.get("metrics_type", "")
+                val = entry.get("value")
+                if val is None:
+                    continue
+                if mtype == "XPUM_STATS_POWER":
+                    metrics["xpu_power_watts"] = {
+                        "name": "xpu_power_watts",
+                        "value": float(val),
+                        "help": "GPU power consumption in watts",
+                        "type": "gauge",
+                        "labels": labels,
+                    }
+
+            # Tile-level metrics (aggregate to device level)
+            tile_data = data.get("tile_level", [])
+            if tile_data:
+                mem_used_sum = 0.0
+                mem_util_sum = 0.0
+                freq_sum = 0.0
+                tile_count = 0
+                for tile in tile_data:
+                    tile_count += 1
+                    for entry in tile.get("data_list", []):
+                        mtype = entry.get("metrics_type", "")
+                        val = entry.get("value")
+                        if val is None:
+                            continue
+                        if mtype == "XPUM_STATS_MEMORY_USED":
+                            mem_used_sum += float(val)  # MiB
+                        elif mtype == "XPUM_STATS_MEMORY_UTILIZATION":
+                            mem_util_sum += float(val)
+                        elif mtype == "XPUM_STATS_GPU_FREQUENCY":
+                            freq_sum += float(val)
+
+                if tile_count > 0:
+                    # Memory used: sum across tiles, convert MiB -> bytes
+                    mem_used_bytes = mem_used_sum * 1048576
+                    metrics["xpu_memory_used_bytes"] = {
+                        "name": "xpu_memory_used_bytes",
+                        "value": mem_used_bytes,
+                        "help": "GPU memory used in bytes",
+                        "type": "gauge",
+                        "labels": labels,
+                    }
+                    # Memory free: total - used
+                    total = self._device_memory_total.get(device_id, 0)
+                    if total > 0:
+                        metrics["xpu_memory_free_bytes"] = {
+                            "name": "xpu_memory_free_bytes",
+                            "value": max(0, total - mem_used_bytes),
+                            "help": "GPU memory free in bytes",
+                            "type": "gauge",
+                            "labels": labels,
+                        }
+                    # Average frequency across tiles
+                    metrics["xpu_frequency_mhz"] = {
+                        "name": "xpu_frequency_mhz",
+                        "value": freq_sum / tile_count,
+                        "help": "GPU core frequency in MHz",
+                        "type": "gauge",
+                        "labels": labels,
+                    }
+
+        except subprocess.TimeoutExpired:
+            logger.warning(f"xpu-smi stats timed out for device {device_id}")
+        except Exception as e:
+            logger.warning(f"Error collecting stats for device {device_id}: {e}")
+        return metrics
+
+    def start_background_collection(self):
+        """Start a daemon thread that collects metrics periodically."""
+
+        def _loop():
+            while True:
+                try:
+                    self.collect()
+                except Exception as e:
+                    logger.error(f"Background collection error: {e}")
+                time.sleep(self._interval)
+
+        t = threading.Thread(target=_loop, daemon=True)
+        t.start()
+        logger.info(f"Background collection started (interval={self._interval}s)")
+
+    def collect(self):
+        """Collect all metrics from all devices."""
+        all_metrics = {}
+        for dev_id in self._devices:
+            # Collect from dump first
+            dump_metrics = self._collect_dump_metrics(dev_id)
+            # Collect from stats (supplements dump, especially for memory)
+            stats_metrics = self._collect_stats_metrics(dev_id)
+
+            # Merge: dump takes priority for metrics it provides,
+            # stats fills in what dump doesn't have
+            merged = {**stats_metrics, **dump_metrics}
+            # But for memory_used_bytes and memory_free_bytes, prefer stats
+            # since dump often returns N/A for memory
+            if "xpu_memory_used_bytes" in stats_metrics:
+                merged["xpu_memory_used_bytes"] = stats_metrics["xpu_memory_used_bytes"]
+            if "xpu_memory_free_bytes" in stats_metrics:
+                merged["xpu_memory_free_bytes"] = stats_metrics["xpu_memory_free_bytes"]
+
+            for name, data in merged.items():
+                if name not in all_metrics:
+                    all_metrics[name] = []
+                all_metrics[name].append(data)
+
+        with self._lock:
+            self._metrics = all_metrics
+
+    def format_prometheus(self) -> str:
+        """Format collected metrics in Prometheus exposition format."""
+        with self._lock:
+            metrics = self._metrics.copy()
+
+        # Group entries by actual Prometheus metric name (from 'name' field)
+        grouped: dict = {}
+        for _key, entries in metrics.items():
+            for entry in entries:
+                metric_name = entry.get("name", _key)
+                if metric_name not in grouped:
+                    grouped[metric_name] = []
+                grouped[metric_name].append(entry)
+
+        lines = []
+        for metric_name, entries in sorted(grouped.items()):
+            if not entries:
+                continue
+            first = entries[0]
+            lines.append(f"# HELP {metric_name} {first['help']}")
+            lines.append(f"# TYPE {metric_name} {first['type']}")
+            for entry in entries:
+                label_parts = ",".join(
+                    f'{k}="{v}"' for k, v in sorted(entry["labels"].items())
+                )
+                lines.append(f"{metric_name}{{{label_parts}}} {entry['value']}")
+        lines.append("")
+        return "\n".join(lines)
+
+
+class MetricsHandler(BaseHTTPRequestHandler):
+    """HTTP handler for /metrics endpoint."""
+
+    collector: MetricsCollector = None  # Set by main
+
+    def do_GET(self):
+        try:
+            if self.path == "/metrics" or self.path == "/":
+                output = self.collector.format_prometheus()
+                self.send_response(200)
+                self.send_header(
+                    "Content-Type", "text/plain; version=0.0.4; charset=utf-8"
+                )
+                self.end_headers()
+                self.wfile.write(output.encode("utf-8"))
+            elif self.path == "/healthz":
+                self.send_response(200)
+                self.send_header("Content-Type", "text/plain")
+                self.end_headers()
+                self.wfile.write(b"ok\n")
+            else:
+                self.send_response(404)
+                self.end_headers()
+        except BrokenPipeError:
+            pass
+
+    def log_message(self, format, *args):
+        # Suppress per-request logging to reduce noise
+        pass
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Intel XPU-SMI Prometheus Exporter")
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=9966,
+        help="Port to expose Prometheus metrics on (default: 9966)",
+    )
+    parser.add_argument(
+        "--interval",
+        type=int,
+        default=5,
+        help="Seconds between background metric collections (default: 5)",
+    )
+    args = parser.parse_args()
+
+    collector = MetricsCollector(interval=args.interval)
+    if not collector._devices:
+        logger.error("No XPU devices found. Exiting.")
+        sys.exit(1)
+
+    # Do an initial collection to verify it works
+    collector.collect()
+    initial = collector.format_prometheus()
+    logger.info(f"Initial collection complete, {len(initial)} bytes of metrics")
+
+    # Start background collection so /metrics returns cached data instantly
+    collector.start_background_collection()
+
+    MetricsHandler.collector = collector
+
+    server = HTTPServer(("0.0.0.0", args.port), MetricsHandler)
+    logger.info(f"Serving XPU metrics on http://0.0.0.0:{args.port}/metrics")
+    try:
+        server.serve_forever()
+    except KeyboardInterrupt:
+        logger.info("Shutting down exporter")
+        server.shutdown()
+
+
+if __name__ == "__main__":
+    main()