Unverified Commit ebd23361 authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

feat: add a new composite SW/HW grafana (DYN-678) (#1788)


Co-authored-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent 0584b081
# Metrics
The `metrics` component is a utility that can collect, aggregate, and publish
metrics from a Dynamo deployment for use in other applications or visualization
tools like Prometheus and Grafana.
metrics from a Dynamo deployment. After collecting and aggregating metrics from
workers, it exposes them via an HTTP `/metrics` endpoint in Prometheus format
that other applications or visualization tools like Prometheus server and Grafana can
pull from.
**Note**: This is a demo implementation. The metrics component is currently under active development and this documentation will change as the implementation evolves.
- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "nv_llm" (e.g., the HTTP `/metrics` endpoint will serve metrics with "nv_llm" prefixes)
- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work
<div align="center">
<img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/>
......@@ -22,16 +28,16 @@ For example:
```bash
# Default namespace is "dynamo", but can be configured with --namespace
# For more detailed output, try setting the env var: DYN_LOG=debug
metrics --component my_component --endpoint my_endpoint
metrics --component MyComponent --endpoint my_endpoint
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/my_component/my_endpoint for stats
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/MyComponent/my_endpoint for stats
# 2025-03-17T00:07:05.202955Z INFO metrics: Prometheus metrics server started at 0.0.0.0:9091/metrics
# ...
```
With no matching endpoints running to collect stats from, you should see warnings in the logs:
```bash
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/my_component/my_endpoint
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/MyComponent/my_endpoint
```
After a worker with a matching endpoint gets started, the endpoint
......@@ -44,22 +50,23 @@ so below are some examples of workers and how they can be monitored.
### Mock Worker
For quick testing and debugging, there is a Rust-based
[mock worker](src/bin/mock_worker.rs) that registers a mock
`StatsHandler` under an endpoint named
`dynamo/my_component/my_endpoint` and publishes random data.
To try out how `metrics` works, there is a demo Rust-based
[mock worker](src/bin/mock_worker.rs) that provides sample data through two mechanisms:
1. Exposes a stats handler at `dynamo/MyComponent/my_endpoint` that responds to polling requests (from `metrics`) with randomly generated `ForwardPassMetrics` data
2. Publishes mock `KVHitRateEvent` data every second to demonstrate event-based metrics
Step 1: Launch a mock workers via the following command (if already built):
```bash
# Can run multiple workers in separate shells to see aggregation as well.
# Or to build/run from source: cargo run --bin mock_worker
# or build/run from source: DYN_LOG=DEBUG cargo run --bin mock_worker
mock_worker
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/my_component/my_endpoint
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/MyComponent/my_endpoint
```
To monitor the metrics of these mock workers, run:
Step 2: Monitor the metrics of these mock workers, and prepare its own Prometheus endpoint at
port 9091 (a default, when --port is not specified) on /metrics:
```bash
metrics --component my_component --endpoint my_endpoint
metrics --component MyComponent --endpoint my_endpoint
```
### Real Worker
......@@ -69,13 +76,14 @@ see the examples in [examples/llm](../../examples/llm).
For example, for a VLLM + KV Routing based deployment that
exposes statistics on an endpoint labeled
`dynamo/VllmWorker/load_metrics`:
`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
with any other example such as examples/vllm_v0, vllm_v1, ...):
```bash
cd deploy/examples/llm
dynamo serve <vllm kv routing example args>
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
```
To monitor the metrics of these VllmWorkers, run:
Then, to monitor the metrics of these VllmWorkers, run:
```bash
metrics --component VllmWorker --endpoint load_metrics
```
......@@ -105,10 +113,10 @@ Prometheus server or curl client can pull from:
```bash
# Start metrics server on default host (0.0.0.0) and port (9091)
metrics --component my_component --endpoint my_endpoint
metrics --component MyComponent --endpoint my_endpoint
# Or specify a custom port
metrics --component my_component --endpoint my_endpoint --port 9092
metrics --component MyComponent --endpoint my_endpoint --port 9092
```
In pull mode:
......@@ -121,12 +129,12 @@ curl localhost:9091/metrics
# # HELP llm_kv_blocks_active Active KV cache blocks
# # TYPE llm_kv_blocks_active gauge
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
# llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
# llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
# # HELP llm_kv_blocks_total Total KV cache blocks
# # TYPE llm_kv_blocks_total gauge
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
# llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
# llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
```
### Push Mode
......@@ -145,7 +153,7 @@ Start the metrics component in `--push` mode, specifying the host and port of yo
```bash
# Push metrics to a Prometheus PushGateway every --push-interval seconds
metrics \
--component my_component \
--component MyComponent \
--endpoint my_endpoint \
--host 127.0.0.1 \
--port 9091 \
......@@ -173,7 +181,7 @@ For easy iteration while making edits to the metrics component, you can use `car
to build and run with your local changes:
```bash
cargo run --bin metrics -- --component my_component --endpoint my_endpoint
cargo run --bin metrics -- --component MyComponent --endpoint my_endpoint
```
......@@ -146,7 +146,7 @@ async fn backend(runtime: DistributedRuntime) -> Result<()> {
let namespace = runtime.namespace("dynamo")?;
// we must first create a service, then we can attach one more more endpoints
let component = namespace
.component("my_component")?
.component("MyComponent")?
.service_builder()
.create()
.await?;
......
......@@ -100,16 +100,18 @@ Note: You may need to adjust the target based on your host configuration and net
Grafana is pre-configured with:
- Prometheus datasource
- Sample dashboard for visualizing service metrics
![grafana image](./grafana1.png)
![grafana image](./grafana-dynamo-composite.png)
## Required Files
The following configuration files should be present in this directory:
- [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services
- [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration
- [grafana.json](./grafana.json): Contains Grafana dashboard configuration
- [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana-dashboard-providers.yml](./grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
- [grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): Contains Grafana dashboard configuration for LLM specific metrics.
- [grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
## Running the example `metrics` component
......
......@@ -13,6 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# IMPORT NOTE: Make sure this is in sync with lib/runtime/docker-compose.yml
networks:
server:
driver: bridge
......@@ -83,6 +84,8 @@ services:
networks:
- monitoring
# To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu:
# sudo ufw allow 9090/tcp
prometheus:
image: prom/prometheus:v3.4.1
container_name: prometheus
......@@ -98,11 +101,13 @@ services:
restart: unless-stopped
# Example to pull from the /query endpoint:
# {__name__=~"DCGM.*", job="dcgm-exporter"}
ports:
- "9090:9090"
networks:
- monitoring
ports:
- "9090:9090"
profiles: [metrics]
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
- dcgm-exporter
- nats-prometheus-exporter
......@@ -110,23 +115,29 @@ services:
# grafana connects to prometheus via the /query endpoint.
# Default credentials are dynamo/dynamo.
# To access Grafana from another machine, you may need to disable te firewall on your host. On Ubuntu:
# sudo ufw allow 3001/tcp
grafana:
image: grafana/grafana-enterprise:12.0.1
container_name: grafana
volumes:
- ./grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json
- ./grafana-dcgm-dashboard.json:/etc/grafana/provisioning/dashboards/dcgm-dashboard.json
- ./grafana_dashboards:/etc/grafana/provisioning/dashboards
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
- ./grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml
environment:
# Port 3000 is already used by "dynamo serve", so use 3001
- GF_SERVER_HTTP_PORT=3001
# do not make it admin/admin, because you will be prompted to change the password every time
- GF_SECURITY_ADMIN_USER=dynamo
- GF_SECURITY_ADMIN_PASSWORD=dynamo
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel
# Default min interval is 5s, but can be configured lower
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
# Disable password change requirement
- GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false
- GF_SECURITY_ADMIN_PASSWORD_POLICY=false
- GF_AUTH_DISABLE_LOGIN_FORM=false
- GF_AUTH_DISABLE_SIGNOUT_MENU=false
restart: unless-stopped
ports:
- "3001:3001"
......
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"copyright": [
"SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.",
"SPDX-License-Identifier: Apache-2.0",
"Licensed under the Apache License, Version 2.0 (the \"License\");",
"you may not use this file except in compliance with the License.",
"You may obtain a copy of the License at",
"http://www.apache.org/licenses/LICENSE-2.0",
"Unless required by applicable law or agreed to in writing, software",
"distributed under the License is distributed on an \"AS IS\" BASIS,",
"WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.",
"See the License for the specific language governing permissions and",
"limitations under the License."
],
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 1,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"title": "KV Cache Utilization by Worker",
"type": "timeseries",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "100 * llm_kv_blocks_active{component=\"$component\", endpoint=\"$endpoint\"} / llm_kv_blocks_total{component=\"$component\", endpoint=\"$endpoint\"}",
"legendFormat": "Worker {{worker_id}}",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 2,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"title": "Request Slot Utilization by Worker",
"type": "timeseries",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "100 * llm_requests_active_slots{component=\"$component\", endpoint=\"$endpoint\"} / llm_requests_total_slots{component=\"$component\", endpoint=\"$endpoint\"}",
"legendFormat": "Worker {{worker_id}}",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 4,
"x": 0,
"y": 8
},
"id": 3,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "10.0.0",
"title": "Average KV Cache Utilization",
"type": "gauge",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "100 * avg(llm_kv_blocks_active{component=\"$component\", endpoint=\"$endpoint\"}) / avg(llm_kv_blocks_total{component=\"$component\", endpoint=\"$endpoint\"})",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 4,
"x": 4,
"y": 8
},
"id": 4,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "10.0.0",
"title": "Average Request Slot Utilization",
"type": "gauge",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "100 * avg(llm_requests_active_slots{component=\"$component\", endpoint=\"$endpoint\"}) / avg(llm_requests_total_slots{component=\"$component\", endpoint=\"$endpoint\"})",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 4,
"x": 8,
"y": 8
},
"id": 7,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "10.0.0",
"title": "Average KV Cache Hit Rate",
"type": "gauge",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "100 * avg(llm_kv_hit_rate_percent{component=\"$component\", endpoint=\"$endpoint\"})",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "none"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"id": 5,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"title": "Load Average & Standard Deviation",
"type": "timeseries",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "llm_load_avg{component=\"$component\", endpoint=\"$endpoint\"}",
"legendFormat": "Average",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "llm_load_std{component=\"$component\", endpoint=\"$endpoint\"}",
"hide": false,
"legendFormat": "StdDev",
"range": true,
"refId": "B"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"id": 8,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"title": "KV Cache Hit Rate by Worker",
"type": "timeseries",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "100 * llm_kv_hit_rate_percent{component=\"$component\", endpoint=\"$endpoint\"}",
"legendFormat": "Worker {{worker_id}}",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"id": 9,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"title": "Average KV Cache Hit Rate",
"type": "timeseries",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "avg(100 * llm_kv_hit_rate_percent{component=\"$component\", endpoint=\"$endpoint\"})",
"legendFormat": "Average Hit Rate",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "none"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 24
},
"id": 6,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"title": "Available Resources",
"type": "timeseries",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "sum(llm_kv_blocks_total{component=\"$component\", endpoint=\"$endpoint\"} - llm_kv_blocks_active{component=\"$component\", endpoint=\"$endpoint\"})",
"legendFormat": "Available KV Blocks",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "sum(llm_requests_total_slots{component=\"$component\", endpoint=\"$endpoint\"} - llm_requests_active_slots{component=\"$component\", endpoint=\"$endpoint\"})",
"hide": false,
"legendFormat": "Available Request Slots",
"range": true,
"refId": "B"
}
]
}
],
"refresh": "2s",
"schemaVersion": 38,
"style": "dark",
"tags": [
"llm",
"metrics"
],
"templating": {
"list": [
{
"current": {
"selected": false,
"text": "component",
"value": "vllm"
},
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"definition": "label_values(llm_kv_blocks_active, component)",
"hide": 0,
"includeAll": false,
"label": "Component",
"multi": false,
"name": "component",
"options": [],
"query": {
"query": "label_values(llm_kv_blocks_active, component)",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"type": "query"
},
{
"current": {
"selected": false,
"text": "endpoint",
"value": "load_metrics"
},
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"definition": "label_values(llm_kv_blocks_active{component=\"$component\"}, endpoint)",
"hide": 0,
"includeAll": false,
"label": "Endpoint",
"multi": false,
"name": "endpoint",
"options": [],
"query": {
"query": "label_values(llm_kv_blocks_active{component=\"$component\"}, endpoint)",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"type": "query"
}
]
},
"time": {
"from": "now-5m",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "LLM Worker Metrics",
"uid": "llm-worker-metrics",
"version": 1,
"weekStart": ""
}
\ No newline at end of file
......@@ -15,57 +15,48 @@
}
]
},
"copyright": [
"SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.",
"SPDX-License-Identifier: Apache-2.0",
"Licensed under the Apache License, Version 2.0 (the \"License\");",
"you may not use this file except in compliance with the License.",
"You may obtain a copy of the License at",
"http://www.apache.org/licenses/LICENSE-2.0",
"Unless required by applicable law or agreed to in writing, software",
"distributed under the License is distributed on an \"AS IS\" BASIS,",
"WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.",
"See the License for the specific language governing permissions and",
"limitations under the License."
],
"description": "Various stats, from Dynamo runtime, GPU HW, NATS, etcd, ...",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 2,
"id": 1,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"description": "nv_llm_http_service_requests_total (1m)",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisLabel": "Requests",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 20,
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
......@@ -80,90 +71,85 @@
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
"color": "green"
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent",
"min": 0,
"max": 100
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"w": 8,
"x": 0,
"y": 0
},
"id": 1,
"id": 14,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"hideZeros": false,
"mode": "single",
"sort": "none"
}
},
"title": "GPU Utilization",
"type": "timeseries",
"pluginVersion": "12.0.1",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_DEV_GPU_UTIL",
"legendFormat": "GPU {{gpu}} ({{modelName}})",
"expr": "rate(nv_llm_http_service_requests_total[30s])",
"legendFormat": "{{request_type}}, {{status}},",
"range": true,
"refId": "A"
}
]
],
"title": "Requests / Sec",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"description": "nv_llm_http_service_time_to_first_token_seconds (sum/count)",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisLabel": "milliseconds",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 20,
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "smooth",
"lineWidth": 2,
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
......@@ -178,100 +164,85 @@
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
"color": "green"
},
{
"color": "red",
"value": 80
}
]
},
"unit": "bytes",
"min": 0
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"w": 8,
"x": 8,
"y": 0
},
"id": 2,
"id": 12,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"hideZeros": false,
"mode": "single",
"sort": "none"
}
},
"title": "GPU Memory Usage",
"type": "timeseries",
"pluginVersion": "12.0.1",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_DEV_FB_USED * 1024 * 1024",
"legendFormat": "GPU {{gpu}} Used",
"expr": "1000*(nv_llm_http_service_time_to_first_token_seconds_sum/nv_llm_http_service_time_to_first_token_seconds_count)",
"legendFormat": "{{model}}",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_DEV_FB_FREE * 1024 * 1024",
"legendFormat": "GPU {{gpu}} Free",
"range": true,
"refId": "B"
}
]
],
"title": "Avg Time to First Token",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"description": "nv_llm_http_service_inter_token_latency_seconds (sum/count)",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisLabel": "milliseconds",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 20,
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
......@@ -286,103 +257,85 @@
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 70
"color": "green"
},
{
"color": "red",
"value": 85
"value": 80
}
]
},
"unit": "celsius"
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
"w": 8,
"x": 16,
"y": 0
},
"id": 3,
"id": 16,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"hideZeros": false,
"mode": "single",
"sort": "none"
}
},
"title": "GPU Temperature",
"type": "timeseries",
"pluginVersion": "12.0.1",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_DEV_GPU_TEMP",
"legendFormat": "GPU {{gpu}} Temp",
"expr": "1000*(nv_llm_http_service_inter_token_latency_seconds_sum/nv_llm_http_service_inter_token_latency_seconds_count)",
"legendFormat": "{{model}}",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_DEV_MEMORY_TEMP",
"legendFormat": "GPU {{gpu}} Memory Temp",
"range": true,
"refId": "B"
}
]
],
"title": "Avg Inter-Token Latency",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"description": "nv_llm_http_service_request_duration (sum/count)",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisLabel": "milliseconds",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 20,
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
......@@ -397,187 +350,85 @@
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "watt"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"id": 4,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"title": "GPU Power Usage",
"type": "timeseries",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_DEV_POWER_USAGE",
"legendFormat": "GPU {{gpu}} Power",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
"color": "green"
},
{
"color": "green",
"value": null
"color": "red",
"value": 80
}
]
},
"unit": "hertz"
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"w": 8,
"x": 0,
"y": 16
"y": 8
},
"id": 5,
"id": 17,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"hideZeros": false,
"mode": "single",
"sort": "none"
}
},
"title": "GPU Clock Speeds",
"type": "timeseries",
"pluginVersion": "12.0.1",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_DEV_SM_CLOCK * 1000000",
"legendFormat": "GPU {{gpu}} SM Clock",
"expr": "1000*(nv_llm_http_service_request_duration_seconds_sum / nv_llm_http_service_request_duration_seconds_count)",
"legendFormat": "{{model}}",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_DEV_MEM_CLOCK * 1000000",
"legendFormat": "GPU {{gpu}} Memory Clock",
"range": true,
"refId": "B"
}
]
],
"title": "Avg Request Duration",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"description": "The length is the number of tokens. nv_llm_http_service_input_sequence_tokens",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisLabel": "Tokens",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 20,
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineWidth": 2,
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
......@@ -592,70 +443,67 @@
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
"color": "green"
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent",
"min": 0,
"max": 100
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
"w": 8,
"x": 8,
"y": 8
},
"id": 6,
"id": 11,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"hideZeros": false,
"mode": "single",
"sort": "none"
}
},
"title": "GPU Engine Activity",
"type": "timeseries",
"pluginVersion": "12.0.1",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "DCGM_FI_PROF_GR_ENGINE_ACTIVE * 100",
"legendFormat": "GPU {{gpu}} Graphics Engine",
"expr": "nv_llm_http_service_input_sequence_tokens_sum / nv_llm_http_service_input_sequence_tokens_count",
"legendFormat": "ISL",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"editorMode": "code",
"expr": "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE * 100",
"legendFormat": "GPU {{gpu}} Tensor Core",
"expr": "nv_llm_http_service_output_sequence_tokens_sum / nv_llm_http_service_output_sequence_tokens_count",
"hide": false,
"instant": false,
"legendFormat": "OSL",
"range": true,
"refId": "B"
}
]
],
"title": "Avg Input/Output Sequence Length",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"fieldConfig": {
"defaults": {
......@@ -663,26 +511,29 @@
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"axisPlacement": "left",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 20,
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "smooth",
"lineWidth": 2,
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
......@@ -697,208 +548,79 @@
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
"color": "green"
},
{
"color": "red",
"value": 79.9954
}
]
},
"unit": "binBps"
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 24
"w": 8,
"x": 16,
"y": 8
},
"id": 7,
"id": 1,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"hideZeros": false,
"mode": "single",
"sort": "none"
}
},
"title": "PCIe Bandwidth",
"type": "timeseries",
"pluginVersion": "12.0.1",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"editorMode": "code",
"expr": "rate(DCGM_FI_PROF_PCIE_RX_BYTES[10s])",
"legendFormat": "GPU {{gpu}} PCIe RX",
"exemplar": false,
"expr": "DCGM_FI_DEV_GPU_UTIL",
"instant": false,
"legendFormat": "{{__name__}} (%)",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
"uid": "P1809F7CD0C75ACF3"
},
"editorMode": "code",
"expr": "rate(DCGM_FI_PROF_PCIE_TX_BYTES[10s])",
"legendFormat": "GPU {{gpu}} PCIe TX",
"exemplar": false,
"expr": "DCGM_FI_DEV_POWER_USAGE",
"hide": false,
"instant": false,
"legendFormat": "{{__name__}} (Watts)",
"range": true,
"refId": "B"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 50
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 12,
"y": 24
},
"id": 8,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "10.0.0",
"title": "Average GPU Utilization",
"type": "gauge",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "avg(DCGM_FI_DEV_GPU_UTIL)",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
]
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 70
},
{
"color": "red",
"value": 85
}
]
},
"unit": "celsius"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 18,
"y": 24
},
"id": 9,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "10.0.0",
"title": "Max GPU Temperature",
"type": "gauge",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"expr": "max(DCGM_FI_DEV_GPU_TEMP)",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
]
],
"title": "DCGM GPU Utilization",
"type": "timeseries"
}
],
"refresh": "5s",
"schemaVersion": 36,
"style": "dark",
"preload": false,
"refresh": "",
"schemaVersion": 41,
"tags": [
"dcgm",
"gpu",
"nvidia"
"Dynamo",
"DCGM",
"etcd",
"NATS"
],
"templating": {
"list": []
......@@ -908,9 +630,8 @@
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "DCGM GPU Monitoring Dashboard",
"uid": "dcgm-dashboard",
"version": 1,
"weekStart": ""
}
"timezone": "browser",
"title": "Dynamo Dashboard",
"uid": "a7d3733f-f8e7-423a-ab4b-b18e3d7d0357",
"version": 5
}
\ No newline at end of file
......@@ -33,14 +33,23 @@ scrape_configs:
static_configs:
- targets: ['dcgm-exporter:9400'] # on the "monitoring" network
# This is a demo service that needs to be launched manually. See components/metrics/README.md
# Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 8000/tcp
- job_name: 'llm-demo'
scrape_interval: 10s
static_configs:
- targets: ['host.docker.internal:8000'] # on the "monitoring" network
# This is another demo aggregator that needs to be launched manually. See components/metrics/README.md
# Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 9091/tcp
- job_name: 'metrics-aggregation-service'
scrape_interval: 2s
static_configs:
# - targets: ['localhost:9091'] # metrics aggregation service on host
- targets: ['host.docker.internal:9091'] # metrics aggregation service on host
# Uncomment to see its own Prometheus metrics
# - job_name: 'prometheus'
# scrape_interval: 5s
# static_configs:
# - targets: ['prometheus:9090'] # on the "monitoring" network
# Uncomment to see the metrics-aggregation-service metrics
# - job_name: 'metrics-aggregation-service'
# scrape_interval: 2s
# static_configs:
# - targets: ['host.docker.internal:9091'] # metrics aggregation service on host
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment