Unverified Commit ebd23361 authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

feat: add a new composite SW/HW grafana (DYN-678) (#1788)


Co-authored-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent 0584b081
# Metrics # Metrics
The `metrics` component is a utility that can collect, aggregate, and publish The `metrics` component is a utility that can collect, aggregate, and publish
metrics from a Dynamo deployment for use in other applications or visualization metrics from a Dynamo deployment. After collecting and aggregating metrics from
tools like Prometheus and Grafana. workers, it exposes them via an HTTP `/metrics` endpoint in Prometheus format
that other applications or visualization tools like Prometheus server and Grafana can
pull from.
**Note**: This is a demo implementation. The metrics component is currently under active development and this documentation will change as the implementation evolves.
- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "nv_llm" (e.g., the HTTP `/metrics` endpoint will serve metrics with "nv_llm" prefixes)
- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work
<div align="center"> <div align="center">
<img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/> <img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/>
...@@ -22,16 +28,16 @@ For example: ...@@ -22,16 +28,16 @@ For example:
```bash ```bash
# Default namespace is "dynamo", but can be configured with --namespace # Default namespace is "dynamo", but can be configured with --namespace
# For more detailed output, try setting the env var: DYN_LOG=debug # For more detailed output, try setting the env var: DYN_LOG=debug
metrics --component my_component --endpoint my_endpoint metrics --component MyComponent --endpoint my_endpoint
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/my_component/my_endpoint for stats # 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/MyComponent/my_endpoint for stats
# 2025-03-17T00:07:05.202955Z INFO metrics: Prometheus metrics server started at 0.0.0.0:9091/metrics # 2025-03-17T00:07:05.202955Z INFO metrics: Prometheus metrics server started at 0.0.0.0:9091/metrics
# ... # ...
``` ```
With no matching endpoints running to collect stats from, you should see warnings in the logs: With no matching endpoints running to collect stats from, you should see warnings in the logs:
```bash ```bash
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/my_component/my_endpoint 2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/MyComponent/my_endpoint
``` ```
After a worker with a matching endpoint gets started, the endpoint After a worker with a matching endpoint gets started, the endpoint
...@@ -44,22 +50,23 @@ so below are some examples of workers and how they can be monitored. ...@@ -44,22 +50,23 @@ so below are some examples of workers and how they can be monitored.
### Mock Worker ### Mock Worker
For quick testing and debugging, there is a Rust-based To try out how `metrics` works, there is a demo Rust-based
[mock worker](src/bin/mock_worker.rs) that registers a mock [mock worker](src/bin/mock_worker.rs) that provides sample data through two mechanisms:
`StatsHandler` under an endpoint named 1. Exposes a stats handler at `dynamo/MyComponent/my_endpoint` that responds to polling requests (from `metrics`) with randomly generated `ForwardPassMetrics` data
`dynamo/my_component/my_endpoint` and publishes random data. 2. Publishes mock `KVHitRateEvent` data every second to demonstrate event-based metrics
Step 1: Launch a mock workers via the following command (if already built):
```bash ```bash
# Can run multiple workers in separate shells to see aggregation as well. # or build/run from source: DYN_LOG=DEBUG cargo run --bin mock_worker
# Or to build/run from source: cargo run --bin mock_worker
mock_worker mock_worker
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/my_component/my_endpoint # 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/MyComponent/my_endpoint
``` ```
To monitor the metrics of these mock workers, run: Step 2: Monitor the metrics of these mock workers, and prepare its own Prometheus endpoint at
port 9091 (a default, when --port is not specified) on /metrics:
```bash ```bash
metrics --component my_component --endpoint my_endpoint metrics --component MyComponent --endpoint my_endpoint
``` ```
### Real Worker ### Real Worker
...@@ -69,13 +76,14 @@ see the examples in [examples/llm](../../examples/llm). ...@@ -69,13 +76,14 @@ see the examples in [examples/llm](../../examples/llm).
For example, for a VLLM + KV Routing based deployment that For example, for a VLLM + KV Routing based deployment that
exposes statistics on an endpoint labeled exposes statistics on an endpoint labeled
`dynamo/VllmWorker/load_metrics`: `dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
with any other example such as examples/vllm_v0, vllm_v1, ...):
```bash ```bash
cd deploy/examples/llm cd deploy/examples/llm
dynamo serve <vllm kv routing example args> dynamo serve graphs.agg:Frontend -f configs/agg.yaml
``` ```
To monitor the metrics of these VllmWorkers, run: Then, to monitor the metrics of these VllmWorkers, run:
```bash ```bash
metrics --component VllmWorker --endpoint load_metrics metrics --component VllmWorker --endpoint load_metrics
``` ```
...@@ -105,10 +113,10 @@ Prometheus server or curl client can pull from: ...@@ -105,10 +113,10 @@ Prometheus server or curl client can pull from:
```bash ```bash
# Start metrics server on default host (0.0.0.0) and port (9091) # Start metrics server on default host (0.0.0.0) and port (9091)
metrics --component my_component --endpoint my_endpoint metrics --component MyComponent --endpoint my_endpoint
# Or specify a custom port # Or specify a custom port
metrics --component my_component --endpoint my_endpoint --port 9092 metrics --component MyComponent --endpoint my_endpoint --port 9092
``` ```
In pull mode: In pull mode:
...@@ -121,12 +129,12 @@ curl localhost:9091/metrics ...@@ -121,12 +129,12 @@ curl localhost:9091/metrics
# # HELP llm_kv_blocks_active Active KV cache blocks # # HELP llm_kv_blocks_active Active KV cache blocks
# # TYPE llm_kv_blocks_active gauge # # TYPE llm_kv_blocks_active gauge
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 40 # llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 2 # llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
# # HELP llm_kv_blocks_total Total KV cache blocks # # HELP llm_kv_blocks_total Total KV cache blocks
# # TYPE llm_kv_blocks_total gauge # # TYPE llm_kv_blocks_total gauge
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 100 # llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 100 # llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
``` ```
### Push Mode ### Push Mode
...@@ -145,7 +153,7 @@ Start the metrics component in `--push` mode, specifying the host and port of yo ...@@ -145,7 +153,7 @@ Start the metrics component in `--push` mode, specifying the host and port of yo
```bash ```bash
# Push metrics to a Prometheus PushGateway every --push-interval seconds # Push metrics to a Prometheus PushGateway every --push-interval seconds
metrics \ metrics \
--component my_component \ --component MyComponent \
--endpoint my_endpoint \ --endpoint my_endpoint \
--host 127.0.0.1 \ --host 127.0.0.1 \
--port 9091 \ --port 9091 \
...@@ -173,7 +181,7 @@ For easy iteration while making edits to the metrics component, you can use `car ...@@ -173,7 +181,7 @@ For easy iteration while making edits to the metrics component, you can use `car
to build and run with your local changes: to build and run with your local changes:
```bash ```bash
cargo run --bin metrics -- --component my_component --endpoint my_endpoint cargo run --bin metrics -- --component MyComponent --endpoint my_endpoint
``` ```
...@@ -146,7 +146,7 @@ async fn backend(runtime: DistributedRuntime) -> Result<()> { ...@@ -146,7 +146,7 @@ async fn backend(runtime: DistributedRuntime) -> Result<()> {
let namespace = runtime.namespace("dynamo")?; let namespace = runtime.namespace("dynamo")?;
// we must first create a service, then we can attach one more more endpoints // we must first create a service, then we can attach one more more endpoints
let component = namespace let component = namespace
.component("my_component")? .component("MyComponent")?
.service_builder() .service_builder()
.create() .create()
.await?; .await?;
......
...@@ -100,16 +100,18 @@ Note: You may need to adjust the target based on your host configuration and net ...@@ -100,16 +100,18 @@ Note: You may need to adjust the target based on your host configuration and net
Grafana is pre-configured with: Grafana is pre-configured with:
- Prometheus datasource - Prometheus datasource
- Sample dashboard for visualizing service metrics - Sample dashboard for visualizing service metrics
![grafana image](./grafana1.png) ![grafana image](./grafana-dynamo-composite.png)
## Required Files ## Required Files
The following configuration files should be present in this directory: The following configuration files should be present in this directory:
- [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services - [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services
- [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration - [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration
- [grafana.json](./grafana.json): Contains Grafana dashboard configuration
- [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration - [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana-dashboard-providers.yml](./grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration - [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
- [grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): Contains Grafana dashboard configuration for LLM specific metrics.
- [grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
## Running the example `metrics` component ## Running the example `metrics` component
......
...@@ -13,6 +13,7 @@ ...@@ -13,6 +13,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# IMPORT NOTE: Make sure this is in sync with lib/runtime/docker-compose.yml
networks: networks:
server: server:
driver: bridge driver: bridge
...@@ -83,6 +84,8 @@ services: ...@@ -83,6 +84,8 @@ services:
networks: networks:
- monitoring - monitoring
# To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu:
# sudo ufw allow 9090/tcp
prometheus: prometheus:
image: prom/prometheus:v3.4.1 image: prom/prometheus:v3.4.1
container_name: prometheus container_name: prometheus
...@@ -98,11 +101,13 @@ services: ...@@ -98,11 +101,13 @@ services:
restart: unless-stopped restart: unless-stopped
# Example to pull from the /query endpoint: # Example to pull from the /query endpoint:
# {__name__=~"DCGM.*", job="dcgm-exporter"} # {__name__=~"DCGM.*", job="dcgm-exporter"}
ports:
- "9090:9090"
networks: networks:
- monitoring - monitoring
ports:
- "9090:9090"
profiles: [metrics] profiles: [metrics]
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on: depends_on:
- dcgm-exporter - dcgm-exporter
- nats-prometheus-exporter - nats-prometheus-exporter
...@@ -110,23 +115,29 @@ services: ...@@ -110,23 +115,29 @@ services:
# grafana connects to prometheus via the /query endpoint. # grafana connects to prometheus via the /query endpoint.
# Default credentials are dynamo/dynamo. # Default credentials are dynamo/dynamo.
# To access Grafana from another machine, you may need to disable te firewall on your host. On Ubuntu:
# sudo ufw allow 3001/tcp
grafana: grafana:
image: grafana/grafana-enterprise:12.0.1 image: grafana/grafana-enterprise:12.0.1
container_name: grafana container_name: grafana
volumes: volumes:
- ./grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json - ./grafana_dashboards:/etc/grafana/provisioning/dashboards
- ./grafana-dcgm-dashboard.json:/etc/grafana/provisioning/dashboards/dcgm-dashboard.json
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
- ./grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml
environment: environment:
# Port 3000 is already used by "dynamo serve", so use 3001 # Port 3000 is already used by "dynamo serve", so use 3001
- GF_SERVER_HTTP_PORT=3001 - GF_SERVER_HTTP_PORT=3001
# do not make it admin/admin, because you will be prompted to change the password every time
- GF_SECURITY_ADMIN_USER=dynamo - GF_SECURITY_ADMIN_USER=dynamo
- GF_SECURITY_ADMIN_PASSWORD=dynamo - GF_SECURITY_ADMIN_PASSWORD=dynamo
- GF_USERS_ALLOW_SIGN_UP=false - GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel - GF_INSTALL_PLUGINS=grafana-piechart-panel
# Default min interval is 5s, but can be configured lower # Default min interval is 5s, but can be configured lower
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s - GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
# Disable password change requirement
- GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false
- GF_SECURITY_ADMIN_PASSWORD_POLICY=false
- GF_AUTH_DISABLE_LOGIN_FORM=false
- GF_AUTH_DISABLE_SIGNOUT_MENU=false
restart: unless-stopped restart: unless-stopped
ports: ports:
- "3001:3001" - "3001:3001"
......
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
This diff is collapsed.
...@@ -33,14 +33,23 @@ scrape_configs: ...@@ -33,14 +33,23 @@ scrape_configs:
static_configs: static_configs:
- targets: ['dcgm-exporter:9400'] # on the "monitoring" network - targets: ['dcgm-exporter:9400'] # on the "monitoring" network
# This is a demo service that needs to be launched manually. See components/metrics/README.md
# Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 8000/tcp
- job_name: 'llm-demo'
scrape_interval: 10s
static_configs:
- targets: ['host.docker.internal:8000'] # on the "monitoring" network
# This is another demo aggregator that needs to be launched manually. See components/metrics/README.md
# Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 9091/tcp
- job_name: 'metrics-aggregation-service'
scrape_interval: 2s
static_configs:
# - targets: ['localhost:9091'] # metrics aggregation service on host
- targets: ['host.docker.internal:9091'] # metrics aggregation service on host
# Uncomment to see its own Prometheus metrics # Uncomment to see its own Prometheus metrics
# - job_name: 'prometheus' # - job_name: 'prometheus'
# scrape_interval: 5s # scrape_interval: 5s
# static_configs: # static_configs:
# - targets: ['prometheus:9090'] # on the "monitoring" network # - targets: ['prometheus:9090'] # on the "monitoring" network
# Uncomment to see the metrics-aggregation-service metrics
# - job_name: 'metrics-aggregation-service'
# scrape_interval: 2s
# static_configs:
# - targets: ['host.docker.internal:9091'] # metrics aggregation service on host
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment