Unverified Commit ebd23361 authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

feat: add a new composite SW/HW grafana (DYN-678) (#1788)


Co-authored-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent 0584b081
# Metrics
The `metrics` component is a utility that can collect, aggregate, and publish
metrics from a Dynamo deployment for use in other applications or visualization
tools like Prometheus and Grafana.
metrics from a Dynamo deployment. After collecting and aggregating metrics from
workers, it exposes them via an HTTP `/metrics` endpoint in Prometheus format
that other applications or visualization tools like Prometheus server and Grafana can
pull from.
**Note**: This is a demo implementation. The metrics component is currently under active development and this documentation will change as the implementation evolves.
- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "nv_llm" (e.g., the HTTP `/metrics` endpoint will serve metrics with "nv_llm" prefixes)
- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work
<div align="center">
<img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/>
......@@ -22,16 +28,16 @@ For example:
```bash
# Default namespace is "dynamo", but can be configured with --namespace
# For more detailed output, try setting the env var: DYN_LOG=debug
metrics --component my_component --endpoint my_endpoint
metrics --component MyComponent --endpoint my_endpoint
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/my_component/my_endpoint for stats
# 2025-03-17T00:07:05.202558Z INFO metrics: Scraping endpoint dynamo/MyComponent/my_endpoint for stats
# 2025-03-17T00:07:05.202955Z INFO metrics: Prometheus metrics server started at 0.0.0.0:9091/metrics
# ...
```
With no matching endpoints running to collect stats from, you should see warnings in the logs:
```bash
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/my_component/my_endpoint
2025-03-17T00:07:06.204756Z WARN metrics: No endpoints found matching dynamo/MyComponent/my_endpoint
```
After a worker with a matching endpoint gets started, the endpoint
......@@ -44,22 +50,23 @@ so below are some examples of workers and how they can be monitored.
### Mock Worker
For quick testing and debugging, there is a Rust-based
[mock worker](src/bin/mock_worker.rs) that registers a mock
`StatsHandler` under an endpoint named
`dynamo/my_component/my_endpoint` and publishes random data.
To try out how `metrics` works, there is a demo Rust-based
[mock worker](src/bin/mock_worker.rs) that provides sample data through two mechanisms:
1. Exposes a stats handler at `dynamo/MyComponent/my_endpoint` that responds to polling requests (from `metrics`) with randomly generated `ForwardPassMetrics` data
2. Publishes mock `KVHitRateEvent` data every second to demonstrate event-based metrics
Step 1: Launch a mock workers via the following command (if already built):
```bash
# Can run multiple workers in separate shells to see aggregation as well.
# Or to build/run from source: cargo run --bin mock_worker
# or build/run from source: DYN_LOG=DEBUG cargo run --bin mock_worker
mock_worker
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/my_component/my_endpoint
# 2025-03-16T23:49:28.101668Z INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/MyComponent/my_endpoint
```
To monitor the metrics of these mock workers, run:
Step 2: Monitor the metrics of these mock workers, and prepare its own Prometheus endpoint at
port 9091 (a default, when --port is not specified) on /metrics:
```bash
metrics --component my_component --endpoint my_endpoint
metrics --component MyComponent --endpoint my_endpoint
```
### Real Worker
......@@ -69,13 +76,14 @@ see the examples in [examples/llm](../../examples/llm).
For example, for a VLLM + KV Routing based deployment that
exposes statistics on an endpoint labeled
`dynamo/VllmWorker/load_metrics`:
`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
with any other example such as examples/vllm_v0, vllm_v1, ...):
```bash
cd deploy/examples/llm
dynamo serve <vllm kv routing example args>
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
```
To monitor the metrics of these VllmWorkers, run:
Then, to monitor the metrics of these VllmWorkers, run:
```bash
metrics --component VllmWorker --endpoint load_metrics
```
......@@ -105,10 +113,10 @@ Prometheus server or curl client can pull from:
```bash
# Start metrics server on default host (0.0.0.0) and port (9091)
metrics --component my_component --endpoint my_endpoint
metrics --component MyComponent --endpoint my_endpoint
# Or specify a custom port
metrics --component my_component --endpoint my_endpoint --port 9092
metrics --component MyComponent --endpoint my_endpoint --port 9092
```
In pull mode:
......@@ -121,12 +129,12 @@ curl localhost:9091/metrics
# # HELP llm_kv_blocks_active Active KV cache blocks
# # TYPE llm_kv_blocks_active gauge
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
# llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
# llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
# # HELP llm_kv_blocks_total Total KV cache blocks
# # TYPE llm_kv_blocks_total gauge
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
# llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
# llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
```
### Push Mode
......@@ -145,7 +153,7 @@ Start the metrics component in `--push` mode, specifying the host and port of yo
```bash
# Push metrics to a Prometheus PushGateway every --push-interval seconds
metrics \
--component my_component \
--component MyComponent \
--endpoint my_endpoint \
--host 127.0.0.1 \
--port 9091 \
......@@ -173,7 +181,7 @@ For easy iteration while making edits to the metrics component, you can use `car
to build and run with your local changes:
```bash
cargo run --bin metrics -- --component my_component --endpoint my_endpoint
cargo run --bin metrics -- --component MyComponent --endpoint my_endpoint
```
......@@ -146,7 +146,7 @@ async fn backend(runtime: DistributedRuntime) -> Result<()> {
let namespace = runtime.namespace("dynamo")?;
// we must first create a service, then we can attach one more more endpoints
let component = namespace
.component("my_component")?
.component("MyComponent")?
.service_builder()
.create()
.await?;
......
......@@ -100,16 +100,18 @@ Note: You may need to adjust the target based on your host configuration and net
Grafana is pre-configured with:
- Prometheus datasource
- Sample dashboard for visualizing service metrics
![grafana image](./grafana1.png)
![grafana image](./grafana-dynamo-composite.png)
## Required Files
The following configuration files should be present in this directory:
- [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services
- [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration
- [grafana.json](./grafana.json): Contains Grafana dashboard configuration
- [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana-dashboard-providers.yml](./grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
- [grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): Contains Grafana dashboard configuration for LLM specific metrics.
- [grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
## Running the example `metrics` component
......
......@@ -13,6 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# IMPORT NOTE: Make sure this is in sync with lib/runtime/docker-compose.yml
networks:
server:
driver: bridge
......@@ -83,6 +84,8 @@ services:
networks:
- monitoring
# To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu:
# sudo ufw allow 9090/tcp
prometheus:
image: prom/prometheus:v3.4.1
container_name: prometheus
......@@ -98,11 +101,13 @@ services:
restart: unless-stopped
# Example to pull from the /query endpoint:
# {__name__=~"DCGM.*", job="dcgm-exporter"}
ports:
- "9090:9090"
networks:
- monitoring
ports:
- "9090:9090"
profiles: [metrics]
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
- dcgm-exporter
- nats-prometheus-exporter
......@@ -110,23 +115,29 @@ services:
# grafana connects to prometheus via the /query endpoint.
# Default credentials are dynamo/dynamo.
# To access Grafana from another machine, you may need to disable te firewall on your host. On Ubuntu:
# sudo ufw allow 3001/tcp
grafana:
image: grafana/grafana-enterprise:12.0.1
container_name: grafana
volumes:
- ./grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json
- ./grafana-dcgm-dashboard.json:/etc/grafana/provisioning/dashboards/dcgm-dashboard.json
- ./grafana_dashboards:/etc/grafana/provisioning/dashboards
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
- ./grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml
environment:
# Port 3000 is already used by "dynamo serve", so use 3001
- GF_SERVER_HTTP_PORT=3001
# do not make it admin/admin, because you will be prompted to change the password every time
- GF_SECURITY_ADMIN_USER=dynamo
- GF_SECURITY_ADMIN_PASSWORD=dynamo
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel
# Default min interval is 5s, but can be configured lower
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
# Disable password change requirement
- GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false
- GF_SECURITY_ADMIN_PASSWORD_POLICY=false
- GF_AUTH_DISABLE_LOGIN_FORM=false
- GF_AUTH_DISABLE_SIGNOUT_MENU=false
restart: unless-stopped
ports:
- "3001:3001"
......
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
This diff is collapsed.
......@@ -33,14 +33,23 @@ scrape_configs:
static_configs:
- targets: ['dcgm-exporter:9400'] # on the "monitoring" network
# This is a demo service that needs to be launched manually. See components/metrics/README.md
# Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 8000/tcp
- job_name: 'llm-demo'
scrape_interval: 10s
static_configs:
- targets: ['host.docker.internal:8000'] # on the "monitoring" network
# This is another demo aggregator that needs to be launched manually. See components/metrics/README.md
# Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 9091/tcp
- job_name: 'metrics-aggregation-service'
scrape_interval: 2s
static_configs:
# - targets: ['localhost:9091'] # metrics aggregation service on host
- targets: ['host.docker.internal:9091'] # metrics aggregation service on host
# Uncomment to see its own Prometheus metrics
# - job_name: 'prometheus'
# scrape_interval: 5s
# static_configs:
# - targets: ['prometheus:9090'] # on the "monitoring" network
# Uncomment to see the metrics-aggregation-service metrics
# - job_name: 'metrics-aggregation-service'
# scrape_interval: 2s
# static_configs:
# - targets: ['host.docker.internal:9091'] # metrics aggregation service on host
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment