feat: add a new composite SW/HW grafana (DYN-678) (#1788)

Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>

feat: add a new composite SW/HW grafana (DYN-678) (#1788)
Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
ebd23361 · Keiven C · GitHub · 0584b081 · ebd23361 · ebd23361
Unverified Commit ebd23361 authored Jul 07, 2025 by Keiven C Committed by GitHub Jul 08, 2025
9 changed files
--- a/components/metrics/README.md
+++ b/components/metrics/README.md
 # Metrics

 The `metrics` component is a utility that can collect, aggregate, and publish
-metrics from a Dynamo deployment for use in other applications or visualization
-tools like Prometheus and Grafana.
+metrics from a Dynamo deployment. After collecting and aggregating metrics from
+workers, it exposes them via an HTTP `/metrics` endpoint in Prometheus format
+that other applications or visualization tools like Prometheus server and Grafana can
+pull from.
+
+**Note**: This is a demo implementation. The metrics component is currently under active development and this documentation will change as the implementation evolves.
+- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "nv_llm" (e.g., the HTTP `/metrics` endpoint will serve metrics with "nv_llm" prefixes)
+- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work

 <div align="center">
  <img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/>
@@ -22,16 +28,16 @@ For example:
 ```bash
 # Default namespace is "dynamo", but can be configured with --namespace
 # For more detailed output, try setting the env var: DYN_LOG=debug
-metrics --component my_component --endpoint my_endpoint
+metrics --component MyComponent --endpoint my_endpoint

-# 2025-03-17T00:07:05.202558Z  INFO metrics: Scraping endpoint dynamo/my_component/my_endpoint for stats
+# 2025-03-17T00:07:05.202558Z  INFO metrics: Scraping endpoint dynamo/MyComponent/my_endpoint for stats
 # 2025-03-17T00:07:05.202955Z  INFO metrics: Prometheus metrics server started at 0.0.0.0:9091/metrics
 # ...
 ```

 With no matching endpoints running to collect stats from, you should see warnings in the logs:
 ```bash
-2025-03-17T00:07:06.204756Z  WARN metrics: No endpoints found matching dynamo/my_component/my_endpoint
+2025-03-17T00:07:06.204756Z  WARN metrics: No endpoints found matching dynamo/MyComponent/my_endpoint
 ```

 After a worker with a matching endpoint gets started, the endpoint
@@ -44,22 +50,23 @@ so below are some examples of workers and how they can be monitored.

 ### Mock Worker

-For quick testing and debugging, there is a Rust-based
-[mock worker](src/bin/mock_worker.rs) that registers a mock
-`StatsHandler` under an endpoint named
-`dynamo/my_component/my_endpoint` and publishes random data.
+To try out how `metrics` works, there is a demo Rust-based
+[mock worker](src/bin/mock_worker.rs) that provides sample data through two mechanisms:
+1. Exposes a stats handler at `dynamo/MyComponent/my_endpoint` that responds to polling requests (from `metrics`) with randomly generated `ForwardPassMetrics` data
+2. Publishes mock `KVHitRateEvent` data every second to demonstrate event-based metrics

+Step 1: Launch a mock workers via the following command (if already built):
 ```bash
-# Can run multiple workers in separate shells to see aggregation as well.
-# Or to build/run from source: cargo run --bin mock_worker
+# or build/run from source: DYN_LOG=DEBUG cargo run --bin mock_worker
 mock_worker

-# 2025-03-16T23:49:28.101668Z  INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/my_component/my_endpoint
+# 2025-03-16T23:49:28.101668Z  INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/MyComponent/my_endpoint
 ```

-To monitor the metrics of these mock workers, run:
+Step 2: Monitor the metrics of these mock workers, and prepare its own Prometheus endpoint at
+port 9091 (a default, when --port is not specified) on /metrics:
 ```bash
-metrics --component my_component --endpoint my_endpoint
+metrics --component MyComponent --endpoint my_endpoint
 ```

 ### Real Worker
@@ -69,13 +76,14 @@ see the examples in [examples/llm](../../examples/llm).

 For example, for a VLLM + KV Routing based deployment that
 exposes statistics on an endpoint labeled
-`dynamo/VllmWorker/load_metrics`:
+`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
+with any other example such as examples/vllm_v0, vllm_v1, ...):
 ```bash
 cd deploy/examples/llm
-dynamo serve <vllm kv routing example args>
+dynamo serve graphs.agg:Frontend -f configs/agg.yaml
 ```

-To monitor the metrics of these VllmWorkers, run:
+Then, to monitor the metrics of these VllmWorkers, run:
 ```bash
 metrics --component VllmWorker --endpoint load_metrics
 ```
@@ -105,10 +113,10 @@ Prometheus server or curl client can pull from:

 ```bash
 # Start metrics server on default host (0.0.0.0) and port (9091)
-metrics --component my_component --endpoint my_endpoint
+metrics --component MyComponent --endpoint my_endpoint

 # Or specify a custom port
-metrics --component my_component --endpoint my_endpoint --port 9092
+metrics --component MyComponent --endpoint my_endpoint --port 9092
 ```

 In pull mode:
@@ -121,12 +129,12 @@ curl localhost:9091/metrics

 # # HELP llm_kv_blocks_active Active KV cache blocks
 # # TYPE llm_kv_blocks_active gauge
-# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
-# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
+# llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
+# llm_kv_blocks_active{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
 # # HELP llm_kv_blocks_total Total KV cache blocks
 # # TYPE llm_kv_blocks_total gauge
-# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
-# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
+# llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
+# llm_kv_blocks_total{component="MyComponent",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
 ```

 ### Push Mode
@@ -145,7 +153,7 @@ Start the metrics component in `--push` mode, specifying the host and port of yo
 ```bash
 # Push metrics to a Prometheus PushGateway every --push-interval seconds
 metrics \
-    --component my_component \
+    --component MyComponent \
    --endpoint my_endpoint \
    --host 127.0.0.1 \
    --port 9091 \
@@ -173,7 +181,7 @@ For easy iteration while making edits to the metrics component, you can use `car
 to build and run with your local changes:

 ```bash
-cargo run --bin metrics -- --component my_component --endpoint my_endpoint
+cargo run --bin metrics -- --component MyComponent --endpoint my_endpoint
 ```


--- a/components/metrics/src/bin/mock_worker.rs
+++ b/components/metrics/src/bin/mock_worker.rs
@@ -146,7 +146,7 @@ async fn backend(runtime: DistributedRuntime) -> Result<()> {
    let namespace = runtime.namespace("dynamo")?;
    // we must first create a service, then we can attach one more more endpoints
    let component = namespace
-        .component("my_component")?
+        .component("MyComponent")?
        .service_builder()
        .create()
        .await?;

--- a/deploy/metrics/README.md
+++ b/deploy/metrics/README.md
@@ -100,16 +100,18 @@ Note: You may need to adjust the target based on your host configuration and net
 Grafana is pre-configured with:
 - Prometheus datasource
 - Sample dashboard for visualizing service metrics
-![grafana image](./grafana1.png)
+![grafana image](./grafana-dynamo-composite.png)

 ## Required Files

 The following configuration files should be present in this directory:
 - [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services
 - [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration
- [grafana.json](./grafana.json): Contains Grafana dashboard configuration
 - [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana-dashboard-providers.yml](./grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
+- [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
+- [grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
+- [grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): Contains Grafana dashboard configuration for LLM specific metrics.
+- [grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics

 ## Running the example `metrics` component


--- a/deploy/metrics/docker-compose.yml
+++ b/deploy/metrics/docker-compose.yml
@@ -13,6 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+# IMPORT NOTE: Make sure this is in sync with lib/runtime/docker-compose.yml
 networks:
  server:
    driver: bridge
@@ -83,6 +84,8 @@ services:
    networks:
      - monitoring

+  # To access Prometheus from another machine, you may need to disable te firewall on your host. On Ubuntu:
+  # sudo ufw allow 9090/tcp
  prometheus:
    image: prom/prometheus:v3.4.1
    container_name: prometheus
@@ -98,11 +101,13 @@ services:
    restart: unless-stopped
    # Example to pull from the /query endpoint:
    # {__name__=~"DCGM.*", job="dcgm-exporter"}
-    ports:
-      - "9090:9090"
    networks:
      - monitoring
+    ports:
+      - "9090:9090"
    profiles: [metrics]
+    extra_hosts:
+    - "host.docker.internal:host-gateway"
    depends_on:
      - dcgm-exporter
      - nats-prometheus-exporter
@@ -110,23 +115,29 @@ services:

  # grafana connects to prometheus via the /query endpoint.
  # Default credentials are dynamo/dynamo.
+  # To access Grafana from another machine, you may need to disable te firewall on your host. On Ubuntu:
+  # sudo ufw allow 3001/tcp
  grafana:
    image: grafana/grafana-enterprise:12.0.1
    container_name: grafana
    volumes:
-      - ./grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json
-      - ./grafana-dcgm-dashboard.json:/etc/grafana/provisioning/dashboards/dcgm-dashboard.json
+      - ./grafana_dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
-      - ./grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml
    environment:
      # Port 3000 is already used by "dynamo serve", so use 3001
      - GF_SERVER_HTTP_PORT=3001
+      # do not make it admin/admin, because you will be prompted to change the password every time
      - GF_SECURITY_ADMIN_USER=dynamo
      - GF_SECURITY_ADMIN_PASSWORD=dynamo
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
      # Default min interval is 5s, but can be configured lower
      - GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
+      # Disable password change requirement
+      - GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false
+      - GF_SECURITY_ADMIN_PASSWORD_POLICY=false
+      - GF_AUTH_DISABLE_LOGIN_FORM=false
+      - GF_AUTH_DISABLE_SIGNOUT_MENU=false
    restart: unless-stopped
    ports:
      - "3001:3001"

--- a/deploy/metrics/grafana-dynamo-composite.png
+++ b/deploy/metrics/grafana-dynamo-composite.png
--- a/deploy/metrics/grafana.json
+++ b/deploy/metrics/grafana.json
--- a/deploy/metrics/grafana-dashboard-providers.yml
+++ b/deploy/metrics/grafana-dashboard-providers.yml
--- a/deploy/metrics/grafana-dcgm-dashboard.json
+++ b/deploy/metrics/grafana-dcgm-dashboard.json
--- a/deploy/metrics/prometheus.yml
+++ b/deploy/metrics/prometheus.yml
@@ -33,14 +33,23 @@ scrape_configs:
    static_configs:
      - targets: ['dcgm-exporter:9400']  # on the "monitoring" network

+  # This is a demo service that needs to be launched manually. See components/metrics/README.md
+  # Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 8000/tcp
+  - job_name: 'llm-demo'
+    scrape_interval: 10s
+    static_configs:
+      - targets: ['host.docker.internal:8000']  # on the "monitoring" network
+
+  # This is another demo aggregator that needs to be launched manually. See components/metrics/README.md
+  # Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 9091/tcp
+  - job_name: 'metrics-aggregation-service'
+    scrape_interval: 2s
+    static_configs:
+      # - targets: ['localhost:9091']  # metrics aggregation service on host
+      - targets: ['host.docker.internal:9091']  # metrics aggregation service on host
+
  # Uncomment to see its own Prometheus metrics
  # - job_name: 'prometheus'
  #   scrape_interval: 5s
  #   static_configs:
  #     - targets: ['prometheus:9090']  # on the "monitoring" network
-
-  # Uncomment to see the metrics-aggregation-service metrics
-  # - job_name: 'metrics-aggregation-service'
-  #   scrape_interval: 2s
-  #   static_configs:
-  #     - targets: ['host.docker.internal:9091']  # metrics aggregation service on host