Unverified Commit e756f390 authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

docs: clarify metrics visualization instructions. Removed unused file. (#1824)


Co-authored-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent 5e2f29f5
......@@ -4,55 +4,30 @@ This directory contains configuration for visualizing metrics from the metrics a
## Components
- **Prometheus**: Collects and stores metrics from the service
- **Grafana**: Provides visualization dashboards for the metrics
- **Prometheus Server**: Collects and stores metrics from Dynamo services and other components.
- **Grafana**: Provides dashboards by querying the Prometheus Server.
## Topology
Default Service Relationship Diagram:
```text
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ nats-server │ │ etcd-server │ │dcgm-exporter│
│ :4222 │ │ :2379 │ │ :9400 │
│ :6222 │ │ :2380 │ │ │
│ :8222 │ │ │ │ │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ :8222/varz │ :2379/metrics │ :9400/metrics
│ │ │
▼ │ │
┌─────────────┐ │ │
│nats-prom-exp│ │ │
│ :7777 │ │ │
│ │ │ │
│ /metrics │ │ │
└──────┬──────┘ │ │
│ │ │
│ :7777/metrics │ │
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────┐
│ prometheus │
│ :9090 │
│ │
│ scrapes: nats-prom-exp:7777/metrics │
│ etcd-server:2379/metrics │
│ dcgm-exporter:9400/metrics │
└──────────────────┬──────────────────────────────┘
│ :9090/query API
┌─────────────┐
│ grafana │
│ :3001 │
│ │
└─────────────┘
```mermaid
graph TD
BROWSER[Browser] -->|:3001| GRAFANA[Grafana :3001]
BROWSER[Browser] -->|:3001| DCGM_EXPORTER2["external dcgm_exporter 0.0.0.0:9400"]
subgraph DockerComposeNetwork [Network inside Docker Compose]
NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222]
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9400/metrics| DCGM_EXPORTER[dcgm-exporter :9400]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
GRAFANA -->|:9090/query API| PROMETHEUS
end
BROWSER -->|:9401/metrics| DCGM_EXPORTER
```
Networks:
- monitoring: nats-prom-exp, etcd-server, dcgm-exporter, prometheus, grafana
- default: nats-server (accessible via host network)
The dcgm-exporter within the Docker Compose network is configured to bind to port 9400 internally, but it is exposed externally on port 9401. This setup helps prevent conflicts with other dcgm-exporters that might be running concurrently, such as in distributed environments like SLURM.
As of Q2 2025, Dynamo HTTP Frontend metrics are exposed when you build containers with `--framework VLLM_V1` or `--framework TENSORRTLLM`.
## Getting Started
......@@ -66,7 +41,7 @@ Networks:
docker compose -f deploy/metrics/docker-compose.yml --profile metrics up -d # In addition to the above, start Prometheus & Grafana
```
If you have particular GPU(s) to use, set the variable below before docker compose:
To target specific GPU(s), export the variable below before running Docker Compose:
```bash
export CUDA_VISIBLE_DEVICES=0,2
```
......
......@@ -47,7 +47,8 @@ The simplest way to deploy the pre-requisite services is using
defined in [deploy/metrics/docker-compose.yml](../../deploy/metrics/docker-compose.yml).
```
docker-compose up -d
# At the root of the repository:
docker compose -f deploy/metrics/docker-compose.yml up -d
```
This will deploy a [NATS.io](https://nats.io/) server and an [etcd](https://etcd.io/)
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
services:
nats-server:
image: nats
command: [ "-js", "--trace" ]
ports:
- 4222:4222
- 6222:6222
- 8222:8222
etcd-server:
image: bitnami/etcd
environment:
- ALLOW_NONE_AUTHENTICATION=yes
ports:
- 2379:2379
- 2380:2380
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./metrics/prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
# These provide the web console functionality
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
restart: unless-stopped
# TODO: Use more explicit networking setup when metrics is containerized
#ports:
# - "9090:9090"
#networks:
# - monitoring
network_mode: "host"
profiles: [metrics]
grafana:
image: grafana/grafana-enterprise:latest
container_name: grafana
volumes:
- ./metrics/grafana.json:/etc/grafana/provisioning/dashboards/llm-worker-dashboard.json
- ./metrics/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
- ./metrics/grafana-dashboard-providers.yml:/etc/grafana/provisioning/dashboards/dashboard-providers.yml
environment:
# Port 3000 is used by "dynamo serve", so use 3001
- GF_SERVER_HTTP_PORT=3001
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel
# Default min interval is 5s, but can be configured lower
- GF_DASHBOARDS_MIN_REFRESH_INTERVAL=2s
restart: unless-stopped
# TODO: Use more explicit networking setup when metrics is containerized
#ports:
# - "3001:3001"
#networks:
# - monitoring
network_mode: "host"
profiles: [metrics]
depends_on:
- prometheus
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment