README.md 7.14 KB
Newer Older
1
2
3
4
5
6
7
8
9
# Metrics Visualization with Prometheus and Grafana

This directory contains configuration for visualizing metrics from the metrics aggregation service using Prometheus and Grafana.

## Components

- **Prometheus**: Collects and stores metrics from the service
- **Grafana**: Provides visualization dashboards for the metrics

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
## Topology

Default Service Relationship Diagram:
```text
     ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
     │ nats-server │    │ etcd-server │    │dcgm-exporter│
     │   :4222     │    │   :2379     │    │   :9400     │
     │   :6222     │    │   :2380     │    │             │
     │   :8222     │    │             │    │             │
     └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
            │                  │                  │
            │ :8222/varz       │ :2379/metrics    │ :9400/metrics
            │                  │                  │
            ▼                  │                  │
     ┌─────────────┐           │                  │
     │nats-prom-exp│           │                  │
     │   :7777     │           │                  │
     │             │           │                  │
     │  /metrics   │           │                  │
     └──────┬──────┘           │                  │
            │                  │                  │
            │ :7777/metrics    │                  │
            │                  │                  │
            ▼                  ▼                  ▼
     ┌─────────────────────────────────────────────────┐
     │                prometheus                       │
     │                  :9090                          │
     │                                                 │
     │  scrapes: nats-prom-exp:7777/metrics            │
     │           etcd-server:2379/metrics              │
     │           dcgm-exporter:9400/metrics            │
     └──────────────────┬──────────────────────────────┘

                        │ :9090/query API


                ┌─────────────┐
                │   grafana   │
                │    :3001    │
                │             │
                └─────────────┘
```

Networks:
- monitoring: nats-prom-exp, etcd-server, dcgm-exporter, prometheus, grafana
- default: nats-server (accessible via host network)

57
58
59
60
## Getting Started

1. Make sure Docker and Docker Compose are installed on your system

61
2. Start Dynamo dependencies. Assume you're at the root dynamo path:
62

63
64
65
66
67
   ```bash
   docker compose -f deploy/metrics/docker-compose.yml up -d  # Minimum components for Dynamo: etcd/nats/dcgm-exporter
   # or
   docker compose -f deploy/metrics/docker-compose.yml --profile metrics up -d  # In addition to the above, start Prometheus & Grafana
   ```
68

69
70
71
72
   If you have particular GPU(s) to use, set the variable below before docker compose:
   ```bash
   export CUDA_VISIBLE_DEVICES=0,2
   ```
73

74
75
76
77
78
79
80
81
82
83
84
85
86
87
3. Web servers started. The ones that end in /metrics are in Prometheus format:
   - Grafana: `http://localhost:3001` (default login: dynamo/dynamo)
   - Prometheus Server: `http://localhost:9090`
   - NATS Server: `http://localhost:8222` (monitoring endpoints: /varz, /healthz, etc.)
   - NATS Prometheus Exporter: `http://localhost:7777/metrics`
   - etcd Server: `http://localhost:2379/metrics`
   - DCGM Exporter: `http://localhost:9401/metrics`

4. Optionally, if you want to experiment further, look through components/metrics/README.md for more details on launching a metrics server (subscribes to nats), mock_worker (publishes to nats), and real workers.

   - Start the [components/metrics](../../components/metrics/README.md) application to begin monitoring for metric events from dynamo workers and aggregating them on a Prometheus metrics endpoint: `http://localhost:9091/metrics`.
   - Uncomment the appropriate lines in prometheus.yml to poll port 9091.
   - Start worker(s) that publishes KV Cache metrics: [examples/rust/service_metrics/bin/server](../../lib/runtime/examples/service_metrics/README.md)` can populate dummy KV Cache metrics.
   - For a real workflow with real data, see the KV Routing example in [examples/llm/utils/vllm.py](../../examples/llm/utils/vllm.py).
88
89
90
91
92
93


## Configuration

### Prometheus

94
The Prometheus configuration is defined in [prometheus.yml](./prometheus.yml). It is configured to scrape metrics from the metrics aggregation service endpoint.
95
96
97
98
99
100
101
102

Note: You may need to adjust the target based on your host configuration and network setup.

### Grafana

Grafana is pre-configured with:
- Prometheus datasource
- Sample dashboard for visualizing service metrics
103
![grafana image](./grafana-dynamo-composite.png)
104
105
106
107

## Required Files

The following configuration files should be present in this directory:
108
109
110
- [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services
- [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration
- [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration
111
112
113
114
- [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
- [grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): Contains Grafana dashboard configuration for LLM specific metrics.
- [grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
115

116
## Running the example `metrics` component
117

118
When you run the example [components/metrics](../../components/metrics/README.md) component, it exposes a Prometheus /metrics endpoint with the followings (defined in [../../components/metrics/src/lib.rs](../../components/metrics/src/lib.rs)):
119
120
121
122
123
- `llm_requests_active_slots`: Number of currently active request slots per worker
- `llm_requests_total_slots`: Total available request slots per worker
- `llm_kv_blocks_active`: Number of active KV blocks per worker
- `llm_kv_blocks_total`: Total KV blocks available per worker
- `llm_kv_hit_rate_percent`: Cumulative KV Cache hit percent per worker
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
- `llm_load_avg`: Average load across workers
- `llm_load_std`: Load standard deviation across workers

## Troubleshooting

1. Verify services are running:
  ```bash
  docker compose ps
  ```

2. Check logs:
  ```bash
  docker compose logs prometheus
  docker compose logs grafana
  ```