prometheus-grafana.md 19.7 KB
Newer Older
1
2
3
4
# Metrics Visualization with Prometheus and Grafana

This directory contains configuration for visualizing metrics from the metrics aggregation service using Prometheus and Grafana.

5
> [!NOTE]
6
> For detailed information about Dynamo's metrics system, including hierarchical metrics, automatic labeling, and usage examples, see the [Metrics Guide](./metrics.md).
7
8
9
10

## Overview

### Components
11

12
13
- **Prometheus Server**: Collects and stores metrics from Dynamo services and other components.
- **Grafana**: Provides dashboards by querying the Prometheus Server.
14

15
### Topology
16
17

Default Service Relationship Diagram:
18
19
20
21
22
23
```mermaid
graph TD
    BROWSER[Browser] -->|:3001| GRAFANA[Grafana :3001]
    subgraph DockerComposeNetwork [Network inside Docker Compose]
        NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222]
        PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
24
        PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
25
        PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
26
        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
27
28
        PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
        DYNAMOFE --> DYNAMOBACKEND
29
30
        GRAFANA -->|:9090/query API| PROMETHEUS
    end
31
32
```

33
The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.
34

35
As of Q2 2025, Dynamo HTTP Frontend metrics are exposed when you build containers with `--framework VLLM` or `--framework TRTLLM`.
36

37
38
### Available Metrics

39
#### Backend Component Metrics
40
41
42

The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework:

43
- `dynamo_component_inflight_requests`: Requests currently being processed (gauge)
44
45
46
47
48
49
- `dynamo_component_request_bytes_total`: Total bytes received in requests (counter)
- `dynamo_component_request_duration_seconds`: Request processing time (histogram)
- `dynamo_component_requests_total`: Total requests processed (counter)
- `dynamo_component_response_bytes_total`: Total bytes sent in responses (counter)
- `dynamo_component_system_uptime_seconds`: DistributedRuntime uptime (gauge)

50
51
52
53
54
55
56
57
58
59
60
61
62
#### KV Router Statistics (kvstats)

KV router statistics are automatically exposed by LLM workers and KV router components with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:

- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)

These metrics are published by:
- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions

63
64
65
66
67
68
69
70
#### Specialized Component Metrics

Some components expose additional metrics specific to their functionality:

- `dynamo_preprocessor_*`: Metrics specific to preprocessor components

#### Frontend Metrics

71
When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name:
72

73
74
- `dynamo_frontend_inflight_requests`: Inflight requests (gauge)
- `dynamo_frontend_queued_requests`: Number of requests in HTTP processing queue (gauge)
75
76
77
78
79
80
81
- `dynamo_frontend_input_sequence_tokens`: Input sequence length (histogram)
- `dynamo_frontend_inter_token_latency_seconds`: Inter-token latency (histogram)
- `dynamo_frontend_output_sequence_tokens`: Output sequence length (histogram)
- `dynamo_frontend_request_duration_seconds`: LLM request duration (histogram)
- `dynamo_frontend_requests_total`: Total LLM requests (counter)
- `dynamo_frontend_time_to_first_token_seconds`: Time to first token (histogram)

82
83
**Note**: The `dynamo_frontend_inflight_requests` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.

84
85
86
87
88
89
90
91
92
93
94
95
##### Model Configuration Metrics

The frontend also exposes model configuration metrics with the `dynamo_frontend_model_*` prefix. These metrics are populated from the worker backend registration service when workers register with the system:

**Runtime Config Metrics (from ModelRuntimeConfig):**
These metrics come from the runtime configuration provided by worker backends during registration.

- `dynamo_frontend_model_total_kv_blocks`: Total KV blocks available for a worker serving the model (gauge)
- `dynamo_frontend_model_max_num_seqs`: Maximum number of sequences for a worker serving the model (gauge)
- `dynamo_frontend_model_max_num_batched_tokens`: Maximum number of batched tokens for a worker serving the model (gauge)

**MDC Metrics (from ModelDeploymentCard):**
96
These metrics come from the Model Deployment Card information provided by worker backends during registration. Note that when multiple worker instances register with the same model name, only the first instance's configuration metrics (runtime config and MDC metrics) will be populated. Subsequent instances with duplicate model names will be skipped for configuration metric updates, though the worker count metric will reflect all instances.
97
98
99
100
101
102
103
104

- `dynamo_frontend_model_context_length`: Maximum context length for a worker serving the model (gauge)
- `dynamo_frontend_model_kv_cache_block_size`: KV cache block size for a worker serving the model (gauge)
- `dynamo_frontend_model_migration_limit`: Request migration limit for a worker serving the model (gauge)

**Worker Management Metrics:**
- `dynamo_frontend_model_workers`: Number of worker instances currently serving the model (gauge)

105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
#### Request Processing Flow

This section explains the distinction between two key metrics used to track request processing:

1. **Inflight**: Tracks requests from HTTP handler start until the complete response is finished
2. **HTTP Queue**: Tracks requests from HTTP handler start until first token generation begins (including prefill time)

**Example Request Flow:**
```
curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-0.6B",
  "prompt": "Hello let's talk about LLMs",
  "stream": false,
  "max_tokens": 1000
}'
```

**Timeline:**
```
Timeline:    0, 1, ...
Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (vLLM, SGLang, TRT)
             │request start                     │received                              │
             |                                  |                                      |
             │                                  ├──> start prefill ──> first token ──> |last token
             │                                  │     (not impl)       |               |
             ├─────actual HTTP queue¹ ──────────┘                      │               |
             │                                                         │               │
             ├─────implemented HTTP queue ─────────────────────────────┘               |
             │                                                                         │
             └─────────────────────────────────── Inflight ────────────────────────────┘
```

**Concurrency Example:**
Suppose the backend allows 3 concurrent requests and there are 10 clients continuously hitting the frontend:
- All 10 requests will be counted as inflight (from start until complete response)
- 7 requests will be in HTTP queue most of the time
- 3 requests will be actively processed (between first token and last token)

**Testing Setup:**
Try launching a frontend and a Mocker backend that allows 3 concurrent requests:
```bash
$ python -m dynamo.frontend --http-port 8000
$ python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --max-num-seqs 3
# Launch your 10 concurrent clients here
149
# Then check the queued_requests and inflight_requests metrics from the frontend:
150
$ curl -s localhost:8000/metrics|grep -v '^#'|grep -E 'queue|inflight'
151
152
dynamo_frontend_queued_requests{model="qwen/qwen3-0.6b"} 7
dynamo_frontend_inflight_requests{model="qwen/qwen3-0.6b"} 10
153
154
155
156
157
158
159
160
161
162
163
164
165
```

**Real setup using vLLM (instead of Mocker):**
```bash
$ python -m dynamo.vllm --model Qwen/Qwen3-0.6B  \
   --enforce-eager --no-enable-prefix-caching --max-num-seqs 3
```

**Key Differences:**
- **Inflight**: Measures total request lifetime including processing time
- **HTTP Queue**: Measures queuing time before processing begins (including prefill time)
- **HTTP Queue ≤ Inflight** (HTTP queue is a subset of inflight time)

166
167
### Required Files

168
169
170
171
172
173
174
175
The following configuration files are located in the `deploy/metrics/` directory:
- [docker-compose.yml](../../deploy/docker-compose.yml): Defines the Prometheus and Grafana services
- [prometheus.yml](../../deploy/metrics/prometheus.yml): Contains Prometheus scraping configuration
- [grafana-datasources.yml](../../deploy/metrics/grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana_dashboards/grafana-dashboard-providers.yml](../../deploy/metrics/grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/grafana-dynamo-dashboard.json](../../deploy/metrics/grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
- [grafana_dashboards/grafana-dcgm-metrics.json](../../deploy/metrics/grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
- [grafana_dashboards/grafana-kvbm-dashboard.json](../../deploy/metrics/grafana_dashboards/grafana-kvbm-dashboard.json): Contains Grafana dashboard configuration for KVBM metrics
176

177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
### Metric Name Constants

The [prometheus_names.rs](../../lib/runtime/src/metrics/prometheus_names.rs) module provides centralized Prometheus metric name constants and sanitization utilities for the Dynamo metrics system. This module ensures consistency across all components and prevents metric name duplication.

#### Key Features

- **Centralized Constants**: All Prometheus metric names are defined as constants to avoid duplication and typos
- **Automatic Sanitization**: Functions to sanitize metric and label names according to Prometheus naming rules
- **Component Organization**: Metric names are organized by component (frontend, work_handler, nats_client, etc.)
- **Validation Arrays**: Arrays of metric names for iteration and validation purposes

#### Metric Name Prefixes

- `dynamo_component_*`: Core component metrics (requests, latency, bytes, etc.)
- `dynamo_frontend_*`: Frontend service metrics (LLM HTTP service)
- `nats_client_*`: NATS client connection and message metrics
- `nats_service_*`: NATS service statistics metrics
- `kvstats_*`: KV cache statistics from LLM workers

#### Sanitization Functions

The module provides functions to ensure metric and label names comply with Prometheus naming conventions:

- `sanitize_prometheus_name()`: Sanitizes metric names (allows colons and `__`)
- `sanitize_prometheus_label()`: Sanitizes label names (no colons, no `__` prefix)
- `build_component_metric_name()`: Builds full component metric names with proper prefixing

This centralized approach ensures all Dynamo components use consistent, valid Prometheus metric names without manual coordination.

206
207
## Getting Started

208
209
### Prerequisites

210
211
1. Make sure Docker and Docker Compose are installed on your system

212
213
214
### Quick Start

1. Start Dynamo dependencies. Assume you're at the root dynamo path:
215

216
   ```bash
217
218
219
   # Start the basic services (etcd & natsd), along with Prometheus and Grafana
   docker compose -f deploy/docker-compose.yml --profile metrics up -d

220
   # Minimum components for Dynamo (will not have Prometheus and Grafana): etcd/nats/dcgm-exporter
221
   docker compose -f deploy/docker-compose.yml up -d
222
   ```
223

224
   Optional: To target specific GPU(s), export the variable below before running Docker Compose
225
226
227
   ```bash
   export CUDA_VISIBLE_DEVICES=0,2
   ```
228

229
2. Web servers started. The ones that end in /metrics are in Prometheus format:
230
231
232
233
234
235
236
237
   - Grafana: `http://localhost:3001` (default login: dynamo/dynamo)
   - Prometheus Server: `http://localhost:9090`
   - NATS Server: `http://localhost:8222` (monitoring endpoints: /varz, /healthz, etc.)
   - NATS Prometheus Exporter: `http://localhost:7777/metrics`
   - etcd Server: `http://localhost:2379/metrics`
   - DCGM Exporter: `http://localhost:9401/metrics`


238
   - Start worker(s) that publishes KV Cache metrics: [lib/runtime/examples/service_metrics/README.md](../../lib/runtime/examples/service_metrics/README.md) can populate dummy KV Cache metrics.
239

240
### Configuration
241

242
#### Prometheus
243

244
The Prometheus configuration is specified in [prometheus.yml](../../deploy/metrics/prometheus.yml). This file is set up to collect metrics from the metrics aggregation service endpoint.
245
246
247
248

Please be aware that you might need to modify the target settings to align with your specific host configuration and network environment.

After making changes to prometheus.yml, it is necessary to reload the configuration using the command below. Simply sending a kill -HUP signal will not suffice due to the caching of the volume that contains the prometheus.yml file.
249

250
251
252
```
docker compose -f deploy/docker-compose.yml up prometheus -d --force-recreate
```
253

254
#### Grafana
255
256
257
258

Grafana is pre-configured with:
- Prometheus datasource
- Sample dashboard for visualizing service metrics
259
![grafana image](./grafana-dynamo-composite.png)
260

261
### Troubleshooting
262

263
264
265
266
1. Verify services are running:
  ```bash
  docker compose ps
  ```
267

268
269
270
271
272
273
2. Check logs:
  ```bash
  docker compose logs prometheus
  docker compose logs grafana
  ```

274
3. Check Prometheus targets at `http://localhost:9090/targets` to verify metric collection.
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290

## Developer Guide

### Creating Metrics at Different Hierarchy Levels

#### Runtime-Level Metrics

```rust
use dynamo_runtime::DistributedRuntime;

let runtime = DistributedRuntime::new()?;
let namespace = runtime.namespace("my_namespace")?;
let component = namespace.component("my_component")?;
let endpoint = component.endpoint("my_endpoint")?;

// Create endpoint-level counters (this is a Prometheus Counter type)
291
let requests_total = endpoint.metrics().create_counter(
292
    "requests_total",
293
294
295
296
    "Total requests across all namespaces",
    &[]
)?;

297
let active_connections = endpoint.metrics().create_gauge(
298
299
300
301
302
303
304
305
306
307
308
309
    "active_connections",
    "Number of active client connections",
    &[]
)?;
```

#### Namespace-Level Metrics

```rust
let namespace = runtime.namespace("my_model")?;

// Namespace-scoped metrics
310
let model_requests = namespace.metrics().create_counter(
311
312
313
314
315
    "model_requests",
    "Requests for this specific model",
    &[]
)?;

316
let model_latency = namespace.metrics().create_histogram(
317
318
319
    "model_latency_seconds",
    "Model inference latency",
    &[],
320
    Some(vec![0.001, 0.01, 0.1, 1.0, 10.0])
321
322
323
324
325
326
327
328
329
)?;
```

#### Component-Level Metrics

```rust
let component = namespace.component("backend")?;

// Component-specific metrics
330
let backend_requests = component.metrics().create_counter(
331
332
333
334
335
    "backend_requests",
    "Requests handled by this backend component",
    &[]
)?;

336
let gpu_memory_usage = component.metrics().create_gauge(
337
338
339
340
341
342
343
344
345
346
347
348
    "gpu_memory_bytes",
    "GPU memory usage in bytes",
    &[]
)?;
```

#### Endpoint-Level Metrics

```rust
let endpoint = component.endpoint("generate")?;

// Endpoint-specific metrics
349
let generate_requests = endpoint.metrics().create_counter(
350
351
352
353
354
    "generate_requests",
    "Generate endpoint requests",
    &[]
)?;

355
let generate_latency = endpoint.metrics().create_histogram(
356
357
358
    "generate_latency_seconds",
    "Generate endpoint latency",
    &[],
359
    Some(vec![0.001, 0.01, 0.1, 1.0, 10.0])
360
361
362
363
364
365
366
367
368
)?;
```

### Creating Vector Metrics with Dynamic Labels

Use vector metrics when you need to track metrics with different label values:

```rust
// Counter with labels
369
let requests_by_model = endpoint.metrics().create_countervec(
370
371
    "requests_by_model",
    "Requests by model type",
372
373
    &["model_type", "model_size"],
    &[]  // no constant labels
374
375
376
377
378
379
380
)?;

// Increment with specific labels
requests_by_model.with_label_values(&["llama", "7b"]).inc();
requests_by_model.with_label_values(&["gpt", "13b"]).inc();

// Gauge with labels
381
let memory_by_gpu = component.metrics().create_gaugevec(
382
383
    "gpu_memory_bytes",
    "GPU memory usage by device",
384
385
    &["gpu_id", "memory_type"],
    &[]  // no constant labels
386
387
388
389
390
391
392
393
394
395
396
)?;

memory_by_gpu.with_label_values(&["0", "allocated"]).set(8192.0);
memory_by_gpu.with_label_values(&["0", "cached"]).set(4096.0);
```

### Creating Histograms

Histograms are useful for measuring distributions of values like latency:

```rust
397
let latency_histogram = endpoint.metrics().create_histogram(
398
399
400
    "request_latency_seconds",
    "Request latency distribution",
    &[],
401
    Some(vec![0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0])
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
)?;

// Record latency values
latency_histogram.observe(0.023); // 23ms
latency_histogram.observe(0.156); // 156ms
```

### Transitioning from Plain Prometheus

If you're currently using plain Prometheus metrics, transitioning to Dynamo's `MetricsRegistry` is straightforward:

#### Before (Plain Prometheus)

```rust
use prometheus::{Counter, Opts, Registry};

// Create a registry to hold metrics
let registry = Registry::new();
let counter_opts = Opts::new("my_counter", "My custom counter");
let counter = Counter::with_opts(counter_opts).unwrap();
registry.register(Box::new(counter.clone())).unwrap();

// Use the counter
counter.inc();

// To expose metrics, you'd need to set up an HTTP server manually
// and implement the /metrics endpoint yourself
```

#### After (Dynamo MetricsRegistry)

```rust
434
let counter = endpoint.metrics().create_counter(
435
436
437
438
439
440
441
442
    "my_counter",
    "My custom counter",
    &[]
)?;

counter.inc();
```

443
**Note:** The metric is automatically registered when created via the endpoint's `metrics().create_counter()` factory method.
444
445

**Benefits of Dynamo's approach:**
446
- **Automatic registration**: Metrics created via endpoint's `metrics().create_*()` factory methods are automatically registered with the system
447
448
449
450
451
452
453
454
455
456
457
458
- Automatic labeling with namespace, component, and endpoint information
- Consistent metric naming with `dynamo_` prefix
- Built-in HTTP metrics endpoint when enabled with `DYN_SYSTEM_ENABLED=true`
- Hierarchical metric organization

### Advanced Features

#### Custom Buckets for Histograms

```rust
// Define custom buckets for your use case
let custom_buckets = vec![0.001, 0.01, 0.1, 1.0, 10.0];
459
let latency = endpoint.metrics().create_histogram(
460
461
462
    "api_latency_seconds",
    "API latency in seconds",
    &[],
463
    Some(custom_buckets)
464
465
466
467
468
469
470
)?;
```

#### Metric Aggregation

```rust
// Aggregate metrics across multiple endpoints
471
let requests_total = namespace.metrics().create_counter(
472
    "requests_total",
473
474
475
476
477
    "Total requests across all endpoints",
    &[]
)?;
```

478
479
480
481
482
483
484
485
486
487
488
489
490

## Troubleshooting

1. Verify services are running:
  ```bash
  docker compose ps
  ```

2. Check logs:
  ```bash
  docker compose logs prometheus
  docker compose logs grafana
  ```
491

492
3. Check Prometheus targets at `http://localhost:9090/targets` to verify metric collection.