README.md 6.23 KB
Newer Older
1
# Metrics
2

3
4
5
6
The `metrics` component is a utility that can collect, aggregate, and publish
metrics from a Dynamo deployment for use in other applications or visualization
tools like Prometheus and Grafana.

7
8
9
10
<div align="center">
  <img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/>
</div>

11
12
## Quickstart

13
14
To start the `metrics` component, simply point it at the `namespace/component/endpoint`
trio for the Dynamo workers that you're interested in monitoring metrics on.
15

16
This will:
17
18
19
1. Collect statistics from workers associated with that `namespace/component/endpoint`
2. Postprocess and aggregate those statistics across the workers
3. Publish them on a Prometheus-compatible metrics endpoint
20
21

For example:
22
```bash
23
24
25
# Default namespace is "dynamo", but can be configured with --namespace
# For more detailed output, try setting the env var: DYN_LOG=debug
metrics --component my_component --endpoint my_endpoint
26

27
28
# 2025-03-17T00:07:05.202558Z  INFO metrics: Scraping endpoint dynamo/my_component/my_endpoint for stats
# 2025-03-17T00:07:05.202955Z  INFO metrics: Prometheus metrics server started at 0.0.0.0:9091/metrics
29
30
31
# ...
```

32
With no matching endpoints running to collect stats from, you should see warnings in the logs:
33
```bash
34
2025-03-17T00:07:06.204756Z  WARN metrics: No endpoints found matching dynamo/my_component/my_endpoint
35
36
```

37
38
After a worker with a matching endpoint gets started, the endpoint
will get automatically discovered and the warnings will stop.
39

40
## Workers
41

42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
The `metrics` component needs running workers to gather metrics from,
so below are some examples of workers and how they can be monitored.

### Mock Worker

For quick testing and debugging, there is a Rust-based
[mock worker](src/bin/mock_worker.rs) that registers a mock
`StatsHandler` under an endpoint named
`dynamo/my_component/my_endpoint` and publishes random data.

```bash
# Can run multiple workers in separate shells to see aggregation as well.
# Or to build/run from source: cargo run --bin mock_worker
mock_worker

# 2025-03-16T23:49:28.101668Z  INFO mock_worker: Starting Mock Worker on Endpoint: dynamo/my_component/my_endpoint
```

To monitor the metrics of these mock workers, run:
```bash
metrics --component my_component --endpoint my_endpoint
```

### Real Worker

To run a more realistic deployment to gathering metrics from,
68
see the examples in [examples/llm](../../examples/llm).
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87

For example, for a VLLM + KV Routing based deployment that
exposes statistics on an endpoint labeled
`dynamo/VllmWorker/load_metrics`:
```bash
cd deploy/examples/llm
dynamo serve <vllm kv routing example args>
```

To monitor the metrics of these VllmWorkers, run:
```bash
metrics --component VllmWorker --endpoint load_metrics
```

**NOTE**: `load_metrics` is currently a
[hard-coded](https://github.com/ai-dynamo/dynamo/blob/d5220c7b1151372ba3d2a061c7d0a7ed72724789/lib/llm/src/kv_router/publisher.rs#L108)
endpoint name used for python-based workers that register a `KvMetricsPublisher`.

## Visualization
88

89
90
To visualize the metrics being exposed on the Prometheus endpoint,
see the Prometheus and Grafana configurations in
91
[deploy/metrics](../../deploy/metrics):
92
```bash
93
docker compose -f deploy/docker-compose.yml --profile metrics up -d
94
```
95
96
97
98
99
100
101

## Metrics Collection Modes

The metrics component supports two modes for exposing metrics in a Prometheus format:

### Pull Mode (Default)

102
103
104
When running in pull mode (the default), the metrics component will expose a
Prometheus metrics endpoint on the specified host and port that a
Prometheus server or curl client can pull from:
105
106
107

```bash
# Start metrics server on default host (0.0.0.0) and port (9091)
108
metrics --component my_component --endpoint my_endpoint
109
110

# Or specify a custom port
111
metrics --component my_component --endpoint my_endpoint --port 9092
112
113
```

114
115
116
117
118
In pull mode:
- The `--host` parameter must be a valid IPv4 or IPv6 address (e.g., "0.0.0.0", "127.0.0.1")
- The `--port` parameter specifies which port the HTTP server will listen on

You can then query the metrics using:
119
120
121
122
123
```bash
curl localhost:9091/metrics

# # HELP llm_kv_blocks_active Active KV cache blocks
# # TYPE llm_kv_blocks_active gauge
124
125
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 40
# llm_kv_blocks_active{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 2
126
127
# # HELP llm_kv_blocks_total Total KV cache blocks
# # TYPE llm_kv_blocks_total gauge
128
129
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033398"} 100
# llm_kv_blocks_total{component="my_component",endpoint="my_endpoint",worker_id="7587884888253033401"} 100
130
```
131

132
133
### Push Mode

134
135
136
137
For ephemeral or batch jobs, or when metrics need to be pushed through a firewall,
you can use Push mode. In this mode, the metrics component will periodically push
metrics to an externally hosted
[Prometheus PushGateway](https://prometheus.io/docs/instrumenting/pushing/):
138
139
140
141
142
143
144
145
146

Start a prometheus push gateway service via docker:
```bash
docker run --rm -d -p 9091:9091 --name pushgateway prom/pushgateway
```

Start the metrics component in `--push` mode, specifying the host and port of your PushGateway:
```bash
# Push metrics to a Prometheus PushGateway every --push-interval seconds
147
148
149
metrics \
    --component my_component \
    --endpoint my_endpoint \
150
151
152
153
154
155
    --host 127.0.0.1 \
    --port 9091 \
    --push
```

When using Push mode:
156
157
158
- The `--host` parameter must be a valid IPv4 or IPv6 address (e.g., "0.0.0.0", "127.0.0.1")
  that the Prometheus PushGateway is running on
- The `--port` parameter specifies the port of the Prometheus PushGateway
159
160
161
162
163
164
165
166
167
168
169
170
- The push interval can be configured with `--push-interval` (default: 2 seconds)
- A default job name of "dynamo_metrics" is used for the Prometheus job label
- Metrics persist in the PushGateway until explicitly deleted
- Prometheus should be configured to scrape the PushGateway with `honor_labels: true`

To view the metrics hosted on the PushGateway:
```bash
# View all metrics
# curl http://<pushgateway_ip>:<pushgateway_port>/metrics
curl 127.0.0.1:9091/metrics
```

171
## Building/Running from Source
172

173
174
For easy iteration while making edits to the metrics component, you can use `cargo run`
to build and run with your local changes:
175
176

```bash
177
cargo run --bin metrics -- --component my_component --endpoint my_endpoint
178
179
```