prometheus.md 7.95 KB
Newer Older
1
2
3
4
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
5

6
# vLLM Prometheus Metrics
7
8
9
10
11

## Overview

When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.

12
**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
13

14
15
**For LMCache metrics and integration**, see the [LMCache Integration Guide](LMCache_Integration.md).

16
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
17

18
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
19

20
## Environment Variables and Flags
21

22
23
24
25
26
| Variable/Flag | Description | Default | Example |
|---------------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port. Required to expose `/metrics` endpoint. | `-1` (disabled) | `8081` |
| `--connector` | KV connector to use. Use `lmcache` to enable LMCache metrics. | `nixl` | `--connector lmcache` |
| `ENABLE_LMCACHE` | Sets Dynamo's recommended LMCache defaults (optional). | Not set | `ENABLE_LMCACHE=1` |
27

28
## Getting Started Quickly
29

30
This is a single machine example.
31

32
### Start Observability Stack
33

34
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
35

36
### Launch Dynamo Components
37

38
Launch a frontend and vLLM backend to test metrics:
39
40

```bash
41
42
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend
43

44
45
46
# Enable system metrics server on port 8081
$ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model_name> \
   --enforce-eager --no-enable-prefix-caching --max-num-seqs 3
47
48
```

49
Wait for the vLLM worker to start, then send requests and check metrics:
50
51

```bash
52
53
54
55
56
57
58
59
60
61
62
# Send a request
curl -H 'Content-Type: application/json' \
-d '{
  "model": "<model_name>",
  "max_completion_tokens": 100,
  "messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions

# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^vllm:"
63
64
```

65
66
67
## Exposed Metrics

vLLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All vLLM engine metrics use the `vllm:` prefix and include labels (e.g., `model_name`, `finished_reason`, `scheduling_event`) to identify the source.
68

69
**Example Prometheus Exposition Format text:**
70
71
72
73
74
75

```
# HELP vllm:request_success_total Number of successfully finished requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0
76

77
78
79
80
81
82
83
84
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B"} 5.0
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165.0
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
```

85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
**Note:** The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for the current list.

### Metric Categories

vLLM provides metrics in the following categories (all prefixed with `vllm:`):

- **Request metrics** - Request success, failure, and completion tracking
- **Performance metrics** - Latency, throughput, and timing measurements
- **Resource usage** - System resource consumption
- **Scheduler metrics** - Scheduling and queue management
- **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)

**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version.

## Available Metrics

The official vLLM documentation includes complete metric definitions with:
- Detailed explanations and design rationale
- Counter, Gauge, and Histogram metric types
- Metric labels (e.g., `model_name`, `finished_reason`, `scheduling_event`)
- Information about v1 metrics migration
- Future work and deprecated metrics

For the complete and authoritative list of all vLLM metrics, see the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).

110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
## LMCache Metrics

When LMCache is enabled with `--connector lmcache` and `DYN_SYSTEM_PORT` is set, LMCache metrics (prefixed with `lmcache:`) are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.

### Minimum Requirements

To access LMCache metrics, both of these are required:
1. `--connector lmcache` - Enables LMCache in vLLM
2. `DYN_SYSTEM_PORT=8081` - Enables Dynamo's metrics HTTP endpoint

**Minimal example:**
```bash
DYN_SYSTEM_PORT=8081 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector lmcache
```

**Recommended (with Dynamo defaults):**
```bash
DYN_SYSTEM_PORT=8081 ENABLE_LMCACHE=1 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector lmcache
```

### Viewing LMCache Metrics

```bash
# View all LMCache metrics
curl -s localhost:8081/metrics | grep "^lmcache:"
```

**For complete LMCache configuration and metric details**, see:
- [LMCache Integration Guide](LMCache_Integration.md) - Setup and configuration
- [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference

143
144
145
146
## Implementation Details

- vLLM v1 uses multiprocess metrics collection via `prometheus_client.multiprocess`
- `PROMETHEUS_MULTIPROC_DIR`: vLLM sets this environment variable to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when `/metrics` is scraped.
147
148
149
- Dynamo uses `MultiProcessCollector` to aggregate metrics from all worker processes
- Metrics are filtered by the `vllm:` and `lmcache:` prefixes before being exposed (when LMCache is enabled)
- The integration uses Dynamo's `register_engine_metrics_callback()` function with the global `REGISTRY`
150
151
152
- Metrics appear after vLLM engine initialization completes
- vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for migration details

153
## Related Documentation
154
155
156

### vLLM Metrics
- [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/latest/design/metrics.html)
157
158
- [vLLM Production Metrics User Guide](https://docs.vllm.ai/en/latest/usage/metrics.html)
- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/v1/metrics)
159
160

### Dynamo Metrics
161
162
163
- [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside vLLM metrics
164
165
  - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
  - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
166
  - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration