trtllm-observability.md 12 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Prometheus
5
---
6

7
8
9
10
For general TensorRT-LLM features and configuration, see the [Reference Guide](trtllm-reference-guide.md).

---

11
12
## Overview

13
When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
14

15
Additional performance metrics are available via non-Prometheus APIs (see [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) below).
16

17
18
As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes **5 basic Prometheus metrics**. Note that the `trtllm_` prefix is added by Dynamo.

19
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
20

21
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
22

23
## Environment Variables
24

25
26
27
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
28

29
## Getting Started Quickly
30

31
This is a single machine example.
32

33
### Start Observability Stack
34

35
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
36

37
### Launch Dynamo Components
38

39
Launch a frontend and TensorRT-LLM backend to test metrics:
40
41

```bash
42
43
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend
44

45
46
47
# Enable system metrics server on port 8081 and enable metrics collection
$ DYN_SYSTEM_PORT=8081 python -m dynamo.trtllm --model <model_name> --publish-events-and-metrics
```
48

49
**Note:** The `backend` must be set to `"pytorch"` for metrics collection (enforced in `components/src/dynamo/trtllm/main.py`). TensorRT-LLM's `MetricsCollector` integration has only been tested/validated with the PyTorch backend.
50

51
Wait for the TensorRT-LLM worker to start, then send requests and check metrics:
52
53

```bash
54
55
56
57
58
# Send a request
curl -H 'Content-Type: application/json' \
-d '{
  "model": "<model_name>",
  "max_completion_tokens": 100,
59
  "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}]
60
61
62
63
}' \
http://localhost:8000/v1/chat/completions

# Check metrics from the worker
64
curl -s localhost:8081/metrics | grep "^trtllm_"
65
66
```

67
## Exposed Metrics
68

69
TensorRT-LLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All TensorRT-LLM engine metrics use the `trtllm_` prefix and include labels (e.g., `model_name`, `engine_type`, `finished_reason`) to identify the source.
70

71
**Note:** TensorRT-LLM uses `model_name` instead of Dynamo's standard `model` label convention.
72

73
**Example Prometheus Exposition Format text:**
74
75

```
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# HELP trtllm_request_success_total Count of successfully processed requests.
# TYPE trtllm_request_success_total counter
trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="stop"} 150.0
trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="length"} 5.0

# HELP trtllm_time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE trtllm_time_to_first_token_seconds histogram
trtllm_time_to_first_token_seconds_bucket{le="0.01",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 0.0
trtllm_time_to_first_token_seconds_bucket{le="0.05",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.0
trtllm_time_to_first_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_time_to_first_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 8.75

# HELP trtllm_e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE trtllm_e2e_request_latency_seconds histogram
trtllm_e2e_request_latency_seconds_bucket{le="0.5",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 25.0
trtllm_e2e_request_latency_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_e2e_request_latency_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 45.2

# HELP trtllm_time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE trtllm_time_per_output_token_seconds histogram
trtllm_time_per_output_token_seconds_bucket{le="0.1",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 120.0
trtllm_time_per_output_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_time_per_output_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.5

# HELP trtllm_request_queue_time_seconds Histogram of time spent in WAITING phase for request.
# TYPE trtllm_request_queue_time_seconds histogram
trtllm_request_queue_time_seconds_bucket{le="1.0",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 140.0
trtllm_request_queue_time_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_request_queue_time_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 32.1
105
106
```

107
**Note:** The specific metrics shown above are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual `/metrics` endpoint for the current list.
108

109
### Metric Categories
110

111
TensorRT-LLM provides metrics in the following categories (all prefixed with `trtllm_`):
112
113
114
115
116

- **Request metrics** - Request success tracking and latency measurements
- **Performance metrics** - Time to first token (TTFT), time per output token (TPOT), and queue time

**Note:** Metrics may change between TensorRT-LLM versions. Always inspect the `/metrics` endpoint for your version.
117

118
119
## Available Metrics

120
The following metrics are exposed via Dynamo's `/metrics` endpoint (with the `trtllm_` prefix added by Dynamo) for TensorRT-LLM version 1.1.0rc5:
121

122
- `trtllm_request_success_total` (Counter) — Count of successfully processed requests by finish reason
123
  - Labels: `model_name`, `engine_type`, `finished_reason`
124
- `trtllm_e2e_request_latency_seconds` (Histogram) — End-to-end request latency (seconds)
125
  - Labels: `model_name`, `engine_type`
126
- `trtllm_time_to_first_token_seconds` (Histogram) — Time to first token, TTFT (seconds)
127
  - Labels: `model_name`, `engine_type`
128
- `trtllm_time_per_output_token_seconds` (Histogram) — Time per output token, TPOT (seconds)
129
  - Labels: `model_name`, `engine_type`
130
- `trtllm_request_queue_time_seconds` (Histogram) — Time a request spends waiting in the queue (seconds)
131
132
133
134
135
136
  - Labels: `model_name`, `engine_type`

These metric names and availability are subject to change with TensorRT-LLM version updates.

TensorRT-LLM provides Prometheus metrics through the `MetricsCollector` class (see [tensorrt_llm/metrics/collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py)).

137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
### Additional Operational Metrics

Dynamo adds the following operational metrics for TensorRT-LLM workers. These complement the engine's native metrics above with request-level observability that the engine does not provide. All metrics use the `trtllm_` prefix and are automatically enabled when `--publish-events-and-metrics` is set.

Metric name constants are defined in `lib/runtime/src/metrics/prometheus_names.rs` (`trtllm_additional` module).

#### Request Type Tracking

- `trtllm_request_type_image_total` (Counter) — Total number of requests containing image/multimodal content
  - Labels: `model_name`, `disaggregation_mode`, `engine_type`
- `trtllm_request_type_structured_output_total` (Counter) — Total number of requests using guided/structured decoding (JSON, regex, grammar, etc.)
  - Labels: `model_name`, `disaggregation_mode`, `engine_type`

#### Abort Tracking

- `trtllm_num_aborted_requests_total` (Counter) — Total number of aborted/cancelled requests
  - Labels: `model_name`, `disaggregation_mode`, `engine_type`

#### KV Cache Transfer Metrics (Disaggregated Deployments)

These metrics are only recorded in disaggregated (prefill + decode) deployments when a KV cache transfer actually occurs. They are sourced from TensorRT-LLM's `RequestPerfMetrics.timing_metrics`.

- `trtllm_kv_transfer_success_total` (Counter) — Total number of successful KV cache transfers (recorded on prefill side)
  - Labels: `model_name`, `disaggregation_mode`, `engine_type`
- `trtllm_kv_transfer_latency_seconds` (Histogram) — KV cache transfer latency per request in seconds
  - Labels: `model_name`, `disaggregation_mode`, `engine_type`
- `trtllm_kv_transfer_bytes` (Histogram) — KV cache transfer size per request in bytes
  - Labels: `model_name`, `disaggregation_mode`, `engine_type`
  - Buckets: 100KB, 500KB, 1MB, 5MB, 10MB, 50MB, 100MB, 500MB, 1GB, 5GB
- `trtllm_kv_transfer_speed_gb_s` (Histogram) — KV cache transfer speed per request in GB/s
  - Labels: `model_name`, `disaggregation_mode`, `engine_type`

169
170
171
172
173
## Non-Prometheus Performance Metrics

TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not currently exposed to Prometheus.

### Available via Code References
174
175
176
177
178

- **RequestPerfMetrics Structure**: [tensorrt_llm/executor/result.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/executor/result.py) - KV cache, timing, speculative decoding metrics
- **Engine Statistics**: `engine.llm.get_stats_async()` - System-wide aggregate statistics
- **KV Cache Events**: `engine.llm.get_kv_cache_events_async()` - Real-time cache operations

179
180
### Example RequestPerfMetrics JSON Structure

181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
```json
{
  "timing_metrics": {
    "arrival_time": 1234567890.123,
    "first_scheduled_time": 1234567890.135,
    "first_token_time": 1234567890.150,
    "last_token_time": 1234567890.300,
    "kv_cache_size": 2048576,
    "kv_cache_transfer_start": 1234567890.140,
    "kv_cache_transfer_end": 1234567890.145
  },
  "kv_cache_metrics": {
    "num_total_allocated_blocks": 100,
    "num_new_allocated_blocks": 10,
    "num_reused_blocks": 90,
    "num_missed_blocks": 5
  },
  "speculative_decoding": {
    "acceptance_rate": 0.85,
    "total_accepted_draft_tokens": 42,
    "total_draft_tokens": 50
  }
}
```

206
207
208
209
210
**Note:** These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates.

## Implementation Details

- **Prometheus Integration**: Uses the `MetricsCollector` class from `tensorrt_llm.metrics` (see [collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py))
211
- **Dynamo Integration**: Uses `register_engine_metrics_callback()` function with `metric_prefix_filter=["trtllm_"]`
212
213
214
- **Engine Configuration**: `return_perf_metrics` set to `True` when `--publish-events-and-metrics` is enabled
- **Initialization**: Metrics appear after TensorRT-LLM engine initialization completes
- **Metadata**: `MetricsCollector` initialized with model metadata (model name, engine type)
215

216
## Related Documentation
217
218

### TensorRT-LLM Metrics
219
220
- See the [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) section above for detailed performance data and source code references
- [TensorRT-LLM Metrics Collector](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py) - Source code reference
221
222

### Dynamo Metrics
223
224
225
- [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside TensorRT-LLM metrics
226
227
  - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
  - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
228
  - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration