Unverified Commit f817c595 authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

docs: reorganize prometheus.md to be consistent with docs/observability/metrics.md (#4262)


Signed-off-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
Co-authored-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent d392bbdd
# SGLang Prometheus Metrics <!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
**📚 Official Documentation**: [SGLang Production Metrics](https://docs.sglang.ai/references/production_metrics.html) SPDX-License-Identifier: Apache-2.0
-->
This document describes how SGLang Prometheus metrics are exposed in Dynamo. # SGLang Prometheus Metrics
## Overview ## Overview
When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
For the complete and authoritative list of all SGLang metrics, always refer to the official documentation linked above. **For the complete and authoritative list of all SGLang metrics**, always refer to the [official SGLang Production Metrics documentation](https://docs.sglang.ai/references/production_metrics.html).
Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md).
## Metric Reference **For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
The official documentation includes: **For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
- Complete metric definitions with HELP and TYPE descriptions
- Example metric output in Prometheus exposition format
- Counter, Gauge, and Histogram metrics
- Metric labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`)
- Setup guide for Prometheus + Grafana monitoring
- Troubleshooting tips and configuration examples
## Metric Categories ## Environment Variables
SGLang provides metrics in the following categories (all prefixed with `sglang:`): | Variable | Description | Default | Example |
- Throughput metrics |----------|-------------|---------|---------|
- Resource usage | `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
- Latency metrics
- Disaggregation metrics (when enabled)
**Note:** Specific metrics are subject to change between SGLang versions. Always refer to the [official documentation](https://docs.sglang.ai/references/production_metrics.html) or inspect the `/metrics` endpoint for your SGLang version. ## Getting Started Quickly
## Enabling Metrics in Dynamo This is a single machine example.
SGLang metrics are automatically exposed when running SGLang through Dynamo with metrics enabled. ### Start Observability Stack
## Inspecting Metrics For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
To see the actual metrics available in your SGLang version: ### Launch Dynamo Components
### 1. Launch SGLang with Metrics Enabled Launch a frontend and SGLang backend to test metrics:
```bash ```bash
# Set system metrics port (automatically enables metrics server) # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
export DYN_SYSTEM_PORT=8081 $ python -m dynamo.frontend
# Start SGLang worker with metrics enabled # Enable system metrics server on port 8081
python -m dynamo.sglang --model <model_name> --enable-metrics $ DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model <model_name> --enable-metrics
# Wait for engine to initialize
``` ```
Metrics will be available at: `http://localhost:8081/metrics` Wait for the SGLang worker to start, then send requests and check metrics:
### 2. Fetch Metrics via curl
```bash ```bash
curl http://localhost:8081/metrics | grep "^sglang:" # Send a request
curl -H 'Content-Type: application/json' \
-d '{
"model": "<model_name>",
"max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions
# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^sglang:"
``` ```
### 3. Example Output ## Exposed Metrics
**Note:** The specific metrics shown below are examples and may vary depending on your SGLang version. Always inspect your actual `/metrics` endpoint for the current list. SGLang exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All SGLang engine metrics use the `sglang:` prefix and include labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`) to identify the source.
**Example Prometheus Exposition Format text:**
``` ```
# HELP sglang:prompt_tokens_total Number of prefill tokens processed. # HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter # TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8128902.0 sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8128902.0
# HELP sglang:generation_tokens_total Number of generation tokens processed. # HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter # TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7557572.0 sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7557572.0
# HELP sglang:cache_hit_rate The cache hit rate # HELP sglang:cache_hit_rate The cache hit rate
# TYPE sglang:cache_hit_rate gauge # TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075 sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075
``` ```
**Note:** The specific metrics shown above are examples and may vary depending on your SGLang version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.sglang.ai/references/production_metrics.html) for the current list.
### Metric Categories
SGLang provides metrics in the following categories (all prefixed with `sglang:`):
- **Throughput metrics** - Token processing rates
- **Resource usage** - System resource consumption
- **Latency metrics** - Request and token latency measurements
- **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
**Note:** Specific metrics are subject to change between SGLang versions. Always refer to the [official documentation](https://docs.sglang.ai/references/production_metrics.html) or inspect the `/metrics` endpoint for your SGLang version.
## Available Metrics
The official SGLang documentation includes complete metric definitions with:
- HELP and TYPE descriptions
- Counter, Gauge, and Histogram metric types
- Metric labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`)
- Setup guide for Prometheus + Grafana monitoring
- Troubleshooting tips and configuration examples
For the complete and authoritative list of all SGLang metrics, see the [official SGLang Production Metrics documentation](https://docs.sglang.ai/references/production_metrics.html).
## Implementation Details ## Implementation Details
- SGLang uses multiprocess metrics collection via `prometheus_client.multiprocess.MultiProcessCollector` - SGLang uses multiprocess metrics collection via `prometheus_client.multiprocess.MultiProcessCollector`
...@@ -83,16 +108,16 @@ sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075 ...@@ -83,16 +108,16 @@ sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075
- The integration uses Dynamo's `register_engine_metrics_callback()` function - The integration uses Dynamo's `register_engine_metrics_callback()` function
- Metrics appear after SGLang engine initialization completes - Metrics appear after SGLang engine initialization completes
## See Also ## Related Documentation
### SGLang Metrics ### SGLang Metrics
- [Official SGLang Production Metrics](https://docs.sglang.ai/references/production_metrics.html) - [Official SGLang Production Metrics](https://docs.sglang.ai/references/production_metrics.html)
- [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py) - [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py)
### Dynamo Metrics ### Dynamo Metrics
- **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics - [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces - [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside SGLang metrics
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
- Available at the same `/metrics` endpoint alongside SGLang metrics - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
- **Integration Code**: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
# TensorRT-LLM Prometheus Metrics <!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
This document describes how TensorRT-LLM Prometheus metrics are exposed in Dynamo, as well as where to find non-Prometheus metrics. # TensorRT-LLM Prometheus Metrics
## Overview ## Overview
When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
Additional performance metrics are available via non-Prometheus APIs in the RequestPerfMetrics section below.
As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes **5 basic Prometheus metrics**. Note that the `trtllm:` prefix is added by Dynamo. As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes **5 basic Prometheus metrics**. Note that the `trtllm:` prefix is added by Dynamo.
Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md). Additional performance metrics are available via non-Prometheus APIs (see [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) below).
## Metric Reference
TensorRT-LLM provides Prometheus metrics through the `MetricsCollector` class (see [tensorrt_llm/metrics/collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py)), which includes: **For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
- Counter and Histogram metrics
- Metric labels (e.g., `model_name`, `engine_type`, `finished_reason`) - note that TensorRT-LLM uses `model_name` instead of Dynamo's standard `model` label convention
### Current Prometheus Metrics (TensorRT-LLM 1.1.0rc5) **For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
The following metrics are exposed via Dynamo's `/metrics` endpoint (with the `trtllm:` prefix added by Dynamo): ## Environment Variables
- `trtllm:request_success_total` (Counter) — Count of successfully processed requests by finish reason | Variable | Description | Default | Example |
- Labels: `model_name`, `engine_type`, `finished_reason` |----------|-------------|---------|---------|
- `trtllm:e2e_request_latency_seconds` (Histogram) — End-to-end request latency (seconds) | `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
- Labels: `model_name`, `engine_type`
- `trtllm:time_to_first_token_seconds` (Histogram) — Time to first token, TTFT (seconds)
- Labels: `model_name`, `engine_type`
- `trtllm:time_per_output_token_seconds` (Histogram) — Time per output token, TPOT (seconds)
- Labels: `model_name`, `engine_type`
- `trtllm:request_queue_time_seconds` (Histogram) — Time a request spends waiting in the queue (seconds)
- Labels: `model_name`, `engine_type`
These metric names and availability are subject to change with TensorRT-LLM version updates. ## Getting Started Quickly
## Metric Categories This is a single machine example.
TensorRT-LLM provides metrics in the following categories (all prefixed with `trtllm:`): ### Start Observability Stack
- Request metrics (latency, throughput)
- Performance metrics (TTFT, TPOT, queue time)
**Note:** Metrics may change between TensorRT-LLM versions. Always inspect the `/metrics` endpoint for your version. For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
## Enabling Metrics in Dynamo ### Launch Dynamo Components
TensorRT-LLM Prometheus metrics are automatically exposed when running TensorRT-LLM through Dynamo with the `--publish-events-and-metrics` flag. Launch a frontend and TensorRT-LLM backend to test metrics:
### Required Configuration
```bash ```bash
python -m dynamo.trtllm --model <model_name> --publish-events-and-metrics # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
``` $ python -m dynamo.frontend
### Backend Requirement # Enable system metrics server on port 8081 and enable metrics collection
- `backend`: Must be set to `"pytorch"` for metrics collection (enforced in `components/src/dynamo/trtllm/main.py`) $ DYN_SYSTEM_PORT=8081 python -m dynamo.trtllm --model <model_name> --publish-events-and-metrics
- TensorRT-LLM's `MetricsCollector` integration has only been tested/validated with the PyTorch backend ```
## Inspecting Metrics
To see the actual metrics available in your TensorRT-LLM version: **Note:** The `backend` must be set to `"pytorch"` for metrics collection (enforced in `components/src/dynamo/trtllm/main.py`). TensorRT-LLM's `MetricsCollector` integration has only been tested/validated with the PyTorch backend.
### 1. Launch TensorRT-LLM with Metrics Enabled Wait for the TensorRT-LLM worker to start, then send requests and check metrics:
```bash ```bash
# Set system metrics port (automatically enables metrics server) # Send a request
export DYN_SYSTEM_PORT=8081 curl -H 'Content-Type: application/json' \
-d '{
# Start TensorRT-LLM worker with metrics enabled "model": "<model_name>",
python -m dynamo.trtllm --model <model_name> --publish-events-and-metrics "max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
# Wait for engine to initialize }' \
http://localhost:8000/v1/chat/completions
# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^trtllm:"
``` ```
Metrics will be available at: `http://localhost:8081/metrics` ## Exposed Metrics
### 2. Fetch Metrics via curl TensorRT-LLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All TensorRT-LLM engine metrics use the `trtllm:` prefix and include labels (e.g., `model_name`, `engine_type`, `finished_reason`) to identify the source.
```bash
curl http://localhost:8081/metrics | grep "^trtllm:"
```
### 3. Example Output **Note:** TensorRT-LLM uses `model_name` instead of Dynamo's standard `model` label convention.
**Note:** The specific metrics shown below are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual `/metrics` endpoint for the current list. **Example Prometheus Exposition Format text:**
``` ```
# HELP trtllm:request_success_total Count of successfully processed requests. # HELP trtllm:request_success_total Count of successfully processed requests.
...@@ -102,38 +87,50 @@ trtllm:time_to_first_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type= ...@@ -102,38 +87,50 @@ trtllm:time_to_first_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type=
trtllm:e2e_request_latency_seconds_bucket{le="0.5",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 25.0 trtllm:e2e_request_latency_seconds_bucket{le="0.5",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 25.0
trtllm:e2e_request_latency_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0 trtllm:e2e_request_latency_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm:e2e_request_latency_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 45.2 trtllm:e2e_request_latency_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 45.2
# HELP trtllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE trtllm:time_per_output_token_seconds histogram
trtllm:time_per_output_token_seconds_bucket{le="0.1",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 120.0
trtllm:time_per_output_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm:time_per_output_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.5
# HELP trtllm:request_queue_time_seconds Histogram of time spent in WAITING phase for request.
# TYPE trtllm:request_queue_time_seconds histogram
trtllm:request_queue_time_seconds_bucket{le="1.0",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 140.0
trtllm:request_queue_time_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm:request_queue_time_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 32.1
``` ```
## Implementation Details **Note:** The specific metrics shown above are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual `/metrics` endpoint for the current list.
- **Prometheus Integration**: Uses the `MetricsCollector` class from `tensorrt_llm.metrics` (see [collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py)) ### Metric Categories
- **Dynamo Integration**: Uses `register_engine_metrics_callback()` function with `add_prefix="trtllm:"`
- **Engine Configuration**: `return_perf_metrics` set to `True` when `--publish-events-and-metrics` is enabled
- **Initialization**: Metrics appear after TensorRT-LLM engine initialization completes
- **Metadata**: `MetricsCollector` initialized with model metadata (model name, engine type)
## TensorRT-LLM Specific: Non-Prometheus Performance Metrics TensorRT-LLM provides metrics in the following categories (all prefixed with `trtllm:`):
- **Request metrics** - Request success tracking and latency measurements
- **Performance metrics** - Time to first token (TTFT), time per output token (TPOT), and queue time
**Note:** Metrics may change between TensorRT-LLM versions. Always inspect the `/metrics` endpoint for your version.
TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are **not exposed to Prometheus**. ## Available Metrics
The following metrics are exposed via Dynamo's `/metrics` endpoint (with the `trtllm:` prefix added by Dynamo) for TensorRT-LLM version 1.1.0rc5:
- `trtllm:request_success_total` (Counter) — Count of successfully processed requests by finish reason
- Labels: `model_name`, `engine_type`, `finished_reason`
- `trtllm:e2e_request_latency_seconds` (Histogram) — End-to-end request latency (seconds)
- Labels: `model_name`, `engine_type`
- `trtllm:time_to_first_token_seconds` (Histogram) — Time to first token, TTFT (seconds)
- Labels: `model_name`, `engine_type`
- `trtllm:time_per_output_token_seconds` (Histogram) — Time per output token, TPOT (seconds)
- Labels: `model_name`, `engine_type`
- `trtllm:request_queue_time_seconds` (Histogram) — Time a request spends waiting in the queue (seconds)
- Labels: `model_name`, `engine_type`
These metric names and availability are subject to change with TensorRT-LLM version updates.
TensorRT-LLM provides Prometheus metrics through the `MetricsCollector` class (see [tensorrt_llm/metrics/collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py)).
## Non-Prometheus Performance Metrics
TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not currently exposed to Prometheus.
### Available via Code References
### Available via Code References:
- **RequestPerfMetrics Structure**: [tensorrt_llm/executor/result.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/executor/result.py) - KV cache, timing, speculative decoding metrics - **RequestPerfMetrics Structure**: [tensorrt_llm/executor/result.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/executor/result.py) - KV cache, timing, speculative decoding metrics
- **Engine Statistics**: `engine.llm.get_stats_async()` - System-wide aggregate statistics - **Engine Statistics**: `engine.llm.get_stats_async()` - System-wide aggregate statistics
- **KV Cache Events**: `engine.llm.get_kv_cache_events_async()` - Real-time cache operations - **KV Cache Events**: `engine.llm.get_kv_cache_events_async()` - Real-time cache operations
### Example RequestPerfMetrics JSON Structure: ### Example RequestPerfMetrics JSON Structure
```json ```json
{ {
"timing_metrics": { "timing_metrics": {
...@@ -159,17 +156,26 @@ TensorRT-LLM provides extensive performance data beyond the basic Prometheus met ...@@ -159,17 +156,26 @@ TensorRT-LLM provides extensive performance data beyond the basic Prometheus met
} }
``` ```
**Note**: These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates. **Note:** These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates.
## Implementation Details
- **Prometheus Integration**: Uses the `MetricsCollector` class from `tensorrt_llm.metrics` (see [collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py))
- **Dynamo Integration**: Uses `register_engine_metrics_callback()` function with `add_prefix="trtllm:"`
- **Engine Configuration**: `return_perf_metrics` set to `True` when `--publish-events-and-metrics` is enabled
- **Initialization**: Metrics appear after TensorRT-LLM engine initialization completes
- **Metadata**: `MetricsCollector` initialized with model metadata (model name, engine type)
## See Also ## Related Documentation
### TensorRT-LLM Metrics ### TensorRT-LLM Metrics
- See the "TensorRT-LLM Specific: Non-Prometheus Performance Metrics" section above for detailed performance data and source code references - See the [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) section above for detailed performance data and source code references
- [TensorRT-LLM Metrics Collector](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py) - Source code reference
### Dynamo Metrics ### Dynamo Metrics
- **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics - [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces - [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside TensorRT-LLM metrics
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
- Available at the same `/metrics` endpoint alongside TensorRT-LLM metrics - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
- **Integration Code**: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
# vLLM Prometheus Metrics <!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
**📚 Official Documentation**: [vLLM Metrics Design](https://docs.vllm.ai/en/latest/design/metrics.html) SPDX-License-Identifier: Apache-2.0
-->
This document describes how vLLM Prometheus metrics are exposed in Dynamo. # vLLM Prometheus Metrics
## Overview ## Overview
When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
For the complete and authoritative list of all vLLM metrics, always refer to the official documentation linked above. **For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
Dynamo runtime metrics are documented in [docs/observability/metrics.md](../../observability/metrics.md). **For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
## Metric Reference **For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
The official documentation includes: ## Environment Variables
- Complete metric definitions with detailed explanations
- Counter, Gauge, and Histogram metrics
- Metric labels (e.g., `model_name`, `finished_reason`, `scheduling_event`)
- Design rationale and implementation details
- Information about v1 metrics migration
- Future work and deprecated metrics
## Metric Categories | Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
vLLM provides metrics in the following categories (all prefixed with `vllm:`): ## Getting Started Quickly
- Request metrics
- Performance metrics
- Resource usage
- Scheduler metrics
- Disaggregation metrics (when enabled)
**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version. This is a single machine example.
## Enabling Metrics in Dynamo
vLLM metrics are automatically exposed when running vLLM through Dynamo with metrics enabled. ### Start Observability Stack
## Inspecting Metrics For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
To see the actual metrics available in your vLLM version: ### Launch Dynamo Components
### 1. Launch vLLM with Metrics Enabled Launch a frontend and vLLM backend to test metrics:
```bash ```bash
# Set system metrics port (automatically enables metrics server) # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
export DYN_SYSTEM_PORT=8081 $ python -m dynamo.frontend
# Start vLLM worker (metrics enabled by default via --disable-log-stats=false)
python -m dynamo.vllm --model <model_name>
# Wait for engine to initialize # Enable system metrics server on port 8081
$ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model_name> \
--enforce-eager --no-enable-prefix-caching --max-num-seqs 3
``` ```
Metrics will be available at: `http://localhost:8081/metrics` Wait for the vLLM worker to start, then send requests and check metrics:
### 2. Fetch Metrics via curl
```bash ```bash
curl http://localhost:8081/metrics | grep "^vllm:" # Send a request
curl -H 'Content-Type: application/json' \
-d '{
"model": "<model_name>",
"max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions
# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^vllm:"
``` ```
### 3. Example Output ## Exposed Metrics
vLLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All vLLM engine metrics use the `vllm:` prefix and include labels (e.g., `model_name`, `finished_reason`, `scheduling_event`) to identify the source.
**Note:** The specific metrics shown below are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint for the current list. **Example Prometheus Exposition Format text:**
``` ```
# HELP vllm:request_success_total Number of successfully finished requests. # HELP vllm:request_success_total Number of successfully finished requests.
# TYPE vllm:request_success_total counter # TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0 vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0 vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds. # HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram # TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0 vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0
...@@ -78,6 +78,31 @@ vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165 ...@@ -78,6 +78,31 @@ vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38 vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
``` ```
**Note:** The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for the current list.
### Metric Categories
vLLM provides metrics in the following categories (all prefixed with `vllm:`):
- **Request metrics** - Request success, failure, and completion tracking
- **Performance metrics** - Latency, throughput, and timing measurements
- **Resource usage** - System resource consumption
- **Scheduler metrics** - Scheduling and queue management
- **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version.
## Available Metrics
The official vLLM documentation includes complete metric definitions with:
- Detailed explanations and design rationale
- Counter, Gauge, and Histogram metric types
- Metric labels (e.g., `model_name`, `finished_reason`, `scheduling_event`)
- Information about v1 metrics migration
- Future work and deprecated metrics
For the complete and authoritative list of all vLLM metrics, see the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
## Implementation Details ## Implementation Details
- vLLM v1 uses multiprocess metrics collection via `prometheus_client.multiprocess` - vLLM v1 uses multiprocess metrics collection via `prometheus_client.multiprocess`
...@@ -87,7 +112,7 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38 ...@@ -87,7 +112,7 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
- Metrics appear after vLLM engine initialization completes - Metrics appear after vLLM engine initialization completes
- vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for migration details - vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for migration details
## See Also ## Related Documentation
### vLLM Metrics ### vLLM Metrics
- [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/latest/design/metrics.html) - [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/latest/design/metrics.html)
...@@ -95,9 +120,9 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38 ...@@ -95,9 +120,9 @@ vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/v1/metrics) - [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/v1/metrics)
### Dynamo Metrics ### Dynamo Metrics
- **Dynamo Metrics Guide**: See [docs/observability/metrics.md](../../observability/metrics.md) for complete documentation on Dynamo runtime metrics - [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces - [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside vLLM metrics
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
- Available at the same `/metrics` endpoint alongside vLLM metrics - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
- **Integration Code**: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
...@@ -26,8 +26,8 @@ orchestration frameworks such as Kubernetes. ...@@ -26,8 +26,8 @@ orchestration frameworks such as Kubernetes.
Enable health checks and query endpoints: Enable health checks and query endpoints:
```bash ```bash
# Start your Dynamo components # Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
python -m dynamo.frontend --http-port 8000 & python -m dynamo.frontend &
# Enable system status server on port 8081 # Enable system status server on port 8081
DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager & DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
......
...@@ -38,8 +38,8 @@ Enable structured JSONL logging: ...@@ -38,8 +38,8 @@ Enable structured JSONL logging:
export DYN_LOGGING_JSONL=true export DYN_LOGGING_JSONL=true
export DYN_LOG=debug export DYN_LOG=debug
# Start your Dynamo components # Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
python -m dynamo.frontend --http-port 8000 & python -m dynamo.frontend &
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager & python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
``` ```
...@@ -109,8 +109,8 @@ To see trace information in logs: ...@@ -109,8 +109,8 @@ To see trace information in logs:
export DYN_LOGGING_JSONL=true export DYN_LOGGING_JSONL=true
export DYN_LOG=debug # Set to debug to see detailed trace logs export DYN_LOG=debug # Set to debug to see detailed trace logs
# Start your Dynamo components (e.g., frontend and worker) # Start your Dynamo components (e.g., frontend and worker) (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
python -m dynamo.frontend --http-port 8000 & python -m dynamo.frontend &
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager & python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
``` ```
......
...@@ -33,7 +33,8 @@ For visualizing metrics with Prometheus and Grafana, start the observability sta ...@@ -33,7 +33,8 @@ For visualizing metrics with Prometheus and Grafana, start the observability sta
Launch a frontend and vLLM backend to test metrics: Launch a frontend and vLLM backend to test metrics:
```bash ```bash
$ python -m dynamo.frontend --http-port 8000 # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend
# Enable system metrics server on port 8081 # Enable system metrics server on port 8081
$ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B \ $ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B \
......
...@@ -36,8 +36,8 @@ Start the observability stack (Prometheus, Grafana, Tempo, exporters). See [Obse ...@@ -36,8 +36,8 @@ Start the observability stack (Prometheus, Grafana, Tempo, exporters). See [Obse
Start frontend and worker (a simple single GPU example): Start frontend and worker (a simple single GPU example):
```bash ```bash
# Start frontend in one process # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
python -m dynamo.frontend --http-port 8000 & python -m dynamo.frontend &
# Start vLLM worker with metrics enabled on port 8081 # Start vLLM worker with metrics enabled on port 8081
DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager
......
...@@ -46,9 +46,9 @@ export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317 ...@@ -46,9 +46,9 @@ export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
For a simple single-GPU deployment, start the frontend and a single vLLM worker: For a simple single-GPU deployment, start the frontend and a single vLLM worker:
```bash ```bash
# Start the frontend with tracing enabled # Start the frontend with tracing enabled (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
export OTEL_SERVICE_NAME=dynamo-frontend export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend --router-mode kv --http-port=8000 & python -m dynamo.frontend --router-mode kv &
# Start a single vLLM worker (aggregated prefill and decode) # Start a single vLLM worker (aggregated prefill and decode)
export OTEL_SERVICE_NAME=dynamo-worker-vllm export OTEL_SERVICE_NAME=dynamo-worker-vllm
...@@ -83,9 +83,9 @@ export DYN_LOGGING_JSONL=true ...@@ -83,9 +83,9 @@ export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=true export OTEL_EXPORT_ENABLED=true
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317 export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
# Run frontend # Run frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
export OTEL_SERVICE_NAME=dynamo-frontend export OTEL_SERVICE_NAME=dynamo-frontend
python -m dynamo.frontend --router-mode kv --http-port=8000 & python -m dynamo.frontend --router-mode kv &
# Run decode worker, make sure to wait for start up # Run decode worker, make sure to wait for start up
export OTEL_SERVICE_NAME=dynamo-worker-decode export OTEL_SERVICE_NAME=dynamo-worker-decode
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment