@@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
...
@@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
## Objectives
## Objectives
- Achieve parity of metrics between v0 and v1.
- Achieve parity of metrics between v0 and v1.
- The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments.
- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
- Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
## Background
## Background
Metrics in vLLM can be categorized as follows:
Metrics in vLLM can be categorized as follows:
1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
1. Server-level metrics: Global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
2. Request-level metrics: Metrics that track the characteristics (e.g. size and timing) of individual requests. These are typically exposed as Histograms in Prometheus and are often the SLOs that an SRE monitoring vLLM will be tracking.
The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.
The mental model is that server-level metrics help explain the values of request-level metrics.
### v0 Metrics
### v0 Metrics
...
@@ -65,20 +65,20 @@ vLLM also provides [a reference example](../../examples/online_serving/prometheu
...
@@ -65,20 +65,20 @@ vLLM also provides [a reference example](../../examples/online_serving/prometheu
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
-`vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds
-`vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds.
-`vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached
-`vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
-`vllm:request_queue_time_seconds` - Queue Time
-`vllm:request_queue_time_seconds` - Queue time.
-`vllm:request_prefill_time_seconds` - Requests Prefill Time
- [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
- [Benchmarking LLM Workloads for Performance Evaluation and