metrics.md 1.95 KB
Newer Older
1
2
3
4
5
6
# Production Metrics

vLLM exposes a number of metrics that can be used to monitor the health of the
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
OpenAI compatible API server.

7
You can start the server using Python, or using [Docker](../deployment/docker.md):
8

9
```bash
10
vllm serve unsloth/Llama-3.2-1B-Instruct
11
12
13
14
```

Then query the endpoint to get the latest metrics from the server:

15
??? console "Output"
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

    ```console
    $ curl http://0.0.0.0:8000/metrics

    # HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
    # TYPE vllm:iteration_tokens_total histogram
    vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
    vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
    vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
    vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
    vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
    vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
    vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
    vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
    vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
    ...
    ```
33
34
35

The following metrics are exposed:

36
## General Metrics
37

38
39
40
41
42
43
44
45
46
47
48
--8<-- "docs/generated/metrics/general.md"

## Speculative Decoding Metrics

--8<-- "docs/generated/metrics/spec_decode.md"

## NIXL KV Connector Metrics

--8<-- "docs/generated/metrics/nixl_connector.md"

## Deprecation Policy
49
50
51
52

Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1`
but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch,
and are then removed in version `X.Y+2`.