Unverified Commit f17aec0d authored by Reid's avatar Reid Committed by GitHub
Browse files

[doc] Fold long code blocks to improve readability (#19926)


Signed-off-by: default avatarreidliu41 <reid201711@gmail.com>
Co-authored-by: default avatarreidliu41 <reid201711@gmail.com>
parent 493c2753
...@@ -90,6 +90,8 @@ Currently, there are no pre-built ROCm wheels. ...@@ -90,6 +90,8 @@ Currently, there are no pre-built ROCm wheels.
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps: 4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
??? Commands
```bash ```bash
pip install --upgrade pip pip install --upgrade pip
...@@ -201,8 +203,10 @@ DOCKER_BUILDKIT=1 docker build \ ...@@ -201,8 +203,10 @@ DOCKER_BUILDKIT=1 docker build \
To run the above docker image `vllm-rocm`, use the below command: To run the above docker image `vllm-rocm`, use the below command:
```console ??? Command
docker run -it \
```console
docker run -it \
--network=host \ --network=host \
--group-add=video \ --group-add=video \
--ipc=host \ --ipc=host \
...@@ -213,7 +217,7 @@ docker run -it \ ...@@ -213,7 +217,7 @@ docker run -it \
-v <path/to/model>:/app/model \ -v <path/to/model>:/app/model \
vllm-rocm \ vllm-rocm \
bash bash
``` ```
Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models. Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.
......
...@@ -200,7 +200,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 1 ...@@ -200,7 +200,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 1
`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes. `min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.
Example (with ramp-up) Example (with ramp-up):
```text ```text
min = 2, step = 32, max = 64 min = 2, step = 32, max = 64
...@@ -209,7 +209,7 @@ min = 2, step = 32, max = 64 ...@@ -209,7 +209,7 @@ min = 2, step = 32, max = 64
=> buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64) => buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64)
``` ```
Example (without ramp-up) Example (without ramp-up):
```text ```text
min = 128, step = 128, max = 512 min = 128, step = 128, max = 512
...@@ -232,19 +232,21 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come ...@@ -232,19 +232,21 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup: Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
```text ??? Logs
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB ```text
INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
... INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB ...
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
... INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB ...
``` INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
```
This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations. This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
...@@ -279,37 +281,39 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi ...@@ -279,37 +281,39 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released): Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
```text ??? Logs
INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)] ```text
INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048] INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used) INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used) INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used) INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used) INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0 INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used) INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
... INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3) ...
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
... INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB ...
... INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB ...
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)] INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used) INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
``` INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)
```
### Recommended vLLM Parameters ### Recommended vLLM Parameters
......
...@@ -147,20 +147,22 @@ curl http://localhost:8000/v1/completions \ ...@@ -147,20 +147,22 @@ curl http://localhost:8000/v1/completions \
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package: Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
```python ??? Code
from openai import OpenAI
```python
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server. # Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY" openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1" openai_api_base = "http://localhost:8000/v1"
client = OpenAI( client = OpenAI(
api_key=openai_api_key, api_key=openai_api_key,
base_url=openai_api_base, base_url=openai_api_base,
) )
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct", completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
prompt="San Francisco is a") prompt="San Francisco is a")
print("Completion result:", completion) print("Completion result:", completion)
``` ```
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py> A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
...@@ -184,26 +186,28 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -184,26 +186,28 @@ curl http://localhost:8000/v1/chat/completions \
Alternatively, you can use the `openai` Python package: Alternatively, you can use the `openai` Python package:
```python ??? Code
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server. ```python
openai_api_key = "EMPTY" from openai import OpenAI
openai_api_base = "http://localhost:8000/v1" # Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI( client = OpenAI(
api_key=openai_api_key, api_key=openai_api_key,
base_url=openai_api_base, base_url=openai_api_base,
) )
chat_response = client.chat.completions.create( chat_response = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct", model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[ messages=[
{"role": "system", "content": "You are a helpful assistant."}, {"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."}, {"role": "user", "content": "Tell me a joke."},
] ]
) )
print("Chat response:", chat_response) print("Chat response:", chat_response)
``` ```
## On Attention Backends ## On Attention Backends
......
...@@ -85,11 +85,13 @@ and automatically applies the model's [chat template](https://huggingface.co/doc ...@@ -85,11 +85,13 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
In general, only instruction-tuned models have a chat template. In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation. Base models may perform poorly as they are not trained to respond to the chat conversation.
```python ??? Code
from vllm import LLM
```python
from vllm import LLM
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct") llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
conversation = [ conversation = [
{ {
"role": "system", "role": "system",
"content": "You are a helpful assistant" "content": "You are a helpful assistant"
...@@ -106,14 +108,14 @@ conversation = [ ...@@ -106,14 +108,14 @@ conversation = [
"role": "user", "role": "user",
"content": "Write an essay about the importance of higher education.", "content": "Write an essay about the importance of higher education.",
}, },
] ]
outputs = llm.chat(conversation) outputs = llm.chat(conversation)
for output in outputs: for output in outputs:
prompt = output.prompt prompt = output.prompt
generated_text = output.outputs[0].text generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py> A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>
......
...@@ -70,7 +70,10 @@ To make your model compatible with the Transformers backend, it needs: ...@@ -70,7 +70,10 @@ To make your model compatible with the Transformers backend, it needs:
2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention. 2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
3. `MyModel` must contain `_supports_attention_backend = True`. 3. `MyModel` must contain `_supports_attention_backend = True`.
```python title="modeling_my_model.py" <details>
<summary>modeling_my_model.py</summary>
```python
from transformers import PreTrainedModel from transformers import PreTrainedModel
from torch import nn from torch import nn
...@@ -93,6 +96,8 @@ class MyModel(PreTrainedModel): ...@@ -93,6 +96,8 @@ class MyModel(PreTrainedModel):
_supports_attention_backend = True _supports_attention_backend = True
``` ```
</details>
Here is what happens in the background when this model is loaded: Here is what happens in the background when this model is loaded:
1. The config is loaded. 1. The config is loaded.
...@@ -103,7 +108,10 @@ That's it! ...@@ -103,7 +108,10 @@ That's it!
For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class: For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:
```python title="configuration_my_model.py" <details>
<summary>configuration_my_model.py</summary>
```python
from transformers import PretrainedConfig from transformers import PretrainedConfig
...@@ -123,6 +131,8 @@ class MyConfig(PretrainedConfig): ...@@ -123,6 +131,8 @@ class MyConfig(PretrainedConfig):
} }
``` ```
</details>
- `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported). - `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
- `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s: - `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
* You only need to do this for layers which are not present on all pipeline stages * You only need to do this for layers which are not present on all pipeline stages
...@@ -198,6 +208,9 @@ huggingface-cli scan-cache --dir ~/.cache/huggingface/hub ...@@ -198,6 +208,9 @@ huggingface-cli scan-cache --dir ~/.cache/huggingface/hub
Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache: Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache:
<details>
<summary>Commands</summary>
```console ```console
# The `delete-cache` command requires extra dependencies to work with the TUI. # The `delete-cache` command requires extra dependencies to work with the TUI.
# Please run `pip install huggingface_hub[cli]` to install them. # Please run `pip install huggingface_hub[cli]` to install them.
...@@ -224,6 +237,8 @@ Start deletion. ...@@ -224,6 +237,8 @@ Start deletion.
Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M. Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M.
``` ```
</details>
#### Using a proxy #### Using a proxy
Here are some tips for loading/downloading models from Hugging Face using a proxy: Here are some tips for loading/downloading models from Hugging Face using a proxy:
...@@ -601,6 +616,8 @@ Specified using `--task generate`. ...@@ -601,6 +616,8 @@ Specified using `--task generate`.
For the best results, we recommend using the following dependency versions (tested on A10 and L40): For the best results, we recommend using the following dependency versions (tested on A10 and L40):
??? Dependency versions
```text ```text
# Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40) # Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
torch==2.5.1 torch==2.5.1
......
...@@ -13,19 +13,21 @@ pip install langchain langchain_community -q ...@@ -13,19 +13,21 @@ pip install langchain langchain_community -q
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`. To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
```python ??? Code
from langchain_community.llms import VLLM
llm = VLLM(model="mosaicml/mpt-7b", ```python
from langchain_community.llms import VLLM
llm = VLLM(model="mosaicml/mpt-7b",
trust_remote_code=True, # mandatory for hf models trust_remote_code=True, # mandatory for hf models
max_new_tokens=128, max_new_tokens=128,
top_k=10, top_k=10,
top_p=0.95, top_p=0.95,
temperature=0.8, temperature=0.8,
# tensor_parallel_size=... # for distributed inference # tensor_parallel_size=... # for distributed inference
) )
print(llm("What is the capital of France ?")) print(llm("What is the capital of France ?"))
``` ```
Please refer to this [Tutorial](https://python.langchain.com/docs/integrations/llms/vllm) for more details. Please refer to this [Tutorial](https://python.langchain.com/docs/integrations/llms/vllm) for more details.
...@@ -15,22 +15,24 @@ vllm serve NousResearch/Meta-Llama-3-8B-Instruct \ ...@@ -15,22 +15,24 @@ vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python). To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
```python ??? Code
from openai import OpenAI
client = OpenAI( ```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1", base_url="http://localhost:8000/v1",
api_key="token-abc123", api_key="token-abc123",
) )
completion = client.chat.completions.create( completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[ messages=[
{"role": "user", "content": "Hello!"} {"role": "user", "content": "Hello!"}
] ]
) )
print(completion.choices[0].message) print(completion.choices[0].message)
``` ```
!!! tip !!! tip
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example. vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
...@@ -147,8 +149,10 @@ with `--enable-request-id-headers`. ...@@ -147,8 +149,10 @@ with `--enable-request-id-headers`.
> rather than within the vLLM layer for this reason. > rather than within the vLLM layer for this reason.
> See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details. > See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.
```python ??? Code
completion = client.chat.completions.create(
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[ messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"} {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
...@@ -156,18 +160,18 @@ completion = client.chat.completions.create( ...@@ -156,18 +160,18 @@ completion = client.chat.completions.create(
extra_headers={ extra_headers={
"x-request-id": "sentiment-classification-00001", "x-request-id": "sentiment-classification-00001",
} }
) )
print(completion._request_id) print(completion._request_id)
completion = client.completions.create( completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being", prompt="A robot may not injure a human being",
extra_headers={ extra_headers={
"x-request-id": "completion-test", "x-request-id": "completion-test",
} }
) )
print(completion._request_id) print(completion._request_id)
``` ```
## API Reference ## API Reference
...@@ -184,15 +188,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py> ...@@ -184,15 +188,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>
The following [sampling parameters][sampling-params] are supported. The following [sampling parameters][sampling-params] are supported.
```python ??? Code
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
``` ```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
```
The following extra parameters are supported: The following extra parameters are supported:
```python ??? Code
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
``` ```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
```
[](){ #chat-api } [](){ #chat-api }
...@@ -212,15 +220,19 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py> ...@@ -212,15 +220,19 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
The following [sampling parameters][sampling-params] are supported. The following [sampling parameters][sampling-params] are supported.
```python ??? Code
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
``` ```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
```
The following extra parameters are supported: The following extra parameters are supported:
```python ??? Code
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
``` ```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
```
[](){ #embeddings-api } [](){ #embeddings-api }
...@@ -259,6 +271,8 @@ and passing a list of `messages` in the request. Refer to the examples below for ...@@ -259,6 +271,8 @@ and passing a list of `messages` in the request. Refer to the examples below for
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
??? Code
```python ```python
import requests import requests
...@@ -316,15 +330,19 @@ The following [pooling parameters][pooling-params] are supported. ...@@ -316,15 +330,19 @@ The following [pooling parameters][pooling-params] are supported.
The following extra parameters are supported by default: The following extra parameters are supported by default:
```python ??? Code
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
``` ```python
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
```
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead: For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
```python ??? Code
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
``` ```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
```
[](){ #transcriptions-api } [](){ #transcriptions-api }
...@@ -343,15 +361,19 @@ Code example: <gh-file:examples/online_serving/openai_transcription_client.py> ...@@ -343,15 +361,19 @@ Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
The following [sampling parameters][sampling-params] are supported. The following [sampling parameters][sampling-params] are supported.
```python ??? Code
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
``` ```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
```
The following extra parameters are supported: The following extra parameters are supported:
```python ??? Code
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
``` ```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
```
[](){ #tokenizer-api } [](){ #tokenizer-api }
...@@ -387,8 +409,6 @@ Code example: <gh-file:examples/online_serving/openai_classification_client.py> ...@@ -387,8 +409,6 @@ Code example: <gh-file:examples/online_serving/openai_classification_client.py>
You can classify multiple texts by passing an array of strings: You can classify multiple texts by passing an array of strings:
Request:
```bash ```bash
curl -v "http://127.0.0.1:8000/classify" \ curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
...@@ -401,10 +421,10 @@ curl -v "http://127.0.0.1:8000/classify" \ ...@@ -401,10 +421,10 @@ curl -v "http://127.0.0.1:8000/classify" \
}' }'
``` ```
Response: ??? Response
```bash ```bash
{ {
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2", "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
"object": "list", "object": "list",
"created": 1745383065, "created": 1745383065,
...@@ -435,13 +455,11 @@ Response: ...@@ -435,13 +455,11 @@ Response:
"completion_tokens": 0, "completion_tokens": 0,
"prompt_tokens_details": null "prompt_tokens_details": null
} }
} }
``` ```
You can also pass a string directly to the `input` field: You can also pass a string directly to the `input` field:
Request:
```bash ```bash
curl -v "http://127.0.0.1:8000/classify" \ curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
...@@ -451,10 +469,10 @@ curl -v "http://127.0.0.1:8000/classify" \ ...@@ -451,10 +469,10 @@ curl -v "http://127.0.0.1:8000/classify" \
}' }'
``` ```
Response: ??? Response
```bash ```bash
{ {
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682", "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
"object": "list", "object": "list",
"created": 1745383213, "created": 1745383213,
...@@ -476,8 +494,8 @@ Response: ...@@ -476,8 +494,8 @@ Response:
"completion_tokens": 0, "completion_tokens": 0,
"prompt_tokens_details": null "prompt_tokens_details": null
} }
} }
``` ```
#### Extra parameters #### Extra parameters
...@@ -508,8 +526,6 @@ Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py> ...@@ -508,8 +526,6 @@ Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>
You can pass a string to both `text_1` and `text_2`, forming a single sentence pair. You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.
Request:
```bash ```bash
curl -X 'POST' \ curl -X 'POST' \
'http://127.0.0.1:8000/score' \ 'http://127.0.0.1:8000/score' \
...@@ -523,10 +539,10 @@ curl -X 'POST' \ ...@@ -523,10 +539,10 @@ curl -X 'POST' \
}' }'
``` ```
Response: ??? Response
```bash ```bash
{ {
"id": "score-request-id", "id": "score-request-id",
"object": "list", "object": "list",
"created": 693447, "created": 693447,
...@@ -539,8 +555,8 @@ Response: ...@@ -539,8 +555,8 @@ Response:
} }
], ],
"usage": {} "usage": {}
} }
``` ```
#### Batch inference #### Batch inference
...@@ -548,10 +564,10 @@ You can pass a string to `text_1` and a list to `text_2`, forming multiple sente ...@@ -548,10 +564,10 @@ You can pass a string to `text_1` and a list to `text_2`, forming multiple sente
where each pair is built from `text_1` and a string in `text_2`. where each pair is built from `text_1` and a string in `text_2`.
The total number of pairs is `len(text_2)`. The total number of pairs is `len(text_2)`.
Request: ??? Request
```bash ```bash
curl -X 'POST' \ curl -X 'POST' \
'http://127.0.0.1:8000/score' \ 'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \ -H 'accept: application/json' \
-H 'Content-Type: application/json' \ -H 'Content-Type: application/json' \
...@@ -562,13 +578,13 @@ curl -X 'POST' \ ...@@ -562,13 +578,13 @@ curl -X 'POST' \
"The capital of Brazil is Brasilia.", "The capital of Brazil is Brasilia.",
"The capital of France is Paris." "The capital of France is Paris."
] ]
}' }'
``` ```
Response: ??? Response
```bash ```bash
{ {
"id": "score-request-id", "id": "score-request-id",
"object": "list", "object": "list",
"created": 693570, "created": 693570,
...@@ -586,17 +602,17 @@ Response: ...@@ -586,17 +602,17 @@ Response:
} }
], ],
"usage": {} "usage": {}
} }
``` ```
You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs
where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`). where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`).
The total number of pairs is `len(text_2)`. The total number of pairs is `len(text_2)`.
Request: ??? Request
```bash ```bash
curl -X 'POST' \ curl -X 'POST' \
'http://127.0.0.1:8000/score' \ 'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \ -H 'accept: application/json' \
-H 'Content-Type: application/json' \ -H 'Content-Type: application/json' \
...@@ -611,13 +627,13 @@ curl -X 'POST' \ ...@@ -611,13 +627,13 @@ curl -X 'POST' \
"The capital of Brazil is Brasilia.", "The capital of Brazil is Brasilia.",
"The capital of France is Paris." "The capital of France is Paris."
] ]
}' }'
``` ```
Response: ??? Response
```bash ```bash
{ {
"id": "score-request-id", "id": "score-request-id",
"object": "list", "object": "list",
"created": 693447, "created": 693447,
...@@ -635,8 +651,8 @@ Response: ...@@ -635,8 +651,8 @@ Response:
} }
], ],
"usage": {} "usage": {}
} }
``` ```
#### Extra parameters #### Extra parameters
...@@ -675,10 +691,10 @@ Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py> ...@@ -675,10 +691,10 @@ Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
Note that the `top_n` request parameter is optional and will default to the length of the `documents` field. Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order. Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
Request: ??? Request
```bash ```bash
curl -X 'POST' \ curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \ 'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \ -H 'accept: application/json' \
-H 'Content-Type: application/json' \ -H 'Content-Type: application/json' \
...@@ -690,13 +706,13 @@ curl -X 'POST' \ ...@@ -690,13 +706,13 @@ curl -X 'POST' \
"The capital of France is Paris.", "The capital of France is Paris.",
"Horses and cows are both animals" "Horses and cows are both animals"
] ]
}' }'
``` ```
Response: ??? Response
```bash ```bash
{ {
"id": "rerank-fae51b2b664d4ed38f5969b612edff77", "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base", "model": "BAAI/bge-reranker-base",
"usage": { "usage": {
...@@ -718,8 +734,8 @@ Response: ...@@ -718,8 +734,8 @@ Response:
"relevance_score": 0.0005860328674316406 "relevance_score": 0.0005860328674316406
} }
] ]
} }
``` ```
#### Extra parameters #### Extra parameters
......
...@@ -12,28 +12,32 @@ vllm serve unsloth/Llama-3.2-1B-Instruct ...@@ -12,28 +12,32 @@ vllm serve unsloth/Llama-3.2-1B-Instruct
Then query the endpoint to get the latest metrics from the server: Then query the endpoint to get the latest metrics from the server:
```console ??? Output
$ curl http://0.0.0.0:8000/metrics
```console
# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step. $ curl http://0.0.0.0:8000/metrics
# TYPE vllm:iteration_tokens_total histogram
vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0 # HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 # TYPE vllm:iteration_tokens_total histogram
vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
... vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
``` vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
...
```
The following metrics are exposed: The following metrics are exposed:
```python ??? Code
--8<-- "vllm/engine/metrics.py:metrics-definitions"
``` ```python
--8<-- "vllm/engine/metrics.py:metrics-definitions"
```
Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1` Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1`
but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch, but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch,
......
...@@ -60,68 +60,70 @@ To identify the particular CUDA operation that causes the error, you can add `-- ...@@ -60,68 +60,70 @@ To identify the particular CUDA operation that causes the error, you can add `--
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly. If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
```python ??? Code
# Test PyTorch NCCL
import torch ```python
import torch.distributed as dist # Test PyTorch NCCL
dist.init_process_group(backend="nccl") import torch
local_rank = dist.get_rank() % torch.cuda.device_count() import torch.distributed as dist
torch.cuda.set_device(local_rank) dist.init_process_group(backend="nccl")
data = torch.FloatTensor([1,] * 128).to("cuda") local_rank = dist.get_rank() % torch.cuda.device_count()
dist.all_reduce(data, op=dist.ReduceOp.SUM) torch.cuda.set_device(local_rank)
torch.cuda.synchronize() data = torch.FloatTensor([1,] * 128).to("cuda")
value = data.mean().item() dist.all_reduce(data, op=dist.ReduceOp.SUM)
world_size = dist.get_world_size() torch.cuda.synchronize()
assert value == world_size, f"Expected {world_size}, got {value}" value = data.mean().item()
world_size = dist.get_world_size()
print("PyTorch NCCL is successful!") assert value == world_size, f"Expected {world_size}, got {value}"
# Test PyTorch GLOO print("PyTorch NCCL is successful!")
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128) # Test PyTorch GLOO
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group) gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
value = cpu_data.mean().item() cpu_data = torch.FloatTensor([1,] * 128)
assert value == world_size, f"Expected {world_size}, got {value}" dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
print("PyTorch GLOO is successful!") assert value == world_size, f"Expected {world_size}, got {value}"
if world_size <= 1: print("PyTorch GLOO is successful!")
if world_size <= 1:
exit() exit()
# Test vLLM NCCL, with cuda graph # Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank) pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
# pynccl is enabled by default for 0.6.5+, # pynccl is enabled by default for 0.6.5+,
# but for 0.6.4 and below, we need to enable it manually. # but for 0.6.4 and below, we need to enable it manually.
# keep the code for backward compatibility when because people # keep the code for backward compatibility when because people
# prefer to read the latest documentation. # prefer to read the latest documentation.
pynccl.disabled = False pynccl.disabled = False
s = torch.cuda.Stream() s = torch.cuda.Stream()
with torch.cuda.stream(s): with torch.cuda.stream(s):
data.fill_(1) data.fill_(1)
out = pynccl.all_reduce(data, stream=s) out = pynccl.all_reduce(data, stream=s)
value = out.mean().item() value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}" assert value == world_size, f"Expected {world_size}, got {value}"
print("vLLM NCCL is successful!") print("vLLM NCCL is successful!")
g = torch.cuda.CUDAGraph() g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s): with torch.cuda.graph(cuda_graph=g, stream=s):
out = pynccl.all_reduce(data, stream=torch.cuda.current_stream()) out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())
data.fill_(1) data.fill_(1)
g.replay() g.replay()
torch.cuda.current_stream().synchronize() torch.cuda.current_stream().synchronize()
value = out.mean().item() value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}" assert value == world_size, f"Expected {world_size}, got {value}"
print("vLLM NCCL with cuda graph is successful!") print("vLLM NCCL with cuda graph is successful!")
dist.destroy_process_group(gloo_group) dist.destroy_process_group(gloo_group)
dist.destroy_process_group() dist.destroy_process_group()
``` ```
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use: If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
...@@ -165,8 +167,10 @@ WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously ...@@ -165,8 +167,10 @@ WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
or an error from Python that looks like this: or an error from Python that looks like this:
```console ??? Logs
RuntimeError:
```console
RuntimeError:
An attempt has been made to start a new process before the An attempt has been made to start a new process before the
current process has finished its bootstrapping phase. current process has finished its bootstrapping phase.
...@@ -183,7 +187,7 @@ RuntimeError: ...@@ -183,7 +187,7 @@ RuntimeError:
To fix this issue, refer to the "Safe importing of main module" To fix this issue, refer to the "Safe importing of main module"
section in https://docs.python.org/3/library/multiprocessing.html section in https://docs.python.org/3/library/multiprocessing.html
``` ```
then you must update your Python code to guard usage of `vllm` behind a `if then you must update your Python code to guard usage of `vllm` behind a `if
__name__ == '__main__':` block. For example, instead of this: __name__ == '__main__':` block. For example, instead of this:
...@@ -207,20 +211,22 @@ if __name__ == '__main__': ...@@ -207,20 +211,22 @@ if __name__ == '__main__':
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script: vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
```python ??? Code
import torch
@torch.compile ```python
def f(x): import torch
@torch.compile
def f(x):
# a simple function to test torch.compile # a simple function to test torch.compile
x = x + 1 x = x + 1
x = x * 2 x = x * 2
x = x.sin() x = x.sin()
return x return x
x = torch.randn(4, 4).cuda() x = torch.randn(4, 4).cuda()
print(f(x)) print(f(x))
``` ```
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example. If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
......
...@@ -10,8 +10,10 @@ The list of data collected by the latest version of vLLM can be found here: <gh- ...@@ -10,8 +10,10 @@ The list of data collected by the latest version of vLLM can be found here: <gh-
Here is an example as of v0.4.0: Here is an example as of v0.4.0:
```json ??? Output
{
```json
{
"uuid": "fbe880e9-084d-4cab-a395-8984c50f1109", "uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
"provider": "GCP", "provider": "GCP",
"num_cpu": 24, "num_cpu": 24,
...@@ -38,8 +40,8 @@ Here is an example as of v0.4.0: ...@@ -38,8 +40,8 @@ Here is an example as of v0.4.0:
"enable_prefix_caching": false, "enable_prefix_caching": false,
"enforce_eager": false, "enforce_eager": false,
"disable_custom_all_reduce": true "disable_custom_all_reduce": true
} }
``` ```
You can preview the collected data by running the following command: You can preview the collected data by running the following command:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment