Commit e661d594 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.5.4' into v0.5.4-dtk24.04.1

parents 6b16ea2e 4db5176d
.. _benchmarks:
Benchmark suites of vLLM
========================
vLLM contains two sets of benchmarks:
+ **Performance benchmarks**: benchmark vLLM's performance under various workloads at a high frequency (when a pull request (PR for short) of vLLM is being merged). See `vLLM performance dashboard <https://perf.vllm.ai>`_ for the latest performance results.
+ **Nightly benchmarks**: compare vLLM's performance against alternatives (tgi, trt-llm, and lmdeploy) when there are major updates of vLLM (e.g., bumping up to a new version). The latest results are available in the `vLLM GitHub README <https://github.com/vllm-project/vllm/blob/main/README.md>`_.
Trigger a benchmark
-------------------
The performance benchmarks and nightly benchmarks can be triggered by submitting a PR to vLLM, and label the PR with `perf-benchmarks` and `nightly-benchmarks`.
.. note::
Please refer to `vLLM performance benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/tests/descriptions.md>`_ and `vLLM nightly benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/nightly-descriptions.md>`_ for detailed descriptions on benchmark environment, workload and metrics.
.. _bits_and_bytes:
BitsAndBytes
==================
vLLM now supports `BitsAndBytes <https://github.com/TimDettmers/bitsandbytes>`_ for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.
Below are the steps to utilize BitsAndBytes with vLLM.
.. code-block:: console
$ pip install bitsandbytes>=0.42.0
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
You can find bitsandbytes quantized models on https://huggingface.co/models?other=bitsandbytes.
And usually, these repositories have a config.json file that includes a quantization_config section.
Read quantized checkpoint.
--------------------------
.. code-block:: python
from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")
Inflight quantization: load as 4bit quantization
------------------------------------------------
.. code-block:: python
from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")
...@@ -50,7 +50,7 @@ You can also additionally specify :code:`--pipeline-parallel-size` to enable pip ...@@ -50,7 +50,7 @@ You can also additionally specify :code:`--pipeline-parallel-size` to enable pip
$ --pipeline-parallel-size 2 $ --pipeline-parallel-size 2
.. note:: .. note::
Pipeline parallel is a beta feature. It is only supported for online serving as well as LLaMa, GPT2, and Mixtral style models. Pipeline parallel is a beta feature. It is only supported for online serving as well as LLaMa, GPT2, Mixtral, Qwen, Qwen2, and Nemotron style models.
Multi-Node Inference and Serving Multi-Node Inference and Serving
-------------------------------- --------------------------------
...@@ -79,7 +79,7 @@ On the rest of the worker nodes, run the following command: ...@@ -79,7 +79,7 @@ On the rest of the worker nodes, run the following command:
$ --worker \ $ --worker \
$ /path/to/the/huggingface/home/in/this/node $ /path/to/the/huggingface/home/in/this/node
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument ``ip_of_head_node`` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
Then, on any node, use ``docker exec -it node /bin/bash`` to enter the container, execute ``ray status`` to check the status of the Ray cluster. You should see the right number of nodes and GPUs. Then, on any node, use ``docker exec -it node /bin/bash`` to enter the container, execute ``ray status`` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
...@@ -101,7 +101,7 @@ You can also use tensor parallel without pipeline parallel, just set the tensor ...@@ -101,7 +101,7 @@ You can also use tensor parallel without pipeline parallel, just set the tensor
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like ``--privileged -e NCCL_IB_HCA=mlx5`` to the ``run_cluster.sh`` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with ``NCCL_DEBUG=TRACE`` environment variable set, e.g. ``NCCL_DEBUG=TRACE vllm serve ...`` and check the logs for the NCCL version and the network used. If you find ``[send] via NET/Socket`` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find ``[send] via NET/IB/GDRDMA`` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient. To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like ``--privileged -e NCCL_IB_HCA=mlx5`` to the ``run_cluster.sh`` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with ``NCCL_DEBUG=TRACE`` environment variable set, e.g. ``NCCL_DEBUG=TRACE vllm serve ...`` and check the logs for the NCCL version and the network used. If you find ``[send] via NET/Socket`` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find ``[send] via NET/IB/GDRDMA`` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
.. warning:: .. warning::
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the `sanity check script <https://docs.vllm.ai/en/latest/getting_started/debugging.html>`_ for more information. After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the `sanity check script <https://docs.vllm.ai/en/latest/getting_started/debugging.html>`_ for more information. If you need to set some environment variables for the communication configuration, you can append them to the ``run_cluster.sh`` script, e.g. ``-e NCCL_SOCKET_IFNAME=eth0``. Note that setting environment variables in the shell (e.g. ``NCCL_SOCKET_IFNAME=eth0 vllm serve ...``) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See the `discussion <https://github.com/vllm-project/vllm/issues/6803>`_ for more information.
.. warning:: .. warning::
......
...@@ -5,9 +5,9 @@ Deploying and scaling up with SkyPilot ...@@ -5,9 +5,9 @@ Deploying and scaling up with SkyPilot
.. raw:: html .. raw:: html
<p align="center"> <p align="center">
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/> <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
</p> </p>
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__. vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
...@@ -21,8 +21,8 @@ Prerequisites ...@@ -21,8 +21,8 @@ Prerequisites
.. code-block:: console .. code-block:: console
pip install skypilot-nightly pip install skypilot-nightly
sky check sky check
Run on a single instance Run on a single instance
...@@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil ...@@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil
.. code-block:: yaml .. code-block:: yaml
resources: resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True use_spot: True
disk_size: 512 # Ensure model checkpoints can fit. disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best disk_tier: best
ports: 8081 # Expose to internet traffic. ports: 8081 # Expose to internet traffic.
envs: envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
setup: | setup: |
conda create -n vllm python=3.10 -y conda create -n vllm python=3.10 -y
conda activate vllm conda activate vllm
pip install vllm==0.4.0.post1 pip install vllm==0.4.0.post1
# Install Gradio for web UI. # Install Gradio for web UI.
pip install gradio openai pip install gradio openai
pip install flash-attn==2.5.7 pip install flash-attn==2.5.7
run: | run: |
conda activate vllm conda activate vllm
echo 'Starting vllm api server...' echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.openai.api_server \ python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \ --port 8081 \
--model $MODEL_NAME \ --model $MODEL_NAME \
--trust-remote-code \ --trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log & 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...' echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...' echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \ python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \ -m $MODEL_NAME \
--port 8811 \ --port 8811 \
--model-url http://localhost:8081/v1 \ --model-url http://localhost:8081/v1 \
--stop-token-ids 128009,128001 --stop-token-ids 128009,128001
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
.. code-block:: console .. code-block:: console
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion. Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
.. code-block:: console .. code-block:: console
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
**Optional**: Serve the 70B model instead of the default 8B and use more GPU: **Optional**: Serve the 70B model instead of the default 8B and use more GPU:
.. code-block:: console .. code-block:: console
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
Scale up to multiple replicas Scale up to multiple replicas
...@@ -99,151 +99,212 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut ...@@ -99,151 +99,212 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
.. code-block:: yaml .. code-block:: yaml
service: service:
replicas: 2 replicas: 2
# An actual request for readiness probe. # An actual request for readiness probe.
readiness_probe: readiness_probe:
path: /v1/chat/completions path: /v1/chat/completions
post_data: post_data:
model: $MODEL_NAME model: $MODEL_NAME
messages: messages:
- role: user - role: user
content: Hello! What is your name? content: Hello! What is your name?
max_tokens: 1 max_tokens: 1
.. raw:: html .. raw:: html
<details> <details>
<summary>Click to see the full recipe YAML</summary> <summary>Click to see the full recipe YAML</summary>
.. code-block:: yaml .. code-block:: yaml
service: service:
replicas: 2 replicas: 2
# An actual request for readiness probe. # An actual request for readiness probe.
readiness_probe: readiness_probe:
path: /v1/chat/completions path: /v1/chat/completions
post_data: post_data:
model: $MODEL_NAME model: $MODEL_NAME
messages: messages:
- role: user - role: user
content: Hello! What is your name? content: Hello! What is your name?
max_tokens: 1 max_tokens: 1
resources: resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True use_spot: True
disk_size: 512 # Ensure model checkpoints can fit. disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best disk_tier: best
ports: 8081 # Expose to internet traffic. ports: 8081 # Expose to internet traffic.
envs: envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
setup: | setup: |
conda create -n vllm python=3.10 -y conda create -n vllm python=3.10 -y
conda activate vllm conda activate vllm
pip install vllm==0.4.0.post1 pip install vllm==0.4.0.post1
# Install Gradio for web UI. # Install Gradio for web UI.
pip install gradio openai pip install gradio openai
pip install flash-attn==2.5.7 pip install flash-attn==2.5.7
run: | run: |
conda activate vllm conda activate vllm
echo 'Starting vllm api server...' echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.openai.api_server \ python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \ --port 8081 \
--model $MODEL_NAME \ --model $MODEL_NAME \
--trust-remote-code \ --trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log & 2>&1 | tee api_server.log
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://localhost:8081/v1 \
--stop-token-ids 128009,128001
.. raw:: html .. raw:: html
</details> </details>
Start the serving the Llama-3 8B model on multiple replicas: Start the serving the Llama-3 8B model on multiple replicas:
.. code-block:: console .. code-block:: console
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
Wait until the service is ready: Wait until the service is ready:
.. code-block:: console .. code-block:: console
watch -n10 sky serve status vllm watch -n10 sky serve status vllm
.. raw:: html .. raw:: html
<details> <details>
<summary>Example outputs:</summary> <summary>Example outputs:</summary>
.. code-block:: console .. code-block:: console
Services Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
vllm 1 35s READY 2/2 xx.yy.zz.100:30001 vllm 1 35s READY 2/2 xx.yy.zz.100:30001
Service Replicas Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4 vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4 vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
.. raw:: html .. raw:: html
</details> </details>
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
.. code-block:: console .. code-block:: console
ENDPOINT=$(sky serve status --endpoint 8081 vllm) ENDPOINT=$(sky serve status --endpoint 8081 vllm)
curl -L http://$ENDPOINT/v1/chat/completions \ curl -L http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct", "model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [ "messages": [
{ {
"role": "system", "role": "system",
"content": "You are a helpful assistant." "content": "You are a helpful assistant."
}, },
{ {
"role": "user", "role": "user",
"content": "Who are you?" "content": "Who are you?"
} }
], ],
"stop_token_ids": [128009, 128001] "stop_token_ids": [128009, 128001]
}' }'
To enable autoscaling, you could specify additional configs in `services`: To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
.. code-block:: yaml .. code-block:: yaml
services: service:
replica_policy: replica_policy:
min_replicas: 0 min_replicas: 2
max_replicas: 3 max_replicas: 4
target_qps_per_replica: 2 target_qps_per_replica: 2
This will scale the service up to when the QPS exceeds 2 for each replica. This will scale the service up to when the QPS exceeds 2 for each replica.
.. raw:: html
<details>
<summary>Click to see the full recipe YAML</summary>
.. code-block:: yaml
service:
replica_policy:
min_replicas: 2
max_replicas: 4
target_qps_per_replica: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True
disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.
envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
setup: |
conda create -n vllm python=3.10 -y
conda activate vllm
pip install vllm==0.4.0.post1
# Install Gradio for web UI.
pip install gradio openai
pip install flash-attn==2.5.7
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log
.. raw:: html
</details>
To update the service with the new config:
.. code-block:: console
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
To stop the service:
.. code-block:: console
sky serve down vllm
**Optional**: Connect a GUI to the endpoint **Optional**: Connect a GUI to the endpoint
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...@@ -253,58 +314,53 @@ It is also possible to access the Llama-3 service with a separate GUI frontend, ...@@ -253,58 +314,53 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
.. raw:: html .. raw:: html
<details> <details>
<summary>Click to see the full GUI YAML</summary> <summary>Click to see the full GUI YAML</summary>
.. code-block:: yaml .. code-block:: yaml
envs: envs:
MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
resources: resources:
cpus: 2 cpus: 2
setup: | setup: |
conda activate vllm conda create -n vllm python=3.10 -y
if [ $? -ne 0 ]; then conda activate vllm
conda create -n vllm python=3.10 -y
conda activate vllm # Install Gradio for web UI.
fi pip install gradio openai
# Install Gradio for web UI. run: |
pip install gradio openai conda activate vllm
export PATH=$PATH:/sbin
run: |
conda activate vllm echo 'Starting gradio server...'
export PATH=$PATH:/sbin git clone https://github.com/vllm-project/vllm.git || true
WORKER_IP=$(hostname -I | cut -d' ' -f1) python vllm/examples/gradio_openai_chatbot_webserver.py \
CONTROLLER_PORT=21001 -m $MODEL_NAME \
WORKER_PORT=21002 --port 8811 \
--model-url http://$ENDPOINT/v1 \
echo 'Starting gradio server...' --stop-token-ids 128009,128001 | tee ~/gradio.log
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://$ENDPOINT/v1 \
--stop-token-ids 128009,128001 | tee ~/gradio.log
.. raw:: html .. raw:: html
</details> </details>
1. Start the chat web UI: 1. Start the chat web UI:
.. code-block:: console .. code-block:: console
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm) sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
2. Then, we can access the GUI at the returned gradio link: 2. Then, we can access the GUI at the returned gradio link:
.. code-block:: console .. code-block:: console
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
...@@ -31,7 +31,10 @@ def post_http_request(prompt: str, ...@@ -31,7 +31,10 @@ def post_http_request(prompt: str,
"max_tokens": 16, "max_tokens": 16,
"stream": stream, "stream": stream,
} }
response = requests.post(api_url, headers=headers, json=pload, stream=True) response = requests.post(api_url,
headers=headers,
json=pload,
stream=stream)
return response return response
......
...@@ -16,7 +16,7 @@ ...@@ -16,7 +16,7 @@
#### Run on H100 system for speed if FP8; number of GPUs depends on the model size #### Run on H100 system for speed if FP8; number of GPUs depends on the model size
#### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache: #### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:
`python quantize.py --model_dir ./ll2-7b --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./ll2_7b_fp8 --calib_size 512 --tp_size 1` `python quantize.py --model-dir ./ll2-7b --dtype float16 --qformat fp8 --kv-cache-dtype fp8 --output-dir ./ll2_7b_fp8 --calib-size 512 --tp-size 1`
Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference) Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)
``` ```
......
import requests
from PIL import Image
from vllm import LLM, SamplingParams
def run_fuyu():
llm = LLM(model="adept/fuyu-8b", max_model_len=4096)
# single-image prompt
prompt = "What is the highest life expectancy at of male?\n"
url = "https://huggingface.co/adept/fuyu-8b/resolve/main/chart.png"
image = Image.open(requests.get(url, stream=True).raw)
sampling_params = SamplingParams(temperature=0, max_tokens=64)
outputs = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {
"image": image
},
},
sampling_params=sampling_params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
if __name__ == "__main__":
run_fuyu()
from vllm import LLM
from vllm.assets.image import ImageAsset
def run_llava():
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
image = ImageAsset("stop_sign").pil_image
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": image
},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
if __name__ == "__main__":
run_llava()
from io import BytesIO
import requests
from PIL import Image
from vllm import LLM, SamplingParams
def run_llava_next():
llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))
sampling_params = SamplingParams(temperature=0.8,
top_p=0.95,
max_tokens=100)
outputs = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {
"image": image
}
},
sampling_params=sampling_params)
generated_text = ""
for o in outputs:
generated_text += o.outputs[0].text
print(f"LLM output:{generated_text}")
if __name__ == "__main__":
run_llava_next()
"""
This example shows how to use vLLM for running offline inference
with the correct prompt format on vision language models.
For most models, the prompt format should follow corresponding examples
on HuggingFace model repository.
"""
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.utils import FlexibleArgumentParser
# Input image and question
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
question = "What is the content of this image?"
# LLaVA-1.5
def run_llava(question):
prompt = f"USER: <image>\n{question}\nASSISTANT:"
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
return llm, prompt
# LLaVA-1.6/LLaVA-NeXT
def run_llava_next(question):
prompt = f"[INST] <image>\n{question} [/INST]"
llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf")
return llm, prompt
# Fuyu
def run_fuyu(question):
prompt = f"{question}\n"
llm = LLM(model="adept/fuyu-8b")
return llm, prompt
# Phi-3-Vision
def run_phi3v(question):
prompt = f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n" # noqa: E501
# Note: The default setting of max_num_seqs (256) and
# max_model_len (128k) for this model may cause OOM.
# You may lower either to run this example on lower-end GPUs.
# In this example, we override max_num_seqs to 5 while
# keeping the original context length of 128k.
llm = LLM(
model="microsoft/Phi-3-vision-128k-instruct",
trust_remote_code=True,
max_num_seqs=5,
)
return llm, prompt
# PaliGemma
def run_paligemma(question):
# PaliGemma has special prompt format for VQA
prompt = "caption en"
llm = LLM(model="google/paligemma-3b-mix-224")
return llm, prompt
# Chameleon
def run_chameleon(question):
prompt = f"{question}<image>"
llm = LLM(model="facebook/chameleon-7b")
return llm, prompt
# MiniCPM-V
def run_minicpmv(question):
# 2.0
# The official repo doesn't work yet, so we need to use a fork for now
# For more details, please see: See: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630 # noqa
# model_name = "HwwwH/MiniCPM-V-2"
# 2.5
model_name = "openbmb/MiniCPM-Llama3-V-2_5"
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
llm = LLM(
model=model_name,
trust_remote_code=True,
)
messages = [{
'role': 'user',
'content': f'(<image>./</image>)\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
return llm, prompt
# InternVL
def run_internvl(question):
# Generally, InternVL can use chatml template for conversation
TEMPLATE = "<|im_start|>User\n{prompt}<|im_end|>\n<|im_start|>Assistant\n"
prompt = f"<image>\n{question}\n"
prompt = TEMPLATE.format(prompt=prompt)
llm = LLM(
model="OpenGVLab/InternVL2-4B",
trust_remote_code=True,
max_num_seqs=5,
)
return llm, prompt
# BLIP-2
def run_blip2(question):
# BLIP-2 prompt format is inaccurate on HuggingFace model repository.
# See https://huggingface.co/Salesforce/blip2-opt-2.7b/discussions/15#64ff02f3f8cf9e4f5b038262 #noqa
prompt = f"Question: {question} Answer:"
llm = LLM(model="Salesforce/blip2-opt-2.7b")
return llm, prompt
model_example_map = {
"llava": run_llava,
"llava-next": run_llava_next,
"fuyu": run_fuyu,
"phi3_v": run_phi3v,
"paligemma": run_paligemma,
"chameleon": run_chameleon,
"minicpmv": run_minicpmv,
"blip-2": run_blip2,
"internvl_chat": run_internvl,
}
def main(args):
model = args.model_type
if model not in model_example_map:
raise ValueError(f"Model type {model} is not supported.")
llm, prompt = model_example_map[model](question)
# We set temperature to 0.2 so that outputs can be different
# even when all prompts are identical when running batch inference.
sampling_params = SamplingParams(temperature=0.2, max_tokens=64)
assert args.num_prompts > 0
if args.num_prompts == 1:
# Single inference
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
},
}
else:
# Batch inference
inputs = [{
"prompt": prompt,
"multi_modal_data": {
"image": image
},
} for _ in range(args.num_prompts)]
outputs = llm.generate(inputs, sampling_params=sampling_params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
if __name__ == "__main__":
parser = FlexibleArgumentParser(
description='Demo on using vLLM for offline inference with '
'vision language models')
parser.add_argument('--model-type',
'-m',
type=str,
default="llava",
choices=model_example_map.keys(),
help='Huggingface "model_type".')
parser.add_argument('--num-prompts',
type=int,
default=1,
help='Number of prompts to run.')
args = parser.parse_args()
main(args)
...@@ -13,11 +13,14 @@ client = OpenAI( ...@@ -13,11 +13,14 @@ client = OpenAI(
models = client.models.list() models = client.models.list()
model = models.data[0].id model = models.data[0].id
responses = client.embeddings.create(input=[ responses = client.embeddings.create(
"Hello my name is", input=[
"The best thing about vLLM is that it supports many different models" "Hello my name is",
], "The best thing about vLLM is that it supports many different models"
model=model) ],
model=model,
encoding_format="float",
)
for data in responses.data: for data in responses.data:
print(data.embedding) # list of float of len 4096 print(data.embedding) # list of float of len 4096
...@@ -42,6 +42,7 @@ chat_completion_from_url = client.chat.completions.create( ...@@ -42,6 +42,7 @@ chat_completion_from_url = client.chat.completions.create(
], ],
}], }],
model=model, model=model,
max_tokens=64,
) )
result = chat_completion_from_url.choices[0].message.content result = chat_completion_from_url.choices[0].message.content
...@@ -78,6 +79,7 @@ chat_completion_from_base64 = client.chat.completions.create( ...@@ -78,6 +79,7 @@ chat_completion_from_base64 = client.chat.completions.create(
], ],
}], }],
model=model, model=model,
max_tokens=64,
) )
result = chat_completion_from_base64.choices[0].message.content result = chat_completion_from_base64.choices[0].message.content
......
from vllm import LLM
from vllm.assets.image import ImageAsset
def run_paligemma():
llm = LLM(model="google/paligemma-3b-mix-224")
prompt = "caption es"
image = ImageAsset("stop_sign").pil_image
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": image
},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
if __name__ == "__main__":
run_paligemma()
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
def run_phi3v():
model_path = "microsoft/Phi-3-vision-128k-instruct"
# Note: The default setting of max_num_seqs (256) and
# max_model_len (128k) for this model may cause OOM.
# You may lower either to run this example on lower-end GPUs.
# In this example, we override max_num_seqs to 5 while
# keeping the original context length of 128k.
llm = LLM(
model=model_path,
trust_remote_code=True,
max_num_seqs=5,
)
image = ImageAsset("cherry_blossom").pil_image
# single-image prompt
prompt = "<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n" # noqa: E501
sampling_params = SamplingParams(temperature=0, max_tokens=64)
outputs = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {
"image": image
},
},
sampling_params=sampling_params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
if __name__ == "__main__":
run_phi3v()
{%- for message in messages -%}
{%- if message['role'] == 'user' -%}
{{- 'Question: ' + message['content'] + ' ' -}}
{%- elif message['role'] == 'assistant' -%}
{{- 'Answer: ' + message['content'] + ' ' -}}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- 'Answer:' -}}
{% endif %}
...@@ -96,23 +96,20 @@ echo 'vLLM yapf: Done' ...@@ -96,23 +96,20 @@ echo 'vLLM yapf: Done'
# Run mypy # Run mypy
echo 'vLLM mypy:' echo 'vLLM mypy:'
mypy tests --config-file pyproject.toml mypy --follow-imports skip # Note that this is less strict than CI
mypy vllm/*.py --config-file pyproject.toml mypy tests --follow-imports skip
mypy vllm/attention --config-file pyproject.toml mypy vllm/attention --follow-imports skip
mypy vllm/core --config-file pyproject.toml mypy vllm/core --follow-imports skip
mypy vllm/distributed --config-file pyproject.toml mypy vllm/distributed --follow-imports skip
mypy vllm/engine --config-file pyproject.toml mypy vllm/engine --follow-imports skip
mypy vllm/entrypoints --config-file pyproject.toml mypy vllm/entrypoints --follow-imports skip
mypy vllm/executor --config-file pyproject.toml mypy vllm/executor --follow-imports skip
mypy vllm/logging --config-file pyproject.toml mypy vllm/lora --follow-imports skip
mypy vllm/lora --config-file pyproject.toml mypy vllm/model_executor --follow-imports skip
mypy vllm/model_executor --config-file pyproject.toml mypy vllm/prompt_adapter --follow-imports skip
mypy vllm/multimodal --config-file pyproject.toml mypy vllm/spec_decode --follow-imports skip
mypy vllm/prompt_adapter --config-file pyproject.toml mypy vllm/worker --follow-imports skip
mypy vllm/spec_decode --config-file pyproject.toml echo 'vLLM mypy: Done'
mypy vllm/transformers_utils --config-file pyproject.toml
mypy vllm/usage --config-file pyproject.toml
mypy vllm/worker --config-file pyproject.toml
# If git diff returns a file that is in the skip list, the file may be checked anyway: # If git diff returns a file that is in the skip list, the file may be checked anyway:
...@@ -131,7 +128,7 @@ spell_check_all(){ ...@@ -131,7 +128,7 @@ spell_check_all(){
codespell --toml pyproject.toml "${CODESPELL_EXCLUDES[@]}" codespell --toml pyproject.toml "${CODESPELL_EXCLUDES[@]}"
} }
# Spelling check of files that differ from main branch. # Spelling check of files that differ from main branch.
spell_check_changed() { spell_check_changed() {
# The `if` guard ensures that the list of filenames is not empty, which # The `if` guard ensures that the list of filenames is not empty, which
# could cause ruff to receive 0 positional arguments, making it hang # could cause ruff to receive 0 positional arguments, making it hang
...@@ -245,12 +242,6 @@ echo 'vLLM isort: Done' ...@@ -245,12 +242,6 @@ echo 'vLLM isort: Done'
# NOTE: Keep up to date with .github/workflows/clang-format.yml # NOTE: Keep up to date with .github/workflows/clang-format.yml
CLANG_FORMAT_EXCLUDES=( CLANG_FORMAT_EXCLUDES=(
'csrc/moe/topk_softmax_kernels.cu' 'csrc/moe/topk_softmax_kernels.cu'
'csrc/punica/bgmv/bgmv_bf16_bf16_bf16.cu'
'csrc/punica/bgmv/bgmv_config.h'
'csrc/punica/bgmv/bgmv_impl.cuh'
'csrc/punica/bgmv/vec_dtypes.cuh'
'csrc/punica/punica_ops.cu'
'csrc/punica/type_convert.h'
) )
# Format specified files with clang-format # Format specified files with clang-format
......
...@@ -5,7 +5,7 @@ requires = [ ...@@ -5,7 +5,7 @@ requires = [
"ninja", "ninja",
"packaging", "packaging",
"setuptools >= 49.4.0", "setuptools >= 49.4.0",
"torch == 2.3.1", "torch == 2.4.0",
"wheel", "wheel",
] ]
build-backend = "setuptools.build_meta" build-backend = "setuptools.build_meta"
...@@ -48,9 +48,22 @@ python_version = "3.8" ...@@ -48,9 +48,22 @@ python_version = "3.8"
ignore_missing_imports = true ignore_missing_imports = true
check_untyped_defs = true check_untyped_defs = true
follow_imports = "skip" follow_imports = "silent"
files = "vllm" # After fixing type errors resulting from follow_imports: "skip" -> "silent",
# move the directory here and remove it from format.sh and mypy.yaml
files = [
"vllm/*.py",
"vllm/adapter_commons",
"vllm/assets",
"vllm/inputs",
"vllm/logging",
"vllm/multimodal",
"vllm/platforms",
"vllm/transformers_utils",
"vllm/triton_utils",
"vllm/usage",
]
# TODO(woosuk): Include the code from Megatron and HuggingFace. # TODO(woosuk): Include the code from Megatron and HuggingFace.
exclude = [ exclude = [
"vllm/model_executor/parallel_utils/|vllm/model_executor/models/", "vllm/model_executor/parallel_utils/|vllm/model_executor/models/",
......
# Dependencies for Ray accelerated DAG
cupy-cuda12x
ray >= 2.32
\ No newline at end of file
...@@ -3,5 +3,5 @@ cmake>=3.21 ...@@ -3,5 +3,5 @@ cmake>=3.21
ninja ninja
packaging packaging
setuptools>=49.4.0 setuptools>=49.4.0
torch==2.3.1 torch==2.4.0
wheel wheel
...@@ -6,7 +6,7 @@ numpy < 2.0.0 ...@@ -6,7 +6,7 @@ numpy < 2.0.0
requests requests
tqdm tqdm
py-cpuinfo py-cpuinfo
transformers >= 4.42.4 # Required for Gemma 2 and for additional chat template parameters. transformers >= 4.43.2 # Required for Chameleon and Llama 3.1 hotfox.
tokenizers >= 0.19.1 # Required for Llama 3. tokenizers >= 0.19.1 # Required for Llama 3.
fastapi fastapi
aiohttp aiohttp
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment