Commit e7c1b7f3 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge branch 'v0.5.4-dtk24.04.1'

parents 7462218e 04c62b93
.. _bits_and_bytes:
BitsAndBytes
==================
vLLM now supports `BitsAndBytes <https://github.com/TimDettmers/bitsandbytes>`_ for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.
Below are the steps to utilize BitsAndBytes with vLLM.
.. code-block:: console
$ pip install bitsandbytes>=0.42.0
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
You can find bitsandbytes quantized models on https://huggingface.co/models?other=bitsandbytes.
And usually, these repositories have a config.json file that includes a quantization_config section.
Read quantized checkpoint.
--------------------------
.. code-block:: python
from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")
Inflight quantization: load as 4bit quantization
------------------------------------------------
.. code-block:: python
from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")
...@@ -3,7 +3,10 @@ ...@@ -3,7 +3,10 @@
FP8 FP8
================== ==================
vLLM supports FP8 (8-bit floating point) computation using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. Currently, only Hopper and Ada Lovelace GPUs are supported. Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy. vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
Please visit the HF collection of `quantized FP8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127>`_. Please visit the HF collection of `quantized FP8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127>`_.
......
.. _supported_hardware_for_quantization:
Supported Hardware for Quantization Kernels
===========================================
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
============== ====== ======= ======= ===== ====== ======= ========= ======= ============== ==========
Implementation Volta Turing Ampere Ada Hopper AMD GPU Intel GPU x86 CPU AWS Inferentia Google TPU
============== ====== ======= ======= ===== ====== ======= ========= ======= ============== ==========
AQLM ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
AWQ ❌ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
DeepSpeedFP ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
FP8 ❌ ❌ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
Marlin ❌ ❌ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
GPTQ ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
SqueezeLLM ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
bitsandbytes ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
============== ====== ======= ======= ===== ====== ======= ========= ======= ============== ==========
Notes:
^^^^^^
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅" indicates that the quantization method is supported on the specified hardware.
- "❌" indicates that the quantization method is not supported on the specified hardware.
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.
\ No newline at end of file
.. _deploying_with_cerebrium:
Deploying with Cerebrium
============================
.. raw:: html
<p align="center">
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
</p>
vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
To install the Cerebrium client, run:
.. code-block:: console
$ pip install cerebrium
$ cerebrium login
Next, create your Cerebrium project, run:
.. code-block:: console
$ cerebrium init vllm-project
Next, to install the required packages, add the following to your cerebrium.toml:
.. code-block:: toml
[cerebrium.deployment]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
[cerebrium.dependencies.pip]
vllm = "latest"
Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
.. code-block:: python
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
results = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results}
Then, run the following code to deploy it to the cloud
.. code-block:: console
$ cerebrium deploy
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
.. code-block:: python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \
-H 'Authorization: <JWT TOKEN>' \
--data '{
"prompts": [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
}'
You should get a response like:
.. code-block:: python
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
"result": {
"result": [
{
"prompt": "Hello, my name is",
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
},
{
"prompt": "The president of the United States is",
"generated_text": " elected every four years. This is a democratic system.\n\n5. What"
},
{
"prompt": "The capital of France is",
"generated_text": " Paris.\n"
},
{
"prompt": "The future of AI is",
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}
]
},
"run_time_ms": 152.53663063049316
}
You now have an autoscaling endpoint where you only pay for the compute you use!
...@@ -3,9 +3,8 @@ ...@@ -3,9 +3,8 @@
Deploying with Docker Deploying with Docker
============================ ============================
vLLM offers official docker image for deployment. vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server. The image can be used to run OpenAI compatible server and is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
The image is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
.. code-block:: console .. code-block:: console
...@@ -25,7 +24,7 @@ The image is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.co ...@@ -25,7 +24,7 @@ The image is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.co
memory to share data between processes under the hood, particularly for tensor parallel inference. memory to share data between processes under the hood, particularly for tensor parallel inference.
You can build and run vLLM from source via the provided dockerfile. To build vLLM: You can build and run vLLM from source via the provided `Dockerfile <https://github.com/vllm-project/vllm/blob/main/Dockerfile>`_. To build vLLM:
.. code-block:: console .. code-block:: console
......
...@@ -40,7 +40,7 @@ Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7 ...@@ -40,7 +40,7 @@ Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7
gpu: 24GB gpu: 24GB
commands: commands:
- pip install vllm - pip install vllm
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000 - vllm serve $MODEL --port 8000
model: model:
format: openai format: openai
type: chat type: chat
......
...@@ -3,7 +3,26 @@ ...@@ -3,7 +3,26 @@
Distributed Inference and Serving Distributed Inference and Serving
================================= =================================
vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray. How to decide the distributed inference strategy?
-------------------------------------------------
Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:
- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like ``# GPU blocks: 790``. Multiply the number by ``16`` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
.. note::
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
Details for Distributed Inference and Serving
----------------------------------------------
vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We also support pipeline parallel as a beta feature for online serving. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured :code:`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the :code:`LLM` class :code:`distributed-executor-backend` argument or :code:`--distributed-executor-backend` API server argument. Set it to :code:`mp` for multiprocessing or :code:`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case. Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured :code:`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the :code:`LLM` class :code:`distributed-executor-backend` argument or :code:`--distributed-executor-backend` API server argument. Set it to :code:`mp` for multiprocessing or :code:`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
...@@ -19,20 +38,73 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh ...@@ -19,20 +38,73 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh
.. code-block:: console .. code-block:: console
$ python -m vllm.entrypoints.api_server \ $ vllm serve facebook/opt-13b \
$ --model facebook/opt-13b \
$ --tensor-parallel-size 4 $ --tensor-parallel-size 4
To scale vLLM beyond a single machine, install and start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM: You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
.. code-block:: console
$ vllm serve gpt2 \
$ --tensor-parallel-size 4 \
$ --pipeline-parallel-size 2
.. note::
Pipeline parallel is a beta feature. It is only supported for online serving as well as LLaMa, GPT2, Mixtral, Qwen, Qwen2, and Nemotron style models.
Multi-Node Inference and Serving
--------------------------------
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
The first step, is to start containers and organize them into a cluster. We have provided a helper `script <https://github.com/vllm-project/vllm/tree/main/examples/run_cluster.sh>`_ to start the cluster.
Pick a node as the head node, and run the following command:
.. code-block:: console .. code-block:: console
$ pip install ray $ bash run_cluster.sh \
$ vllm/vllm-openai \
$ ip_of_head_node \
$ --head \
$ /path/to/the/huggingface/home/in/this/node
On the rest of the worker nodes, run the following command:
.. code-block:: console
$ bash run_cluster.sh \
$ vllm/vllm-openai \
$ ip_of_head_node \
$ --worker \
$ /path/to/the/huggingface/home/in/this/node
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument ``ip_of_head_node`` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
Then, on any node, use ``docker exec -it node /bin/bash`` to enter the container, execute ``ray status`` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
.. code-block:: console
$ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 8 \
$ --pipeline-parallel-size 2
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16:
.. code-block:: console
$ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 16
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like ``--privileged -e NCCL_IB_HCA=mlx5`` to the ``run_cluster.sh`` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with ``NCCL_DEBUG=TRACE`` environment variable set, e.g. ``NCCL_DEBUG=TRACE vllm serve ...`` and check the logs for the NCCL version and the network used. If you find ``[send] via NET/Socket`` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find ``[send] via NET/IB/GDRDMA`` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
.. warning::
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the `sanity check script <https://docs.vllm.ai/en/latest/getting_started/debugging.html>`_ for more information. If you need to set some environment variables for the communication configuration, you can append them to the ``run_cluster.sh`` script, e.g. ``-e NCCL_SOCKET_IFNAME=eth0``. Note that setting environment variables in the shell (e.g. ``NCCL_SOCKET_IFNAME=eth0 vllm serve ...``) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See the `discussion <https://github.com/vllm-project/vllm/issues/6803>`_ for more information.
$ # On head node .. warning::
$ ray start --head
$ # On worker nodes Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
$ ray start --address=<ray-head-address>
After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` to the number of GPUs to be the total number of GPUs across all machines. When you use huggingface repo id to refer to the model, you should append your huggingface token to the ``run_cluster.sh`` script, e.g. ``-e HF_TOKEN=``. The recommended way is to download the model first, and then use the path to refer to the model.
\ No newline at end of file
...@@ -3,6 +3,11 @@ Environment Variables ...@@ -3,6 +3,11 @@ Environment Variables
vLLM uses the following environment variables to configure the system: vLLM uses the following environment variables to configure the system:
.. warning::
Please note that ``VLLM_PORT`` and ``VLLM_HOST_IP`` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use ``--host $VLLM_HOST_IP`` and ``--port $VLLM_PORT`` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with ``VLLM_``. **Special care should be taken for Kubernetes users**: please do not name the service as ``vllm``, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because `Kubernetes sets environment variables for each service with the capitalized service name as the prefix <https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables>`_.
.. literalinclude:: ../../../vllm/envs.py .. literalinclude:: ../../../vllm/envs.py
:language: python :language: python
:start-after: begin-env-vars-definition :start-after: begin-env-vars-definition
......
Frequently Asked Questions
===========================
Q: How can I serve multiple models on a single port using the OpenAI API?
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
----------------------------------------
Q: Which model to use for offline inference embedding?
A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
...@@ -8,6 +8,7 @@ Integrations ...@@ -8,6 +8,7 @@ Integrations
deploying_with_kserve deploying_with_kserve
deploying_with_triton deploying_with_triton
deploying_with_bentoml deploying_with_bentoml
deploying_with_cerebrium
deploying_with_lws deploying_with_lws
deploying_with_dstack deploying_with_dstack
serving_with_langchain serving_with_langchain
...@@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat ...@@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat
You can start the server using Python, or using [Docker](deploying_with_docker.rst): You can start the server using Python, or using [Docker](deploying_with_docker.rst):
```bash ```bash
python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
``` ```
To call the server, you can use the official OpenAI Python client library, or any other HTTP client. To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
...@@ -97,9 +97,7 @@ template, or the template in string form. Without a chat template, the server wi ...@@ -97,9 +97,7 @@ template, or the template in string form. Without a chat template, the server wi
and all chat requests will error. and all chat requests will error.
```bash ```bash
python -m vllm.entrypoints.openai.api_server \ vllm serve <model> --chat-template ./path-to-chat-template.jinja
--model ... \
--chat-template ./path-to-chat-template.jinja
``` ```
vLLM community provides a set of chat templates for popular models. You can find them in the examples vLLM community provides a set of chat templates for popular models. You can find them in the examples
...@@ -109,8 +107,8 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/) ...@@ -109,8 +107,8 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
```{argparse} ```{argparse}
:module: vllm.entrypoints.openai.cli_args :module: vllm.entrypoints.openai.cli_args
:func: make_arg_parser :func: create_parser_for_docs
:prog: -m vllm.entrypoints.openai.api_server :prog: vllm serve
``` ```
## Tool calling in the chat completion API ## Tool calling in the chat completion API
......
...@@ -5,9 +5,9 @@ Deploying and scaling up with SkyPilot ...@@ -5,9 +5,9 @@ Deploying and scaling up with SkyPilot
.. raw:: html .. raw:: html
<p align="center"> <p align="center">
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/> <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
</p> </p>
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__. vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
...@@ -21,8 +21,8 @@ Prerequisites ...@@ -21,8 +21,8 @@ Prerequisites
.. code-block:: console .. code-block:: console
pip install skypilot-nightly pip install skypilot-nightly
sky check sky check
Run on a single instance Run on a single instance
...@@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil ...@@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil
.. code-block:: yaml .. code-block:: yaml
resources: resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True use_spot: True
disk_size: 512 # Ensure model checkpoints can fit. disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best disk_tier: best
ports: 8081 # Expose to internet traffic. ports: 8081 # Expose to internet traffic.
envs: envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
setup: | setup: |
conda create -n vllm python=3.10 -y conda create -n vllm python=3.10 -y
conda activate vllm conda activate vllm
pip install vllm==0.4.0.post1 pip install vllm==0.4.0.post1
# Install Gradio for web UI. # Install Gradio for web UI.
pip install gradio openai pip install gradio openai
pip install flash-attn==2.5.7 pip install flash-attn==2.5.7
run: | run: |
conda activate vllm conda activate vllm
echo 'Starting vllm api server...' echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.openai.api_server \ python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \ --port 8081 \
--model $MODEL_NAME \ --model $MODEL_NAME \
--trust-remote-code \ --trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log & 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...' echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...' echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \ python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \ -m $MODEL_NAME \
--port 8811 \ --port 8811 \
--model-url http://localhost:8081/v1 \ --model-url http://localhost:8081/v1 \
--stop-token-ids 128009,128001 --stop-token-ids 128009,128001
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
.. code-block:: console .. code-block:: console
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion. Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
.. code-block:: console .. code-block:: console
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
**Optional**: Serve the 70B model instead of the default 8B and use more GPU: **Optional**: Serve the 70B model instead of the default 8B and use more GPU:
.. code-block:: console .. code-block:: console
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
Scale up to multiple replicas Scale up to multiple replicas
...@@ -99,151 +99,212 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut ...@@ -99,151 +99,212 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
.. code-block:: yaml .. code-block:: yaml
service: service:
replicas: 2 replicas: 2
# An actual request for readiness probe. # An actual request for readiness probe.
readiness_probe: readiness_probe:
path: /v1/chat/completions path: /v1/chat/completions
post_data: post_data:
model: $MODEL_NAME model: $MODEL_NAME
messages: messages:
- role: user - role: user
content: Hello! What is your name? content: Hello! What is your name?
max_tokens: 1 max_tokens: 1
.. raw:: html .. raw:: html
<details> <details>
<summary>Click to see the full recipe YAML</summary> <summary>Click to see the full recipe YAML</summary>
.. code-block:: yaml .. code-block:: yaml
service: service:
replicas: 2 replicas: 2
# An actual request for readiness probe. # An actual request for readiness probe.
readiness_probe: readiness_probe:
path: /v1/chat/completions path: /v1/chat/completions
post_data: post_data:
model: $MODEL_NAME model: $MODEL_NAME
messages: messages:
- role: user - role: user
content: Hello! What is your name? content: Hello! What is your name?
max_tokens: 1 max_tokens: 1
resources: resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True use_spot: True
disk_size: 512 # Ensure model checkpoints can fit. disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best disk_tier: best
ports: 8081 # Expose to internet traffic. ports: 8081 # Expose to internet traffic.
envs: envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
setup: | setup: |
conda create -n vllm python=3.10 -y conda create -n vllm python=3.10 -y
conda activate vllm conda activate vllm
pip install vllm==0.4.0.post1 pip install vllm==0.4.0.post1
# Install Gradio for web UI. # Install Gradio for web UI.
pip install gradio openai pip install gradio openai
pip install flash-attn==2.5.7 pip install flash-attn==2.5.7
run: | run: |
conda activate vllm conda activate vllm
echo 'Starting vllm api server...' echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.openai.api_server \ python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \ --port 8081 \
--model $MODEL_NAME \ --model $MODEL_NAME \
--trust-remote-code \ --trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log & 2>&1 | tee api_server.log
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://localhost:8081/v1 \
--stop-token-ids 128009,128001
.. raw:: html .. raw:: html
</details> </details>
Start the serving the Llama-3 8B model on multiple replicas: Start the serving the Llama-3 8B model on multiple replicas:
.. code-block:: console .. code-block:: console
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
Wait until the service is ready: Wait until the service is ready:
.. code-block:: console .. code-block:: console
watch -n10 sky serve status vllm watch -n10 sky serve status vllm
.. raw:: html .. raw:: html
<details> <details>
<summary>Example outputs:</summary> <summary>Example outputs:</summary>
.. code-block:: console .. code-block:: console
Services Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
vllm 1 35s READY 2/2 xx.yy.zz.100:30001 vllm 1 35s READY 2/2 xx.yy.zz.100:30001
Service Replicas Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4 vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4 vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
.. raw:: html .. raw:: html
</details> </details>
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
.. code-block:: console .. code-block:: console
ENDPOINT=$(sky serve status --endpoint 8081 vllm) ENDPOINT=$(sky serve status --endpoint 8081 vllm)
curl -L http://$ENDPOINT/v1/chat/completions \ curl -L http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct", "model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [ "messages": [
{ {
"role": "system", "role": "system",
"content": "You are a helpful assistant." "content": "You are a helpful assistant."
}, },
{ {
"role": "user", "role": "user",
"content": "Who are you?" "content": "Who are you?"
} }
], ],
"stop_token_ids": [128009, 128001] "stop_token_ids": [128009, 128001]
}' }'
To enable autoscaling, you could specify additional configs in `services`: To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
.. code-block:: yaml .. code-block:: yaml
services: service:
replica_policy: replica_policy:
min_replicas: 0 min_replicas: 2
max_replicas: 3 max_replicas: 4
target_qps_per_replica: 2 target_qps_per_replica: 2
This will scale the service up to when the QPS exceeds 2 for each replica. This will scale the service up to when the QPS exceeds 2 for each replica.
.. raw:: html
<details>
<summary>Click to see the full recipe YAML</summary>
.. code-block:: yaml
service:
replica_policy:
min_replicas: 2
max_replicas: 4
target_qps_per_replica: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
use_spot: True
disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.
envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
setup: |
conda create -n vllm python=3.10 -y
conda activate vllm
pip install vllm==0.4.0.post1
# Install Gradio for web UI.
pip install gradio openai
pip install flash-attn==2.5.7
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
2>&1 | tee api_server.log
.. raw:: html
</details>
To update the service with the new config:
.. code-block:: console
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
To stop the service:
.. code-block:: console
sky serve down vllm
**Optional**: Connect a GUI to the endpoint **Optional**: Connect a GUI to the endpoint
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...@@ -253,58 +314,53 @@ It is also possible to access the Llama-3 service with a separate GUI frontend, ...@@ -253,58 +314,53 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
.. raw:: html .. raw:: html
<details> <details>
<summary>Click to see the full GUI YAML</summary> <summary>Click to see the full GUI YAML</summary>
.. code-block:: yaml .. code-block:: yaml
envs: envs:
MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
resources: resources:
cpus: 2 cpus: 2
setup: | setup: |
conda activate vllm conda create -n vllm python=3.10 -y
if [ $? -ne 0 ]; then conda activate vllm
conda create -n vllm python=3.10 -y
conda activate vllm # Install Gradio for web UI.
fi pip install gradio openai
# Install Gradio for web UI. run: |
pip install gradio openai conda activate vllm
export PATH=$PATH:/sbin
run: |
conda activate vllm echo 'Starting gradio server...'
export PATH=$PATH:/sbin git clone https://github.com/vllm-project/vllm.git || true
WORKER_IP=$(hostname -I | cut -d' ' -f1) python vllm/examples/gradio_openai_chatbot_webserver.py \
CONTROLLER_PORT=21001 -m $MODEL_NAME \
WORKER_PORT=21002 --port 8811 \
--model-url http://$ENDPOINT/v1 \
echo 'Starting gradio server...' --stop-token-ids 128009,128001 | tee ~/gradio.log
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://$ENDPOINT/v1 \
--stop-token-ids 128009,128001 | tee ~/gradio.log
.. raw:: html .. raw:: html
</details> </details>
1. Start the chat web UI: 1. Start the chat web UI:
.. code-block:: console .. code-block:: console
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm) sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
2. Then, we can access the GUI at the returned gradio link: 2. Then, we can access the GUI at the returned gradio link:
.. code-block:: console .. code-block:: console
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
.. _tensorizer:
Loading Models with CoreWeave's Tensorizer
==========================================
vLLM supports loading models with `CoreWeave's Tensorizer <https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer>`_.
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
at runtime extremely quickly directly to the GPU, resulting in significantly
shorter Pod startup times and CPU memory usage. Tensor encryption is also supported.
For more information on CoreWeave's Tensorizer, please refer to
`CoreWeave's Tensorizer documentation <https://github.com/coreweave/tensorizer>`_. For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the `vLLM example script <https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html>`_.
\ No newline at end of file
"""Example Python client for vllm.entrypoints.api_server""" """Example Python client for `vllm.entrypoints.api_server`
NOTE: The API server is used only for demonstration and simple performance
benchmarks. It is not intended for production use.
For production use, we recommend `vllm serve` and the OpenAI client API.
"""
import argparse import argparse
import json import json
...@@ -27,7 +31,10 @@ def post_http_request(prompt: str, ...@@ -27,7 +31,10 @@ def post_http_request(prompt: str,
"max_tokens": 16, "max_tokens": 16,
"stream": stream, "stream": stream,
} }
response = requests.post(api_url, headers=headers, json=pload, stream=True) response = requests.post(api_url,
headers=headers,
json=pload,
stream=stream)
return response return response
......
import argparse
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
from vllm.utils import FlexibleArgumentParser
def main(): def main():
parser = argparse.ArgumentParser(description='AQLM examples') parser = FlexibleArgumentParser(description='AQLM examples')
parser.add_argument('--model', parser.add_argument('--model',
'-m', '-m',
...@@ -17,7 +16,7 @@ def main(): ...@@ -17,7 +16,7 @@ def main():
type=int, type=int,
default=0, default=0,
help='known good models by index, [0-4]') help='known good models by index, [0-4]')
parser.add_argument('--tensor_parallel_size', parser.add_argument('--tensor-parallel-size',
'-t', '-t',
type=int, type=int,
default=1, default=1,
......
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="meta-llama/Llama-2-13b-chat-hf", cpu_offload_gb=10)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
...@@ -2,7 +2,7 @@ import argparse ...@@ -2,7 +2,7 @@ import argparse
import glob import glob
import json import json
import os import os
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple from typing import Any, Callable, Dict, List, Optional, Tuple
import numpy as np import numpy as np
import torch import torch
...@@ -19,7 +19,7 @@ def _prepare_hf_weights( ...@@ -19,7 +19,7 @@ def _prepare_hf_weights(
quantized_model_dir: str, quantized_model_dir: str,
load_format: str = "auto", load_format: str = "auto",
fall_back_to_pt: bool = True, fall_back_to_pt: bool = True,
) -> Tuple[str, List[str], bool]: ) -> Tuple[List[str], bool]:
if not os.path.isdir(quantized_model_dir): if not os.path.isdir(quantized_model_dir):
raise FileNotFoundError( raise FileNotFoundError(
f"The quantized model directory `{quantized_model_dir}` " f"The quantized model directory `{quantized_model_dir}` "
...@@ -94,7 +94,7 @@ def _hf_tensorfile_iterator(filename: str, load_format: str, ...@@ -94,7 +94,7 @@ def _hf_tensorfile_iterator(filename: str, load_format: str,
def _kv_scales_extractor( def _kv_scales_extractor(
hf_tensor_files: Iterable[str], hf_tensor_files: List[str],
use_safetensors: bool, use_safetensors: bool,
rank_keyword: str = "rank", rank_keyword: str = "rank",
expected_tp_size: Optional[int] = None) -> Dict[int, Dict[int, float]]: expected_tp_size: Optional[int] = None) -> Dict[int, Dict[int, float]]:
...@@ -115,7 +115,7 @@ def _kv_scales_extractor( ...@@ -115,7 +115,7 @@ def _kv_scales_extractor(
for char in rank_keyword: for char in rank_keyword:
assert not char.isdecimal( assert not char.isdecimal(
), f"Rank keyword {rank_keyword} contains a numeric character!" ), f"Rank keyword {rank_keyword} contains a numeric character!"
rank_scales_map = {} rank_scales_map: Dict[int, Dict[int, float]] = {}
for tensor_file in hf_tensor_files: for tensor_file in hf_tensor_files:
try: try:
rank_idx = tensor_file.find(rank_keyword) rank_idx = tensor_file.find(rank_keyword)
...@@ -141,7 +141,7 @@ def _kv_scales_extractor( ...@@ -141,7 +141,7 @@ def _kv_scales_extractor(
raise raise
if rank not in rank_scales_map: if rank not in rank_scales_map:
layer_scales_map = {} layer_scales_map: Dict[int, float] = {}
rank_scales_map[rank] = layer_scales_map rank_scales_map[rank] = layer_scales_map
else: else:
raise RuntimeError( raise RuntimeError(
...@@ -222,7 +222,7 @@ def _metadata_extractor(quantized_model_dir: str, ...@@ -222,7 +222,7 @@ def _metadata_extractor(quantized_model_dir: str,
"does not exist.") "does not exist.")
metadata_files = glob.glob(os.path.join(quantized_model_dir, "*.json")) metadata_files = glob.glob(os.path.join(quantized_model_dir, "*.json"))
result = {} result: Dict[str, Any] = {}
for file in metadata_files: for file in metadata_files:
with open(file) as f: with open(file) as f:
try: try:
...@@ -327,7 +327,7 @@ if __name__ == "__main__": ...@@ -327,7 +327,7 @@ if __name__ == "__main__":
"--quantization-param-path <filename>). This is only used " "--quantization-param-path <filename>). This is only used "
"if the KV cache dtype is FP8 and on ROCm (AMD GPU).") "if the KV cache dtype is FP8 and on ROCm (AMD GPU).")
parser.add_argument( parser.add_argument(
"--quantized_model", "--quantized-model",
help="Specify the directory containing a single quantized HF model. " help="Specify the directory containing a single quantized HF model. "
"It is expected that the quantization format is FP8_E4M3, for use " "It is expected that the quantization format is FP8_E4M3, for use "
"on ROCm (AMD GPU).", "on ROCm (AMD GPU).",
...@@ -339,18 +339,18 @@ if __name__ == "__main__": ...@@ -339,18 +339,18 @@ if __name__ == "__main__":
choices=["auto", "safetensors", "npz", "pt"], choices=["auto", "safetensors", "npz", "pt"],
default="auto") default="auto")
parser.add_argument( parser.add_argument(
"--output_dir", "--output-dir",
help="Optionally specify the output directory. By default the " help="Optionally specify the output directory. By default the "
"KV cache scaling factors will be saved in the model directory, " "KV cache scaling factors will be saved in the model directory, "
"however you can override this behavior here.", "however you can override this behavior here.",
default=None) default=None)
parser.add_argument( parser.add_argument(
"--output_name", "--output-name",
help="Optionally specify the output filename.", help="Optionally specify the output filename.",
# TODO: Change this once additional scaling factors are enabled # TODO: Change this once additional scaling factors are enabled
default="kv_cache_scales.json") default="kv_cache_scales.json")
parser.add_argument( parser.add_argument(
"--tp_size", "--tp-size",
help="Optionally specify the tensor-parallel (TP) size that the " help="Optionally specify the tensor-parallel (TP) size that the "
"quantized model should correspond to. If specified, during KV " "quantized model should correspond to. If specified, during KV "
"cache scaling factor extraction the observed TP size will be " "cache scaling factor extraction the observed TP size will be "
......
...@@ -16,7 +16,7 @@ ...@@ -16,7 +16,7 @@
#### Run on H100 system for speed if FP8; number of GPUs depends on the model size #### Run on H100 system for speed if FP8; number of GPUs depends on the model size
#### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache: #### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:
`python quantize.py --model_dir ./ll2-7b --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./ll2_7b_fp8 --calib_size 512 --tp_size 1` `python quantize.py --model-dir ./ll2-7b --dtype float16 --qformat fp8 --kv-cache-dtype fp8 --output-dir ./ll2_7b_fp8 --calib-size 512 --tp-size 1`
Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference) Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)
``` ```
......
import argparse
import os
import subprocess
import torch
from PIL import Image
from vllm import LLM
from vllm.multimodal.image import ImageFeatureData, ImagePixelData
# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.
# You can use `.buildkite/download-images.sh` to download them
def run_llava_pixel_values(*, disable_image_processor: bool = False):
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_input_type="pixel_values",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=576,
disable_image_processor=disable_image_processor,
)
prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
if disable_image_processor:
image = torch.load("images/stop_sign_pixel_values.pt")
else:
image = Image.open("images/stop_sign.jpg")
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": ImagePixelData(image),
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
def run_llava_image_features():
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_input_type="image_features",
image_token_id=32000,
image_input_shape="1,576,1024",
image_feature_size=576,
)
prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
image: torch.Tensor = torch.load("images/stop_sign_image_features.pt")
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": ImageFeatureData(image),
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
def main(args):
if args.type == "pixel_values":
run_llava_pixel_values()
else:
run_llava_image_features()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Demo on Llava")
parser.add_argument("--type",
type=str,
choices=["pixel_values", "image_features"],
default="pixel_values",
help="image input type")
args = parser.parse_args()
# Download from s3
s3_bucket_path = "s3://air-example-data-2/vllm_opensource_llava/"
local_directory = "images"
# Make sure the local directory exists or create it
os.makedirs(local_directory, exist_ok=True)
# Use AWS CLI to sync the directory, assume anonymous access
subprocess.check_call([
"aws",
"s3",
"sync",
s3_bucket_path,
local_directory,
"--no-sign-request",
])
main(args)
...@@ -2,6 +2,7 @@ import argparse ...@@ -2,6 +2,7 @@ import argparse
from typing import List, Tuple from typing import List, Tuple
from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
from vllm.utils import FlexibleArgumentParser
def create_test_prompts() -> List[Tuple[str, SamplingParams]]: def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
...@@ -55,7 +56,7 @@ def main(args: argparse.Namespace): ...@@ -55,7 +56,7 @@ def main(args: argparse.Namespace):
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser( parser = FlexibleArgumentParser(
description='Demo on using the LLMEngine class directly') description='Demo on using the LLMEngine class directly')
parser = EngineArgs.add_cli_args(parser) parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args() args = parser.parse_args()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment