Commit afd0da21 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.7.1' into v0.7.1-dev

parents 1a11f127 4f4d427a
# Integrations
```{toctree}
:maxdepth: 1
run_on_sky
deploying_with_kserve
deploying_with_kubeai
deploying_with_triton
deploying_with_bentoml
deploying_with_cerebrium
deploying_with_lws
deploying_with_dstack
serving_with_langchain
serving_with_llamaindex
serving_with_llamastack
```
# External Integrations
:::{toctree}
:maxdepth: 1
langchain
llamaindex
:::
(run-on-langchain)= (serving-langchain)=
# Serving with Langchain # LangChain
vLLM is also available via [Langchain](https://github.com/langchain-ai/langchain) . vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
To install langchain, run To install LangChain, run
```console ```console
$ pip install langchain langchain_community -q pip install langchain langchain_community -q
``` ```
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`. To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
......
(run-on-llamaindex)= (serving-llamaindex)=
# Serving with llama_index # LlamaIndex
vLLM is also available via [llama_index](https://github.com/run-llama/llama_index) . vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
To install llamaindex, run To install LlamaIndex, run
```console ```console
$ pip install llama-index-llms-vllm -q pip install llama-index-llms-vllm -q
``` ```
To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`. To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`.
......
...@@ -4,10 +4,10 @@ vLLM exposes a number of metrics that can be used to monitor the health of the ...@@ -4,10 +4,10 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
system. These metrics are exposed via the `/metrics` endpoint on the vLLM system. These metrics are exposed via the `/metrics` endpoint on the vLLM
OpenAI compatible API server. OpenAI compatible API server.
You can start the server using Python, or using [Docker](deploying_with_docker.md): You can start the server using Python, or using [Docker](#deployment-docker):
```console ```console
$ vllm serve unsloth/Llama-3.2-1B-Instruct vllm serve unsloth/Llama-3.2-1B-Instruct
``` ```
Then query the endpoint to get the latest metrics from the server: Then query the endpoint to get the latest metrics from the server:
...@@ -31,8 +31,8 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I ...@@ -31,8 +31,8 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I
The following metrics are exposed: The following metrics are exposed:
```{literalinclude} ../../../vllm/engine/metrics.py :::{literalinclude} ../../../vllm/engine/metrics.py
:end-before: end-metrics-definitions :end-before: end-metrics-definitions
:language: python :language: python
:start-after: begin-metrics-definitions :start-after: begin-metrics-definitions
``` :::
...@@ -4,21 +4,21 @@ ...@@ -4,21 +4,21 @@
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM. This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
```{note} :::{note}
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes, We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests. and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
``` :::
## Offline Inference ## Offline Inference
To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`: To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`:
- `prompt`: The prompt should follow the format that is documented on HuggingFace. - `prompt`: The prompt should follow the format that is documented on HuggingFace.
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.MultiModalDataDict`. - `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.inputs.MultiModalDataDict`.
### Image ### Image
You can pass a single image to the {code}`'image'` field of the multi-modal dictionary, as shown in the following examples: You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
```python ```python
llm = LLM(model="llava-hf/llava-1.5-7b-hf") llm = LLM(model="llava-hf/llava-1.5-7b-hf")
...@@ -60,7 +60,7 @@ for o in outputs: ...@@ -60,7 +60,7 @@ for o in outputs:
print(generated_text) print(generated_text)
``` ```
Full example: <gh-file:examples/offline_inference_vision_language.py> Full example: <gh-file:examples/offline_inference/vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead: To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
...@@ -91,7 +91,7 @@ for o in outputs: ...@@ -91,7 +91,7 @@ for o in outputs:
print(generated_text) print(generated_text)
``` ```
Full example: <gh-file:examples/offline_inference_vision_language_multi_image.py> Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos: Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
...@@ -122,21 +122,21 @@ for o in outputs: ...@@ -122,21 +122,21 @@ for o in outputs:
### Video ### Video
You can pass a list of NumPy arrays directly to the {code}`'video'` field of the multi-modal dictionary You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
instead of using multi-image input. instead of using multi-image input.
Full example: <gh-file:examples/offline_inference_vision_language.py> Full example: <gh-file:examples/offline_inference/vision_language.py>
### Audio ### Audio
You can pass a tuple {code}`(array, sampling_rate)` to the {code}`'audio'` field of the multi-modal dictionary. You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
Full example: <gh-file:examples/offline_inference_audio_language.py> Full example: <gh-file:examples/offline_inference/audio_language.py>
### Embedding ### Embedding
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape {code}`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary. pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
```python ```python
# Inference with image embeddings as input # Inference with image embeddings as input
...@@ -199,17 +199,17 @@ for o in outputs: ...@@ -199,17 +199,17 @@ for o in outputs:
print(generated_text) print(generated_text)
``` ```
## Online Inference ## Online Serving
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
```{important} :::{important}
A chat template is **required** to use Chat Completions API. A chat template is **required** to use Chat Completions API.
Although most models come with a chat template, for others you have to define one yourself. Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo. The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja> For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
``` :::
### Image ### Image
...@@ -271,30 +271,31 @@ chat_response = client.chat.completions.create( ...@@ -271,30 +271,31 @@ chat_response = client.chat.completions.create(
print("Chat completion output:", chat_response.choices[0].message.content) print("Chat completion output:", chat_response.choices[0].message.content)
``` ```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
```{tip} :::{tip}
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine, Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
and pass the file path as `url` in the API request. and pass the file path as `url` in the API request.
``` :::
```{tip} :::{tip}
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content. There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content. In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
``` :::
````{note} :::{note}
By default, the timeout for fetching images through HTTP URL is `5` seconds. By default, the timeout for fetching images through HTTP URL is `5` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```console
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout> export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
``` ```
````
:::
### Video ### Video
Instead of {code}`image_url`, you can pass a video file via {code}`video_url`. Here is a simple example using [LLaVA-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf). Instead of `image_url`, you can pass a video file via `video_url`. Here is a simple example using [LLaVA-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf).
First, launch the OpenAI-compatible server: First, launch the OpenAI-compatible server:
...@@ -303,6 +304,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model ...@@ -303,6 +304,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model
``` ```
Then, you can use the OpenAI client as follows: Then, you can use the OpenAI client as follows:
```python ```python
from openai import OpenAI from openai import OpenAI
...@@ -342,16 +344,17 @@ result = chat_completion_from_url.choices[0].message.content ...@@ -342,16 +344,17 @@ result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from image url:", result) print("Chat completion output from image url:", result)
``` ```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note} :::{note}
By default, the timeout for fetching videos through HTTP URL is `30` seconds. By default, the timeout for fetching videos through HTTP URL is `30` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```console
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout> export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
``` ```
````
:::
### Audio ### Audio
...@@ -418,7 +421,7 @@ result = chat_completion_from_base64.choices[0].message.content ...@@ -418,7 +421,7 @@ result = chat_completion_from_base64.choices[0].message.content
print("Chat completion output from input audio:", result) print("Chat completion output from input audio:", result)
``` ```
Alternatively, you can pass {code}`audio_url`, which is the audio counterpart of {code}`image_url` for image input: Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input:
```python ```python
chat_completion_from_url = client.chat.completions.create( chat_completion_from_url = client.chat.completions.create(
...@@ -445,26 +448,27 @@ result = chat_completion_from_url.choices[0].message.content ...@@ -445,26 +448,27 @@ result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from audio url:", result) print("Chat completion output from audio url:", result)
``` ```
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note} :::{note}
By default, the timeout for fetching audios through HTTP URL is `10` seconds. By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```console
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout> export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
``` ```
````
:::
### Embedding ### Embedding
vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings), vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings),
where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models. where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models.
```{tip} :::{tip}
The schema of `messages` is exactly the same as in Chat Completions API. The schema of `messages` is exactly the same as in Chat Completions API.
You can refer to the above tutorials for more details on how to pass each type of multi-modal data. You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
``` :::
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images. Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
Refer to the examples below for illustration. Refer to the examples below for illustration.
...@@ -476,13 +480,13 @@ vllm serve TIGER-Lab/VLM2Vec-Full --task embed \ ...@@ -476,13 +480,13 @@ vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
``` ```
```{important} :::{important}
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed` Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode. to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model, The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec.jinja> and can be found here: <gh-file:examples/template_vlm2vec.jinja>
``` :::
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
...@@ -517,16 +521,16 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \ ...@@ -517,16 +521,16 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
``` ```
```{important} :::{important}
Like with VLM2Vec, we have to explicitly pass `--task embed`. Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja> by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
``` :::
```{important} :::{important}
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details. example below for details.
``` :::
Full example: <gh-file:examples/openai_chat_embedding_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
(offline-inference)=
# Offline Inference
You can run vLLM in your own code on a list of prompts.
The offline API is based on the {class}`~vllm.LLM` class.
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.
```python
llm = LLM(model="facebook/opt-125m")
```
After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:
- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](#pooling-models) output their hidden states directly.
Please refer to the above pages for more details about each API.
:::{seealso}
[API Reference](/api/offline_inference/index)
:::
## Configuration Options
This section lists the most common options for running the vLLM engine.
For a full list, refer to the [Engine Arguments](#engine-args) page.
(model-resolution)=
### Model resolution
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
and finding the corresponding implementation that is registered to vLLM.
Nevertheless, our model resolution may fail for the following reasons:
- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
For example:
```python
model = LLM(
model="cerebras/Cerebras-GPT-1.3B",
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
)
```
Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
### Reducing memory usage
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
#### Tensor Parallelism (TP)
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
The following code splits the model across 2 GPUs.
```python
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
tensor_parallel_size=2)
```
:::{important}
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
:::
#### Quantization
Quantized models take less memory at the cost of lower precision.
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))
and used directly without extra configuration.
Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.
#### Context length and batch size
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
and the maximum batch size (`max_num_seqs` option).
```python
llm = LLM(model="adept/fuyu-8b",
max_model_len=2048,
max_num_seqs=2)
```
### Performance optimization and tuning
You can potentially improve the performance of vLLM by finetuning various options.
Please refer to [this guide](#optimization-and-tuning) for more details.
# OpenAI Compatible Server (openai-compatible-server)=
vLLM provides an HTTP server that implements OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API, and more! # OpenAI-Compatible Server
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more!
You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](#deployment-docker):
You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](deploying_with_docker.md):
```bash ```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
``` ```
To call the server, you can use the [official OpenAI Python client](https://github.com/openai/openai-python), or any other HTTP client. To call the server, you can use the [official OpenAI Python client](https://github.com/openai/openai-python), or any other HTTP client.
```python ```python
from openai import OpenAI from openai import OpenAI
client = OpenAI( client = OpenAI(
...@@ -46,8 +50,14 @@ In addition, we have the following custom APIs: ...@@ -46,8 +50,14 @@ In addition, we have the following custom APIs:
- Applicable to all [pooling models](../models/pooling_models.md). - Applicable to all [pooling models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`) - [Score API](#score-api) (`/score`)
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`). - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
(chat-template)= (chat-template)=
## Chat Template ## Chat Template
In order for the language model to support chat protocol, vLLM requires the model to include In order for the language model to support chat protocol, vLLM requires the model to include
...@@ -69,6 +79,7 @@ vLLM community provides a set of chat templates for popular models. You can find ...@@ -69,6 +79,7 @@ vLLM community provides a set of chat templates for popular models. You can find
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below: both a `type` and a `text` field. An example is provided below:
```python ```python
completion = client.chat.completions.create( completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct", model="NousResearch/Meta-Llama-3-8B-Instruct",
...@@ -78,7 +89,7 @@ completion = client.chat.completions.create( ...@@ -78,7 +89,7 @@ completion = client.chat.completions.create(
) )
``` ```
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the `meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match *"Detected the chat template content format to be..."*, and internally converts incoming requests to match
...@@ -113,12 +124,12 @@ completion = client.chat.completions.create( ...@@ -113,12 +124,12 @@ completion = client.chat.completions.create(
## Extra HTTP Headers ## Extra HTTP Headers
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
with `--enable-request-id-headers`. with `--enable-request-id-headers`.
> Note that enablement of the headers can impact performance significantly at high QPS > Note that enablement of the headers can impact performance significantly at high QPS
> rates. We recommend implementing HTTP headers at the router level (e.g. via Istio), > rates. We recommend implementing HTTP headers at the router level (e.g. via Istio),
> rather than within the vLLM layer for this reason. > rather than within the vLLM layer for this reason.
> See https://github.com/vllm-project/vllm/pull/11529 for more details. > See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.
```python ```python
completion = client.chat.completions.create( completion = client.chat.completions.create(
...@@ -145,15 +156,16 @@ print(completion._request_id) ...@@ -145,15 +156,16 @@ print(completion._request_id)
## CLI Reference ## CLI Reference
(vllm-serve)= (vllm-serve)=
### `vllm serve` ### `vllm serve`
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.
```{argparse} :::{argparse}
:module: vllm.entrypoints.openai.cli_args :module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs :func: create_parser_for_docs
:prog: vllm serve :prog: vllm serve
``` :::
#### Configuration file #### Configuration file
...@@ -173,43 +185,45 @@ uvicorn-log-level: "info" ...@@ -173,43 +185,45 @@ uvicorn-log-level: "info"
To use the above config file: To use the above config file:
```bash ```bash
$ vllm serve SOME_MODEL --config config.yaml vllm serve SOME_MODEL --config config.yaml
``` ```
```{note} :::{note}
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence. In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`. The order of priorities is `command line > config file values > defaults`.
``` :::
## API Reference ## API Reference
(completions-api)= (completions-api)=
### Completions API ### Completions API
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions); Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/openai_completion_client.py> Code example: <gh-file:examples/online_serving/openai_completion_client.py>
#### Extra parameters #### Extra parameters
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported. The following [sampling parameters](#sampling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-completion-sampling-params :start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params :end-before: end-completion-sampling-params
``` :::
The following extra parameters are supported: The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-completion-extra-params :start-after: begin-completion-extra-params
:end-before: end-completion-extra-params :end-before: end-completion-extra-params
``` :::
(chat-api)= (chat-api)=
### Chat API ### Chat API
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat); Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
...@@ -217,30 +231,31 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai ...@@ -217,30 +231,31 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information. see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
- *Note: `image_url.detail` parameter is not supported.* - *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/openai_chat_completion_client.py> Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
#### Extra parameters #### Extra parameters
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported. The following [sampling parameters](#sampling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-chat-completion-sampling-params :start-after: begin-chat-completion-sampling-params
:end-before: end-chat-completion-sampling-params :end-before: end-chat-completion-sampling-params
``` :::
The following extra parameters are supported: The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-chat-completion-extra-params :start-after: begin-chat-completion-extra-params
:end-before: end-chat-completion-extra-params :end-before: end-chat-completion-extra-params
``` :::
(embeddings-api)= (embeddings-api)=
### Embeddings API ### Embeddings API
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings); Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
...@@ -249,39 +264,40 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai ...@@ -249,39 +264,40 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api)) If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
which will be treated as a single prompt to the model. which will be treated as a single prompt to the model.
```{tip} :::{tip}
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details. This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
``` :::
Code example: <gh-file:examples/openai_embedding_client.py> Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
#### Extra parameters #### Extra parameters
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported. The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-embedding-pooling-params :start-after: begin-embedding-pooling-params
:end-before: end-embedding-pooling-params :end-before: end-embedding-pooling-params
``` :::
The following extra parameters are supported by default: The following extra parameters are supported by default:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-embedding-extra-params :start-after: begin-embedding-extra-params
:end-before: end-embedding-extra-params :end-before: end-embedding-extra-params
``` :::
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead: For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-chat-embedding-extra-params :start-after: begin-chat-embedding-extra-params
:end-before: end-chat-embedding-extra-params :end-before: end-chat-embedding-extra-params
``` :::
(tokenizer-api)= (tokenizer-api)=
### Tokenizer API ### Tokenizer API
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer). Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
...@@ -291,15 +307,17 @@ It consists of two endpoints: ...@@ -291,15 +307,17 @@ It consists of two endpoints:
- `/detokenize` corresponds to calling `tokenizer.decode()`. - `/detokenize` corresponds to calling `tokenizer.decode()`.
(pooling-api)= (pooling-api)=
### Pooling API ### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states. Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats. The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/openai_pooling_client.py> Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
(score-api)= (score-api)=
### Score API ### Score API
Our Score API applies a cross-encoder model to predict scores for sentence pairs. Our Score API applies a cross-encoder model to predict scores for sentence pairs.
...@@ -307,7 +325,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent ...@@ -307,7 +325,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html). You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
Code example: <gh-file:examples/openai_cross_encoder_score.py> Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>
#### Single inference #### Single inference
...@@ -445,18 +463,105 @@ Response: ...@@ -445,18 +463,105 @@ Response:
#### Extra parameters #### Extra parameters
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported. The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-score-pooling-params :start-after: begin-score-pooling-params
:end-before: end-score-pooling-params :end-before: end-score-pooling-params
``` :::
The following extra parameters are supported: The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-score-extra-params :start-after: begin-score-extra-params
:end-before: end-score-extra-params :end-before: end-score-extra-params
:::
(rerank-api)=
### Re-rank API
Our Re-rank API applies a cross-encoder model to predict relevant scores between a single query, and
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences, on
a scale of 0 to 1.
You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools.
Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
#### Example Request
Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
Request:
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
``` ```
Response:
```bash
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
```
#### Extra parameters
The following [pooling parameters](#pooling-params) are supported.
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-pooling-params
:end-before: end-rerank-pooling-params
:::
The following extra parameters are supported:
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-extra-params
:end-before: end-rerank-extra-params
:::
...@@ -45,7 +45,7 @@ You can preview the collected data by running the following command: ...@@ -45,7 +45,7 @@ You can preview the collected data by running the following command:
tail ~/.config/vllm/usage_stats.json tail ~/.config/vllm/usage_stats.json
``` ```
## Opt-out of Usage Stats Collection ## Opting out
You can opt-out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file: You can opt-out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file:
......
(compatibility-matrix)=
# Compatibility Matrix
The tables below show mutually exclusive features and the support on some hardware.
```{note}
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
```
## Feature x Feature
```{raw} html
<style>
/* Make smaller to try to improve readability */
td {
font-size: 0.8rem;
text-align: center;
}
th {
text-align: center;
font-size: 0.8rem;
}
</style>
```
```{list-table}
:header-rows: 1
:stub-columns: 1
:widths: auto
* - Feature
- [CP](#chunked-prefill)
- [APC](#apc)
- [LoRA](#lora-adapter)
- <abbr title="Prompt Adapter">prmpt adptr</abbr>
- [SD](#spec_decode)
- CUDA graph
- <abbr title="Pooling Models">pooling</abbr>
- <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- <abbr title="Logprobs">logP</abbr>
- <abbr title="Prompt Logprobs">prmpt logP</abbr>
- <abbr title="Async Output Processing">async output</abbr>
- multi-step
- <abbr title="Multimodal Inputs">mm</abbr>
- best-of
- beam-search
- <abbr title="Guided Decoding">guided dec</abbr>
* - [CP](#chunked-prefill)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [APC](#apc)
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [LoRA](#lora-adapter)
- [✗](gh-pr:9057)
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
* - [SD](#spec_decode)
- ✅
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Pooling Models">pooling</abbr>
- ✗
- ✗
- ✗
- ✗
- ✗
- ✗
-
-
-
-
-
-
-
-
-
-
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- ✗
- [✗](gh-issue:7366)
- ✗
- ✗
- [✗](gh-issue:7366)
- ✅
- ✅
-
-
-
-
-
-
-
-
-
* - <abbr title="Logprobs">logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
- ✅
- ✅
- ✅
- ✅
- [✗](gh-pr:8199)
- ✅
- ✗
- ✅
- ✅
-
-
-
-
-
-
-
* - <abbr title="Async Output Processing">async output</abbr>
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
- ✗
- ✗
- ✅
- ✅
-
-
-
-
-
-
* - multi-step
- ✗
- ✅
- ✗
- ✅
- ✗
- ✅
- ✗
- ✗
- ✅
- [✗](gh-issue:8198)
- ✅
-
-
-
-
-
* - <abbr title="Multimodal Inputs">mm</abbr>
- ✅
- [✗](gh-pr:8348)
- [✗](gh-pr:7199)
- ?
- ?
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ?
-
-
-
-
* - best-of
- ✅
- ✅
- ✅
- ✅
- [✗](gh-issue:6137)
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- [✗](gh-issue:7968)
- ✅
-
-
-
* - beam-search
- ✅
- ✅
- ✅
- ✅
- [✗](gh-issue:6137)
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
- [✗](gh-issue:7968>)
- ?
- ✅
-
-
* - <abbr title="Guided Decoding">guided dec</abbr>
- ✅
- ✅
- ?
- ?
- ✅
- ✅
- ✗
- ?
- ✅
- ✅
- ✅
- [✗](gh-issue:9893)
- ?
- ✅
- ✅
-
```
### Feature x Hardware
```{list-table}
:header-rows: 1
:stub-columns: 1
:widths: auto
* - Feature
- Volta
- Turing
- Ampere
- Ada
- Hopper
- CPU
- AMD
* - [CP](#chunked-prefill)
- [✗](gh-issue:2729)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - [APC](#apc)
- [✗](gh-issue:3687)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - [LoRA](#lora-adapter)
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](gh-pr:4830)
- ✅
* - <abbr title="Prompt Adapter">prmpt adptr</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](gh-issue:8475)
- ✅
* - [SD](#spec_decode)
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
* - <abbr title="Pooling Models">pooling</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ?
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
* - <abbr title="Multimodal Inputs">mm</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Logprobs">logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Prompt Logprobs">prmpt logP</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Async Output Processing">async output</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✗
* - multi-step
- ✅
- ✅
- ✅
- ✅
- ✅
- [✗](gh-issue:8477)
- ✅
* - best-of
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - beam-search
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - <abbr title="Guided Decoding">guided dec</abbr>
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
```
### Quantizer Utilities
`quantize.py`: NVIDIA Quantization utilities using TensorRT-Model-Optimizer, ported
from TensorRT-LLM: [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/quantize.py)
### Prerequisite
#### AMMO (AlgorithMic Model Optimization) Installation: nvidia-ammo 0.7.1 or later
`pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo`
#### AMMO Download (code and docs)
`https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.5.0.tar.gz`
`https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.7.1.tar.gz`
### Usage
#### Run on H100 system for speed if FP8; number of GPUs depends on the model size
#### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:
`python quantize.py --model-dir ./ll2-7b --dtype float16 --qformat fp8 --kv-cache-dtype fp8 --output-dir ./ll2_7b_fp8 --calib-size 512 --tp-size 1`
Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)
```
# ll ./ll2_7b_fp8/
total 19998244
drwxr-xr-x 2 root root 4096 Feb 7 01:08 ./
drwxrwxr-x 8 1060 1061 4096 Feb 7 01:08 ../
-rw-r--r-- 1 root root 176411 Feb 7 01:08 llama_tp1.json
-rw-r--r-- 1 root root 13477087480 Feb 7 01:09 llama_tp1_rank0.npz
-rw-r--r-- 1 root root 7000893272 Feb 7 01:08 rank0.safetensors
#
```
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # noqa: E501
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Adapted from examples/quantization/hf_ptq.py
"""
import argparse
import copy
import json
import random
import time
import ammo.torch.quantization as atq
import numpy as np
import torch
from ammo.torch.export import export_model_config
from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer
RAND_SEED = 1234
MAX_SEQ_LEN = 2048
EMPTY_CFG = {
"quant_cfg": {
"*weight_quantizer": {
"enable": False,
},
"*input_quantizer": {
"enable": False
},
"*lm_head*": {
"enable": False
},
"*output_layer*": {
"enable": False
},
"default": {
"enable": False
},
},
"algorithm": "max",
}
KV_CACHE_CFG = {
"*.query_key_value.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.Wqkv.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.W_pack.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.c_attn.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.k_proj.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
"*.v_proj.output_quantizer": {
"num_bits": 8,
"axis": None,
"enable": True
},
}
QUANT_CFG_CHOICES = {
"int8_sq": atq.INT8_SMOOTHQUANT_CFG,
"fp8": atq.FP8_DEFAULT_CFG,
"int4_awq": atq.INT4_AWQ_CFG,
"w4a8_awq": atq.W4A8_AWQ_BETA_CFG,
"int8_wo": EMPTY_CFG,
"int4_wo": EMPTY_CFG,
"full_prec": EMPTY_CFG,
}
MODEL_NAME_PATTERN_MAP = {
"GPT2": "gpt2",
"Xverse": "llama",
"Llama": "llama",
"Mistral": "llama",
"GPTJ": "gptj",
"FalconForCausalLM": "falcon",
"RWForCausalLM": "falcon",
"baichuan": "baichuan",
"MPT": "mpt",
"Bloom": "bloom",
"ChatGLM": "chatglm",
"QWen": "qwen",
}
def get_tokenizer(ckpt_path, max_seq_len=MAX_SEQ_LEN, model_type=None):
print(f"Initializing tokenizer from {ckpt_path}")
tokenizer = AutoTokenizer.from_pretrained(
ckpt_path,
model_max_length=max_seq_len,
padding_side="left",
trust_remote_code=True,
)
if model_type and model_type == "qwen":
# qwen use token id 151643 as pad and eos tokens
tokenizer.pad_token = tokenizer.convert_ids_to_tokens(151643)
tokenizer.eos_token = tokenizer.convert_ids_to_tokens(151643)
# can't set attribute 'pad_token' for "<unk>"
if tokenizer.pad_token != "<unk>":
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
assert (tokenizer.pad_token
is not None), f"Pad token for {model_type} cannot be set!"
return tokenizer
def get_model(ckpt_path, dtype="fp16", device="cuda"):
print(f"Initializing model from {ckpt_path}")
if dtype == "bf16" or dtype == "bfloat16":
dtype = torch.bfloat16
elif dtype == "fp16" or dtype == "float16":
dtype = torch.float16
elif dtype == "fp32" or dtype == "float32":
dtype = torch.float32
else:
raise NotImplementedError(f"Unknown dtype {dtype}")
# model_kwargs = {"torch_dtype": dtype}
model_kwargs = {"torch_dtype": "auto"}
model = AutoModelForCausalLM.from_pretrained(ckpt_path,
device_map="auto",
**model_kwargs,
trust_remote_code=True)
model.eval()
model_dtype = next(model.parameters()).dtype
if dtype != model_dtype:
print("[TensorRT-LLM][WARNING] The manually set model data type is "
f"{dtype}, but the data type of the HuggingFace model is "
f"{model_dtype}.")
return model
def get_model_type(model):
for k, v in MODEL_NAME_PATTERN_MAP.items():
if k.lower() in type(model).__name__.lower():
return v
return None
def get_calib_dataloader(data="cnn_dailymail",
tokenizer=None,
batch_size=1,
calib_size=512,
block_size=512,
device=None):
print("Loading calibration dataset")
if data == "pileval":
dataset = load_dataset(
"json",
data_files="https://the-eye.eu/public/AI/pile/val.jsonl.zst",
split="train")
dataset = dataset["text"][:calib_size]
elif data == "cnn_dailymail":
dataset = load_dataset("cnn_dailymail", name="3.0.0", split="train")
dataset = dataset["article"][:calib_size]
else:
raise NotImplementedError
batch_encoded = tokenizer.batch_encode_plus(dataset,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=block_size)
if device:
batch_encoded = batch_encoded.to(device)
batch_encoded = batch_encoded["input_ids"]
calib_dataloader = DataLoader(batch_encoded,
batch_size=batch_size,
shuffle=False)
return calib_dataloader
def quantize_model(model, quant_cfg, calib_dataloader=None):
def calibrate_loop():
if calib_dataloader is None:
return
"""Adjusts weights and scaling factors based on selected algorithms."""
for idx, data in enumerate(calib_dataloader):
print(f"Calibrating batch {idx}")
model(data)
print("Starting quantization...")
start_time = time.time()
atq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
end_time = time.time()
print("Quantization done. Total time used: {:.2f} s.".format(end_time -
start_time))
return model
def main(args):
if not torch.cuda.is_available():
raise OSError("GPU is required for inference.")
random.seed(RAND_SEED)
np.random.seed(RAND_SEED)
model = get_model(args.model_dir, args.dtype, args.device)
model_type = get_model_type(model)
tokenizer = get_tokenizer(args.model_dir, model_type=model_type)
if args.qformat in ["full_prec", "int8_wo", "int4_wo"
] and args.kv_cache_dtype is None:
print(f"No quantization applied, export {args.dtype} model")
else:
if "awq" in args.qformat:
if args.calib_size > 32:
print("AWQ calibration could take longer with calib_size = "
f"{args.calib_size}, Using calib_size=32 instead")
args.calib_size = 32
print("\nAWQ calibration could take longer than other calibration "
"methods. Please increase the batch size to speed up the "
"calibration process. Batch size can be set by adding the "
"argument --batch_size <batch_size> to the command line.\n")
calib_dataloader = get_calib_dataloader(
tokenizer=tokenizer,
batch_size=args.batch_size,
calib_size=args.calib_size,
device=args.device,
)
if args.qformat in QUANT_CFG_CHOICES:
quant_cfg = QUANT_CFG_CHOICES[args.qformat]
else:
raise ValueError(
f"Unsupported quantization format: {args.qformat}")
if "awq" in args.qformat:
quant_cfg = copy.deepcopy(QUANT_CFG_CHOICES[args.qformat])
weight_quantizer = quant_cfg["quant_cfg"][
"*weight_quantizer"] # type: ignore
if isinstance(weight_quantizer, list):
weight_quantizer = weight_quantizer[0]
weight_quantizer["block_sizes"][-1] = args.awq_block_size
if args.kv_cache_dtype is not None:
if args.kv_cache_dtype == "fp8":
for value in KV_CACHE_CFG.values():
value.update({"num_bits": (4, 3)}) # type: ignore
quant_cfg["quant_cfg"].update(KV_CACHE_CFG) # type: ignore
print(quant_cfg)
model = quantize_model(model, quant_cfg, calib_dataloader)
with torch.inference_mode():
if model_type is None:
print(f"Unknown model type {type(model).__name__}. Continue "
"exporting...")
model_type = f"unknown:{type(model).__name__}"
export_path = args.output_dir
start_time = time.time()
if args.qformat == "int4_awq" and model_type == "qwen":
torch.save(model.state_dict(), export_path)
else:
export_npz = (model_type not in [
'gptj', 'falcon', 'chatglm', 'mpt', 'llama', 'baichuan'
])
# export safetensors
export_model_config(
model,
model_type,
getattr(torch, args.dtype),
export_dir=export_path,
inference_tensor_parallel=args.tp_size,
inference_pipeline_parallel=args.pp_size,
# export_tensorrt_llm_config=(not export_npz),
export_tensorrt_llm_config=False,
export_npz=export_npz)
# Workaround for wo quantization
if args.qformat in ["int8_wo", "int4_wo", "full_prec"]:
with open(f"{export_path}/config.json") as f:
tensorrt_llm_config = json.load(f)
if args.qformat == "int8_wo":
tensorrt_llm_config["quantization"]["quant_algo"] = 'W8A16'
elif args.qformat == "int4_wo":
tensorrt_llm_config["quantization"]["quant_algo"] = 'W4A16'
else:
tensorrt_llm_config["quantization"]["quant_algo"] = None
with open(f"{export_path}/config.json", "w") as f:
json.dump(tensorrt_llm_config, f, indent=4)
end_time = time.time()
print("Quantized model exported to {} \nTotal time used {:.2f} s.".
format(export_path, end_time - start_time))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--model-dir",
help="Specify where the HuggingFace model is",
required=True)
parser.add_argument("--device", default="cuda")
parser.add_argument("--dtype", help="Model data type.", default="float16")
parser.add_argument(
"--qformat",
help="Quantization format.",
default="full_prec",
choices=[
"fp8", "int8_sq", "int4_awq", "w4a8_awq", "int8_wo", "int4_wo",
"full_prec"
],
)
parser.add_argument("--batch-size",
help="Batch size for calibration.",
type=int,
default=1)
parser.add_argument("--calib-size",
help="Number of samples for calibration.",
type=int,
default=512)
parser.add_argument("--output-dir", default="exported_model")
parser.add_argument("--tp-size", type=int, default=1)
parser.add_argument("--pp-size", type=int, default=1)
parser.add_argument("--awq-block-size", type=int, default=128)
parser.add_argument("--kv-cache-dtype",
help="KV Cache dtype.",
default=None,
choices=["int8", "fp8", None])
args = parser.parse_args()
main(args)
...@@ -67,7 +67,37 @@ def run_qwen2_audio(question: str, audio_count: int): ...@@ -67,7 +67,37 @@ def run_qwen2_audio(question: str, audio_count: int):
return llm, prompt, stop_token_ids return llm, prompt, stop_token_ids
model_example_map = {"ultravox": run_ultravox, "qwen2_audio": run_qwen2_audio} def run_minicpmo(question: str, audio_count: int):
model_name = "openbmb/MiniCPM-o-2_6"
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
llm = LLM(model=model_name,
trust_remote_code=True,
max_model_len=4096,
max_num_seqs=5,
limit_mm_per_prompt={"audio": audio_count})
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
audio_placeholder = "(<audio>./</audio>)" * audio_count
audio_chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n<|spk_bos|><|spk|><|spk_eos|><|tts_bos|>' }}{% endif %}" # noqa: E501
messages = [{
'role': 'user',
'content': f'{audio_placeholder}\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True,
chat_template=audio_chat_template)
return llm, prompt, stop_token_ids
model_example_map = {
"ultravox": run_ultravox,
"qwen2_audio": run_qwen2_audio,
"minicpmo": run_minicpmo
}
def main(args): def main(args):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment