Commit ec5e299c authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.7.3' into v0.7.3-dev

parents 47bd229c ed6e9075
...@@ -46,7 +46,7 @@ for output in outputs: ...@@ -46,7 +46,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic.py> A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py>
### `LLM.beam_search` ### `LLM.beam_search`
...@@ -103,7 +103,7 @@ for output in outputs: ...@@ -103,7 +103,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/chat.py> A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>
If the model doesn't have a chat template or you want to specify another one, If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template: you can explicitly pass a chat template:
......
...@@ -28,10 +28,10 @@ The selected option sets the default pooler used to extract the final hidden sta ...@@ -28,10 +28,10 @@ The selected option sets the default pooler used to extract the final hidden sta
- * Embedding (`embed`) - * Embedding (`embed`)
* `LAST` * `LAST`
* ✅︎ * ✅︎
* *
- * Classification (`classify`) - * Classification (`classify`)
* `LAST` * `LAST`
* *
* ✅︎ * ✅︎
- * Sentence Pair Scoring (`score`) - * Sentence Pair Scoring (`score`)
* \* * \*
...@@ -39,8 +39,8 @@ The selected option sets the default pooler used to extract the final hidden sta ...@@ -39,8 +39,8 @@ The selected option sets the default pooler used to extract the final hidden sta
* \* * \*
- * Reward Modeling (`reward`) - * Reward Modeling (`reward`)
* `ALL` * `ALL`
* *
* *
::: :::
\*The default pooler is always defined by the model. \*The default pooler is always defined by the model.
...@@ -88,7 +88,7 @@ embeds = output.outputs.embedding ...@@ -88,7 +88,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})") print(f"Embeddings: {embeds!r} (size={len(embeds)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/embedding.py> A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py>
### `LLM.classify` ### `LLM.classify`
...@@ -103,7 +103,7 @@ probs = output.outputs.probs ...@@ -103,7 +103,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})") print(f"Class Probabilities: {probs!r} (size={len(probs)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/classification.py> A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py>
### `LLM.score` ### `LLM.score`
...@@ -125,7 +125,7 @@ score = output.outputs.score ...@@ -125,7 +125,7 @@ score = output.outputs.score
print(f"Score: {score}") print(f"Score: {score}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/scoring.py> A code example can be found here: <gh-file:examples/offline_inference/basic/score.py>
## Online Serving ## Online Serving
......
...@@ -42,7 +42,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project ...@@ -42,7 +42,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
### Transformers fallback ### Transformers fallback
After the merge of <gh-pr:11330>, `vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned! `vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
To check if the backend is `transformers`, you can simply do this: To check if the backend is `transformers`, you can simply do this:
...@@ -56,9 +56,13 @@ If it is `TransformersModel` then it means it's based on `transformers`! ...@@ -56,9 +56,13 @@ If it is `TransformersModel` then it means it's based on `transformers`!
#### Supported features #### Supported features
##### LORA and quantization ##### Quantization
Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team! Transformers fallback has supported most of available quantization in vLLM (except GGUF). See [Quantization page](#quantization-index) for more information about supported quantization in vllm.
##### LoRA
LoRA hasn't supported on transformers fallback yet! Make sure to open an issue and we'll work on this together with the `transformers` team!
Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly. Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly.
...@@ -429,7 +433,7 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -429,7 +433,7 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎ * ✅︎
- * `TeleChat2ForCausalLM` - * `TeleChat2ForCausalLM`
* TeleChat2 * TeleChat2
* `TeleAI/TeleChat2-3B`, `TeleAI/TeleChat2-7B`, `TeleAI/TeleChat2-35B`, etc. * `Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, etc.
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `XverseForCausalLM` - * `XverseForCausalLM`
...@@ -699,10 +703,10 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -699,10 +703,10 @@ See [this page](#generative-models) for more information on how to use generativ
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `DeepseekVLV2ForCausalLM` - * `DeepseekVLV2ForCausalLM`<sup>^</sup>
* DeepSeek-VL2 * DeepSeek-VL2
* T + I<sup>+</sup> * T + I<sup>+</sup>
* `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc. (see note) * `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc.
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
...@@ -713,20 +717,20 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -713,20 +717,20 @@ See [this page](#generative-models) for more information on how to use generativ
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `ChatGLMModel` - * `GLM4VForCausalLM`<sup>^</sup>
* GLM-4V * GLM-4V
* T + I * T + I
* `THUDM/glm-4v-9b` etc. * `THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220` etc.
* ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
*
- * `H2OVLChatModel` - * `H2OVLChatModel`
* H2OVL * H2OVL
* T + I<sup>E+</sup> * T + I<sup>E+</sup>
* `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. * `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc.
* *
* ✅︎ * ✅︎
* \* * ✅︎\*
- * `Idefics3ForConditionalGeneration` - * `Idefics3ForConditionalGeneration`
* Idefics3 * Idefics3
* T + I * T + I
...@@ -793,7 +797,7 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -793,7 +797,7 @@ See [this page](#generative-models) for more information on how to use generativ
- * `MolmoForCausalLM` - * `MolmoForCausalLM`
* Molmo * Molmo
* T + I * T + I
* `allenai/Molmo-7B-D-0924`, `allenai/Molmo-72B-0924`, etc. * `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc.
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
...@@ -804,7 +808,7 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -804,7 +808,7 @@ See [this page](#generative-models) for more information on how to use generativ
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `PaliGemmaForConditionalGeneration` - * `PaliGemmaForConditionalGeneration`\*
* PaliGemma, PaliGemma 2 * PaliGemma, PaliGemma 2
* T + I<sup>E</sup> * T + I<sup>E</sup>
* `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. * `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
...@@ -825,7 +829,7 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -825,7 +829,7 @@ See [this page](#generative-models) for more information on how to use generativ
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `QWenLMHeadModel` - * `QwenVLForConditionalGeneration`<sup>^</sup>
* Qwen-VL * Qwen-VL
* T + I<sup>E+</sup> * T + I<sup>E+</sup>
* `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc. * `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc.
...@@ -850,27 +854,26 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -850,27 +854,26 @@ See [this page](#generative-models) for more information on how to use generativ
* Qwen2.5-VL * Qwen2.5-VL
* T + I<sup>E+</sup> + V<sup>E+</sup> * T + I<sup>E+</sup> + V<sup>E+</sup>
* `Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct`, etc. * `Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct`, etc.
* * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `UltravoxModel` - * `UltravoxModel`
* Ultravox * Ultravox
* T + A<sup>E+</sup> * T + A<sup>E+</sup>
* `fixie-ai/ultravox-v0_3` * `fixie-ai/ultravox-v0_5-llama-3_2-1b`
* ✅︎ * ✅︎
* ✅︎ * ✅︎
* ✅︎ * ✅︎
::: :::
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
&nbsp;&nbsp;&nbsp;&nbsp;• For example, to use DeepSeek-VL2 series models:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'`
<sup>E</sup> Pre-computed embeddings can be inputted for this modality. <sup>E</sup> Pre-computed embeddings can be inputted for this modality.
<sup>+</sup> Multiple items can be inputted per text prompt for this modality. <sup>+</sup> Multiple items can be inputted per text prompt for this modality.
:::{note} :::{note}
To use DeepSeek-VL2 series models, you have to pass `--hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'` when running vLLM. `h2oai/h2ovl-mississippi-2b` will be available in V1 once we support backends other than FlashAttention.
:::
:::{note}
H2O-VL series models will be available in V1 once we support backends other than FlashAttention.
::: :::
:::{note} :::{note}
...@@ -882,6 +885,10 @@ The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (` ...@@ -882,6 +885,10 @@ The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`
For more details, please see: <gh-pr:4087#issuecomment-2250397630> For more details, please see: <gh-pr:4087#issuecomment-2250397630>
::: :::
:::{note}
Currently the PaliGemma model series is implemented without PrefixLM attention mask. This model series may be deprecated in a future release.
:::
:::{note} :::{note}
`mistral-community/pixtral-12b` does not support V1 yet. `mistral-community/pixtral-12b` does not support V1 yet.
::: :::
...@@ -939,6 +946,26 @@ The following table lists those that are tested in vLLM. ...@@ -939,6 +946,26 @@ The following table lists those that are tested in vLLM.
* ✅︎ * ✅︎
::: :::
#### Transcription (`--task transcription`)
Speech2Text models trained specifically for Automatic Speech Recognition.
:::{list-table}
:widths: 25 25 25 5 5
:header-rows: 1
- * Architecture
* Models
* Example HF Models
* [LoRA](#lora-adapter)
* [PP](#distributed-serving)
- * `Whisper`
* Whisper-based
* `openai/whisper-large-v3-turbo`
* 🚧
* 🚧
:::
_________________ _________________
## Model Support Policy ## Model Support Policy
......
...@@ -75,11 +75,15 @@ bash run_cluster.sh \ ...@@ -75,11 +75,15 @@ bash run_cluster.sh \
-e VLLM_HOST_IP=ip_of_this_node -e VLLM_HOST_IP=ip_of_this_node
``` ```
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses. Then you get a ray cluster of **containers**. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses.
:::{warning}
Since this is a ray cluster of **containers**, all the following commands should be executed in the **containers**, otherwise you are executing the commands on the host machine, which is not connected to the ray cluster. To enter the container, you can use `docker exec -it node /bin/bash`.
:::
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs. Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2: After that, on any node, use `docker exec -it node /bin/bash` to enter the container again. **In the container**, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
```console ```console
vllm serve /path/to/the/model/in/the/container \ vllm serve /path/to/the/model/in/the/container \
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
Below, you can find an explanation of every engine argument for vLLM: Below, you can find an explanation of every engine argument for vLLM:
<!--- pyml disable-num-lines 7 no-space-in-emphasis--> <!--- pyml disable-num-lines 7 no-space-in-emphasis -->
```{eval-rst} ```{eval-rst}
.. argparse:: .. argparse::
:module: vllm.engine.arg_utils :module: vllm.engine.arg_utils
...@@ -17,7 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM: ...@@ -17,7 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM:
Below are the additional arguments related to the asynchronous engine: Below are the additional arguments related to the asynchronous engine:
<!--- pyml disable-num-lines 7 no-space-in-emphasis--> <!--- pyml disable-num-lines 7 no-space-in-emphasis -->
```{eval-rst} ```{eval-rst}
.. argparse:: .. argparse::
:module: vllm.engine.arg_utils :module: vllm.engine.arg_utils
......
...@@ -184,8 +184,8 @@ llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={ ...@@ -184,8 +184,8 @@ llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={
mm_data = { mm_data = {
"image": { "image": {
"image_embeds": image_embeds, "image_embeds": image_embeds,
# image_size_list is needed to calculate details of the sliced image. # image_sizes is needed to calculate details of the sliced image.
"image_size_list": [image.size for image in images], # list of image sizes "image_sizes": [image.size for image in images], # list of image sizes
} }
} }
...@@ -359,12 +359,12 @@ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout> ...@@ -359,12 +359,12 @@ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
### Audio ### Audio
Audio input is supported according to [OpenAI Audio API](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in). Audio input is supported according to [OpenAI Audio API](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in).
Here is a simple example using Ultravox-v0.3. Here is a simple example using Ultravox-v0.5-1B.
First, launch the OpenAI-compatible server: First, launch the OpenAI-compatible server:
```bash ```bash
vllm serve fixie-ai/ultravox-v0_3 vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b
``` ```
Then, you can use the OpenAI client as follows: Then, you can use the OpenAI client as follows:
......
...@@ -41,6 +41,8 @@ We currently support the following OpenAI APIs: ...@@ -41,6 +41,8 @@ We currently support the following OpenAI APIs:
- *Note: `parallel_tool_calls` and `user` parameters are ignored.* - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API](#embeddings-api) (`/v1/embeddings`) - [Embeddings API](#embeddings-api) (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`). - Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
In addition, we have the following custom APIs: In addition, we have the following custom APIs:
...@@ -296,6 +298,17 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s ...@@ -296,6 +298,17 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
:end-before: end-chat-embedding-extra-params :end-before: end-chat-embedding-extra-params
::: :::
(transcriptions-api)=
### Transcriptions API
Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
<!-- TODO: api enforced limits + uploading audios -->
Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
(tokenizer-api)= (tokenizer-api)=
### Tokenizer API ### Tokenizer API
......
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
from vllm.utils import FlexibleArgumentParser
def main():
parser = FlexibleArgumentParser(description='AQLM examples')
parser.add_argument('--model',
'-m',
type=str,
default=None,
help='model path, as for HF')
parser.add_argument('--choice',
'-c',
type=int,
default=0,
help='known good models by index, [0-4]')
parser.add_argument('--tensor-parallel-size',
'-t',
type=int,
default=1,
help='tensor parallel size')
args = parser.parse_args()
models = [
"ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf",
"ISTA-DASLab/Llama-2-7b-AQLM-2Bit-2x8-hf",
"ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf",
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
"BlackSamorez/TinyLlama-1_1B-Chat-v1_0-AQLM-2Bit-1x16-hf",
]
model = LLM(args.model if args.model is not None else models[args.choice],
tensor_parallel_size=args.tensor_parallel_size)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is",
sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
if __name__ == '__main__':
main()
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="snowflake/snowflake-arctic-instruct",
quantization="deepspeedfp",
tensor_parallel_size=8,
trust_remote_code=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
...@@ -24,9 +24,9 @@ question_per_audio_count = { ...@@ -24,9 +24,9 @@ question_per_audio_count = {
# Unless specified, these settings have been tested to work on a single L4. # Unless specified, these settings have been tested to work on a single L4.
# Ultravox 0.3 # Ultravox 0.5-1B
def run_ultravox(question: str, audio_count: int): def run_ultravox(question: str, audio_count: int):
model_name = "fixie-ai/ultravox-v0_3" model_name = "fixie-ai/ultravox-v0_5-llama-3_2-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{ messages = [{
......
# Basic
The `LLM` class provides the primary Python interface for doing offline inference, which is interacting with a model without using a separate model inference server.
## Usage
The first script in this example shows the most basic usage of vLLM. If you are new to Python and vLLM, you should start here.
```bash
python examples/offline_inference/basic/basic.py
```
The rest of the scripts include an [argument parser](https://docs.python.org/3/library/argparse.html), which you can use to pass any arguments that are compatible with [`LLM`](https://docs.vllm.ai/en/latest/api/offline_inference/llm.html). Try running the script with `--help` for a list of all available arguments.
```bash
python examples/offline_inference/basic/classify.py
```
```bash
python examples/offline_inference/basic/embed.py
```
```bash
python examples/offline_inference/basic/score.py
```
The chat and generate scripts also accept the [sampling parameters](https://docs.vllm.ai/en/latest/api/inference_params.html#sampling-parameters): `max_tokens`, `temperature`, `top_p` and `top_k`.
```bash
python examples/offline_inference/basic/chat.py
```
```bash
python examples/offline_inference/basic/generate.py
```
## Features
In the scripts that support passing arguments, you can experiment with the following features.
### Default generation config
The `--generation-config` argument specifies where the generation config will be loaded from when calling `LLM.get_default_sampling_params()`. If set to ‘auto’, the generation config will be loaded from model path. If set to a folder path, the generation config will be loaded from the specified folder path. If it is not provided, vLLM defaults will be used.
> If max_new_tokens is specified in generation config, then it sets a server-wide limit on the number of output tokens for all requests.
Try it yourself with the following argument:
```bash
--generation-config auto
```
### Quantization
#### AQLM
vLLM supports models that are quantized using AQLM.
Try one yourself by passing one of the following models to the `--model` argument:
- `ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf`
- `ISTA-DASLab/Llama-2-7b-AQLM-2Bit-2x8-hf`
- `ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf`
- `ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf`
- `BlackSamorez/TinyLlama-1_1B-Chat-v1_0-AQLM-2Bit-1x16-hf`
> Some of these models are likely to be too large for a single GPU. You can split them across multiple GPUs by setting `--tensor-parallel-size` to the number of required GPUs.
#### GGUF
vLLM supports models that are quantized using GGUF.
Try one yourself by downloading a GUFF quantised model and using the following arguments:
```python
from huggingface_hub import hf_hub_download
repo_id = "bartowski/Phi-3-medium-4k-instruct-GGUF"
filename = "Phi-3-medium-4k-instruct-IQ2_M.gguf"
print(hf_hub_download(repo_id, filename=filename))
```
```bash
--model {local-path-printed-above} --tokenizer microsoft/Phi-3-medium-4k-instruct
```
### CPU offload
The `--cpu-offload-gb` argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight, which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass.
Try it yourself with the following arguments:
```bash
--model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10
```
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser
def main(args: dict):
# Pop arguments not used by LLM
max_tokens = args.pop("max_tokens")
temperature = args.pop("temperature")
top_p = args.pop("top_p")
top_k = args.pop("top_k")
chat_template_path = args.pop("chat_template_path")
# Create an LLM
llm = LLM(**args)
# Create sampling params object
sampling_params = llm.get_default_sampling_params()
if max_tokens is not None:
sampling_params.max_tokens = max_tokens
if temperature is not None:
sampling_params.temperature = temperature
if top_p is not None:
sampling_params.top_p = top_p
if top_k is not None:
sampling_params.top_k = top_k
def print_outputs(outputs):
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Generated text: {generated_text!r}")
print("-" * 80)
print("=" * 80)
# In this script, we demonstrate how to pass input to the chat method:
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content":
"Write an essay about the importance of higher education.",
},
]
outputs = llm.chat(conversation, sampling_params, use_tqdm=False)
print_outputs(outputs)
# You can run batch inference with llm.chat API
conversations = [conversation for _ in range(10)]
# We turn on tqdm progress bar to verify it's indeed running batch inference
outputs = llm.chat(conversations, sampling_params, use_tqdm=True)
print_outputs(outputs)
# A chat template can be optionally supplied.
# If not, the model will use its default chat template.
if chat_template_path is not None:
with open(chat_template_path) as f:
chat_template = f.read()
outputs = llm.chat(
conversations,
sampling_params,
use_tqdm=False,
chat_template=chat_template,
)
if __name__ == "__main__":
parser = FlexibleArgumentParser()
# Add engine args
engine_group = parser.add_argument_group("Engine arguments")
EngineArgs.add_cli_args(engine_group)
engine_group.set_defaults(model="meta-llama/Llama-3.2-1B-Instruct")
# Add sampling params
sampling_group = parser.add_argument_group("Sampling parameters")
sampling_group.add_argument("--max-tokens", type=int)
sampling_group.add_argument("--temperature", type=float)
sampling_group.add_argument("--top-p", type=float)
sampling_group.add_argument("--top-k", type=int)
# Add example params
parser.add_argument("--chat-template-path", type=str)
args: dict = vars(parser.parse_args())
main(args)
# SPDX-License-Identifier: Apache-2.0
from argparse import Namespace
from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser
def main(args: Namespace):
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM.
# You should pass task="classify" for classification models
model = LLM(**vars(args))
# Generate logits. The output is a list of ClassificationRequestOutputs.
outputs = model.classify(prompts)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
probs = output.outputs.probs
probs_trimmed = ((str(probs[:16])[:-1] +
", ...]") if len(probs) > 16 else probs)
print(f"Prompt: {prompt!r} | "
f"Class Probabilities: {probs_trimmed} (size={len(probs)})")
if __name__ == "__main__":
parser = FlexibleArgumentParser()
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(model="jason9693/Qwen2.5-1.5B-apeach",
task="classify",
enforce_eager=True)
args = parser.parse_args()
main(args)
# SPDX-License-Identifier: Apache-2.0
from argparse import Namespace
from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser
def main(args: Namespace):
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM.
# You should pass task="embed" for embedding models
model = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.embed(prompts)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
embeds = output.outputs.embedding
embeds_trimmed = ((str(embeds[:16])[:-1] +
", ...]") if len(embeds) > 16 else embeds)
print(f"Prompt: {prompt!r} | "
f"Embeddings: {embeds_trimmed} (size={len(embeds)})")
if __name__ == "__main__":
parser = FlexibleArgumentParser()
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(model="intfloat/e5-mistral-7b-instruct",
task="embed",
enforce_eager=True)
args = parser.parse_args()
main(args)
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser
def main(args: dict):
# Pop arguments not used by LLM
max_tokens = args.pop("max_tokens")
temperature = args.pop("temperature")
top_p = args.pop("top_p")
top_k = args.pop("top_k")
# Create an LLM
llm = LLM(**args)
# Create a sampling params object
sampling_params = llm.get_default_sampling_params()
if max_tokens is not None:
sampling_params.max_tokens = max_tokens
if temperature is not None:
sampling_params.temperature = temperature
if top_p is not None:
sampling_params.top_p = top_p
if top_k is not None:
sampling_params.top_k = top_k
# Generate texts from the prompts. The output is a list of RequestOutput
# objects that contain the prompt, generated text, and other information.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == "__main__":
parser = FlexibleArgumentParser()
# Add engine args
engine_group = parser.add_argument_group("Engine arguments")
EngineArgs.add_cli_args(engine_group)
engine_group.set_defaults(model="meta-llama/Llama-3.2-1B-Instruct")
# Add sampling params
sampling_group = parser.add_argument_group("Sampling parameters")
sampling_group.add_argument("--max-tokens", type=int)
sampling_group.add_argument("--temperature", type=float)
sampling_group.add_argument("--top-p", type=float)
sampling_group.add_argument("--top-k", type=int)
args: dict = vars(parser.parse_args())
main(args)
# SPDX-License-Identifier: Apache-2.0
from argparse import Namespace
from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser
def main(args: Namespace):
# Sample prompts.
text_1 = "What is the capital of France?"
texts_2 = [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
]
# Create an LLM.
# You should pass task="score" for cross-encoder models
model = LLM(**vars(args))
# Generate scores. The output is a list of ScoringRequestOutputs.
outputs = model.score(text_1, texts_2)
# Print the outputs.
for text_2, output in zip(texts_2, outputs):
score = output.outputs.score
print(f"Pair: {[text_1, text_2]!r} | Score: {score}")
if __name__ == "__main__":
parser = FlexibleArgumentParser()
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(model="BAAI/bge-reranker-v2-m3",
task="score",
enforce_eager=True)
args = parser.parse_args()
main(args)
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM with built-in default generation config.
# The generation config is set to None by default to keep
# the behavior consistent with the previous version.
# If you want to use the default generation config from the model,
# you should set the generation_config to "auto".
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", generation_config="auto")
# Load the default sampling parameters from the model.
sampling_params = llm.get_default_sampling_params()
# Modify the sampling parameters if needed.
sampling_params.temperature = 0.5
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
sampling_params = SamplingParams(temperature=0.5)
def print_outputs(outputs):
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print("-" * 80)
print("=" * 80)
# In this script, we demonstrate how to pass input to the chat method:
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
outputs = llm.chat(conversation,
sampling_params=sampling_params,
use_tqdm=False)
print_outputs(outputs)
# You can run batch inference with llm.chat API
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
conversations = [conversation for _ in range(10)]
# We turn on tqdm progress bar to verify it's indeed running batch inference
outputs = llm.chat(messages=conversations,
sampling_params=sampling_params,
use_tqdm=True)
print_outputs(outputs)
# A chat template can be optionally supplied.
# If not, the model will use its default chat template.
# with open('template_falcon_180b.jinja', "r") as f:
# chat_template = f.read()
# outputs = llm.chat(
# conversations,
# sampling_params=sampling_params,
# use_tqdm=False,
# chat_template=chat_template,
# )
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM.
# You should pass task="classify" for classification models
model = LLM(
model="jason9693/Qwen2.5-1.5B-apeach",
task="classify",
enforce_eager=True,
)
# Generate logits. The output is a list of ClassificationRequestOutputs.
outputs = model.classify(prompts)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
probs = output.outputs.probs
probs_trimmed = ((str(probs[:16])[:-1] +
", ...]") if len(probs) > 16 else probs)
print(f"Prompt: {prompt!r} | "
f"Class Probabilities: {probs_trimmed} (size={len(probs)})")
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment