@@ -42,7 +42,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
...
@@ -42,7 +42,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
### Transformers fallback
### Transformers fallback
After the merge of <gh-pr:11330>, `vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
`vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
To check if the backend is `transformers`, you can simply do this:
To check if the backend is `transformers`, you can simply do this:
...
@@ -56,9 +56,13 @@ If it is `TransformersModel` then it means it's based on `transformers`!
...
@@ -56,9 +56,13 @@ If it is `TransformersModel` then it means it's based on `transformers`!
#### Supported features
#### Supported features
##### LORA and quantization
##### Quantization
Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team!
Transformers fallback has supported most of available quantization in vLLM (except GGUF). See [Quantization page](#quantization-index) for more information about supported quantization in vllm.
##### LoRA
LoRA hasn't supported on transformers fallback yet! Make sure to open an issue and we'll work on this together with the `transformers` team!
Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly.
Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly.
...
@@ -429,7 +433,7 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -429,7 +433,7 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
* ✅︎
-*`TeleChat2ForCausalLM`
-*`TeleChat2ForCausalLM`
* TeleChat2
* TeleChat2
*`TeleAI/TeleChat2-3B`, `TeleAI/TeleChat2-7B`, `TeleAI/TeleChat2-35B`, etc.
*`Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, etc.
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`XverseForCausalLM`
-*`XverseForCausalLM`
...
@@ -699,10 +703,10 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -699,10 +703,10 @@ See [this page](#generative-models) for more information on how to use generativ
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`DeepseekVLV2ForCausalLM`
-*`DeepseekVLV2ForCausalLM`<sup>^</sup>
* DeepSeek-VL2
* DeepSeek-VL2
* T + I<sup>+</sup>
* T + I<sup>+</sup>
*`deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc. (see note)
*`deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc.
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
...
@@ -713,20 +717,20 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -713,20 +717,20 @@ See [this page](#generative-models) for more information on how to use generativ
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`ChatGLMModel`
-*`GLM4VForCausalLM`<sup>^</sup>
* GLM-4V
* GLM-4V
* T + I
* T + I
*`THUDM/glm-4v-9b` etc.
*`THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220` etc.
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
-*`H2OVLChatModel`
-*`H2OVLChatModel`
* H2OVL
* H2OVL
* T + I<sup>E+</sup>
* T + I<sup>E+</sup>
*`h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc.
*`h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc.
*
*
* ✅︎
* ✅︎
*\*
*✅︎\*
-*`Idefics3ForConditionalGeneration`
-*`Idefics3ForConditionalGeneration`
* Idefics3
* Idefics3
* T + I
* T + I
...
@@ -793,7 +797,7 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -793,7 +797,7 @@ See [this page](#generative-models) for more information on how to use generativ
-*`MolmoForCausalLM`
-*`MolmoForCausalLM`
* Molmo
* Molmo
* T + I
* T + I
*`allenai/Molmo-7B-D-0924`, `allenai/Molmo-72B-0924`, etc.
*`allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc.
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
...
@@ -804,7 +808,7 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -804,7 +808,7 @@ See [this page](#generative-models) for more information on how to use generativ
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`PaliGemmaForConditionalGeneration`
-*`PaliGemmaForConditionalGeneration`\*
* PaliGemma, PaliGemma 2
* PaliGemma, PaliGemma 2
* T + I<sup>E</sup>
* T + I<sup>E</sup>
*`google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
*`google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
...
@@ -825,7 +829,7 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -825,7 +829,7 @@ See [this page](#generative-models) for more information on how to use generativ
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`QWenLMHeadModel`
-*`QwenVLForConditionalGeneration`<sup>^</sup>
* Qwen-VL
* Qwen-VL
* T + I<sup>E+</sup>
* T + I<sup>E+</sup>
*`Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc.
*`Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc.
...
@@ -850,27 +854,26 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -850,27 +854,26 @@ See [this page](#generative-models) for more information on how to use generativ
* Qwen2.5-VL
* Qwen2.5-VL
* T + I<sup>E+</sup> + V<sup>E+</sup>
* T + I<sup>E+</sup> + V<sup>E+</sup>
*`Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct`, etc.
*`Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct`, etc.
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`UltravoxModel`
-*`UltravoxModel`
* Ultravox
* Ultravox
* T + A<sup>E+</sup>
* T + A<sup>E+</sup>
*`fixie-ai/ultravox-v0_3`
*`fixie-ai/ultravox-v0_5-llama-3_2-1b`
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
:::
:::
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
• For example, to use DeepSeek-VL2 series models:
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses.
Then you get a ray cluster of **containers**. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses.
:::{warning}
Since this is a ray cluster of **containers**, all the following commands should be executed in the **containers**, otherwise you are executing the commands on the host machine, which is not connected to the ray cluster. To enter the container, you can use `docker exec -it node /bin/bash`.
:::
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
After that, on any node, use `docker exec -it node /bin/bash` to enter the container again. **In the container**, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
The `LLM` class provides the primary Python interface for doing offline inference, which is interacting with a model without using a separate model inference server.
## Usage
The first script in this example shows the most basic usage of vLLM. If you are new to Python and vLLM, you should start here.
```bash
python examples/offline_inference/basic/basic.py
```
The rest of the scripts include an [argument parser](https://docs.python.org/3/library/argparse.html), which you can use to pass any arguments that are compatible with [`LLM`](https://docs.vllm.ai/en/latest/api/offline_inference/llm.html). Try running the script with `--help` for a list of all available arguments.
The chat and generate scripts also accept the [sampling parameters](https://docs.vllm.ai/en/latest/api/inference_params.html#sampling-parameters): `max_tokens`, `temperature`, `top_p` and `top_k`.
In the scripts that support passing arguments, you can experiment with the following features.
### Default generation config
The `--generation-config` argument specifies where the generation config will be loaded from when calling `LLM.get_default_sampling_params()`. If set to ‘auto’, the generation config will be loaded from model path. If set to a folder path, the generation config will be loaded from the specified folder path. If it is not provided, vLLM defaults will be used.
> If max_new_tokens is specified in generation config, then it sets a server-wide limit on the number of output tokens for all requests.
Try it yourself with the following argument:
```bash
--generation-config auto
```
### Quantization
#### AQLM
vLLM supports models that are quantized using AQLM.
Try one yourself by passing one of the following models to the `--model` argument:
> Some of these models are likely to be too large for a single GPU. You can split them across multiple GPUs by setting `--tensor-parallel-size` to the number of required GPUs.
#### GGUF
vLLM supports models that are quantized using GGUF.
Try one yourself by downloading a GUFF quantised model and using the following arguments:
The `--cpu-offload-gb` argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight, which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass.