"tests/vscode:/vscode.git/clone" did not exist on "9e14887ff8829422af025c80a62a30cc9202bea8"
Commit 31330101 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.8.4' into v0.8.4-dev

parents e8933c34 dc1b4a6f
# TorchAO
TorchAO is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techniques and kernels for inference and training, featuring composability with native PyTorch features like torch.compile, FSDP etc.. Some benchmark numbers can be found [here](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks).
We recommend installing the latest torchao nightly with
```console
# Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
pip install --pre torchao>=10.0.0 --index-url https://download.pytorch.org/whl/nightly/cu126
```
## Quantizing HuggingFace Models
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
```Python
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8WeightOnlyConfig
model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
hub_repo = # YOUR HUB REPO ID
tokenizer.push_to_hub(hub_repo)
quantized_model.push_to_hub(hub_repo, safe_serialization=False)
```
Alternatively, you can use the TorchAO Quantization space for quantizing models with a simple UI.
See: https://huggingface.co/spaces/medmekk/TorchAO_Quantization
...@@ -245,6 +245,8 @@ Example supported models: ...@@ -245,6 +245,8 @@ Example supported models:
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`) * `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`) * `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`) * `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with `examples/tool_chat_template_llama4_pythonic.jinja`)
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with `examples/tool_chat_template_llama4_pythonic.jinja`)
Flags: `--tool-call-parser pythonic --chat-template {see_above}` Flags: `--tool-call-parser pythonic --chat-template {see_above}`
......
...@@ -17,6 +17,7 @@ def fix_case(text: str) -> str: ...@@ -17,6 +17,7 @@ def fix_case(text: str) -> str:
"cli": "CLI", "cli": "CLI",
"cpu": "CPU", "cpu": "CPU",
"llm": "LLM", "llm": "LLM",
"mae": "MAE",
"tpu": "TPU", "tpu": "TPU",
"aqlm": "AQLM", "aqlm": "AQLM",
"gguf": "GGUF", "gguf": "GGUF",
...@@ -24,6 +25,7 @@ def fix_case(text: str) -> str: ...@@ -24,6 +25,7 @@ def fix_case(text: str) -> str:
"rlhf": "RLHF", "rlhf": "RLHF",
"vllm": "vLLM", "vllm": "vLLM",
"openai": "OpenAI", "openai": "OpenAI",
"lmcache": "LMCache",
"multilora": "MultiLoRA", "multilora": "MultiLoRA",
"mlpspeculator": "MLPSpeculator", "mlpspeculator": "MLPSpeculator",
r"fp\d+": lambda x: x.group(0).upper(), # e.g. fp16, fp32 r"fp\d+": lambda x: x.group(0).upper(), # e.g. fp16, fp32
......
...@@ -12,7 +12,7 @@ There are no pre-built wheels or images for this device, so you must build vLLM ...@@ -12,7 +12,7 @@ There are no pre-built wheels or images for this device, so you must build vLLM
- OS: `macOS Sonoma` or later - OS: `macOS Sonoma` or later
- SDK: `XCode 15.4` or later with Command Line Tools - SDK: `XCode 15.4` or later with Command Line Tools
- Compiler: `Apple Clang >= 15.0.0` and `Apple Clang < 17.0.0` - Compiler: `Apple Clang >= 15.0.0`
## Set up using Python ## Set up using Python
...@@ -51,14 +51,6 @@ If the build has error like the following snippet where standard C++ headers can ...@@ -51,14 +51,6 @@ If the build has error like the following snippet where standard C++ headers can
1 error generated. 1 error generated.
``` ```
If run with error like the following snippet you need to check clang version and install a compatible version.
```text
AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul'
```
More information can be found in <gh-issue:15941>.
## Set up using Docker ## Set up using Docker
### Pre-built images ### Pre-built images
......
...@@ -156,10 +156,3 @@ vLLM V1 is currently optimized for decoder-only transformers. Models requiring ...@@ -156,10 +156,3 @@ vLLM V1 is currently optimized for decoder-only transformers. Models requiring
cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`). cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html). For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
## Frequently Asked Questions
**I'm using vLLM V1 and I'm getting CUDA OOM errors. What should I do?**
The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`.
On the other hand, if you get an error about insufficient memory for the cache blocks, you should increase `gpu_memory_utilization` as this indicates that your GPU has sufficient memory but you're not allocating enough to vLLM for KV cache blocks.
...@@ -9,7 +9,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor ...@@ -9,7 +9,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor
For more information on CoreWeave's Tensorizer, please refer to For more information on CoreWeave's Tensorizer, please refer to
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html). the [vLLM example script](https://docs.vllm.ai/en/latest/getting_started/examples/tensorize_vllm_model.html).
:::{note} :::{note}
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`. Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
......
...@@ -160,6 +160,35 @@ If vLLM successfully returns text (for generative models) or hidden states (for ...@@ -160,6 +160,35 @@ If vLLM successfully returns text (for generative models) or hidden states (for
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM. Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support. Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
#### Using a proxy
Here are some tips for loading/downloading models from Hugging Face using a proxy:
- Set the proxy globally for your session (or set it in the profile file):
```shell
export http_proxy=http://your.proxy.server:port
export https_proxy=http://your.proxy.server:port
```
- Set the proxy for just the current command:
```shell
https_proxy=http://your.proxy.server:port huggingface-cli download <model_name>
# or use vllm cmd directly
https_proxy=http://your.proxy.server:port vllm serve <model_name> --disable-log-requests
```
- Set the proxy in Python interpreter:
```python
import os
os.environ['http_proxy'] = 'http://your.proxy.server:port'
os.environ['https_proxy'] = 'http://your.proxy.server:port'
```
### ModelScope ### ModelScope
To use models from [ModelScope](https://www.modelscope.cn) instead of Hugging Face Hub, set an environment variable: To use models from [ModelScope](https://www.modelscope.cn) instead of Hugging Face Hub, set an environment variable:
...@@ -233,9 +262,9 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -233,9 +262,9 @@ See [this page](#generative-models) for more information on how to use generativ
* `facebook/bart-base`, `facebook/bart-large-cnn`, etc. * `facebook/bart-base`, `facebook/bart-large-cnn`, etc.
* *
* *
- * `ChatGLMModel` - * `ChatGLMModel`, `ChatGLMForConditionalGeneration`
* ChatGLM * ChatGLM
* `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc. * `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc.
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `CohereForCausalLM`, `Cohere2ForCausalLM` - * `CohereForCausalLM`, `Cohere2ForCausalLM`
...@@ -303,6 +332,11 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -303,6 +332,11 @@ See [this page](#generative-models) for more information on how to use generativ
* `THUDM/glm-4-9b-chat-hf`, etc. * `THUDM/glm-4-9b-chat-hf`, etc.
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `Glm4ForCausalLM`
* GLM-4-0414
* `THUDM/GLM-4-32B-Chat-0414`, etc.
* ✅︎
* ✅︎
- * `GPT2LMHeadModel` - * `GPT2LMHeadModel`
* GPT-2 * GPT-2
* `gpt2`, `gpt2-xl`, etc. * `gpt2`, `gpt2-xl`, etc.
...@@ -478,6 +512,16 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -478,6 +512,16 @@ See [this page](#generative-models) for more information on how to use generativ
* `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. * `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc.
* *
* ✅︎ * ✅︎
- * `Qwen3ForCausalLM`
* Qwen3
* `Qwen/Qwen3-8B`, etc.
* ✅︎
* ✅︎
- * `Qwen3MoeForCausalLM`
* Qwen3MoE
* `Qwen/Qwen3-MoE-15B-A2B`, etc.
* ✅︎
* ✅︎
- * `StableLmForCausalLM` - * `StableLmForCausalLM`
* StableLM * StableLM
* `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc. * `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc.
...@@ -715,7 +759,7 @@ On the other hand, modalities separated by `/` are mutually exclusive. ...@@ -715,7 +759,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model. See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
:::{important} :::{important}
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference) **To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt: or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
Offline inference: Offline inference:
...@@ -733,6 +777,8 @@ Online serving: ...@@ -733,6 +777,8 @@ Online serving:
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4 vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
``` ```
**This is no longer required if you are using vLLM V1.**
::: :::
:::{note} :::{note}
...@@ -834,9 +880,16 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -834,9 +880,16 @@ See [this page](#generative-models) for more information on how to use generativ
* *
* ✅︎ * ✅︎
- * `InternVLChatModel` - * `InternVLChatModel`
* InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0 * InternVL 3.0, InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0
* T + I<sup>E+</sup> * T + I<sup>E+</sup>
* `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc. * `OpenGVLab/InternVL3-9B`, `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc.
*
* ✅︎
* ✅︎
- * `Llama4ForConditionalGeneration`
* Llama 4
* T + I<sup>+</sup>
* `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc.
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
...@@ -980,6 +1033,13 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -980,6 +1033,13 @@ See [this page](#generative-models) for more information on how to use generativ
* *
* ✅︎ * ✅︎
* ✅︎ * ✅︎
- * `SmolVLMForConditionalGeneration`
* SmolVLM2
* T + I
* `SmolVLM2-2.2B-Instruct`
*
* ✅︎
* ✅︎
- * `UltravoxModel` - * `UltravoxModel`
* Ultravox * Ultravox
* T + A<sup>E+</sup> * T + A<sup>E+</sup>
...@@ -996,9 +1056,6 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -996,9 +1056,6 @@ See [this page](#generative-models) for more information on how to use generativ
<sup>+</sup> Multiple items can be inputted per text prompt for this modality. <sup>+</sup> Multiple items can be inputted per text prompt for this modality.
:::{important} :::{important}
To use Gemma3 series models, you have to install Hugging Face Transformers library from source via
`pip install git+https://github.com/huggingface/transformers`.
Pan-and-scan image pre-processing is currently supported on V0 (but not V1). Pan-and-scan image pre-processing is currently supported on V0 (but not V1).
You can enable it by passing `--mm-processor-kwargs '{"do_pan_and_scan": True}'`. You can enable it by passing `--mm-processor-kwargs '{"do_pan_and_scan": True}'`.
::: :::
...@@ -1135,5 +1192,5 @@ We have the following levels of testing for models: ...@@ -1135,5 +1192,5 @@ We have the following levels of testing for models:
1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test. 1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test. 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:main/examples) for the models that have passed this test. 3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category. 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
\ No newline at end of file
...@@ -110,6 +110,30 @@ If you run out of CPU RAM, try the following options: ...@@ -110,6 +110,30 @@ If you run out of CPU RAM, try the following options:
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB). - (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB). - (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
#### Disable unused modalities
You can disable unused modalities (except for text) by setting its limit to zero.
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
```python
from vllm import LLM
# Accept images but not videos
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
limit_mm_per_prompt={"video": 0})
```
You can even run a multi-modal model for text-only inference:
```python
from vllm import LLM
# Don't accept images. Just text.
llm = LLM(model="google/gemma-3-27b-it",
limit_mm_per_prompt={"image": 0})
```
### Performance optimization and tuning ### Performance optimization and tuning
You can potentially improve the performance of vLLM by finetuning various options. You can potentially improve the performance of vLLM by finetuning various options.
......
...@@ -2,15 +2,15 @@ ...@@ -2,15 +2,15 @@
# OpenAI-Compatible Server # OpenAI-Compatible Server
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](#deployment-docker): In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`](#vllm-serve) command. (You can also use our [Docker](#deployment-docker) image.)
```bash ```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
``` ```
To call the server, you can use the [official OpenAI Python client](https://github.com/openai/openai-python), or any other HTTP client. To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
```python ```python
from openai import OpenAI from openai import OpenAI
......
...@@ -47,7 +47,7 @@ def run_minicpmo(question: str, audio_count: int) -> ModelRequestData: ...@@ -47,7 +47,7 @@ def run_minicpmo(question: str, audio_count: int) -> ModelRequestData:
model=model_name, model=model_name,
trust_remote_code=True, trust_remote_code=True,
max_model_len=4096, max_model_len=4096,
max_num_seqs=5, max_num_seqs=2,
limit_mm_per_prompt={"audio": audio_count}, limit_mm_per_prompt={"audio": audio_count},
) )
...@@ -196,16 +196,14 @@ def main(args): ...@@ -196,16 +196,14 @@ def main(args):
req_data = model_example_map[model](question_per_audio_count[audio_count], req_data = model_example_map[model](question_per_audio_count[audio_count],
audio_count) audio_count)
# Disable other modalities to save memory
default_limits = {"image": 0, "video": 0, "audio": 0}
req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
req_data.engine_args.limit_mm_per_prompt or {})
engine_args = asdict(req_data.engine_args) | {"seed": args.seed} engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
llm = LLM(**engine_args) llm = LLM(**engine_args)
# To maintain code compatibility in this script, we add LoRA here.
# You can also add LoRA using:
# llm.generate(prompts, lora_request=lora_request,...)
if req_data.lora_requests:
for lora_request in req_data.lora_requests:
llm.llm_engine.add_lora(lora_request=lora_request)
# We set temperature to 0.2 so that outputs can be different # We set temperature to 0.2 so that outputs can be different
# even when all prompts are identical when running batch inference. # even when all prompts are identical when running batch inference.
sampling_params = SamplingParams(temperature=0.2, sampling_params = SamplingParams(temperature=0.2,
...@@ -226,8 +224,15 @@ def main(args): ...@@ -226,8 +224,15 @@ def main(args):
if args.num_prompts > 1: if args.num_prompts > 1:
# Batch inference # Batch inference
inputs = [inputs] * args.num_prompts inputs = [inputs] * args.num_prompts
# Add LoRA request if applicable
outputs = llm.generate(inputs, sampling_params=sampling_params) lora_request = (req_data.lora_requests *
args.num_prompts if req_data.lora_requests else None)
outputs = llm.generate(
inputs,
sampling_params=sampling_params,
lora_request=lora_request,
)
for o in outputs: for o in outputs:
generated_text = o.outputs[0].text generated_text = o.outputs[0].text
......
...@@ -7,89 +7,103 @@ from transformers import AutoTokenizer ...@@ -7,89 +7,103 @@ from transformers import AutoTokenizer
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
parser = argparse.ArgumentParser()
def load_prompts(dataset_path, num_prompts):
parser.add_argument( if os.path.exists(dataset_path):
"--dataset", prompts = []
type=str, try:
default="./examples/data/gsm8k.jsonl", with open(dataset_path) as f:
help="downloaded from the eagle repo " \ for line in f:
"https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/" data = json.loads(line)
) prompts.append(data["turns"][0])
parser.add_argument("--max_num_seqs", type=int, default=8) except Exception as e:
parser.add_argument("--num_prompts", type=int, default=80) print(f"Error reading dataset: {e}")
parser.add_argument("--num_spec_tokens", type=int, default=2) return []
parser.add_argument("--tp", type=int, default=1) else:
parser.add_argument("--draft_tp", type=int, default=1) prompts = [
parser.add_argument("--enforce_eager", action='store_true') "The future of AI is", "The president of the United States is"
parser.add_argument("--enable_chunked_prefill", action='store_true') ]
parser.add_argument("--max_num_batched_tokens", type=int, default=2048)
parser.add_argument("--temp", type=float, default=0) return prompts[:num_prompts]
args = parser.parse_args()
def main():
print(args) parser = argparse.ArgumentParser()
parser.add_argument(
model_dir = "meta-llama/Meta-Llama-3-8B-Instruct" "--dataset",
eagle_dir = "abhigoyal/EAGLE-LLaMA3-Instruct-8B-vllm" type=str,
default="./examples/data/gsm8k.jsonl",
max_model_len = 2048 help="downloaded from the eagle repo " \
"https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/"
tokenizer = AutoTokenizer.from_pretrained(model_dir) )
parser.add_argument("--max_num_seqs", type=int, default=8)
if os.path.exists(args.dataset): parser.add_argument("--num_prompts", type=int, default=80)
prompts = [] parser.add_argument("--num_spec_tokens", type=int, default=2)
num_prompts = args.num_prompts parser.add_argument("--tp", type=int, default=1)
with open(args.dataset) as f: parser.add_argument("--draft_tp", type=int, default=1)
for line in f: parser.add_argument("--enforce_eager", action='store_true')
data = json.loads(line) parser.add_argument("--enable_chunked_prefill", action='store_true')
prompts.append(data["turns"][0]) parser.add_argument("--max_num_batched_tokens", type=int, default=2048)
else: parser.add_argument("--temp", type=float, default=0)
prompts = ["The future of AI is", "The president of the United States is"] args = parser.parse_args()
prompts = prompts[:args.num_prompts] model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
num_prompts = len(prompts) eagle_dir = "abhigoyal/EAGLE-LLaMA3-Instruct-8B-vllm"
prompt_ids = [ max_model_len = 2048
tokenizer.apply_chat_template([{
"role": "user", tokenizer = AutoTokenizer.from_pretrained(model_dir)
"content": prompt
}], prompts = load_prompts(args.dataset, args.num_prompts)
add_generation_prompt=True)
for prompt in prompts prompt_ids = [
] tokenizer.apply_chat_template([{
"role": "user",
llm = LLM( "content": prompt
model=model_dir, }],
trust_remote_code=True, add_generation_prompt=True)
tensor_parallel_size=args.tp, for prompt in prompts
enable_chunked_prefill=args.enable_chunked_prefill, ]
max_num_batched_tokens=args.max_num_batched_tokens,
enforce_eager=args.enforce_eager, llm = LLM(
max_model_len=max_model_len, model=model_dir,
max_num_seqs=args.max_num_seqs, trust_remote_code=True,
gpu_memory_utilization=0.8, tensor_parallel_size=args.tp,
speculative_config={ enable_chunked_prefill=args.enable_chunked_prefill,
"model": eagle_dir, max_num_batched_tokens=args.max_num_batched_tokens,
"num_speculative_tokens": args.num_spec_tokens, enforce_eager=args.enforce_eager,
"draft_tensor_parallel_size": args.draft_tp, max_model_len=max_model_len,
"max_model_len": max_model_len, max_num_seqs=args.max_num_seqs,
}, gpu_memory_utilization=0.8,
disable_log_stats=False, speculative_config={
) "method": "eagle",
"model": eagle_dir,
sampling_params = SamplingParams(temperature=args.temp, max_tokens=256) "num_speculative_tokens": args.num_spec_tokens,
"draft_tensor_parallel_size": args.draft_tp,
outputs = llm.generate(prompt_token_ids=prompt_ids, "max_model_len": max_model_len,
sampling_params=sampling_params) },
disable_log_stats=False,
# calculate the average number of accepted tokens per forward pass, +1 is )
# to account for the token from the target model that's always going to be
# accepted sampling_params = SamplingParams(temperature=args.temp, max_tokens=256)
acceptance_counts = [0] * (args.num_spec_tokens + 1)
for output in outputs: outputs = llm.generate(prompt_token_ids=prompt_ids,
for step, count in enumerate(output.metrics.spec_token_acceptance_counts): sampling_params=sampling_params)
acceptance_counts[step] += count
# calculate the average number of accepted tokens per forward pass, +1 is
print(f"mean acceptance length: \ # to account for the token from the target model that's always going to be
{sum(acceptance_counts) / acceptance_counts[0]:.2f}") # accepted
acceptance_counts = [0] * (args.num_spec_tokens + 1)
for output in outputs:
for step, count in enumerate(
output.metrics.spec_token_acceptance_counts):
acceptance_counts[step] += count
print("-" * 50)
print(f"mean acceptance length: \
{sum(acceptance_counts) / acceptance_counts[0]:.2f}")
print("-" * 50)
if __name__ == "__main__":
main()
# SPDX-License-Identifier: Apache-2.0
from argparse import Namespace
from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser
def main(args: Namespace):
# Sample prompts.
prompts = [
"Follow the white rabbit.", # English
"Sigue al conejo blanco.", # Spanish
"Suis le lapin blanc.", # French
"跟着白兔走。", # Chinese
"اتبع الأرنب الأبيض.", # Arabic
"Folge dem weißen Kaninchen.", # German
]
# Create an LLM.
# You should pass task="embed" for embedding models
model = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
# Only text matching task is supported for now. See #16120
outputs = model.embed(prompts)
# Print the outputs.
print("\nGenerated Outputs:")
print("Only text matching task is supported for now. See #16120")
print("-" * 60)
for prompt, output in zip(prompts, outputs):
embeds = output.outputs.embedding
embeds_trimmed = ((str(embeds[:16])[:-1] +
", ...]") if len(embeds) > 16 else embeds)
print(f"Prompt: {prompt!r} \n"
f"Embeddings for text matching: {embeds_trimmed} "
f"(size={len(embeds)})")
print("-" * 60)
if __name__ == "__main__":
parser = FlexibleArgumentParser()
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(model="jinaai/jina-embeddings-v3",
task="embed",
trust_remote_code=True)
args = parser.parse_args()
main(args)
# SPDX-License-Identifier: Apache-2.0
from argparse import Namespace
from vllm import LLM, EngineArgs, PoolingParams
from vllm.utils import FlexibleArgumentParser
def main(args: Namespace):
# Sample prompts.
prompts = [
"Follow the white rabbit.", # English
"Sigue al conejo blanco.", # Spanish
"Suis le lapin blanc.", # French
"跟着白兔走。", # Chinese
"اتبع الأرنب الأبيض.", # Arabic
"Folge dem weißen Kaninchen.", # German
]
# Create an LLM.
# You should pass task="embed" for embedding models
model = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.embed(prompts, pooling_params=PoolingParams(dimensions=32))
# Print the outputs.
print("\nGenerated Outputs:")
print("-" * 60)
for prompt, output in zip(prompts, outputs):
embeds = output.outputs.embedding
embeds_trimmed = ((str(embeds[:16])[:-1] +
", ...]") if len(embeds) > 16 else embeds)
print(f"Prompt: {prompt!r} \n"
f"Embeddings: {embeds_trimmed} "
f"(size={len(embeds)})")
print("-" * 60)
if __name__ == "__main__":
parser = FlexibleArgumentParser()
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(model="jinaai/jina-embeddings-v3",
task="embed",
trust_remote_code=True)
args = parser.parse_args()
main(args)
...@@ -75,8 +75,6 @@ prompts = [ ...@@ -75,8 +75,6 @@ prompts = [
enc_dec_prompt1, enc_dec_prompt2, enc_dec_prompt3 enc_dec_prompt1, enc_dec_prompt2, enc_dec_prompt3
] + zipped_prompt_list ] + zipped_prompt_list
print(prompts)
# Create a sampling params object. # Create a sampling params object.
sampling_params = SamplingParams( sampling_params = SamplingParams(
temperature=0, temperature=0,
...@@ -91,10 +89,13 @@ sampling_params = SamplingParams( ...@@ -91,10 +89,13 @@ sampling_params = SamplingParams(
outputs = llm.generate(prompts, sampling_params) outputs = llm.generate(prompts, sampling_params)
# Print the outputs. # Print the outputs.
for output in outputs: print("-" * 50)
for i, output in enumerate(outputs):
prompt = output.prompt prompt = output.prompt
encoder_prompt = output.encoder_prompt encoder_prompt = output.encoder_prompt
generated_text = output.outputs[0].text generated_text = output.outputs[0].text
print(f"Encoder prompt: {encoder_prompt!r}, " print(f"Output {i+1}:")
f"Decoder prompt: {prompt!r}, " print(f"Encoder prompt: {encoder_prompt!r}\n"
f"Decoder prompt: {prompt!r}\n"
f"Generated text: {generated_text!r}") f"Generated text: {generated_text!r}")
print("-" * 50)
...@@ -56,7 +56,7 @@ def run_florence2(): ...@@ -56,7 +56,7 @@ def run_florence2():
def run_mllama(): def run_mllama():
engine_args = EngineArgs( engine_args = EngineArgs(
model="meta-llama/Llama-3.2-11B-Vision-Instruct", model="meta-llama/Llama-3.2-11B-Vision-Instruct",
max_model_len=4096, max_model_len=8192,
max_num_seqs=2, max_num_seqs=2,
limit_mm_per_prompt={"image": 1}, limit_mm_per_prompt={"image": 1},
dtype="half", dtype="half",
...@@ -133,6 +133,11 @@ def main(args): ...@@ -133,6 +133,11 @@ def main(args):
req_data = model_example_map[model]() req_data = model_example_map[model]()
# Disable other modalities to save memory
default_limits = {"image": 0, "video": 0, "audio": 0}
req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
req_data.engine_args.limit_mm_per_prompt or {})
engine_args = asdict(req_data.engine_args) | {"seed": args.seed} engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
llm = LLM(**engine_args) llm = LLM(**engine_args)
......
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
"""
This file demonstrates using the `LLMEngine`
for processing prompts with various sampling parameters.
"""
import argparse import argparse
from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
...@@ -26,6 +29,7 @@ def process_requests(engine: LLMEngine, ...@@ -26,6 +29,7 @@ def process_requests(engine: LLMEngine,
"""Continuously process a list of prompts and handle the outputs.""" """Continuously process a list of prompts and handle the outputs."""
request_id = 0 request_id = 0
print('-' * 50)
while test_prompts or engine.has_unfinished_requests(): while test_prompts or engine.has_unfinished_requests():
if test_prompts: if test_prompts:
prompt, sampling_params = test_prompts.pop(0) prompt, sampling_params = test_prompts.pop(0)
...@@ -37,6 +41,7 @@ def process_requests(engine: LLMEngine, ...@@ -37,6 +41,7 @@ def process_requests(engine: LLMEngine,
for request_output in request_outputs: for request_output in request_outputs:
if request_output.finished: if request_output.finished:
print(request_output) print(request_output)
print('-' * 50)
def initialize_engine(args: argparse.Namespace) -> LLMEngine: def initialize_engine(args: argparse.Namespace) -> LLMEngine:
......
...@@ -13,9 +13,14 @@ from vllm.sampling_params import SamplingParams ...@@ -13,9 +13,14 @@ from vllm.sampling_params import SamplingParams
# - Server: # - Server:
# #
# ```bash # ```bash
# # Mistral format
# vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \ # vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
# --tokenizer-mode mistral --config-format mistral --load-format mistral \ # --tokenizer-mode mistral --config-format mistral --load-format mistral \
# --limit-mm-per-prompt 'image=4' --max-model-len 16384 # --limit-mm-per-prompt 'image=4' --max-model-len 16384
#
# # HF format
# vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
# --limit-mm-per-prompt 'image=4' --max-model-len 16384
# ``` # ```
# #
# - Client: # - Client:
...@@ -44,19 +49,22 @@ from vllm.sampling_params import SamplingParams ...@@ -44,19 +49,22 @@ from vllm.sampling_params import SamplingParams
# python demo.py simple # python demo.py simple
# python demo.py advanced # python demo.py advanced
# Lower max_model_len and/or max_num_seqs on low-VRAM GPUs.
# These scripts have been tested on 2x L40 GPUs
def run_simple_demo(args: argparse.Namespace): def run_simple_demo(args: argparse.Namespace):
model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
sampling_params = SamplingParams(max_tokens=8192) sampling_params = SamplingParams(max_tokens=8192)
# Lower max_model_len and/or max_num_seqs on low-VRAM GPUs.
llm = LLM( llm = LLM(
model=model_name, model=model_name,
tokenizer_mode="mistral", tokenizer_mode="mistral" if args.format == "mistral" else "auto",
config_format="mistral", config_format="mistral" if args.format == "mistral" else "auto",
load_format="mistral", load_format="mistral" if args.format == "mistral" else "auto",
max_model_len=4096, max_model_len=4096,
max_num_seqs=2, max_num_seqs=2,
tensor_parallel_size=2,
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache, disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
) )
...@@ -82,23 +90,25 @@ def run_simple_demo(args: argparse.Namespace): ...@@ -82,23 +90,25 @@ def run_simple_demo(args: argparse.Namespace):
}, },
] ]
outputs = llm.chat(messages, sampling_params=sampling_params) outputs = llm.chat(messages, sampling_params=sampling_params)
print("-" * 50)
print(outputs[0].outputs[0].text) print(outputs[0].outputs[0].text)
print("-" * 50)
def run_advanced_demo(args: argparse.Namespace): def run_advanced_demo(args: argparse.Namespace):
model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
max_img_per_msg = 5 max_img_per_msg = 3
max_tokens_per_img = 4096 max_tokens_per_img = 4096
sampling_params = SamplingParams(max_tokens=8192, temperature=0.7) sampling_params = SamplingParams(max_tokens=8192, temperature=0.7)
llm = LLM( llm = LLM(
model=model_name, model=model_name,
tokenizer_mode="mistral", tokenizer_mode="mistral" if args.format == "mistral" else "auto",
config_format="mistral", config_format="mistral" if args.format == "mistral" else "auto",
load_format="mistral", load_format="mistral" if args.format == "mistral" else "auto",
limit_mm_per_prompt={"image": max_img_per_msg}, limit_mm_per_prompt={"image": max_img_per_msg},
max_model_len=max_img_per_msg * max_tokens_per_img, max_model_len=max_img_per_msg * max_tokens_per_img,
tensor_parallel_size=2,
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache, disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
) )
...@@ -153,7 +163,9 @@ def run_advanced_demo(args: argparse.Namespace): ...@@ -153,7 +163,9 @@ def run_advanced_demo(args: argparse.Namespace):
] ]
outputs = llm.chat(messages=messages, sampling_params=sampling_params) outputs = llm.chat(messages=messages, sampling_params=sampling_params)
print("-" * 50)
print(outputs[0].outputs[0].text) print(outputs[0].outputs[0].text)
print("-" * 50)
def main(): def main():
...@@ -166,6 +178,11 @@ def main(): ...@@ -166,6 +178,11 @@ def main():
help="Specify the demo mode: 'simple' or 'advanced'", help="Specify the demo mode: 'simple' or 'advanced'",
) )
parser.add_argument('--format',
choices=["mistral", "hf"],
default="mistral",
help='Specify the format of the model to load.')
parser.add_argument( parser.add_argument(
'--disable-mm-preprocessor-cache', '--disable-mm-preprocessor-cache',
action='store_true', action='store_true',
......
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
"""
This file demonstrates the usage of text generation with an LLM model,
comparing the performance with and without speculative decoding.
Note that still not support `v1`:
VLLM_USE_V1=0 python examples/offline_inference/mlpspeculator.py
"""
import gc import gc
import time import time
...@@ -7,7 +14,7 @@ from vllm import LLM, SamplingParams ...@@ -7,7 +14,7 @@ from vllm import LLM, SamplingParams
def time_generation(llm: LLM, prompts: list[str], def time_generation(llm: LLM, prompts: list[str],
sampling_params: SamplingParams): sampling_params: SamplingParams, title: str):
# Generate texts from the prompts. The output is a list of RequestOutput # Generate texts from the prompts. The output is a list of RequestOutput
# objects that contain the prompt, generated text, and other information. # objects that contain the prompt, generated text, and other information.
# Warmup first # Warmup first
...@@ -16,11 +23,15 @@ def time_generation(llm: LLM, prompts: list[str], ...@@ -16,11 +23,15 @@ def time_generation(llm: LLM, prompts: list[str],
start = time.time() start = time.time()
outputs = llm.generate(prompts, sampling_params) outputs = llm.generate(prompts, sampling_params)
end = time.time() end = time.time()
print((end - start) / sum([len(o.outputs[0].token_ids) for o in outputs])) print("-" * 50)
print(title)
print("time: ",
(end - start) / sum(len(o.outputs[0].token_ids) for o in outputs))
# Print the outputs. # Print the outputs.
for output in outputs: for output in outputs:
generated_text = output.outputs[0].text generated_text = output.outputs[0].text
print(f"text: {generated_text!r}") print(f"text: {generated_text!r}")
print("-" * 50)
if __name__ == "__main__": if __name__ == "__main__":
...@@ -41,8 +52,7 @@ if __name__ == "__main__": ...@@ -41,8 +52,7 @@ if __name__ == "__main__":
# Create an LLM without spec decoding # Create an LLM without spec decoding
llm = LLM(model="meta-llama/Llama-2-13b-chat-hf") llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")
print("Without speculation") time_generation(llm, prompts, sampling_params, "Without speculation")
time_generation(llm, prompts, sampling_params)
del llm del llm
gc.collect() gc.collect()
...@@ -55,5 +65,4 @@ if __name__ == "__main__": ...@@ -55,5 +65,4 @@ if __name__ == "__main__":
}, },
) )
print("With speculation") time_generation(llm, prompts, sampling_params, "With speculation")
time_generation(llm, prompts, sampling_params)
...@@ -61,6 +61,7 @@ def process_requests(engine: LLMEngine, ...@@ -61,6 +61,7 @@ def process_requests(engine: LLMEngine,
"""Continuously process a list of prompts and handle the outputs.""" """Continuously process a list of prompts and handle the outputs."""
request_id = 0 request_id = 0
print("-" * 50)
while test_prompts or engine.has_unfinished_requests(): while test_prompts or engine.has_unfinished_requests():
if test_prompts: if test_prompts:
prompt, sampling_params, lora_request = test_prompts.pop(0) prompt, sampling_params, lora_request = test_prompts.pop(0)
...@@ -75,6 +76,7 @@ def process_requests(engine: LLMEngine, ...@@ -75,6 +76,7 @@ def process_requests(engine: LLMEngine,
for request_output in request_outputs: for request_output in request_outputs:
if request_output.finished: if request_output.finished:
print(request_output) print(request_output)
print("-" * 50)
def initialize_engine() -> LLMEngine: def initialize_engine() -> LLMEngine:
......
...@@ -12,27 +12,36 @@ prompts = [ ...@@ -12,27 +12,36 @@ prompts = [
# Create a sampling params object. # Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM( def main():
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", # Create an LLM.
max_num_seqs=8, llm = LLM(
# The max_model_len and block_size arguments are required to be same as model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
# max sequence length when targeting neuron device. max_num_seqs=8,
# Currently, this is a known limitation in continuous batching support # The max_model_len and block_size arguments are required to be same as
# in transformers-neuronx. # max sequence length when targeting neuron device.
# TODO(liangfu): Support paged-attention in transformers-neuronx. # Currently, this is a known limitation in continuous batching support
max_model_len=1024, # in transformers-neuronx.
block_size=1024, # TODO(liangfu): Support paged-attention in transformers-neuronx.
# The device can be automatically detected when AWS Neuron SDK is installed. max_model_len=1024,
# The device argument can be either unspecified for automated detection, block_size=1024,
# or explicitly assigned. # ruff: noqa: E501
device="neuron", # The device can be automatically detected when AWS Neuron SDK is installed.
tensor_parallel_size=2) # The device argument can be either unspecified for automated detection,
# Generate texts from the prompts. The output is a list of RequestOutput objects # or explicitly assigned.
# that contain the prompt, generated text, and other information. device="neuron",
outputs = llm.generate(prompts, sampling_params) tensor_parallel_size=2)
# Print the outputs. # Generate texts from the prompts. The output is a list of RequestOutput objects
for output in outputs: # that contain the prompt, generated text, and other information.
prompt = output.prompt outputs = llm.generate(prompts, sampling_params)
generated_text = output.outputs[0].text # Print the outputs.
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print("-" * 50)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
print("-" * 50)
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment