Merge tag 'v0.8.4' into v0.8.4-dev

31330101 · zhuwenwen · e8933c34 · dc1b4a6f · 31330101 · 31330101
Commit 31330101 authored Apr 16, 2025 by zhuwenwen
20 changed files
--- a/docs/source/features/quantization/torchao.md
+++ b/docs/source/features/quantization/torchao.md
+# TorchAO
+TorchAO is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techniques and kernels for inference and training, featuring composability with native PyTorch features like torch.compile, FSDP etc.. Some benchmark numbers can be found [here](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks).
+We recommend installing the latest torchao nightly with
+```console
+# Install the latest TorchAO nightly build
+# Choose the CUDA version that matches your system (cu126, cu128, etc.)
+pip install --pre torchao>=10.0.0 --index-url https://download.pytorch.org/whl/nightly/cu126
+```
+## Quantizing HuggingFace Models
+You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
+```Python
+import torch
+from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
+from torchao.quantization import Int8WeightOnlyConfig
+model_name = "meta-llama/Meta-Llama-3-8B"
+quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
+quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+input_text = "What are we having for dinner?"
+input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+hub_repo = # YOUR HUB REPO ID
+tokenizer.push_to_hub(hub_repo)
+quantized_model.push_to_hub(hub_repo, safe_serialization=False)
+```
+Alternatively, you can use the TorchAO Quantization space for quantizing models with a simple UI.
+See: https://huggingface.co/spaces/medmekk/TorchAO_Quantization
--- a/docs/source/features/tool_calling.md
+++ b/docs/source/features/tool_calling.md
@@ -245,6 +245,8 @@ Example supported models:
 * `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
 * `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
 * `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
+* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with `examples/tool_chat_template_llama4_pythonic.jinja`)
+* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with `examples/tool_chat_template_llama4_pythonic.jinja`)
 Flags: `--tool-call-parser pythonic --chat-template {see_above}`

--- a/docs/source/generate_examples.py
+++ b/docs/source/generate_examples.py
@@ -17,6 +17,7 @@ def fix_case(text: str) -> str:
        "cli": "CLI",
        "cpu": "CPU",
        "llm": "LLM",
+        "mae": "MAE",
        "tpu": "TPU",
        "aqlm": "AQLM",
        "gguf": "GGUF",
@@ -24,6 +25,7 @@ def fix_case(text: str) -> str:
        "rlhf": "RLHF",
        "vllm": "vLLM",
        "openai": "OpenAI",
+        "lmcache": "LMCache",
        "multilora": "MultiLoRA",
        "mlpspeculator": "MLPSpeculator",
        r"fp\d+": lambda x: x.group(0).upper(),  # e.g. fp16, fp32

--- a/docs/source/getting_started/installation/cpu/apple.inc.md
+++ b/docs/source/getting_started/installation/cpu/apple.inc.md
@@ -12,7 +12,7 @@ There are no pre-built wheels or images for this device, so you must build vLLM
 - OS: `macOS Sonoma` or later
 - SDK: `XCode 15.4` or later with Command Line Tools
- Compiler: `Apple Clang >= 15.0.0` and `Apple Clang < 17.0.0`
+- Compiler: `Apple Clang >= 15.0.0`
 ## Set up using Python
@@ -51,14 +51,6 @@ If the build has error like the following snippet where standard C++ headers can
      1 error generated.
 ```
-If run with error like the following snippet you need to check clang version and install a compatible version.
-```text
-AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul'
-```
-More information can be found in <gh-issue:15941>.
 ## Set up using Docker
 ### Pre-built images

--- a/docs/source/getting_started/v1_user_guide.md
+++ b/docs/source/getting_started/v1_user_guide.md
@@ -156,10 +156,3 @@ vLLM V1 is currently optimized for decoder-only transformers. Models requiring
  cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
 For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
-## Frequently Asked Questions
-**I'm using vLLM V1 and I'm getting CUDA OOM errors. What should I do?**
-The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`.
-On the other hand, if you get an error about insufficient memory for the cache blocks, you should increase `gpu_memory_utilization` as this indicates that your GPU has sufficient memory but you're not allocating enough to vLLM for KV cache blocks.
--- a/docs/source/models/extensions/tensorizer.md
+++ b/docs/source/models/extensions/tensorizer.md
@@ -9,7 +9,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor
 For more information on CoreWeave's Tensorizer, please refer to
 [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
-the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html).
+the [vLLM example script](https://docs.vllm.ai/en/latest/getting_started/examples/tensorize_vllm_model.html).
 :::{note}
 Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.

--- a/docs/source/models/supported_models.md
+++ b/docs/source/models/supported_models.md
@@ -160,6 +160,35 @@ If vLLM successfully returns text (for generative models) or hidden states (for
 Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
 Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
+#### Using a proxy
+Here are some tips for loading/downloading models from Hugging Face using a proxy:
+- Set the proxy globally for your session (or set it in the profile file):
+```shell
+export http_proxy=http://your.proxy.server:port
+export https_proxy=http://your.proxy.server:port
+```
+- Set the proxy for just the current command:
+```shell
+https_proxy=http://your.proxy.server:port huggingface-cli download <model_name>
+# or use vllm cmd directly
+https_proxy=http://your.proxy.server:port  vllm serve <model_name> --disable-log-requests
+```
+- Set the proxy in Python interpreter:
+```python
+import os
+os.environ['http_proxy'] = 'http://your.proxy.server:port'
+os.environ['https_proxy'] = 'http://your.proxy.server:port'
+```
 ### ModelScope
 To use models from [ModelScope](https://www.modelscope.cn) instead of Hugging Face Hub, set an environment variable:
@@ -233,9 +262,9 @@ See [this page](#generative-models) for more information on how to use generativ
  * `facebook/bart-base`, `facebook/bart-large-cnn`, etc.
  *
  *
- * `ChatGLMModel`
+- * `ChatGLMModel`, `ChatGLMForConditionalGeneration`
  * ChatGLM
-  * `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.
+  * `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc.
  * ✅︎
  * ✅︎
 - * `CohereForCausalLM`, `Cohere2ForCausalLM`
@@ -303,6 +332,11 @@ See [this page](#generative-models) for more information on how to use generativ
  * `THUDM/glm-4-9b-chat-hf`, etc.
  * ✅︎
  * ✅︎
+- * `Glm4ForCausalLM`
+  * GLM-4-0414
+  * `THUDM/GLM-4-32B-Chat-0414`, etc.
+  * ✅︎
+  * ✅︎
 - * `GPT2LMHeadModel`
  * GPT-2
  * `gpt2`, `gpt2-xl`, etc.
@@ -478,6 +512,16 @@ See [this page](#generative-models) for more information on how to use generativ
  * `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc.
  *
  * ✅︎
+- * `Qwen3ForCausalLM`
+  * Qwen3
+  * `Qwen/Qwen3-8B`, etc.
+  * ✅︎
+  * ✅︎
+- * `Qwen3MoeForCausalLM`
+  * Qwen3MoE
+  * `Qwen/Qwen3-MoE-15B-A2B`, etc.
+  * ✅︎
+  * ✅︎
 - * `StableLmForCausalLM`
  * StableLM
  * `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc.
@@ -715,7 +759,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
 See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
 :::{important}
-To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
+**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
 or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
 Offline inference:
@@ -733,6 +777,8 @@ Online serving:
 vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
 ```
+**This is no longer required if you are using vLLM V1.**
 :::
 :::{note}
@@ -834,9 +880,16 @@ See [this page](#generative-models) for more information on how to use generativ
  *
  * ✅︎
 - * `InternVLChatModel`
-  * InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0
+  * InternVL 3.0, InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0
  * T + I<sup>E+</sup>
-  * `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc.
+  * `OpenGVLab/InternVL3-9B`, `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc.
+  *
+  * ✅︎
+  * ✅︎
+- * `Llama4ForConditionalGeneration`
+  * Llama 4
+  * T + I<sup>+</sup>
+  * `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc.
  *
  * ✅︎
  * ✅︎
@@ -980,6 +1033,13 @@ See [this page](#generative-models) for more information on how to use generativ
  *
  * ✅︎
  * ✅︎
+- * `SmolVLMForConditionalGeneration`
+  * SmolVLM2
+  * T + I
+  * `SmolVLM2-2.2B-Instruct`
+  *
+  * ✅︎
+  * ✅︎
 - * `UltravoxModel`
  * Ultravox
  * T + A<sup>E+</sup>
@@ -996,9 +1056,6 @@ See [this page](#generative-models) for more information on how to use generativ
 <sup>+</sup> Multiple items can be inputted per text prompt for this modality.
 :::{important}
-To use Gemma3 series models, you have to install Hugging Face Transformers library from source via
-`pip install git+https://github.com/huggingface/transformers`.
 Pan-and-scan image pre-processing is currently supported on V0 (but not V1).
 You can enable it by passing `--mm-processor-kwargs '{"do_pan_and_scan": True}'`.
 :::
@@ -1135,5 +1192,5 @@ We have the following levels of testing for models:
 1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
-3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:main/examples) for the models that have passed this test.
+3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test.
 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
\ No newline at end of file
--- a/docs/source/serving/offline_inference.md
+++ b/docs/source/serving/offline_inference.md
@@ -110,6 +110,30 @@ If you run out of CPU RAM, try the following options:
 - (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
 - (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
+#### Disable unused modalities
+You can disable unused modalities (except for text) by setting its limit to zero.
+For example, if your application only accepts image input, there is no need to allocate any memory for videos.
+```python
+from vllm import LLM
+# Accept images but not videos
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+          limit_mm_per_prompt={"video": 0})
+```
+You can even run a multi-modal model for text-only inference:
+```python
+from vllm import LLM
+# Don't accept images. Just text.
+llm = LLM(model="google/gemma-3-27b-it",
+          limit_mm_per_prompt={"image": 0})
+```
 ### Performance optimization and tuning
 You can potentially improve the performance of vLLM by finetuning various options.

--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
@@ -2,15 +2,15 @@
 # OpenAI-Compatible Server
-vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more!
+vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
-You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](#deployment-docker):
+In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`](#vllm-serve) command. (You can also use our [Docker](#deployment-docker) image.)
 ```bash
 vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
 ```
-To call the server, you can use the [official OpenAI Python client](https://github.com/openai/openai-python), or any other HTTP client.
+To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
 ```python
 from openai import OpenAI

--- a/examples/offline_inference/audio_language.py
+++ b/examples/offline_inference/audio_language.py
@@ -47,7 +47,7 @@ def run_minicpmo(question: str, audio_count: int) -> ModelRequestData:
        model=model_name,
        trust_remote_code=True,
        max_model_len=4096,
-        max_num_seqs=5,
+        max_num_seqs=2,
        limit_mm_per_prompt={"audio": audio_count},
    )
@@ -196,16 +196,14 @@ def main(args):
    req_data = model_example_map[model](question_per_audio_count[audio_count],
                                        audio_count)
+    # Disable other modalities to save memory
+    default_limits = {"image": 0, "video": 0, "audio": 0}
+    req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
+        req_data.engine_args.limit_mm_per_prompt or {})
    engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
    llm = LLM(**engine_args)
-    # To maintain code compatibility in this script, we add LoRA here.
-    # You can also add LoRA using:
-    # llm.generate(prompts, lora_request=lora_request,...)
-    if req_data.lora_requests:
-        for lora_request in req_data.lora_requests:
-            llm.llm_engine.add_lora(lora_request=lora_request)
    # We set temperature to 0.2 so that outputs can be different
    # even when all prompts are identical when running batch inference.
    sampling_params = SamplingParams(temperature=0.2,
@@ -226,8 +224,15 @@ def main(args):
    if args.num_prompts > 1:
        # Batch inference
        inputs = [inputs] * args.num_prompts
+    # Add LoRA request if applicable
-    outputs = llm.generate(inputs, sampling_params=sampling_params)
+    lora_request = (req_data.lora_requests *
+                    args.num_prompts if req_data.lora_requests else None)
+    outputs = llm.generate(
+        inputs,
+        sampling_params=sampling_params,
+        lora_request=lora_request,
+    )
    for o in outputs:
        generated_text = o.outputs[0].text

--- a/examples/offline_inference/eagle.py
+++ b/examples/offline_inference/eagle.py
@@ -7,89 +7,103 @@ from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
-parser = argparse.ArgumentParser()
+def load_prompts(dataset_path, num_prompts):
-parser.add_argument(
+    if os.path.exists(dataset_path):
-    "--dataset",
+        prompts = []
-    type=str,
+        try:
-    default="./examples/data/gsm8k.jsonl",
+            with open(dataset_path) as f:
-    help="downloaded from the eagle repo " \
+                for line in f:
-    "https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/"
+                    data = json.loads(line)
-)
+                    prompts.append(data["turns"][0])
-parser.add_argument("--max_num_seqs", type=int, default=8)
+        except Exception as e:
-parser.add_argument("--num_prompts", type=int, default=80)
+            print(f"Error reading dataset: {e}")
-parser.add_argument("--num_spec_tokens", type=int, default=2)
+            return []
-parser.add_argument("--tp", type=int, default=1)
+    else:
-parser.add_argument("--draft_tp", type=int, default=1)
+        prompts = [
-parser.add_argument("--enforce_eager", action='store_true')
+            "The future of AI is", "The president of the United States is"
-parser.add_argument("--enable_chunked_prefill", action='store_true')
+        ]
-parser.add_argument("--max_num_batched_tokens", type=int, default=2048)
-parser.add_argument("--temp", type=float, default=0)
+    return prompts[:num_prompts]
-args = parser.parse_args()
+def main():
-print(args)
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
-model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
+        "--dataset",
-eagle_dir = "abhigoyal/EAGLE-LLaMA3-Instruct-8B-vllm"
+        type=str,
+        default="./examples/data/gsm8k.jsonl",
-max_model_len = 2048
+        help="downloaded from the eagle repo " \
+        "https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/"
-tokenizer = AutoTokenizer.from_pretrained(model_dir)
+    )
+    parser.add_argument("--max_num_seqs", type=int, default=8)
-if os.path.exists(args.dataset):
+    parser.add_argument("--num_prompts", type=int, default=80)
-    prompts = []
+    parser.add_argument("--num_spec_tokens", type=int, default=2)
-    num_prompts = args.num_prompts
+    parser.add_argument("--tp", type=int, default=1)
-    with open(args.dataset) as f:
+    parser.add_argument("--draft_tp", type=int, default=1)
-        for line in f:
+    parser.add_argument("--enforce_eager", action='store_true')
-            data = json.loads(line)
+    parser.add_argument("--enable_chunked_prefill", action='store_true')
-            prompts.append(data["turns"][0])
+    parser.add_argument("--max_num_batched_tokens", type=int, default=2048)
-else:
+    parser.add_argument("--temp", type=float, default=0)
-    prompts = ["The future of AI is", "The president of the United States is"]
+    args = parser.parse_args()
-prompts = prompts[:args.num_prompts]
+    model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
-num_prompts = len(prompts)
+    eagle_dir = "abhigoyal/EAGLE-LLaMA3-Instruct-8B-vllm"
-prompt_ids = [
+    max_model_len = 2048
-    tokenizer.apply_chat_template([{
-        "role": "user",
+    tokenizer = AutoTokenizer.from_pretrained(model_dir)
-        "content": prompt
-    }],
+    prompts = load_prompts(args.dataset, args.num_prompts)
-                                  add_generation_prompt=True)
-    for prompt in prompts
+    prompt_ids = [
-]
+        tokenizer.apply_chat_template([{
+            "role": "user",
-llm = LLM(
+            "content": prompt
-    model=model_dir,
+        }],
-    trust_remote_code=True,
+                                      add_generation_prompt=True)
-    tensor_parallel_size=args.tp,
+        for prompt in prompts
-    enable_chunked_prefill=args.enable_chunked_prefill,
+    ]
-    max_num_batched_tokens=args.max_num_batched_tokens,
-    enforce_eager=args.enforce_eager,
+    llm = LLM(
-    max_model_len=max_model_len,
+        model=model_dir,
-    max_num_seqs=args.max_num_seqs,
+        trust_remote_code=True,
-    gpu_memory_utilization=0.8,
+        tensor_parallel_size=args.tp,
-    speculative_config={
+        enable_chunked_prefill=args.enable_chunked_prefill,
-        "model": eagle_dir,
+        max_num_batched_tokens=args.max_num_batched_tokens,
-        "num_speculative_tokens": args.num_spec_tokens,
+        enforce_eager=args.enforce_eager,
-        "draft_tensor_parallel_size": args.draft_tp,
+        max_model_len=max_model_len,
-        "max_model_len": max_model_len,
+        max_num_seqs=args.max_num_seqs,
-    },
+        gpu_memory_utilization=0.8,
-    disable_log_stats=False,
+        speculative_config={
-)
+            "method": "eagle",
+            "model": eagle_dir,
-sampling_params = SamplingParams(temperature=args.temp, max_tokens=256)
+            "num_speculative_tokens": args.num_spec_tokens,
+            "draft_tensor_parallel_size": args.draft_tp,
-outputs = llm.generate(prompt_token_ids=prompt_ids,
+            "max_model_len": max_model_len,
-                       sampling_params=sampling_params)
+        },
+        disable_log_stats=False,
-# calculate the average number of accepted tokens per forward pass, +1 is
+    )
-# to account for the token from the target model that's always going to be
-# accepted
+    sampling_params = SamplingParams(temperature=args.temp, max_tokens=256)
-acceptance_counts = [0] * (args.num_spec_tokens + 1)
-for output in outputs:
+    outputs = llm.generate(prompt_token_ids=prompt_ids,
-    for step, count in enumerate(output.metrics.spec_token_acceptance_counts):
+                           sampling_params=sampling_params)
-        acceptance_counts[step] += count
+    # calculate the average number of accepted tokens per forward pass, +1 is
-print(f"mean acceptance length: \
+    # to account for the token from the target model that's always going to be
-    {sum(acceptance_counts) / acceptance_counts[0]:.2f}")
+    # accepted
+    acceptance_counts = [0] * (args.num_spec_tokens + 1)
+    for output in outputs:
+        for step, count in enumerate(
+                output.metrics.spec_token_acceptance_counts):
+            acceptance_counts[step] += count
+    print("-" * 50)
+    print(f"mean acceptance length: \
+        {sum(acceptance_counts) / acceptance_counts[0]:.2f}")
+    print("-" * 50)
+if __name__ == "__main__":
+    main()
--- a/examples/offline_inference/embed_jina_embeddings_v3.py
+++ b/examples/offline_inference/embed_jina_embeddings_v3.py
+# SPDX-License-Identifier: Apache-2.0
+from argparse import Namespace
+from vllm import LLM, EngineArgs
+from vllm.utils import FlexibleArgumentParser
+def main(args: Namespace):
+    # Sample prompts.
+    prompts = [
+        "Follow the white rabbit.",  # English
+        "Sigue al conejo blanco.",  # Spanish
+        "Suis le lapin blanc.",  # French
+        "跟着白兔走。",  # Chinese
+        "اتبع الأرنب الأبيض.",  # Arabic
+        "Folge dem weißen Kaninchen.",  # German
+    ]
+    # Create an LLM.
+    # You should pass task="embed" for embedding models
+    model = LLM(**vars(args))
+    # Generate embedding. The output is a list of EmbeddingRequestOutputs.
+    # Only text matching task is supported for now. See #16120
+    outputs = model.embed(prompts)
+    # Print the outputs.
+    print("\nGenerated Outputs:")
+    print("Only text matching task is supported for now. See #16120")
+    print("-" * 60)
+    for prompt, output in zip(prompts, outputs):
+        embeds = output.outputs.embedding
+        embeds_trimmed = ((str(embeds[:16])[:-1] +
+                           ", ...]") if len(embeds) > 16 else embeds)
+        print(f"Prompt: {prompt!r} \n"
+              f"Embeddings for text matching: {embeds_trimmed} "
+              f"(size={len(embeds)})")
+        print("-" * 60)
+if __name__ == "__main__":
+    parser = FlexibleArgumentParser()
+    parser = EngineArgs.add_cli_args(parser)
+    # Set example specific arguments
+    parser.set_defaults(model="jinaai/jina-embeddings-v3",
+                        task="embed",
+                        trust_remote_code=True)
+    args = parser.parse_args()
+    main(args)
--- a/examples/offline_inference/embed_matryoshka_fy.py
+++ b/examples/offline_inference/embed_matryoshka_fy.py
+# SPDX-License-Identifier: Apache-2.0
+from argparse import Namespace
+from vllm import LLM, EngineArgs, PoolingParams
+from vllm.utils import FlexibleArgumentParser
+def main(args: Namespace):
+    # Sample prompts.
+    prompts = [
+        "Follow the white rabbit.",  # English
+        "Sigue al conejo blanco.",  # Spanish
+        "Suis le lapin blanc.",  # French
+        "跟着白兔走。",  # Chinese
+        "اتبع الأرنب الأبيض.",  # Arabic
+        "Folge dem weißen Kaninchen.",  # German
+    ]
+    # Create an LLM.
+    # You should pass task="embed" for embedding models
+    model = LLM(**vars(args))
+    # Generate embedding. The output is a list of EmbeddingRequestOutputs.
+    outputs = model.embed(prompts, pooling_params=PoolingParams(dimensions=32))
+    # Print the outputs.
+    print("\nGenerated Outputs:")
+    print("-" * 60)
+    for prompt, output in zip(prompts, outputs):
+        embeds = output.outputs.embedding
+        embeds_trimmed = ((str(embeds[:16])[:-1] +
+                           ", ...]") if len(embeds) > 16 else embeds)
+        print(f"Prompt: {prompt!r} \n"
+              f"Embeddings: {embeds_trimmed} "
+              f"(size={len(embeds)})")
+        print("-" * 60)
+if __name__ == "__main__":
+    parser = FlexibleArgumentParser()
+    parser = EngineArgs.add_cli_args(parser)
+    # Set example specific arguments
+    parser.set_defaults(model="jinaai/jina-embeddings-v3",
+                        task="embed",
+                        trust_remote_code=True)
+    args = parser.parse_args()
+    main(args)
--- a/examples/offline_inference/encoder_decoder.py
+++ b/examples/offline_inference/encoder_decoder.py
@@ -75,8 +75,6 @@ prompts = [
    enc_dec_prompt1, enc_dec_prompt2, enc_dec_prompt3
 ] + zipped_prompt_list
-print(prompts)
 # Create a sampling params object.
 sampling_params = SamplingParams(
    temperature=0,
@@ -91,10 +89,13 @@ sampling_params = SamplingParams(
 outputs = llm.generate(prompts, sampling_params)
 # Print the outputs.
-for output in outputs:
+print("-" * 50)
+for i, output in enumerate(outputs):
    prompt = output.prompt
    encoder_prompt = output.encoder_prompt
    generated_text = output.outputs[0].text
-    print(f"Encoder prompt: {encoder_prompt!r}, "
+    print(f"Output {i+1}:")
-          f"Decoder prompt: {prompt!r}, "
+    print(f"Encoder prompt: {encoder_prompt!r}\n"
+          f"Decoder prompt: {prompt!r}\n"
          f"Generated text: {generated_text!r}")
+    print("-" * 50)
--- a/examples/offline_inference/encoder_decoder_multimodal.py
+++ b/examples/offline_inference/encoder_decoder_multimodal.py
@@ -56,7 +56,7 @@ def run_florence2():
 def run_mllama():
    engine_args = EngineArgs(
        model="meta-llama/Llama-3.2-11B-Vision-Instruct",
-        max_model_len=4096,
+        max_model_len=8192,
        max_num_seqs=2,
        limit_mm_per_prompt={"image": 1},
        dtype="half",
@@ -133,6 +133,11 @@ def main(args):
    req_data = model_example_map[model]()
+    # Disable other modalities to save memory
+    default_limits = {"image": 0, "video": 0, "audio": 0}
+    req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
+        req_data.engine_args.limit_mm_per_prompt or {})
    engine_args = asdict(req_data.engine_args) | {"seed": args.seed}
    llm = LLM(**engine_args)

--- a/examples/offline_inference/llm_engine_example.py
+++ b/examples/offline_inference/llm_engine_example.py
 # SPDX-License-Identifier: Apache-2.0
+"""
+This file demonstrates using the `LLMEngine`
+for processing prompts with various sampling parameters.
+"""
 import argparse
 from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
@@ -26,6 +29,7 @@ def process_requests(engine: LLMEngine,
    """Continuously process a list of prompts and handle the outputs."""
    request_id = 0
+    print('-' * 50)
    while test_prompts or engine.has_unfinished_requests():
        if test_prompts:
            prompt, sampling_params = test_prompts.pop(0)
@@ -37,6 +41,7 @@ def process_requests(engine: LLMEngine,
        for request_output in request_outputs:
            if request_output.finished:
                print(request_output)
+                print('-' * 50)
 def initialize_engine(args: argparse.Namespace) -> LLMEngine:

--- a/examples/offline_inference/mistral-small.py
+++ b/examples/offline_inference/mistral-small.py
@@ -13,9 +13,14 @@ from vllm.sampling_params import SamplingParams
 # - Server:
 #
 # ```bash
+# # Mistral format
 # vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
 #   --tokenizer-mode mistral --config-format mistral --load-format mistral \
 #   --limit-mm-per-prompt 'image=4' --max-model-len 16384
+#
+# # HF format
+# vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
+#   --limit-mm-per-prompt 'image=4' --max-model-len 16384
 # ```
 #
 # - Client:
@@ -44,19 +49,22 @@ from vllm.sampling_params import SamplingParams
 #     python demo.py simple
 #     python demo.py advanced
+# Lower max_model_len and/or max_num_seqs on low-VRAM GPUs.
+# These scripts have been tested on 2x L40 GPUs
 def run_simple_demo(args: argparse.Namespace):
    model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
    sampling_params = SamplingParams(max_tokens=8192)
-    # Lower max_model_len and/or max_num_seqs on low-VRAM GPUs.
    llm = LLM(
        model=model_name,
-        tokenizer_mode="mistral",
+        tokenizer_mode="mistral" if args.format == "mistral" else "auto",
-        config_format="mistral",
+        config_format="mistral" if args.format == "mistral" else "auto",
-        load_format="mistral",
+        load_format="mistral" if args.format == "mistral" else "auto",
        max_model_len=4096,
        max_num_seqs=2,
+        tensor_parallel_size=2,
        disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
    )
@@ -82,23 +90,25 @@ def run_simple_demo(args: argparse.Namespace):
        },
    ]
    outputs = llm.chat(messages, sampling_params=sampling_params)
+    print("-" * 50)
    print(outputs[0].outputs[0].text)
+    print("-" * 50)
 def run_advanced_demo(args: argparse.Namespace):
    model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
-    max_img_per_msg = 5
+    max_img_per_msg = 3
    max_tokens_per_img = 4096
    sampling_params = SamplingParams(max_tokens=8192, temperature=0.7)
    llm = LLM(
        model=model_name,
-        tokenizer_mode="mistral",
+        tokenizer_mode="mistral" if args.format == "mistral" else "auto",
-        config_format="mistral",
+        config_format="mistral" if args.format == "mistral" else "auto",
-        load_format="mistral",
+        load_format="mistral" if args.format == "mistral" else "auto",
        limit_mm_per_prompt={"image": max_img_per_msg},
        max_model_len=max_img_per_msg * max_tokens_per_img,
+        tensor_parallel_size=2,
        disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
    )
@@ -153,7 +163,9 @@ def run_advanced_demo(args: argparse.Namespace):
    ]
    outputs = llm.chat(messages=messages, sampling_params=sampling_params)
+    print("-" * 50)
    print(outputs[0].outputs[0].text)
+    print("-" * 50)
 def main():
@@ -166,6 +178,11 @@ def main():
        help="Specify the demo mode: 'simple' or 'advanced'",
    )
+    parser.add_argument('--format',
+                        choices=["mistral", "hf"],
+                        default="mistral",
+                        help='Specify the format of the model to load.')
    parser.add_argument(
        '--disable-mm-preprocessor-cache',
        action='store_true',

--- a/examples/offline_inference/mlpspeculator.py
+++ b/examples/offline_inference/mlpspeculator.py
 # SPDX-License-Identifier: Apache-2.0
+"""
+This file demonstrates the usage of text generation with an LLM model,
+comparing the performance with and without speculative decoding.
+Note that still not support `v1`:
+VLLM_USE_V1=0 python examples/offline_inference/mlpspeculator.py
+"""
 import gc
 import time
@@ -7,7 +14,7 @@ from vllm import LLM, SamplingParams
 def time_generation(llm: LLM, prompts: list[str],
-                    sampling_params: SamplingParams):
+                    sampling_params: SamplingParams, title: str):
    # Generate texts from the prompts. The output is a list of RequestOutput
    # objects that contain the prompt, generated text, and other information.
    # Warmup first
@@ -16,11 +23,15 @@ def time_generation(llm: LLM, prompts: list[str],
    start = time.time()
    outputs = llm.generate(prompts, sampling_params)
    end = time.time()
-    print((end - start) / sum([len(o.outputs[0].token_ids) for o in outputs]))
+    print("-" * 50)
+    print(title)
+    print("time: ",
+          (end - start) / sum(len(o.outputs[0].token_ids) for o in outputs))
    # Print the outputs.
    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"text: {generated_text!r}")
+        print("-" * 50)
 if __name__ == "__main__":
@@ -41,8 +52,7 @@ if __name__ == "__main__":
    # Create an LLM without spec decoding
    llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")
-    print("Without speculation")
+    time_generation(llm, prompts, sampling_params, "Without speculation")
-    time_generation(llm, prompts, sampling_params)
    del llm
    gc.collect()
@@ -55,5 +65,4 @@ if __name__ == "__main__":
        },
    )
-    print("With speculation")
+    time_generation(llm, prompts, sampling_params, "With speculation")
-    time_generation(llm, prompts, sampling_params)
--- a/examples/offline_inference/multilora_inference.py
+++ b/examples/offline_inference/multilora_inference.py
@@ -61,6 +61,7 @@ def process_requests(engine: LLMEngine,
    """Continuously process a list of prompts and handle the outputs."""
    request_id = 0
+    print("-" * 50)
    while test_prompts or engine.has_unfinished_requests():
        if test_prompts:
            prompt, sampling_params, lora_request = test_prompts.pop(0)
@@ -75,6 +76,7 @@ def process_requests(engine: LLMEngine,
        for request_output in request_outputs:
            if request_output.finished:
                print(request_output)
+                print("-" * 50)
 def initialize_engine() -> LLMEngine:

--- a/examples/offline_inference/neuron.py
+++ b/examples/offline_inference/neuron.py
@@ -12,27 +12,36 @@ prompts = [
 # Create a sampling params object.
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-# Create an LLM.
-llm = LLM(
+def main():
-    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    # Create an LLM.
-    max_num_seqs=8,
+    llm = LLM(
-    # The max_model_len and block_size arguments are required to be same as
+        model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-    # max sequence length when targeting neuron device.
+        max_num_seqs=8,
-    # Currently, this is a known limitation in continuous batching support
+        # The max_model_len and block_size arguments are required to be same as
-    # in transformers-neuronx.
+        # max sequence length when targeting neuron device.
-    # TODO(liangfu): Support paged-attention in transformers-neuronx.
+        # Currently, this is a known limitation in continuous batching support
-    max_model_len=1024,
+        # in transformers-neuronx.
-    block_size=1024,
+        # TODO(liangfu): Support paged-attention in transformers-neuronx.
-    # The device can be automatically detected when AWS Neuron SDK is installed.
+        max_model_len=1024,
-    # The device argument can be either unspecified for automated detection,
+        block_size=1024,
-    # or explicitly assigned.
+        # ruff: noqa: E501
-    device="neuron",
+        # The device can be automatically detected when AWS Neuron SDK is installed.
-    tensor_parallel_size=2)
+        # The device argument can be either unspecified for automated detection,
-# Generate texts from the prompts. The output is a list of RequestOutput objects
+        # or explicitly assigned.
-# that contain the prompt, generated text, and other information.
+        device="neuron",
-outputs = llm.generate(prompts, sampling_params)
+        tensor_parallel_size=2)
-# Print the outputs.
+    # Generate texts from the prompts. The output is a list of RequestOutput objects
-for output in outputs:
+    # that contain the prompt, generated text, and other information.
-    prompt = output.prompt
+    outputs = llm.generate(prompts, sampling_params)
-    generated_text = output.outputs[0].text
+    # Print the outputs.
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    print("-" * 50)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
+        print("-" * 50)
+if __name__ == "__main__":
+    main()