Merge tag 'v0.8.3' into v0.8.3-dev

fcfc474d · zhuwenwen · bb94d2e5 · 296c6572 · fcfc474d · fcfc474d
Commit fcfc474d authored Apr 09, 2025 by zhuwenwen
20 changed files
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -43,7 +43,7 @@ vLLM is flexible and easy to use with:
 - Tensor parallelism and pipeline parallelism support for distributed inference
 - Streaming outputs
 - OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
+- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
 - Prefix caching support
 - Multi-lora support
@@ -77,9 +77,9 @@ getting_started/v1_user_guide
 :caption: Models
 :maxdepth: 1
+models/supported_models
 models/generative_models
 models/pooling_models
-models/supported_models
 models/extensions/index
 :::

--- a/docs/source/models/generative_models.md
+++ b/docs/source/models/generative_models.md
@@ -23,6 +23,8 @@ It is similar to [its counterpart in HF Transformers](https://huggingface.co/doc
 except that tokenization and detokenization are also performed automatically.
 ```python
+from vllm import LLM
 llm = LLM(model="facebook/opt-125m")
 outputs = llm.generate("Hello, my name is")
@@ -36,6 +38,8 @@ You can optionally control the language generation by passing {class}`~vllm.Samp
 For example, you can use greedy sampling by setting `temperature=0`:
 ```python
+from vllm import LLM, SamplingParams
 llm = LLM(model="facebook/opt-125m")
 params = SamplingParams(temperature=0)
 outputs = llm.generate("Hello, my name is", params)
@@ -83,6 +87,8 @@ Base models may perform poorly as they are not trained to respond to the chat co
 :::
 ```python
+from vllm import LLM
 llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
 conversation = [
    {

--- a/docs/source/models/pooling_models.md
+++ b/docs/source/models/pooling_models.md
@@ -68,6 +68,8 @@ The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM.
 It returns the extracted hidden states directly, which is useful for reward models.
 ```python
+from vllm import LLM
 llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
 (output,) = llm.encode("Hello, my name is")
@@ -81,6 +83,8 @@ The {class}`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
 It is primarily designed for embedding models.
 ```python
+from vllm import LLM
 llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
 (output,) = llm.embed("Hello, my name is")
@@ -96,6 +100,8 @@ The {class}`~vllm.LLM.classify` method outputs a probability vector for each pro
 It is primarily designed for classification models.
 ```python
+from vllm import LLM
 llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
 (output,) = llm.classify("Hello, my name is")
@@ -116,6 +122,8 @@ To handle RAG at a higher level, you should use integration frameworks such as [
 :::
 ```python
+from vllm import LLM
 llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
 (output,) = llm.score("What is the capital of France?",
                      "The capital of Brazil is Brasilia.")

--- a/docs/source/models/supported_models.md
+++ b/docs/source/models/supported_models.md
 (supported-models)=
-# List of Supported Models
+# Supported Models
-vLLM supports generative and pooling models across various tasks.
+vLLM supports [generative](generative-models) and [pooling](pooling-models) models across various tasks.
 If a model supports more than one task, you can set the task via the `--task` argument.
 For each task, we list the model architectures that have been implemented in vLLM.
 Alongside each architecture, we include some popular models that use it.
-## Loading a Model
+## Model Implementation
-### HuggingFace Hub
+### vLLM
-By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models).
+If vLLM natively supports a model, its implementation can be found in <gh-file:vllm/model_executor/models>.
-To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
+These models are what we list in <project:#supported-text-models> and <project:#supported-mm-models>.
-If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
-Models do not _need_ to be natively supported to be used in vLLM.
+(transformers-backend)=
-The <project:#transformers-fallback> enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
-:::{tip}
+### Transformers
-The easiest way to check if your model is really supported at runtime is to run the program below:
-```python
+vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models are supported, and vision language model support is planned!
-from vllm import LLM
-# For generative models (task=generate) only
+To check if the modeling backend is Transformers, you can simply do this:
-llm = LLM(model=..., task="generate")  # Name or path of your model
-output = llm.generate("Hello, my name is")
-print(output)
-# For pooling models (task={embed,classify,reward,score}) only
+```python
-llm = LLM(model=..., task="embed")  # Name or path of your model
-output = llm.encode("Hello, my name is")
-print(output)
-```
-If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
-:::
-Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
-Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
-(transformers-fallback)=
-### Transformers fallback
-vLLM can fallback to model implementations that are available in Transformers. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
-To check if the backend is Transformers, you can simply do this:
-```python 
 from vllm import LLM
 llm = LLM(model=..., task="generate")  # Name or path of your model
 llm.apply_model(lambda model: print(type(model)))
 ```
-If it is `TransformersModel` then it means it's based on Transformers!
+If it is `TransformersForCausalLM` then it means it's based on Transformers!
 :::{tip}
-You can force the use of `TransformersModel` by setting `model_impl="transformers"` for <project:#offline-inference> or `--model-impl transformers` for the <project:#openai-compatible-server>.
+You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for <project:#offline-inference> or `--model-impl transformers` for the <project:#openai-compatible-server>.
 :::
 :::{note}
@@ -69,27 +42,26 @@ vLLM may not fully optimise the Transformers implementation so you may see degra
 #### Supported features
-The Transformers fallback explicitly supports the following features:
+The Transformers modeling backend explicitly supports the following features:
 - <project:#quantization-index> (except GGUF)
 - <project:#lora-adapter>
- <project:#distributed-serving> (requires `transformers>=4.49.0`)
+- <project:#distributed-serving>
-#### Remote code
+#### Remote Code
-Earlier we mentioned that the Transformers fallback enables you to run remote code models directly in vLLM.
+If your model is neither supported natively by vLLM or Transformers, you can still run it in vLLM!
-If you are interested in this feature, this section is for you!
 Simply set `trust_remote_code=True` and vLLM will run any model on the Model Hub that is compatible with Transformers.
 Provided that the model writer implements their model in a compatible way, this means that you can run new models before they are officially supported in Transformers or vLLM!
-```python 
+```python
 from vllm import LLM
 llm = LLM(model=..., task="generate", trust_remote_code=True)  # Name or path of your model
 llm.apply_model(lambda model: print(model.__class__))
 ```
-To make your model compatible with the Transformers fallback, it needs:
+To make your model compatible with the Transformers backend, it needs:
 ```{code-block} python
 :caption: modeling_my_model.py
@@ -119,9 +91,11 @@ Here is what happens in the background:
 1. The config is loaded
 2. `MyModel` Python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
-3. The `TransformersModel` backend is used. See <gh-file:vllm/model_executor/models/transformers.py>, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
+3. The `TransformersForCausalLM` backend is used. See <gh-file:vllm/model_executor/models/transformers.py>, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
+That's it!
-To make your model compatible with tensor parallel, it needs:
+For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:
 ```{code-block} python
 :caption: configuration_my_model.py
@@ -130,20 +104,65 @@ from transformers import PretrainedConfig
 class MyConfig(PretrainedConfig):
  base_model_tp_plan = {
-    "layers.*.self_attn.q_proj": "colwise",
+    "layers.*.self_attn.k_proj": "colwise",
-    ...
+    "layers.*.self_attn.v_proj": "colwise",
+    "layers.*.self_attn.o_proj": "rowwise",
+    "layers.*.mlp.gate_proj": "colwise",
+    "layers.*.mlp.up_proj": "colwise",
+    "layers.*.mlp.down_proj": "rowwise",
+  }
+  base_model_pp_plan = {
+    "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+    "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+    "norm": (["hidden_states"], ["hidden_states"]),
  }
 ```
+- `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
+- `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
+  * You only need to do this for layers which are not present on all pipeline stages
+  * vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
+  * The `list` in the first element of the `tuple` contains the names of the input arguments
+  * The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
+## Loading a Model
+### Hugging Face Hub
+By default, vLLM loads models from [Hugging Face (HF) Hub](https://huggingface.co/models).
+To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
+If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
+Models do not _need_ to be natively supported to be used in vLLM.
+The [Transformers backend](#transformers-backend) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
 :::{tip}
-`base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
+The easiest way to check if your model is really supported at runtime is to run the program below:
+```python
+from vllm import LLM
+# For generative models (task=generate) only
+llm = LLM(model=..., task="generate")  # Name or path of your model
+output = llm.generate("Hello, my name is")
+print(output)
+# For pooling models (task={embed,classify,reward,score}) only
+llm = LLM(model=..., task="embed")  # Name or path of your model
+output = llm.encode("Hello, my name is")
+print(output)
+```
+If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
 :::
-That's it!
+Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
+Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
 ### ModelScope
-To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
+To use models from [ModelScope](https://www.modelscope.cn) instead of Hugging Face Hub, set an environment variable:
 ```shell
 export VLLM_USE_MODELSCOPE=True
@@ -165,6 +184,8 @@ output = llm.encode("Hello, my name is")
 print(output)
 ```
+(supported-text-models)=
 ## List of Text-only Language Models
 ### Generative Models
@@ -197,6 +218,11 @@ See [this page](#generative-models) for more information on how to use generativ
  * `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.
  * ✅︎
  * ✅︎
+- * `BambaForCausalLM`
+  * Bamba
+  * `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B`
+  *
+  *
 - * `BloomForCausalLM`
  * BLOOM, BLOOMZ, BLOOMChat
  * `bigscience/bloom`, `bigscience/bloomz`, etc.
@@ -224,7 +250,7 @@ See [this page](#generative-models) for more information on how to use generativ
  * ✅︎
 - * `DeciLMForCausalLM`
  * DeciLM
-  * `Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.
+  * `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, etc.
  *
  * ✅︎
 - * `DeepseekForCausalLM`
@@ -482,6 +508,11 @@ See [this page](#generative-models) for more information on how to use generativ
  * `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc.
  * ✅︎
  * ✅︎
+- * `MiniMaxText01ForCausalLM`
+  * MiniMax-Text
+  * `MiniMaxAI/MiniMax-Text-01`, etc.
+  *
+  * ✅︎
 - * `Zamba2ForCausalLM`
  * Zamba2
  * `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc.
@@ -545,7 +576,7 @@ you should explicitly specify the task type to ensure that the model is used in
  *
 - * `XLMRobertaModel`
  * XLM-RoBERTa-based
-  * `intfloat/multilingual-e5-large`, etc.
+  * `intfloat/multilingual-e5-large`, `jinaai/jina-reranker-v2-base-multilingual`, etc.
  *
  *
 :::
@@ -732,6 +763,13 @@ See [this page](#generative-models) for more information on how to use generativ
  *
  * ✅︎
  * ✅︎
+- * `AyaVisionForConditionalGeneration`
+  * Aya Vision
+  * T + I<sup>+</sup>
+  * `CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc.
+  *
+  * ✅︎
+  * ✅︎
 - * `Blip2ForConditionalGeneration`
  * BLIP-2
  * T + I<sup>E</sup>
@@ -802,6 +840,13 @@ See [this page](#generative-models) for more information on how to use generativ
  *
  * ✅︎
  * ✅︎
+- * `Llama4ForConditionalGeneration`
+  * Llama-4-17B-Omni-Instruct
+  * T + I<sup>+</sup>
+  * `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc.
+  *
+  *
+  * ✅︎
 - * `LlavaForConditionalGeneration`
  * LLaVA-1.5
  * T + I<sup>E+</sup>
@@ -836,14 +881,21 @@ See [this page](#generative-models) for more information on how to use generativ
  * `openbmb/MiniCPM-o-2_6`, etc.
  * ✅︎
  * ✅︎
-  *
+  * ✅︎
 - * `MiniCPMV`
  * MiniCPM-V
  * T + I<sup>E+</sup> + V<sup>E+</sup>
  * `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc.
  * ✅︎
  * ✅︎
+  * ✅︎
+- * `Mistral3ForConditionalGeneration`
+  * Mistral3
+  * T + I<sup>+</sup>
+  * `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc.
  *
+  * ✅︎
+  * ✅︎
 - * `MllamaForConditionalGeneration`
  * Llama 3.2
  * T + I<sup>+</sup>
@@ -853,7 +905,7 @@ See [this page](#generative-models) for more information on how to use generativ
  *
 - * `MolmoForCausalLM`
  * Molmo
-  * T + I
+  * T + I<sup>+</sup>
  * `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc.
  * ✅︎
  * ✅︎
@@ -921,6 +973,13 @@ See [this page](#generative-models) for more information on how to use generativ
  * ✅︎
  * ✅︎
  * ✅︎
+- * `SkyworkR1VChatModel`
+  * Skywork-R1V-38B
+  * T + I
+  * `Skywork/Skywork-R1V-38B`
+  *
+  * ✅︎
+  * ✅︎
 - * `UltravoxModel`
  * Ultravox
  * T + A<sup>E+</sup>
@@ -930,10 +989,10 @@ See [this page](#generative-models) for more information on how to use generativ
  * ✅︎
 :::
-<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.  
+<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
-&nbsp;&nbsp;&nbsp;&nbsp;• For example, to use DeepSeek-VL2 series models:  
+&nbsp;&nbsp;&nbsp;&nbsp;• For example, to use DeepSeek-VL2 series models:
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'`  
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'`
-<sup>E</sup> Pre-computed embeddings can be inputted for this modality.  
+<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
 <sup>+</sup> Multiple items can be inputted per text prompt for this modality.
 :::{important}
@@ -1059,7 +1118,7 @@ At vLLM, we are committed to facilitating the integration and support of third-p
 2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
    :::{tip}
-    When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
+    When comparing the output of `model.generate` from Hugging Face Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
    :::
 3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.

--- a/docs/source/performance/optimization.md
+++ b/docs/source/performance/optimization.md
@@ -31,6 +31,8 @@ vLLM supports an experimental feature chunked prefill. Chunked prefill allows to
 You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.
 ```python
+from vllm import LLM
 llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
 # Set max_num_batched_tokens to tune performance.
 # NOTE: 2048 is the default max_num_batched_tokens for chunked prefill.

--- a/docs/source/serving/multimodal_inputs.md
+++ b/docs/source/serving/multimodal_inputs.md
@@ -21,6 +21,8 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`
 You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
 ```python
+from vllm import LLM
 llm = LLM(model="llava-hf/llava-1.5-7b-hf")
 # Refer to the HuggingFace repo for the correct format to use
@@ -65,6 +67,8 @@ Full example: <gh-file:examples/offline_inference/vision_language.py>
 To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
 ```python
+from vllm import LLM
 llm = LLM(
    model="microsoft/Phi-3.5-vision-instruct",
    trust_remote_code=True,  # Required to load Phi-3.5-vision
@@ -96,6 +100,8 @@ Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py
 Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
 ```python
+from vllm import LLM
 # Specify the maximum number of frames per video to be 4. This can be changed.
 llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
@@ -139,6 +145,8 @@ To input pre-computed embeddings belonging to a data type (i.e. image, video, or
 pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
 ```python
+from vllm import LLM
 # Inference with image embeddings as input
 llm = LLM(model="llava-hf/llava-1.5-7b-hf")

--- a/docs/source/serving/offline_inference.md
+++ b/docs/source/serving/offline_inference.md
@@ -11,6 +11,8 @@ For example, the following code downloads the [`facebook/opt-125m`](https://hugg
 and runs it in vLLM using the default configuration.
 ```python
+from vllm import LLM
 llm = LLM(model="facebook/opt-125m")
 ```
@@ -47,6 +49,8 @@ To fix this, explicitly specify the model architecture by passing `config.json`
 For example:
 ```python
+from vllm import LLM
 model = LLM(
    model="cerebras/Cerebras-GPT-1.3B",
    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
@@ -92,6 +96,8 @@ You can further reduce memory usage by limiting the context length of the model
 and the maximum batch size (`max_num_seqs` option).
 ```python
+from vllm import LLM
 llm = LLM(model="adept/fuyu-8b",
          max_model_len=2048,
          max_num_seqs=2)

--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
@@ -188,6 +188,7 @@ For example:
 ```yaml
 # config.yaml
+model: meta-llama/Llama-3.1-8B-Instruct
 host: "127.0.0.1"
 port: 6379
 uvicorn-log-level: "info"
@@ -196,12 +197,13 @@ uvicorn-log-level: "info"
 To use the above config file:
 ```bash
-vllm serve SOME_MODEL --config config.yaml
+vllm serve --config config.yaml
 ```
 :::{note}
 In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
 The order of priorities is `command line > config file values > defaults`.
+e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
 :::
 ## API Reference

--- a/docs/source/serving/usage_stats.md
+++ b/docs/source/serving/usage_stats.md
 # Usage Stats Collection
-vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information, and will be publicly released for the community's benefit.
+vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information.
+A subset of the data, after cleaning and aggregation, will be publicly released for the community's benefit. For example, you can see the 2024 usage report [here](https://2024.vllm.ai).
 ## What data is collected?

--- a/examples/offline_inference/basic/basic.py
+++ b/examples/offline_inference/basic/basic.py
@@ -13,13 +13,16 @@ if __name__ == '__main__':
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=16)
-    # Create an LLM.
+# Create an LLM.
-    llm = LLM(model="facebook/opt-125m",tensor_parallel_size=1, dtype="float16",trust_remote_code=True, enforce_eager=True)
+llm = LLM(model="facebook/opt-125m")
-    # Generate texts from the prompts. The output is a list of RequestOutput objects
+# Generate texts from the prompts. The output is a list of RequestOutput objects
-    # that contain the prompt, generated text, and other information.
+# that contain the prompt, generated text, and other information.
-    outputs = llm.generate(prompts, sampling_params)
+outputs = llm.generate(prompts, sampling_params)
-    # Print the outputs.
+# Print the outputs.
-    for output in outputs:
+print("\nGenerated Outputs:\n" + "-" * 60)
-        prompt = output.prompt
+for output in outputs:
-        generated_text = output.outputs[0].text
+    prompt = output.prompt
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    generated_text = output.outputs[0].text
+    print(f"Prompt:    {prompt!r}")
+    print(f"Output:    {generated_text!r}")
+    print("-" * 60)
--- a/examples/offline_inference/basic/chat.py
+++ b/examples/offline_inference/basic/chat.py
@@ -27,12 +27,13 @@ def main(args: dict):
        sampling_params.top_k = top_k
    def print_outputs(outputs):
+        print("\nGenerated Outputs:\n" + "-" * 80)
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
-            print(f"Prompt: {prompt!r}")
+            print(f"Prompt: {prompt!r}\n")
            print(f"Generated text: {generated_text!r}")
-        print("-" * 80)
+            print("-" * 80)
    print("=" * 80)

--- a/examples/offline_inference/basic/classify.py
+++ b/examples/offline_inference/basic/classify.py
@@ -23,12 +23,14 @@ def main(args: Namespace):
    outputs = model.classify(prompts)
    # Print the outputs.
+    print("\nGenerated Outputs:\n" + "-" * 60)
    for prompt, output in zip(prompts, outputs):
        probs = output.outputs.probs
        probs_trimmed = ((str(probs[:16])[:-1] +
                          ", ...]") if len(probs) > 16 else probs)
-        print(f"Prompt: {prompt!r} | "
+        print(f"Prompt: {prompt!r} \n"
              f"Class Probabilities: {probs_trimmed} (size={len(probs)})")
+        print("-" * 60)
 if __name__ == "__main__":

--- a/examples/offline_inference/basic/embed.py
+++ b/examples/offline_inference/basic/embed.py
@@ -23,12 +23,14 @@ def main(args: Namespace):
    outputs = model.embed(prompts)
    # Print the outputs.
+    print("\nGenerated Outputs:\n" + "-" * 60)
    for prompt, output in zip(prompts, outputs):
        embeds = output.outputs.embedding
        embeds_trimmed = ((str(embeds[:16])[:-1] +
                           ", ...]") if len(embeds) > 16 else embeds)
-        print(f"Prompt: {prompt!r} | "
+        print(f"Prompt: {prompt!r} \n"
              f"Embeddings: {embeds_trimmed} (size={len(embeds)})")
+        print("-" * 60)
 if __name__ == "__main__":

--- a/examples/offline_inference/basic/score.py
+++ b/examples/offline_inference/basic/score.py
@@ -22,9 +22,11 @@ def main(args: Namespace):
    outputs = model.score(text_1, texts_2)
    # Print the outputs.
+    print("\nGenerated Outputs:\n" + "-" * 60)
    for text_2, output in zip(texts_2, outputs):
        score = output.outputs.score
-        print(f"Pair: {[text_1, text_2]!r} | Score: {score}")
+        print(f"Pair: {[text_1, text_2]!r} \nScore: {score}")
+        print("-" * 60)
 if __name__ == "__main__":

--- a/examples/offline_inference/data_parallel.py
+++ b/examples/offline_inference/data_parallel.py
 # SPDX-License-Identifier: Apache-2.0
-# usage:
+"""
-# VLLM_USE_V1=1 python examples/offline_inference/data_parallel.py
+Usage:
-# we need to have a launcher to create multiple data parallel
+Single node:
-# ranks. And each rank will create a vLLM instance to process its own prompts.
+    python examples/offline_inference/data_parallel.py \
+            --model="ibm-research/PowerMoE-3b" \
+            --dp-size=2 \
+            --tp-size=2
+Multi-node:
+    Node 0 (assume the node has ip of 10.99.48.128):
+            python examples/offline_inference/data_parallel.py \
+                    --model="ibm-research/PowerMoE-3b" \
+                    --dp-size=2 \
+                    --tp-size=2 \
+                    --node-size=2 \
+                    --node-rank=0 \
+                    --master-addr=10.99.48.128 \
+                    --master-port=13345
+    Node 1:
+            python examples/offline_inference/data_parallel.py \
+                    --model="ibm-research/PowerMoE-3b" \
+                    --dp-size=2 \
+                    --tp-size=2 \
+                    --node-size=2 \
+                    --node-rank=1 \
+                    --master-addr=10.99.48.128 \
+                    --master-port=13345
+"""
 import os
+from time import sleep
 from vllm import LLM, SamplingParams
 from vllm.utils import get_open_port
-GPUs_per_dp_rank = 2
-DP_size = 2
-def main(dp_size, dp_rank, dp_master_ip, dp_master_port, GPUs_per_dp_rank):
+def main(model, dp_size, local_dp_rank, global_dp_rank, dp_master_ip,
-    os.environ["VLLM_DP_RANK"] = str(dp_rank)
+         dp_master_port, GPUs_per_dp_rank):
+    os.environ["VLLM_DP_RANK"] = str(global_dp_rank)
+    os.environ["VLLM_DP_RANK_LOCAL"] = str(local_dp_rank)
    os.environ["VLLM_DP_SIZE"] = str(dp_size)
    os.environ["VLLM_DP_MASTER_IP"] = dp_master_ip
    os.environ["VLLM_DP_MASTER_PORT"] = str(dp_master_port)
-    # set devices for each dp_rank
-    os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(
+    # CUDA_VISIBLE_DEVICES for each DP rank is set automatically inside the
-        str(i) for i in range(dp_rank * GPUs_per_dp_rank, (dp_rank + 1) *
+    # engine processes.
-                              GPUs_per_dp_rank))
    # Sample prompts.
    prompts = [
@@ -28,20 +51,20 @@ def main(dp_size, dp_rank, dp_master_ip, dp_master_port, GPUs_per_dp_rank):
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
-    ]
+    ] * 100
    # with DP, each rank should process different prompts.
    # usually all the DP ranks process a full dataset,
    # and each rank processes a different part of the dataset.
    promts_per_rank = len(prompts) // dp_size
-    start = dp_rank * promts_per_rank
+    start = global_dp_rank * promts_per_rank
    end = start + promts_per_rank
    prompts = prompts[start:end]
    if len(prompts) == 0:
        # if any rank has no prompts to process,
        # we need to set a placeholder prompt
        prompts = ["Placeholder"]
-    print(f"DP rank {dp_rank} needs to process {len(prompts)} prompts")
+    print(f"DP rank {global_dp_rank} needs to process {len(prompts)} prompts")
    # Create a sampling params object.
    # since we are doing data parallel, every rank can have different
@@ -49,37 +72,96 @@ def main(dp_size, dp_rank, dp_master_ip, dp_master_port, GPUs_per_dp_rank):
    # ranks for demonstration.
    sampling_params = SamplingParams(temperature=0.8,
                                     top_p=0.95,
-                                     max_tokens=16 * (dp_rank + 1))
+                                     max_tokens=[16, 20][global_dp_rank % 2])
    # Create an LLM.
-    llm = LLM(model="ibm-research/PowerMoE-3b",
+    llm = LLM(model=model,
              tensor_parallel_size=GPUs_per_dp_rank,
              enforce_eager=True,
              enable_expert_parallel=True)
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
-    for output in outputs:
+    for i, output in enumerate(outputs):
+        if i >= 5:
+            # print only 5 outputs
+            break
        prompt = output.prompt
        generated_text = output.outputs[0].text
-        print(f"DP rank {dp_rank}, Prompt: {prompt!r}, "
+        print(f"DP rank {global_dp_rank}, Prompt: {prompt!r}, "
              f"Generated text: {generated_text!r}")
+    # Give engines time to pause their processing loops before exiting.
+    sleep(1)
 if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="Data Parallel Inference")
+    parser.add_argument("--model",
+                        type=str,
+                        default="ibm-research/PowerMoE-3b",
+                        help="Model name or path")
+    parser.add_argument("--dp-size",
+                        type=int,
+                        default=2,
+                        help="Data parallel size")
+    parser.add_argument("--tp-size",
+                        type=int,
+                        default=2,
+                        help="Tensor parallel size")
+    parser.add_argument("--node-size",
+                        type=int,
+                        default=1,
+                        help="Total number of nodes")
+    parser.add_argument("--node-rank",
+                        type=int,
+                        default=0,
+                        help="Rank of the current node")
+    parser.add_argument("--master-addr",
+                        type=str,
+                        default="",
+                        help="Master node IP address")
+    parser.add_argument("--master-port",
+                        type=int,
+                        default=0,
+                        help="Master node port")
+    args = parser.parse_args()
+    dp_size = args.dp_size
+    tp_size = args.tp_size
+    node_size = args.node_size
+    node_rank = args.node_rank
+    if node_size == 1:
+        dp_master_ip = "127.0.0.1"
+        dp_master_port = get_open_port()
+    else:
+        dp_master_ip = args.master_addr
+        dp_master_port = args.master_port
+    assert dp_size % node_size == 0, "dp_size should be divisible by node_size"
+    dp_per_node = dp_size // node_size
    from multiprocessing import Process
-    dp_master_ip = "127.0.0.1"
-    dp_master_port = get_open_port()
    procs = []
-    for i in range(DP_size):
+    for local_dp_rank, global_dp_rank in enumerate(
+            range(node_rank * dp_per_node, (node_rank + 1) * dp_per_node)):
        proc = Process(target=main,
-                       args=(DP_size, i, dp_master_ip, dp_master_port,
+                       args=(args.model, dp_size, local_dp_rank,
-                             GPUs_per_dp_rank))
+                             global_dp_rank, dp_master_ip, dp_master_port,
+                             tp_size))
        proc.start()
        procs.append(proc)
    exit_code = 0
    for proc in procs:
-        proc.join()
+        proc.join(timeout=300)
-        if proc.exitcode:
+        if proc.exitcode is None:
+            print(f"Killing process {proc.pid} that "
+                  f"didn't stop within 5 minutes.")
+            proc.kill()
+            exit_code = 1
+        elif proc.exitcode:
            exit_code = proc.exitcode
    exit(exit_code)
--- a/examples/offline_inference/eagle.py
+++ b/examples/offline_inference/eagle.py
@@ -69,10 +69,12 @@ llm = LLM(
    max_model_len=max_model_len,
    max_num_seqs=args.max_num_seqs,
    gpu_memory_utilization=0.8,
-    speculative_model=eagle_dir,
+    speculative_config={
-    num_speculative_tokens=args.num_spec_tokens,
+        "model": eagle_dir,
-    speculative_draft_tensor_parallel_size=args.draft_tp,
+        "num_speculative_tokens": args.num_spec_tokens,
-    speculative_max_model_len=max_model_len,
+        "draft_tensor_parallel_size": args.draft_tp,
+        "max_model_len": max_model_len,
+    },
    disable_log_stats=False,
 )

--- a/examples/offline_inference/load_sharded_state.py
+++ b/examples/offline_inference/load_sharded_state.py
+# SPDX-License-Identifier: Apache-2.0
+"""
+Validates the loading of a model saved with the sharded_state format.
+This script demonstrates how to load a model that was previously saved
+using save_sharded_state.py and validates it by running inference.
+Example usage:
+(First need to save a sharded_state mode)
+python save_sharded_state.py \
+    --model /path/to/load \
+    --quantization deepspeedfp \
+    --tensor-parallel-size 8 \
+    --output /path/to/save/sharded/modele
+python load_sharded_state.py \
+    --model /path/to/saved/sharded/model \
+    --load-format sharded_state \
+    --quantization deepspeedfp \
+    --tensor-parallel-size 8 \
+    --prompt "Hello, my name is" \
+    --max-tokens 50
+"""
+import dataclasses
+from vllm import LLM, EngineArgs, SamplingParams
+from vllm.utils import FlexibleArgumentParser
+def parse_args():
+    parser = FlexibleArgumentParser()
+    # Add engine arguments
+    EngineArgs.add_cli_args(parser)
+    # Override default load_format for clarity
+    parser.set_defaults(load_format="sharded_state")
+    # Add validation arguments
+    parser.add_argument("--prompt",
+                        type=str,
+                        default="Hello, world!",
+                        help="Prompt for validation")
+    parser.add_argument("--max-tokens",
+                        type=int,
+                        default=100,
+                        help="Maximum number of tokens to generate")
+    parser.add_argument("--temperature",
+                        type=float,
+                        default=0.7,
+                        help="Sampling temperature")
+    parser.add_argument("--top-p",
+                        type=float,
+                        default=1.0,
+                        help="Top-p sampling parameter")
+    return parser.parse_args()
+def main():
+    args = parse_args()
+    engine_args = EngineArgs.from_cli_args(args)
+    print(f"Loading model from {engine_args.model} "
+          f"using format {engine_args.load_format}")
+    print(f"Tensor parallel size: {engine_args.tensor_parallel_size}")
+    # Load the model using engine args
+    llm = LLM(**dataclasses.asdict(engine_args))
+    # Prepare sampling parameters
+    sampling_params = SamplingParams(
+        temperature=args.temperature,
+        top_p=args.top_p,
+        max_tokens=args.max_tokens,
+    )
+    print("\nRunning inference:")
+    print(f"Prompt: {args.prompt}")
+    # Generate completion
+    outputs = llm.generate(args.prompt, sampling_params)
+    # Display generated text
+    print("\nGenerated outputs:")
+    for output in outputs:
+        generated_text = output.outputs[0].text
+        print("-" * 50)
+        print(f"Full output: {args.prompt}{generated_text}")
+        print("-" * 50)
+if __name__ == "__main__":
+    main()
\ No newline at end of file
--- a/examples/offline_inference/save_sharded_state.py
+++ b/examples/offline_inference/save_sharded_state.py
@@ -57,10 +57,25 @@ def main(args):
    # Prepare output directory
    Path(args.output).mkdir(exist_ok=True)
    # Dump worker states to output directory
-    model_executor = llm.llm_engine.model_executor
-    model_executor.save_sharded_state(path=args.output,
+    # Check which engine version is being used
-                                      pattern=args.file_pattern,
+    is_v1_engine = hasattr(llm.llm_engine, "engine_core")
-                                      max_size=args.max_file_size)
+    if is_v1_engine:
+        # For V1 engine, we need to use engine_core.save_sharded_state
+        print("Using V1 engine save path")
+        llm.llm_engine.engine_core.save_sharded_state(
+            path=args.output,
+            pattern=args.file_pattern,
+            max_size=args.max_file_size)
+    else:
+        # For V0 engine
+        print("Using V0 engine save path")
+        model_executor = llm.llm_engine.model_executor
+        model_executor.save_sharded_state(path=args.output,
+                                          pattern=args.file_pattern,
+                                          max_size=args.max_file_size)
    # Copy metadata files to output directory
    for file in os.listdir(model_path):
        if os.path.splitext(file)[1] not in (".bin", ".pt", ".safetensors"):

--- a/examples/offline_inference/torchrun_example.py
+++ b/examples/offline_inference/torchrun_example.py
@@ -23,10 +23,14 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 # Use `distributed_executor_backend="external_launcher"` so that
 # this llm engine/instance only creates one worker.
+# it is important to set an explicit seed to make sure that
+# all ranks have the same random seed, so that sampling can be
+# deterministic across ranks.
 llm = LLM(
    model="facebook/opt-125m",
    tensor_parallel_size=2,
    distributed_executor_backend="external_launcher",
+    seed=0,
 )
 outputs = llm.generate(prompts, sampling_params)

--- a/examples/offline_inference/tpu.py
+++ b/examples/offline_inference/tpu.py
@@ -14,10 +14,7 @@ answers = [
 ]
 N = 1
 # Currently, top-p sampling is disabled. `top_p` should be 1.0.
-sampling_params = SamplingParams(temperature=0.7,
+sampling_params = SamplingParams(temperature=0, top_p=1.0, n=N, max_tokens=16)
-                                 top_p=1.0,
-                                 n=N,
-                                 max_tokens=16)
 # Set `enforce_eager=True` to avoid ahead-of-time compilation.
 # In real workloads, `enforace_eager` should be `False`.