Unverified Commit 86ae693f authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (#21470)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 8f605ee3
...@@ -343,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision. ...@@ -343,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision.
First, launch the OpenAI-compatible server: First, launch the OpenAI-compatible server:
```bash ```bash
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \ vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}' --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
``` ```
...@@ -422,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim ...@@ -422,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
First, launch the OpenAI-compatible server: First, launch the OpenAI-compatible server:
```bash ```bash
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192 vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
``` ```
Then, you can use the OpenAI client as follows: Then, you can use the OpenAI client as follows:
......
...@@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors. ...@@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
First, launch the OpenAI-compatible server: First, launch the OpenAI-compatible server:
```bash ```bash
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \ vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
--max-model-len 4096 --enable-prompt-embeds --max-model-len 4096 --enable-prompt-embeds
``` ```
......
...@@ -2,12 +2,19 @@ ...@@ -2,12 +2,19 @@
vLLM provides first-class support for generative models, which covers most of LLMs. vLLM provides first-class support for generative models, which covers most of LLMs.
In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface. In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text. which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
For generative models, the only supported `--task` option is `"generate"`. ## Configuration
Usually, this is automatically inferred so you don't have to specify it.
### Model Runner (`--runner`)
Run a model in generation mode via the option `--runner generate`.
!!! tip
There is no need to set this option in the vast majority of cases as vLLM can automatically
detect the model runner to use via `--runner auto`.
## Offline Inference ## Offline Inference
......
# Pooling Models # Pooling Models
vLLM also supports pooling models, including embedding, reranking and reward models. vLLM also supports pooling models, such as embedding, classification and reward models.
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface. In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
before returning them. before returning them.
!!! note !!! note
...@@ -11,18 +11,39 @@ before returning them. ...@@ -11,18 +11,39 @@ before returning them.
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much. pooling models as they only work on the generation or decode stage, so performance may not improve as much.
If the model doesn't implement this interface, you can set `--task` which tells vLLM ## Configuration
to convert the model into a pooling model.
| `--task` | Model type | Supported pooling tasks | ### Model Runner
|------------|----------------------|-------------------------------|
| `embed` | Embedding model | `encode`, `embed` |
| `classify` | Classification model | `encode`, `classify`, `score` |
| `reward` | Reward model | `encode` |
## Pooling Tasks Run a model in pooling mode via the option `--runner pooling`.
In vLLM, we define the following pooling tasks and corresponding APIs: !!! tip
There is no need to set this option in the vast majority of cases as vLLM can automatically
detect the model runner to use via `--runner auto`.
### Model Conversion
vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.
| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|-------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
!!! tip
You can explicitly set `--convert <type>` to specify how to convert the model.
### Pooling Tasks
Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs:
| Task | APIs | | Task | APIs |
|------------|--------------------| |------------|--------------------|
...@@ -31,11 +52,19 @@ In vLLM, we define the following pooling tasks and corresponding APIs: ...@@ -31,11 +52,19 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
| `classify` | `classify` | | `classify` | `classify` |
| `score` | `score` | | `score` | `score` |
\*The `score` API falls back to `embed` task if the model does not support `score` task. \* The `score` API falls back to `embed` task if the model does not support `score` task.
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks]. ### Pooler Configuration
By default, the pooler assigned to each task has the following attributes: #### Predefined models
If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
you can override some of its attributes via the `--override-pooler-config` option.
#### Converted models
If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default:
| Task | Pooling Type | Normalization | Softmax | | Task | Pooling Type | Normalization | Softmax |
|------------|----------------|---------------|---------| |------------|----------------|---------------|---------|
...@@ -43,20 +72,12 @@ By default, the pooler assigned to each task has the following attributes: ...@@ -43,20 +72,12 @@ By default, the pooler assigned to each task has the following attributes:
| `embed` | `LAST` | ✅︎ | ❌ | | `embed` | `LAST` | ✅︎ | ❌ |
| `classify` | `LAST` | ❌ | ✅︎ | | `classify` | `LAST` | ❌ | ✅︎ |
These defaults may be overridden by the model's implementation in vLLM.
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models, When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`), its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
which takes priority over the model's defaults.
You can further customize this via the `--override-pooler-config` option, You can further customize this via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults. which takes priority over both the model's and Sentence Transformers's defaults.
!!! note
The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
that is not based on [PoolerConfig][vllm.config.PoolerConfig].
## Offline Inference ## Offline Inference
The [LLM][vllm.LLM] class provides various methods for offline inference. The [LLM][vllm.LLM] class provides various methods for offline inference.
...@@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode ...@@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward") llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
(output,) = llm.encode("Hello, my name is") (output,) = llm.encode("Hello, my name is")
data = output.outputs.data data = output.outputs.data
...@@ -85,7 +106,7 @@ It is primarily designed for embedding models. ...@@ -85,7 +106,7 @@ It is primarily designed for embedding models.
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed") llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
(output,) = llm.embed("Hello, my name is") (output,) = llm.embed("Hello, my name is")
embeds = output.outputs.embedding embeds = output.outputs.embedding
...@@ -102,7 +123,7 @@ It is primarily designed for classification models. ...@@ -102,7 +123,7 @@ It is primarily designed for classification models.
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify") llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
(output,) = llm.classify("Hello, my name is") (output,) = llm.classify("Hello, my name is")
probs = output.outputs.probs probs = output.outputs.probs
...@@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u ...@@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score") llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score("What is the capital of France?", (output,) = llm.score("What is the capital of France?",
"The capital of Brazil is Brasilia.") "The capital of Brazil is Brasilia.")
...@@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka ...@@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
from vllm import LLM, PoolingParams from vllm import LLM, PoolingParams
llm = LLM(model="jinaai/jina-embeddings-v3", llm = LLM(model="jinaai/jina-embeddings-v3",
task="embed", runner="pooling",
trust_remote_code=True) trust_remote_code=True)
outputs = llm.embed(["Follow the white rabbit."], outputs = llm.embed(["Follow the white rabbit."],
pooling_params=PoolingParams(dimensions=32)) pooling_params=PoolingParams(dimensions=32))
......
# Supported Models # Supported Models
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks. vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument.
For each task, we list the model architectures that have been implemented in vLLM. For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it. Alongside each architecture, we include some popular models that use it.
...@@ -24,7 +23,7 @@ To check if the modeling backend is Transformers, you can simply do this: ...@@ -24,7 +23,7 @@ To check if the modeling backend is Transformers, you can simply do this:
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model=..., task="generate") # Name or path of your model llm = LLM(model=...) # Name or path of your model
llm.apply_model(lambda model: print(type(model))) llm.apply_model(lambda model: print(type(model)))
``` ```
...@@ -158,13 +157,13 @@ The [Transformers backend][transformers-backend] enables you to run models direc ...@@ -158,13 +157,13 @@ The [Transformers backend][transformers-backend] enables you to run models direc
```python ```python
from vllm import LLM from vllm import LLM
# For generative models (task=generate) only # For generative models (runner=generate) only
llm = LLM(model=..., task="generate") # Name or path of your model llm = LLM(model=..., runner="generate") # Name or path of your model
output = llm.generate("Hello, my name is") output = llm.generate("Hello, my name is")
print(output) print(output)
# For pooling models (task={embed,classify,reward,score}) only # For pooling models (runner=pooling) only
llm = LLM(model=..., task="embed") # Name or path of your model llm = LLM(model=..., runner="pooling") # Name or path of your model
output = llm.encode("Hello, my name is") output = llm.encode("Hello, my name is")
print(output) print(output)
``` ```
...@@ -281,13 +280,13 @@ And use with `trust_remote_code=True`. ...@@ -281,13 +280,13 @@ And use with `trust_remote_code=True`.
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True) llm = LLM(model=..., revision=..., runner=..., trust_remote_code=True)
# For generative models (task=generate) only # For generative models (runner=generate) only
output = llm.generate("Hello, my name is") output = llm.generate("Hello, my name is")
print(output) print(output)
# For pooling models (task={embed,classify,reward,score}) only # For pooling models (runner=pooling) only
output = llm.encode("Hello, my name is") output = llm.encode("Hello, my name is")
print(output) print(output)
``` ```
...@@ -312,8 +311,6 @@ See [this page](generative_models.md) for more information on how to use generat ...@@ -312,8 +311,6 @@ See [this page](generative_models.md) for more information on how to use generat
#### Text Generation #### Text Generation
Specified using `--task generate`.
<style> <style>
th { th {
white-space: nowrap; white-space: nowrap;
...@@ -420,25 +417,27 @@ See [this page](./pooling_models.md) for more information on how to use pooling ...@@ -420,25 +417,27 @@ See [this page](./pooling_models.md) for more information on how to use pooling
!!! important !!! important
Since some model architectures support both generative and pooling tasks, Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode. you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
#### Text Embedding #### Text Embedding
Specified using `--task embed`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | | | `BertModel`<sup>C</sup> | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ | | `Gemma2Model`<sup>C</sup> | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | | | `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | |
| `GteModel` | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | | | `GteModel`<sup>C</sup> | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | | | `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | | | `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | | | `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | | `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen3Model`<sup>C</sup>, `Qwen3ForCausalLM`<sup>C</sup> | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | | | `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
!!! note !!! note
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config. `ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
...@@ -460,14 +459,16 @@ of the whole prompt are extracted from the normalized hidden state corresponding ...@@ -460,14 +459,16 @@ of the whole prompt are extracted from the normalized hidden state corresponding
#### Reward Modeling #### Reward Modeling
Specified using `--task reward`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ | | `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ | | `LlamaForCausalLM`<sup>C</sup> | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into a reward model via `--convert reward`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
If your model is not in the above list, we will try to automatically convert the model using If your model is not in the above list, we will try to automatically convert the model using
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly. [as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
...@@ -478,28 +479,31 @@ If your model is not in the above list, we will try to automatically convert the ...@@ -478,28 +479,31 @@ If your model is not in the above list, we will try to automatically convert the
#### Classification #### Classification
Specified using `--task classify`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | | | `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ | | `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
If your model is not in the above list, we will try to automatically convert the model using If your model is not in the above list, we will try to automatically convert the model using
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token. [as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
#### Sentence Pair Scoring #### Sentence Pair Scoring
Specified using `--task score`. | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | | | |
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | |
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | |
| Architecture | Models | Example HF Models | [V1](gh-issue:8779) | <sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|--------------|--------|-------------------|---------------------| \* Feature support is the same as that of the original model.
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | |
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | |
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ |
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ |
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | |
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | |
!!! note !!! note
Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command. Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.
...@@ -575,8 +579,6 @@ See [this page](generative_models.md) for more information on how to use generat ...@@ -575,8 +579,6 @@ See [this page](generative_models.md) for more information on how to use generat
#### Text Generation #### Text Generation
Specified using `--task generate`.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ | | `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
...@@ -705,8 +707,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th ...@@ -705,8 +707,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
#### Transcription #### Transcription
Specified using `--task transcription`.
Speech2Text models trained specifically for Automatic Speech Recognition. Speech2Text models trained specifically for Automatic Speech Recognition.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
...@@ -719,14 +719,10 @@ See [this page](./pooling_models.md) for more information on how to use pooling ...@@ -719,14 +719,10 @@ See [this page](./pooling_models.md) for more information on how to use pooling
!!! important !!! important
Since some model architectures support both generative and pooling tasks, Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode. you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
#### Text Embedding #### Text Embedding
Specified using `--task embed`.
Any text generation model can be converted into an embedding model by passing `--task embed`.
!!! note !!! note
To get the best results, you should use pooling models that are specifically trained as such. To get the best results, you should use pooling models that are specifically trained as such.
...@@ -734,19 +730,24 @@ The following table lists those that are tested in vLLM. ...@@ -734,19 +730,24 @@ The following table lists those that are tested in vLLM.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | | | `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | | | `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
--- ---
#### Scoring #### Scoring
Specified using `--task score`.
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) | | Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------| |-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ | | `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
## Model Support Policy ## Model Support Policy
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support: At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
......
...@@ -45,17 +45,17 @@ To call the server, in your preferred text editor, create a script that uses an ...@@ -45,17 +45,17 @@ To call the server, in your preferred text editor, create a script that uses an
We currently support the following OpenAI APIs: We currently support the following OpenAI APIs:
- [Completions API][completions-api] (`/v1/completions`) - [Completions API][completions-api] (`/v1/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`). - Only applicable to [text generation models](../models/generative_models.md).
- *Note: `suffix` parameter is not supported.* - *Note: `suffix` parameter is not supported.*
- [Chat Completions API][chat-api] (`/v1/chat/completions`) - [Chat Completions API][chat-api] (`/v1/chat/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template]. - Only applicable to [text generation models](../models/generative_models.md) with a [chat template][chat-template].
- *Note: `parallel_tool_calls` and `user` parameters are ignored.* - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API][embeddings-api] (`/v1/embeddings`) - [Embeddings API][embeddings-api] (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`). - Only applicable to [embedding models](../models/pooling_models.md).
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`) - [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`). - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
- [Translation API][translations-api] (`/v1/audio/translations`) - [Translation API][translations-api] (`/v1/audio/translations`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`). - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
In addition, we have the following custom APIs: In addition, we have the following custom APIs:
...@@ -64,14 +64,14 @@ In addition, we have the following custom APIs: ...@@ -64,14 +64,14 @@ In addition, we have the following custom APIs:
- [Pooling API][pooling-api] (`/pooling`) - [Pooling API][pooling-api] (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md). - Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API][classification-api] (`/classify`) - [Classification API][classification-api] (`/classify`)
- Only applicable to [classification models](../models/pooling_models.md) (`--task classify`). - Only applicable to [classification models](../models/pooling_models.md).
- [Score API][score-api] (`/score`) - [Score API][score-api] (`/score`)
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`). - Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`) - [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/) - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank) - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response. - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`). - Only applicable to [cross-encoder models](../models/pooling_models.md).
[](){ #chat-template } [](){ #chat-template }
...@@ -250,14 +250,14 @@ and passing a list of `messages` in the request. Refer to the examples below for ...@@ -250,14 +250,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
To serve the model: To serve the model:
```bash ```bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \ vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
--trust-remote-code \ --trust-remote-code \
--max-model-len 4096 \ --max-model-len 4096 \
--chat-template examples/template_vlm2vec.jinja --chat-template examples/template_vlm2vec.jinja
``` ```
!!! important !!! important
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed` Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
to run this model in embedding mode instead of text generation mode. to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model, The custom chat template is completely different from the original one for this model,
...@@ -296,14 +296,14 @@ and passing a list of `messages` in the request. Refer to the examples below for ...@@ -296,14 +296,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
To serve the model: To serve the model:
```bash ```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
--trust-remote-code \ --trust-remote-code \
--max-model-len 8192 \ --max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja --chat-template examples/template_dse_qwen2_vl.jinja
``` ```
!!! important !!! important
Like with VLM2Vec, we have to explicitly pass `--task embed`. Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja> by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
......
...@@ -12,7 +12,9 @@ def parse_args(): ...@@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser) parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments # Set example specific arguments
parser.set_defaults( parser.set_defaults(
model="jason9693/Qwen2.5-1.5B-apeach", task="classify", enforce_eager=True model="jason9693/Qwen2.5-1.5B-apeach",
runner="pooling",
enforce_eager=True,
) )
return parser.parse_args() return parser.parse_args()
...@@ -27,7 +29,7 @@ def main(args: Namespace): ...@@ -27,7 +29,7 @@ def main(args: Namespace):
] ]
# Create an LLM. # Create an LLM.
# You should pass task="classify" for classification models # You should pass runner="pooling" for classification models
llm = LLM(**vars(args)) llm = LLM(**vars(args))
# Generate logits. The output is a list of ClassificationRequestOutputs. # Generate logits. The output is a list of ClassificationRequestOutputs.
......
...@@ -13,7 +13,7 @@ def parse_args(): ...@@ -13,7 +13,7 @@ def parse_args():
# Set example specific arguments # Set example specific arguments
parser.set_defaults( parser.set_defaults(
model="intfloat/e5-mistral-7b-instruct", model="intfloat/e5-mistral-7b-instruct",
task="embed", runner="pooling",
enforce_eager=True, enforce_eager=True,
max_model_len=1024, max_model_len=1024,
) )
...@@ -30,7 +30,7 @@ def main(args: Namespace): ...@@ -30,7 +30,7 @@ def main(args: Namespace):
] ]
# Create an LLM. # Create an LLM.
# You should pass task="embed" for embedding models # You should pass runner="pooling" for embedding models
llm = LLM(**vars(args)) llm = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs. # Generate embedding. The output is a list of EmbeddingRequestOutputs.
......
...@@ -12,7 +12,9 @@ def parse_args(): ...@@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser) parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments # Set example specific arguments
parser.set_defaults( parser.set_defaults(
model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True model="BAAI/bge-reranker-v2-m3",
runner="pooling",
enforce_eager=True,
) )
return parser.parse_args() return parser.parse_args()
...@@ -26,7 +28,7 @@ def main(args: Namespace): ...@@ -26,7 +28,7 @@ def main(args: Namespace):
] ]
# Create an LLM. # Create an LLM.
# You should pass task="score" for cross-encoder models # You should pass runner="pooling" for cross-encoder models
llm = LLM(**vars(args)) llm = LLM(**vars(args))
# Generate scores. The output is a list of ScoringRequestOutputs. # Generate scores. The output is a list of ScoringRequestOutputs.
......
...@@ -12,7 +12,9 @@ def parse_args(): ...@@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser) parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments # Set example specific arguments
parser.set_defaults( parser.set_defaults(
model="jinaai/jina-embeddings-v3", task="embed", trust_remote_code=True model="jinaai/jina-embeddings-v3",
runner="pooling",
trust_remote_code=True,
) )
return parser.parse_args() return parser.parse_args()
...@@ -29,7 +31,7 @@ def main(args: Namespace): ...@@ -29,7 +31,7 @@ def main(args: Namespace):
] ]
# Create an LLM. # Create an LLM.
# You should pass task="embed" for embedding models # You should pass runner="pooling" for embedding models
llm = LLM(**vars(args)) llm = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs. # Generate embedding. The output is a list of EmbeddingRequestOutputs.
......
...@@ -12,7 +12,9 @@ def parse_args(): ...@@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser) parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments # Set example specific arguments
parser.set_defaults( parser.set_defaults(
model="jinaai/jina-embeddings-v3", task="embed", trust_remote_code=True model="jinaai/jina-embeddings-v3",
runner="pooling",
trust_remote_code=True,
) )
return parser.parse_args() return parser.parse_args()
...@@ -29,7 +31,7 @@ def main(args: Namespace): ...@@ -29,7 +31,7 @@ def main(args: Namespace):
] ]
# Create an LLM. # Create an LLM.
# You should pass task="embed" for embedding models # You should pass runner="pooling" for embedding models
llm = LLM(**vars(args)) llm = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs. # Generate embedding. The output is a list of EmbeddingRequestOutputs.
......
...@@ -17,7 +17,7 @@ model_name = "Qwen/Qwen3-Reranker-0.6B" ...@@ -17,7 +17,7 @@ model_name = "Qwen/Qwen3-Reranker-0.6B"
# Models converted offline using this method can not only be more efficient # Models converted offline using this method can not only be more efficient
# and support the vllm score API, but also make the init parameters more # and support the vllm score API, but also make the init parameters more
# concise, for example. # concise, for example.
# llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", task="score") # llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", runner="pooling")
# If you want to load the official original version, the init parameters are # If you want to load the official original version, the init parameters are
# as follows. # as follows.
...@@ -27,7 +27,7 @@ def get_llm() -> LLM: ...@@ -27,7 +27,7 @@ def get_llm() -> LLM:
"""Initializes and returns the LLM model for Qwen3-Reranker.""" """Initializes and returns the LLM model for Qwen3-Reranker."""
return LLM( return LLM(
model=model_name, model=model_name,
task="score", runner="pooling",
hf_overrides={ hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"], "architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"], "classifier_from_token": ["no", "yes"],
......
...@@ -70,7 +70,7 @@ def run_e5_v(query: Query) -> ModelRequestData: ...@@ -70,7 +70,7 @@ def run_e5_v(query: Query) -> ModelRequestData:
engine_args = EngineArgs( engine_args = EngineArgs(
model="royokong/e5-v", model="royokong/e5-v",
task="embed", runner="pooling",
max_model_len=4096, max_model_len=4096,
limit_mm_per_prompt={"image": 1}, limit_mm_per_prompt={"image": 1},
) )
...@@ -102,7 +102,7 @@ def run_vlm2vec(query: Query) -> ModelRequestData: ...@@ -102,7 +102,7 @@ def run_vlm2vec(query: Query) -> ModelRequestData:
engine_args = EngineArgs( engine_args = EngineArgs(
model="TIGER-Lab/VLM2Vec-Full", model="TIGER-Lab/VLM2Vec-Full",
task="embed", runner="pooling",
max_model_len=4096, max_model_len=4096,
trust_remote_code=True, trust_remote_code=True,
mm_processor_kwargs={"num_crops": 4}, mm_processor_kwargs={"num_crops": 4},
...@@ -122,7 +122,7 @@ def run_jinavl_reranker(query: Query) -> ModelRequestData: ...@@ -122,7 +122,7 @@ def run_jinavl_reranker(query: Query) -> ModelRequestData:
engine_args = EngineArgs( engine_args = EngineArgs(
model="jinaai/jina-reranker-m0", model="jinaai/jina-reranker-m0",
task="score", runner="pooling",
max_model_len=32768, max_model_len=32768,
trust_remote_code=True, trust_remote_code=True,
mm_processor_kwargs={ mm_processor_kwargs={
......
...@@ -9,7 +9,7 @@ Launch the vLLM server with the following command: ...@@ -9,7 +9,7 @@ Launch the vLLM server with the following command:
vllm serve llava-hf/llava-1.5-7b-hf vllm serve llava-hf/llava-1.5-7b-hf
(multi-image inference with Phi-3.5-vision-instruct) (multi-image inference with Phi-3.5-vision-instruct)
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \ vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}' --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
(audio inference with Ultravox) (audio inference with Ultravox)
......
...@@ -92,7 +92,7 @@ def dse_qwen2_vl(inp: dict): ...@@ -92,7 +92,7 @@ def dse_qwen2_vl(inp: dict):
def parse_args(): def parse_args():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
"Script to call a specified VLM through the API. Make sure to serve " "Script to call a specified VLM through the API. Make sure to serve "
"the model with --task embed before running this." "the model with `--runner pooling` before running this."
) )
parser.add_argument( parser.add_argument(
"--model", "--model",
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
""" """
Example online usage of Score API. Example online usage of Score API.
Run `vllm serve <model> --task score` to start up the server in vLLM. Run `vllm serve <model> --runner pooling` to start up the server in vLLM.
""" """
import argparse import argparse
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
""" """
Example online usage of Score API. Example online usage of Score API.
Run `vllm serve <model> --task score` to start up the server in vLLM. Run `vllm serve <model> --runner pooling` to start up the server in vLLM.
""" """
import argparse import argparse
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
""" """
Example online usage of Pooling API. Example online usage of Pooling API.
Run `vllm serve <model> --task <embed|classify|reward|score>` Run `vllm serve <model> --runner pooling`
to start up the server in vLLM. to start up the server in vLLM.
""" """
......
...@@ -10,7 +10,7 @@ This script demonstrates how to: ...@@ -10,7 +10,7 @@ This script demonstrates how to:
Run the vLLM server first: Run the vLLM server first:
vllm serve meta-llama/Llama-3.2-1B-Instruct \ vllm serve meta-llama/Llama-3.2-1B-Instruct \
--task generate \ --runner generate \
--max-model-len 4096 \ --max-model-len 4096 \
--enable-prompt-embeds --enable-prompt-embeds
......
...@@ -148,9 +148,6 @@ def async_tp_pass_on_test_model(local_rank: int, world_size: int, ...@@ -148,9 +148,6 @@ def async_tp_pass_on_test_model(local_rank: int, world_size: int,
# in the vllm_config, it's not really used. # in the vllm_config, it's not really used.
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e" model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
vllm_config.model_config = ModelConfig(model=model_name, vllm_config.model_config = ModelConfig(model=model_name,
task="auto",
tokenizer=model_name,
tokenizer_mode="auto",
trust_remote_code=True, trust_remote_code=True,
dtype=dtype, dtype=dtype,
seed=42) seed=42)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment