Unverified Commit 3de2b1ea authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc] Show default pooling method in a table (#11904)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent b844b99a
...@@ -8,14 +8,14 @@ In vLLM, generative models implement the {class}`~vllm.model_executor.models.Vll ...@@ -8,14 +8,14 @@ In vLLM, generative models implement the {class}`~vllm.model_executor.models.Vll
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text. which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text.
For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
## Offline Inference ## Offline Inference
The {class}`~vllm.LLM` class provides various methods for offline inference. The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model. See [Engine Arguments](#engine-args) for a list of options when initializing the model.
For generative models, the only supported {code}`task` option is {code}`"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
### `LLM.generate` ### `LLM.generate`
The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM. The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM.
...@@ -33,7 +33,7 @@ for output in outputs: ...@@ -33,7 +33,7 @@ for output in outputs:
``` ```
You can optionally control the language generation by passing {class}`~vllm.SamplingParams`. You can optionally control the language generation by passing {class}`~vllm.SamplingParams`.
For example, you can use greedy sampling by setting {code}`temperature=0`: For example, you can use greedy sampling by setting `temperature=0`:
```python ```python
llm = LLM(model="facebook/opt-125m") llm = LLM(model="facebook/opt-125m")
......
...@@ -14,30 +14,53 @@ As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM feature ...@@ -14,30 +14,53 @@ As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM feature
pooling models as they only work on the generation or decode stage, so performance may not improve as much. pooling models as they only work on the generation or decode stage, so performance may not improve as much.
``` ```
## Offline Inference For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:
The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model. ```{list-table}
:widths: 50 25 25 25
For pooling models, we support the following {code}`task` options: :header-rows: 1
- Embedding ({code}`"embed"` / {code}`"embedding"`) * - Task
- Classification ({code}`"classify"`) - Pooling Type
- Sentence Pair Scoring ({code}`"score"`) - Normalization
- Reward Modeling ({code}`"reward"`) - Softmax
* - Embedding (`embed`)
- `LAST`
- ✅︎
- ✗
* - Classification (`classify`)
- `LAST`
- ✗
- ✅︎
* - Sentence Pair Scoring (`score`)
- \*
- \*
- \*
* - Reward Modeling (`reward`)
- `ALL`
- ✗
- ✗
```
The selected task determines the default {class}`~vllm.model_executor.layers.Pooler` that is used: \*The default pooler is always defined by the model.
- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization. ```{note}
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax. If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax. ```
- Reward Modeling: Extract all of the hidden states and return them directly.
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models, When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file ({code}`modules.json`). we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
You can customize the model's pooling method via the {code}`override_pooler_config` option, ```{tip}
You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults. which takes priority over both the model's and Sentence Transformers's defaults.
```
## Offline Inference
The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.
### `LLM.encode` ### `LLM.encode`
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment