Unverified Commit 483ea646 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Replace all explicit anchors with real links (#27087)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent e20eba75
......@@ -2,8 +2,8 @@
This guide will help you quickly get started with vLLM to perform:
- [Offline batched inference][quickstart-offline]
- [Online serving using OpenAI-compatible server][quickstart-online]
- [Offline batched inference](#offline-batched-inference)
- [Online serving using OpenAI-compatible server](#openai-compatible-server)
## Prerequisites
......@@ -42,8 +42,6 @@ uv pip install vllm --torch-backend=auto
!!! note
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
[](){ #quickstart-offline }
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
......@@ -57,7 +55,7 @@ The first line of this example imports the classes [LLM][vllm.LLM] and [Sampling
from vllm import LLM, SamplingParams
```
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](../api/README.md#inference-parameters).
!!! important
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
......@@ -135,8 +133,6 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
[](){ #quickstart-online }
## OpenAI-Compatible Server
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
......@@ -150,7 +146,7 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct
!!! note
By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here][chat-template].
You can learn about overriding it [here](../serving/openai_compatible_server.md#chat-template).
!!! important
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
......
......@@ -140,5 +140,5 @@ outputs = llm.chat(conversation, chat_template=custom_template)
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
- [Completions API](../serving/openai_compatible_server.md#completions-api) is similar to `LLM.generate` but only accepts text.
- [Chat API](../serving/openai_compatible_server.md#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
......@@ -185,10 +185,10 @@ print(f"Data: {data!r}")
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
- [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
- [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API][score-api] is similar to `LLM.score` for cross-encoder models.
- [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
- [Classification API](../serving/openai_compatible_server.md#classification-api) is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API](../serving/openai_compatible_server.md#score-api) is similar to `LLM.score` for cross-encoder models.
## Matryoshka Embeddings
......
......@@ -11,9 +11,7 @@ Alongside each architecture, we include some popular models that use it.
If vLLM natively supports a model, its implementation can be found in [vllm/model_executor/models](../../vllm/model_executor/models).
These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
[](){ #transformers-backend }
These models are what we list in [supported text models](#list-of-text-only-language-models) and [supported multimodal models](#list-of-multimodal-language-models).
### Transformers
......@@ -60,7 +58,7 @@ For a model to be compatible with the Transformers backend for vLLM it must:
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
- The model directory must have the correct structure (e.g. `config.json` is present).
- `config.json` must contain `auto_map.AutoModel`.
- be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]):
- be a Transformers backend for vLLM compatible model (see [Writing custom models](#writing-custom-models)):
- Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
If the compatible model is:
......@@ -70,8 +68,6 @@ If the compatible model is:
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
[](){ #writing-custom-models }
#### Writing custom models
This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
......@@ -164,7 +160,7 @@ To determine whether a given model is natively supported, you can check the `con
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
Models do not _need_ to be natively supported to be used in vLLM.
The [Transformers backend][transformers-backend] enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
The [Transformers backend](#transformers) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
!!! tip
The easiest way to check if your model is really supported at runtime is to run the program below:
......@@ -306,8 +302,6 @@ output = llm.encode("Hello, my name is")
print(output)
```
[](){ #feature-status-legend }
## Feature Status Legend
- ✅︎ indicates that the feature is supported for the model.
......@@ -316,8 +310,6 @@ print(output)
- ⚠️ indicates that the feature is available but may have known issues or limitations.
[](){ #supported-text-models }
## List of Text-only Language Models
### Generative Models
......@@ -583,8 +575,6 @@ These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode)
!!! note
Named Entity Recognition (NER) usage, please refer to [examples/offline_inference/pooling/ner.py](../../examples/offline_inference/pooling/ner.py), [examples/online_serving/pooling/ner_client.py](../../examples/online_serving/pooling/ner_client.py).
[](){ #supported-mm-models }
## List of Multimodal Language Models
The following modalities are supported depending on the model:
......
......@@ -4,7 +4,7 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
## Verify inter-node GPU communication
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script](../usage/troubleshooting.md#incorrect-hardwaredriver). If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
## No available node types can fulfill resource request
......
......@@ -19,7 +19,7 @@ The available APIs depend on the model type:
- [Pooling models](../models/pooling_models.md) output their hidden states directly.
!!! info
[API Reference][offline-inference-api]
[API Reference](../api/README.md#offline-inference)
## Ray Data LLM API
......
......@@ -44,37 +44,35 @@ To call the server, in your preferred text editor, create a script that uses an
We currently support the following OpenAI APIs:
- [Completions API][completions-api] (`/v1/completions`)
- [Completions API](#completions-api) (`/v1/completions`)
- Only applicable to [text generation models](../models/generative_models.md).
- *Note: `suffix` parameter is not supported.*
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template][chat-template].
- [Chat Completions API](#chat-api) (`/v1/chat/completions`)
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template](../serving/openai_compatible_server.md#chat-template).
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
- [Embeddings API](#embeddings-api) (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md).
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
- [Translation API][translations-api] (`/v1/audio/translations`)
- [Translation API](#translations-api) (`/v1/audio/translations`)
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
In addition, we have the following custom APIs:
- [Tokenizer API][tokenizer-api] (`/tokenize`, `/detokenize`)
- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
- Applicable to any model with a tokenizer.
- [Pooling API][pooling-api] (`/pooling`)
- [Pooling API](#pooling-api) (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API][classification-api] (`/classify`)
- [Classification API](#classification-api) (`/classify`)
- Only applicable to [classification models](../models/pooling_models.md).
- [Score API][score-api] (`/score`)
- [Score API](#score-api) (`/score`)
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- [Re-rank API](#re-rank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md).
[](){ #chat-template }
## Chat Template
In order for the language model to support chat protocol, vLLM requires the model to include
......@@ -174,8 +172,6 @@ with `--enable-request-id-headers`.
## API Reference
[](){ #completions-api }
### Completions API
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
......@@ -185,7 +181,7 @@ Code example: [examples/online_serving/openai_completion_client.py](../../exampl
#### Extra parameters
The following [sampling parameters][sampling-params] are supported.
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
??? code
......@@ -201,8 +197,6 @@ The following extra parameters are supported:
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
```
[](){ #chat-api }
### Chat API
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
......@@ -218,7 +212,7 @@ Code example: [examples/online_serving/openai_chat_completion_client.py](../../e
#### Extra parameters
The following [sampling parameters][sampling-params] are supported.
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
??? code
......@@ -234,8 +228,6 @@ The following extra parameters are supported:
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
```
[](){ #embeddings-api }
### Embeddings API
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
......@@ -243,7 +235,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
Code example: [examples/online_serving/pooling/openai_embedding_client.py](../../examples/online_serving/pooling/openai_embedding_client.py)
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
If the model has a [chat template](../serving/openai_compatible_server.md#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
??? code
......@@ -369,8 +361,6 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
```
[](){ #transcriptions-api }
### Transcriptions API
Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
......@@ -468,7 +458,7 @@ For `verbose_json` response format:
#### Extra Parameters
The following [sampling parameters][sampling-params] are supported.
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
??? code
......@@ -484,8 +474,6 @@ The following extra parameters are supported:
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
```
[](){ #translations-api }
### Translations API
Our Translation API is compatible with [OpenAI's Translations API](https://platform.openai.com/docs/api-reference/audio/createTranslation);
......@@ -500,7 +488,7 @@ Code example: [examples/online_serving/openai_translation_client.py](../../examp
#### Extra Parameters
The following [sampling parameters][sampling-params] are supported.
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
```python
--8<-- "vllm/entrypoints/openai/protocol.py:translation-sampling-params"
......@@ -512,8 +500,6 @@ The following extra parameters are supported:
--8<-- "vllm/entrypoints/openai/protocol.py:translation-extra-params"
```
[](){ #tokenizer-api }
### Tokenizer API
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
......@@ -522,18 +508,14 @@ It consists of two endpoints:
- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.
[](){ #pooling-api }
### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: [examples/online_serving/pooling/openai_pooling_client.py](../../examples/online_serving/pooling/openai_pooling_client.py)
[](){ #classification-api }
### Classification API
Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
......@@ -649,8 +631,6 @@ The following extra parameters are supported:
--8<-- "vllm/entrypoints/openai/protocol.py:classification-extra-params"
```
[](){ #score-api }
### Score API
Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence or multimodal pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
......@@ -856,8 +836,6 @@ The following extra parameters are supported:
--8<-- "vllm/entrypoints/openai/protocol.py:score-extra-params"
```
[](){ #rerank-api }
### Re-rank API
Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and
......
......@@ -80,8 +80,6 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
[](){ #troubleshooting-incorrect-hardware-driver }
## Incorrect hardware/driver
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
......@@ -178,8 +176,6 @@ If the test script hangs or crashes, usually it means the hardware/drivers are b
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
[](){ #troubleshooting-python-multiprocessing }
## Python multiprocessing
### `RuntimeError` Exception
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment