Unverified Commit a1fe24d9 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Migrate docs from Sphinx to MkDocs (#18145)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent d0bc2f81
(meetups)= ---
title: vLLM Meetups
# vLLM Meetups ---
[](){ #meetups }
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below: We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
......
# Dockerfile # Dockerfile
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM. We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here](#deployment-docker). More information about deploying with Docker can be found [here][deployment-docker].
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes: Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
...@@ -17,11 +17,9 @@ The edges of the build graph represent: ...@@ -17,11 +17,9 @@ The edges of the build graph represent:
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head) - `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
> :::{figure} /assets/contributing/dockerfile-stages-dependency.png > <figure markdown="span">
> :align: center > ![](../../assets/contributing/dockerfile-stages-dependency.png){ align="center" alt="query" width="100%" }
> :alt: query > </figure>
> :width: 100%
> :::
> >
> Made using: <https://github.com/patrickhoefler/dockerfilegraph> > Made using: <https://github.com/patrickhoefler/dockerfilegraph>
> >
......
---
title: Adding a New Model
---
[](){ #new-model }
This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
Contents:
- [Basic](basic.md)
- [Registration](registration.md)
- [Tests](tests.md)
- [Multimodal](multimodal.md)
!!! note
The complexity of adding a new model depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
!!! tip
If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
or ask on our [developer slack](https://slack.vllm.ai).
We will be happy to help you out!
(new-model-basic)= ---
title: Implementing a Basic Model
# Implementing a Basic Model ---
[](){ #new-model-basic }
This guide walks you through the steps to implement a basic vLLM model. This guide walks you through the steps to implement a basic vLLM model.
...@@ -10,9 +11,8 @@ First, clone the PyTorch model code from the source repository. ...@@ -10,9 +11,8 @@ First, clone the PyTorch model code from the source repository.
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file. HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
:::{warning} !!! warning
Make sure to review and adhere to the original code's copyright and licensing terms! Make sure to review and adhere to the original code's copyright and licensing terms!
:::
## 2. Make your code compatible with vLLM ## 2. Make your code compatible with vLLM
...@@ -67,7 +67,7 @@ class MyModel(nn.Module): ...@@ -67,7 +67,7 @@ class MyModel(nn.Module):
... ...
``` ```
- Rewrite the {meth}`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension. - Rewrite the [forward][torch.nn.Module.forward] method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
```python ```python
def forward( def forward(
...@@ -78,10 +78,9 @@ def forward( ...@@ -78,10 +78,9 @@ def forward(
... ...
``` ```
:::{note} !!! note
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings. Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM. If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
:::
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples. For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.
...@@ -89,7 +88,7 @@ For reference, check out our [Llama implementation](gh-file:vllm/model_executor/ ...@@ -89,7 +88,7 @@ For reference, check out our [Llama implementation](gh-file:vllm/model_executor/
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it. If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions. To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
For the embedding layer, you can simply replace {class}`torch.nn.Embedding` with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`. For the embedding layer, you can simply replace [torch.nn.Embedding][] with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
When it comes to the linear layers, we provide the following options to parallelize them: When it comes to the linear layers, we provide the following options to parallelize them:
- `ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving. - `ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
...@@ -107,7 +106,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a ...@@ -107,7 +106,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a
## 5. Register your model ## 5. Register your model
See [this page](#new-model-registration) for instructions on how to register your new model to be used by vLLM. See [this page][new-model-registration] for instructions on how to register your new model to be used by vLLM.
## Frequently Asked Questions ## Frequently Asked Questions
......
This diff is collapsed.
(new-model-registration)= ---
title: Registering a Model to vLLM
# Registering a Model to vLLM ---
[](){ #new-model-registration }
vLLM relies on a model registry to determine how to run each model. vLLM relies on a model registry to determine how to run each model.
A list of pre-registered architectures can be found [here](#supported-models). A list of pre-registered architectures can be found [here][supported-models].
If your model is not on this list, you must register it to vLLM. If your model is not on this list, you must register it to vLLM.
This page provides detailed instructions on how to do so. This page provides detailed instructions on how to do so.
## Built-in models ## Built-in models
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source](#build-from-source). To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
This gives you the ability to modify the codebase and test your model. This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial](#new-model-basic)), put it into the <gh-dir:vllm/model_executor/models> directory. After you have implemented your model (see [tutorial][new-model-basic]), put it into the <gh-dir:vllm/model_executor/models> directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM. Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models](#supported-models) to promote your model! Finally, update our [list of supported models][supported-models] to promote your model!
:::{important} !!! warning
The list of models in each section should be maintained in alphabetical order. The list of models in each section should be maintained in alphabetical order.
:::
## Out-of-tree models ## Out-of-tree models
You can load an external model using a plugin without modifying the vLLM codebase. You can load an external model using a plugin without modifying the vLLM codebase.
:::{seealso} !!! info
[vLLM's Plugin System](#plugin-system) [vLLM's Plugin System][plugin-system]
:::
To register the model, use the following code: To register the model, use the following code:
...@@ -45,11 +44,9 @@ from vllm import ModelRegistry ...@@ -45,11 +44,9 @@ from vllm import ModelRegistry
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM") ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
``` ```
:::{important} !!! warning
If your model is a multimodal model, ensure the model class implements the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface. If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
Read more about that [here](#supports-multimodal). Read more about that [here][supports-multimodal].
:::
:::{note} !!! note
Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server. Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
:::
(new-model-tests)= ---
title: Writing Unit Tests
# Writing Unit Tests ---
[](){ #new-model-tests }
This page explains how to write unit tests to verify the implementation of your model. This page explains how to write unit tests to verify the implementation of your model.
...@@ -14,14 +15,12 @@ Without them, the CI for your PR will fail. ...@@ -14,14 +15,12 @@ Without them, the CI for your PR will fail.
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>. Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM. This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
:::{important} !!! warning
The list of models in each section should be maintained in alphabetical order. The list of models in each section should be maintained in alphabetical order.
:::
:::{tip} !!! tip
If your model requires a development version of HF Transformers, you can set If your model requires a development version of HF Transformers, you can set
`min_transformers_version` to skip the test in CI until the model is released. `min_transformers_version` to skip the test in CI until the model is released.
:::
## Optional Tests ## Optional Tests
...@@ -34,16 +33,16 @@ These tests compare the model outputs of vLLM against [HF Transformers](https:// ...@@ -34,16 +33,16 @@ These tests compare the model outputs of vLLM against [HF Transformers](https://
#### Generative models #### Generative models
For [generative models](#generative-models), there are two levels of correctness tests, as defined in <gh-file:tests/models/utils.py>: For [generative models][generative-models], there are two levels of correctness tests, as defined in <gh-file:tests/models/utils.py>:
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF. - Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa. - Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
#### Pooling models #### Pooling models
For [pooling models](#pooling-models), we simply check the cosine similarity, as defined in <gh-file:tests/models/embedding/utils.py>. For [pooling models][pooling-models], we simply check the cosine similarity, as defined in <gh-file:tests/models/embedding/utils.py>.
(mm-processing-tests)= [](){ #mm-processing-tests }
### Multi-modal processing ### Multi-modal processing
......
...@@ -27,7 +27,21 @@ See <gh-file:LICENSE>. ...@@ -27,7 +27,21 @@ See <gh-file:LICENSE>.
## Developing ## Developing
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
Check out the [building from source](#build-from-source) documentation for details. Check out the [building from source][build-from-source] documentation for details.
### Building the docs
Install the dependencies:
```bash
pip install -r requirements/docs.txt
```
Start the autoreloading MkDocs server:
```bash
mkdocs serve
```
## Testing ## Testing
...@@ -48,29 +62,25 @@ pre-commit run mypy-3.9 --hook-stage manual --all-files ...@@ -48,29 +62,25 @@ pre-commit run mypy-3.9 --hook-stage manual --all-files
pytest tests/ pytest tests/
``` ```
:::{tip} !!! tip
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12. Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment. Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
:::
:::{note} !!! note
Currently, the repository is not fully checked by `mypy`. Currently, the repository is not fully checked by `mypy`.
:::
:::{note} !!! note
Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
platform to run unit tests locally, rely on the continuous integration system to run the tests for platform to run unit tests locally, rely on the continuous integration system to run the tests for
now. now.
:::
## Issues ## Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
:::{important} !!! warning
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability). If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
:::
## Pull Requests & Code Reviews ## Pull Requests & Code Reviews
...@@ -106,9 +116,8 @@ appropriately to indicate the type of change. Please use one of the following: ...@@ -106,9 +116,8 @@ appropriately to indicate the type of change. Please use one of the following:
- `[Misc]` for PRs that do not fit the above categories. Please use this - `[Misc]` for PRs that do not fit the above categories. Please use this
sparingly. sparingly.
:::{note} !!! note
If the PR spans more than one category, please include all relevant prefixes. If the PR spans more than one category, please include all relevant prefixes.
:::
### Code Quality ### Code Quality
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment