Unverified Commit b4bab816 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Remove unnecessary explicit title anchors and use relative links instead (#20620)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent b91cb3fa
...@@ -48,4 +48,4 @@ For more information, check out the following: ...@@ -48,4 +48,4 @@ For more information, check out the following:
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention) - [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023) - [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al. - [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
- [vLLM Meetups][meetups] - [vLLM Meetups](community/meetups.md)
...@@ -64,7 +64,7 @@ vLLM provides experimental support for multi-modal models through the [vllm.mult ...@@ -64,7 +64,7 @@ vLLM provides experimental support for multi-modal models through the [vllm.mult
Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models] Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
via the `multi_modal_data` field in [vllm.inputs.PromptType][]. via the `multi_modal_data` field in [vllm.inputs.PromptType][].
Looking to add your own multi-modal model? Please follow the instructions listed [here][supports-multimodal]. Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).
- [vllm.multimodal.MULTIMODAL_REGISTRY][] - [vllm.multimodal.MULTIMODAL_REGISTRY][]
......
--- ---
title: Contact Us title: Contact Us
--- ---
[](){ #contactus }
--8<-- "README.md:contact-us" --8<-- "README.md:contact-us"
--- ---
title: Meetups title: Meetups
--- ---
[](){ #meetups }
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below: We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
......
...@@ -33,7 +33,7 @@ Quantized models take less memory at the cost of lower precision. ...@@ -33,7 +33,7 @@ Quantized models take less memory at the cost of lower precision.
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI)) Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
and used directly without extra configuration. and used directly without extra configuration.
Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details. Dynamic quantization is also supported via the `quantization` option -- see [here](../features/quantization/README.md) for more details.
## Context length and batch size ## Context length and batch size
......
--- ---
title: Engine Arguments title: Engine Arguments
--- ---
[](){ #engine-args }
Engine arguments control the behavior of the vLLM engine. Engine arguments control the behavior of the vLLM engine.
- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class. - For [offline inference](../serving/offline_inference.md), they are part of the arguments to [LLM][vllm.LLM] class.
- For [online serving][serving-openai-compatible-server], they are part of the arguments to `vllm serve`. - For [online serving](../serving/openai_compatible_server.md), they are part of the arguments to `vllm serve`.
You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments. You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
......
...@@ -20,4 +20,4 @@ model = LLM( ...@@ -20,4 +20,4 @@ model = LLM(
) )
``` ```
Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM. Our [list of supported models](../models/supported_models.md) shows the model architectures that are recognized by vLLM.
--- ---
title: Server Arguments title: Server Arguments
--- ---
[](){ #serve-args }
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.
...@@ -13,7 +12,7 @@ To see the available CLI arguments, run `vllm serve --help`! ...@@ -13,7 +12,7 @@ To see the available CLI arguments, run `vllm serve --help`!
## Configuration file ## Configuration file
You can load CLI arguments via a [YAML](https://yaml.org/) config file. You can load CLI arguments via a [YAML](https://yaml.org/) config file.
The argument names must be the long form of those outlined [above][serve-args]. The argument names must be the long form of those outlined [above](serve_args.md).
For example: For example:
......
--- ---
title: Benchmark Suites title: Benchmark Suites
--- ---
[](){ #benchmarks }
vLLM contains two sets of benchmarks: vLLM contains two sets of benchmarks:
......
# Dockerfile # Dockerfile
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM. We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here][deployment-docker]. More information about deploying with Docker can be found [here](../../deployment/docker.md).
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes: Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
......
--- ---
title: Summary title: Summary
--- ---
[](){ #new-model }
!!! important !!! important
Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first! Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!
vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features][compatibility-matrix] to optimize their performance. vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features](../../features/compatibility_matrix.md) to optimize their performance.
The complexity of integrating a model into vLLM depends heavily on the model's architecture. The complexity of integrating a model into vLLM depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM. The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
......
--- ---
title: Basic Model title: Basic Model
--- ---
[](){ #new-model-basic }
This guide walks you through the steps to implement a basic vLLM model. This guide walks you through the steps to implement a basic vLLM model.
...@@ -108,7 +107,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a ...@@ -108,7 +107,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a
## 5. Register your model ## 5. Register your model
See [this page][new-model-registration] for instructions on how to register your new model to be used by vLLM. See [this page](registration.md) for instructions on how to register your new model to be used by vLLM.
## Frequently Asked Questions ## Frequently Asked Questions
......
--- ---
title: Multi-Modal Support title: Multi-Modal Support
--- ---
[](){ #supports-multimodal }
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs][multimodal-inputs]. This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md).
## 1. Update the base vLLM model ## 1. Update the base vLLM model
It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic]. It is assumed that you have already implemented the model in vLLM according to [these steps](basic.md).
Further update the model as follows: Further update the model as follows:
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model. - Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
...@@ -483,7 +482,7 @@ Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.proce ...@@ -483,7 +482,7 @@ Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.proce
to fill in the missing details about HF processing. to fill in the missing details about HF processing.
!!! info !!! info
[Multi-Modal Data Processing][mm-processing] [Multi-Modal Data Processing](../../design/mm_processing.md)
### Multi-modal fields ### Multi-modal fields
...@@ -846,7 +845,7 @@ Examples: ...@@ -846,7 +845,7 @@ Examples:
### Handling prompt updates unrelated to multi-modal data ### Handling prompt updates unrelated to multi-modal data
[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design][mm-processing]. [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design](../../design/mm_processing.md).
Examples: Examples:
......
--- ---
title: Registering a Model title: Registering a Model
--- ---
[](){ #new-model-registration }
vLLM relies on a model registry to determine how to run each model. vLLM relies on a model registry to determine how to run each model.
A list of pre-registered architectures can be found [here][supported-models]. A list of pre-registered architectures can be found [here](../../models/supported_models.md).
If your model is not on this list, you must register it to vLLM. If your model is not on this list, you must register it to vLLM.
This page provides detailed instructions on how to do so. This page provides detailed instructions on how to do so.
...@@ -14,16 +13,16 @@ This page provides detailed instructions on how to do so. ...@@ -14,16 +13,16 @@ This page provides detailed instructions on how to do so.
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source]. To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
This gives you the ability to modify the codebase and test your model. This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial][new-model-basic]), put it into the <gh-dir:vllm/model_executor/models> directory. After you have implemented your model (see [tutorial](basic.md)), put it into the <gh-dir:vllm/model_executor/models> directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM. Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models][supported-models] to promote your model! Finally, update our [list of supported models](../../models/supported_models.md) to promote your model!
!!! important !!! important
The list of models in each section should be maintained in alphabetical order. The list of models in each section should be maintained in alphabetical order.
## Out-of-tree models ## Out-of-tree models
You can load an external model [using a plugin][plugin-system] without modifying the vLLM codebase. You can load an external model [using a plugin](../../design/plugin_system.md) without modifying the vLLM codebase.
To register the model, use the following code: To register the model, use the following code:
...@@ -51,4 +50,4 @@ def register(): ...@@ -51,4 +50,4 @@ def register():
!!! important !!! important
If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface. If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
Read more about that [here][supports-multimodal]. Read more about that [here](multimodal.md).
--- ---
title: Unit Testing title: Unit Testing
--- ---
[](){ #new-model-tests }
This page explains how to write unit tests to verify the implementation of your model. This page explains how to write unit tests to verify the implementation of your model.
......
--- ---
title: Using Docker title: Using Docker
--- ---
[](){ #deployment-docker }
[](){ #deployment-docker-pre-built-image } [](){ #deployment-docker-pre-built-image }
...@@ -32,7 +31,7 @@ podman run --gpus all \ ...@@ -32,7 +31,7 @@ podman run --gpus all \
--model mistralai/Mistral-7B-v0.1 --model mistralai/Mistral-7B-v0.1
``` ```
You can add any other [engine-args][engine-args] you need after the image tag (`vllm/vllm-openai:latest`). You can add any other [engine-args](../configuration/engine_args.md) you need after the image tag (`vllm/vllm-openai:latest`).
!!! note !!! note
You can either use the `ipc=host` flag or `--shm-size` flag to allow the You can either use the `ipc=host` flag or `--shm-size` flag to allow the
......
--- ---
title: Anything LLM title: Anything LLM
--- ---
[](){ #deployment-anything-llm }
[Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. [Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.
......
--- ---
title: AutoGen title: AutoGen
--- ---
[](){ #deployment-autogen }
[AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans. [AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans.
......
--- ---
title: BentoML title: BentoML
--- ---
[](){ #deployment-bentoml }
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes. [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes.
......
--- ---
title: Cerebrium title: Cerebrium
--- ---
[](){ #deployment-cerebrium }
<p align="center"> <p align="center">
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/> <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment