Unverified Commit 8ceffbf3 authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc][3/N] Reorganize Serving section (#11766)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent d93d2d74
......@@ -77,7 +77,7 @@ pip install vllm
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
- [List of Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
## Contributing
......
# Dockerfile
We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
More information about deploying with Docker can be found [here](#deployment-docker).
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
......
......@@ -3,7 +3,7 @@
# Model Registration
vLLM relies on a model registry to determine how to run each model.
A list of pre-registered architectures can be found on the [Supported Models](#supported-models) page.
A list of pre-registered architectures can be found [here](#supported-models).
If your model is not on this list, you must register it to vLLM.
This page provides detailed instructions on how to do so.
......@@ -16,7 +16,7 @@ This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial](#new-model-basic)), put it into the <gh-dir:vllm/model_executor/models> directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
You should also include an example HuggingFace repository for this model in <gh-file:tests/models/registry.py> to run the unit tests.
Finally, update the [Supported Models](#supported-models) documentation page to promote your model!
Finally, update our [list of supported models](#supported-models) to promote your model!
```{important}
The list of models in each section should be maintained in alphabetical order.
......
(deploying-with-docker)=
(deployment-docker)=
# Deploying with Docker
# Using Docker
## Use vLLM's Official Docker Image
......
(deploying-with-bentoml)=
(deployment-bentoml)=
# Deploying with BentoML
# BentoML
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
......
(deploying-with-cerebrium)=
(deployment-cerebrium)=
# Deploying with Cerebrium
# Cerebrium
```{raw} html
<p align="center">
......
(deploying-with-dstack)=
(deployment-dstack)=
# Deploying with dstack
# dstack
```{raw} html
<p align="center">
......
(deploying-with-helm)=
(deployment-helm)=
# Deploying with Helm
# Helm
A Helm chart to deploy vLLM for Kubernetes
......@@ -38,7 +38,7 @@ chart **including persistent volumes** and deletes the release.
## Architecture
```{image} architecture_helm_deployment.png
```{image} /assets/deployment/architecture_helm_deployment.png
```
## Values
......
# Using other frameworks
```{toctree}
:maxdepth: 1
bentoml
cerebrium
dstack
helm
lws
skypilot
triton
```
(deploying-with-lws)=
(deployment-lws)=
# Deploying with LWS
# LWS
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference.
......
(on-cloud)=
(deployment-skypilot)=
# Deploying and scaling up with SkyPilot
# SkyPilot
```{raw} html
<p align="center">
......@@ -12,9 +12,9 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet
## Prerequisites
- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model {code}`meta-llama/Meta-Llama-3-8B-Instruct`.
- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
- Check that {code}`sky check` shows clouds or Kubernetes are enabled.
- Check that `sky check` shows clouds or Kubernetes are enabled.
```console
pip install skypilot-nightly
......
(deploying-with-triton)=
(deployment-triton)=
# Deploying with NVIDIA Triton
# NVIDIA Triton
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
# External Integrations
```{toctree}
:maxdepth: 1
kserve
kubeai
llamastack
```
(deploying-with-kserve)=
(deployment-kserve)=
# Deploying with KServe
# KServe
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
......
(deploying-with-kubeai)=
(deployment-kubeai)=
# Deploying with KubeAI
# KubeAI
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
......
(run-on-llamastack)=
(deployment-llamastack)=
# Serving with Llama Stack
# Llama Stack
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
......
(deploying-with-k8s)=
(deployment-k8s)=
# Deploying with Kubernetes
# Using Kubernetes
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
......
(nginxloadbalancer)=
# Deploying with Nginx Loadbalancer
# Using Nginx
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
......
......@@ -57,7 +57,7 @@ More API details can be found in the {doc}`Offline Inference
The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.
### OpenAI-compatible API server
### OpenAI-Compatible API Server
The second primary interface to vLLM is via its OpenAI-compatible API server.
This server can be started using the `vllm serve` command.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment