[Doc][3/N] Reorganize Serving section (#11766)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Doc][3/N] Reorganize Serving section (#11766)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
8ceffbf3 · Cyrus Leung · GitHub · d93d2d74 · 8ceffbf3 · 8ceffbf3
Unverified Commit 8ceffbf3 authored Jan 07, 2025 by Cyrus Leung Committed by GitHub Jan 07, 2025
20 changed files
--- a/README.md
+++ b/README.md
@@ -77,7 +77,7 @@ pip install vllm
 Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more.
 - [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
 - [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
+- [List of Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)

 ## Contributing


--- a/docs/source/serving/architecture_helm_deployment.png
+++ b/docs/source/serving/architecture_helm_deployment.png
--- a/docs/source/contributing/dockerfile/dockerfile.md
+++ b/docs/source/contributing/dockerfile/dockerfile.md
 # Dockerfile

 We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
-More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
+More information about deploying with Docker can be found [here](#deployment-docker).

 Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:


--- a/docs/source/contributing/model/registration.md
+++ b/docs/source/contributing/model/registration.md
@@ -3,7 +3,7 @@
 # Model Registration

 vLLM relies on a model registry to determine how to run each model.
-A list of pre-registered architectures can be found on the [Supported Models](#supported-models) page.
+A list of pre-registered architectures can be found [here](#supported-models).

 If your model is not on this list, you must register it to vLLM.
 This page provides detailed instructions on how to do so.
@@ -16,7 +16,7 @@ This gives you the ability to modify the codebase and test your model.
 After you have implemented your model (see [tutorial](#new-model-basic)), put it into the <gh-dir:vllm/model_executor/models> directory.
 Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
 You should also include an example HuggingFace repository for this model in <gh-file:tests/models/registry.py> to run the unit tests.
-Finally, update the [Supported Models](#supported-models) documentation page to promote your model!
+Finally, update our [list of supported models](#supported-models) to promote your model!

 ```{important}
 The list of models in each section should be maintained in alphabetical order.

--- a/docs/source/serving/deploying_with_docker.md
+++ b/docs/source/serving/deploying_with_docker.md
-(deploying-with-docker)=
+(deployment-docker)=

-# Deploying with Docker
+# Using Docker

 ## Use vLLM's Official Docker Image


--- a/docs/source/serving/deploying_with_bentoml.md
+++ b/docs/source/serving/deploying_with_bentoml.md
-(deploying-with-bentoml)=
+(deployment-bentoml)=

-# Deploying with BentoML
+# BentoML

 [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.


--- a/docs/source/serving/deploying_with_cerebrium.md
+++ b/docs/source/serving/deploying_with_cerebrium.md
-(deploying-with-cerebrium)=
+(deployment-cerebrium)=

-# Deploying with Cerebrium
+# Cerebrium

 ```{raw} html
 <p align="center">

--- a/docs/source/serving/deploying_with_dstack.md
+++ b/docs/source/serving/deploying_with_dstack.md
-(deploying-with-dstack)=
+(deployment-dstack)=

-# Deploying with dstack
+# dstack

 ```{raw} html
 <p align="center">

--- a/docs/source/serving/deploying_with_helm.md
+++ b/docs/source/serving/deploying_with_helm.md
-(deploying-with-helm)=
+(deployment-helm)=

-# Deploying with Helm
+# Helm

 A Helm chart to deploy vLLM for Kubernetes

@@ -38,7 +38,7 @@ chart **including persistent volumes** and deletes the release.

 ## Architecture

-```{image} architecture_helm_deployment.png
+```{image} /assets/deployment/architecture_helm_deployment.png
 ```

 ## Values

--- a/docs/source/deployment/frameworks/index.md
+++ b/docs/source/deployment/frameworks/index.md
+# Using other frameworks
+
+```{toctree}
+:maxdepth: 1
+
+bentoml
+cerebrium
+dstack
+helm
+lws
+skypilot
+triton
+```
--- a/docs/source/serving/deploying_with_lws.md
+++ b/docs/source/serving/deploying_with_lws.md
-(deploying-with-lws)=
+(deployment-lws)=

-# Deploying with LWS
+# LWS

 LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
 A major use case is for multi-host/multi-node distributed inference.

--- a/docs/source/serving/run_on_sky.md
+++ b/docs/source/serving/run_on_sky.md
-(on-cloud)=
+(deployment-skypilot)=

-# Deploying and scaling up with SkyPilot
+# SkyPilot

 ```{raw} html
 <p align="center">
@@ -12,9 +12,9 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet

 ## Prerequisites

- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model {code}`meta-llama/Meta-Llama-3-8B-Instruct`.
+- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
 - Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
- Check that {code}`sky check` shows clouds or Kubernetes are enabled.
+- Check that `sky check` shows clouds or Kubernetes are enabled.

 ```console
 pip install skypilot-nightly

--- a/docs/source/serving/deploying_with_triton.md
+++ b/docs/source/serving/deploying_with_triton.md
-(deploying-with-triton)=
+(deployment-triton)=

-# Deploying with NVIDIA Triton
+# NVIDIA Triton

 The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
--- a/docs/source/deployment/integrations/index.md
+++ b/docs/source/deployment/integrations/index.md
+# External Integrations
+
+```{toctree}
+:maxdepth: 1
+
+kserve
+kubeai
+llamastack
+```
--- a/docs/source/serving/deploying_with_kserve.md
+++ b/docs/source/serving/deploying_with_kserve.md
-(deploying-with-kserve)=
+(deployment-kserve)=

-# Deploying with KServe
+# KServe

 vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.


--- a/docs/source/serving/deploying_with_kubeai.md
+++ b/docs/source/serving/deploying_with_kubeai.md
-(deploying-with-kubeai)=
+(deployment-kubeai)=

-# Deploying with KubeAI
+# KubeAI

 [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.


--- a/docs/source/serving/serving_with_llamastack.md
+++ b/docs/source/serving/serving_with_llamastack.md
-(run-on-llamastack)=
+(deployment-llamastack)=

-# Serving with Llama Stack
+# Llama Stack

 vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .


--- a/docs/source/serving/deploying_with_k8s.md
+++ b/docs/source/serving/deploying_with_k8s.md
-(deploying-with-k8s)=
+(deployment-k8s)=

-# Deploying with Kubernetes
+# Using Kubernetes

 Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.


--- a/docs/source/serving/deploying_with_nginx.md
+++ b/docs/source/serving/deploying_with_nginx.md
 (nginxloadbalancer)=

-# Deploying with Nginx Loadbalancer
+# Using Nginx

 This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.


--- a/docs/source/design/arch_overview.md
+++ b/docs/source/design/arch_overview.md
@@ -57,7 +57,7 @@ More API details can be found in the {doc}`Offline Inference

 The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.

-### OpenAI-compatible API server
+### OpenAI-Compatible API Server

 The second primary interface to vLLM is via its OpenAI-compatible API server.
 This server can be started using the `vllm serve` command.