[Docs] Convert rST to MyST (Markdown) (#11145)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>

[Docs] Convert rST to MyST (Markdown) (#11145)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
32aa2059 · Rafael Vasquez · GitHub · 94d545a1 · 32aa2059 · 94d545a1
Unverified Commit 32aa2059 authored Dec 23, 2024 by Rafael Vasquez Committed by GitHub Dec 23, 2024
20 changed files
--- a/docs/source/quantization/int8.md
+++ b/docs/source/quantization/int8.md
+(int8)=
+# INT8 W8A8
+vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
+This quantization method is particularly useful for reducing model size while maintaining good performance.
+Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
+```{note}
+INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
+```
+## Prerequisites
+To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
+```console
+$ pip install llmcompressor
+```
+## Quantization Process
+The quantization process involves four main steps:
+1. Loading the model
+2. Preparing calibration data
+3. Applying quantization
+4. Evaluating accuracy in vLLM
+### 1. Loading the Model
+Use `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM`, for saving and loading quantized models:
+```python
+from llmcompressor.transformers import SparseAutoModelForCausalLM
+from transformers import AutoTokenizer
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = SparseAutoModelForCausalLM.from_pretrained(
+    MODEL_ID, device_map="auto", torch_dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+```
+### 2. Preparing Calibration Data
+When quantizing activations to INT8, you need sample data to estimate the activation scales.
+It's best to use calibration data that closely matches your deployment data.
+For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
+```python
+from datasets import load_dataset
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048
+# Load and preprocess the dataset
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+def preprocess(example):
+    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+ds = ds.map(preprocess)
+def tokenize(sample):
+    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+```
+### 3. Applying Quantization
+Now, apply the quantization algorithms:
+```python
+from llmcompressor.transformers import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+# Configure the quantization algorithms
+recipe = [
+    SmoothQuantModifier(smoothing_strength=0.8),
+    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+]
+# Apply quantization
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+# Save the compressed model
+SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
+### 4. Evaluating Accuracy
+After quantization, you can load and run the model in vLLM:
+```python
+from vllm import LLM
+model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
+```
+To evaluate accuracy, you can use `lm_eval`:
+```console
+$ lm_eval --model vllm \
+  --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
+  --tasks gsm8k \
+  --num_fewshot 5 \
+  --limit 250 \
+  --batch_size 'auto'
+```
+```{note}
+Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
+```
+## Best Practices
+- Start with 512 samples for calibration data (increase if accuracy drops)
+- Use a sequence length of 2048 as a starting point
+- Employ the chat template or instruction template that the model was trained with
+- If you've fine-tuned a model, consider using a sample of your training data for calibration
+## Troubleshooting and Support
+If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
--- a/docs/source/quantization/int8.rst
+++ b/docs/source/quantization/int8.rst
-.. _int8:
-INT8 W8A8
-==================
-vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
-This quantization method is particularly useful for reducing model size while maintaining good performance.
-Please visit the HF collection of `quantized INT8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415>`_.
-.. note::
-   INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
-Prerequisites
-------------
-To use INT8 quantization with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
-.. code-block:: console
-   $ pip install llmcompressor
-Quantization Process
--------------------
-The quantization process involves four main steps:
-1. Loading the model
-2. Preparing calibration data
-3. Applying quantization
-4. Evaluating accuracy in vLLM
-1. Loading the Model
-^^^^^^^^^^^^^^^^^^^^
-Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
-.. code-block:: python
-   from llmcompressor.transformers import SparseAutoModelForCausalLM
-   from transformers import AutoTokenizer
-   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
-   model = SparseAutoModelForCausalLM.from_pretrained(
-       MODEL_ID, device_map="auto", torch_dtype="auto",
-   )
-   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
-2. Preparing Calibration Data
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-When quantizing activations to INT8, you need sample data to estimate the activation scales.
-It's best to use calibration data that closely matches your deployment data. 
-For a general-purpose instruction-tuned model, you can use a dataset like ``ultrachat``:
-.. code-block:: python
-   from datasets import load_dataset
-   NUM_CALIBRATION_SAMPLES = 512
-   MAX_SEQUENCE_LENGTH = 2048
-   # Load and preprocess the dataset
-   ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
-   ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-   def preprocess(example):
-       return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
-   ds = ds.map(preprocess)
-   def tokenize(sample):
-       return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
-   ds = ds.map(tokenize, remove_columns=ds.column_names)
-3. Applying Quantization
-^^^^^^^^^^^^^^^^^^^^^^^^
-Now, apply the quantization algorithms:
-.. code-block:: python
-   from llmcompressor.transformers import oneshot
-   from llmcompressor.modifiers.quantization import GPTQModifier
-   from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
-   # Configure the quantization algorithms
-   recipe = [
-       SmoothQuantModifier(smoothing_strength=0.8),
-       GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
-   ]
-   # Apply quantization
-   oneshot(
-       model=model,
-       dataset=ds,
-       recipe=recipe,
-       max_seq_length=MAX_SEQUENCE_LENGTH,
-       num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-   )
-   # Save the compressed model
-   SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
-   model.save_pretrained(SAVE_DIR, save_compressed=True)
-   tokenizer.save_pretrained(SAVE_DIR)
-This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
-4. Evaluating Accuracy
-^^^^^^^^^^^^^^^^^^^^^^
-After quantization, you can load and run the model in vLLM:
-.. code-block:: python
-   from vllm import LLM
-   model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
-To evaluate accuracy, you can use ``lm_eval``:
-.. code-block:: console
-   $ lm_eval --model vllm \
-     --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
-     --tasks gsm8k \
-     --num_fewshot 5 \
-     --limit 250 \
-     --batch_size 'auto'
-.. note::
-   Quantized models can be sensitive to the presence of the ``bos`` token. Make sure to include the ``add_bos_token=True`` argument when running evaluations.
-Best Practices
--------------
- Start with 512 samples for calibration data (increase if accuracy drops)
- Use a sequence length of 2048 as a starting point
- Employ the chat template or instruction template that the model was trained with
- If you've fine-tuned a model, consider using a sample of your training data for calibration
-Troubleshooting and Support
---------------------------
-If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
--- a/docs/source/quantization/supported_hardware.rst
+++ b/docs/source/quantization/supported_hardware.rst
-.. _supported_hardware_for_quantization:
+(supported-hardware-for-quantization)=
-Supported Hardware for Quantization Kernels
+# Supported Hardware for Quantization Kernels
-===========================================
+The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
-The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
+```{eval-rst}
 .. list-table::
   :header-rows: 1
   :widths: 20 8 8 8 8 8 8 8 8 8 8
   * - Implementation
     - Volta
     - Turing
     - Ampere
     - Ada
     - Hopper
     - AMD GPU
     - Intel GPU
     - x86 CPU
     - AWS Inferentia
     - Google TPU
   * - AWQ
     - ✗
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✅︎
     - ✅︎
     - ✗
     - ✗
   * - GPTQ
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✅︎
     - ✅︎
     - ✗
     - ✗
   * - Marlin (GPTQ/AWQ/FP8)
     - ✗
     - ✗
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✗
     - ✗
     - ✗
     - ✗
   * - INT8 (W8A8)
     - ✗
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✗
     - ✅︎
     - ✗
     - ✗
   * - FP8 (W8A8)
     - ✗
     - ✗
     - ✗
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✗
     - ✗
     - ✗
   * - AQLM
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✗
     - ✗
     - ✗
     - ✗
   * - bitsandbytes
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✗
     - ✗
     - ✗
     - ✗
   * - DeepSpeedFP
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✗
     - ✗
     - ✗
     - ✗
   * - GGUF
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✅︎
     - ✗
     - ✗
     - ✗
     - ✗
     - ✗
+```
-Notes:
-^^^^^^
+## Notes:
 - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
 - "✅︎" indicates that the quantization method is supported on the specified hardware.
 - "✗" indicates that the quantization method is not supported on the specified hardware.
 Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
-For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.
+For the most up-to-date information on hardware support and quantization methods, please check the [quantization directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization) or consult with the vLLM development team.
--- a/docs/source/serving/deploying_with_bentoml.md
+++ b/docs/source/serving/deploying_with_bentoml.md
+(deploying-with-bentoml)=
+# Deploying with BentoML
+[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
+For details, see the tutorial [vLLM inference in the BentoML documentation](https://docs.bentoml.com/en/latest/use-cases/large-language-models/vllm.html).
--- a/docs/source/serving/deploying_with_bentoml.rst
+++ b/docs/source/serving/deploying_with_bentoml.rst
-.. _deploying_with_bentoml:
-Deploying with BentoML
-======================
-`BentoML <https://github.com/bentoml/BentoML>`_ allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
-For details, see the tutorial `vLLM inference in the BentoML documentation <https://docs.bentoml.com/en/latest/use-cases/large-language-models/vllm.html>`_.
\ No newline at end of file
--- a/docs/source/serving/deploying_with_cerebrium.md
+++ b/docs/source/serving/deploying_with_cerebrium.md
+(deploying-with-cerebrium)=
+# Deploying with Cerebrium
+```{raw} html
+<p align="center">
+    <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
+</p>
+```
+vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
+To install the Cerebrium client, run:
+```console
+$ pip install cerebrium
+$ cerebrium login
+```
+Next, create your Cerebrium project, run:
+```console
+$ cerebrium init vllm-project
+```
+Next, to install the required packages, add the following to your cerebrium.toml:
+```toml
+[cerebrium.deployment]
+docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
+[cerebrium.dependencies.pip]
+vllm = "latest"
+```
+Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py\`:
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
+    sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
+    outputs = llm.generate(prompts, sampling_params)
+    # Print the outputs.
+    results = []
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        results.append({"prompt": prompt, "generated_text": generated_text})
+    return {"results": results}
+```
+Then, run the following code to deploy it to the cloud
+```console
+$ cerebrium deploy
+```
+If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
+```python
+curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
+ -H 'Content-Type: application/json' \
+ -H 'Authorization: <JWT TOKEN>' \
+ --data '{
+   "prompts": [
+     "Hello, my name is",
+     "The president of the United States is",
+     "The capital of France is",
+     "The future of AI is"
+   ]
+ }'
+```
+You should get a response like:
+```python
+{
+    "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
+    "result": {
+        "result": [
+            {
+                "prompt": "Hello, my name is",
+                "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
+            },
+            {
+                "prompt": "The president of the United States is",
+                "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
+            },
+            {
+                "prompt": "The capital of France is",
+                "generated_text": " Paris.\n"
+            },
+            {
+                "prompt": "The future of AI is",
+                "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
+            }
+        ]
+    },
+    "run_time_ms": 152.53663063049316
+}
+```
+You now have an autoscaling endpoint where you only pay for the compute you use!
--- a/docs/source/serving/deploying_with_cerebrium.rst
+++ b/docs/source/serving/deploying_with_cerebrium.rst
-.. _deploying_with_cerebrium:
-Deploying with Cerebrium
-============================
-.. raw:: html
-    <p align="center">
-        <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
-    </p>
-vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
-To install the Cerebrium client, run:
-.. code-block:: console
-    $ pip install cerebrium
-    $ cerebrium login
-Next, create your Cerebrium project, run:
-.. code-block:: console
-    $ cerebrium init vllm-project
-Next, to install the required packages, add the following to your cerebrium.toml:
-.. code-block:: toml
-    [cerebrium.deployment]
-    docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
-    [cerebrium.dependencies.pip]
-    vllm = "latest"
-Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
-.. code-block:: python
-    from vllm import LLM, SamplingParams
-    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
-    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
-        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
-        outputs = llm.generate(prompts, sampling_params)
-        # Print the outputs.
-        results = []
-        for output in outputs:
-            prompt = output.prompt
-            generated_text = output.outputs[0].text
-            results.append({"prompt": prompt, "generated_text": generated_text})
-        return {"results": results}
-Then, run the following code to deploy it to the cloud
-.. code-block:: console
-    $ cerebrium deploy
-If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
-.. code-block:: python
-    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-     -H 'Content-Type: application/json' \
-     -H 'Authorization: <JWT TOKEN>' \
-     --data '{
-       "prompts": [
-         "Hello, my name is",
-         "The president of the United States is",
-         "The capital of France is",
-         "The future of AI is"
-       ]
-     }'
-You should get a response like:
-.. code-block:: python
-    {
-        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
-        "result": {
-            "result": [
-                {
-                    "prompt": "Hello, my name is",
-                    "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
-                },
-                {
-                    "prompt": "The president of the United States is",
-                    "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
-                },
-                {
-                    "prompt": "The capital of France is",
-                    "generated_text": " Paris.\n"
-                },
-                {
-                    "prompt": "The future of AI is",
-                    "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
-                }
-            ]
-        },
-        "run_time_ms": 152.53663063049316
-    }
-You now have an autoscaling endpoint where you only pay for the compute you use!
--- a/docs/source/serving/deploying_with_docker.md
+++ b/docs/source/serving/deploying_with_docker.md
+(deploying-with-docker)=
+# Deploying with Docker
+## Use vLLM's Official Docker Image
+vLLM offers an official Docker image for deployment.
+The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
+```console
+$ docker run --runtime nvidia --gpus all \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
+    -p 8000:8000 \
+    --ipc=host \
+    vllm/vllm-openai:latest \
+    --model mistralai/Mistral-7B-v0.1
+```
+```{note}
+You can either use the `ipc=host` flag or `--shm-size` flag to allow the
+container to access the host's shared memory. vLLM uses PyTorch, which uses shared
+memory to share data between processes under the hood, particularly for tensor parallel inference.
+```
+## Building vLLM's Docker Image from Source
+You can build and run vLLM from source via the provided [Dockerfile](https://github.com/vllm-project/vllm/blob/main/Dockerfile). To build vLLM:
+```console
+$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
+$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
+```
+```{note}
+By default vLLM will build for all GPU types for widest distribution. If you are just building for the
+current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
+for vLLM to find the current GPU type and build for that.
+```
+## Building for Arm64/aarch64
+A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
+of PyTorch Nightly and should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
+```{note}
+Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
+flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
+Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
+```
+```console
+# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
+$ python3 use_existing_torch.py
+$ DOCKER_BUILDKIT=1 docker build . \
+  --target vllm-openai \
+  --platform "linux/arm64" \
+  -t vllm/vllm-gh200-openai:latest \
+  --build-arg max_jobs=66 \
+  --build-arg nvcc_threads=2 \
+  --build-arg torch_cuda_arch_list="9.0+PTX" \
+  --build-arg vllm_fa_cmake_gpu_arches="90-real"
+```
+## Use the custom-built vLLM Docker image
+To run vLLM with the custom-built Docker image:
+```console
+$ docker run --runtime nvidia --gpus all \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    -p 8000:8000 \
+    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
+    vllm/vllm-openai <args...>
+```
+The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
+```{note}
+**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
+```
--- a/docs/source/serving/deploying_with_docker.rst
+++ b/docs/source/serving/deploying_with_docker.rst
-.. _deploying_with_docker:
-Deploying with Docker
-============================
-Use vLLM's Official Docker Image
--------------------------------
-vLLM offers an official Docker image for deployment.
-The image can be used to run OpenAI compatible server and is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
-.. code-block:: console
-    $ docker run --runtime nvidia --gpus all \
-        -v ~/.cache/huggingface:/root/.cache/huggingface \
-        --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-        -p 8000:8000 \
-        --ipc=host \
-        vllm/vllm-openai:latest \
-        --model mistralai/Mistral-7B-v0.1
-.. note::
-        You can either use the ``ipc=host`` flag or ``--shm-size`` flag to allow the
-        container to access the host's shared memory. vLLM uses PyTorch, which uses shared
-        memory to share data between processes under the hood, particularly for tensor parallel inference.
-Building vLLM's Docker Image from Source
----------------------------------------
-You can build and run vLLM from source via the provided `Dockerfile <https://github.com/vllm-project/vllm/blob/main/Dockerfile>`_. To build vLLM:
-.. code-block:: console
-    $ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
-    $ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
-.. note::
-        By default vLLM will build for all GPU types for widest distribution. If you are just building for the
-        current GPU type the machine is running on, you can add the argument ``--build-arg torch_cuda_arch_list=""``
-        for vLLM to find the current GPU type and build for that.
-Building for Arm64/aarch64
--------------------------
-A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
-of PyTorch Nightly and should be considered **experimental**. Using the flag ``--platform "linux/arm64"`` will attempt to build for arm64.
-.. note::
-        Multiple modules must be compiled, so this process can take a while. Recommend using ``--build-arg max_jobs=`` & ``--build-arg nvcc_threads=``
-        flags to speed up build process. However, ensure your ``max_jobs`` is substantially larger than ``nvcc_threads`` to get the most benefits.
-        Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
-.. code-block:: console
-    # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
-    $ python3 use_existing_torch.py
-    $ DOCKER_BUILDKIT=1 docker build . \
-      --target vllm-openai \
-      --platform "linux/arm64" \
-      -t vllm/vllm-gh200-openai:latest \
-      --build-arg max_jobs=66 \
-      --build-arg nvcc_threads=2 \
-      --build-arg torch_cuda_arch_list="9.0+PTX" \
-      --build-arg vllm_fa_cmake_gpu_arches="90-real"
-Use the custom-built vLLM Docker image
--------------------------------------
-To run vLLM with the custom-built Docker image:
-.. code-block:: console
-    $ docker run --runtime nvidia --gpus all \
-        -v ~/.cache/huggingface:/root/.cache/huggingface \
-        -p 8000:8000 \
-        --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-        vllm/vllm-openai <args...>
-The argument ``vllm/vllm-openai`` specifies the image to run, and should be replaced with the name of the custom-built image (the ``-t`` tag from the build command).
-.. note::
-        **For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. ``/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1`` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable ``VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1`` .
--- a/docs/source/serving/deploying_with_dstack.md
+++ b/docs/source/serving/deploying_with_dstack.md
+(deploying-with-dstack)=
+# Deploying with dstack
+```{raw} html
+<p align="center">
+    <img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
+</p>
+```
+vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
+To install dstack client, run:
+```console
+$ pip install "dstack[all]
+$ dstack server
+```
+Next, to configure your dstack project, run:
+```console
+$ mkdir -p vllm-dstack
+$ cd vllm-dstack
+$ dstack init
+```
+Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
+```yaml
+type: service
+python: "3.11"
+env:
+    - MODEL=NousResearch/Llama-2-7b-chat-hf
+port: 8000
+resources:
+    gpu: 24GB
+commands:
+    - pip install vllm
+    - vllm serve $MODEL --port 8000
+model:
+    format: openai
+    type: chat
+    name: NousResearch/Llama-2-7b-chat-hf
+```
+Then, run the following CLI for provisioning:
+```console
+$ dstack run . -f serve.dstack.yml
+⠸ Getting run plan...
+ Configuration  serve.dstack.yml
+ Project        deep-diver-main
+ User           deep-diver
+ Min resources  2..xCPU, 8GB.., 1xGPU (24GB)
+ Max price      -
+ Max duration   -
+ Spot policy    auto
+ Retry policy   no
+ #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
+ 1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+ 2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+ 3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+    ...
+ Shown 3 of 193 offers, $5.876 max
+Continue? [y/n]: y
+⠙ Submitting run...
+⠏ Launching spicy-treefrog-1 (pulling)
+spicy-treefrog-1 provisioning completed (running)
+Service is published at ...
+```
+After the provisioning, you can interact with the model by using the OpenAI SDK:
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="https://gateway.<gateway domain>",
+    api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
+)
+completion = client.chat.completions.create(
+    model="NousResearch/Llama-2-7b-chat-hf",
+    messages=[
+        {
+            "role": "user",
+            "content": "Compose a poem that explains the concept of recursion in programming.",
+        }
+    ]
+)
+print(completion.choices[0].message.content)
+```
+```{note}
+dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
+```
--- a/docs/source/serving/deploying_with_dstack.rst
+++ b/docs/source/serving/deploying_with_dstack.rst
-.. _deploying_with_dstack:
-Deploying with dstack
-============================
-.. raw:: html
-    <p align="center">
-        <img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
-    </p>
-vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
-To install dstack client, run:
-.. code-block:: console
-    $ pip install "dstack[all]
-    $ dstack server
-Next, to configure your dstack project, run:
-.. code-block:: console
-    $ mkdir -p vllm-dstack
-    $ cd vllm-dstack
-    $ dstack init
-Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
-.. code-block:: yaml
-    type: service
-    python: "3.11"
-    env:
-        - MODEL=NousResearch/Llama-2-7b-chat-hf
-    port: 8000
-    resources:
-        gpu: 24GB
-    commands:
-        - pip install vllm
-        - vllm serve $MODEL --port 8000
-    model:
-        format: openai
-        type: chat
-        name: NousResearch/Llama-2-7b-chat-hf
-Then, run the following CLI for provisioning:
-.. code-block:: console
-    $ dstack run . -f serve.dstack.yml
-    ⠸ Getting run plan...
-     Configuration  serve.dstack.yml             
-     Project        deep-diver-main              
-     User           deep-diver                   
-     Min resources  2..xCPU, 8GB.., 1xGPU (24GB) 
-     Max price      -                            
-     Max duration   -                            
-     Spot policy    auto                         
-     Retry policy   no                           
-     #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE       
-     1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
-     2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
-     3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
-        ...                                                                                            
-     Shown 3 of 193 offers, $5.876 max
-    Continue? [y/n]: y
-    ⠙ Submitting run...
-    ⠏ Launching spicy-treefrog-1 (pulling)
-    spicy-treefrog-1 provisioning completed (running)
-    Service is published at ...
-After the provisioning, you can interact with the model by using the OpenAI SDK:
-.. code-block:: python
-    from openai import OpenAI
-    client = OpenAI(
-        base_url="https://gateway.<gateway domain>",
-        api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
-    )
-    completion = client.chat.completions.create(
-        model="NousResearch/Llama-2-7b-chat-hf",
-        messages=[
-            {
-                "role": "user",
-                "content": "Compose a poem that explains the concept of recursion in programming.",
-            }
-        ]
-    )
-    print(completion.choices[0].message.content)
-.. note::
-    dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__
--- a/docs/source/serving/deploying_with_helm.rst
+++ b/docs/source/serving/deploying_with_helm.rst
-.. _deploying_with_helm:
+(deploying-with-helm)=
-Deploying with Helm
+# Deploying with Helm
-===================
 A Helm chart to deploy vLLM for Kubernetes
@@ -9,44 +8,42 @@ Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s
 This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
-Prerequisites
+## Prerequisites
-------------
 Before you begin, ensure that you have the following:
 - A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (``k8s-device-plugin``): This can be found at `https://github.com/NVIDIA/k8s-device-plugin <https://github.com/NVIDIA/k8s-device-plugin>`__
+- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
 - Available GPU resources in your cluster
 - S3 with the model which will be deployed
-Installing the chart
+## Installing the chart
--------------------
-To install the chart with the release name ``test-vllm``:
-.. code-block:: console
-    helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
+To install the chart with the release name `test-vllm`:
-Uninstalling the Chart
+```console
----------------------
+helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
+```
-To uninstall the ``test-vllm`` deployment:
+## Uninstalling the Chart
-.. code-block:: console
+To uninstall the `test-vllm` deployment:
-    helm uninstall test-vllm --namespace=ns-vllm
+```console
+helm uninstall test-vllm --namespace=ns-vllm
+```
 The command removes all the Kubernetes components associated with the
 chart **including persistent volumes** and deletes the release.
-Architecture
+## Architecture
------------
-.. image:: architecture_helm_deployment.png
+```{image} architecture_helm_deployment.png
+```
-Values
+## Values
------
+```{eval-rst}
 .. list-table:: Values
   :widths: 25 25 25 25
   :header-rows: 1
@@ -251,3 +248,4 @@ Values
     - string
     - test
     - Release name
+```
--- a/docs/source/serving/deploying_with_k8s.rst
+++ b/docs/source/serving/deploying_with_k8s.rst
-.. _deploying_with_k8s:
+(deploying-with-k8s)=
-Deploying with Kubernetes
+# Deploying with Kubernetes
-==========================
 Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
-Prerequisites
+## Prerequisites
-------------
 Before you begin, ensure that you have the following:
 - A running Kubernetes cluster
 - NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
 - Available GPU resources in your cluster
-Deployment Steps
+## Deployment Steps
----------------
-1.  **Create a PVC , Secret and Deployment for vLLM**
+1. **Create a PVC , Secret and Deployment for vLLM**
 PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
-.. code-block:: yaml
+```yaml
+apiVersion: v1
-  apiVersion: v1
+kind: PersistentVolumeClaim
-  kind: PersistentVolumeClaim
+metadata:
-  metadata:
+  name: mistral-7b
-    name: mistral-7b
+  namespace: default
-    namespace: default
+spec:
-  spec:
+  accessModes:
-    accessModes:
+  - ReadWriteOnce
-    - ReadWriteOnce
+  resources:
-    resources:
+    requests:
-      requests:
+      storage: 50Gi
-        storage: 50Gi
+  storageClassName: default
-    storageClassName: default
+  volumeMode: Filesystem
-    volumeMode: Filesystem
+```
 Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
-.. code-block:: yaml
+```yaml
+apiVersion: v1
-  apiVersion: v1
+kind: Secret
-  kind: Secret
+metadata:
-  metadata:
+  name: hf-token-secret
-    name: hf-token-secret
+  namespace: default
-    namespace: default
+type: Opaque
-  type: Opaque
+data:
-  data:
+  token: "REPLACE_WITH_TOKEN"
-    token: "REPLACE_WITH_TOKEN"
+```
 Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
-.. code-block:: yaml
+```yaml
+apiVersion: apps/v1
-  apiVersion: apps/v1
+kind: Deployment
-  kind: Deployment
+metadata:
-  metadata:
+  name: mistral-7b
-    name: mistral-7b
+  namespace: default
-    namespace: default
+  labels:
-    labels:
+    app: mistral-7b
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
      app: mistral-7b
-  spec:
+  template:
-    replicas: 1
+    metadata:
-    selector:
+      labels:
-      matchLabels:
        app: mistral-7b
-    template:
+    spec:
-      metadata:
+      volumes:
-        labels:
+      - name: cache-volume
-          app: mistral-7b
+        persistentVolumeClaim:
-      spec:
+          claimName: mistral-7b
-        volumes:
+      # vLLM needs to access the host's shared memory for tensor parallel inference.
-        - name: cache-volume
+      - name: shm
-          persistentVolumeClaim:
+        emptyDir:
-            claimName: mistral-7b
+          medium: Memory
-        # vLLM needs to access the host's shared memory for tensor parallel inference.
+          sizeLimit: "2Gi"
+      containers:
+      - name: mistral-7b
+        image: vllm/vllm-openai:latest
+        command: ["/bin/sh", "-c"]
+        args: [
+          "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
+        ]
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token-secret
+              key: token
+        ports:
+        - containerPort: 8000
+        resources:
+          limits:
+            cpu: "10"
+            memory: 20G
+            nvidia.com/gpu: "1"
+          requests:
+            cpu: "2"
+            memory: 6G
+            nvidia.com/gpu: "1"
+        volumeMounts:
+        - mountPath: /root/.cache/huggingface
+          name: cache-volume
        - name: shm
-          emptyDir:
+          mountPath: /dev/shm
-            medium: Memory
+        livenessProbe:
-            sizeLimit: "2Gi"
+          httpGet:
-        containers:
+            path: /health
-        - name: mistral-7b
+            port: 8000
-          image: vllm/vllm-openai:latest
+          initialDelaySeconds: 60
-          command: ["/bin/sh", "-c"]
+          periodSeconds: 10
-          args: [
+        readinessProbe:
-            "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
+          httpGet:
-          ]
+            path: /health
-          env:
+            port: 8000
-          - name: HUGGING_FACE_HUB_TOKEN
+          initialDelaySeconds: 60
-            valueFrom:
+          periodSeconds: 5
-              secretKeyRef:
+```
-                name: hf-token-secret
-                key: token
-          ports:
-          - containerPort: 8000
-          resources:
-            limits:
-              cpu: "10"
-              memory: 20G
-              nvidia.com/gpu: "1"
-            requests:
-              cpu: "2"
-              memory: 6G
-              nvidia.com/gpu: "1"
-          volumeMounts:
-          - mountPath: /root/.cache/huggingface
-            name: cache-volume
-          - name: shm
-            mountPath: /dev/shm
-          livenessProbe:
-            httpGet:
-              path: /health
-              port: 8000
-            initialDelaySeconds: 60
-            periodSeconds: 10
-          readinessProbe:
-            httpGet:
-              path: /health
-              port: 8000
-            initialDelaySeconds: 60
-            periodSeconds: 5
 2. **Create a Kubernetes Service for vLLM**
 Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
-.. code-block:: yaml
+```yaml
+apiVersion: v1
-    apiVersion: v1
+kind: Service
-    kind: Service
+metadata:
-    metadata:
+  name: mistral-7b
-      name: mistral-7b
+  namespace: default
-      namespace: default
+spec:
-    spec:
+  ports:
-      ports:
+  - name: http-mistral-7b
-      - name: http-mistral-7b
+    port: 80
-        port: 80
+    protocol: TCP
-        protocol: TCP
+    targetPort: 8000
-        targetPort: 8000
+  # The label selector should match the deployment labels & it is useful for prefix caching feature
-      # The label selector should match the deployment labels & it is useful for prefix caching feature
+  selector:
-      selector:
+    app: mistral-7b
-        app: mistral-7b
+  sessionAffinity: None
-      sessionAffinity: None
+  type: ClusterIP
-      type: ClusterIP
+```
 3. **Deploy and Test**
-Apply the deployment and service configurations using ``kubectl apply -f <filename>``:
+Apply the deployment and service configurations using `kubectl apply -f <filename>`:
-.. code-block:: console
+```console
+kubectl apply -f deployment.yaml
+kubectl apply -f service.yaml
+```
-    kubectl apply -f deployment.yaml
+To test the deployment, run the following `curl` command:
-    kubectl apply -f service.yaml
-To test the deployment, run the following ``curl`` command:
+```console
+curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-.. code-block:: console
+  -H "Content-Type: application/json" \
+  -d '{
-    curl http://mistral-7b.default.svc.cluster.local/v1/completions \
+        "model": "mistralai/Mistral-7B-Instruct-v0.3",
-      -H "Content-Type: application/json" \
+        "prompt": "San Francisco is a",
-      -d '{
+        "max_tokens": 7,
-            "model": "mistralai/Mistral-7B-Instruct-v0.3",
+        "temperature": 0
-            "prompt": "San Francisco is a",
+      }'
-            "max_tokens": 7,
+```
-            "temperature": 0
-          }'
 If the service is correctly deployed, you should receive a response from the vLLM model.
-Conclusion
+## Conclusion
----------
 Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
--- a/docs/source/serving/deploying_with_kserve.md
+++ b/docs/source/serving/deploying_with_kserve.md
+(deploying-with-kserve)=
+# Deploying with KServe
+vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
+Please see [this guide](https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/) for more details on using vLLM with KServe.
--- a/docs/source/serving/deploying_with_kserve.rst
+++ b/docs/source/serving/deploying_with_kserve.rst
-.. _deploying_with_kserve:
-Deploying with KServe
-============================
-vLLM can be deployed with `KServe <https://github.com/kserve/kserve>`_ on Kubernetes for highly scalable distributed model serving.
-Please see `this guide <https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/>`_ for more details on using vLLM with KServe.
--- a/docs/source/serving/deploying_with_kubeai.md
+++ b/docs/source/serving/deploying_with_kubeai.md
+(deploying-with-kubeai)=
+# Deploying with KubeAI
+[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
+Please see the Installation Guides for environment specific instructions:
+- [Any Kubernetes Cluster](https://www.kubeai.org/installation/any/)
+- [EKS](https://www.kubeai.org/installation/eks/)
+- [GKE](https://www.kubeai.org/installation/gke/)
+Once you have KubeAI installed, you can
+[configure text generation models](https://www.kubeai.org/how-to/configure-text-generation-models/)
+using vLLM.
--- a/docs/source/serving/deploying_with_kubeai.rst
+++ b/docs/source/serving/deploying_with_kubeai.rst
-.. _deploying_with_kubeai:
-Deploying with KubeAI
-=====================
-`KubeAI <https://github.com/substratusai/kubeai>`_ is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
-Please see the Installation Guides for environment specific instructions:
-* `Any Kubernetes Cluster <https://www.kubeai.org/installation/any/>`_
-* `EKS <https://www.kubeai.org/installation/eks/>`_
-* `GKE <https://www.kubeai.org/installation/gke/>`_
-Once you have KubeAI installed, you can
-`configure text generation models <https://www.kubeai.org/how-to/configure-text-generation-models/>`_
-using vLLM.
\ No newline at end of file
--- a/docs/source/serving/deploying_with_lws.rst
+++ b/docs/source/serving/deploying_with_lws.rst
-.. _deploying_with_lws:
+(deploying-with-lws)=
-Deploying with LWS
+# Deploying with LWS
-============================
 LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
 A major use case is for multi-host/multi-node distributed inference.
-vLLM can be deployed with `LWS <https://github.com/kubernetes-sigs/lws>`_ on Kubernetes for distributed model serving.
+vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving.
-Please see `this guide <https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm>`_ for more details on
+Please see [this guide](https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm) for more details on
 deploying vLLM on Kubernetes using LWS.
--- a/docs/source/serving/deploying_with_nginx.md
+++ b/docs/source/serving/deploying_with_nginx.md
+(nginxloadbalancer)=
+# Deploying with Nginx Loadbalancer
+This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
+Table of contents:
+1. [Build Nginx Container](#nginxloadbalancer-nginx-build)
+2. [Create Simple Nginx Config file](#nginxloadbalancer-nginx-conf)
+3. [Build vLLM Container](#nginxloadbalancer-nginx-vllm-container)
+4. [Create Docker Network](#nginxloadbalancer-nginx-docker-network)
+5. [Launch vLLM Containers](#nginxloadbalancer-nginx-launch-container)
+6. [Launch Nginx](#nginxloadbalancer-nginx-launch-nginx)
+7. [Verify That vLLM Servers Are Ready](#nginxloadbalancer-nginx-verify-nginx)
+(nginxloadbalancer-nginx-build)=
+## Build Nginx Container
+This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
+```console
+export vllm_root=`pwd`
+```
+Create a file named `Dockerfile.nginx`:
+```console
+FROM nginx:latest
+RUN rm /etc/nginx/conf.d/default.conf
+EXPOSE 80
+CMD ["nginx", "-g", "daemon off;"]
+```
+Build the container:
+```console
+docker build . -f Dockerfile.nginx --tag nginx-lb
+```
+(nginxloadbalancer-nginx-conf)=
+## Create Simple Nginx Config file
+Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
+```console
+upstream backend {
+    least_conn;
+    server vllm0:8000 max_fails=3 fail_timeout=10000s;
+    server vllm1:8000 max_fails=3 fail_timeout=10000s;
+}
+server {
+    listen 80;
+    location / {
+        proxy_pass http://backend;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+    }
+}
+```
+(nginxloadbalancer-nginx-vllm-container)=
+## Build vLLM Container
+```console
+cd $vllm_root
+docker build -f Dockerfile . --tag vllm
+```
+If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
+```console
+cd $vllm_root
+docker build -f Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
+```
+(nginxloadbalancer-nginx-docker-network)=
+## Create Docker Network
+```console
+docker network create vllm_nginx
+```
+(nginxloadbalancer-nginx-launch-container)=
+## Launch vLLM Containers
+Notes:
+- If you have your HuggingFace models cached somewhere else, update `hf_cache_dir` below.
+- If you don't have an existing HuggingFace cache you will want to start `vllm0` and wait for the model to complete downloading and the server to be ready. This will ensure that `vllm1` can leverage the model you just downloaded and it won't have to be downloaded again.
+- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus all`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
+- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
+```console
+mkdir -p ~/.cache/huggingface/hub/
+hf_cache_dir=~/.cache/huggingface/
+docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
+docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
+```
+```{note}
+If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
+```
+(nginxloadbalancer-nginx-launch-nginx)=
+## Launch Nginx
+```console
+docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
+```
+(nginxloadbalancer-nginx-verify-nginx)=
+## Verify That vLLM Servers Are Ready
+```console
+docker logs vllm0 | grep Uvicorn
+docker logs vllm1 | grep Uvicorn
+```
+Both outputs should look like this:
+```console
+INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
+```
--- a/docs/source/serving/deploying_with_nginx.rst
+++ b/docs/source/serving/deploying_with_nginx.rst
-.. _nginxloadbalancer:
-Deploying with Nginx Loadbalancer
-=================================
-This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. 
-Table of contents:
-#. :ref:`Build Nginx Container <nginxloadbalancer_nginx_build>`
-#. :ref:`Create Simple Nginx Config file <nginxloadbalancer_nginx_conf>`
-#. :ref:`Build vLLM Container <nginxloadbalancer_nginx_vllm_container>`
-#. :ref:`Create Docker Network <nginxloadbalancer_nginx_docker_network>`
-#. :ref:`Launch vLLM Containers <nginxloadbalancer_nginx_launch_container>`
-#. :ref:`Launch Nginx <nginxloadbalancer_nginx_launch_nginx>`
-#. :ref:`Verify That vLLM Servers Are Ready <nginxloadbalancer_nginx_verify_nginx>`
-.. _nginxloadbalancer_nginx_build:
-Build Nginx Container
---------------------
-This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
-.. code-block:: console
-    export vllm_root=`pwd`
-Create a file named ``Dockerfile.nginx``:
-.. code-block:: console
-    FROM nginx:latest
-    RUN rm /etc/nginx/conf.d/default.conf
-    EXPOSE 80
-    CMD ["nginx", "-g", "daemon off;"]
-Build the container:
-.. code-block:: console
-    docker build . -f Dockerfile.nginx --tag nginx-lb
-.. _nginxloadbalancer_nginx_conf:
-Create Simple Nginx Config file
-------------------------------
-Create a file named ``nginx_conf/nginx.conf``. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another ``server vllmN:8000 max_fails=3 fail_timeout=10000s;`` entry to ``upstream backend``.
-.. code-block:: console
-    upstream backend {
-        least_conn;
-        server vllm0:8000 max_fails=3 fail_timeout=10000s;
-        server vllm1:8000 max_fails=3 fail_timeout=10000s;
-    }     
-    server {
-        listen 80;
-        location / {
-            proxy_pass http://backend;
-            proxy_set_header Host $host;
-            proxy_set_header X-Real-IP $remote_addr;
-            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-            proxy_set_header X-Forwarded-Proto $scheme;
-        }
-    }
-.. _nginxloadbalancer_nginx_vllm_container:
-Build vLLM Container
--------------------
-.. code-block:: console
-    cd $vllm_root
-    docker build -f Dockerfile . --tag vllm
-If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
-.. code-block:: console
-    cd $vllm_root
-    docker build -f Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
-.. _nginxloadbalancer_nginx_docker_network:
-Create Docker Network
---------------------
-.. code-block:: console
-    docker network create vllm_nginx
-.. _nginxloadbalancer_nginx_launch_container:
-Launch vLLM Containers
----------------------
-Notes:
-* If you have your HuggingFace models cached somewhere else, update ``hf_cache_dir`` below. 
-* If you don't have an existing HuggingFace cache you will want to start ``vllm0`` and wait for the model to complete downloading and the server to be ready. This will ensure that ``vllm1`` can leverage the model you just downloaded and it won't have to be downloaded again.
-* The below example assumes GPU backend used. If you are using CPU backend, remove ``--gpus all``, add ``VLLM_CPU_KVCACHE_SPACE`` and ``VLLM_CPU_OMP_THREADS_BIND`` environment variables to the docker run command.
-* Adjust the model name that you want to use in your vLLM servers if you don't want to use ``Llama-2-7b-chat-hf``. 
-.. code-block:: console
-    mkdir -p ~/.cache/huggingface/hub/
-    hf_cache_dir=~/.cache/huggingface/
-    docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
-    docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
-.. note::
-    If you are behind proxy, you can pass the proxy settings to the docker run command via ``-e http_proxy=$http_proxy -e https_proxy=$https_proxy``.
-.. _nginxloadbalancer_nginx_launch_nginx:
-Launch Nginx
------------
-.. code-block:: console
-    docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
-.. _nginxloadbalancer_nginx_verify_nginx:
-Verify That vLLM Servers Are Ready
----------------------------------
-.. code-block:: console
-    docker logs vllm0 | grep Uvicorn
-    docker logs vllm1 | grep Uvicorn
-Both outputs should look like this:
-.. code-block:: console
-    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)