[Doc] Convert docs to use colon fences (#12471)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
dd6a3a02 · Harry Mellor · GitHub · a7e3eba6 · dd6a3a02 · dd6a3a02
Unverified Commit dd6a3a02 authored Jan 29, 2025 by Harry Mellor Committed by GitHub Jan 29, 2025
20 changed files
--- a/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
+++ b/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
@@ -47,10 +47,10 @@ When you request queued resources, the request is added to a queue maintained by
 the Cloud TPU service. When the requested resource becomes available, it's
 assigned to your Google Cloud project for your immediate exclusive use.
-```{note}
+:::{note}
 In all of the following commands, replace the ALL CAPS parameter names with
 appropriate values. See the parameter descriptions table for more information.
-```
+:::
 ### Provision Cloud TPUs with GKE
@@ -75,33 +75,33 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
 --service-account SERVICE_ACCOUNT
 ```
-```{list-table} Parameter descriptions
+:::{list-table} Parameter descriptions
 :header-rows: 1
-* - Parameter name
+- * Parameter name
-  - Description
+  * Description
-* - QUEUED_RESOURCE_ID
+- * QUEUED_RESOURCE_ID
-  - The user-assigned ID of the queued resource request.
+  * The user-assigned ID of the queued resource request.
-* - TPU_NAME
+- * TPU_NAME
-  - The user-assigned name of the TPU which is created when the queued
+  * The user-assigned name of the TPU which is created when the queued
    resource request is allocated.
-* - PROJECT_ID
+- * PROJECT_ID
-  - Your Google Cloud project
+  * Your Google Cloud project
-* - ZONE
+- * ZONE
-  - The GCP zone where you want to create your Cloud TPU. The value you use
+  * The GCP zone where you want to create your Cloud TPU. The value you use
    depends on the version of TPUs you are using. For more information, see
    `TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
-* - ACCELERATOR_TYPE
+- * ACCELERATOR_TYPE
-  - The TPU version you want to use. Specify the TPU version, for example
+  * The TPU version you want to use. Specify the TPU version, for example
    `v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
    see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
-* - RUNTIME_VERSION
+- * RUNTIME_VERSION
-  - The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
+  * The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
-* - SERVICE_ACCOUNT
+- * SERVICE_ACCOUNT
-  - The email address for your service account. You can find it in the IAM
+  * The email address for your service account. You can find it in the IAM
    Cloud Console under *Service Accounts*. For example:
    `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
-```
+:::
 Connect to your TPU using SSH:
@@ -178,15 +178,15 @@ Run the Docker image with the following command:
 docker run --privileged --net host --shm-size=16G -it vllm-tpu
 ```
-```{note}
+:::{note}
 Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
 possible input shapes and compiles an XLA graph for each shape. The
 compilation time may take 20~30 minutes in the first run. However, the
 compilation time reduces to ~5 minutes afterwards because the XLA graphs are
 cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
-```
+:::
-````{tip}
+:::{tip}
 If you encounter the following error:
 ```console
@@ -198,9 +198,10 @@ file or directory
 Install OpenBLAS with the following command:
 ```console
-$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
+sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
 ```
-````
+:::
 ## Extra information

--- a/docs/source/getting_started/installation/cpu/apple.inc.md
+++ b/docs/source/getting_started/installation/cpu/apple.inc.md
@@ -25,9 +25,9 @@ pip install -r requirements-cpu.txt
 pip install -e . 
 ```
-```{note}
+:::{note}
 On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
-```
+:::
 #### Troubleshooting

--- a/docs/source/getting_started/installation/cpu/index.md
+++ b/docs/source/getting_started/installation/cpu/index.md
@@ -2,86 +2,86 @@
 vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} x86
+::::{tab-item} x86
 :sync: x86
-```{include} x86.inc.md
+:::{include} x86.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
-:::{tab-item} ARM
+::::
+::::{tab-item} ARM
 :sync: arm
-```{include} arm.inc.md
+:::{include} arm.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
-:::{tab-item} Apple silicon
+::::
+::::{tab-item} Apple silicon
 :sync: apple
-```{include} apple.inc.md
+:::{include} apple.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
 ::::
+:::::
 ## Requirements
 - Python: 3.9 -- 3.12
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} x86
+::::{tab-item} x86
 :sync: x86
-```{include} x86.inc.md
+:::{include} x86.inc.md
 :start-after: "## Requirements"
 :end-before: "## Set up using Python"
-```
 :::
-:::{tab-item} ARM
+::::
+::::{tab-item} ARM
 :sync: arm
-```{include} arm.inc.md
+:::{include} arm.inc.md
 :start-after: "## Requirements"
 :end-before: "## Set up using Python"
-```
 :::
-:::{tab-item} Apple silicon
+::::
+::::{tab-item} Apple silicon
 :sync: apple
-```{include} apple.inc.md
+:::{include} apple.inc.md
 :start-after: "## Requirements"
 :end-before: "## Set up using Python"
-```
 :::
 ::::
+:::::
 ## Set up using Python
 ### Create a new Python environment
-```{include} ../python_env_setup.inc.md
+:::{include} ../python_env_setup.inc.md
-```
+:::
 ### Pre-built wheels
@@ -89,41 +89,41 @@ Currently, there are no pre-built CPU wheels.
 ### Build wheel from source
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} x86
+::::{tab-item} x86
 :sync: x86
-```{include} x86.inc.md
+:::{include} x86.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
-:::{tab-item} ARM
+::::
+::::{tab-item} ARM
 :sync: arm
-```{include} arm.inc.md
+:::{include} arm.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
-:::{tab-item} Apple silicon
+::::
+::::{tab-item} Apple silicon
 :sync: apple
-```{include} apple.inc.md
+:::{include} apple.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
 ::::
+:::::
 ## Set up using Docker
 ### Pre-built images
@@ -142,9 +142,9 @@ $ docker run -it \
             vllm-cpu-env
 ```
-:::{tip}
+::::{tip}
 For ARM or Apple silicon, use `Dockerfile.arm`
-:::
+::::
 ## Supported features

--- a/docs/source/getting_started/installation/cpu/x86.inc.md
+++ b/docs/source/getting_started/installation/cpu/x86.inc.md
@@ -17,10 +17,10 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
 :::{include} build.inc.md
 :::
-```{note}
+:::{note}
 - AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
 - If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
-```
+:::
 ## Set up using Docker

--- a/docs/source/getting_started/installation/gpu/cuda.inc.md
+++ b/docs/source/getting_started/installation/gpu/cuda.inc.md
@@ -10,9 +10,9 @@ vLLM contains pre-compiled C++ and CUDA (12.1) binaries.
 ### Create a new Python environment
-```{note}
+:::{note}
 PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
-```
+:::
 In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
@@ -100,10 +100,10 @@ pip install --editable .
 You can find more information about vLLM's wheels in <project:#install-the-latest-code>.
-```{note}
+:::{note}
 There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
 It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to <project:#install-the-latest-code> for instructions on how to install a specified wheel.
-```
+:::
 #### Full build (with compilation)
@@ -115,7 +115,7 @@ cd vllm
 pip install -e .
 ```
-```{tip}
+:::{tip}
 Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
 For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
@@ -123,7 +123,7 @@ As long as `which ccache` command can find the `ccache` binary, it will be used
 [sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
 The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
-```
+:::
 ##### Use an existing PyTorch installation

--- a/docs/source/getting_started/installation/gpu/index.md
+++ b/docs/source/getting_started/installation/gpu/index.md
@@ -2,299 +2,299 @@
 vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} CUDA
 :sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
-:::{tab-item} ROCm
+::::
+::::{tab-item} ROCm
 :sync: rocm
-```{include} rocm.inc.md
+:::{include} rocm.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
-:::{tab-item} XPU
+::::
+::::{tab-item} XPU
 :sync: xpu
-```{include} xpu.inc.md
+:::{include} xpu.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
 ::::
+:::::
 ## Requirements
 - OS: Linux
 - Python: 3.9 -- 3.12
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} CUDA
 :sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
 :start-after: "## Requirements"
 :end-before: "## Set up using Python"
-```
 :::
-:::{tab-item} ROCm
+::::
+::::{tab-item} ROCm
 :sync: rocm
-```{include} rocm.inc.md
+:::{include} rocm.inc.md
 :start-after: "## Requirements"
 :end-before: "## Set up using Python"
-```
 :::
-:::{tab-item} XPU
+::::
+::::{tab-item} XPU
 :sync: xpu
-```{include} xpu.inc.md
+:::{include} xpu.inc.md
 :start-after: "## Requirements"
 :end-before: "## Set up using Python"
-```
 :::
 ::::
+:::::
 ## Set up using Python
 ### Create a new Python environment
-```{include} ../python_env_setup.inc.md
+:::{include} ../python_env_setup.inc.md
-```
+:::
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} CUDA
 :sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
 :start-after: "## Create a new Python environment"
 :end-before: "### Pre-built wheels"
-```
 :::
-:::{tab-item} ROCm
+::::
+::::{tab-item} ROCm
 :sync: rocm
 There is no extra information on creating a new Python environment for this device.
-:::
+::::
-:::{tab-item} XPU
+::::{tab-item} XPU
 :sync: xpu
 There is no extra information on creating a new Python environment for this device.
-:::
 ::::
+:::::
 ### Pre-built wheels
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} CUDA
 :sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
 :start-after: "### Pre-built wheels"
 :end-before: "### Build wheel from source"
-```
 :::
-:::{tab-item} ROCm
+::::
+::::{tab-item} ROCm
 :sync: rocm
-```{include} rocm.inc.md
+:::{include} rocm.inc.md
 :start-after: "### Pre-built wheels"
 :end-before: "### Build wheel from source"
-```
 :::
-:::{tab-item} XPU
+::::
+::::{tab-item} XPU
 :sync: xpu
-```{include} xpu.inc.md
+:::{include} xpu.inc.md
 :start-after: "### Pre-built wheels"
 :end-before: "### Build wheel from source"
-```
 :::
 ::::
+:::::
 (build-from-source)=
 ### Build wheel from source
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} CUDA
 :sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
-:::{tab-item} ROCm
+::::
+::::{tab-item} ROCm
 :sync: rocm
-```{include} rocm.inc.md
+:::{include} rocm.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
-:::{tab-item} XPU
+::::
+::::{tab-item} XPU
 :sync: xpu
-```{include} xpu.inc.md
+:::{include} xpu.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
 ::::
+:::::
 ## Set up using Docker
 ### Pre-built images
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} CUDA
 :sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
 :start-after: "### Pre-built images"
 :end-before: "### Build image from source"
-```
 :::
-:::{tab-item} ROCm
+::::
+::::{tab-item} ROCm
 :sync: rocm
-```{include} rocm.inc.md
+:::{include} rocm.inc.md
 :start-after: "### Pre-built images"
 :end-before: "### Build image from source"
-```
 :::
-:::{tab-item} XPU
+::::
+::::{tab-item} XPU
 :sync: xpu
-```{include} xpu.inc.md
+:::{include} xpu.inc.md
 :start-after: "### Pre-built images"
 :end-before: "### Build image from source"
-```
 :::
 ::::
+:::::
 ### Build image from source
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} CUDA
 :sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
 :start-after: "### Build image from source"
 :end-before: "## Supported features"
-```
 :::
-:::{tab-item} ROCm
+::::
+::::{tab-item} ROCm
 :sync: rocm
-```{include} rocm.inc.md
+:::{include} rocm.inc.md
 :start-after: "### Build image from source"
 :end-before: "## Supported features"
-```
 :::
-:::{tab-item} XPU
+::::
+::::{tab-item} XPU
 :sync: xpu
-```{include} xpu.inc.md
+:::{include} xpu.inc.md
 :start-after: "### Build image from source"
 :end-before: "## Supported features"
-```
 :::
 ::::
+:::::
 ## Supported features
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} CUDA
 :sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
 :start-after: "## Supported features"
-```
 :::
-:::{tab-item} ROCm
+::::
+::::{tab-item} ROCm
 :sync: rocm
-```{include} rocm.inc.md
+:::{include} rocm.inc.md
 :start-after: "## Supported features"
-```
 :::
-:::{tab-item} XPU
+::::
+::::{tab-item} XPU
 :sync: xpu
-```{include} xpu.inc.md
+:::{include} xpu.inc.md
 :start-after: "## Supported features"
-```
 :::
 ::::
+:::::
--- a/docs/source/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/source/getting_started/installation/gpu/rocm.inc.md
@@ -16,10 +16,10 @@ Currently, there are no pre-built ROCm wheels.
 However, the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
 docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
-```{tip}
+:::{tip}
 Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
 for instructions on how to use this prebuilt docker image.
-```
+:::
 ### Build wheel from source
@@ -47,9 +47,9 @@ for instructions on how to use this prebuilt docker image.
    cd ../..
    ```
-    ```{note}
+    :::{note}
-    - If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
+    If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
-    ```
+    :::
 2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
@@ -67,9 +67,9 @@ for instructions on how to use this prebuilt docker image.
    cd ..
    ```
-    ```{note}
+    :::{note}
-    - You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
+    You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
-    ```
+    :::
 3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:
@@ -95,17 +95,18 @@ for instructions on how to use this prebuilt docker image.
    This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
-    ```{tip}
+<!--- pyml disable-num-lines 5 ul-indent-->
+    :::{tip}
    - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
    - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
    - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
    - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
-    ```
+    :::
-```{tip}
+:::{tip}
 - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
  For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
-```
+:::
 ## Set up using Docker

--- a/docs/source/getting_started/installation/gpu/xpu.inc.md
+++ b/docs/source/getting_started/installation/gpu/xpu.inc.md
@@ -30,10 +30,10 @@ pip install -v -r requirements-xpu.txt
 VLLM_TARGET_DEVICE=xpu python setup.py install
 ```
-```{note}
+:::{note}
 - FP16 is the default data type in the current XPU backend. The BF16 data
  type will be supported in the future.
-```
+:::
 ## Set up using Docker

--- a/docs/source/getting_started/installation/index.md
+++ b/docs/source/getting_started/installation/index.md
@@ -4,10 +4,10 @@
 vLLM supports the following hardware platforms:
-```{toctree}
+:::{toctree}
 :maxdepth: 1
 gpu/index
 cpu/index
 ai_accelerator/index
-```
+:::
--- a/docs/source/getting_started/installation/python_env_setup.inc.md
+++ b/docs/source/getting_started/installation/python_env_setup.inc.md
@@ -6,9 +6,9 @@ conda create -n myenv python=3.12 -y
 conda activate myenv
 ```
-```{note}
+:::{note}
 [PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
-```
+:::
 Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:

--- a/docs/source/getting_started/quickstart.md
+++ b/docs/source/getting_started/quickstart.md
@@ -32,9 +32,9 @@ conda activate myenv
 pip install vllm
 ```
-```{note}
+:::{note}
 For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
-```
+:::
 (quickstart-offline)=
@@ -69,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](
 llm = LLM(model="facebook/opt-125m")
 ```
-```{note}
+:::{note}
 By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
-```
+:::
 Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
@@ -97,10 +97,10 @@ Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instru
 vllm serve Qwen/Qwen2.5-1.5B-Instruct
 ```
-```{note}
+:::{note}
 By default, the server uses a predefined chat template stored in the tokenizer.
 You can learn about overriding it [here](#chat-template).
-```
+:::
 This server can be queried in the same format as OpenAI API. For example, to list the models:

--- a/docs/source/getting_started/troubleshooting.md
+++ b/docs/source/getting_started/troubleshooting.md
@@ -4,9 +4,9 @@
 This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
-```{note}
+:::{note}
 Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
-```
+:::
 ## Hangs downloading a model
@@ -18,9 +18,9 @@ It's recommended to download the model first using the [huggingface-cli](https:/
 If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
 It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
-```{note}
+:::{note}
 To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
-```
+:::
 ## Out of memory
@@ -132,14 +132,14 @@ If the script runs successfully, you should see the message `sanity check is suc
 If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
-```{note}
+:::{note}
 A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
 - In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
 - In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
 Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
-```
+:::
 (troubleshooting-python-multiprocessing)=

--- a/docs/source/index.md
+++ b/docs/source/index.md
 # Welcome to vLLM
-```{figure} ./assets/logos/vllm-logo-text-light.png
+:::{figure} ./assets/logos/vllm-logo-text-light.png
 :align: center
 :alt: vLLM
 :class: no-scaled-link
 :width: 60%
-```
+:::
-```{raw} html
+:::{raw} html
 <p style="text-align:center">
 <strong>Easy, fast, and cheap LLM serving for everyone
 </strong>
@@ -19,7 +19,7 @@
 <a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
 <a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
 </p>
-```
+:::
 vLLM is a fast and easy-to-use library for LLM inference and serving.
@@ -58,7 +58,7 @@ For more information, check out the following:
 % How to start using vLLM?
-```{toctree}
+:::{toctree}
 :caption: Getting Started
 :maxdepth: 1
@@ -67,11 +67,11 @@ getting_started/quickstart
 getting_started/examples/examples_index
 getting_started/troubleshooting
 getting_started/faq
-```
+:::
 % What does vLLM support?
-```{toctree}
+:::{toctree}
 :caption: Models
 :maxdepth: 1
@@ -79,11 +79,11 @@ models/generative_models
 models/pooling_models
 models/supported_models
 models/extensions/index
-```
+:::
 % Additional capabilities
-```{toctree}
+:::{toctree}
 :caption: Features
 :maxdepth: 1
@@ -96,11 +96,11 @@ features/automatic_prefix_caching
 features/disagg_prefill
 features/spec_decode
 features/compatibility_matrix
-```
+:::
 % Details about running vLLM
-```{toctree}
+:::{toctree}
 :caption: Inference and Serving
 :maxdepth: 1
@@ -113,11 +113,11 @@ serving/engine_args
 serving/env_vars
 serving/usage_stats
 serving/integrations/index
-```
+:::
 % Scaling up vLLM for production
-```{toctree}
+:::{toctree}
 :caption: Deployment
 :maxdepth: 1
@@ -126,21 +126,21 @@ deployment/k8s
 deployment/nginx
 deployment/frameworks/index
 deployment/integrations/index
-```
+:::
 % Making the most out of vLLM
-```{toctree}
+:::{toctree}
 :caption: Performance
 :maxdepth: 1
 performance/optimization
 performance/benchmarks
-```
+:::
 % Explanation of vLLM internals
-```{toctree}
+:::{toctree}
 :caption: Design Documents
 :maxdepth: 2
@@ -151,11 +151,11 @@ design/kernel/paged_attention
 design/mm_processing
 design/automatic_prefix_caching
 design/multiprocessing
-```
+:::
 % How to contribute to the vLLM project
-```{toctree}
+:::{toctree}
 :caption: Developer Guide
 :maxdepth: 2
@@ -164,11 +164,11 @@ contributing/profiling/profiling_index
 contributing/dockerfile/dockerfile
 contributing/model/index
 contributing/vulnerability_management
-```
+:::
 % Technical API specifications
-```{toctree}
+:::{toctree}
 :caption: API Reference
 :maxdepth: 2
@@ -177,18 +177,18 @@ api/engine/index
 api/inference_params
 api/multimodal/index
 api/model/index
-```
+:::
 % Latest news and acknowledgements
-```{toctree}
+:::{toctree}
 :caption: Community
 :maxdepth: 1
 community/blog
 community/meetups
 community/sponsors
-```
+:::
 ## Indices and tables

--- a/docs/source/models/extensions/index.md
+++ b/docs/source/models/extensions/index.md
 # Built-in Extensions
-```{toctree}
+:::{toctree}
 :maxdepth: 1
 runai_model_streamer
 tensorizer
-```
+:::
--- a/docs/source/models/extensions/runai_model_streamer.md
+++ b/docs/source/models/extensions/runai_model_streamer.md
@@ -48,6 +48,6 @@ You can read further about CPU buffer memory limiting [here](https://github.com/
 vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
 ```
-```{note}
+:::{note}
 For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
-```
+:::
--- a/docs/source/models/extensions/tensorizer.md
+++ b/docs/source/models/extensions/tensorizer.md
@@ -11,6 +11,6 @@ For more information on CoreWeave's Tensorizer, please refer to
 [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
 the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html).
-```{note}
+:::{note}
 Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
-```
+:::
--- a/docs/source/models/generative_models.md
+++ b/docs/source/models/generative_models.md
@@ -70,10 +70,10 @@ The {class}`~vllm.LLM.chat` method implements chat functionality on top of {clas
 In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
 and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
-```{important}
+:::{important}
 In general, only instruction-tuned models have a chat template.
 Base models may perform poorly as they are not trained to respond to the chat conversation.
-```
+:::
 ```python
 llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

--- a/docs/source/models/pooling_models.md
+++ b/docs/source/models/pooling_models.md
@@ -8,54 +8,54 @@ In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmMo
 These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
 before returning them.
-```{note}
+:::{note}
 We currently support pooling models primarily as a matter of convenience.
 As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
 pooling models as they only work on the generation or decode stage, so performance may not improve as much.
-```
+:::
 For pooling models, we support the following `--task` options.
 The selected option sets the default pooler used to extract the final hidden states:
-```{list-table}
+:::{list-table}
 :widths: 50 25 25 25
 :header-rows: 1
-* - Task
+- * Task
-  - Pooling Type
+  * Pooling Type
-  - Normalization
+  * Normalization
-  - Softmax
+  * Softmax
-* - Embedding (`embed`)
+- * Embedding (`embed`)
-  - `LAST`
+  * `LAST`
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-* - Classification (`classify`)
+- * Classification (`classify`)
-  - `LAST`
+  * `LAST`
-  - ✗
+  * ✗
-  - ✅︎
+  * ✅︎
-* - Sentence Pair Scoring (`score`)
+- * Sentence Pair Scoring (`score`)
-  - \*
+  * \*
-  - \*
+  * \*
-  - \*
+  * \*
-* - Reward Modeling (`reward`)
+- * Reward Modeling (`reward`)
-  - `ALL`
+  * `ALL`
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-```
+:::
 \*The default pooler is always defined by the model.
-```{note}
+:::{note}
 If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
-```
+:::
 When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
 we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
-```{tip}
+:::{tip}
 You can customize the model's pooling method via the `--override-pooler-config` option,
 which takes priority over both the model's and Sentence Transformers's defaults.
-```
+:::
 ## Offline Inference
@@ -111,10 +111,10 @@ The {class}`~vllm.LLM.score` method outputs similarity scores between sentence p
 It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html).
 These types of models serve as rerankers between candidate query-document pairs in RAG systems.
-```{note}
+:::{note}
 vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
 To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
-```
+:::
 ```python
 llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")

--- a/docs/source/models/supported_models.md
+++ b/docs/source/models/supported_models.md
--- a/docs/source/serving/distributed_serving.md
+++ b/docs/source/serving/distributed_serving.md
@@ -14,9 +14,9 @@ In short, you should increase the number of GPUs and the number of nodes until y
 After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
-```{note}
+:::{note}
 There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
-```
+:::
 ## Running vLLM on a single node
@@ -94,12 +94,12 @@ vllm serve /path/to/the/model/in/the/container \
 To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
-```{warning}
+:::{warning}
 After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
-```
+:::
-```{warning}
+:::{warning}
 Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
 When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
-```
+:::