[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

[doc] Fold long code blocks to improve readability (#19926)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
f17aec0d · Reid · GitHub · 493c2753 · f17aec0d · f17aec0d
Unverified Commit f17aec0d authored Jun 23, 2025 by Reid Committed by GitHub Jun 23, 2025
10 changed files
--- a/docs/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/getting_started/installation/gpu/rocm.inc.md
@@ -90,24 +90,26 @@ Currently, there are no pre-built ROCm wheels.

 4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:

-    ```bash
-    pip install --upgrade pip
-
-    # Build & install AMD SMI
-    pip install /opt/rocm/share/amd_smi
-
-    # Install dependencies
-    pip install --upgrade numba \
-        scipy \
-        huggingface-hub[cli,hf_transfer] \
-        setuptools_scm
-    pip install "numpy<2"
-    pip install -r requirements/rocm.txt
-
-    # Build vLLM for MI210/MI250/MI300.
-    export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
-    python3 setup.py develop
-    ```
+    ??? Commands
+
+        ```bash
+        pip install --upgrade pip
+
+        # Build & install AMD SMI
+        pip install /opt/rocm/share/amd_smi
+
+        # Install dependencies
+        pip install --upgrade numba \
+            scipy \
+            huggingface-hub[cli,hf_transfer] \
+            setuptools_scm
+        pip install "numpy<2"
+        pip install -r requirements/rocm.txt
+
+        # Build vLLM for MI210/MI250/MI300.
+        export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+        python3 setup.py develop
+        ```

    This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.

@@ -201,19 +203,21 @@ DOCKER_BUILDKIT=1 docker build \

 To run the above docker image `vllm-rocm`, use the below command:

-```console
-docker run -it \
-   --network=host \
-   --group-add=video \
-   --ipc=host \
-   --cap-add=SYS_PTRACE \
-   --security-opt seccomp=unconfined \
-   --device /dev/kfd \
-   --device /dev/dri \
-   -v <path/to/model>:/app/model \
-   vllm-rocm \
-   bash
-```
+??? Command
+
+    ```console
+    docker run -it \
+    --network=host \
+    --group-add=video \
+    --ipc=host \
+    --cap-add=SYS_PTRACE \
+    --security-opt seccomp=unconfined \
+    --device /dev/kfd \
+    --device /dev/dri \
+    -v <path/to/model>:/app/model \
+    vllm-rocm \
+    bash
+    ```

 Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.


--- a/docs/getting_started/installation/intel_gaudi.md
+++ b/docs/getting_started/installation/intel_gaudi.md
@@ -200,7 +200,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 1

 `min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.

-Example (with ramp-up)
+Example (with ramp-up):

 ```text
 min = 2, step = 32, max = 64
@@ -209,7 +209,7 @@ min = 2, step = 32, max = 64
 => buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64)
 ```

-Example (without ramp-up)
+Example (without ramp-up):

 ```text
 min = 128, step = 128, max = 512
@@ -232,19 +232,21 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come

 Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:

-```text
-INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
-INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
-INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
-...
-INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
-INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
-INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
-INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
-...
-INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
-INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
-```
+??? Logs
+
+    ```text
+    INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
+    INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
+    INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
+    ...
+    INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
+    INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
+    INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
+    INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
+    ...
+    INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
+    INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
+    ```

 This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.

@@ -279,37 +281,39 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi

 Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):

-```text
-INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
-INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
-INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
-INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
-INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
-INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
-INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
-INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
-INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
-INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
-INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
-INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
-...
-INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
-INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
-INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
-...
-INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
-INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
-...
-INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
-INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
-INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
-INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
-INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
-INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
-INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
-INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
-INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)
-```
+??? Logs
+
+    ```text
+    INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
+    INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
+    INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
+    INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
+    INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
+    INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
+    INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
+    INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
+    INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
+    INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
+    INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
+    INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
+    ...
+    INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
+    INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
+    INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
+    ...
+    INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
+    INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
+    ...
+    INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
+    INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
+    INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
+    INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
+    INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
+    INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
+    INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
+    INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
+    INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)
+    ```

 ### Recommended vLLM Parameters


--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@@ -147,20 +147,22 @@ curl http://localhost:8000/v1/completions \

 Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:

-```python
-from openai import OpenAI
-
-# Modify OpenAI's API key and API base to use vLLM's API server.
-openai_api_key = "EMPTY"
-openai_api_base = "http://localhost:8000/v1"
-client = OpenAI(
-    api_key=openai_api_key,
-    base_url=openai_api_base,
-)
-completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
-                                      prompt="San Francisco is a")
-print("Completion result:", completion)
-```
+??? Code
+
+    ```python
+    from openai import OpenAI
+
+    # Modify OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
+    openai_api_base = "http://localhost:8000/v1"
+    client = OpenAI(
+        api_key=openai_api_key,
+        base_url=openai_api_base,
+    )
+    completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
+                                        prompt="San Francisco is a")
+    print("Completion result:", completion)
+    ```

 A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>

@@ -184,26 +186,28 @@ curl http://localhost:8000/v1/chat/completions \

 Alternatively, you can use the `openai` Python package:

-```python
-from openai import OpenAI
-# Set OpenAI's API key and API base to use vLLM's API server.
-openai_api_key = "EMPTY"
-openai_api_base = "http://localhost:8000/v1"
-
-client = OpenAI(
-    api_key=openai_api_key,
-    base_url=openai_api_base,
-)
-
-chat_response = client.chat.completions.create(
-    model="Qwen/Qwen2.5-1.5B-Instruct",
-    messages=[
-        {"role": "system", "content": "You are a helpful assistant."},
-        {"role": "user", "content": "Tell me a joke."},
-    ]
-)
-print("Chat response:", chat_response)
-```
+??? Code
+
+    ```python
+    from openai import OpenAI
+    # Set OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
+    openai_api_base = "http://localhost:8000/v1"
+
+    client = OpenAI(
+        api_key=openai_api_key,
+        base_url=openai_api_base,
+    )
+
+    chat_response = client.chat.completions.create(
+        model="Qwen/Qwen2.5-1.5B-Instruct",
+        messages=[
+            {"role": "system", "content": "You are a helpful assistant."},
+            {"role": "user", "content": "Tell me a joke."},
+        ]
+    )
+    print("Chat response:", chat_response)
+    ```

 ## On Attention Backends


--- a/docs/models/generative_models.md
+++ b/docs/models/generative_models.md
@@ -85,35 +85,37 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
    In general, only instruction-tuned models have a chat template.
    Base models may perform poorly as they are not trained to respond to the chat conversation.

-```python
-from vllm import LLM
-
-llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
-conversation = [
-    {
-        "role": "system",
-        "content": "You are a helpful assistant"
-    },
-    {
-        "role": "user",
-        "content": "Hello"
-    },
-    {
-        "role": "assistant",
-        "content": "Hello! How can I assist you today?"
-    },
-    {
-        "role": "user",
-        "content": "Write an essay about the importance of higher education.",
-    },
-]
-outputs = llm.chat(conversation)
-
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-```
+??? Code
+
+    ```python
+    from vllm import LLM
+
+    llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
+    conversation = [
+        {
+            "role": "system",
+            "content": "You are a helpful assistant"
+        },
+        {
+            "role": "user",
+            "content": "Hello"
+        },
+        {
+            "role": "assistant",
+            "content": "Hello! How can I assist you today?"
+        },
+        {
+            "role": "user",
+            "content": "Write an essay about the importance of higher education.",
+        },
+    ]
+    outputs = llm.chat(conversation)
+
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```

 A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>


--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@@ -70,7 +70,10 @@ To make your model compatible with the Transformers backend, it needs:
 2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
 3. `MyModel` must contain `_supports_attention_backend = True`.

-```python title="modeling_my_model.py"
+<details>
+<summary>modeling_my_model.py</summary>
+
+```python

 from transformers import PreTrainedModel
 from torch import nn
@@ -93,6 +96,8 @@ class MyModel(PreTrainedModel):
    _supports_attention_backend = True
 ```

+</details>
+
 Here is what happens in the background when this model is loaded:

 1. The config is loaded.
@@ -103,7 +108,10 @@ That's it!

 For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:

-```python title="configuration_my_model.py"
+<details>
+<summary>configuration_my_model.py</summary>
+
+```python

 from transformers import PretrainedConfig

@@ -123,6 +131,8 @@ class MyConfig(PretrainedConfig):
    }
 ```

+</details>
+
 - `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
 - `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
    * You only need to do this for layers which are not present on all pipeline stages
@@ -198,6 +208,9 @@ huggingface-cli scan-cache --dir ~/.cache/huggingface/hub

 Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache:

+<details>
+<summary>Commands</summary>
+
 ```console
 # The `delete-cache` command requires extra dependencies to work with the TUI.
 # Please run `pip install huggingface_hub[cli]` to install them.
@@ -224,6 +237,8 @@ Start deletion.
 Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M.
 ```

+</details>
+
 #### Using a proxy

 Here are some tips for loading/downloading models from Hugging Face using a proxy:
@@ -601,27 +616,29 @@ Specified using `--task generate`.

    For the best results, we recommend using the following dependency versions (tested on A10 and L40):

-    ```text
-    # Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
-    torch==2.5.1
-    torchvision==0.20.1
-    transformers==4.48.1
-    tokenizers==0.21.0
-    tiktoken==0.7.0
-    vllm==0.7.0
-
-    # Optional but recommended for improved performance and stability
-    triton==3.1.0
-    xformers==0.0.28.post3
-    uvloop==0.21.0
-    protobuf==5.29.3
-    openai==1.60.2
-    opencv-python-headless==4.11.0.86
-    pillow==10.4.0
-
-    # Installed FlashAttention (for float16 only)
-    flash-attn>=2.5.6  # Not used in float32, but should be documented
-    ```
+    ??? Dependency versions
+
+        ```text
+        # Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
+        torch==2.5.1
+        torchvision==0.20.1
+        transformers==4.48.1
+        tokenizers==0.21.0
+        tiktoken==0.7.0
+        vllm==0.7.0
+
+        # Optional but recommended for improved performance and stability
+        triton==3.1.0
+        xformers==0.0.28.post3
+        uvloop==0.21.0
+        protobuf==5.29.3
+        openai==1.60.2
+        opencv-python-headless==4.11.0.86
+        pillow==10.4.0
+
+        # Installed FlashAttention (for float16 only)
+        flash-attn>=2.5.6  # Not used in float32, but should be documented
+        ```

    **Note:** Make sure you understand the security implications of using outdated packages.


--- a/docs/serving/integrations/langchain.md
+++ b/docs/serving/integrations/langchain.md
@@ -13,19 +13,21 @@ pip install langchain langchain_community -q

 To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.

-```python
-from langchain_community.llms import VLLM
-
-llm = VLLM(model="mosaicml/mpt-7b",
-           trust_remote_code=True,  # mandatory for hf models
-           max_new_tokens=128,
-           top_k=10,
-           top_p=0.95,
-           temperature=0.8,
-           # tensor_parallel_size=... # for distributed inference
-)
-
-print(llm("What is the capital of France ?"))
-```
+??? Code
+
+    ```python
+    from langchain_community.llms import VLLM
+
+    llm = VLLM(model="mosaicml/mpt-7b",
+            trust_remote_code=True,  # mandatory for hf models
+            max_new_tokens=128,
+            top_k=10,
+            top_p=0.95,
+            temperature=0.8,
+            # tensor_parallel_size=... # for distributed inference
+    )
+
+    print(llm("What is the capital of France ?"))
+    ```

 Please refer to this [Tutorial](https://python.langchain.com/docs/integrations/llms/vllm) for more details.
--- a/docs/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
@@ -15,22 +15,24 @@ vllm serve NousResearch/Meta-Llama-3-8B-Instruct \

 To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).

-```python
-from openai import OpenAI
-client = OpenAI(
-    base_url="http://localhost:8000/v1",
-    api_key="token-abc123",
-)
+??? Code

-completion = client.chat.completions.create(
-    model="NousResearch/Meta-Llama-3-8B-Instruct",
-    messages=[
-        {"role": "user", "content": "Hello!"}
-    ]
-)
+    ```python
+    from openai import OpenAI
+    client = OpenAI(
+        base_url="http://localhost:8000/v1",
+        api_key="token-abc123",
+    )

-print(completion.choices[0].message)
-```
+    completion = client.chat.completions.create(
+        model="NousResearch/Meta-Llama-3-8B-Instruct",
+        messages=[
+            {"role": "user", "content": "Hello!"}
+        ]
+    )
+
+    print(completion.choices[0].message)
+    ```

 !!! tip
    vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
@@ -147,27 +149,29 @@ with `--enable-request-id-headers`.
 > rather than within the vLLM layer for this reason.
 > See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.

-```python
-completion = client.chat.completions.create(
-    model="NousResearch/Meta-Llama-3-8B-Instruct",
-    messages=[
-        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
-    ],
-    extra_headers={
-        "x-request-id": "sentiment-classification-00001",
-    }
-)
-print(completion._request_id)
+??? Code

-completion = client.completions.create(
-    model="NousResearch/Meta-Llama-3-8B-Instruct",
-    prompt="A robot may not injure a human being",
-    extra_headers={
-        "x-request-id": "completion-test",
-    }
-)
-print(completion._request_id)
-```
+    ```python
+    completion = client.chat.completions.create(
+        model="NousResearch/Meta-Llama-3-8B-Instruct",
+        messages=[
+            {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+        ],
+        extra_headers={
+            "x-request-id": "sentiment-classification-00001",
+        }
+    )
+    print(completion._request_id)
+
+    completion = client.completions.create(
+        model="NousResearch/Meta-Llama-3-8B-Instruct",
+        prompt="A robot may not injure a human being",
+        extra_headers={
+            "x-request-id": "completion-test",
+        }
+    )
+    print(completion._request_id)
+    ```

 ## API Reference

@@ -184,15 +188,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>

 The following [sampling parameters][sampling-params] are supported.

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
+    ```

 The following extra parameters are supported:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
+    ```

 [](){ #chat-api }

@@ -212,15 +220,19 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>

 The following [sampling parameters][sampling-params] are supported.

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
+    ```

 The following extra parameters are supported:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
+    ```

 [](){ #embeddings-api }

@@ -259,29 +271,31 @@ and passing a list of `messages` in the request. Refer to the examples below for

    Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:

-    ```python
-    import requests
-
-    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
-
-    response = requests.post(
-        "http://localhost:8000/v1/embeddings",
-        json={
-            "model": "TIGER-Lab/VLM2Vec-Full",
-            "messages": [{
-                "role": "user",
-                "content": [
-                    {"type": "image_url", "image_url": {"url": image_url}},
-                    {"type": "text", "text": "Represent the given image."},
-                ],
-            }],
-            "encoding_format": "float",
-        },
-    )
-    response.raise_for_status()
-    response_json = response.json()
-    print("Embedding output:", response_json["data"][0]["embedding"])
-    ```
+    ??? Code
+
+        ```python
+        import requests
+
+        image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
+
+        response = requests.post(
+            "http://localhost:8000/v1/embeddings",
+            json={
+                "model": "TIGER-Lab/VLM2Vec-Full",
+                "messages": [{
+                    "role": "user",
+                    "content": [
+                        {"type": "image_url", "image_url": {"url": image_url}},
+                        {"type": "text", "text": "Represent the given image."},
+                    ],
+                }],
+                "encoding_format": "float",
+            },
+        )
+        response.raise_for_status()
+        response_json = response.json()
+        print("Embedding output:", response_json["data"][0]["embedding"])
+        ```

 === "DSE-Qwen2-MRL"

@@ -316,15 +330,19 @@ The following [pooling parameters][pooling-params] are supported.

 The following extra parameters are supported by default:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
+    ```

 For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
+    ```

 [](){ #transcriptions-api }

@@ -343,15 +361,19 @@ Code example: <gh-file:examples/online_serving/openai_transcription_client.py>

 The following [sampling parameters][sampling-params] are supported.

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
+    ```

 The following extra parameters are supported:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
+    ```

 [](){ #tokenizer-api }

@@ -387,8 +409,6 @@ Code example: <gh-file:examples/online_serving/openai_classification_client.py>

 You can classify multiple texts by passing an array of strings:

-Request:
-
 ```bash
 curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
@@ -401,47 +421,45 @@ curl -v "http://127.0.0.1:8000/classify" \
  }'
 ```

-Response:
+??? Response

-```bash
-{
-  "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
-  "object": "list",
-  "created": 1745383065,
-  "model": "jason9693/Qwen2.5-1.5B-apeach",
-  "data": [
-    {
-      "index": 0,
-      "label": "Default",
-      "probs": [
-        0.565970778465271,
-        0.4340292513370514
-      ],
-      "num_classes": 2
-    },
+    ```bash
    {
-      "index": 1,
-      "label": "Spoiled",
-      "probs": [
-        0.26448777318000793,
-        0.7355121970176697
+      "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
+      "object": "list",
+      "created": 1745383065,
+      "model": "jason9693/Qwen2.5-1.5B-apeach",
+      "data": [
+        {
+          "index": 0,
+          "label": "Default",
+          "probs": [
+            0.565970778465271,
+            0.4340292513370514
+          ],
+          "num_classes": 2
+        },
+        {
+          "index": 1,
+          "label": "Spoiled",
+          "probs": [
+            0.26448777318000793,
+            0.7355121970176697
+          ],
+          "num_classes": 2
+        }
      ],
-      "num_classes": 2
+      "usage": {
+        "prompt_tokens": 20,
+        "total_tokens": 20,
+        "completion_tokens": 0,
+        "prompt_tokens_details": null
+      }
    }
-  ],
-  "usage": {
-    "prompt_tokens": 20,
-    "total_tokens": 20,
-    "completion_tokens": 0,
-    "prompt_tokens_details": null
-  }
-}
-```
+    ```

 You can also pass a string directly to the `input` field:

-Request:
-
 ```bash
 curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
@@ -451,33 +469,33 @@ curl -v "http://127.0.0.1:8000/classify" \
  }'
 ```

-Response:
+??? Response

-```bash
-{
-  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
-  "object": "list",
-  "created": 1745383213,
-  "model": "jason9693/Qwen2.5-1.5B-apeach",
-  "data": [
+    ```bash
    {
-      "index": 0,
-      "label": "Default",
-      "probs": [
-        0.565970778465271,
-        0.4340292513370514
+      "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
+      "object": "list",
+      "created": 1745383213,
+      "model": "jason9693/Qwen2.5-1.5B-apeach",
+      "data": [
+        {
+          "index": 0,
+          "label": "Default",
+          "probs": [
+            0.565970778465271,
+            0.4340292513370514
+          ],
+          "num_classes": 2
+        }
      ],
-      "num_classes": 2
+      "usage": {
+        "prompt_tokens": 10,
+        "total_tokens": 10,
+        "completion_tokens": 0,
+        "prompt_tokens_details": null
+      }
    }
-  ],
-  "usage": {
-    "prompt_tokens": 10,
-    "total_tokens": 10,
-    "completion_tokens": 0,
-    "prompt_tokens_details": null
-  }
-}
-```
+    ```

 #### Extra parameters

@@ -508,8 +526,6 @@ Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>

 You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.

-Request:
-
 ```bash
 curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
@@ -523,24 +539,24 @@ curl -X 'POST' \
 }'
 ```

-Response:
+??? Response

-```bash
-{
-  "id": "score-request-id",
-  "object": "list",
-  "created": 693447,
-  "model": "BAAI/bge-reranker-v2-m3",
-  "data": [
+    ```bash
    {
-      "index": 0,
-      "object": "score",
-      "score": 1
+      "id": "score-request-id",
+      "object": "list",
+      "created": 693447,
+      "model": "BAAI/bge-reranker-v2-m3",
+      "data": [
+        {
+          "index": 0,
+          "object": "score",
+          "score": 1
+        }
+      ],
+      "usage": {}
    }
-  ],
-  "usage": {}
-}
-```
+    ```

 #### Batch inference

@@ -548,95 +564,95 @@ You can pass a string to `text_1` and a list to `text_2`, forming multiple sente
 where each pair is built from `text_1` and a string in `text_2`.
 The total number of pairs is `len(text_2)`.

-Request:
+??? Request

-```bash
-curl -X 'POST' \
-  'http://127.0.0.1:8000/score' \
-  -H 'accept: application/json' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "model": "BAAI/bge-reranker-v2-m3",
-  "text_1": "What is the capital of France?",
-  "text_2": [
-    "The capital of Brazil is Brasilia.",
-    "The capital of France is Paris."
-  ]
-}'
-```
+    ```bash
+    curl -X 'POST' \
+      'http://127.0.0.1:8000/score' \
+      -H 'accept: application/json' \
+      -H 'Content-Type: application/json' \
+      -d '{
+      "model": "BAAI/bge-reranker-v2-m3",
+      "text_1": "What is the capital of France?",
+      "text_2": [
+        "The capital of Brazil is Brasilia.",
+        "The capital of France is Paris."
+      ]
+    }'
+    ```

-Response:
+??? Response

-```bash
-{
-  "id": "score-request-id",
-  "object": "list",
-  "created": 693570,
-  "model": "BAAI/bge-reranker-v2-m3",
-  "data": [
-    {
-      "index": 0,
-      "object": "score",
-      "score": 0.001094818115234375
-    },
+    ```bash
    {
-      "index": 1,
-      "object": "score",
-      "score": 1
+      "id": "score-request-id",
+      "object": "list",
+      "created": 693570,
+      "model": "BAAI/bge-reranker-v2-m3",
+      "data": [
+        {
+          "index": 0,
+          "object": "score",
+          "score": 0.001094818115234375
+        },
+        {
+          "index": 1,
+          "object": "score",
+          "score": 1
+        }
+      ],
+      "usage": {}
    }
-  ],
-  "usage": {}
-}
-```
+    ```

 You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs
 where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`).
 The total number of pairs is `len(text_2)`.

-Request:
+??? Request

-```bash
-curl -X 'POST' \
-  'http://127.0.0.1:8000/score' \
-  -H 'accept: application/json' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "model": "BAAI/bge-reranker-v2-m3",
-  "encoding_format": "float",
-  "text_1": [
-    "What is the capital of Brazil?",
-    "What is the capital of France?"
-  ],
-  "text_2": [
-    "The capital of Brazil is Brasilia.",
-    "The capital of France is Paris."
-  ]
-}'
-```
+    ```bash
+    curl -X 'POST' \
+      'http://127.0.0.1:8000/score' \
+      -H 'accept: application/json' \
+      -H 'Content-Type: application/json' \
+      -d '{
+      "model": "BAAI/bge-reranker-v2-m3",
+      "encoding_format": "float",
+      "text_1": [
+        "What is the capital of Brazil?",
+        "What is the capital of France?"
+      ],
+      "text_2": [
+        "The capital of Brazil is Brasilia.",
+        "The capital of France is Paris."
+      ]
+    }'
+    ```

-Response:
+??? Response

-```bash
-{
-  "id": "score-request-id",
-  "object": "list",
-  "created": 693447,
-  "model": "BAAI/bge-reranker-v2-m3",
-  "data": [
-    {
-      "index": 0,
-      "object": "score",
-      "score": 1
-    },
+    ```bash
    {
-      "index": 1,
-      "object": "score",
-      "score": 1
+      "id": "score-request-id",
+      "object": "list",
+      "created": 693447,
+      "model": "BAAI/bge-reranker-v2-m3",
+      "data": [
+        {
+          "index": 0,
+          "object": "score",
+          "score": 1
+        },
+        {
+          "index": 1,
+          "object": "score",
+          "score": 1
+        }
+      ],
+      "usage": {}
    }
-  ],
-  "usage": {}
-}
-```
+    ```

 #### Extra parameters

@@ -675,51 +691,51 @@ Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
 Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
 Result documents will be sorted by relevance, and the `index` property can be used to determine original order.

-Request:
+??? Request

-```bash
-curl -X 'POST' \
-  'http://127.0.0.1:8000/v1/rerank' \
-  -H 'accept: application/json' \
-  -H 'Content-Type: application/json' \
-  -d '{
-  "model": "BAAI/bge-reranker-base",
-  "query": "What is the capital of France?",
-  "documents": [
-    "The capital of Brazil is Brasilia.",
-    "The capital of France is Paris.",
-    "Horses and cows are both animals"
-  ]
-}'
-```
+    ```bash
+    curl -X 'POST' \
+      'http://127.0.0.1:8000/v1/rerank' \
+      -H 'accept: application/json' \
+      -H 'Content-Type: application/json' \
+      -d '{
+      "model": "BAAI/bge-reranker-base",
+      "query": "What is the capital of France?",
+      "documents": [
+        "The capital of Brazil is Brasilia.",
+        "The capital of France is Paris.",
+        "Horses and cows are both animals"
+      ]
+    }'
+    ```

-Response:
+??? Response

-```bash
-{
-  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
-  "model": "BAAI/bge-reranker-base",
-  "usage": {
-    "total_tokens": 56
-  },
-  "results": [
-    {
-      "index": 1,
-      "document": {
-        "text": "The capital of France is Paris."
-      },
-      "relevance_score": 0.99853515625
-    },
+    ```bash
    {
-      "index": 0,
-      "document": {
-        "text": "The capital of Brazil is Brasilia."
+      "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
+      "model": "BAAI/bge-reranker-base",
+      "usage": {
+        "total_tokens": 56
      },
-      "relevance_score": 0.0005860328674316406
+      "results": [
+        {
+          "index": 1,
+          "document": {
+            "text": "The capital of France is Paris."
+          },
+          "relevance_score": 0.99853515625
+        },
+        {
+          "index": 0,
+          "document": {
+            "text": "The capital of Brazil is Brasilia."
+          },
+          "relevance_score": 0.0005860328674316406
+        }
+      ]
    }
-  ]
-}
-```
+    ```

 #### Extra parameters


--- a/docs/usage/metrics.md
+++ b/docs/usage/metrics.md
@@ -12,28 +12,32 @@ vllm serve unsloth/Llama-3.2-1B-Instruct

 Then query the endpoint to get the latest metrics from the server:

-```console
-$ curl http://0.0.0.0:8000/metrics
-
-# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
-# TYPE vllm:iteration_tokens_total histogram
-vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
-vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-...
-```
+??? Output
+
+    ```console
+    $ curl http://0.0.0.0:8000/metrics
+
+    # HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
+    # TYPE vllm:iteration_tokens_total histogram
+    vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
+    vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    ...
+    ```

 The following metrics are exposed:

-```python
--8<-- "vllm/engine/metrics.py:metrics-definitions"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/engine/metrics.py:metrics-definitions"
+    ```

 Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1`
 but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch,

--- a/docs/usage/troubleshooting.md
+++ b/docs/usage/troubleshooting.md
@@ -60,68 +60,70 @@ To identify the particular CUDA operation that causes the error, you can add `--

 If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.

-```python
-# Test PyTorch NCCL
-import torch
-import torch.distributed as dist
-dist.init_process_group(backend="nccl")
-local_rank = dist.get_rank() % torch.cuda.device_count()
-torch.cuda.set_device(local_rank)
-data = torch.FloatTensor([1,] * 128).to("cuda")
-dist.all_reduce(data, op=dist.ReduceOp.SUM)
-torch.cuda.synchronize()
-value = data.mean().item()
-world_size = dist.get_world_size()
-assert value == world_size, f"Expected {world_size}, got {value}"
-
-print("PyTorch NCCL is successful!")
-
-# Test PyTorch GLOO
-gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
-cpu_data = torch.FloatTensor([1,] * 128)
-dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
-value = cpu_data.mean().item()
-assert value == world_size, f"Expected {world_size}, got {value}"
-
-print("PyTorch GLOO is successful!")
-
-if world_size <= 1:
-    exit()
-
-# Test vLLM NCCL, with cuda graph
-from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
-
-pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
-# pynccl is enabled by default for 0.6.5+,
-# but for 0.6.4 and below, we need to enable it manually.
-# keep the code for backward compatibility when because people
-# prefer to read the latest documentation.
-pynccl.disabled = False
-
-s = torch.cuda.Stream()
-with torch.cuda.stream(s):
-    data.fill_(1)
-    out = pynccl.all_reduce(data, stream=s)
-    value = out.mean().item()
+??? Code
+
+    ```python
+    # Test PyTorch NCCL
+    import torch
+    import torch.distributed as dist
+    dist.init_process_group(backend="nccl")
+    local_rank = dist.get_rank() % torch.cuda.device_count()
+    torch.cuda.set_device(local_rank)
+    data = torch.FloatTensor([1,] * 128).to("cuda")
+    dist.all_reduce(data, op=dist.ReduceOp.SUM)
+    torch.cuda.synchronize()
+    value = data.mean().item()
+    world_size = dist.get_world_size()
    assert value == world_size, f"Expected {world_size}, got {value}"

-print("vLLM NCCL is successful!")
+    print("PyTorch NCCL is successful!")

-g = torch.cuda.CUDAGraph()
-with torch.cuda.graph(cuda_graph=g, stream=s):
-    out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())
+    # Test PyTorch GLOO
+    gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
+    cpu_data = torch.FloatTensor([1,] * 128)
+    dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
+    value = cpu_data.mean().item()
+    assert value == world_size, f"Expected {world_size}, got {value}"

-data.fill_(1)
-g.replay()
-torch.cuda.current_stream().synchronize()
-value = out.mean().item()
-assert value == world_size, f"Expected {world_size}, got {value}"
+    print("PyTorch GLOO is successful!")

-print("vLLM NCCL with cuda graph is successful!")
+    if world_size <= 1:
+        exit()

-dist.destroy_process_group(gloo_group)
-dist.destroy_process_group()
-```
+    # Test vLLM NCCL, with cuda graph
+    from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
+
+    pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
+    # pynccl is enabled by default for 0.6.5+,
+    # but for 0.6.4 and below, we need to enable it manually.
+    # keep the code for backward compatibility when because people
+    # prefer to read the latest documentation.
+    pynccl.disabled = False
+
+    s = torch.cuda.Stream()
+    with torch.cuda.stream(s):
+        data.fill_(1)
+        out = pynccl.all_reduce(data, stream=s)
+        value = out.mean().item()
+        assert value == world_size, f"Expected {world_size}, got {value}"
+
+    print("vLLM NCCL is successful!")
+
+    g = torch.cuda.CUDAGraph()
+    with torch.cuda.graph(cuda_graph=g, stream=s):
+        out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())
+
+    data.fill_(1)
+    g.replay()
+    torch.cuda.current_stream().synchronize()
+    value = out.mean().item()
+    assert value == world_size, f"Expected {world_size}, got {value}"
+
+    print("vLLM NCCL with cuda graph is successful!")
+
+    dist.destroy_process_group(gloo_group)
+    dist.destroy_process_group()
+    ```

 If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:

@@ -165,25 +167,27 @@ WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously

 or an error from Python that looks like this:

-```console
-RuntimeError:
-        An attempt has been made to start a new process before the
-        current process has finished its bootstrapping phase.
+??? Logs

-        This probably means that you are not using fork to start your
-        child processes and you have forgotten to use the proper idiom
-        in the main module:
+    ```console
+    RuntimeError:
+            An attempt has been made to start a new process before the
+            current process has finished its bootstrapping phase.

-            if __name__ == '__main__':
-                freeze_support()
-                ...
+            This probably means that you are not using fork to start your
+            child processes and you have forgotten to use the proper idiom
+            in the main module:

-        The "freeze_support()" line can be omitted if the program
-        is not going to be frozen to produce an executable.
+                if __name__ == '__main__':
+                    freeze_support()
+                    ...

-        To fix this issue, refer to the "Safe importing of main module"
-        section in https://docs.python.org/3/library/multiprocessing.html
-```
+            The "freeze_support()" line can be omitted if the program
+            is not going to be frozen to produce an executable.
+
+            To fix this issue, refer to the "Safe importing of main module"
+            section in https://docs.python.org/3/library/multiprocessing.html
+    ```

 then you must update your Python code to guard usage of `vllm` behind a `if
 __name__ == '__main__':` block. For example, instead of this:
@@ -207,20 +211,22 @@ if __name__ == '__main__':

 vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:

-```python
-import torch
-
-@torch.compile
-def f(x):
-    # a simple function to test torch.compile
-    x = x + 1
-    x = x * 2
-    x = x.sin()
-    return x
-
-x = torch.randn(4, 4).cuda()
-print(f(x))
-```
+??? Code
+
+    ```python
+    import torch
+
+    @torch.compile
+    def f(x):
+        # a simple function to test torch.compile
+        x = x + 1
+        x = x * 2
+        x = x.sin()
+        return x
+
+    x = torch.randn(4, 4).cuda()
+    print(f(x))
+    ```

 If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.


--- a/docs/usage/usage_stats.md
+++ b/docs/usage/usage_stats.md
@@ -10,36 +10,38 @@ The list of data collected by the latest version of vLLM can be found here: <gh-

 Here is an example as of v0.4.0:

-```json
-{
-  "uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
-  "provider": "GCP",
-  "num_cpu": 24,
-  "cpu_type": "Intel(R) Xeon(R) CPU @ 2.20GHz",
-  "cpu_family_model_stepping": "6,85,7",
-  "total_memory": 101261135872,
-  "architecture": "x86_64",
-  "platform": "Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31",
-  "gpu_count": 2,
-  "gpu_type": "NVIDIA L4",
-  "gpu_memory_per_device": 23580639232,
-  "model_architecture": "OPTForCausalLM",
-  "vllm_version": "0.3.2+cu123",
-  "context": "LLM_CLASS",
-  "log_time": 1711663373492490000,
-  "source": "production",
-  "dtype": "torch.float16",
-  "tensor_parallel_size": 1,
-  "block_size": 16,
-  "gpu_memory_utilization": 0.9,
-  "quantization": null,
-  "kv_cache_dtype": "auto",
-  "enable_lora": false,
-  "enable_prefix_caching": false,
-  "enforce_eager": false,
-  "disable_custom_all_reduce": true
-}
-```
+??? Output
+
+    ```json
+    {
+      "uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
+      "provider": "GCP",
+      "num_cpu": 24,
+      "cpu_type": "Intel(R) Xeon(R) CPU @ 2.20GHz",
+      "cpu_family_model_stepping": "6,85,7",
+      "total_memory": 101261135872,
+      "architecture": "x86_64",
+      "platform": "Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31",
+      "gpu_count": 2,
+      "gpu_type": "NVIDIA L4",
+      "gpu_memory_per_device": 23580639232,
+      "model_architecture": "OPTForCausalLM",
+      "vllm_version": "0.3.2+cu123",
+      "context": "LLM_CLASS",
+      "log_time": 1711663373492490000,
+      "source": "production",
+      "dtype": "torch.float16",
+      "tensor_parallel_size": 1,
+      "block_size": 16,
+      "gpu_memory_utilization": 0.9,
+      "quantization": null,
+      "kv_cache_dtype": "auto",
+      "enable_lora": false,
+      "enable_prefix_caching": false,
+      "enforce_eager": false,
+      "disable_custom_all_reduce": true
+    }
+    ```

 You can preview the collected data by running the following command: