[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

[doc] Fold long code blocks to improve readability (#19926)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
f17aec0d · Reid · GitHub · 493c2753 · f17aec0d · f17aec0d
Unverified Commit f17aec0d authored Jun 23, 2025 by Reid Committed by GitHub Jun 23, 2025
10 changed files
--- a/docs/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/getting_started/installation/gpu/rocm.inc.md
@@ -90,6 +90,8 @@ Currently, there are no pre-built ROCm wheels.

 4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:

+    ??? Commands
+
        ```bash
        pip install --upgrade pip

@@ -201,8 +203,10 @@ DOCKER_BUILDKIT=1 docker build \

 To run the above docker image `vllm-rocm`, use the below command:

-```console
-docker run -it \
+??? Command
+
+    ```console
+    docker run -it \
    --network=host \
    --group-add=video \
    --ipc=host \
@@ -213,7 +217,7 @@ docker run -it \
    -v <path/to/model>:/app/model \
    vllm-rocm \
    bash
-```
+    ```

 Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.


--- a/docs/getting_started/installation/intel_gaudi.md
+++ b/docs/getting_started/installation/intel_gaudi.md
@@ -200,7 +200,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 1

 `min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.

-Example (with ramp-up)
+Example (with ramp-up):

 ```text
 min = 2, step = 32, max = 64
@@ -209,7 +209,7 @@ min = 2, step = 32, max = 64
 => buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64)
 ```

-Example (without ramp-up)
+Example (without ramp-up):

 ```text
 min = 128, step = 128, max = 512
@@ -232,19 +232,21 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come

 Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:

-```text
-INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
-INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
-INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
-...
-INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
-INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
-INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
-INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
-...
-INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
-INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
-```
+??? Logs
+
+    ```text
+    INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
+    INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
+    INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
+    ...
+    INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
+    INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
+    INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
+    INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
+    ...
+    INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
+    INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
+    ```

 This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.

@@ -279,37 +281,39 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi

 Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):

-```text
-INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
-INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
-INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
-INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
-INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
-INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
-INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
-INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
-INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
-INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
-INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
-INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
-...
-INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
-INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
-INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
-...
-INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
-INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
-...
-INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
-INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
-INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
-INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
-INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
-INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
-INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
-INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
-INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)
-```
+??? Logs
+
+    ```text
+    INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
+    INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
+    INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
+    INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
+    INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
+    INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
+    INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
+    INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
+    INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
+    INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
+    INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
+    INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
+    ...
+    INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
+    INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
+    INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
+    ...
+    INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
+    INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
+    ...
+    INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
+    INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
+    INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
+    INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
+    INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
+    INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
+    INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
+    INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
+    INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)
+    ```

 ### Recommended vLLM Parameters


--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@@ -147,20 +147,22 @@ curl http://localhost:8000/v1/completions \

 Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:

-```python
-from openai import OpenAI
+??? Code
+
+    ```python
+    from openai import OpenAI

-# Modify OpenAI's API key and API base to use vLLM's API server.
-openai_api_key = "EMPTY"
-openai_api_base = "http://localhost:8000/v1"
-client = OpenAI(
+    # Modify OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
+    openai_api_base = "http://localhost:8000/v1"
+    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
-)
-completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
+    )
+    completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                        prompt="San Francisco is a")
-print("Completion result:", completion)
-```
+    print("Completion result:", completion)
+    ```

 A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>

@@ -184,26 +186,28 @@ curl http://localhost:8000/v1/chat/completions \

 Alternatively, you can use the `openai` Python package:

-```python
-from openai import OpenAI
-# Set OpenAI's API key and API base to use vLLM's API server.
-openai_api_key = "EMPTY"
-openai_api_base = "http://localhost:8000/v1"
+??? Code
+
+    ```python
+    from openai import OpenAI
+    # Set OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
+    openai_api_base = "http://localhost:8000/v1"

-client = OpenAI(
+    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
-)
+    )

-chat_response = client.chat.completions.create(
+    chat_response = client.chat.completions.create(
        model="Qwen/Qwen2.5-1.5B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a joke."},
        ]
-)
-print("Chat response:", chat_response)
-```
+    )
+    print("Chat response:", chat_response)
+    ```

 ## On Attention Backends


--- a/docs/models/generative_models.md
+++ b/docs/models/generative_models.md
@@ -85,11 +85,13 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
    In general, only instruction-tuned models have a chat template.
    Base models may perform poorly as they are not trained to respond to the chat conversation.

-```python
-from vllm import LLM
+??? Code
+
+    ```python
+    from vllm import LLM

-llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
-conversation = [
+    llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
+    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant"
@@ -106,14 +108,14 @@ conversation = [
            "role": "user",
            "content": "Write an essay about the importance of higher education.",
        },
-]
-outputs = llm.chat(conversation)
+    ]
+    outputs = llm.chat(conversation)

-for output in outputs:
+    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-```
+    ```

 A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>


--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@@ -70,7 +70,10 @@ To make your model compatible with the Transformers backend, it needs:
 2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
 3. `MyModel` must contain `_supports_attention_backend = True`.

-```python title="modeling_my_model.py"
+<details>
+<summary>modeling_my_model.py</summary>
+
+```python

 from transformers import PreTrainedModel
 from torch import nn
@@ -93,6 +96,8 @@ class MyModel(PreTrainedModel):
    _supports_attention_backend = True
 ```

+</details>
+
 Here is what happens in the background when this model is loaded:

 1. The config is loaded.
@@ -103,7 +108,10 @@ That's it!

 For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:

-```python title="configuration_my_model.py"
+<details>
+<summary>configuration_my_model.py</summary>
+
+```python

 from transformers import PretrainedConfig

@@ -123,6 +131,8 @@ class MyConfig(PretrainedConfig):
    }
 ```

+</details>
+
 - `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
 - `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
    * You only need to do this for layers which are not present on all pipeline stages
@@ -198,6 +208,9 @@ huggingface-cli scan-cache --dir ~/.cache/huggingface/hub

 Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache:

+<details>
+<summary>Commands</summary>
+
 ```console
 # The `delete-cache` command requires extra dependencies to work with the TUI.
 # Please run `pip install huggingface_hub[cli]` to install them.
@@ -224,6 +237,8 @@ Start deletion.
 Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M.
 ```

+</details>
+
 #### Using a proxy

 Here are some tips for loading/downloading models from Hugging Face using a proxy:
@@ -601,6 +616,8 @@ Specified using `--task generate`.

    For the best results, we recommend using the following dependency versions (tested on A10 and L40):

+    ??? Dependency versions
+
        ```text
        # Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
        torch==2.5.1

--- a/docs/serving/integrations/langchain.md
+++ b/docs/serving/integrations/langchain.md
@@ -13,19 +13,21 @@ pip install langchain langchain_community -q

 To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.

-```python
-from langchain_community.llms import VLLM
+??? Code

-llm = VLLM(model="mosaicml/mpt-7b",
+    ```python
+    from langchain_community.llms import VLLM
+
+    llm = VLLM(model="mosaicml/mpt-7b",
            trust_remote_code=True,  # mandatory for hf models
            max_new_tokens=128,
            top_k=10,
            top_p=0.95,
            temperature=0.8,
            # tensor_parallel_size=... # for distributed inference
-)
+    )

-print(llm("What is the capital of France ?"))
-```
+    print(llm("What is the capital of France ?"))
+    ```

 Please refer to this [Tutorial](https://python.langchain.com/docs/integrations/llms/vllm) for more details.
--- a/docs/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
@@ -15,22 +15,24 @@ vllm serve NousResearch/Meta-Llama-3-8B-Instruct \

 To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).

-```python
-from openai import OpenAI
-client = OpenAI(
+??? Code
+
+    ```python
+    from openai import OpenAI
+    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="token-abc123",
-)
+    )

-completion = client.chat.completions.create(
+    completion = client.chat.completions.create(
        model="NousResearch/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "user", "content": "Hello!"}
        ]
-)
+    )

-print(completion.choices[0].message)
-```
+    print(completion.choices[0].message)
+    ```

 !!! tip
    vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
@@ -147,8 +149,10 @@ with `--enable-request-id-headers`.
 > rather than within the vLLM layer for this reason.
 > See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.

-```python
-completion = client.chat.completions.create(
+??? Code
+
+    ```python
+    completion = client.chat.completions.create(
        model="NousResearch/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
@@ -156,18 +160,18 @@ completion = client.chat.completions.create(
        extra_headers={
            "x-request-id": "sentiment-classification-00001",
        }
-)
-print(completion._request_id)
+    )
+    print(completion._request_id)

-completion = client.completions.create(
+    completion = client.completions.create(
        model="NousResearch/Meta-Llama-3-8B-Instruct",
        prompt="A robot may not injure a human being",
        extra_headers={
            "x-request-id": "completion-test",
        }
-)
-print(completion._request_id)
-```
+    )
+    print(completion._request_id)
+    ```

 ## API Reference

@@ -184,15 +188,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>

 The following [sampling parameters][sampling-params] are supported.

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
+    ```

 The following extra parameters are supported:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
+    ```

 [](){ #chat-api }

@@ -212,15 +220,19 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>

 The following [sampling parameters][sampling-params] are supported.

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
+    ```

 The following extra parameters are supported:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
+    ```

 [](){ #embeddings-api }

@@ -259,6 +271,8 @@ and passing a list of `messages` in the request. Refer to the examples below for

    Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:

+    ??? Code
+
        ```python
        import requests

@@ -316,15 +330,19 @@ The following [pooling parameters][pooling-params] are supported.

 The following extra parameters are supported by default:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
+    ```

 For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
+    ```

 [](){ #transcriptions-api }

@@ -343,15 +361,19 @@ Code example: <gh-file:examples/online_serving/openai_transcription_client.py>

 The following [sampling parameters][sampling-params] are supported.

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
+    ```

 The following extra parameters are supported:

-```python
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
+    ```

 [](){ #tokenizer-api }

@@ -387,8 +409,6 @@ Code example: <gh-file:examples/online_serving/openai_classification_client.py>

 You can classify multiple texts by passing an array of strings:

-Request:
-
 ```bash
 curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
@@ -401,10 +421,10 @@ curl -v "http://127.0.0.1:8000/classify" \
  }'
 ```

-Response:
+??? Response

-```bash
-{
+    ```bash
+    {
      "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
      "object": "list",
      "created": 1745383065,
@@ -435,13 +455,11 @@ Response:
        "completion_tokens": 0,
        "prompt_tokens_details": null
      }
-}
-```
+    }
+    ```

 You can also pass a string directly to the `input` field:

-Request:
-
 ```bash
 curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
@@ -451,10 +469,10 @@ curl -v "http://127.0.0.1:8000/classify" \
  }'
 ```

-Response:
+??? Response

-```bash
-{
+    ```bash
+    {
      "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
      "object": "list",
      "created": 1745383213,
@@ -476,8 +494,8 @@ Response:
        "completion_tokens": 0,
        "prompt_tokens_details": null
      }
-}
-```
+    }
+    ```

 #### Extra parameters

@@ -508,8 +526,6 @@ Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>

 You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.

-Request:
-
 ```bash
 curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
@@ -523,10 +539,10 @@ curl -X 'POST' \
 }'
 ```

-Response:
+??? Response

-```bash
-{
+    ```bash
+    {
      "id": "score-request-id",
      "object": "list",
      "created": 693447,
@@ -539,8 +555,8 @@ Response:
        }
      ],
      "usage": {}
-}
-```
+    }
+    ```

 #### Batch inference

@@ -548,10 +564,10 @@ You can pass a string to `text_1` and a list to `text_2`, forming multiple sente
 where each pair is built from `text_1` and a string in `text_2`.
 The total number of pairs is `len(text_2)`.

-Request:
+??? Request

-```bash
-curl -X 'POST' \
+    ```bash
+    curl -X 'POST' \
      'http://127.0.0.1:8000/score' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
@@ -562,13 +578,13 @@ curl -X 'POST' \
        "The capital of Brazil is Brasilia.",
        "The capital of France is Paris."
      ]
-}'
-```
+    }'
+    ```

-Response:
+??? Response

-```bash
-{
+    ```bash
+    {
      "id": "score-request-id",
      "object": "list",
      "created": 693570,
@@ -586,17 +602,17 @@ Response:
        }
      ],
      "usage": {}
-}
-```
+    }
+    ```

 You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs
 where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`).
 The total number of pairs is `len(text_2)`.

-Request:
+??? Request

-```bash
-curl -X 'POST' \
+    ```bash
+    curl -X 'POST' \
      'http://127.0.0.1:8000/score' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
@@ -611,13 +627,13 @@ curl -X 'POST' \
        "The capital of Brazil is Brasilia.",
        "The capital of France is Paris."
      ]
-}'
-```
+    }'
+    ```

-Response:
+??? Response

-```bash
-{
+    ```bash
+    {
      "id": "score-request-id",
      "object": "list",
      "created": 693447,
@@ -635,8 +651,8 @@ Response:
        }
      ],
      "usage": {}
-}
-```
+    }
+    ```

 #### Extra parameters

@@ -675,10 +691,10 @@ Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
 Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
 Result documents will be sorted by relevance, and the `index` property can be used to determine original order.

-Request:
+??? Request

-```bash
-curl -X 'POST' \
+    ```bash
+    curl -X 'POST' \
      'http://127.0.0.1:8000/v1/rerank' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
@@ -690,13 +706,13 @@ curl -X 'POST' \
        "The capital of France is Paris.",
        "Horses and cows are both animals"
      ]
-}'
-```
+    }'
+    ```

-Response:
+??? Response

-```bash
-{
+    ```bash
+    {
      "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
      "model": "BAAI/bge-reranker-base",
      "usage": {
@@ -718,8 +734,8 @@ Response:
          "relevance_score": 0.0005860328674316406
        }
      ]
-}
-```
+    }
+    ```

 #### Extra parameters


--- a/docs/usage/metrics.md
+++ b/docs/usage/metrics.md
@@ -12,28 +12,32 @@ vllm serve unsloth/Llama-3.2-1B-Instruct

 Then query the endpoint to get the latest metrics from the server:

-```console
-$ curl http://0.0.0.0:8000/metrics
-
-# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
-# TYPE vllm:iteration_tokens_total histogram
-vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
-vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
-...
-```
+??? Output
+
+    ```console
+    $ curl http://0.0.0.0:8000/metrics
+
+    # HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
+    # TYPE vllm:iteration_tokens_total histogram
+    vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
+    vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
+    ...
+    ```

 The following metrics are exposed:

-```python
--8<-- "vllm/engine/metrics.py:metrics-definitions"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/engine/metrics.py:metrics-definitions"
+    ```

 Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1`
 but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch,

--- a/docs/usage/troubleshooting.md
+++ b/docs/usage/troubleshooting.md
@@ -60,68 +60,70 @@ To identify the particular CUDA operation that causes the error, you can add `--

 If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.

-```python
-# Test PyTorch NCCL
-import torch
-import torch.distributed as dist
-dist.init_process_group(backend="nccl")
-local_rank = dist.get_rank() % torch.cuda.device_count()
-torch.cuda.set_device(local_rank)
-data = torch.FloatTensor([1,] * 128).to("cuda")
-dist.all_reduce(data, op=dist.ReduceOp.SUM)
-torch.cuda.synchronize()
-value = data.mean().item()
-world_size = dist.get_world_size()
-assert value == world_size, f"Expected {world_size}, got {value}"
-
-print("PyTorch NCCL is successful!")
-
-# Test PyTorch GLOO
-gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
-cpu_data = torch.FloatTensor([1,] * 128)
-dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
-value = cpu_data.mean().item()
-assert value == world_size, f"Expected {world_size}, got {value}"
-
-print("PyTorch GLOO is successful!")
-
-if world_size <= 1:
+??? Code
+
+    ```python
+    # Test PyTorch NCCL
+    import torch
+    import torch.distributed as dist
+    dist.init_process_group(backend="nccl")
+    local_rank = dist.get_rank() % torch.cuda.device_count()
+    torch.cuda.set_device(local_rank)
+    data = torch.FloatTensor([1,] * 128).to("cuda")
+    dist.all_reduce(data, op=dist.ReduceOp.SUM)
+    torch.cuda.synchronize()
+    value = data.mean().item()
+    world_size = dist.get_world_size()
+    assert value == world_size, f"Expected {world_size}, got {value}"
+
+    print("PyTorch NCCL is successful!")
+
+    # Test PyTorch GLOO
+    gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
+    cpu_data = torch.FloatTensor([1,] * 128)
+    dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
+    value = cpu_data.mean().item()
+    assert value == world_size, f"Expected {world_size}, got {value}"
+
+    print("PyTorch GLOO is successful!")
+
+    if world_size <= 1:
        exit()

-# Test vLLM NCCL, with cuda graph
-from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
+    # Test vLLM NCCL, with cuda graph
+    from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

-pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
-# pynccl is enabled by default for 0.6.5+,
-# but for 0.6.4 and below, we need to enable it manually.
-# keep the code for backward compatibility when because people
-# prefer to read the latest documentation.
-pynccl.disabled = False
+    pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
+    # pynccl is enabled by default for 0.6.5+,
+    # but for 0.6.4 and below, we need to enable it manually.
+    # keep the code for backward compatibility when because people
+    # prefer to read the latest documentation.
+    pynccl.disabled = False

-s = torch.cuda.Stream()
-with torch.cuda.stream(s):
+    s = torch.cuda.Stream()
+    with torch.cuda.stream(s):
        data.fill_(1)
        out = pynccl.all_reduce(data, stream=s)
        value = out.mean().item()
        assert value == world_size, f"Expected {world_size}, got {value}"

-print("vLLM NCCL is successful!")
+    print("vLLM NCCL is successful!")

-g = torch.cuda.CUDAGraph()
-with torch.cuda.graph(cuda_graph=g, stream=s):
+    g = torch.cuda.CUDAGraph()
+    with torch.cuda.graph(cuda_graph=g, stream=s):
        out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())

-data.fill_(1)
-g.replay()
-torch.cuda.current_stream().synchronize()
-value = out.mean().item()
-assert value == world_size, f"Expected {world_size}, got {value}"
+    data.fill_(1)
+    g.replay()
+    torch.cuda.current_stream().synchronize()
+    value = out.mean().item()
+    assert value == world_size, f"Expected {world_size}, got {value}"

-print("vLLM NCCL with cuda graph is successful!")
+    print("vLLM NCCL with cuda graph is successful!")

-dist.destroy_process_group(gloo_group)
-dist.destroy_process_group()
-```
+    dist.destroy_process_group(gloo_group)
+    dist.destroy_process_group()
+    ```

 If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:

@@ -165,8 +167,10 @@ WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously

 or an error from Python that looks like this:

-```console
-RuntimeError:
+??? Logs
+
+    ```console
+    RuntimeError:
            An attempt has been made to start a new process before the
            current process has finished its bootstrapping phase.

@@ -183,7 +187,7 @@ RuntimeError:

            To fix this issue, refer to the "Safe importing of main module"
            section in https://docs.python.org/3/library/multiprocessing.html
-```
+    ```

 then you must update your Python code to guard usage of `vllm` behind a `if
 __name__ == '__main__':` block. For example, instead of this:
@@ -207,20 +211,22 @@ if __name__ == '__main__':

 vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:

-```python
-import torch
+??? Code

-@torch.compile
-def f(x):
+    ```python
+    import torch
+
+    @torch.compile
+    def f(x):
        # a simple function to test torch.compile
        x = x + 1
        x = x * 2
        x = x.sin()
        return x

-x = torch.randn(4, 4).cuda()
-print(f(x))
-```
+    x = torch.randn(4, 4).cuda()
+    print(f(x))
+    ```

 If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.


--- a/docs/usage/usage_stats.md
+++ b/docs/usage/usage_stats.md
@@ -10,8 +10,10 @@ The list of data collected by the latest version of vLLM can be found here: <gh-

 Here is an example as of v0.4.0:

-```json
-{
+??? Output
+
+    ```json
+    {
      "uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
      "provider": "GCP",
      "num_cpu": 24,
@@ -38,8 +40,8 @@ Here is an example as of v0.4.0:
      "enable_prefix_caching": false,
      "enforce_eager": false,
      "disable_custom_all_reduce": true
-}
-```
+    }
+    ```

 You can preview the collected data by running the following command: