[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

[doc] Fold long code blocks to improve readability (#19926)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
f17aec0d · Reid · GitHub · 493c2753 · f17aec0d · f17aec0d
Unverified Commit f17aec0d authored Jun 23, 2025 by Reid Committed by GitHub Jun 23, 2025
20 changed files
--- a/docs/ci/update_pytorch_version.md
+++ b/docs/ci/update_pytorch_version.md
@@ -91,7 +91,7 @@ source to unblock the update process.
 ### FlashInfer
 Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
-```
+```bash
 export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
 export FLASHINFER_ENABLE_SM90=1
 uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1"
@@ -105,14 +105,14 @@ team if you want to get the package published there.
 ### xFormers
 Similar to FlashInfer, here is how to build and install xFormers from source:
-```
+```bash
 export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
 MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30"
 ```
 ### Mamba
-```
+```bash
 uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
 ```

--- a/docs/cli/README.md
+++ b/docs/cli/README.md
@@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}
 Start the vLLM OpenAI Compatible API server.
-Examples:
+??? Examples
-```bash
+    ```bash
-# Start with a model
+    # Start with a model
-vllm serve meta-llama/Llama-2-7b-hf
+    vllm serve meta-llama/Llama-2-7b-hf
-# Specify the port
+    # Specify the port
-vllm serve meta-llama/Llama-2-7b-hf --port 8100
+    vllm serve meta-llama/Llama-2-7b-hf --port 8100
-# Check with --help for more options
+    # Check with --help for more options
-# To list all groups
+    # To list all groups
-vllm serve --help=listgroup
+    vllm serve --help=listgroup
-# To view a argument group
+    # To view a argument group
-vllm serve --help=ModelConfig
+    vllm serve --help=ModelConfig
-# To view a single argument
+    # To view a single argument
-vllm serve --help=max-num-seqs
+    vllm serve --help=max-num-seqs
-# To search by keyword
+    # To search by keyword
-vllm serve --help=max
+    vllm serve --help=max
-```
+    ```
 ## chat
 Generate chat completions via the running API server.
-Examples:
 ```bash
 # Directly connect to localhost API without arguments
 vllm chat
@@ -60,8 +58,6 @@ vllm chat --quick "hi"
 Generate text completions based on the given prompt via the running API server.
-Examples:
 ```bash
 # Directly connect to localhost API without arguments
 vllm complete
@@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
 vllm complete --quick "The future of AI is"
 ```
+</details>
 ## bench
 Run benchmark tests for latency online serving throughput and offline inference throughput.
@@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput}
 Benchmark the latency of a single batch of requests.
-Example:
 ```bash
 vllm bench latency \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -104,8 +100,6 @@ vllm bench latency \
 Benchmark the online serving throughput.
-Example:
 ```bash
 vllm bench serve \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -120,8 +114,6 @@ vllm bench serve \
 Benchmark offline inference throughput.
-Example:
 ```bash
 vllm bench throughput \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -143,7 +135,8 @@ vllm collect-env
 Run batch prompts and write results to file.
-Examples:
+<details>
+<summary>Examples</summary>
 ```bash
 # Running with a local file
@@ -159,6 +152,8 @@ vllm run-batch \
    --model meta-llama/Meta-Llama-3-8B-Instruct
 ```
+</details>
 ## More Help
 For detailed options of any subcommand, use:

--- a/docs/configuration/conserving_memory.md
+++ b/docs/configuration/conserving_memory.md
@@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me
 You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
-```python
+??? Code
-from vllm import LLM
-from vllm.config import CompilationConfig, CompilationLevel
+    ```python
+    from vllm import LLM
+    from vllm.config import CompilationConfig, CompilationLevel
-llm = LLM(
+    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        compilation_config=CompilationConfig(
            level=CompilationLevel.PIECEWISE,
            # By default, it goes up to max_num_seqs
            cudagraph_capture_sizes=[1, 2, 4, 8, 16],
        ),
-)
+    )
-```
+    ```
 You can disable graph capturing completely via the `enforce_eager` flag:
@@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
 Here are some examples:
-```python
+??? Code
-from vllm import LLM
-# Available for Qwen2-VL series models
+    ```python
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+    from vllm import LLM
+    # Available for Qwen2-VL series models
+    llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
            mm_processor_kwargs={
                "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
            })
-# Available for InternVL series models
+    # Available for InternVL series models
-llm = LLM(model="OpenGVLab/InternVL2-2B",
+    llm = LLM(model="OpenGVLab/InternVL2-2B",
            mm_processor_kwargs={
                "max_dynamic_patch": 4,  # Default is 12
            })
-```
+    ```
--- a/docs/configuration/env_vars.md
+++ b/docs/configuration/env_vars.md
@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:
    All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
-```python
+??? Code
--8<-- "vllm/envs.py:env-vars-definition"
-```
+    ```python
+    --8<-- "vllm/envs.py:env-vars-definition"
+    ```
--- a/docs/contributing/README.md
+++ b/docs/contributing/README.md
@@ -93,25 +93,27 @@ For additional features and advanced configurations, refer to the official [MkDo
 ## Testing
-```bash
+??? note "Commands"
-pip install -r requirements/dev.txt
-# Linting, formatting and static type checking
+    ```bash
-pre-commit install --hook-type pre-commit --hook-type commit-msg
+    pip install -r requirements/dev.txt
-# You can manually run pre-commit with
+    # Linting, formatting and static type checking
-pre-commit run --all-files
+    pre-commit install --hook-type pre-commit --hook-type commit-msg
-# To manually run something from CI that does not run
+    # You can manually run pre-commit with
-# locally by default, you can run:
+    pre-commit run --all-files
-pre-commit run mypy-3.9 --hook-stage manual --all-files
-# Unit tests
+    # To manually run something from CI that does not run
-pytest tests/
+    # locally by default, you can run:
+    pre-commit run mypy-3.9 --hook-stage manual --all-files
-# Run tests for a single test file with detailed output
+    # Unit tests
-pytest -s -v tests/test_logger.py
+    pytest tests/
-```
+    # Run tests for a single test file with detailed output
+    pytest -s -v tests/test_logger.py
+    ```
 !!! tip
    Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.

--- a/docs/contributing/model/basic.md
+++ b/docs/contributing/model/basic.md
@@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons
 The initialization code should look like this:
-```python
+??? Code
-from torch import nn
-from vllm.config import VllmConfig
+    ```python
-from vllm.attention import Attention
+    from torch import nn
+    from vllm.config import VllmConfig
+    from vllm.attention import Attention
-class MyAttention(nn.Module):
+    class MyAttention(nn.Module):
        def __init__(self, vllm_config: VllmConfig, prefix: str):
            super().__init__()
            self.attn = Attention(prefix=f"{prefix}.attn")
-class MyDecoderLayer(nn.Module):
+    class MyDecoderLayer(nn.Module):
        def __init__(self, vllm_config: VllmConfig, prefix: str):
            super().__init__()
            self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
-class MyModel(nn.Module):
+    class MyModel(nn.Module):
        def __init__(self, vllm_config: VllmConfig, prefix: str):
            super().__init__()
            self.layers = nn.ModuleList(
                [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
            )
-class MyModelForCausalLM(nn.Module):
+    class MyModelForCausalLM(nn.Module):
        def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
            super().__init__()
            self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
-```
+    ```
 ### Computation Code

--- a/docs/contributing/model/multimodal.md
+++ b/docs/contributing/model/multimodal.md
@@ -25,6 +25,8 @@ Further update the model as follows:
 - Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
+    ??? Code
        ```python
        class YourModelForImage2Seq(nn.Module):
            ...
@@ -53,6 +55,8 @@ Further update the model as follows:
 - Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
+    ??? Code
        ```python
        from .utils import merge_multimodal_embeddings
@@ -135,6 +139,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    Looking at the code of HF's `LlavaForConditionalGeneration`:
+    ??? Code
        ```python
        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
        n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
@@ -157,6 +163,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    The number of placeholder feature tokens per image is `image_features.shape[1]`.
    `image_features` is calculated inside the `get_image_features` method:
+    ??? Code
        ```python
        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
        image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
@@ -193,6 +201,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
+    ??? Code
        ```python
        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
        target_dtype = self.patch_embedding.weight.dtype
@@ -218,6 +228,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    Overall, the number of placeholder feature tokens for an image can be calculated as:
+    ??? Code
        ```python
        def get_num_image_tokens(
            self,
@@ -241,6 +253,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    Notice that the number of image tokens doesn't depend on the image width and height.
    We can simply use a dummy `image_size` to calculate the multimodal profiling data:
+    ??? Code
        ```python
        # NOTE: In actuality, this is usually implemented as part of the
        # model's subclass of `BaseProcessingInfo`, but we show it as is
@@ -284,6 +298,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    Looking at the code of HF's `FuyuForCausalLM`:
+    ??? Code
        ```python
        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
        if image_patches is not None and past_key_values is None:
@@ -312,6 +328,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
    returning the dimensions after resizing (but before padding) as metadata.
+    ??? Code
        ```python
        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
        image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
@@ -348,6 +366,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
+    ??? Code
        ```python
        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
        model_image_input = self.image_processor.preprocess_with_tokenizer_info(
@@ -384,6 +404,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
+    ??? Code
        ```python
        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
        patch_size = patch_size if patch_size is not None else self.patch_size
@@ -419,6 +441,8 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    For the multimodal image profiling data, the logic is very similar to LLaVA:
+    ??? Code
        ```python
        def get_dummy_mm_data(
            self,
@@ -455,6 +479,7 @@ return a schema of the tensors outputted by the HF processor that are related to
    The output of `CLIPImageProcessor` is a simple tensor with shape
    `(num_images, num_channels, image_height, image_width)`:
    ```python
    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
    images = [
@@ -505,6 +530,8 @@ return a schema of the tensors outputted by the HF processor that are related to
    In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
    we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
+    ??? Code
        ```python
        def _call_hf_processor(
            self,
@@ -573,6 +600,8 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
    It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
    Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
+    ??? Code
        ```python
        def _get_prompt_updates(
            self,
@@ -616,6 +645,8 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
    We define a helper function to return `ncols` and `nrows` directly:
+    ??? Code
        ```python
        def get_image_feature_grid_size(
            self,
@@ -644,6 +675,8 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
    Based on this, we can initially define our replacement tokens as:
+    ??? Code
        ```python
        def get_replacement(item_idx: int):
            images = mm_items.get_items("image", ImageProcessorItems)
@@ -662,6 +695,8 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
    However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
    a BOS token (`<s>`) is also added to the promopt:
+    ??? Code
        ```python
        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
        model_image_input = self.image_processor.preprocess_with_tokenizer_info(
@@ -687,6 +722,8 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
    To assign the vision embeddings to only the image tokens, instead of a string
    you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
+    ??? Code
        ```python
        hf_config = self.info.get_hf_config()
        bos_token_id = hf_config.bos_token_id  # `<s>`
@@ -712,6 +749,8 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
    Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
    we can search for it to conduct the replacement at the start of the string:
+    ??? Code
        ```python
        def _get_prompt_updates(
            self,

--- a/docs/contributing/profiling.md
+++ b/docs/contributing/profiling.md
@@ -97,11 +97,11 @@ to manually kill the profiler and generate your `nsys-rep` report.
 You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
-CLI example:
+??? CLI example
-```bash
+    ```bash
-nsys stats report1.nsys-rep
+    nsys stats report1.nsys-rep
-...
+    ...
    ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
    Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name                                                
@@ -115,8 +115,8 @@ nsys stats report1.nsys-rep
        2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
        1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
        0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
-... 
+    ... 
-```
+    ```
 GUI example:

--- a/docs/deployment/docker.md
+++ b/docs/deployment/docker.md
@@ -97,10 +97,12 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
    flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
    Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
-```console
+??? Command
-# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
-python3 use_existing_torch.py
+    ```console
-DOCKER_BUILDKIT=1 docker build . \
+    # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
+    python3 use_existing_torch.py
+    DOCKER_BUILDKIT=1 docker build . \
    --file docker/Dockerfile \
    --target vllm-openai \
    --platform "linux/arm64" \
@@ -109,7 +111,7 @@ DOCKER_BUILDKIT=1 docker build . \
    --build-arg nvcc_threads=2 \
    --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
    --build-arg vllm_fa_cmake_gpu_arches="90-real"
-```
+    ```
 !!! note
    If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.

--- a/docs/deployment/frameworks/autogen.md
+++ b/docs/deployment/frameworks/autogen.md
@@ -30,14 +30,16 @@ python -m vllm.entrypoints.openai.api_server \
 - Call it with AutoGen:
-```python
+??? Code
-import asyncio
-from autogen_core.models import UserMessage
-from autogen_ext.models.openai import OpenAIChatCompletionClient
-from autogen_core.models import ModelFamily
+    ```python
+    import asyncio
+    from autogen_core.models import UserMessage
+    from autogen_ext.models.openai import OpenAIChatCompletionClient
+    from autogen_core.models import ModelFamily
-async def main() -> None:
+    async def main() -> None:
        # Create a model client
        model_client = OpenAIChatCompletionClient(
            model="mistralai/Mistral-7B-Instruct-v0.2",
@@ -73,8 +75,8 @@ async def main() -> None:
        await model_client.close()
-asyncio.run(main())
+    asyncio.run(main())
-```
+    ```
 For details, see the tutorial:

--- a/docs/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@@ -34,12 +34,14 @@ vllm = "latest"
 Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
-```python
+??? Code
-from vllm import LLM, SamplingParams
-llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+    ```python
+    from vllm import LLM, SamplingParams
-def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
+    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
        outputs = llm.generate(prompts, sampling_params)
@@ -52,7 +54,7 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
            results.append({"prompt": prompt, "generated_text": generated_text})
        return {"results": results}
-```
+    ```
 Then, run the following code to deploy it to the cloud:
@@ -62,8 +64,10 @@ cerebrium deploy
 If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
-```python
+??? Command
-curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
+    ```python
+    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
    -H 'Content-Type: application/json' \
    -H 'Authorization: <JWT TOKEN>' \
    --data '{
@@ -74,12 +78,14 @@ curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
        "The future of AI is"
    ]
    }'
-```
+    ```
 You should get a response like:
-```python
+??? Response
-{
+    ```python
+    {
        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
        "result": {
            "result": [
@@ -102,7 +108,7 @@ You should get a response like:
            ]
        },
        "run_time_ms": 152.53663063049316
-}
+    }
-```
+    ```
 You now have an autoscaling endpoint where you only pay for the compute you use!
--- a/docs/deployment/frameworks/dstack.md
+++ b/docs/deployment/frameworks/dstack.md
@@ -26,30 +26,34 @@ dstack init
 Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
-```yaml
+??? Config
-type: service
-python: "3.11"
+    ```yaml
-env:
+    type: service
+    python: "3.11"
+    env:
        - MODEL=NousResearch/Llama-2-7b-chat-hf
-port: 8000
+    port: 8000
-resources:
+    resources:
        gpu: 24GB
-commands:
+    commands:
        - pip install vllm
        - vllm serve $MODEL --port 8000
-model:
+    model:
        format: openai
        type: chat
        name: NousResearch/Llama-2-7b-chat-hf
-```
+    ```
 Then, run the following CLI for provisioning:
-```console
+??? Command
-$ dstack run . -f serve.dstack.yml
+    ```console
+    $ dstack run . -f serve.dstack.yml
-⠸ Getting run plan...
+    ⠸ Getting run plan...
    Configuration  serve.dstack.yml
    Project        deep-diver-main
    User           deep-diver
@@ -66,24 +70,26 @@ $ dstack run . -f serve.dstack.yml
        ...
    Shown 3 of 193 offers, $5.876 max
-Continue? [y/n]: y
+    Continue? [y/n]: y
-⠙ Submitting run...
+    ⠙ Submitting run...
-⠏ Launching spicy-treefrog-1 (pulling)
+    ⠏ Launching spicy-treefrog-1 (pulling)
-spicy-treefrog-1 provisioning completed (running)
+    spicy-treefrog-1 provisioning completed (running)
-Service is published at ...
+    Service is published at ...
-```
+    ```
 After the provisioning, you can interact with the model by using the OpenAI SDK:
-```python
+??? Code
-from openai import OpenAI
+    ```python
+    from openai import OpenAI
-client = OpenAI(
+    client = OpenAI(
        base_url="https://gateway.<gateway domain>",
        api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
-)
+    )
-completion = client.chat.completions.create(
+    completion = client.chat.completions.create(
        model="NousResearch/Llama-2-7b-chat-hf",
        messages=[
            {
@@ -91,10 +97,10 @@ completion = client.chat.completions.create(
                "content": "Compose a poem that explains the concept of recursion in programming.",
            }
        ]
-)
+    )
-print(completion.choices[0].message.content)
+    print(completion.choices[0].message.content)
-```
+    ```
 !!! note
    dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
--- a/docs/deployment/frameworks/haystack.md
+++ b/docs/deployment/frameworks/haystack.md
@@ -27,29 +27,29 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1
 - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
-```python
+??? Code
-from haystack.components.generators.chat import OpenAIChatGenerator
-from haystack.dataclasses import ChatMessage
-from haystack.utils import Secret
-generator = OpenAIChatGenerator(
+    ```python
+    from haystack.components.generators.chat import OpenAIChatGenerator
+    from haystack.dataclasses import ChatMessage
+    from haystack.utils import Secret
+    generator = OpenAIChatGenerator(
        # for compatibility with the OpenAI API, a placeholder api_key is needed
        api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
        model="mistralai/Mistral-7B-Instruct-v0.1",
        api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
        generation_kwargs = {"max_tokens": 512}
-)
+    )
-response = generator.run(
+    response = generator.run(
      messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
-)
+    )
-print("-"*30)
-print(response)
-print("-"*30)
-```
-Output e.g.:
+    print("-"*30)
+    print(response)
+    print("-"*30)
+    ```
 ```console
 ------------------------------

--- a/docs/deployment/frameworks/litellm.md
+++ b/docs/deployment/frameworks/litellm.md
@@ -34,21 +34,23 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
 - Call it with litellm:
-```python
+??? Code
-import litellm 
-messages = [{ "content": "Hello, how are you?","role": "user"}]
+    ```python
+    import litellm 
-# hosted_vllm is prefix key word and necessary
+    messages = [{ "content": "Hello, how are you?","role": "user"}]
-response = litellm.completion(
+    # hosted_vllm is prefix key word and necessary
+    response = litellm.completion(
                model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
                messages=messages,
                api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
                temperature=0.2,
                max_tokens=80)
-print(response)
+    print(response)
-```
+    ```
 ### Embeddings

--- a/docs/deployment/frameworks/lws.md
+++ b/docs/deployment/frameworks/lws.md
@@ -17,12 +17,14 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber
 Deploy the following yaml file `lws.yaml`
-```yaml
+??? Yaml
-apiVersion: leaderworkerset.x-k8s.io/v1
-kind: LeaderWorkerSet
+    ```yaml
-metadata:
+    apiVersion: leaderworkerset.x-k8s.io/v1
+    kind: LeaderWorkerSet
+    metadata:
      name: vllm
-spec:
+    spec:
      replicas: 2
      leaderWorkerTemplate:
        size: 2
@@ -94,12 +96,12 @@ spec:
              emptyDir:
                medium: Memory
                sizeLimit: 15Gi
---
+    ---
-apiVersion: v1
+    apiVersion: v1
-kind: Service
+    kind: Service
-metadata:
+    metadata:
      name: vllm-leader
-spec:
+    spec:
      ports:
        - name: http
          port: 8080
@@ -109,7 +111,7 @@ spec:
        leaderworkerset.sigs.k8s.io/name: vllm
        role: leader
      type: ClusterIP
-```
+    ```
 ```bash
 kubectl apply -f lws.yaml
@@ -175,8 +177,10 @@ curl http://localhost:8080/v1/completions \
 The output should be similar to the following
-```text
+??? Output
-{
+    ```text
+    {
      "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
      "object": "text_completion",
      "created": 1715138766,
@@ -195,5 +199,5 @@ The output should be similar to the following
        "total_tokens": 12,
        "completion_tokens": 7
      }
-}
+    }
-```
+    ```
--- a/docs/deployment/frameworks/skypilot.md
+++ b/docs/deployment/frameworks/skypilot.md
@@ -24,19 +24,21 @@ sky check
 See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).
-```yaml
+??? Yaml
-resources:
+    ```yaml
+    resources:
      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
      use_spot: True
      disk_size: 512  # Ensure model checkpoints can fit.
      disk_tier: best
      ports: 8081  # Expose to internet traffic.
-envs:
+    envs:
      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-setup: |
+    setup: |
      conda create -n vllm python=3.10 -y
      conda activate vllm
@@ -45,7 +47,7 @@ setup: |
      pip install gradio openai
      pip install flash-attn==2.5.7
-run: |
+    run: |
      conda activate vllm
      echo 'Starting vllm api server...'
      python -u -m vllm.entrypoints.openai.api_server \
@@ -65,7 +67,7 @@ run: |
        --port 8811 \
        --model-url http://localhost:8081/v1 \
        --stop-token-ids 128009,128001
-```
+    ```
 Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
@@ -93,8 +95,10 @@ HF_TOKEN="your-huggingface-token" \
 SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
-```yaml
+??? Yaml
-service:
+    ```yaml
+    service:
      replicas: 2
      # An actual request for readiness probe.
      readiness_probe:
@@ -105,13 +109,12 @@ service:
          - role: user
            content: Hello! What is your name?
      max_completion_tokens: 1
-```
+    ```
-<details>
+??? Yaml
-<summary>Click to see the full recipe YAML</summary>
-```yaml
+    ```yaml
-service:
+    service:
      replicas: 2
      # An actual request for readiness probe.
      readiness_probe:
@@ -123,18 +126,18 @@ service:
              content: Hello! What is your name?
          max_completion_tokens: 1
-resources:
+    resources:
      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
      use_spot: True
      disk_size: 512  # Ensure model checkpoints can fit.
      disk_tier: best
      ports: 8081  # Expose to internet traffic.
-envs:
+    envs:
      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-setup: |
+    setup: |
      conda create -n vllm python=3.10 -y
      conda activate vllm
@@ -143,7 +146,7 @@ setup: |
      pip install gradio openai
      pip install flash-attn==2.5.7
-run: |
+    run: |
      conda activate vllm
      echo 'Starting vllm api server...'
      python -u -m vllm.entrypoints.openai.api_server \
@@ -152,9 +155,7 @@ run: |
        --trust-remote-code \
        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
        2>&1 | tee api_server.log
-```
+    ```
-</details>
 Start the serving the Llama-3 8B model on multiple replicas:
@@ -170,8 +171,7 @@ Wait until the service is ready:
 watch -n10 sky serve status vllm
 ```
-<details>
+Example outputs:
-<summary>Example outputs:</summary>
 ```console
 Services
@@ -184,13 +184,13 @@ vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  R
 vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
 ```
-</details>
 After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
-```console
+??? Commands
-ENDPOINT=$(sky serve status --endpoint 8081 vllm)
-curl -L http://$ENDPOINT/v1/chat/completions \
+    ```bash
+    ENDPOINT=$(sky serve status --endpoint 8081 vllm)
+    curl -L http://$ENDPOINT/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
@@ -206,7 +206,7 @@ curl -L http://$ENDPOINT/v1/chat/completions \
        ],
        "stop_token_ids": [128009,  128001]
      }'
-```
+    ```
 To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
@@ -220,11 +220,10 @@ service:
 This will scale the service up to when the QPS exceeds 2 for each replica.
-<details>
+??? Yaml
-<summary>Click to see the full recipe YAML</summary>
-```yaml
+    ```yaml
-service:
+    service:
      replica_policy:
        min_replicas: 2
        max_replicas: 4
@@ -239,18 +238,18 @@ service:
              content: Hello! What is your name?
          max_completion_tokens: 1
-resources:
+    resources:
      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
      use_spot: True
      disk_size: 512  # Ensure model checkpoints can fit.
      disk_tier: best
      ports: 8081  # Expose to internet traffic.
-envs:
+    envs:
      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-setup: |
+    setup: |
      conda create -n vllm python=3.10 -y
      conda activate vllm
@@ -259,7 +258,7 @@ setup: |
      pip install gradio openai
      pip install flash-attn==2.5.7
-run: |
+    run: |
      conda activate vllm
      echo 'Starting vllm api server...'
      python -u -m vllm.entrypoints.openai.api_server \
@@ -268,9 +267,7 @@ run: |
        --trust-remote-code \
        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
        2>&1 | tee api_server.log
-```
+    ```
-</details>
 To update the service with the new config:
@@ -288,25 +285,24 @@ sky serve down vllm
 It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
-<details>
+??? Yaml
-<summary>Click to see the full GUI YAML</summary>
-```yaml
+    ```yaml
-envs:
+    envs:
      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
      ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
-resources:
+    resources:
      cpus: 2
-setup: |
+    setup: |
      conda create -n vllm python=3.10 -y
      conda activate vllm
      # Install Gradio for web UI.
      pip install gradio openai
-run: |
+    run: |
      conda activate vllm
      export PATH=$PATH:/sbin
@@ -317,9 +313,7 @@ run: |
        --port 8811 \
        --model-url http://$ENDPOINT/v1 \
        --stop-token-ids 128009,128001 | tee ~/gradio.log
-```
+    ```
-</details>
 1. Start the chat web UI:

--- a/docs/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -60,10 +60,10 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
 curl -o- http://localhost:30080/models
 ```
-Expected output:
+??? Output
-```json
+    ```json
-{
+    {
      "object": "list",
      "data": [
        {
@@ -74,8 +74,8 @@ Expected output:
          "root": null
        }
      ]
-}
+    }
-```
+    ```
 To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:
@@ -89,10 +89,10 @@ curl -X POST http://localhost:30080/completions \
  }'
 ```
-Expected output:
+??? Output
-```json
+    ```json
-{
+    {
      "id": "completion-id",
      "object": "text_completion",
      "created": 1737428424,
@@ -104,8 +104,8 @@ Expected output:
          "finish_reason": "length"
        }
      ]
-}
+    }
-```
+    ```
 ### Uninstall
@@ -121,8 +121,10 @@ sudo helm uninstall vllm
 The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
-```yaml
+??? Yaml
-servingEngineSpec:
+    ```yaml
+    servingEngineSpec:
      runtimeClassName: ""
      modelSpec:
      - name: "opt125m"
@@ -137,7 +139,7 @@ servingEngineSpec:
        requestGPU: 1
        pvcStorage: "10Gi"
-```
+    ```
 In this YAML configuration:
 * **`modelSpec`** includes:

--- a/docs/deployment/k8s.md
+++ b/docs/deployment/k8s.md
@@ -29,39 +29,43 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
 First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
-```bash
+??? Config
-cat <<EOF |kubectl apply -f -
-apiVersion: v1
+    ```bash
-kind: PersistentVolumeClaim
+    cat <<EOF |kubectl apply -f -
-metadata:
+    apiVersion: v1
+    kind: PersistentVolumeClaim
+    metadata:
      name: vllm-models
-spec:
+    spec:
      accessModes:
        - ReadWriteOnce
      volumeMode: Filesystem
      resources:
        requests:
          storage: 50Gi
---
+    ---
-apiVersion: v1
+    apiVersion: v1
-kind: Secret
+    kind: Secret
-metadata:
+    metadata:
      name: hf-token-secret
-type: Opaque
+    type: Opaque
-data:
+    data:
      token: $(HF_TOKEN)
-EOF
+    EOF
-```
+    ```
 Next, start the vLLM server as a Kubernetes Deployment and Service:
-```bash
+??? Config
-cat <<EOF |kubectl apply -f -
-apiVersion: apps/v1
+    ```bash
-kind: Deployment
+    cat <<EOF |kubectl apply -f -
-metadata:
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
      name: vllm-server
-spec:
+    spec:
      replicas: 1
      selector:
        matchLabels:
@@ -93,12 +97,12 @@ spec:
          - name: llama-storage
            persistentVolumeClaim:
              claimName: vllm-models
---
+    ---
-apiVersion: v1
+    apiVersion: v1
-kind: Service
+    kind: Service
-metadata:
+    metadata:
      name: vllm-server
-spec:
+    spec:
      selector:
        app.kubernetes.io/name: vllm
      ports:
@@ -106,8 +110,8 @@ spec:
        port: 8000
        targetPort: 8000
      type: ClusterIP
-EOF
+    EOF
-```
+    ```
 We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
@@ -128,6 +132,9 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
      PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
+      <details>
+      <summary>Yaml</summary>
      ```yaml
      apiVersion: v1
      kind: PersistentVolumeClaim
@@ -144,6 +151,8 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
        volumeMode: Filesystem
      ```
+      </details>
      Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
      ```yaml
@@ -163,6 +172,9 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
      NVIDIA GPU:
+      <details>
+      <summary>Yaml</summary>
      ```yaml
      apiVersion: apps/v1
      kind: Deployment
@@ -233,10 +245,15 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
                periodSeconds: 5
      ```
+      </details>
      AMD GPU:
      You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
+      <details>
+      <summary>Yaml</summary>
      ```yaml
      apiVersion: apps/v1
      kind: Deployment
@@ -305,12 +322,17 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
                mountPath: /dev/shm
      ```
+      </details>
      You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
 2. Create a Kubernetes Service for vLLM
      Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
+      <details>
+      <summary>Yaml</summary>
      ```yaml
      apiVersion: v1
      kind: Service
@@ -330,6 +352,8 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
        type: ClusterIP
      ```
+      </details>
 3. Deploy and Test
      Apply the deployment and service configurations using `kubectl apply -f <filename>`:

--- a/docs/deployment/nginx.md
+++ b/docs/deployment/nginx.md
@@ -36,13 +36,15 @@ docker build . -f Dockerfile.nginx --tag nginx-lb
 Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
-```console
+??? Config
-upstream backend {
+    ```console
+    upstream backend {
        least_conn;
        server vllm0:8000 max_fails=3 fail_timeout=10000s;
        server vllm1:8000 max_fails=3 fail_timeout=10000s;
-}
+    }
-server {
+    server {
        listen 80;
        location / {
            proxy_pass http://backend;
@@ -51,8 +53,8 @@ server {
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
-}
+    }
-```
+    ```
 [](){ #nginxloadbalancer-nginx-vllm-container }
@@ -93,10 +95,12 @@ Notes:
 - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
 - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
-```console
+??? Commands
-mkdir -p ~/.cache/huggingface/hub/
-hf_cache_dir=~/.cache/huggingface/
+    ```console
-docker run \
+    mkdir -p ~/.cache/huggingface/hub/
+    hf_cache_dir=~/.cache/huggingface/
+    docker run \
        -itd \
        --ipc host \
        --network vllm_nginx \
@@ -106,7 +110,7 @@ docker run \
        -p 8081:8000 \
        --name vllm0 vllm \
        --model meta-llama/Llama-2-7b-chat-hf
-docker run \
+    docker run \
        -itd \
        --ipc host \
        --network vllm_nginx \
@@ -116,7 +120,7 @@ docker run \
        -p 8082:8000 \
        --name vllm1 vllm \
        --model meta-llama/Llama-2-7b-chat-hf
-```
+    ```
 !!! note
    If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.

--- a/docs/design/arch_overview.md
+++ b/docs/design/arch_overview.md
@@ -22,31 +22,33 @@ server.
 Here is a sample of `LLM` class usage:
-```python
+??? Code
-from vllm import LLM, SamplingParams
-# Define a list of input prompts
+    ```python
-prompts = [
+    from vllm import LLM, SamplingParams
+    # Define a list of input prompts
+    prompts = [
        "Hello, my name is",
        "The capital of France is",
        "The largest ocean is",
-]
+    ]
-# Define sampling parameters
+    # Define sampling parameters
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-# Initialize the LLM engine with the OPT-125M model
+    # Initialize the LLM engine with the OPT-125M model
-llm = LLM(model="facebook/opt-125m")
+    llm = LLM(model="facebook/opt-125m")
-# Generate outputs for the input prompts
+    # Generate outputs for the input prompts
-outputs = llm.generate(prompts, sampling_params)
+    outputs = llm.generate(prompts, sampling_params)
-# Print the generated outputs
+    # Print the generated outputs
-for output in outputs:
+    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-```
+    ```
 More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs.
@@ -178,6 +180,8 @@ vision-language model.
    To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
+    ??? Code
        ```python
        class MyOldModel(nn.Module):
            def __init__(