[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

[doc] Fold long code blocks to improve readability (#19926)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
f17aec0d · Reid · GitHub · 493c2753 · f17aec0d · f17aec0d
Unverified Commit f17aec0d authored Jun 23, 2025 by Reid Committed by GitHub Jun 23, 2025
20 changed files
--- a/docs/ci/update_pytorch_version.md
+++ b/docs/ci/update_pytorch_version.md
@@ -91,7 +91,7 @@ source to unblock the update process.
 ### FlashInfer
 Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
-```
+```bash
 export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
 export FLASHINFER_ENABLE_SM90=1
 uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1"
@@ -105,14 +105,14 @@ team if you want to get the package published there.
 ### xFormers
 Similar to FlashInfer, here is how to build and install xFormers from source:
-```
+```bash
 export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
 MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30"
 ```
 ### Mamba
-```
+```bash
 uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
 ```

--- a/docs/cli/README.md
+++ b/docs/cli/README.md
@@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}
 Start the vLLM OpenAI Compatible API server.
-Examples:
+??? Examples
-```bash
+    ```bash
-# Start with a model
+    # Start with a model
-vllm serve meta-llama/Llama-2-7b-hf
+    vllm serve meta-llama/Llama-2-7b-hf
-# Specify the port
+    # Specify the port
-vllm serve meta-llama/Llama-2-7b-hf --port 8100
+    vllm serve meta-llama/Llama-2-7b-hf --port 8100
-# Check with --help for more options
+    # Check with --help for more options
-# To list all groups
+    # To list all groups
-vllm serve --help=listgroup
+    vllm serve --help=listgroup
-# To view a argument group
+    # To view a argument group
-vllm serve --help=ModelConfig
+    vllm serve --help=ModelConfig
-# To view a single argument
+    # To view a single argument
-vllm serve --help=max-num-seqs
+    vllm serve --help=max-num-seqs
-# To search by keyword
+    # To search by keyword
-vllm serve --help=max
+    vllm serve --help=max
-```
+    ```
 ## chat
 Generate chat completions via the running API server.
-Examples:
 ```bash
 # Directly connect to localhost API without arguments
 vllm chat
@@ -60,8 +58,6 @@ vllm chat --quick "hi"
 Generate text completions based on the given prompt via the running API server.
-Examples:
 ```bash
 # Directly connect to localhost API without arguments
 vllm complete
@@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
 vllm complete --quick "The future of AI is"
 ```
+</details>
 ## bench
 Run benchmark tests for latency online serving throughput and offline inference throughput.
@@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput}
 Benchmark the latency of a single batch of requests.
-Example:
 ```bash
 vllm bench latency \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -104,8 +100,6 @@ vllm bench latency \
 Benchmark the online serving throughput.
-Example:
 ```bash
 vllm bench serve \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -120,8 +114,6 @@ vllm bench serve \
 Benchmark offline inference throughput.
-Example:
 ```bash
 vllm bench throughput \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -143,7 +135,8 @@ vllm collect-env
 Run batch prompts and write results to file.
-Examples:
+<details>
+<summary>Examples</summary>
 ```bash
 # Running with a local file
@@ -159,6 +152,8 @@ vllm run-batch \
    --model meta-llama/Meta-Llama-3-8B-Instruct
 ```
+</details>
 ## More Help
 For detailed options of any subcommand, use:

--- a/docs/configuration/conserving_memory.md
+++ b/docs/configuration/conserving_memory.md
@@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me
 You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
-```python
+??? Code
-from vllm import LLM
-from vllm.config import CompilationConfig, CompilationLevel
+    ```python
+    from vllm import LLM
-llm = LLM(
+    from vllm.config import CompilationConfig, CompilationLevel
-    model="meta-llama/Llama-3.1-8B-Instruct",
-    compilation_config=CompilationConfig(
+    llm = LLM(
-        level=CompilationLevel.PIECEWISE,
+        model="meta-llama/Llama-3.1-8B-Instruct",
-        # By default, it goes up to max_num_seqs
+        compilation_config=CompilationConfig(
-        cudagraph_capture_sizes=[1, 2, 4, 8, 16],
+            level=CompilationLevel.PIECEWISE,
-    ),
+            # By default, it goes up to max_num_seqs
-)
+            cudagraph_capture_sizes=[1, 2, 4, 8, 16],
-```
+        ),
+    )
+    ```
 You can disable graph capturing completely via the `enforce_eager` flag:
@@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
 Here are some examples:
-```python
+??? Code
-from vllm import LLM
-# Available for Qwen2-VL series models
+    ```python
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+    from vllm import LLM
-          mm_processor_kwargs={
-              "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
+    # Available for Qwen2-VL series models
-          })
+    llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+            mm_processor_kwargs={
-# Available for InternVL series models
+                "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
-llm = LLM(model="OpenGVLab/InternVL2-2B",
+            })
-          mm_processor_kwargs={
-              "max_dynamic_patch": 4,  # Default is 12
+    # Available for InternVL series models
-          })
+    llm = LLM(model="OpenGVLab/InternVL2-2B",
-```
+            mm_processor_kwargs={
+                "max_dynamic_patch": 4,  # Default is 12
+            })
+    ```
--- a/docs/configuration/env_vars.md
+++ b/docs/configuration/env_vars.md
@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:
    All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
-```python
+??? Code
--8<-- "vllm/envs.py:env-vars-definition"
-```
+    ```python
+    --8<-- "vllm/envs.py:env-vars-definition"
+    ```
--- a/docs/contributing/README.md
+++ b/docs/contributing/README.md
@@ -93,25 +93,27 @@ For additional features and advanced configurations, refer to the official [MkDo
 ## Testing
-```bash
+??? note "Commands"
-pip install -r requirements/dev.txt
-# Linting, formatting and static type checking
+    ```bash
-pre-commit install --hook-type pre-commit --hook-type commit-msg
+    pip install -r requirements/dev.txt
-# You can manually run pre-commit with
+    # Linting, formatting and static type checking
-pre-commit run --all-files
+    pre-commit install --hook-type pre-commit --hook-type commit-msg
-# To manually run something from CI that does not run
+    # You can manually run pre-commit with
-# locally by default, you can run:
+    pre-commit run --all-files
-pre-commit run mypy-3.9 --hook-stage manual --all-files
-# Unit tests
+    # To manually run something from CI that does not run
-pytest tests/
+    # locally by default, you can run:
+    pre-commit run mypy-3.9 --hook-stage manual --all-files
-# Run tests for a single test file with detailed output
+    # Unit tests
-pytest -s -v tests/test_logger.py
+    pytest tests/
-```
+    # Run tests for a single test file with detailed output
+    pytest -s -v tests/test_logger.py
+    ```
 !!! tip
    Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.

--- a/docs/contributing/model/basic.md
+++ b/docs/contributing/model/basic.md
@@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons
 The initialization code should look like this:
-```python
+??? Code
-from torch import nn
-from vllm.config import VllmConfig
+    ```python
-from vllm.attention import Attention
+    from torch import nn
+    from vllm.config import VllmConfig
-class MyAttention(nn.Module):
+    from vllm.attention import Attention
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
+    class MyAttention(nn.Module):
-        self.attn = Attention(prefix=f"{prefix}.attn")
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
+            super().__init__()
-class MyDecoderLayer(nn.Module):
+            self.attn = Attention(prefix=f"{prefix}.attn")
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
+    class MyDecoderLayer(nn.Module):
-        self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
+            super().__init__()
-class MyModel(nn.Module):
+            self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
+    class MyModel(nn.Module):
-        self.layers = nn.ModuleList(
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
-            [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
+            super().__init__()
-        )
+            self.layers = nn.ModuleList(
+                [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
-class MyModelForCausalLM(nn.Module):
+            )
-    def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
-        super().__init__()
+    class MyModelForCausalLM(nn.Module):
-        self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
+        def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
-```
+            super().__init__()
+            self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
+    ```
 ### Computation Code

--- a/docs/contributing/model/multimodal.md
+++ b/docs/contributing/model/multimodal.md
--- a/docs/contributing/profiling.md
+++ b/docs/contributing/profiling.md
@@ -97,26 +97,26 @@ to manually kill the profiler and generate your `nsys-rep` report.
 You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
-CLI example:
+??? CLI example
-```bash
+    ```bash
-nsys stats report1.nsys-rep
+    nsys stats report1.nsys-rep
-...
+    ...
- ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
+    ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
- Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name                                                
+    Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name                                                
- --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------------------------------------------------------------------------------------
+    --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------------------------------------------------------------------------------------
-     46.3   10,327,352,338     17,505    589,965.9    144,383.0    27,040  3,126,460    944,263.8  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
+        46.3   10,327,352,338     17,505    589,965.9    144,383.0    27,040  3,126,460    944,263.8  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
-     14.8    3,305,114,764      5,152    641,520.7    293,408.0   287,296  2,822,716    867,124.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
+        14.8    3,305,114,764      5,152    641,520.7    293,408.0   287,296  2,822,716    867,124.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
-     12.1    2,692,284,876     14,280    188,535.4     83,904.0    19,328  2,862,237    497,999.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
+        12.1    2,692,284,876     14,280    188,535.4     83,904.0    19,328  2,862,237    497,999.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
-      9.5    2,116,600,578     33,920     62,399.8     21,504.0    15,326  2,532,285    290,954.1  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
+        9.5    2,116,600,578     33,920     62,399.8     21,504.0    15,326  2,532,285    290,954.1  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
-      5.0    1,119,749,165     18,912     59,208.4      9,056.0     6,784  2,578,366    271,581.7  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
+        5.0    1,119,749,165     18,912     59,208.4      9,056.0     6,784  2,578,366    271,581.7  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
-      4.1      916,662,515     21,312     43,011.6     19,776.0     8,928  2,586,205    199,790.1  void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
+        4.1      916,662,515     21,312     43,011.6     19,776.0     8,928  2,586,205    199,790.1  void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
-      2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
+        2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
-      1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
+        1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
-      0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
+        0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
-... 
+    ... 
-```
+    ```
 GUI example:

--- a/docs/deployment/docker.md
+++ b/docs/deployment/docker.md
@@ -97,19 +97,21 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
    flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
    Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
-```console
+??? Command
-# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
-python3 use_existing_torch.py
+    ```console
-DOCKER_BUILDKIT=1 docker build . \
+    # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
-  --file docker/Dockerfile \
+    python3 use_existing_torch.py
-  --target vllm-openai \
+    DOCKER_BUILDKIT=1 docker build . \
-  --platform "linux/arm64" \
+    --file docker/Dockerfile \
-  -t vllm/vllm-gh200-openai:latest \
+    --target vllm-openai \
-  --build-arg max_jobs=66 \
+    --platform "linux/arm64" \
-  --build-arg nvcc_threads=2 \
+    -t vllm/vllm-gh200-openai:latest \
-  --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
+    --build-arg max_jobs=66 \
-  --build-arg vllm_fa_cmake_gpu_arches="90-real"
+    --build-arg nvcc_threads=2 \
-```
+    --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
+    --build-arg vllm_fa_cmake_gpu_arches="90-real"
+    ```
 !!! note
    If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.

--- a/docs/deployment/frameworks/autogen.md
+++ b/docs/deployment/frameworks/autogen.md
@@ -30,51 +30,53 @@ python -m vllm.entrypoints.openai.api_server \
 - Call it with AutoGen:
-```python
+??? Code
-import asyncio
-from autogen_core.models import UserMessage
+    ```python
-from autogen_ext.models.openai import OpenAIChatCompletionClient
+    import asyncio
-from autogen_core.models import ModelFamily
+    from autogen_core.models import UserMessage
+    from autogen_ext.models.openai import OpenAIChatCompletionClient
+    from autogen_core.models import ModelFamily
-async def main() -> None:
-    # Create a model client
-    model_client = OpenAIChatCompletionClient(
+    async def main() -> None:
-        model="mistralai/Mistral-7B-Instruct-v0.2",
+        # Create a model client
-        base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
+        model_client = OpenAIChatCompletionClient(
-        api_key="EMPTY",
+            model="mistralai/Mistral-7B-Instruct-v0.2",
-        model_info={
+            base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
-            "vision": False,
+            api_key="EMPTY",
-            "function_calling": False,
+            model_info={
-            "json_output": False,
+                "vision": False,
-            "family": ModelFamily.MISTRAL,
+                "function_calling": False,
-            "structured_output": True,
+                "json_output": False,
-        },
+                "family": ModelFamily.MISTRAL,
-    )
+                "structured_output": True,
+            },
-    messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
+        )
-    # Create a stream.
+        messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
-    stream = model_client.create_stream(messages=messages)
+        # Create a stream.
-    # Iterate over the stream and print the responses.
+        stream = model_client.create_stream(messages=messages)
-    print("Streamed responses:")
-    async for response in stream:
+        # Iterate over the stream and print the responses.
-        if isinstance(response, str):
+        print("Streamed responses:")
-            # A partial response is a string.
+        async for response in stream:
-            print(response, flush=True, end="")
+            if isinstance(response, str):
-        else:
+                # A partial response is a string.
-            # The last response is a CreateResult object with the complete message.
+                print(response, flush=True, end="")
-            print("\n\n------------\n")
+            else:
-            print("The complete response:", flush=True)
+                # The last response is a CreateResult object with the complete message.
-            print(response.content, flush=True)
+                print("\n\n------------\n")
+                print("The complete response:", flush=True)
-    # Close the client when done.
+                print(response.content, flush=True)
-    await model_client.close()
+        # Close the client when done.
+        await model_client.close()
-asyncio.run(main())
-```
+    asyncio.run(main())
+    ```
 For details, see the tutorial:

--- a/docs/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@@ -34,25 +34,27 @@ vllm = "latest"
 Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
-```python
+??? Code
-from vllm import LLM, SamplingParams
-llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+    ```python
+    from vllm import LLM, SamplingParams
-def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
+    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
-    sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
+    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
-    outputs = llm.generate(prompts, sampling_params)
-    # Print the outputs.
+        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
-    results = []
+        outputs = llm.generate(prompts, sampling_params)
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        results.append({"prompt": prompt, "generated_text": generated_text})
-    return {"results": results}
+        # Print the outputs.
-```
+        results = []
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            results.append({"prompt": prompt, "generated_text": generated_text})
+        return {"results": results}
+    ```
 Then, run the following code to deploy it to the cloud:
@@ -62,47 +64,51 @@ cerebrium deploy
 If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
-```python
+??? Command
-curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
- -H 'Content-Type: application/json' \
+    ```python
- -H 'Authorization: <JWT TOKEN>' \
+    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
- --data '{
+    -H 'Content-Type: application/json' \
-   "prompts": [
+    -H 'Authorization: <JWT TOKEN>' \
-     "Hello, my name is",
+    --data '{
-     "The president of the United States is",
+    "prompts": [
-     "The capital of France is",
+        "Hello, my name is",
-     "The future of AI is"
+        "The president of the United States is",
-   ]
+        "The capital of France is",
- }'
+        "The future of AI is"
-```
+    ]
+    }'
+    ```
 You should get a response like:
-```python
+??? Response
-{
-    "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
+    ```python
-    "result": {
+    {
-        "result": [
+        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
-            {
+        "result": {
-                "prompt": "Hello, my name is",
+            "result": [
-                "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
+                {
-            },
+                    "prompt": "Hello, my name is",
-            {
+                    "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
-                "prompt": "The president of the United States is",
+                },
-                "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
+                {
-            },
+                    "prompt": "The president of the United States is",
-            {
+                    "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
-                "prompt": "The capital of France is",
+                },
-                "generated_text": " Paris.\n"
+                {
-            },
+                    "prompt": "The capital of France is",
-            {
+                    "generated_text": " Paris.\n"
-                "prompt": "The future of AI is",
+                },
-                "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
+                {
-            }
+                    "prompt": "The future of AI is",
-        ]
+                    "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
-    },
+                }
-    "run_time_ms": 152.53663063049316
+            ]
-}
+        },
-```
+        "run_time_ms": 152.53663063049316
+    }
+    ```
 You now have an autoscaling endpoint where you only pay for the compute you use!
--- a/docs/deployment/frameworks/dstack.md
+++ b/docs/deployment/frameworks/dstack.md
@@ -26,75 +26,81 @@ dstack init
 Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
-```yaml
+??? Config
-type: service
+    ```yaml
-python: "3.11"
+    type: service
-env:
-    - MODEL=NousResearch/Llama-2-7b-chat-hf
+    python: "3.11"
-port: 8000
+    env:
-resources:
+        - MODEL=NousResearch/Llama-2-7b-chat-hf
-    gpu: 24GB
+    port: 8000
-commands:
+    resources:
-    - pip install vllm
+        gpu: 24GB
-    - vllm serve $MODEL --port 8000
+    commands:
-model:
+        - pip install vllm
-    format: openai
+        - vllm serve $MODEL --port 8000
-    type: chat
+    model:
-    name: NousResearch/Llama-2-7b-chat-hf
+        format: openai
-```
+        type: chat
+        name: NousResearch/Llama-2-7b-chat-hf
+    ```
 Then, run the following CLI for provisioning:
-```console
+??? Command
-$ dstack run . -f serve.dstack.yml
+    ```console
-⠸ Getting run plan...
+    $ dstack run . -f serve.dstack.yml
- Configuration  serve.dstack.yml
- Project        deep-diver-main
+    ⠸ Getting run plan...
- User           deep-diver
+    Configuration  serve.dstack.yml
- Min resources  2..xCPU, 8GB.., 1xGPU (24GB)
+    Project        deep-diver-main
- Max price      -
+    User           deep-diver
- Max duration   -
+    Min resources  2..xCPU, 8GB.., 1xGPU (24GB)
- Spot policy    auto
+    Max price      -
- Retry policy   no
+    Max duration   -
+    Spot policy    auto
- #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
+    Retry policy   no
- 1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
- 2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+    #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
- 3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+    1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
-    ...
+    2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
- Shown 3 of 193 offers, $5.876 max
+    3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+        ...
-Continue? [y/n]: y
+    Shown 3 of 193 offers, $5.876 max
-⠙ Submitting run...
-⠏ Launching spicy-treefrog-1 (pulling)
+    Continue? [y/n]: y
-spicy-treefrog-1 provisioning completed (running)
+    ⠙ Submitting run...
-Service is published at ...
+    ⠏ Launching spicy-treefrog-1 (pulling)
-```
+    spicy-treefrog-1 provisioning completed (running)
+    Service is published at ...
+    ```
 After the provisioning, you can interact with the model by using the OpenAI SDK:
-```python
+??? Code
-from openai import OpenAI
+    ```python
-client = OpenAI(
+    from openai import OpenAI
-    base_url="https://gateway.<gateway domain>",
-    api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
+    client = OpenAI(
-)
+        base_url="https://gateway.<gateway domain>",
+        api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
-completion = client.chat.completions.create(
+    )
-    model="NousResearch/Llama-2-7b-chat-hf",
-    messages=[
+    completion = client.chat.completions.create(
-        {
+        model="NousResearch/Llama-2-7b-chat-hf",
-            "role": "user",
+        messages=[
-            "content": "Compose a poem that explains the concept of recursion in programming.",
+            {
-        }
+                "role": "user",
-    ]
+                "content": "Compose a poem that explains the concept of recursion in programming.",
-)
+            }
+        ]
-print(completion.choices[0].message.content)
+    )
-```
+    print(completion.choices[0].message.content)
+    ```
 !!! note
    dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
--- a/docs/deployment/frameworks/haystack.md
+++ b/docs/deployment/frameworks/haystack.md
@@ -27,29 +27,29 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1
 - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
-```python
+??? Code
-from haystack.components.generators.chat import OpenAIChatGenerator
-from haystack.dataclasses import ChatMessage
+    ```python
-from haystack.utils import Secret
+    from haystack.components.generators.chat import OpenAIChatGenerator
+    from haystack.dataclasses import ChatMessage
-generator = OpenAIChatGenerator(
+    from haystack.utils import Secret
-    # for compatibility with the OpenAI API, a placeholder api_key is needed
-    api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
+    generator = OpenAIChatGenerator(
-    model="mistralai/Mistral-7B-Instruct-v0.1",
+        # for compatibility with the OpenAI API, a placeholder api_key is needed
-    api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
+        api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
-    generation_kwargs = {"max_tokens": 512}
+        model="mistralai/Mistral-7B-Instruct-v0.1",
-)
+        api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
+        generation_kwargs = {"max_tokens": 512}
-response = generator.run(
+    )
-  messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
-)
+    response = generator.run(
+      messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
-print("-"*30)
+    )
-print(response)
-print("-"*30)
+    print("-"*30)
-```
+    print(response)
+    print("-"*30)
-Output e.g.:
+    ```
 ```console
 ------------------------------

--- a/docs/deployment/frameworks/litellm.md
+++ b/docs/deployment/frameworks/litellm.md
@@ -34,21 +34,23 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
 - Call it with litellm:
-```python
+??? Code
-import litellm 
-messages = [{ "content": "Hello, how are you?","role": "user"}]
+    ```python
+    import litellm 
-# hosted_vllm is prefix key word and necessary
+    messages = [{ "content": "Hello, how are you?","role": "user"}]
-response = litellm.completion(
-            model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
+    # hosted_vllm is prefix key word and necessary
-            messages=messages,
+    response = litellm.completion(
-            api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
+                model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
-            temperature=0.2,
+                messages=messages,
-            max_tokens=80)
+                api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
+                temperature=0.2,
-print(response)
+                max_tokens=80)
-```
+    print(response)
+    ```
 ### Embeddings

--- a/docs/deployment/frameworks/lws.md
+++ b/docs/deployment/frameworks/lws.md
@@ -17,99 +17,101 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber
 Deploy the following yaml file `lws.yaml`
-```yaml
+??? Yaml
-apiVersion: leaderworkerset.x-k8s.io/v1
-kind: LeaderWorkerSet
+    ```yaml
-metadata:
+    apiVersion: leaderworkerset.x-k8s.io/v1
-  name: vllm
+    kind: LeaderWorkerSet
-spec:
+    metadata:
-  replicas: 2
+      name: vllm
-  leaderWorkerTemplate:
+    spec:
-    size: 2
+      replicas: 2
-    restartPolicy: RecreateGroupOnPodRestart
+      leaderWorkerTemplate:
-    leaderTemplate:
+        size: 2
-      metadata:
+        restartPolicy: RecreateGroupOnPodRestart
-        labels:
+        leaderTemplate:
-          role: leader
+          metadata:
-      spec:
+            labels:
-        containers:
+              role: leader
-          - name: vllm-leader
+          spec:
-            image: docker.io/vllm/vllm-openai:latest
+            containers:
-            env:
+              - name: vllm-leader
-              - name: HUGGING_FACE_HUB_TOKEN
+                image: docker.io/vllm/vllm-openai:latest
-                value: <your-hf-token>
+                env:
-            command:
+                  - name: HUGGING_FACE_HUB_TOKEN
-              - sh
+                    value: <your-hf-token>
-              - -c
+                command:
-              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
+                  - sh
-                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
+                  - -c
-            resources:
+                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
-              limits:
+                    python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
-                nvidia.com/gpu: "8"
+                resources:
-                memory: 1124Gi
+                  limits:
-                ephemeral-storage: 800Gi
+                    nvidia.com/gpu: "8"
-              requests:
+                    memory: 1124Gi
-                ephemeral-storage: 800Gi
+                    ephemeral-storage: 800Gi
-                cpu: 125
+                  requests:
-            ports:
+                    ephemeral-storage: 800Gi
-              - containerPort: 8080
+                    cpu: 125
-            readinessProbe:
+                ports:
-              tcpSocket:
+                  - containerPort: 8080
-                port: 8080
+                readinessProbe:
-              initialDelaySeconds: 15
+                  tcpSocket:
-              periodSeconds: 10
+                    port: 8080
-            volumeMounts:
+                  initialDelaySeconds: 15
-              - mountPath: /dev/shm
+                  periodSeconds: 10
-                name: dshm
+                volumeMounts:
-        volumes:
+                  - mountPath: /dev/shm
-        - name: dshm
+                    name: dshm
-          emptyDir:
+            volumes:
-            medium: Memory
+            - name: dshm
-            sizeLimit: 15Gi
+              emptyDir:
-    workerTemplate:
+                medium: Memory
-      spec:
+                sizeLimit: 15Gi
-        containers:
+        workerTemplate:
-          - name: vllm-worker
+          spec:
-            image: docker.io/vllm/vllm-openai:latest
+            containers:
-            command:
+              - name: vllm-worker
-              - sh
+                image: docker.io/vllm/vllm-openai:latest
-              - -c
+                command:
-              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
+                  - sh
-            resources:
+                  - -c
-              limits:
+                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
-                nvidia.com/gpu: "8"
+                resources:
-                memory: 1124Gi
+                  limits:
-                ephemeral-storage: 800Gi
+                    nvidia.com/gpu: "8"
-              requests:
+                    memory: 1124Gi
-                ephemeral-storage: 800Gi
+                    ephemeral-storage: 800Gi
-                cpu: 125
+                  requests:
-            env:
+                    ephemeral-storage: 800Gi
-              - name: HUGGING_FACE_HUB_TOKEN
+                    cpu: 125
-                value: <your-hf-token>
+                env:
-            volumeMounts:
+                  - name: HUGGING_FACE_HUB_TOKEN
-              - mountPath: /dev/shm
+                    value: <your-hf-token>
-                name: dshm   
+                volumeMounts:
-        volumes:
+                  - mountPath: /dev/shm
-        - name: dshm
+                    name: dshm   
-          emptyDir:
+            volumes:
-            medium: Memory
+            - name: dshm
-            sizeLimit: 15Gi
+              emptyDir:
---
+                medium: Memory
-apiVersion: v1
+                sizeLimit: 15Gi
-kind: Service
+    ---
-metadata:
+    apiVersion: v1
-  name: vllm-leader
+    kind: Service
-spec:
+    metadata:
-  ports:
+      name: vllm-leader
-    - name: http
+    spec:
-      port: 8080
+      ports:
-      protocol: TCP
+        - name: http
-      targetPort: 8080
+          port: 8080
-  selector:
+          protocol: TCP
-    leaderworkerset.sigs.k8s.io/name: vllm
+          targetPort: 8080
-    role: leader
+      selector:
-  type: ClusterIP
+        leaderworkerset.sigs.k8s.io/name: vllm
-```
+        role: leader
+      type: ClusterIP
+    ```
 ```bash
 kubectl apply -f lws.yaml
@@ -175,25 +177,27 @@ curl http://localhost:8080/v1/completions \
 The output should be similar to the following
-```text
+??? Output
-{
-  "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
+    ```text
-  "object": "text_completion",
-  "created": 1715138766,
-  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
-  "choices": [
    {
-      "index": 0,
+      "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
-      "text": " top destination for foodies, with",
+      "object": "text_completion",
-      "logprobs": null,
+      "created": 1715138766,
-      "finish_reason": "length",
+      "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
-      "stop_reason": null
+      "choices": [
+        {
+          "index": 0,
+          "text": " top destination for foodies, with",
+          "logprobs": null,
+          "finish_reason": "length",
+          "stop_reason": null
+        }
+      ],
+      "usage": {
+        "prompt_tokens": 5,
+        "total_tokens": 12,
+        "completion_tokens": 7
+      }
    }
-  ],
+    ```
-  "usage": {
-    "prompt_tokens": 5,
-    "total_tokens": 12,
-    "completion_tokens": 7
-  }
-}
-```
--- a/docs/deployment/frameworks/skypilot.md
+++ b/docs/deployment/frameworks/skypilot.md
@@ -24,48 +24,50 @@ sky check
 See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).
-```yaml
+??? Yaml
-resources:
-  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+    ```yaml
-  use_spot: True
+    resources:
-  disk_size: 512  # Ensure model checkpoints can fit.
+      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-  disk_tier: best
+      use_spot: True
-  ports: 8081  # Expose to internet traffic.
+      disk_size: 512  # Ensure model checkpoints can fit.
+      disk_tier: best
-envs:
+      ports: 8081  # Expose to internet traffic.
-  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+    envs:
+      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-setup: |
+      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-  conda create -n vllm python=3.10 -y
-  conda activate vllm
+    setup: |
+      conda create -n vllm python=3.10 -y
-  pip install vllm==0.4.0.post1
+      conda activate vllm
-  # Install Gradio for web UI.
-  pip install gradio openai
+      pip install vllm==0.4.0.post1
-  pip install flash-attn==2.5.7
+      # Install Gradio for web UI.
+      pip install gradio openai
-run: |
+      pip install flash-attn==2.5.7
-  conda activate vllm
-  echo 'Starting vllm api server...'
+    run: |
-  python -u -m vllm.entrypoints.openai.api_server \
+      conda activate vllm
-    --port 8081 \
+      echo 'Starting vllm api server...'
-    --model $MODEL_NAME \
+      python -u -m vllm.entrypoints.openai.api_server \
-    --trust-remote-code \
+        --port 8081 \
-    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+        --model $MODEL_NAME \
-    2>&1 | tee api_server.log &
+        --trust-remote-code \
+        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-  echo 'Waiting for vllm api server to start...'
+        2>&1 | tee api_server.log &
-  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
+      echo 'Waiting for vllm api server to start...'
-  echo 'Starting gradio server...'
+      while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
-  git clone https://github.com/vllm-project/vllm.git || true
-  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
+      echo 'Starting gradio server...'
-    -m $MODEL_NAME \
+      git clone https://github.com/vllm-project/vllm.git || true
-    --port 8811 \
+      python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
-    --model-url http://localhost:8081/v1 \
+        -m $MODEL_NAME \
-    --stop-token-ids 128009,128001
+        --port 8811 \
-```
+        --model-url http://localhost:8081/v1 \
+        --stop-token-ids 128009,128001
+    ```
 Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
@@ -93,68 +95,67 @@ HF_TOKEN="your-huggingface-token" \
 SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
-```yaml
+??? Yaml
-service:
-  replicas: 2
+    ```yaml
-  # An actual request for readiness probe.
+    service:
-  readiness_probe:
+      replicas: 2
-    path: /v1/chat/completions
+      # An actual request for readiness probe.
-    post_data:
+      readiness_probe:
-    model: $MODEL_NAME
+        path: /v1/chat/completions
-    messages:
+        post_data:
-      - role: user
+        model: $MODEL_NAME
-        content: Hello! What is your name?
+        messages:
-  max_completion_tokens: 1
+          - role: user
-```
+            content: Hello! What is your name?
-<details>
-<summary>Click to see the full recipe YAML</summary>
-```yaml
-service:
-  replicas: 2
-  # An actual request for readiness probe.
-  readiness_probe:
-    path: /v1/chat/completions
-    post_data:
-      model: $MODEL_NAME
-      messages:
-        - role: user
-          content: Hello! What is your name?
      max_completion_tokens: 1
+    ```
-resources:
+??? Yaml
-  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-  use_spot: True
+    ```yaml
-  disk_size: 512  # Ensure model checkpoints can fit.
+    service:
-  disk_tier: best
+      replicas: 2
-  ports: 8081  # Expose to internet traffic.
+      # An actual request for readiness probe.
+      readiness_probe:
-envs:
+        path: /v1/chat/completions
-  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+        post_data:
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+          model: $MODEL_NAME
+          messages:
-setup: |
+            - role: user
-  conda create -n vllm python=3.10 -y
+              content: Hello! What is your name?
-  conda activate vllm
+          max_completion_tokens: 1
-  pip install vllm==0.4.0.post1
+    resources:
-  # Install Gradio for web UI.
+      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-  pip install gradio openai
+      use_spot: True
-  pip install flash-attn==2.5.7
+      disk_size: 512  # Ensure model checkpoints can fit.
+      disk_tier: best
-run: |
+      ports: 8081  # Expose to internet traffic.
-  conda activate vllm
-  echo 'Starting vllm api server...'
+    envs:
-  python -u -m vllm.entrypoints.openai.api_server \
+      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-    --port 8081 \
+      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-    --model $MODEL_NAME \
-    --trust-remote-code \
+    setup: |
-    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+      conda create -n vllm python=3.10 -y
-    2>&1 | tee api_server.log
+      conda activate vllm
-```
+      pip install vllm==0.4.0.post1
-</details>
+      # Install Gradio for web UI.
+      pip install gradio openai
+      pip install flash-attn==2.5.7
+    run: |
+      conda activate vllm
+      echo 'Starting vllm api server...'
+      python -u -m vllm.entrypoints.openai.api_server \
+        --port 8081 \
+        --model $MODEL_NAME \
+        --trust-remote-code \
+        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+        2>&1 | tee api_server.log
+    ```
 Start the serving the Llama-3 8B model on multiple replicas:
@@ -170,8 +171,7 @@ Wait until the service is ready:
 watch -n10 sky serve status vllm
 ```
-<details>
+Example outputs:
-<summary>Example outputs:</summary>
 ```console
 Services
@@ -184,29 +184,29 @@ vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  R
 vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
 ```
-</details>
 After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
-```console
+??? Commands
-ENDPOINT=$(sky serve status --endpoint 8081 vllm)
-curl -L http://$ENDPOINT/v1/chat/completions \
+    ```bash
-  -H "Content-Type: application/json" \
+    ENDPOINT=$(sky serve status --endpoint 8081 vllm)
-  -d '{
+    curl -L http://$ENDPOINT/v1/chat/completions \
-    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+      -H "Content-Type: application/json" \
-    "messages": [
+      -d '{
-    {
+        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
-      "role": "system",
+        "messages": [
-      "content": "You are a helpful assistant."
+        {
-    },
+          "role": "system",
-    {
+          "content": "You are a helpful assistant."
-      "role": "user",
+        },
-      "content": "Who are you?"
+        {
-    }
+          "role": "user",
-    ],
+          "content": "Who are you?"
-    "stop_token_ids": [128009,  128001]
+        }
-  }'
+        ],
-```
+        "stop_token_ids": [128009,  128001]
+      }'
+    ```
 To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
@@ -220,57 +220,54 @@ service:
 This will scale the service up to when the QPS exceeds 2 for each replica.
-<details>
+??? Yaml
-<summary>Click to see the full recipe YAML</summary>
+    ```yaml
-```yaml
+    service:
-service:
+      replica_policy:
-  replica_policy:
+        min_replicas: 2
-    min_replicas: 2
+        max_replicas: 4
-    max_replicas: 4
+        target_qps_per_replica: 2
-    target_qps_per_replica: 2
+      # An actual request for readiness probe.
-  # An actual request for readiness probe.
+      readiness_probe:
-  readiness_probe:
+        path: /v1/chat/completions
-    path: /v1/chat/completions
+        post_data:
-    post_data:
+          model: $MODEL_NAME
-      model: $MODEL_NAME
+          messages:
-      messages:
+            - role: user
-        - role: user
+              content: Hello! What is your name?
-          content: Hello! What is your name?
+          max_completion_tokens: 1
-      max_completion_tokens: 1
+    resources:
-resources:
+      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+      use_spot: True
-  use_spot: True
+      disk_size: 512  # Ensure model checkpoints can fit.
-  disk_size: 512  # Ensure model checkpoints can fit.
+      disk_tier: best
-  disk_tier: best
+      ports: 8081  # Expose to internet traffic.
-  ports: 8081  # Expose to internet traffic.
+    envs:
-envs:
+      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+    setup: |
-setup: |
+      conda create -n vllm python=3.10 -y
-  conda create -n vllm python=3.10 -y
+      conda activate vllm
-  conda activate vllm
+      pip install vllm==0.4.0.post1
-  pip install vllm==0.4.0.post1
+      # Install Gradio for web UI.
-  # Install Gradio for web UI.
+      pip install gradio openai
-  pip install gradio openai
+      pip install flash-attn==2.5.7
-  pip install flash-attn==2.5.7
+    run: |
-run: |
+      conda activate vllm
-  conda activate vllm
+      echo 'Starting vllm api server...'
-  echo 'Starting vllm api server...'
+      python -u -m vllm.entrypoints.openai.api_server \
-  python -u -m vllm.entrypoints.openai.api_server \
+        --port 8081 \
-    --port 8081 \
+        --model $MODEL_NAME \
-    --model $MODEL_NAME \
+        --trust-remote-code \
-    --trust-remote-code \
+        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+        2>&1 | tee api_server.log
-    2>&1 | tee api_server.log
+    ```
-```
-</details>
 To update the service with the new config:
@@ -288,38 +285,35 @@ sky serve down vllm
 It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
-<details>
+??? Yaml
-<summary>Click to see the full GUI YAML</summary>
-```yaml
+    ```yaml
-envs:
+    envs:
-  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-  ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
+      ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
-resources:
+    resources:
-  cpus: 2
+      cpus: 2
-setup: |
-  conda create -n vllm python=3.10 -y
-  conda activate vllm
-  # Install Gradio for web UI.
-  pip install gradio openai
-run: |
-  conda activate vllm
-  export PATH=$PATH:/sbin
-  echo 'Starting gradio server...'
-  git clone https://github.com/vllm-project/vllm.git || true
-  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
-    -m $MODEL_NAME \
-    --port 8811 \
-    --model-url http://$ENDPOINT/v1 \
-    --stop-token-ids 128009,128001 | tee ~/gradio.log
-```
-</details>
+    setup: |
+      conda create -n vllm python=3.10 -y
+      conda activate vllm
+      # Install Gradio for web UI.
+      pip install gradio openai
+    run: |
+      conda activate vllm
+      export PATH=$PATH:/sbin
+      echo 'Starting gradio server...'
+      git clone https://github.com/vllm-project/vllm.git || true
+      python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
+        -m $MODEL_NAME \
+        --port 8811 \
+        --model-url http://$ENDPOINT/v1 \
+        --stop-token-ids 128009,128001 | tee ~/gradio.log
+    ```
 1. Start the chat web UI:

--- a/docs/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -60,22 +60,22 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
 curl -o- http://localhost:30080/models
 ```
-Expected output:
+??? Output
-```json
+    ```json
-{
-  "object": "list",
-  "data": [
    {
-      "id": "facebook/opt-125m",
+      "object": "list",
-      "object": "model",
+      "data": [
-      "created": 1737428424,
+        {
-      "owned_by": "vllm",
+          "id": "facebook/opt-125m",
-      "root": null
+          "object": "model",
+          "created": 1737428424,
+          "owned_by": "vllm",
+          "root": null
+        }
+      ]
    }
-  ]
+    ```
-}
-```
 To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:
@@ -89,23 +89,23 @@ curl -X POST http://localhost:30080/completions \
  }'
 ```
-Expected output:
+??? Output
-```json
+    ```json
-{
-  "id": "completion-id",
-  "object": "text_completion",
-  "created": 1737428424,
-  "model": "facebook/opt-125m",
-  "choices": [
    {
-      "text": " there was a brave knight who...",
+      "id": "completion-id",
-      "index": 0,
+      "object": "text_completion",
-      "finish_reason": "length"
+      "created": 1737428424,
+      "model": "facebook/opt-125m",
+      "choices": [
+        {
+          "text": " there was a brave knight who...",
+          "index": 0,
+          "finish_reason": "length"
+        }
+      ]
    }
-  ]
+    ```
-}
-```
 ### Uninstall
@@ -121,23 +121,25 @@ sudo helm uninstall vllm
 The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
-```yaml
+??? Yaml
-servingEngineSpec:
-  runtimeClassName: ""
-  modelSpec:
-  - name: "opt125m"
-    repository: "vllm/vllm-openai"
-    tag: "latest"
-    modelURL: "facebook/opt-125m"
-    replicaCount: 1
+    ```yaml
+    servingEngineSpec:
+      runtimeClassName: ""
+      modelSpec:
+      - name: "opt125m"
+        repository: "vllm/vllm-openai"
+        tag: "latest"
+        modelURL: "facebook/opt-125m"
-    requestCPU: 6
+        replicaCount: 1
-    requestMemory: "16Gi"
-    requestGPU: 1
-    pvcStorage: "10Gi"
+        requestCPU: 6
-```
+        requestMemory: "16Gi"
+        requestGPU: 1
+        pvcStorage: "10Gi"
+    ```
 In this YAML configuration:
 * **`modelSpec`** includes:

--- a/docs/deployment/k8s.md
+++ b/docs/deployment/k8s.md
@@ -29,85 +29,89 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
 First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
-```bash
+??? Config
-cat <<EOF |kubectl apply -f -
-apiVersion: v1
+    ```bash
-kind: PersistentVolumeClaim
+    cat <<EOF |kubectl apply -f -
-metadata:
+    apiVersion: v1
-  name: vllm-models
+    kind: PersistentVolumeClaim
-spec:
+    metadata:
-  accessModes:
+      name: vllm-models
-    - ReadWriteOnce
+    spec:
-  volumeMode: Filesystem
+      accessModes:
-  resources:
+        - ReadWriteOnce
-    requests:
+      volumeMode: Filesystem
-      storage: 50Gi
+      resources:
---
+        requests:
-apiVersion: v1
+          storage: 50Gi
-kind: Secret
+    ---
-metadata:
+    apiVersion: v1
-  name: hf-token-secret
+    kind: Secret
-type: Opaque
+    metadata:
-data:
+      name: hf-token-secret
-  token: $(HF_TOKEN)
+    type: Opaque
-EOF
+    data:
-```
+      token: $(HF_TOKEN)
+    EOF
+    ```
 Next, start the vLLM server as a Kubernetes Deployment and Service:
-```bash
+??? Config
-cat <<EOF |kubectl apply -f -
-apiVersion: apps/v1
+    ```bash
-kind: Deployment
+    cat <<EOF |kubectl apply -f -
-metadata:
+    apiVersion: apps/v1
-  name: vllm-server
+    kind: Deployment
-spec:
-  replicas: 1
-  selector:
-    matchLabels:
-      app.kubernetes.io/name: vllm
-  template:
    metadata:
-      labels:
+      name: vllm-server
-        app.kubernetes.io/name: vllm
    spec:
-      containers:
+      replicas: 1
-      - name: vllm
+      selector:
-        image: vllm/vllm-openai:latest
+        matchLabels:
-        command: ["/bin/sh", "-c"]
+          app.kubernetes.io/name: vllm
-        args: [
+      template:
-          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
+        metadata:
-        ]
+          labels:
-        env:
+            app.kubernetes.io/name: vllm
-        - name: HUGGING_FACE_HUB_TOKEN
+        spec:
-          valueFrom:
+          containers:
-            secretKeyRef:
+          - name: vllm
-              name: hf-token-secret
+            image: vllm/vllm-openai:latest
-              key: token
+            command: ["/bin/sh", "-c"]
-        ports:
+            args: [
-          - containerPort: 8000
+              "vllm serve meta-llama/Llama-3.2-1B-Instruct"
-        volumeMounts:
+            ]
+            env:
+            - name: HUGGING_FACE_HUB_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token-secret
+                  key: token
+            ports:
+              - containerPort: 8000
+            volumeMounts:
+              - name: llama-storage
+                mountPath: /root/.cache/huggingface
+          volumes:
          - name: llama-storage
-            mountPath: /root/.cache/huggingface
+            persistentVolumeClaim:
-      volumes:
+              claimName: vllm-models
-      - name: llama-storage
+    ---
-        persistentVolumeClaim:
+    apiVersion: v1
-          claimName: vllm-models
+    kind: Service
---
+    metadata:
-apiVersion: v1
+      name: vllm-server
-kind: Service
+    spec:
-metadata:
+      selector:
-  name: vllm-server
+        app.kubernetes.io/name: vllm
-spec:
+      ports:
-  selector:
+      - protocol: TCP
-    app.kubernetes.io/name: vllm
+        port: 8000
-  ports:
+        targetPort: 8000
-  - protocol: TCP
+      type: ClusterIP
-    port: 8000
+    EOF
-    targetPort: 8000
+    ```
-  type: ClusterIP
-EOF
-```
 We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
@@ -128,6 +132,9 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
      PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
+      <details>
+      <summary>Yaml</summary>
      ```yaml
      apiVersion: v1
      kind: PersistentVolumeClaim
@@ -144,6 +151,8 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
        volumeMode: Filesystem
      ```
+      </details>
      Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
      ```yaml
@@ -156,13 +165,16 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
      stringData:
        token: "REPLACE_WITH_TOKEN"
      ```
      Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
      Here are two examples for using NVIDIA GPU and AMD GPU.
      NVIDIA GPU:
+      <details>
+      <summary>Yaml</summary>
      ```yaml
      apiVersion: apps/v1
      kind: Deployment
@@ -233,10 +245,15 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
                periodSeconds: 5
      ```
+      </details>
      AMD GPU:
      You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
+      <details>
+      <summary>Yaml</summary>
      ```yaml
      apiVersion: apps/v1
      kind: Deployment
@@ -305,12 +322,17 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
                mountPath: /dev/shm
      ```
+      </details>
      You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
 2. Create a Kubernetes Service for vLLM
      Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
+      <details>
+      <summary>Yaml</summary>
      ```yaml
      apiVersion: v1
      kind: Service
@@ -330,6 +352,8 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
        type: ClusterIP
      ```
+      </details>
 3. Deploy and Test
      Apply the deployment and service configurations using `kubectl apply -f <filename>`:

--- a/docs/deployment/nginx.md
+++ b/docs/deployment/nginx.md
@@ -36,23 +36,25 @@ docker build . -f Dockerfile.nginx --tag nginx-lb
 Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
-```console
+??? Config
-upstream backend {
-    least_conn;
+    ```console
-    server vllm0:8000 max_fails=3 fail_timeout=10000s;
+    upstream backend {
-    server vllm1:8000 max_fails=3 fail_timeout=10000s;
+        least_conn;
-}
+        server vllm0:8000 max_fails=3 fail_timeout=10000s;
-server {
+        server vllm1:8000 max_fails=3 fail_timeout=10000s;
-    listen 80;
-    location / {
-        proxy_pass http://backend;
-        proxy_set_header Host $host;
-        proxy_set_header X-Real-IP $remote_addr;
-        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-        proxy_set_header X-Forwarded-Proto $scheme;
    }
-}
+    server {
-```
+        listen 80;
+        location / {
+            proxy_pass http://backend;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+        }
+    }
+    ```
 [](){ #nginxloadbalancer-nginx-vllm-container }
@@ -93,30 +95,32 @@ Notes:
 - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
 - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
-```console
+??? Commands
-mkdir -p ~/.cache/huggingface/hub/
-hf_cache_dir=~/.cache/huggingface/
+    ```console
-docker run \
+    mkdir -p ~/.cache/huggingface/hub/
-    -itd \
+    hf_cache_dir=~/.cache/huggingface/
-    --ipc host \
+    docker run \
-    --network vllm_nginx \
+        -itd \
-    --gpus device=0 \
+        --ipc host \
-    --shm-size=10.24gb \
+        --network vllm_nginx \
-    -v $hf_cache_dir:/root/.cache/huggingface/ \
+        --gpus device=0 \
-    -p 8081:8000 \
+        --shm-size=10.24gb \
-    --name vllm0 vllm \
+        -v $hf_cache_dir:/root/.cache/huggingface/ \
-    --model meta-llama/Llama-2-7b-chat-hf
+        -p 8081:8000 \
-docker run \
+        --name vllm0 vllm \
-    -itd \
+        --model meta-llama/Llama-2-7b-chat-hf
-    --ipc host \
+    docker run \
-    --network vllm_nginx \
+        -itd \
-    --gpus device=1 \
+        --ipc host \
-    --shm-size=10.24gb \
+        --network vllm_nginx \
-    -v $hf_cache_dir:/root/.cache/huggingface/ \
+        --gpus device=1 \
-    -p 8082:8000 \
+        --shm-size=10.24gb \
-    --name vllm1 vllm \
+        -v $hf_cache_dir:/root/.cache/huggingface/ \
-    --model meta-llama/Llama-2-7b-chat-hf
+        -p 8082:8000 \
-```
+        --name vllm1 vllm \
+        --model meta-llama/Llama-2-7b-chat-hf
+    ```
 !!! note
    If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.

--- a/docs/design/arch_overview.md
+++ b/docs/design/arch_overview.md
@@ -22,31 +22,33 @@ server.
 Here is a sample of `LLM` class usage:
-```python
+??? Code
-from vllm import LLM, SamplingParams
+    ```python
-# Define a list of input prompts
+    from vllm import LLM, SamplingParams
-prompts = [
-    "Hello, my name is",
+    # Define a list of input prompts
-    "The capital of France is",
+    prompts = [
-    "The largest ocean is",
+        "Hello, my name is",
-]
+        "The capital of France is",
+        "The largest ocean is",
-# Define sampling parameters
+    ]
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+    # Define sampling parameters
-# Initialize the LLM engine with the OPT-125M model
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="facebook/opt-125m")
+    # Initialize the LLM engine with the OPT-125M model
-# Generate outputs for the input prompts
+    llm = LLM(model="facebook/opt-125m")
-outputs = llm.generate(prompts, sampling_params)
+    # Generate outputs for the input prompts
-# Print the generated outputs
+    outputs = llm.generate(prompts, sampling_params)
-for output in outputs:
-    prompt = output.prompt
+    # Print the generated outputs
-    generated_text = output.outputs[0].text
+    for output in outputs:
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+        prompt = output.prompt
-```
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```
 More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs.
@@ -178,32 +180,34 @@ vision-language model.
    To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
-    ```python
+    ??? Code
-    class MyOldModel(nn.Module):
-        def __init__(
+        ```python
-            self,
+        class MyOldModel(nn.Module):
-            config,
+            def __init__(
-            cache_config: Optional[CacheConfig] = None,
+                self,
-            quant_config: Optional[QuantizationConfig] = None,
+                config,
-            lora_config: Optional[LoRAConfig] = None,
+                cache_config: Optional[CacheConfig] = None,
-            prefix: str = "",
+                quant_config: Optional[QuantizationConfig] = None,
-        ) -> None:
+                lora_config: Optional[LoRAConfig] = None,
-            ...
+                prefix: str = "",
+            ) -> None:
-    from vllm.config import VllmConfig
+                ...
-    class MyNewModel(MyOldModel):
-        def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+        from vllm.config import VllmConfig
-            config = vllm_config.model_config.hf_config
+        class MyNewModel(MyOldModel):
-            cache_config = vllm_config.cache_config
+            def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
-            quant_config = vllm_config.quant_config
+                config = vllm_config.model_config.hf_config
-            lora_config = vllm_config.lora_config
+                cache_config = vllm_config.cache_config
-            super().__init__(config, cache_config, quant_config, lora_config, prefix)
+                quant_config = vllm_config.quant_config
+                lora_config = vllm_config.lora_config
-    if __version__ >= "0.6.4":
+                super().__init__(config, cache_config, quant_config, lora_config, prefix)
-        MyModel = MyNewModel
-    else:
+        if __version__ >= "0.6.4":
-        MyModel = MyOldModel
+            MyModel = MyNewModel
-    ```
+        else:
+            MyModel = MyOldModel
+        ```
    This way, the model can work with both old and new versions of vLLM.