Merge tag 'v0.9.2' into v0.9.2-dev

a40a133c · zhuwenwen · 1a9a61d7 · a5dd03c1 · a40a133c · a40a133c
Commit a40a133c authored Jul 18, 2025 by zhuwenwen
20 changed files
--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
@@ -7,16 +7,16 @@ Quantization trades off model precision for smaller memory footprint, allowing l
 Contents:
- [Supported_Hardware](supported_hardware.md)
+- [Supported Hardware](supported_hardware.md)
- [Auto_Awq](auto_awq.md)
+- [AutoAWQ](auto_awq.md)
- [Bnb](bnb.md)
+- [BitsAndBytes](bnb.md)
- [Bitblas](bitblas.md)
+- [BitBLAS](bitblas.md)
- [Gguf](gguf.md)
+- [GGUF](gguf.md)
- [Gptqmodel](gptqmodel.md)
+- [GPTQModel](gptqmodel.md)
- [Int4](int4.md)
+- [INT4 W4A16](int4.md)
- [Int8](int8.md)
+- [INT8 W8A8](int8.md)
- [Fp8](fp8.md)
+- [FP8 W8A8](fp8.md)
- [Modelopt](modelopt.md)
+- [NVIDIA TensorRT Model Optimizer](modelopt.md)
- [Quark](quark.md)
+- [AMD Quark](quark.md)
- [Quantized_Kvcache](quantized_kvcache.md)
+- [Quantized KV Cache](quantized_kvcache.md)
- [Torchao](torchao.md)
+- [TorchAO](torchao.md)
--- a/docs/features/quantization/auto_awq.md
+++ b/docs/features/quantization/auto_awq.md
@@ -9,39 +9,41 @@ The main benefits are lower latency and memory usage.
 You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
-```console
+```bash
 pip install autoawq
 ```
 After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
-```python
+??? Code
-from awq import AutoAWQForCausalLM
-from transformers import AutoTokenizer
-model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
+    ```python
-quant_path = 'mistral-instruct-v0.2-awq'
+    from awq import AutoAWQForCausalLM
-quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
+    from transformers import AutoTokenizer
-# Load model
+    model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
-model = AutoAWQForCausalLM.from_pretrained(
+    quant_path = 'mistral-instruct-v0.2-awq'
-    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
+    quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
-)
-tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-# Quantize
+    # Load model
-model.quantize(tokenizer, quant_config=quant_config)
+    model = AutoAWQForCausalLM.from_pretrained(
+        model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
+    )
+    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-# Save quantized model
+    # Quantize
-model.save_quantized(quant_path)
+    model.quantize(tokenizer, quant_config=quant_config)
-tokenizer.save_pretrained(quant_path)
-print(f'Model is quantized and saved at "{quant_path}"')
+    # Save quantized model
-```
+    model.save_quantized(quant_path)
+    tokenizer.save_pretrained(quant_path)
+    print(f'Model is quantized and saved at "{quant_path}"')
+    ```
 To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
-```console
+```bash
 python examples/offline_inference/llm_engine_example.py \
    --model TheBloke/Llama-2-7b-Chat-AWQ \
    --quantization awq
@@ -49,27 +51,29 @@ python examples/offline_inference/llm_engine_example.py \
 AWQ models are also supported directly through the LLM entrypoint:
-```python
+??? Code
-from vllm import LLM, SamplingParams
+    ```python
-# Sample prompts.
+    from vllm import LLM, SamplingParams
-prompts = [
-    "Hello, my name is",
+    # Sample prompts.
-    "The president of the United States is",
+    prompts = [
-    "The capital of France is",
+        "Hello, my name is",
-    "The future of AI is",
+        "The president of the United States is",
-]
+        "The capital of France is",
-# Create a sampling params object.
+        "The future of AI is",
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+    ]
+    # Create a sampling params object.
-# Create an LLM.
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
-# Generate texts from the prompts. The output is a list of RequestOutput objects
+    # Create an LLM.
-# that contain the prompt, generated text, and other information.
+    llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
-outputs = llm.generate(prompts, sampling_params)
+    # Generate texts from the prompts. The output is a list of RequestOutput objects
-# Print the outputs.
+    # that contain the prompt, generated text, and other information.
-for output in outputs:
+    outputs = llm.generate(prompts, sampling_params)
-    prompt = output.prompt
+    # Print the outputs.
-    generated_text = output.outputs[0].text
+    for output in outputs:
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+        prompt = output.prompt
-```
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```
--- a/docs/features/quantization/bitblas.md
+++ b/docs/features/quantization/bitblas.md
@@ -12,7 +12,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic
 Below are the steps to utilize BitBLAS with vLLM.
-```console
+```bash
 pip install bitblas>=0.1.0
 ```
@@ -43,17 +43,19 @@ llm = LLM(
 ## Read gptq format checkpoint
-```python
+??? Code
-from vllm import LLM
-import torch
+    ```python
+    from vllm import LLM
-# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
+    import torch
-model_id = "hxbgsyxh/llama-13b-4bit-g-1"
-llm = LLM(
+    # "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
-    model=model_id,
+    model_id = "hxbgsyxh/llama-13b-4bit-g-1"
-    dtype=torch.float16,
+    llm = LLM(
-    trust_remote_code=True,
+        model=model_id,
-    quantization="bitblas",
+        dtype=torch.float16,
-    max_model_len=1024
+        trust_remote_code=True,
-)
+        quantization="bitblas",
-```
+        max_model_len=1024
+    )
+    ```
--- a/docs/features/quantization/bnb.md
+++ b/docs/features/quantization/bnb.md
@@ -9,8 +9,8 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
 Below are the steps to utilize BitsAndBytes with vLLM.
-```console
+```bash
-pip install bitsandbytes>=0.45.3
+pip install bitsandbytes>=0.46.1
 ```
 vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
@@ -54,6 +54,6 @@ llm = LLM(
 Append the following to your model arguments for 4bit inflight quantization:
-```console
+```bash
 --quantization bitsandbytes
 ```
--- a/docs/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
 To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
-```console
+```bash
 pip install llmcompressor
 ```
@@ -58,28 +58,30 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
 Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
-```python
+??? Code
-from llmcompressor.transformers import oneshot
-from llmcompressor.modifiers.quantization import QuantizationModifier
-# Configure the simple PTQ quantization
+    ```python
-recipe = QuantizationModifier(
+    from llmcompressor.transformers import oneshot
-  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
+    from llmcompressor.modifiers.quantization import QuantizationModifier
-# Apply the quantization algorithm.
+    # Configure the simple PTQ quantization
-oneshot(model=model, recipe=recipe)
+    recipe = QuantizationModifier(
+      targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
-# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
+    # Apply the quantization algorithm.
-SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+    oneshot(model=model, recipe=recipe)
-model.save_pretrained(SAVE_DIR)
-tokenizer.save_pretrained(SAVE_DIR)
+    # Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
-```
+    SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+    model.save_pretrained(SAVE_DIR)
+    tokenizer.save_pretrained(SAVE_DIR)
+    ```
 ### 3. Evaluating Accuracy
 Install `vllm` and `lm-evaluation-harness` for evaluation:
-```console
+```bash
 pip install vllm lm-eval==0.4.4
 ```
@@ -97,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
 !!! note
    Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
-```console
+```bash
-$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
+MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
-$ lm_eval \
+lm_eval \
  --model vllm \
  --model_args pretrained=$MODEL,add_bos_token=True \
  --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250

--- a/docs/features/quantization/gguf.md
+++ b/docs/features/quantization/gguf.md
@@ -11,7 +11,7 @@ title: GGUF
 To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
-```console
+```bash
 wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
 # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
@@ -20,7 +20,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
 You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
-```console
+```bash
 # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
   --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
@@ -32,7 +32,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
 GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
-```console
+```bash
 # If you model is not supported by huggingface you can manually provide a huggingface compatible config path
 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
   --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
@@ -41,42 +41,44 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
 You can also use the GGUF model directly through the LLM entrypoint:
-```python
+??? Code
-from vllm import LLM, SamplingParams
+      ```python
-# In this script, we demonstrate how to pass input to the chat method:
+      from vllm import LLM, SamplingParams
-conversation = [
-   {
+      # In this script, we demonstrate how to pass input to the chat method:
-      "role": "system",
+      conversation = [
-      "content": "You are a helpful assistant"
+         {
-   },
+            "role": "system",
-   {
+            "content": "You are a helpful assistant"
-      "role": "user",
+         },
-      "content": "Hello"
+         {
-   },
+            "role": "user",
-   {
+            "content": "Hello"
-      "role": "assistant",
+         },
-      "content": "Hello! How can I assist you today?"
+         {
-   },
+            "role": "assistant",
-   {
+            "content": "Hello! How can I assist you today?"
-      "role": "user",
+         },
-      "content": "Write an essay about the importance of higher education.",
+         {
-   },
+            "role": "user",
-]
+            "content": "Write an essay about the importance of higher education.",
+         },
-# Create a sampling params object.
+      ]
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+      # Create a sampling params object.
-# Create an LLM.
+      sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
-         tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
+      # Create an LLM.
-# Generate texts from the prompts. The output is a list of RequestOutput objects
+      llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
-# that contain the prompt, generated text, and other information.
+               tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
-outputs = llm.chat(conversation, sampling_params)
+      # Generate texts from the prompts. The output is a list of RequestOutput objects
+      # that contain the prompt, generated text, and other information.
-# Print the outputs.
+      outputs = llm.chat(conversation, sampling_params)
-for output in outputs:
-   prompt = output.prompt
+      # Print the outputs.
-   generated_text = output.outputs[0].text
+      for output in outputs:
-   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+         prompt = output.prompt
-```
+         generated_text = output.outputs[0].text
+         print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+      ```
--- a/docs/features/quantization/gptqmodel.md
+++ b/docs/features/quantization/gptqmodel.md
@@ -21,7 +21,7 @@ for more details on this and other advanced features.
 You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
-```console
+```bash
 pip install -U gptqmodel --no-build-isolation -v
 ```
@@ -31,34 +31,36 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
 Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
-```python
+??? Code
-from datasets import load_dataset
-from gptqmodel import GPTQModel, QuantizeConfig
-model_id = "meta-llama/Llama-3.2-1B-Instruct"
+    ```python
-quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
+    from datasets import load_dataset
+    from gptqmodel import GPTQModel, QuantizeConfig
-calibration_dataset = load_dataset(
+    model_id = "meta-llama/Llama-3.2-1B-Instruct"
-    "allenai/c4",
+    quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
-    data_files="en/c4-train.00001-of-01024.json.gz",
-    split="train"
-  ).select(range(1024))["text"]
-quant_config = QuantizeConfig(bits=4, group_size=128)
+    calibration_dataset = load_dataset(
+        "allenai/c4",
+        data_files="en/c4-train.00001-of-01024.json.gz",
+        split="train"
+    ).select(range(1024))["text"]
-model = GPTQModel.load(model_id, quant_config)
+    quant_config = QuantizeConfig(bits=4, group_size=128)
-# increase `batch_size` to match gpu/vram specs to speed up quantization
+    model = GPTQModel.load(model_id, quant_config)
-model.quantize(calibration_dataset, batch_size=2)
-model.save(quant_path)
+    # increase `batch_size` to match gpu/vram specs to speed up quantization
-```
+    model.quantize(calibration_dataset, batch_size=2)
+    model.save(quant_path)
+    ```
 ## Running a quantized model with vLLM
 To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
-```console
+```bash
 python examples/offline_inference/llm_engine_example.py \
    --model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
 ```
@@ -67,32 +69,34 @@ python examples/offline_inference/llm_engine_example.py \
 GPTQModel quantized models are also supported directly through the LLM entrypoint:
-```python
+??? Code
-from vllm import LLM, SamplingParams
+    ```python
-# Sample prompts.
+    from vllm import LLM, SamplingParams
-prompts = [
-    "Hello, my name is",
+    # Sample prompts.
-    "The president of the United States is",
+    prompts = [
-    "The capital of France is",
+        "Hello, my name is",
-    "The future of AI is",
+        "The president of the United States is",
-]
+        "The capital of France is",
+        "The future of AI is",
-# Create a sampling params object.
+    ]
-sampling_params = SamplingParams(temperature=0.6, top_p=0.9)
+    # Create a sampling params object.
-# Create an LLM.
+    sampling_params = SamplingParams(temperature=0.6, top_p=0.9)
-llm = LLM(model="ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2")
+    # Create an LLM.
-# Generate texts from the prompts. The output is a list of RequestOutput objects
+    llm = LLM(model="ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2")
-# that contain the prompt, generated text, and other information.
-outputs = llm.generate(prompts, sampling_params)
+    # Generate texts from the prompts. The output is a list of RequestOutput objects
+    # that contain the prompt, generated text, and other information.
-# Print the outputs.
+    outputs = llm.generate(prompts, sampling_params)
-print("-"*50)
-for output in outputs:
+    # Print the outputs.
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
    print("-"*50)
-```
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
+        print("-"*50)
+    ```
--- a/docs/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@@ -14,13 +14,13 @@ Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs re
 To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
-```console
+```bash
 pip install llmcompressor
 ```
 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
-```console
+```bash
 pip install vllm lm-eval==0.4.4
 ```
@@ -53,51 +53,55 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
 It's best to use calibration data that closely matches your deployment data.
 For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
-```python
+??? Code
-from datasets import load_dataset
-NUM_CALIBRATION_SAMPLES = 512
+    ```python
-MAX_SEQUENCE_LENGTH = 2048
+    from datasets import load_dataset
-# Load and preprocess the dataset
+    NUM_CALIBRATION_SAMPLES = 512
-ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+    MAX_SEQUENCE_LENGTH = 2048
-ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-def preprocess(example):
+    # Load and preprocess the dataset
-    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+    ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
-ds = ds.map(preprocess)
+    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-def tokenize(sample):
+    def preprocess(example):
-    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+        return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
-ds = ds.map(tokenize, remove_columns=ds.column_names)
+    ds = ds.map(preprocess)
-```
+    def tokenize(sample):
+        return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+    ds = ds.map(tokenize, remove_columns=ds.column_names)
+    ```
 ### 3. Applying Quantization
 Now, apply the quantization algorithms:
-```python
+??? Code
-from llmcompressor.transformers import oneshot
-from llmcompressor.modifiers.quantization import GPTQModifier
-from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
-# Configure the quantization algorithms
-recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
-# Apply quantization
-oneshot(
-    model=model,
-    dataset=ds,
-    recipe=recipe,
-    max_seq_length=MAX_SEQUENCE_LENGTH,
-    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-)
-# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128
+    ```python
-SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
+    from llmcompressor.transformers import oneshot
-model.save_pretrained(SAVE_DIR, save_compressed=True)
+    from llmcompressor.modifiers.quantization import GPTQModifier
-tokenizer.save_pretrained(SAVE_DIR)
+    from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
-```
+    # Configure the quantization algorithms
+    recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
+    # Apply quantization
+    oneshot(
+        model=model,
+        dataset=ds,
+        recipe=recipe,
+        max_seq_length=MAX_SEQUENCE_LENGTH,
+        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    )
+    # Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128
+    SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
+    model.save_pretrained(SAVE_DIR, save_compressed=True)
+    tokenizer.save_pretrained(SAVE_DIR)
+    ```
 This process creates a W4A16 model with weights quantized to 4-bit integers.
@@ -112,8 +116,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
 To evaluate accuracy, you can use `lm_eval`:
-```console
+```bash
-$ lm_eval --model vllm \
+lm_eval --model vllm \
  --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \
  --tasks gsm8k \
  --num_fewshot 5 \
@@ -137,34 +141,36 @@ $ lm_eval --model vllm \
 The following is an example of an expanded quantization recipe you can tune to your own use case:
-```python
+??? Code
-from compressed_tensors.quantization import (
-    QuantizationArgs,
+    ```python
-    QuantizationScheme,
+    from compressed_tensors.quantization import (
-    QuantizationStrategy,
+        QuantizationArgs,
-    QuantizationType,
+        QuantizationScheme,
-) 
+        QuantizationStrategy,
-recipe = GPTQModifier(
+        QuantizationType,
-    targets="Linear",
+    ) 
-    config_groups={
+    recipe = GPTQModifier(
-        "config_group": QuantizationScheme(
+        targets="Linear",
-            targets=["Linear"],
+        config_groups={
-            weights=QuantizationArgs(
+            "config_group": QuantizationScheme(
-                num_bits=4,
+                targets=["Linear"],
-                type=QuantizationType.INT,
+                weights=QuantizationArgs(
-                strategy=QuantizationStrategy.GROUP,
+                    num_bits=4,
-                group_size=128,
+                    type=QuantizationType.INT,
-                symmetric=True,
+                    strategy=QuantizationStrategy.GROUP,
-                dynamic=False,
+                    group_size=128,
-                actorder="weight",
+                    symmetric=True,
+                    dynamic=False,
+                    actorder="weight",
+                ),
            ),
-        ),
+        },
-    },
+        ignore=["lm_head"],
-    ignore=["lm_head"],
+        update_size=NUM_CALIBRATION_SAMPLES,
-    update_size=NUM_CALIBRATION_SAMPLES,
+        dampening_frac=0.01
-    dampening_frac=0.01
+    )
-)
+    ```
-```
 ## Troubleshooting and Support

--- a/docs/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@@ -15,13 +15,13 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re
 To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
-```console
+```bash
 pip install llmcompressor
 ```
 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
-```console
+```bash
 pip install vllm lm-eval==0.4.4
 ```
@@ -54,54 +54,60 @@ When quantizing activations to INT8, you need sample data to estimate the activa
 It's best to use calibration data that closely matches your deployment data.
 For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
-```python
+??? Code
-from datasets import load_dataset
-NUM_CALIBRATION_SAMPLES = 512
+    ```python
-MAX_SEQUENCE_LENGTH = 2048
+    from datasets import load_dataset
-# Load and preprocess the dataset
+    NUM_CALIBRATION_SAMPLES = 512
-ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+    MAX_SEQUENCE_LENGTH = 2048
-ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-def preprocess(example):
+    # Load and preprocess the dataset
-    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+    ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
-ds = ds.map(preprocess)
+    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-def tokenize(sample):
+    def preprocess(example):
-    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+        return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
-ds = ds.map(tokenize, remove_columns=ds.column_names)
+    ds = ds.map(preprocess)
-```
+    def tokenize(sample):
+        return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+    ds = ds.map(tokenize, remove_columns=ds.column_names)
+    ```
+</details>
 ### 3. Applying Quantization
 Now, apply the quantization algorithms:
-```python
+??? Code
-from llmcompressor.transformers import oneshot
-from llmcompressor.modifiers.quantization import GPTQModifier
+    ```python
-from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+    from llmcompressor.transformers import oneshot
+    from llmcompressor.modifiers.quantization import GPTQModifier
-# Configure the quantization algorithms
+    from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
-recipe = [
-    SmoothQuantModifier(smoothing_strength=0.8),
+    # Configure the quantization algorithms
-    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+    recipe = [
-]
+        SmoothQuantModifier(smoothing_strength=0.8),
+        GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
-# Apply quantization
+    ]
-oneshot(
-    model=model,
+    # Apply quantization
-    dataset=ds,
+    oneshot(
-    recipe=recipe,
+        model=model,
-    max_seq_length=MAX_SEQUENCE_LENGTH,
+        dataset=ds,
-    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+        recipe=recipe,
-)
+        max_seq_length=MAX_SEQUENCE_LENGTH,
+        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-# Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token
+    )
-SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
-model.save_pretrained(SAVE_DIR, save_compressed=True)
+    # Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token
-tokenizer.save_pretrained(SAVE_DIR)
+    SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
-```
+    model.save_pretrained(SAVE_DIR, save_compressed=True)
+    tokenizer.save_pretrained(SAVE_DIR)
+    ```
 This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
@@ -116,8 +122,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
 To evaluate accuracy, you can use `lm_eval`:
-```console
+```bash
-$ lm_eval --model vllm \
+lm_eval --model vllm \
  --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
  --tasks gsm8k \
  --num_fewshot 5 \

--- a/docs/features/quantization/modelopt.md
+++ b/docs/features/quantization/modelopt.md
@@ -4,7 +4,7 @@ The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-O
 We recommend installing the library with:
-```console
+```bash
 pip install nvidia-modelopt
 ```
@@ -14,24 +14,26 @@ You can quantize HuggingFace models using the example scripts provided in the Te
 Below is an example showing how to quantize a model using modelopt's PTQ API:
-```python
+??? Code
-import modelopt.torch.quantization as mtq
-from transformers import AutoModelForCausalLM
-# Load the model from HuggingFace
+    ```python
-model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")
+    import modelopt.torch.quantization as mtq
+    from transformers import AutoModelForCausalLM
-# Select the quantization config, for example, FP8
+    # Load the model from HuggingFace
-config = mtq.FP8_DEFAULT_CFG
+    model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")
-# Define a forward loop function for calibration
+    # Select the quantization config, for example, FP8
-def forward_loop(model):
+    config = mtq.FP8_DEFAULT_CFG
-    for data in calib_set:
-        model(data)
-# PTQ with in-place replacement of quantized modules
+    # Define a forward loop function for calibration
-model = mtq.quantize(model, config, forward_loop)
+    def forward_loop(model):
-```
+        for data in calib_set:
+            model(data)
+    # PTQ with in-place replacement of quantized modules
+    model = mtq.quantize(model, config, forward_loop)
+    ```
 After the model is quantized, you can export it to a quantized checkpoint using the export API:
@@ -48,31 +50,33 @@ with torch.inference_mode():
 The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
-```python
+??? Code
-from vllm import LLM, SamplingParams
-def main():
+    ```python
+    from vllm import LLM, SamplingParams
-    model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
+    def main():
-    # Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
-    llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
+        model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
+        # Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
+        llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)
-    prompts = [
+        sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "The future of AI is",
-    ]
-    outputs = llm.generate(prompts, sampling_params)
+        prompts = [
+            "Hello, my name is",
+            "The president of the United States is",
+            "The capital of France is",
+            "The future of AI is",
+        ]
-    for output in outputs:
+        outputs = llm.generate(prompts, sampling_params)
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-if __name__ == "__main__":
+        for output in outputs:
-    main()
+            prompt = output.prompt
-```
+            generated_text = output.outputs[0].text
+            print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    if __name__ == "__main__":
+        main()
+    ```
--- a/docs/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
@@ -35,20 +35,22 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades
 Here is an example of how to enable FP8 quantization:
-```python
+??? Code
-# To calculate kv cache scales on the fly enable the calculate_kv_scales
-# parameter
-from vllm import LLM, SamplingParams
+    ```python
+    # To calculate kv cache scales on the fly enable the calculate_kv_scales
+    # parameter
-sampling_params = SamplingParams(temperature=0.7, top_p=0.8)
+    from vllm import LLM, SamplingParams
-llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
-          kv_cache_dtype="fp8",
+    sampling_params = SamplingParams(temperature=0.7, top_p=0.8)
-          calculate_kv_scales=True)
+    llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
-prompt = "London is the capital of"
+            kv_cache_dtype="fp8",
-out = llm.generate(prompt, sampling_params)[0].outputs[0].text
+            calculate_kv_scales=True)
-print(out)
+    prompt = "London is the capital of"
-```
+    out = llm.generate(prompt, sampling_params)[0].outputs[0].text
+    print(out)
+    ```
 The `kv_cache_dtype` argument specifies the data type for KV cache storage:
 - `"auto"`: Uses the model's default "unquantized" data type
@@ -63,7 +65,7 @@ For optimal model quality when using FP8 KV Cache, we recommend using calibrated
 First, install the required dependencies:
-```console
+```bash
 pip install llmcompressor
 ```
@@ -71,67 +73,69 @@ pip install llmcompressor
 Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):
-```python
+??? Code
-from datasets import load_dataset
-from transformers import AutoModelForCausalLM, AutoTokenizer
+    ```python
-from llmcompressor.transformers import oneshot
+    from datasets import load_dataset
+    from transformers import AutoModelForCausalLM, AutoTokenizer
-# Select model and load it
+    from llmcompressor.transformers import oneshot
-MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
-model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
+    # Select model and load it
-tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+    MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
-# Select calibration dataset
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
-DATASET_ID = "HuggingFaceH4/ultrachat_200k"
-DATASET_SPLIT = "train_sft"
+    # Select calibration dataset
+    DATASET_ID = "HuggingFaceH4/ultrachat_200k"
-# Configure calibration parameters
+    DATASET_SPLIT = "train_sft"
-NUM_CALIBRATION_SAMPLES = 512  # 512 samples is a good starting point
-MAX_SEQUENCE_LENGTH = 2048
+    # Configure calibration parameters
+    NUM_CALIBRATION_SAMPLES = 512  # 512 samples is a good starting point
-# Load and preprocess dataset
+    MAX_SEQUENCE_LENGTH = 2048
-ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
-ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+    # Load and preprocess dataset
+    ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
-def process_and_tokenize(example):
+    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-    text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
-    return tokenizer(
+    def process_and_tokenize(example):
-        text,
+        text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
-        padding=False,
+        return tokenizer(
-        max_length=MAX_SEQUENCE_LENGTH,
+            text,
-        truncation=True,
+            padding=False,
-        add_special_tokens=False,
+            max_length=MAX_SEQUENCE_LENGTH,
+            truncation=True,
+            add_special_tokens=False,
+        )
+    ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
+    # Configure quantization settings
+    recipe = """
+    quant_stage:
+        quant_modifiers:
+            QuantizationModifier:
+                kv_cache_scheme:
+                    num_bits: 8
+                    type: float
+                    strategy: tensor
+                    dynamic: false
+                    symmetric: true
+    """
+    # Apply quantization
+    oneshot(
+        model=model,
+        dataset=ds,
+        recipe=recipe,
+        max_seq_length=MAX_SEQUENCE_LENGTH,
+        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )
-ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
+    # Save quantized model: Llama-3.1-8B-Instruct-FP8-KV
+    SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
-# Configure quantization settings
+    model.save_pretrained(SAVE_DIR, save_compressed=True)
-recipe = """
+    tokenizer.save_pretrained(SAVE_DIR)
-quant_stage:
+    ```
-    quant_modifiers:
-        QuantizationModifier:
-            kv_cache_scheme:
-                num_bits: 8
-                type: float
-                strategy: tensor
-                dynamic: false
-                symmetric: true
-"""
-# Apply quantization
-oneshot(
-    model=model,
-    dataset=ds,
-    recipe=recipe,
-    max_seq_length=MAX_SEQUENCE_LENGTH,
-    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-)
-# Save quantized model: Llama-3.1-8B-Instruct-FP8-KV
-SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
-model.save_pretrained(SAVE_DIR, save_compressed=True)
-tokenizer.save_pretrained(SAVE_DIR)
-```
 The above script will create a folder in your current directory containing your quantized model (e.g., `Llama-3.1-8B-Instruct-FP8-KV`) with calibrated scales.

--- a/docs/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
 ---
-title: AMD QUARK
+title: AMD Quark
 ---
 [](){ #quark }
@@ -13,7 +13,7 @@ AWQ, GPTQ, Rotation and SmoothQuant.
 Before quantizing models, you need to install Quark. The latest release of Quark can be installed with pip:
-```console
+```bash
 pip install amd-quark
 ```
@@ -22,13 +22,13 @@ for more installation details.
 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
-```console
+```bash
 pip install vllm lm-eval==0.4.4
 ```
 ## Quantization Process
-After installing Quark, we will use an example to illustrate how to use Quark.  
+After installing Quark, we will use an example to illustrate how to use Quark.
 The Quark quantization process can be listed for 5 steps as below:
 1. Load the model
@@ -42,20 +42,22 @@ The Quark quantization process can be listed for 5 steps as below:
 Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
 to fetch model and tokenizer.
-```python
+??? Code
-from transformers import AutoTokenizer, AutoModelForCausalLM
-MODEL_ID = "meta-llama/Llama-2-70b-chat-hf"
+    ```python
-MAX_SEQ_LEN = 512
+    from transformers import AutoTokenizer, AutoModelForCausalLM
-model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID = "meta-llama/Llama-2-70b-chat-hf"
-    MODEL_ID, device_map="auto", torch_dtype="auto",
+    MAX_SEQ_LEN = 512
-)
-model.eval()
-tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, model_max_length=MAX_SEQ_LEN)
+    model = AutoModelForCausalLM.from_pretrained(
-tokenizer.pad_token = tokenizer.eos_token
+        MODEL_ID, device_map="auto", torch_dtype="auto",
-```
+    )
+    model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, model_max_length=MAX_SEQ_LEN)
+    tokenizer.pad_token = tokenizer.eos_token
+    ```
 ### 2. Prepare the Calibration Dataloader
@@ -63,22 +65,24 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
 to load calibration data. For more details about how to use calibration datasets efficiently, please refer
 to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).
-```python
+??? Code
-from datasets import load_dataset
-from torch.utils.data import DataLoader
-BATCH_SIZE = 1
+    ```python
-NUM_CALIBRATION_DATA = 512
+    from datasets import load_dataset
+    from torch.utils.data import DataLoader
-# Load the dataset and get calibration data.
+    BATCH_SIZE = 1
-dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
+    NUM_CALIBRATION_DATA = 512
-text_data = dataset["text"][:NUM_CALIBRATION_DATA]
-tokenized_outputs = tokenizer(text_data, return_tensors="pt",
+    # Load the dataset and get calibration data.
-    padding=True, truncation=True, max_length=MAX_SEQ_LEN)
+    dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
-calib_dataloader = DataLoader(tokenized_outputs['input_ids'],
+    text_data = dataset["text"][:NUM_CALIBRATION_DATA]
-    batch_size=BATCH_SIZE, drop_last=True)
-```
+    tokenized_outputs = tokenizer(text_data, return_tensors="pt",
+        padding=True, truncation=True, max_length=MAX_SEQ_LEN)
+    calib_dataloader = DataLoader(tokenized_outputs['input_ids'],
+        batch_size=BATCH_SIZE, drop_last=True)
+    ```
 ### 3. Set the Quantization Configuration
@@ -94,42 +98,44 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
    AutoSmoothQuant config file for Llama is
    `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
-```python
+??? Code
-from quark.torch.quantization import (Config, QuantizationConfig,
-                                     FP8E4M3PerTensorSpec,
+    ```python
-                                     load_quant_algo_config_from_file)
+    from quark.torch.quantization import (Config, QuantizationConfig,
+                                        FP8E4M3PerTensorSpec,
-# Define fp8/per-tensor/static spec.
+                                        load_quant_algo_config_from_file)
-FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(observer_method="min_max",
-    is_dynamic=False).to_quantization_spec()
+    # Define fp8/per-tensor/static spec.
+    FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(observer_method="min_max",
-# Define global quantization config, input tensors and weight apply FP8_PER_TENSOR_SPEC.
+        is_dynamic=False).to_quantization_spec()
-global_quant_config = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC,
-    weight=FP8_PER_TENSOR_SPEC)
+    # Define global quantization config, input tensors and weight apply FP8_PER_TENSOR_SPEC.
+    global_quant_config = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC,
-# Define quantization config for kv-cache layers, output tensors apply FP8_PER_TENSOR_SPEC.
+        weight=FP8_PER_TENSOR_SPEC)
-KV_CACHE_SPEC = FP8_PER_TENSOR_SPEC
-kv_cache_layer_names_for_llama = ["*k_proj", "*v_proj"]
+    # Define quantization config for kv-cache layers, output tensors apply FP8_PER_TENSOR_SPEC.
-kv_cache_quant_config = {name :
+    KV_CACHE_SPEC = FP8_PER_TENSOR_SPEC
-    QuantizationConfig(input_tensors=global_quant_config.input_tensors,
+    kv_cache_layer_names_for_llama = ["*k_proj", "*v_proj"]
-                       weight=global_quant_config.weight,
+    kv_cache_quant_config = {name :
-                       output_tensors=KV_CACHE_SPEC)
+        QuantizationConfig(input_tensors=global_quant_config.input_tensors,
-    for name in kv_cache_layer_names_for_llama}
+                        weight=global_quant_config.weight,
-layer_quant_config = kv_cache_quant_config.copy()
+                        output_tensors=KV_CACHE_SPEC)
+        for name in kv_cache_layer_names_for_llama}
-# Define algorithm config by config file.
+    layer_quant_config = kv_cache_quant_config.copy()
-LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE =
-    'examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json'
+    # Define algorithm config by config file.
-algo_config = load_quant_algo_config_from_file(LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE)
+    LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE =
+        'examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json'
-EXCLUDE_LAYERS = ["lm_head"]
+    algo_config = load_quant_algo_config_from_file(LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE)
-quant_config = Config(
-    global_quant_config=global_quant_config,
+    EXCLUDE_LAYERS = ["lm_head"]
-    layer_quant_config=layer_quant_config,
+    quant_config = Config(
-    kv_cache_quant_config=kv_cache_quant_config,
+        global_quant_config=global_quant_config,
-    exclude=EXCLUDE_LAYERS,
+        layer_quant_config=layer_quant_config,
-    algo_config=algo_config)
+        kv_cache_quant_config=kv_cache_quant_config,
-```
+        exclude=EXCLUDE_LAYERS,
+        algo_config=algo_config)
+    ```
 ### 4. Quantize the Model and Export
@@ -139,68 +145,72 @@ HuggingFace `safetensors`, you can refer to
 [HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
 for more exporting format details.
-```python
+??? Code
-import torch
-from quark.torch import ModelQuantizer, ModelExporter
+    ```python
-from quark.torch.export import ExporterConfig, JsonExporterConfig
+    import torch
+    from quark.torch import ModelQuantizer, ModelExporter
-# Apply quantization.
+    from quark.torch.export import ExporterConfig, JsonExporterConfig
-quantizer = ModelQuantizer(quant_config)
-quant_model = quantizer.quantize_model(model, calib_dataloader)
+    # Apply quantization.
+    quantizer = ModelQuantizer(quant_config)
-# Freeze quantized model to export.
+    quant_model = quantizer.quantize_model(model, calib_dataloader)
-freezed_model = quantizer.freeze(model)
+    # Freeze quantized model to export.
-# Define export config.
+    freezed_model = quantizer.freeze(model)
-LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"]
-export_config = ExporterConfig(json_export_config=JsonExporterConfig())
+    # Define export config.
-export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP
+    LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"]
+    export_config = ExporterConfig(json_export_config=JsonExporterConfig())
-# Model: Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant
+    export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP
-EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant"
-exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR)
+    # Model: Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant
-with torch.no_grad():
+    EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant"
-    exporter.export_safetensors_model(freezed_model,
+    exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR)
-        quant_config=quant_config, tokenizer=tokenizer)
+    with torch.no_grad():
-```
+        exporter.export_safetensors_model(freezed_model,
+            quant_config=quant_config, tokenizer=tokenizer)
+    ```
 ### 5. Evaluation in vLLM
 Now, you can load and run the Quark quantized model directly through the LLM entrypoint:
-```python
+??? Code
-from vllm import LLM, SamplingParams
+    ```python
-# Sample prompts.
+    from vllm import LLM, SamplingParams
-prompts = [
-    "Hello, my name is",
+    # Sample prompts.
-    "The president of the United States is",
+    prompts = [
-    "The capital of France is",
+        "Hello, my name is",
-    "The future of AI is",
+        "The president of the United States is",
-]
+        "The capital of France is",
-# Create a sampling params object.
+        "The future of AI is",
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+    ]
+    # Create a sampling params object.
-# Create an LLM.
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant",
-          kv_cache_dtype='fp8',quantization='quark')
+    # Create an LLM.
-# Generate texts from the prompts. The output is a list of RequestOutput objects
+    llm = LLM(model="Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant",
-# that contain the prompt, generated text, and other information.
+            kv_cache_dtype='fp8',quantization='quark')
-outputs = llm.generate(prompts, sampling_params)
+    # Generate texts from the prompts. The output is a list of RequestOutput objects
-# Print the outputs.
+    # that contain the prompt, generated text, and other information.
-print("\nGenerated Outputs:\n" + "-" * 60)
+    outputs = llm.generate(prompts, sampling_params)
-for output in outputs:
+    # Print the outputs.
-    prompt = output.prompt
+    print("\nGenerated Outputs:\n" + "-" * 60)
-    generated_text = output.outputs[0].text
+    for output in outputs:
-    print(f"Prompt:    {prompt!r}")
+        prompt = output.prompt
-    print(f"Output:    {generated_text!r}")
+        generated_text = output.outputs[0].text
-    print("-" * 60)
+        print(f"Prompt:    {prompt!r}")
-```
+        print(f"Output:    {generated_text!r}")
+        print("-" * 60)
+    ```
 Or, you can use `lm_eval` to evaluate accuracy:
-```console
+```bash
-$ lm_eval --model vllm \
+lm_eval --model vllm \
  --model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant,kv_cache_dtype='fp8',quantization='quark' \
  --tasks gsm8k
 ```
@@ -212,7 +222,7 @@ to quantize large language models more conveniently. It supports quantizing mode
 of different quantization schemes and optimization algorithms. It can export the quantized model
 and run evaluation tasks on the fly. With the script, the example above can be:
-```console
+```bash
 python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
                          --output_dir /path/to/output \
                          --quant_scheme w_fp8_a_fp8 \

--- a/docs/features/quantization/torchao.md
+++ b/docs/features/quantization/torchao.md
@@ -4,7 +4,7 @@ TorchAO is an architecture optimization library for PyTorch, it provides high pe
 We recommend installing the latest torchao nightly with
-```console
+```bash
 # Install the latest TorchAO nightly build
 # Choose the CUDA version that matches your system (cu126, cu128, etc.)
 pip install \
@@ -15,26 +15,28 @@ pip install \
 ## Quantizing HuggingFace Models
 You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
-```Python
+??? Code
-import torch
-from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
+    ```Python
-from torchao.quantization import Int8WeightOnlyConfig
+    import torch
+    from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
-model_name = "meta-llama/Meta-Llama-3-8B"
+    from torchao.quantization import Int8WeightOnlyConfig
-quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
-quantized_model = AutoModelForCausalLM.from_pretrained(
+    model_name = "meta-llama/Meta-Llama-3-8B"
-    model_name,
+    quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
-    torch_dtype="auto",
+    quantized_model = AutoModelForCausalLM.from_pretrained(
-    device_map="auto",
+        model_name,
-    quantization_config=quantization_config
+        torch_dtype="auto",
-)
+        device_map="auto",
-tokenizer = AutoTokenizer.from_pretrained(model_name)
+        quantization_config=quantization_config
-input_text = "What are we having for dinner?"
+    )
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    input_text = "What are we having for dinner?"
-hub_repo = # YOUR HUB REPO ID
+    input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
-tokenizer.push_to_hub(hub_repo)
-quantized_model.push_to_hub(hub_repo, safe_serialization=False)
+    hub_repo = # YOUR HUB REPO ID
-```
+    tokenizer.push_to_hub(hub_repo)
+    quantized_model.push_to_hub(hub_repo, safe_serialization=False)
+    ```
 Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI.
--- a/docs/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
@@ -33,34 +33,36 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
 Next, make a request to the model that should return the reasoning content in the response.
-```python
+??? Code
-from openai import OpenAI
-# Modify OpenAI's API key and API base to use vLLM's API server.
+    ```python
-openai_api_key = "EMPTY"
+    from openai import OpenAI
-openai_api_base = "http://localhost:8000/v1"
-client = OpenAI(
+    # Modify OpenAI's API key and API base to use vLLM's API server.
-    api_key=openai_api_key,
+    openai_api_key = "EMPTY"
-    base_url=openai_api_base,
+    openai_api_base = "http://localhost:8000/v1"
-)
-models = client.models.list()
+    client = OpenAI(
-model = models.data[0].id
+        api_key=openai_api_key,
+        base_url=openai_api_base,
+    )
-# Round 1
+    models = client.models.list()
-messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
+    model = models.data[0].id
-# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
-# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
-# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
-response = client.chat.completions.create(model=model, messages=messages)
-reasoning_content = response.choices[0].message.reasoning_content
+    # Round 1
-content = response.choices[0].message.content
+    messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
+    # For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
+    # For Qwen3 series, if you want to disable thinking in reasoning mode, add:
+    # extra_body={"chat_template_kwargs": {"enable_thinking": False}}
+    response = client.chat.completions.create(model=model, messages=messages)
-print("reasoning_content:", reasoning_content)
+    reasoning_content = response.choices[0].message.reasoning_content
-print("content:", content)
+    content = response.choices[0].message.content
-```
+    print("reasoning_content:", reasoning_content)
+    print("content:", content)
+    ```
 The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion.
@@ -68,164 +70,125 @@ The `reasoning_content` field contains the reasoning steps that led to the final
 Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
-```json
+??? Json
-{
-    "id": "chatcmpl-123",
+    ```json
-    "object": "chat.completion.chunk",
+    {
-    "created": 1694268190,
+        "id": "chatcmpl-123",
-    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+        "object": "chat.completion.chunk",
-    "system_fingerprint": "fp_44709d6fcb",
+        "created": 1694268190,
-    "choices": [
+        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
-        {
+        "system_fingerprint": "fp_44709d6fcb",
-            "index": 0,
+        "choices": [
-            "delta": {
+            {
-                "role": "assistant",
+                "index": 0,
-                "reasoning_content": "is",
+                "delta": {
-            },
+                    "role": "assistant",
-            "logprobs": null,
+                    "reasoning_content": "is",
-            "finish_reason": null
+                },
-        }
+                "logprobs": null,
-    ]
+                "finish_reason": null
-}
+            }
-```
+        ]
+    }
+    ```
 OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
-```python
+??? Code
-from openai import OpenAI
+    ```python
-# Modify OpenAI's API key and API base to use vLLM's API server.
+    from openai import OpenAI
-openai_api_key = "EMPTY"
-openai_api_base = "http://localhost:8000/v1"
+    # Modify OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
-client = OpenAI(
+    openai_api_base = "http://localhost:8000/v1"
-    api_key=openai_api_key,
-    base_url=openai_api_base,
+    client = OpenAI(
-)
+        api_key=openai_api_key,
+        base_url=openai_api_base,
-models = client.models.list()
+    )
-model = models.data[0].id
+    models = client.models.list()
-messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
+    model = models.data[0].id
-# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
-# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
+    messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
-# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
+    # For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
-stream = client.chat.completions.create(model=model,
+    # For Qwen3 series, if you want to disable thinking in reasoning mode, add:
-                                        messages=messages,
+    # extra_body={"chat_template_kwargs": {"enable_thinking": False}}
-                                        stream=True)
+    stream = client.chat.completions.create(model=model,
+                                            messages=messages,
-print("client: Start streaming chat completions...")
+                                            stream=True)
-printed_reasoning_content = False
-printed_content = False
+    print("client: Start streaming chat completions...")
+    printed_reasoning_content = False
-for chunk in stream:
+    printed_content = False
-    reasoning_content = None
-    content = None
+    for chunk in stream:
-    # Check the content is reasoning_content or content
+        reasoning_content = None
-    if hasattr(chunk.choices[0].delta, "reasoning_content"):
+        content = None
-        reasoning_content = chunk.choices[0].delta.reasoning_content
+        # Check the content is reasoning_content or content
-    elif hasattr(chunk.choices[0].delta, "content"):
+        if hasattr(chunk.choices[0].delta, "reasoning_content"):
-        content = chunk.choices[0].delta.content
+            reasoning_content = chunk.choices[0].delta.reasoning_content
+        elif hasattr(chunk.choices[0].delta, "content"):
-    if reasoning_content is not None:
+            content = chunk.choices[0].delta.content
-        if not printed_reasoning_content:
-            printed_reasoning_content = True
+        if reasoning_content is not None:
-            print("reasoning_content:", end="", flush=True)
+            if not printed_reasoning_content:
-        print(reasoning_content, end="", flush=True)
+                printed_reasoning_content = True
-    elif content is not None:
+                print("reasoning_content:", end="", flush=True)
-        if not printed_content:
+            print(reasoning_content, end="", flush=True)
-            printed_content = True
+        elif content is not None:
-            print("\ncontent:", end="", flush=True)
+            if not printed_content:
-        # Extract and print the content
+                printed_content = True
-        print(content, end="", flush=True)
+                print("\ncontent:", end="", flush=True)
-```
+            # Extract and print the content
+            print(content, end="", flush=True)
+    ```
 Remember to check whether the `reasoning_content` exists in the response before accessing it. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py).
-## Structured output
-The reasoning content is also available in the structured output. The structured output engine like `xgrammar` will use the reasoning content to generate structured output. It is only supported in v0 engine now.
-```bash
-vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
-```
-The following is an example client:
-```python
-from openai import OpenAI
-from pydantic import BaseModel
-# Modify OpenAI's API key and API base to use vLLM's API server.
-openai_api_key = "EMPTY"
-openai_api_base = "http://localhost:8000/v1"
-client = OpenAI(
-    api_key=openai_api_key,
-    base_url=openai_api_base,
-)
-models = client.models.list()
-model = models.data[0].id
-class People(BaseModel):
-    name: str
-    age: int
-json_schema = People.model_json_schema()
-prompt = ("Generate a JSON with the name and age of one random person.")
-completion = client.chat.completions.create(
-    model=model,
-    messages=[{
-        "role": "user",
-        "content": prompt,
-    }],
-    extra_body={"guided_json": json_schema},
-)
-print("reasoning_content: ", completion.choices[0].message.reasoning_content)
-print("content: ", completion.choices[0].message.content)
-```
 ## Tool Calling
 The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
-```python
+??? Code
-from openai import OpenAI
+    ```python
-client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
+    from openai import OpenAI
-tools = [{
+    client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
-    "type": "function",
-    "function": {
+    tools = [{
-        "name": "get_weather",
+        "type": "function",
-        "description": "Get the current weather in a given location",
+        "function": {
-        "parameters": {
+            "name": "get_weather",
-            "type": "object",
+            "description": "Get the current weather in a given location",
-            "properties": {
+            "parameters": {
-                "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
+                "type": "object",
-                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
+                "properties": {
-            },
+                    "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
-            "required": ["location", "unit"]
+                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
+                },
+                "required": ["location", "unit"]
+            }
        }
-    }
+    }]
-}]
-response = client.chat.completions.create(
+    response = client.chat.completions.create(
-    model=client.models.list().data[0].id,
+        model=client.models.list().data[0].id,
-    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
+        messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
-    tools=tools,
+        tools=tools,
-    tool_choice="auto"
+        tool_choice="auto"
-)
+    )
-print(response)
+    print(response)
-tool_call = response.choices[0].message.tool_calls[0].function
+    tool_call = response.choices[0].message.tool_calls[0].function
-print(f"reasoning_content: {response.choices[0].message.reasoning_content}")
+    print(f"reasoning_content: {response.choices[0].message.reasoning_content}")
-print(f"Function called: {tool_call.name}")
+    print(f"Function called: {tool_call.name}")
-print(f"Arguments: {tool_call.arguments}")
+    print(f"Arguments: {tool_call.arguments}")
-```
+    ```
 For more examples, please refer to <gh-file:examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py>.
@@ -237,85 +200,89 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_
 You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
-```python
+??? Code
-# import the required packages
+    ```python
-from vllm.reasoning import ReasoningParser, ReasoningParserManager
+    # import the required packages
-from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
-                                              DeltaMessage)
+    from vllm.reasoning import ReasoningParser, ReasoningParserManager
+    from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
-# define a reasoning parser and register it to vllm
+                                                DeltaMessage)
-# the name list in register_module can be used
-# in --reasoning-parser.
+    # define a reasoning parser and register it to vllm
-@ReasoningParserManager.register_module(["example"])
+    # the name list in register_module can be used
-class ExampleParser(ReasoningParser):
+    # in --reasoning-parser.
-    def __init__(self, tokenizer: AnyTokenizer):
+    @ReasoningParserManager.register_module(["example"])
-        super().__init__(tokenizer)
+    class ExampleParser(ReasoningParser):
+        def __init__(self, tokenizer: AnyTokenizer):
-    def extract_reasoning_content_streaming(
+            super().__init__(tokenizer)
-        self,
-        previous_text: str,
+        def extract_reasoning_content_streaming(
-        current_text: str,
+            self,
-        delta_text: str,
+            previous_text: str,
-        previous_token_ids: Sequence[int],
+            current_text: str,
-        current_token_ids: Sequence[int],
+            delta_text: str,
-        delta_token_ids: Sequence[int],
+            previous_token_ids: Sequence[int],
-    ) -> Union[DeltaMessage, None]:
+            current_token_ids: Sequence[int],
-        """
+            delta_token_ids: Sequence[int],
-        Instance method that should be implemented for extracting reasoning
+        ) -> Union[DeltaMessage, None]:
-        from an incomplete response; for use when handling reasoning calls and
+            """
-        streaming. Has to be an instance method because  it requires state -
+            Instance method that should be implemented for extracting reasoning
-        the current tokens/diffs, but also the information about what has
+            from an incomplete response; for use when handling reasoning calls and
-        previously been parsed and extracted (see constructor)
+            streaming. Has to be an instance method because  it requires state -
-        """
+            the current tokens/diffs, but also the information about what has
+            previously been parsed and extracted (see constructor)
+            """
+        def extract_reasoning_content(
+                self, model_output: str, request: ChatCompletionRequest
+        ) -> tuple[Optional[str], Optional[str]]:
+            """
+            Extract reasoning content from a complete model-generated string.
+            Used for non-streaming responses where we have the entire model response
+            available before sending to the client.
+            Parameters:
+            model_output: str
+                The model-generated string to extract reasoning content from.
+            request: ChatCompletionRequest
+                The request object that was used to generate the model_output.
+            Returns:
+            tuple[Optional[str], Optional[str]]
+                A tuple containing the reasoning content and the content.
+            """
+    ```
-    def extract_reasoning_content(
+Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
-            self, model_output: str, request: ChatCompletionRequest
-    ) -> tuple[Optional[str], Optional[str]]:
-        """
-        Extract reasoning content from a complete model-generated string.
-        Used for non-streaming responses where we have the entire model response
-        available before sending to the client.
-        Parameters:
-        model_output: str
-            The model-generated string to extract reasoning content from.
-        request: ChatCompletionRequest
+??? Code
-            The request object that was used to generate the model_output.
-        Returns:
+    ```python
-        tuple[Optional[str], Optional[str]]
+    @dataclass
-            A tuple containing the reasoning content and the content.
+    class DeepSeekReasoner(Reasoner):
        """
-```
+        Reasoner for DeepSeek R series models.
+        """
-Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
+        start_token_id: int
+        end_token_id: int
-```python
-@dataclass
+        start_token: str = "<think>"
-class DeepSeekReasoner(Reasoner):
+        end_token: str = "</think>"
-    """
-    Reasoner for DeepSeek R series models.
+        @classmethod
-    """
+        def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner:
-    start_token_id: int
+            return cls(start_token_id=tokenizer.encode(
-    end_token_id: int
+                "<think>", add_special_tokens=False)[0],
+                    end_token_id=tokenizer.encode("</think>",
-    start_token: str = "<think>"
+                                                    add_special_tokens=False)[0])
-    end_token: str = "</think>"
+        def is_reasoning_end(self, input_ids: list[int]) -> bool:
-    @classmethod
+            return self.end_token_id in input_ids
-    def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner:
+        ...
-        return cls(start_token_id=tokenizer.encode(
+    ```
-            "<think>", add_special_tokens=False)[0],
-                   end_token_id=tokenizer.encode("</think>",
-                                                 add_special_tokens=False)[0])
-    def is_reasoning_end(self, input_ids: list[int]) -> bool:
-        return self.end_token_id in input_ids
-    ...
-```
 The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.

--- a/docs/features/spec_decode.md
+++ b/docs/features/spec_decode.md
@@ -18,29 +18,31 @@ Speculative decoding is a technique which improves inter-token latency in memory
 The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
-```python
+??? Code
-from vllm import LLM, SamplingParams
+    ```python
-prompts = [
+    from vllm import LLM, SamplingParams
-    "The future of AI is",
-]
+    prompts = [
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+        "The future of AI is",
+    ]
-llm = LLM(
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-    model="facebook/opt-6.7b",
-    tensor_parallel_size=1,
+    llm = LLM(
-    speculative_config={
+        model="facebook/opt-6.7b",
-        "model": "facebook/opt-125m",
+        tensor_parallel_size=1,
-        "num_speculative_tokens": 5,
+        speculative_config={
-    },
+            "model": "facebook/opt-125m",
-)
+            "num_speculative_tokens": 5,
-outputs = llm.generate(prompts, sampling_params)
+        },
+    )
-for output in outputs:
+    outputs = llm.generate(prompts, sampling_params)
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
+    for output in outputs:
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+        prompt = output.prompt
-```
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```
 To perform the same with an online mode launch the server:
@@ -60,69 +62,73 @@ python -m vllm.entrypoints.openai.api_server \
 Then use a client:
-```python
+??? Code
-from openai import OpenAI
+    ```python
-# Modify OpenAI's API key and API base to use vLLM's API server.
+    from openai import OpenAI
-openai_api_key = "EMPTY"
-openai_api_base = "http://localhost:8000/v1"
+    # Modify OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
-client = OpenAI(
+    openai_api_base = "http://localhost:8000/v1"
-    # defaults to os.environ.get("OPENAI_API_KEY")
-    api_key=openai_api_key,
+    client = OpenAI(
-    base_url=openai_api_base,
+        # defaults to os.environ.get("OPENAI_API_KEY")
-)
+        api_key=openai_api_key,
+        base_url=openai_api_base,
-models = client.models.list()
+    )
-model = models.data[0].id
+    models = client.models.list()
-# Completion API
+    model = models.data[0].id
-stream = False
-completion = client.completions.create(
+    # Completion API
-    model=model,
+    stream = False
-    prompt="The future of AI is",
+    completion = client.completions.create(
-    echo=False,
+        model=model,
-    n=1,
+        prompt="The future of AI is",
-    stream=stream,
+        echo=False,
-)
+        n=1,
+        stream=stream,
-print("Completion results:")
+    )
-if stream:
-    for c in completion:
+    print("Completion results:")
-        print(c)
+    if stream:
-else:
+        for c in completion:
-    print(completion)
+            print(c)
-```
+    else:
+        print(completion)
+    ```
 ## Speculating by matching n-grams in the prompt
 The following code configures vLLM to use speculative decoding where proposals are generated by
 matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
-```python
+??? Code
-from vllm import LLM, SamplingParams
+    ```python
-prompts = [
+    from vllm import LLM, SamplingParams
-    "The future of AI is",
-]
+    prompts = [
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+        "The future of AI is",
+    ]
-llm = LLM(
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-    model="facebook/opt-6.7b",
-    tensor_parallel_size=1,
+    llm = LLM(
-    speculative_config={
+        model="facebook/opt-6.7b",
-        "method": "ngram",
+        tensor_parallel_size=1,
-        "num_speculative_tokens": 5,
+        speculative_config={
-        "prompt_lookup_max": 4,
+            "method": "ngram",
-    },
+            "num_speculative_tokens": 5,
-)
+            "prompt_lookup_max": 4,
-outputs = llm.generate(prompts, sampling_params)
+        },
+    )
-for output in outputs:
+    outputs = llm.generate(prompts, sampling_params)
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
+    for output in outputs:
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+        prompt = output.prompt
-```
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```
 ## Speculating using MLP speculators
@@ -131,29 +137,31 @@ draft models that conditioning draft predictions on both context vectors and sam
 For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
 [this technical report](https://arxiv.org/abs/2404.19124).
-```python
+??? Code
-from vllm import LLM, SamplingParams
+    ```python
-prompts = [
+    from vllm import LLM, SamplingParams
-    "The future of AI is",
-]
+    prompts = [
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+        "The future of AI is",
+    ]
-llm = LLM(
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
-    tensor_parallel_size=4,
+    llm = LLM(
-    speculative_config={
+        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
-        "model": "ibm-ai-platform/llama3-70b-accelerator",
+        tensor_parallel_size=4,
-        "draft_tensor_parallel_size": 1,
+        speculative_config={
-    },
+            "model": "ibm-ai-platform/llama3-70b-accelerator",
-)
+            "draft_tensor_parallel_size": 1,
-outputs = llm.generate(prompts, sampling_params)
+        },
+    )
-for output in outputs:
+    outputs = llm.generate(prompts, sampling_params)
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
+    for output in outputs:
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+        prompt = output.prompt
-```
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```
 Note that these speculative models currently need to be run without tensor parallelism, although
 it is possible to run the main model using tensor parallelism (see example above). Since the
@@ -177,31 +185,34 @@ A variety of speculative models of this type are available on HF hub:
 The following code configures vLLM to use speculative decoding where proposals are generated by
 an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
-```python
+??? Code
-from vllm import LLM, SamplingParams
-prompts = [
+    ```python
-    "The future of AI is",
+    from vllm import LLM, SamplingParams
-]
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(
+    prompts = [
-    model="meta-llama/Meta-Llama-3-8B-Instruct",
+        "The future of AI is",
-    tensor_parallel_size=4,
+    ]
-    speculative_config={
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-        "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
-        "draft_tensor_parallel_size": 1,
-    },
-)
-outputs = llm.generate(prompts, sampling_params)
+    llm = LLM(
+        model="meta-llama/Meta-Llama-3-8B-Instruct",
+        tensor_parallel_size=4,
+        speculative_config={
+            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
+            "draft_tensor_parallel_size": 1,
+            "num_speculative_tokens": 2,
+        },
+    )
-for output in outputs:
+    outputs = llm.generate(prompts, sampling_params)
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-```
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```
 A few important things to consider when using the EAGLE based draft models:

--- a/docs/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
@@ -21,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters:
 - `guided_grammar`: the output will follow the context free grammar.
 - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
-You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page.
+You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page.
 Structured outputs are supported by default in the OpenAI-Compatible Server. You
 may choose to specify the backend to use by setting the
@@ -33,38 +33,43 @@ text.
 Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
-```python
+??? Code
-from openai import OpenAI
-client = OpenAI(
+    ```python
-    base_url="http://localhost:8000/v1",
+    from openai import OpenAI
-    api_key="-",
+    client = OpenAI(
-)
+        base_url="http://localhost:8000/v1",
+        api_key="-",
-completion = client.chat.completions.create(
+    )
-    model="Qwen/Qwen2.5-3B-Instruct",
+    model = client.models.list().data[0].id
-    messages=[
-        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+    completion = client.chat.completions.create(
-    ],
+        model=model,
-    extra_body={"guided_choice": ["positive", "negative"]},
+        messages=[
-)
+            {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
-print(completion.choices[0].message.content)
+        ],
-```
+        extra_body={"guided_choice": ["positive", "negative"]},
+    )
+    print(completion.choices[0].message.content)
+    ```
 The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
-```python
+??? Code
-completion = client.chat.completions.create(
-    model="Qwen/Qwen2.5-3B-Instruct",
+    ```python
-    messages=[
+    completion = client.chat.completions.create(
-        {
+        model=model,
-            "role": "user",
+        messages=[
-            "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
+            {
-        }
+                "role": "user",
-    ],
+                "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
-    extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
+            }
-)
+        ],
-print(completion.choices[0].message.content)
+        extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
-```
+    )
+    print(completion.choices[0].message.content)
+    ```
 One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
 For this we can use the `guided_json` parameter in two different ways:
@@ -74,75 +79,128 @@ For this we can use the `guided_json` parameter in two different ways:
 The next example shows how to use the `guided_json` parameter with a Pydantic model:
-```python
+??? Code
-from pydantic import BaseModel
-from enum import Enum
+    ```python
+    from pydantic import BaseModel
-class CarType(str, Enum):
+    from enum import Enum
-    sedan = "sedan"
-    suv = "SUV"
+    class CarType(str, Enum):
-    truck = "Truck"
+        sedan = "sedan"
-    coupe = "Coupe"
+        suv = "SUV"
+        truck = "Truck"
-class CarDescription(BaseModel):
+        coupe = "Coupe"
-    brand: str
-    model: str
+    class CarDescription(BaseModel):
-    car_type: CarType
+        brand: str
+        model: str
-json_schema = CarDescription.model_json_schema()
+        car_type: CarType
-completion = client.chat.completions.create(
+    json_schema = CarDescription.model_json_schema()
-    model="Qwen/Qwen2.5-3B-Instruct",
-    messages=[
+    completion = client.chat.completions.create(
-        {
+        model=model,
-            "role": "user",
+        messages=[
-            "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
+            {
-        }
+                "role": "user",
-    ],
+                "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
-    extra_body={"guided_json": json_schema},
+            }
-)
+        ],
-print(completion.choices[0].message.content)
+        "response_format": {
-```
+            "type": "json_schema",
+            "json_schema": {
+                "name": "car-description",
+                "schema": CarDescription.model_json_schema()
+            },
+        },
+    )
+    print(completion.choices[0].message.content)
+    ```
 !!! tip
    While not strictly necessary, normally it´s better to indicate in the prompt the
-    JSON schema and how the fields should be populated.  This can improve the
+    JSON schema and how the fields should be populated. This can improve the
    results notably in most cases.
 Finally we have the `guided_grammar` option, which is probably the most
 difficult to use, but it´s really powerful. It allows us to define complete
-languages like SQL queries.  It works by using a context free EBNF grammar.
+languages like SQL queries. It works by using a context free EBNF grammar.
 As an example, we can use to define a specific format of simplified SQL queries:
-```python
+??? Code
-simplified_sql_grammar = """
-    root ::= select_statement
+    ```python
+    simplified_sql_grammar = """
+        root ::= select_statement
+        select_statement ::= "SELECT " column " from " table " where " condition
+        column ::= "col_1 " | "col_2 "
-    select_statement ::= "SELECT " column " from " table " where " condition
+        table ::= "table_1 " | "table_2 "
-    column ::= "col_1 " | "col_2 "
+        condition ::= column "= " number
-    table ::= "table_1 " | "table_2 "
+        number ::= "1 " | "2 "
+    """
-    condition ::= column "= " number
+    completion = client.chat.completions.create(
+        model=model,
+        messages=[
+            {
+                "role": "user",
+                "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
+            }
+        ],
+        extra_body={"guided_grammar": simplified_sql_grammar},
+    )
+    print(completion.choices[0].message.content)
+    ```
-    number ::= "1 " | "2 "
+See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
-"""
-completion = client.chat.completions.create(
+## Reasoning Outputs
-    model="Qwen/Qwen2.5-3B-Instruct",
-    messages=[
+You can also use structured outputs with <project:#reasoning-outputs> for reasoning models.
-        {
-            "role": "user",
+```bash
-            "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
+vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r1
-        }
-    ],
-    extra_body={"guided_grammar": simplified_sql_grammar},
-)
-print(completion.choices[0].message.content)
 ```
-Full example: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs.py>
+Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:
+??? Code
+    ```python
+    from pydantic import BaseModel
+    class People(BaseModel):
+        name: str
+        age: int
+    completion = client.chat.completions.create(
+        model=model,
+        messages=[
+            {
+                "role": "user",
+                "content": "Generate a JSON with the name and age of one random person.",
+            }
+        ],
+        response_format={
+            "type": "json_schema",
+            "json_schema": {
+                "name": "people",
+                "schema": People.model_json_schema()
+            }
+        },
+    )
+    print("reasoning_content: ", completion.choices[0].message.reasoning_content)
+    print("content: ", completion.choices[0].message.content)
+    ```
+See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
 ## Experimental Automatic Parsing (OpenAI API)
@@ -154,33 +212,33 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.
 Here is a simple example demonstrating how to get structured output using Pydantic models:
-```python
+??? Code
-from pydantic import BaseModel
-from openai import OpenAI
+    ```python
+    from pydantic import BaseModel
-class Info(BaseModel):
+    from openai import OpenAI
-    name: str
-    age: int
+    class Info(BaseModel):
+        name: str
-client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
+        age: int
-completion = client.beta.chat.completions.parse(
-    model="meta-llama/Llama-3.1-8B-Instruct",
+    client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
-    messages=[
+    model = client.models.list().data[0].id
-        {"role": "system", "content": "You are a helpful assistant."},
+    completion = client.beta.chat.completions.parse(
-        {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
+        model=model,
-    ],
+        messages=[
-    response_format=Info,
+            {"role": "system", "content": "You are a helpful assistant."},
-    extra_body=dict(guided_decoding_backend="outlines"),
+            {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
-)
+        ],
+        response_format=Info,
-message = completion.choices[0].message
+    )
-print(message)
-assert message.parsed
+    message = completion.choices[0].message
-print("Name:", message.parsed.name)
+    print(message)
-print("Age:", message.parsed.age)
+    assert message.parsed
-```
+    print("Name:", message.parsed.name)
+    print("Age:", message.parsed.age)
-Output:
+    ```
 ```console
 ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
@@ -190,37 +248,37 @@ Age: 28
 Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
-```python
+??? Code
-from typing import List
-from pydantic import BaseModel
+    ```python
-from openai import OpenAI
+    from typing import List
+    from pydantic import BaseModel
-class Step(BaseModel):
+    from openai import OpenAI
-    explanation: str
-    output: str
+    class Step(BaseModel):
+        explanation: str
-class MathResponse(BaseModel):
+        output: str
-    steps: list[Step]
-    final_answer: str
+    class MathResponse(BaseModel):
+        steps: list[Step]
-client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
+        final_answer: str
-completion = client.beta.chat.completions.parse(
-    model="meta-llama/Llama-3.1-8B-Instruct",
+    completion = client.beta.chat.completions.parse(
-    messages=[
+        model=model,
-        {"role": "system", "content": "You are a helpful expert math tutor."},
+        messages=[
-        {"role": "user", "content": "Solve 8x + 31 = 2."},
+            {"role": "system", "content": "You are a helpful expert math tutor."},
-    ],
+            {"role": "user", "content": "Solve 8x + 31 = 2."},
-    response_format=MathResponse,
+        ],
-    extra_body=dict(guided_decoding_backend="outlines"),
+        response_format=MathResponse,
-)
+    )
-message = completion.choices[0].message
+    message = completion.choices[0].message
-print(message)
+    print(message)
-assert message.parsed
+    assert message.parsed
-for i, step in enumerate(message.parsed.steps):
+    for i, step in enumerate(message.parsed.steps):
-    print(f"Step #{i}:", step)
+        print(f"Step #{i}:", step)
-print("Answer:", message.parsed.final_answer)
+    print("Answer:", message.parsed.final_answer)
-```
+    ```
 Output:
@@ -232,11 +290,11 @@ Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equa
 Answer: x = -29/8
 ```
-An example of using `structural_tag` can be found here: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs_structural_tag.py>
+An example of using `structural_tag` can be found here: <gh-file:examples/online_serving/structured_outputs>
 ## Offline Inference
-Offline inference allows for the same types of guided decoding.
+Offline inference allows for the same types of structured outputs.
 To use it, we´ll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`.
 The main available options inside `GuidedDecodingParams` are:
@@ -247,22 +305,24 @@ The main available options inside `GuidedDecodingParams` are:
 - `structural_tag`
 These parameters can be used in the same way as the parameters from the Online
-Serving examples above.  One example for the usage of the `choice` parameter is
+Serving examples above. One example for the usage of the `choice` parameter is
 shown below:
-```python
+??? Code
-from vllm import LLM, SamplingParams
-from vllm.sampling_params import GuidedDecodingParams
-llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
+    ```python
+    from vllm import LLM, SamplingParams
+    from vllm.sampling_params import GuidedDecodingParams
-guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
+    llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
-sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
-outputs = llm.generate(
+    guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
-    prompts="Classify this sentiment: vLLM is wonderful!",
+    sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
-    sampling_params=sampling_params,
+    outputs = llm.generate(
-)
+        prompts="Classify this sentiment: vLLM is wonderful!",
-print(outputs[0].outputs[0].text)
+        sampling_params=sampling_params,
-```
+    )
+    print(outputs[0].outputs[0].text)
+    ```
-Full example: <gh-file:examples/offline_inference/structured_outputs.py>
+See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
--- a/docs/features/tool_calling.md
+++ b/docs/features/tool_calling.md
@@ -15,44 +15,46 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \
 Next, make a request to the model that should result in it using the available tools:
-```python
+??? Code
-from openai import OpenAI
-import json
+    ```python
+    from openai import OpenAI
-client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
+    import json
-def get_weather(location: str, unit: str):
+    client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
-    return f"Getting the weather for {location} in {unit}..."
-tool_functions = {"get_weather": get_weather}
+    def get_weather(location: str, unit: str):
+        return f"Getting the weather for {location} in {unit}..."
-tools = [{
+    tool_functions = {"get_weather": get_weather}
-    "type": "function",
-    "function": {
+    tools = [{
-        "name": "get_weather",
+        "type": "function",
-        "description": "Get the current weather in a given location",
+        "function": {
-        "parameters": {
+            "name": "get_weather",
-            "type": "object",
+            "description": "Get the current weather in a given location",
-            "properties": {
+            "parameters": {
-                "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
+                "type": "object",
-                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
+                "properties": {
-            },
+                    "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
-            "required": ["location", "unit"]
+                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
+                },
+                "required": ["location", "unit"]
+            }
        }
-    }
+    }]
-}]
+    response = client.chat.completions.create(
-response = client.chat.completions.create(
+        model=client.models.list().data[0].id,
-    model=client.models.list().data[0].id,
+        messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
-    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
+        tools=tools,
-    tools=tools,
+        tool_choice="auto"
-    tool_choice="auto"
+    )
-)
+    tool_call = response.choices[0].message.tool_calls[0].function
-tool_call = response.choices[0].message.tool_calls[0].function
+    print(f"Function called: {tool_call.name}")
-print(f"Function called: {tool_call.name}")
+    print(f"Arguments: {tool_call.arguments}")
-print(f"Arguments: {tool_call.arguments}")
+    print(f"Result: {tool_functions[tool_call.name](**json.loads(tool_call.arguments))}")
-print(f"Result: {get_weather(**json.loads(tool_call.arguments))}")
+    ```
-```
 Example output:
@@ -97,6 +99,14 @@ vLLM supports the `tool_choice='required'` option in the chat completion API. Si
 When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
+## None Function Calling
+vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request.
+By default, when `tool_choice='none'` is specified, vLLM excludes tool definitions from the prompt to optimize context usage. To include tool definitions even with `tool_choice='none'`, use the `--expand-tools-even-if-tool-choice-none` option.
+Note: This behavior will change in v0.10.0, where tool definitions will be included by default even with `tool_choice='none'`.
 ## Automatic Function Calling
 To enable this feature, you should set the following flags:
@@ -226,6 +236,25 @@ AI21's Jamba-1.5 models are supported.
 Flags: `--tool-call-parser jamba`
+### xLAM Models (`xlam`)
+The xLAM tool parser is designed to support models that generate tool calls in various JSON formats. It detects function calls in several different output styles:
+1. Direct JSON arrays: Output strings that are JSON arrays starting with `[` and ending with `]`
+2. Thinking tags: Using `<think>...</think>` tags containing JSON arrays
+3. Code blocks: JSON in code blocks (```json ...```)
+4. Tool calls tags: Using `[TOOL_CALLS]` or `<tool_call>...</tool_call>` tags
+Parallel function calls are supported, and the parser can effectively separate text content from tool calls.
+Supported models:
+* Salesforce Llama-xLAM models: `Salesforce/Llama-xLAM-2-8B-fc-r`, `Salesforce/Llama-xLAM-2-70B-fc-r`
+* Qwen-xLAM models: `Salesforce/xLAM-1B-fc-r`, `Salesforce/xLAM-3B-fc-r`, `Salesforce/Qwen-xLAM-32B-fc-r`
+Flags:
+* For Llama-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_llama.jinja`
+* For Qwen-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_qwen.jinja`
 ### Qwen Models
 For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use. Therefore, you can use the `hermes` parser to enable tool calls for Qwen models. For more detailed information, please refer to the official [Qwen documentation](https://qwen.readthedocs.io/en/latest/framework/function_call.html#vllm)
@@ -235,6 +264,15 @@ For Qwen2.5, the chat template in tokenizer_config.json has already included sup
 Flags: `--tool-call-parser hermes`
+### MiniMax Models (`minimax_m1`)
+Supported models:
+* `MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>)
+* `MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>)
+Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax.jinja`
 ### DeepSeek-V3 Models (`deepseek_v3`)
 Supported models:
@@ -282,53 +320,55 @@ A tool parser plugin is a Python file containing one or more ToolParser implemen
 Here is a summary of a plugin file:
-```python
+??? Code
-# import the required packages
+    ```python
-# define a tool parser and register it to vllm
+    # import the required packages
-# the name list in register_module can be used
-# in --tool-call-parser. you can define as many
+    # define a tool parser and register it to vllm
-# tool parsers as you want here.
+    # the name list in register_module can be used
-@ToolParserManager.register_module(["example"])
+    # in --tool-call-parser. you can define as many
-class ExampleToolParser(ToolParser):
+    # tool parsers as you want here.
-    def __init__(self, tokenizer: AnyTokenizer):
+    @ToolParserManager.register_module(["example"])
-        super().__init__(tokenizer)
+    class ExampleToolParser(ToolParser):
+        def __init__(self, tokenizer: AnyTokenizer):
-    # adjust request. e.g.: set skip special tokens
+            super().__init__(tokenizer)
-    # to False for tool call output.
-    def adjust_request(
+        # adjust request. e.g.: set skip special tokens
-            self, request: ChatCompletionRequest) -> ChatCompletionRequest:
+        # to False for tool call output.
-        return request
+        def adjust_request(
+                self, request: ChatCompletionRequest) -> ChatCompletionRequest:
-    # implement the tool call parse for stream call
+            return request
-    def extract_tool_calls_streaming(
-        self,
+        # implement the tool call parse for stream call
-        previous_text: str,
+        def extract_tool_calls_streaming(
-        current_text: str,
+            self,
-        delta_text: str,
+            previous_text: str,
-        previous_token_ids: Sequence[int],
+            current_text: str,
-        current_token_ids: Sequence[int],
+            delta_text: str,
-        delta_token_ids: Sequence[int],
+            previous_token_ids: Sequence[int],
-        request: ChatCompletionRequest,
+            current_token_ids: Sequence[int],
-    ) -> Union[DeltaMessage, None]:
+            delta_token_ids: Sequence[int],
-        return delta
+            request: ChatCompletionRequest,
+        ) -> Union[DeltaMessage, None]:
-    # implement the tool parse for non-stream call
+            return delta
-    def extract_tool_calls(
-        self,
+        # implement the tool parse for non-stream call
-        model_output: str,
+        def extract_tool_calls(
-        request: ChatCompletionRequest,
+            self,
-    ) -> ExtractedToolCallInformation:
+            model_output: str,
-        return ExtractedToolCallInformation(tools_called=False,
+            request: ChatCompletionRequest,
-                                            tool_calls=[],
+        ) -> ExtractedToolCallInformation:
-                                            content=text)
+            return ExtractedToolCallInformation(tools_called=False,
+                                                tool_calls=[],
-```
+                                                content=text)
+    ```
 Then you can use this plugin in the command line like this.
-```console
+```bash
    --enable-auto-tool-choice \
    --tool-parser-plugin <absolute path of the plugin file>
    --tool-call-parser example \

--- a/docs/getting_started/installation/.nav.yml
+++ b/docs/getting_started/installation/.nav.yml
@@ -2,4 +2,6 @@ nav:
  - README.md
  - gpu.md
  - cpu.md
-  - ai_accelerator.md
+  - google_tpu.md
\ No newline at end of file
+  - intel_gaudi.md
+  - aws_neuron.md
--- a/docs/getting_started/installation/README.md
+++ b/docs/getting_started/installation/README.md
@@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
    - [ARM AArch64](cpu.md#arm-aarch64)
    - [Apple silicon](cpu.md#apple-silicon)
    - [IBM Z (S390X)](cpu.md#ibm-z-s390x)
- [Other AI accelerators](ai_accelerator.md)
+- [Google TPU](google_tpu.md)
-    - [Google TPU](ai_accelerator.md#google-tpu)
+- [Intel Gaudi](intel_gaudi.md)
-    - [Intel Gaudi](ai_accelerator.md#intel-gaudi)
+- [AWS Neuron](aws_neuron.md)
-    - [AWS Neuron](ai_accelerator.md#aws-neuron)
--- a/docs/getting_started/installation/ai_accelerator.md
+++ b/docs/getting_started/installation/ai_accelerator.md
-# Other AI accelerators
-vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
-=== "Google TPU"
-    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
-=== "Intel Gaudi"
-    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
-=== "AWS Neuron"
-    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
-## Requirements
-=== "Google TPU"
-    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
-=== "Intel Gaudi"
-    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
-=== "AWS Neuron"
-    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
-## Configure a new environment
-=== "Google TPU"
-    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
-=== "Intel Gaudi"
-    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
-=== "AWS Neuron"
-    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
-## Set up using Python
-### Pre-built wheels
-=== "Google TPU"
-    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
-=== "Intel Gaudi"
-    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
-=== "AWS Neuron"
-    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
-### Build wheel from source
-=== "Google TPU"
-    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
-=== "Intel Gaudi"
-    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
-=== "AWS Neuron"
-    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
-## Set up using Docker
-### Pre-built images
-=== "Google TPU"
-    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
-=== "Intel Gaudi"
-    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
-=== "AWS Neuron"
-    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
-### Build image from source
-=== "Google TPU"
-    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
-=== "Intel Gaudi"
-    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
-=== "AWS Neuron"
-    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
-## Extra information
-=== "Google TPU"
-    --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
-=== "Intel Gaudi"
-    --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
-=== "AWS Neuron"
-    --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"