Merge tag 'v0.18.0' into v0.18.0-ori

3fb4b5fa · zhuwenwen · bcf25339 · 89138b21 · 3fb4b5fa · 3fb4b5fa
Commit 3fb4b5fa authored Mar 23, 2026 by zhuwenwen
20 changed files
--- a/docs/features/batch_invariance.md
+++ b/docs/features/batch_invariance.md
@@ -109,6 +109,7 @@ Batch invariance has been tested and verified on the following models:
 - **Qwen2.5**: `Qwen/Qwen2.5-0.5B-Instruct`, `Qwen/Qwen2.5-1.5B-Instruct`, `Qwen/Qwen2.5-3B-Instruct`, `Qwen/Qwen2.5-7B-Instruct`, `Qwen/Qwen2.5-14B-Instruct`, `Qwen/Qwen2.5-32B-Instruct`
 - **Llama 3**: `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct`
 - **GPT-OSS**: `openai/gpt-oss-20b`, `openai/gpt-oss-120b`
+- **Mistral**: `mistralai/Mistral-7B-v0.3`

 Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm/issues/new/choose).


--- a/docs/features/disagg_prefill.md
+++ b/docs/features/disagg_prefill.md
@@ -44,6 +44,12 @@ For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
  --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 64, "cpu_bytes_to_use": 1000000000}}'
  ```

+- **FlexKVConnectorV1**: refer to [examples/offline_inference/prefix_caching_flexkv.py](../../examples/offline_inference/prefix_caching_flexkv.py) for the example usage of FlexKVConnectorV1. FlexKV is a distributed KV Store and multi-level cache management system for ultra-large-scale LLM inference.
+
+  ```bash
+  --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
+  ```
+
 ## Benchmarks

 Please refer to [benchmarks/disagg_benchmarks](../../benchmarks/disagg_benchmarks) for disaggregated prefilling benchmarks.

--- a/docs/features/interleaved_thinking.md
+++ b/docs/features/interleaved_thinking.md
@@ -20,9 +20,9 @@ With interleaved thinking, the model can:
 vLLM currently supports the following interleaved thinking models:

 | Model Series | Reasoning Parser Name |
-|--------------|-----------------------|
-| moonshotai/Kimi-K2-Thinking    |  kimi_k2  |
-| MiniMaxAI/MiniMax-M2           |  minimax_m2  |
+| ------------ | --------------------- |
+| moonshotai/Kimi-K2-Thinking | kimi_k2 |
+| MiniMaxAI/MiniMax-M2 | minimax_m2 |

 ## Example Usage


--- a/docs/features/lora.md
+++ b/docs/features/lora.md
@@ -106,7 +106,8 @@ curl http://localhost:8000/v1/completions \

 In addition to serving LoRA adapters at server startup, the vLLM server supports dynamically configuring LoRA adapters at runtime through dedicated API endpoints and plugins. This feature can be particularly useful when the flexibility to change models on-the-fly is needed.

-Note: Enabling this feature in production environments is risky as users may participate in model adapter management.
+!!! warning
+    This feature comes with security risks. It should not be used in production unless it is an isolated, fully trusted environment.

 To enable dynamic LoRA configuration, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
 is set to `True`.

--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -295,6 +295,51 @@ You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the mult

 Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py)

+#### Chunking Long Audio for Transcription
+
+Speech-to-text models like Whisper have a maximum audio length they can process (typically 30 seconds). For longer audio files, vLLM provides a utility to intelligently split audio into chunks at quiet points to minimize cutting through speech.
+
+```python
+import librosa
+from vllm import LLM, SamplingParams
+from vllm.multimodal.audio import split_audio
+
+# Load long audio file
+audio, sr = librosa.load("long_audio.wav", sr=16000)
+
+# Split into chunks at low-energy (quiet) regions
+chunks = split_audio(
+    audio_data=audio,
+    sample_rate=sr,
+    max_clip_duration_s=30.0,      # Maximum chunk length in seconds
+    overlap_duration_s=1.0,         # Search window for finding quiet split points
+    min_energy_window_size=1600,    # Window size for energy calculation (~100ms at 16kHz)
+)
+
+# Initialize Whisper model
+llm = LLM(model="openai/whisper-large-v3-turbo")
+sampling_params = SamplingParams(temperature=0, max_tokens=256)
+
+# Transcribe each chunk
+transcriptions = []
+for chunk in chunks:
+    outputs = llm.generate({
+        "prompt": "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>",
+        "multi_modal_data": {"audio": (chunk, sr)},
+    }, sampling_params)
+    transcriptions.append(outputs[0].outputs[0].text)
+
+# Combine results
+full_transcription = " ".join(transcriptions)
+```
+
+The `split_audio` function:
+
+- Splits audio at quiet points to avoid cutting through speech
+- Uses RMS energy to find low-amplitude regions within the overlap window
+- Preserves all audio samples (no data loss)
+- Supports any sample rate
+
 #### Automatic Audio Channel Normalization

 vLLM automatically normalizes audio channels for models that require specific audio formats. When loading audio with libraries like `torchaudio`, stereo files return shape `[channels, time]`, but many audio models (particularly Whisper-based models) expect mono audio with shape `[time]`.
@@ -521,7 +566,7 @@ First, launch the OpenAI-compatible server:

 ```bash
 vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
-  --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
+  --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt.image 2
 ```

 Then, you can use the OpenAI client as follows:

--- a/docs/features/nixl_connector_usage.md
+++ b/docs/features/nixl_connector_usage.md
@@ -197,8 +197,8 @@ For multi-host DP deployment, only need to provide the host/port of the head ins

 The `kv_load_failure_policy` setting controls how the system handles failures when the decoder instance loads KV cache blocks from the prefiller instance:

- **fail** (recommended): Immediately fail the request with an error when KV load fails. This prevents performance degradation by avoiding recomputation of prefill work on the decode instance.
- **recompute** (default): Recompute failed blocks locally on the decode instance. This may cause performance _jitter_ on decode instances as the scheduled prefill will delay and interfere with other decodes. Furthermore, decode instances are typically configured with low-latency optimizations.
+- **fail** (default): Immediately fail the request with an error when KV load fails. This prevents performance degradation by avoiding recomputation of prefill work on the decode instance.
+- **recompute**: Recompute failed blocks locally on the decode instance. This may cause performance _jitter_ on decode instances as the scheduled prefill will delay and interfere with other decodes. Furthermore, decode instances are typically configured with low-latency optimizations.

 !!! warning
    Using `kv_load_failure_policy="recompute"` can lead to performance degradation in production deployments. When KV loads fail, the decode instance will execute prefill work with decode-optimized configurations, which is inefficient and defeats the purpose of disaggregated prefilling. This also increases tail latency for other ongoing decode requests.

--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
@@ -44,21 +44,22 @@ th:not(:first-child) {
 }
 </style>

-| Implementation        | Volta   | Turing   | Ampere   | Ada   | Hopper   | AMD GPU   | Intel GPU   | x86 CPU   |
-|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|
-| AWQ                   | ❌      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ✅︎          | ✅︎        |
-| GPTQ                  | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ✅︎          | ✅︎        |
-| Marlin (GPTQ/AWQ/FP8) | ❌      | ❌       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌        |
-| INT8 (W8A8)           | ❌      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ✅︎        |
-| FP8 (W8A8)            | ❌      | ❌       | ❌       | ✅︎    | ✅︎       | ✅︎         | ❌          | ❌        |
-| bitsandbytes          | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌        |
-| DeepSpeedFP           | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌        |
-| GGUF                  | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ✅︎         | ❌          | ❌        |
+| Implementation            | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU |
+| ------------------------- | ----- | ------ | ------ | --- | ------ | ------- | --------- | ------- |
+| AWQ                       | ❌    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ✅︎        | ✅︎      |
+| GPTQ                      | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ✅︎        | ✅︎      |
+| Marlin (GPTQ/AWQ/FP8/FP4) | ❌    | ✅︎*    | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      |
+| INT8 (W8A8)               | ❌    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ✅︎      |
+| FP8 (W8A8)                | ❌    | ❌     | ❌     | ✅︎  | ✅︎     | ✅︎      | ❌        | ❌      |
+| bitsandbytes              | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      |
+| DeepSpeedFP               | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ❌      | ❌        | ❌      |
+| GGUF                      | ✅︎    | ✅︎     | ✅︎     | ✅︎  | ✅︎     | ✅︎      | ❌        | ❌      |

 - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
 - ✅︎ indicates that the quantization method is supported on the specified hardware.
 - ❌ indicates that the quantization method is not supported on the specified hardware.
 - All Intel Gaudi quantization support has been migrated to [vLLM-Gaudi](https://github.com/vllm-project/vllm-gaudi).
+- *Turing does not support Marlin MXFP4.

 !!! note
    For information on quantization support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation.
@@ -130,7 +131,7 @@ class MyQuantConfig(QuantizationConfig):
 Your custom `QuantizationConfig` subclass must implement these abstract methods:

 | Method | Description |
-|--------|-------------|
+| ------ | ----------- |
 | `get_name()` | Returns the name of the quantization method |
 | `get_supported_act_dtypes()` | Returns list of supported activation dtypes (e.g., `torch.float16`) |
 | `get_min_capability()` | Returns minimum GPU compute capability (e.g., 80 for Ampere, -1 for no restriction) |

--- a/docs/features/quantization/bnb.md
+++ b/docs/features/quantization/bnb.md
@@ -7,7 +7,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
 Below are the steps to utilize BitsAndBytes with vLLM.

 ```bash
-pip install bitsandbytes>=0.46.1
+pip install bitsandbytes>=0.49.2
 ```

 vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.

--- a/docs/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@@ -2,7 +2,7 @@

 vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
 Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
-Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
+Turing/Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
 Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.

 Please visit the HF collection of [quantized FP8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
@@ -13,8 +13,8 @@ The FP8 types typically supported in hardware have two distinct representations,
 - **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.

 !!! note
-    FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
-    FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
+    FP8 computation is supported on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper).
+    FP8 models will run on compute capability >= 7.5 (Turing) as weight-only W8A16, utilizing FP8 Marlin.

 ## Installation

@@ -84,7 +84,7 @@ Since simple RTN does not require data for weight quantization and the activatio
 Install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm "lm-eval[api]>=0.4.9.2"
+pip install vllm "lm-eval[api]>=0.4.11"
 ```

 Load and run the model in `vllm`:
@@ -114,7 +114,7 @@ Here's an example of the resulting scores:

 ```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
-|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
+| --- |------:| -------------- |-----:| --------- | - |----:| - |-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.768|±  |0.0268|
 |     |       |strict-match    |     5|exact_match|↑  |0.768|±  |0.0268|
 ```

--- a/docs/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@@ -18,7 +18,7 @@ pip install llmcompressor
 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm "lm-eval[api]>=0.4.9.2"
+pip install vllm "lm-eval[api]>=0.4.11"
 ```

 ## Quantization Process

--- a/docs/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@@ -9,7 +9,7 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re
    INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).

 !!! warning
-    **Blackwell GPU Limitation**: INT8 is not supported on compute capability >= 100 (e.g., RTX 6000 Blackwell).
+    **Blackwell GPU Limitation**: INT8 is not supported on compute capability >= 10.0 (e.g., RTX 6000 Blackwell).
    Use [FP8 quantization](fp8.md) instead, or run on Hopper/Ada/Ampere architectures.

 ## Prerequisites
@@ -23,7 +23,7 @@ pip install llmcompressor
 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm "lm-eval[api]>=0.4.9.2"
+pip install vllm "lm-eval[api]>=0.4.11"
 ```

 ## Quantization Process

--- a/docs/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
@@ -20,7 +20,7 @@ for more installation details.
 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm "lm-eval[api]>=0.4.9.2"
+pip install vllm "lm-eval[api]>=0.4.11"
 ```

 ## Quantization Process

--- a/docs/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
@@ -12,7 +12,7 @@ Reasoning models return an additional `reasoning` field in their outputs, which
 vLLM currently supports the following reasoning models:

 | Model Series | Parser Name | Structured Output Support | Tool Calling |
-|--------------|-------------|------------------|-------------|
+| ------------ | ----------- | ---------------- | ----------- |
 | [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `deepseek_r1` | `json`, `regex` | ❌ |
 | [DeepSeek-V3.1](https://huggingface.co/collections/deepseek-ai/deepseek-v31-68a491bed32bd77e7fca048f) | `deepseek_v3` | `json`, `regex` | ❌ |
 | [ERNIE-4.5-VL series](https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT) | `ernie45` | `json`, `regex` | ❌ |

--- a/docs/features/spec_decode/README.md
+++ b/docs/features/spec_decode/README.md
-# Speculative Decoding
-
-!!! warning
-    Please note that speculative decoding in vLLM is not yet optimized and does
-    not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
-    The work to optimize it is ongoing and can be followed here: <https://github.com/vllm-project/vllm/issues/4630>
-
-!!! warning
-    Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
-
-This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
-Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
-
-!!! tip
-    To train your own draft models for speculative decoding, see [Speculators](speculators.md), a library for training draft models that integrates seamlessly with vLLM.
-
-## Speculating with a draft model
-
-The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
-
-!!! warning
-    In vllm v0.10.0, speculative decoding with a draft model is not supported.
-    If you use the following code, you will get a `NotImplementedError`.
-
-??? code
-
-    ```python
-    from vllm import LLM, SamplingParams
-
-    prompts = [
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    llm = LLM(
-        model="facebook/opt-6.7b",
-        tensor_parallel_size=1,
-        speculative_config={
-            "model": "facebook/opt-125m",
-            "num_speculative_tokens": 5,
-        },
-    )
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-    ```
-
-To perform the same with an online mode launch the server:
-
-```bash
-vllm serve facebook/opt-6.7b \
-    --host 0.0.0.0 \
-    --port 8000 \
-    --seed 42 \
-    -tp 1 \
-    --gpu_memory_utilization 0.8 \
-    --speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
-```
-
-!!! warning
-    Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
-
-Then use a client:
-
-??? code
-
-    ```python
-    from openai import OpenAI
-
-    # Modify OpenAI's API key and API base to use vLLM's API server.
-    openai_api_key = "EMPTY"
-    openai_api_base = "http://localhost:8000/v1"
-
-    client = OpenAI(
-        # defaults to os.environ.get("OPENAI_API_KEY")
-        api_key=openai_api_key,
-        base_url=openai_api_base,
-    )
-
-    models = client.models.list()
-    model = models.data[0].id
-
-    # Completion API
-    stream = False
-    completion = client.completions.create(
-        model=model,
-        prompt="The future of AI is",
-        echo=False,
-        n=1,
-        stream=stream,
-    )
-
-    print("Completion results:")
-    if stream:
-        for c in completion:
-            print(c)
-    else:
-        print(completion)
-    ```
-
-## Speculating by matching n-grams in the prompt
-
-The following code configures vLLM to use speculative decoding where proposals are generated by
-matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
-
-??? code
-
-    ```python
-    from vllm import LLM, SamplingParams
-
-    prompts = [
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    llm = LLM(
-        model="facebook/opt-6.7b",
-        tensor_parallel_size=1,
-        speculative_config={
-            "method": "ngram",
-            "num_speculative_tokens": 5,
-            "prompt_lookup_max": 4,
-        },
-    )
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-    ```
-
-## Speculating using Suffix Decoding
-
-The following code configures vLLM to use speculative decoding where proposals are generated using Suffix Decoding ([technical report](https://arxiv.org/abs/2411.04975)).
-
-Like n-gram, Suffix Decoding can generate draft tokens by pattern-matching using the last `n` generated tokens. Unlike n-gram, Suffix Decoding (1) can pattern-match against both the prompt and previous generations, (2) uses frequency counts to propose the most likely continuations, and (3) speculates an adaptive number of tokens for each request at each iteration to get better acceptance rates.
-
-Suffix Decoding can achieve better performance for tasks with high repetition, such as code-editing, agentic loops (e.g. self-reflection, self-consistency), and RL rollouts.
-
-!!! tip "Install Arctic Inference"
-    Suffix Decoding requires [Arctic Inference](https://github.com/snowflakedb/ArcticInference). You can install it with `pip install arctic-inference`.
-
-!!! tip "Suffix Decoding Speculative Tokens"
-    Suffix Decoding will speculate a dynamic number of tokens for each request at each decoding step, so the `num_speculative_tokens` configuration specifies the *maximum* number of speculative tokens. It is suggested to use a high number such as `16` or `32` (default).
-
-??? code
-
-    ```python
-    from vllm import LLM, SamplingParams
-
-    prompts = [
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    llm = LLM(
-        model="facebook/opt-6.7b",
-        tensor_parallel_size=1,
-        speculative_config={
-            "method": "suffix",
-            "num_speculative_tokens": 32,
-        },
-    )
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-    ```
-
-## Speculating using MLP speculators
-
-The following code configures vLLM to use speculative decoding where proposals are generated by
-draft models that condition draft predictions on both context vectors and sampled tokens.
-For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
-[this technical report](https://arxiv.org/abs/2404.19124).
-
-??? code
-
-    ```python
-    from vllm import LLM, SamplingParams
-
-    prompts = [
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    llm = LLM(
-        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
-        tensor_parallel_size=4,
-        speculative_config={
-            "model": "ibm-ai-platform/llama3-70b-accelerator",
-            "draft_tensor_parallel_size": 1,
-        },
-    )
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-    ```
-
-Note that these speculative models currently need to be run without tensor parallelism, although
-it is possible to run the main model using tensor parallelism (see example above). Since the
-speculative models are relatively small, we still see significant speedups. However, this
-limitation will be fixed in a future release.
-
-A variety of speculative models of this type are available on HF hub:
-
- [llama-13b-accelerator](https://huggingface.co/ibm-ai-platform/llama-13b-accelerator)
- [llama3-8b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-8b-accelerator)
- [codellama-34b-accelerator](https://huggingface.co/ibm-ai-platform/codellama-34b-accelerator)
- [llama2-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama2-70b-accelerator)
- [llama3-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-70b-accelerator)
- [granite-3b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator)
- [granite-8b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator)
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
-
-## Speculating using EAGLE based draft models
-
-The following code configures vLLM to use speculative decoding where proposals are generated by
-an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found in [examples/offline_inference/spec_decode.py](../../../examples/offline_inference/spec_decode.py)
-
-??? code
-
-    ```python
-    from vllm import LLM, SamplingParams
-
-    prompts = [
-        "The future of AI is",
-    ]
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    llm = LLM(
-        model="meta-llama/Meta-Llama-3-8B-Instruct",
-        tensor_parallel_size=4,
-        speculative_config={
-            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
-            "draft_tensor_parallel_size": 1,
-            "num_speculative_tokens": 2,
-            "method": "eagle",
-        },
-    )
-
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-    ```
-
-A few important things to consider when using the EAGLE based draft models:
-
-1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
-   be able to be loaded and used directly by vLLM after <https://github.com/vllm-project/vllm/pull/12304>.
-   If you are using vllm version before <https://github.com/vllm-project/vllm/pull/12304>, please use the
-   [script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
-   and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
-
-2. The EAGLE based draft models need to be run without tensor parallelism
-   (i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
-   it is possible to run the main model using tensor parallelism (see example above).
-
-3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
-   reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
-   investigation and tracked here: <https://github.com/vllm-project/vllm/issues/9565>.
-
-4. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".
-   That is, to specify `"method": "eagle3"` in `speculative_config`.
-
-A variety of EAGLE draft models are available on the Hugging Face hub:
-
-| Base Model                                                           | EAGLE on Hugging Face                     | # EAGLE Parameters |
-|---------------------------------------------------------------------|-------------------------------------------|--------------------|
-| Vicuna-7B-v1.3                                                       | yuhuili/EAGLE-Vicuna-7B-v1.3             | 0.24B              |
-| Vicuna-13B-v1.3                                                      | yuhuili/EAGLE-Vicuna-13B-v1.3            | 0.37B              |
-| Vicuna-33B-v1.3                                                      | yuhuili/EAGLE-Vicuna-33B-v1.3            | 0.56B              |
-| LLaMA2-Chat 7B                                                       | yuhuili/EAGLE-llama2-chat-7B             | 0.24B              |
-| LLaMA2-Chat 13B                                                      | yuhuili/EAGLE-llama2-chat-13B            | 0.37B              |
-| LLaMA2-Chat 70B                                                      | yuhuili/EAGLE-llama2-chat-70B            | 0.99B              |
-| Mixtral-8x7B-Instruct-v0.1                                           | yuhuili/EAGLE-mixtral-instruct-8x7B      | 0.28B              |
-| LLaMA3-Instruct 8B                                                   | yuhuili/EAGLE-LLaMA3-Instruct-8B         | 0.25B              |
-| LLaMA3-Instruct 70B                                                  | yuhuili/EAGLE-LLaMA3-Instruct-70B        | 0.99B              |
-| Qwen2-7B-Instruct                                                    | yuhuili/EAGLE-Qwen2-7B-Instruct          | 0.26B              |
-| Qwen2-72B-Instruct                                                   | yuhuili/EAGLE-Qwen2-72B-Instruct         | 1.05B              |
-
-## Lossless guarantees of Speculative Decoding
-
-In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
-speculative decoding, breaking down the guarantees into three key areas:
-
-1. **Theoretical Losslessness**
-   \- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
-   cause slight variations in output distributions, as discussed
-   in [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/pdf/2302.01318)
-
-2. **Algorithmic Losslessness**
-   \- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
-
-    > - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
-    >   distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
-    > - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
-    >   without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
-    >   provides a lossless guarantee. Almost all of the tests in [tests/spec_decode/e2e](../../tests/spec_decode/e2e).
-    >   verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
-
-3. **vLLM Logprob Stability**
-   \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
-   same request across runs. For more details, see the FAQ section
-   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
-
-While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
-can occur due to following factors:
-
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
-  due to non-deterministic behavior in batched operations or numerical instability.
-
-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
-
-## Resources for vLLM contributors
-
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
- [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)
--- a/docs/features/speculative_decoding/README.md
+++ b/docs/features/speculative_decoding/README.md
+# Speculative Decoding
+
+This document shows how to use [Speculative Decoding](https://arxiv.org/pdf/2302.01318) with vLLM to reduce inter-token latency under medium-to-low QPS (query per second), memory-bound workloads.
+
+To train your own draft models for optimized speculative decoding, see [vllm-project/speculators](speculators.md) for seamless training and integration with vLLM.
+
+## vLLM Speculation Methods
+
+vLLM supports a variety of methods of speculative decoding. Model-based methods such as EAGLE, MTP, draft models, PARD and MLP provide the best latency reduction, while simpler methods such as n-gram and suffix decoding provide modest speedups without increasing workload during peak traffic.
+
+- [EAGLE](eagle.md)
+- [Multi-Token Prediction (MTP)](mtp.md)
+- [Draft Model](draft_model.md)
+- [Parallel Draft Model (PARD)](parallel_draft_model.md)
+- [Multi-Layer Perceptron](mlp.md)
+- [N-Gram](n_gram.md)
+- [Suffix Decoding](suffix.md)
+
+## Method Selection at a Glance
+
+Use this qualitative table as a starting point for method selection. Real gains
+depend on your model family, traffic pattern, hardware, and sampling settings.
+
+| Method | Low QPS (latency focused) | High QPS (throughput focused) | Notes |
+| --- | --- | --- | --- |
+| EAGLE | High gain | Medium to high gain | Strong general-purpose model-based method. |
+| MTP | High gain | Medium to high gain | Best when the target model has native MTP support. |
+| Draft model | High gain | Medium gain | Needs a separate draft model. |
+| Parallel Draft Model | High gain | Medium to high gain | Low draft model latency. |
+| MLP speculator | Medium to high gain | Medium gain | Good when compatible MLP speculators are available. |
+| N-gram | Low to medium gain | Medium gain | Lightweight and easy to enable. |
+| Suffix decoding | Low to medium gain | Medium gain | No extra draft model; dynamic speculation depth. |
+
+For reproducible measurements in your environment, use
+[`examples/offline_inference/spec_decode.py`](../../../examples/offline_inference/spec_decode.py)
+or the [benchmark CLI guide](../../benchmarking/cli.md).
+
+## Lossless guarantees of Speculative Decoding
+
+In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
+speculative decoding, breaking down the guarantees into three key areas:
+
+1. **Theoretical Losslessness**
+   \- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
+   cause slight variations in output distributions, as discussed
+   in [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/pdf/2302.01318)
+
+2. **Algorithmic Losslessness**
+   \- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
+
+    > - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
+    >   distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
+    > - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
+    >   without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
+    >   provides a lossless guarantee. Almost all of the tests in [tests/spec_decode/e2e](/tests/v1/spec_decode).
+    >   verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
+
+3. **vLLM Logprob Stability**
+   \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
+   same request across runs. For more details, see the FAQ section
+   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
+
+While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
+can occur due to following factors:
+
+- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
+- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
+  due to non-deterministic behavior in batched operations or numerical instability.
+
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../../usage/faq.md).
+
+## Known Feature Incompatibility
+
+1. Pipeline parallelism is not composible with speculative decoding as of `vllm<=0.15.0`
+2. Speculative decoding with a draft models is not supported in `vllm<=0.10.0`
+
+## Resources for vLLM contributors
+
+- [[vLLM Office Hours #40] Intro to Speculators](https://www.youtube.com/watch?v=2ISAr_JVGLs)
+- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
+- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
+- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
+- [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)
--- a/docs/features/speculative_decoding/draft_model.md
+++ b/docs/features/speculative_decoding/draft_model.md
+# Draft Models
+
+The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = ["The future of AI is"]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+llm = LLM(
+    model="Qwen/Qwen3-8B",
+    tensor_parallel_size=1,
+    speculative_config={
+        "model": "Qwen/Qwen3-0.6B",
+        "num_speculative_tokens": 5,
+        "method": "draft_model",
+    },
+)
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+To perform the equivalent launch in online mode, use the following server-side code:
+
+```bash
+vllm serve Qwen/Qwen3-4B-Thinking-2507 \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --seed 42 \
+    -tp 1 \
+    --max_model_len 2048 \
+    --gpu_memory_utilization 0.8 \
+    --speculative_config '{"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5, "method": "draft_model"}'
+```
+
+The code used to request as completions as a client remains unchanged:
+
+??? code
+
+    ```python
+    from openai import OpenAI
+
+    # Modify OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
+    openai_api_base = "http://localhost:8000/v1"
+
+    client = OpenAI(
+        # defaults to os.environ.get("OPENAI_API_KEY")
+        api_key=openai_api_key,
+        base_url=openai_api_base,
+    )
+
+    models = client.models.list()
+    model = models.data[0].id
+
+    # Completion API
+    stream = False
+    completion = client.completions.create(
+        model=model,
+        prompt="The future of AI is",
+        echo=False,
+        n=1,
+        stream=stream,
+    )
+
+    print("Completion results:")
+    if stream:
+        for c in completion:
+            print(c)
+    else:
+        print(completion)
+    ```
+
+!!! warning
+    Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated.
--- a/docs/features/speculative_decoding/eagle.md
+++ b/docs/features/speculative_decoding/eagle.md
+# EAGLE Draft Models
+
+The following code configures vLLM to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found in [examples/offline_inference/spec_decode.py](../../../examples/offline_inference/spec_decode.py)
+
+## Eagle Drafter Example
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = ["The future of AI is"]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+llm = LLM(
+    model="meta-llama/Meta-Llama-3-8B-Instruct",
+    tensor_parallel_size=4,
+    speculative_config={
+        "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
+        "draft_tensor_parallel_size": 1,
+        "num_speculative_tokens": 2,
+        "method": "eagle",
+    },
+)
+
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+## Eagle3 Drafter Example
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = ["The future of AI is"]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+llm = LLM(
+    model="meta-llama/Meta-Llama-3-8B-Instruct",
+    tensor_parallel_size=2,
+    speculative_config={
+        "model": "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3",
+        "draft_tensor_parallel_size": 2,
+        "num_speculative_tokens": 2,
+        "method": "eagle3",
+    },
+)
+
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+## Pre-Trained Eagle Draft Models
+
+A variety of EAGLE draft models are available on the Hugging Face hub:
+
+* [RedHatAI/speculator-models](https://huggingface.co/collections/RedHatAI/speculator-models)
+* [yuhuili/models](https://huggingface.co/yuhuili/models?search=eagle)
+
+!!! warning
+    If you are using `vllm<0.7.0`, please use [this script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`.
--- a/docs/features/speculative_decoding/mlp.md
+++ b/docs/features/speculative_decoding/mlp.md
+# MLP Draft Models
+
+The following code configures vLLM to use speculative decoding where proposals are generated by draft models that condition draft predictions on both context vectors and sampled tokens. For more information see [The Hitchhiker's Guide to Speculative Decoding](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) and [IBM Research's Technical Report](https://arxiv.org/abs/2404.19124).
+
+## MLP Drafter Example
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = ["The future of AI is"]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+llm = LLM(
+    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
+    tensor_parallel_size=1,
+    speculative_config={
+        "model": "ibm-ai-platform/llama3-8b-accelerator",
+        "draft_tensor_parallel_size": 1,
+        "method": "mlp_speculator",
+    },
+)
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+!!! warning "Known issue"
+    `ibm-ai-platform/llama3-70b-accelerator` can fail with:
+    `AttributeError: 'MLPSpeculatorConfig' object has no attribute 'num_attention_heads'`.
+    Track status in [#34106](https://github.com/vllm-project/vllm/issues/34106)
+    and [#34163](https://github.com/vllm-project/vllm/pull/34163).
+
+## Pre-Trained MLP Drafter Models
+
+A variety of speculative models of this type are available on HF hub:
+
+- [llama-13b-accelerator](https://huggingface.co/ibm-ai-platform/llama-13b-accelerator)
+- [llama3-8b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-8b-accelerator)
+- [codellama-34b-accelerator](https://huggingface.co/ibm-ai-platform/codellama-34b-accelerator)
+- [llama2-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama2-70b-accelerator)
+- [llama3-70b-accelerator](https://huggingface.co/ibm-ai-platform/llama3-70b-accelerator)
+- [granite-3b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator)
+- [granite-8b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator)
+- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
+- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
--- a/docs/features/speculative_decoding/mtp.md
+++ b/docs/features/speculative_decoding/mtp.md
+# MTP (Multi-Token Prediction)
+
+MTP is a speculative decoding method where the target model includes native
+multi-token prediction capability. Unlike draft-model-based methods, you do not
+need to provide a separate draft model.
+
+MTP is useful when:
+
+- Your model natively supports MTP.
+- You want model-based speculative decoding with minimal extra configuration.
+
+## Offline Example
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = ["The future of AI is"]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+llm = LLM(
+    model="XiaomiMiMo/MiMo-7B-Base",
+    tensor_parallel_size=1,
+    speculative_config={
+        "method": "mtp",
+        "num_speculative_tokens": 1,
+    },
+)
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+## Online Example
+
+```bash
+vllm serve XiaomiMiMo/MiMo-7B-Base \
+    --tensor-parallel-size 1 \
+    --speculative_config '{"method":"mtp","num_speculative_tokens":1}'
+```
+
+## Notes
+
+- MTP only works for model families that support MTP in vLLM.
+- `num_speculative_tokens` controls speculative depth. A small value like `1`
+  is a good default to start with.
+- If your model does not support MTP, use another method such as EAGLE or draft
+  model speculation.
--- a/docs/features/speculative_decoding/n_gram.md
+++ b/docs/features/speculative_decoding/n_gram.md
+# N-Gram Speculation
+
+The following code configures vLLM to use speculative decoding where proposals are generated by
+matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = ["The future of AI is"]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+llm = LLM(
+    model="Qwen/Qwen3-8B",
+    tensor_parallel_size=1,
+    speculative_config={
+        "method": "ngram",
+        "num_speculative_tokens": 5,
+        "prompt_lookup_max": 4,
+    },
+)
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```