Unverified Commit 7cc6058a authored by Xing Liu's avatar Xing Liu Committed by GitHub
Browse files

[Doc] Add MTP docs and update speculative decoding guidance (#35197)


Signed-off-by: default avatarliuxing <945764858@qq.com>
parent 28028dff
...@@ -6,14 +6,33 @@ To train your own draft models for optimized speculative decoding, see [vllm-pro ...@@ -6,14 +6,33 @@ To train your own draft models for optimized speculative decoding, see [vllm-pro
## vLLM Speculation Methods ## vLLM Speculation Methods
vLLM supports a variety of methods of speculative decoding. Model-based methods such as EAGLE, draft models, and mlp provide the best latency reduction, while simpler methods such as n-gram and and suffix decoding provide modest speedups without increasing workload during peak traffic. vLLM supports a variety of methods of speculative decoding. Model-based methods such as EAGLE, MTP, draft models, and MLP provide the best latency reduction, while simpler methods such as n-gram and suffix decoding provide modest speedups without increasing workload during peak traffic.
- [EAGLE](eagle.md) - [EAGLE](eagle.md)
- [Multi-Token Prediction (MTP)](mtp.md)
- [Draft Model](draft_model.md) - [Draft Model](draft_model.md)
- [Multi-Layer Perceptron](mlp.md) - [Multi-Layer Perceptron](mlp.md)
- [N-Gram](n_gram.md) - [N-Gram](n_gram.md)
- [Suffix Decoding](suffix.md) - [Suffix Decoding](suffix.md)
## Method Selection at a Glance
Use this qualitative table as a starting point for method selection. Real gains
depend on your model family, traffic pattern, hardware, and sampling settings.
| Method | Low QPS (latency focused) | High QPS (throughput focused) | Notes |
| --- | --- | --- | --- |
| EAGLE | High gain | Medium to high gain | Strong general-purpose model-based method. |
| MTP | High gain | Medium to high gain | Best when the target model has native MTP support. |
| Draft model | High gain | Medium gain | Needs a separate draft model. |
| MLP speculator | Medium to high gain | Medium gain | Good when compatible MLP speculators are available. |
| N-gram | Low to medium gain | Medium gain | Lightweight and easy to enable. |
| Suffix decoding | Low to medium gain | Medium gain | No extra draft model; dynamic speculation depth. |
For reproducible measurements in your environment, use
[`examples/offline_inference/spec_decode.py`](../../../examples/offline_inference/spec_decode.py)
or the [benchmark CLI guide](../../benchmarking/cli.md).
## Lossless guarantees of Speculative Decoding ## Lossless guarantees of Speculative Decoding
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
......
...@@ -11,10 +11,10 @@ prompts = ["The future of AI is"] ...@@ -11,10 +11,10 @@ prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM( llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct", model="meta-llama/Meta-Llama-3.1-8B-Instruct",
tensor_parallel_size=4, tensor_parallel_size=1,
speculative_config={ speculative_config={
"model": "ibm-ai-platform/llama3-70b-accelerator", "model": "ibm-ai-platform/llama3-8b-accelerator",
"draft_tensor_parallel_size": 1, "draft_tensor_parallel_size": 1,
"method": "mlp_speculator", "method": "mlp_speculator",
}, },
...@@ -27,6 +27,12 @@ for output in outputs: ...@@ -27,6 +27,12 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
!!! warning "Known issue"
`ibm-ai-platform/llama3-70b-accelerator` can fail with:
`AttributeError: 'MLPSpeculatorConfig' object has no attribute 'num_attention_heads'`.
Track status in [#34106](https://github.com/vllm-project/vllm/issues/34106)
and [#34163](https://github.com/vllm-project/vllm/pull/34163).
## Pre-Trained MLP Drafter Models ## Pre-Trained MLP Drafter Models
A variety of speculative models of this type are available on HF hub: A variety of speculative models of this type are available on HF hub:
......
# MTP (Multi-Token Prediction)
MTP is a speculative decoding method where the target model includes native
multi-token prediction capability. Unlike draft-model-based methods, you do not
need to provide a separate draft model.
MTP is useful when:
- Your model natively supports MTP.
- You want model-based speculative decoding with minimal extra configuration.
## Offline Example
```python
from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="XiaomiMiMo/MiMo-7B-Base",
tensor_parallel_size=1,
speculative_config={
"method": "mtp",
"num_speculative_tokens": 1,
},
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
## Online Example
```bash
vllm serve XiaomiMiMo/MiMo-7B-Base \
--tensor-parallel-size 1 \
--speculative_config '{"method":"mtp","num_speculative_tokens":1}'
```
## Notes
- MTP only works for model families that support MTP in vLLM.
- `num_speculative_tokens` controls speculative depth. A small value like `1`
is a good default to start with.
- If your model does not support MTP, use another method such as EAGLE or draft
model speculation.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment