Unverified Commit ecbe42e9 authored by xiao's avatar xiao Committed by GitHub
Browse files

[Doc] Clarify supported keys for --speculative-config (#40455)


Signed-off-by: default avatarWangxiaoxiaoa <Wangxiaoxiaoa@users.noreply.github.com>
Co-authored-by: default avatarWangxiaoxiaoa <Wangxiaoxiaoa@users.noreply.github.com>
parent a250f1bd
......@@ -35,6 +35,99 @@ For reproducible measurements in your environment, use
[`examples/offline_inference/spec_decode.py`](../../../examples/offline_inference/spec_decode.py)
or the [benchmark CLI guide](../../benchmarking/cli.md).
## `--speculative-config` schema
Use `--speculative-config` to pass speculative decoding settings as a JSON
object on the CLI:
```bash
vllm serve <target-model> \
--speculative-config '{
"method": "draft_model",
"model": "<draft-model>",
"num_speculative_tokens": 5
}'
```
The same keys are accepted from Python via `LLM(..., speculative_config={...})`.
The tables below highlight common user-facing keys accepted in this JSON
object; they are not an exhaustive schema reference.
For more details, see the generated [engine arguments reference](../../configuration/engine_args.md)
and the API docs for [vllm.config.SpeculativeConfig][].
### Common keys
These keys are commonly used across speculative decoding setups, though some
only apply to model-based methods such as `draft_model`, `mtp`, `eagle3`, and
`dflash`.
| Key | Type | Default | Allowed values / meaning |
| --- | --- | --- | --- |
| `method` | `string` | `None` | Speculation method. Common values include `draft_model`, `ngram`, `suffix`, `mtp`, `eagle3`, and `dflash`. If omitted, vLLM infers the method from the provided configuration when possible. |
| `model` | `string` | `None` | Draft model, EAGLE head, or auxiliary model identifier. For `ngram`, `ngram_gpu`, `suffix`, and `mtp`, this can often be omitted. |
| `num_speculative_tokens` | `integer > 0` | `None` | Number of speculative tokens to propose per step. Required for methods that do not infer it from model metadata. |
| `draft_tensor_parallel_size` | `integer >= 1` | `None` | Tensor parallel size for the draft model. |
| `max_model_len` | `integer >= 1` | `None` | Maximum context length for the draft model. |
| `parallel_drafting` | `boolean` | `false` | Enable parallel draft token generation. Only compatible with EAGLE and draft-model methods. |
| `rejection_sample_method` | `string` | `strict` | `strict`, `probabilistic`, or `synthetic`. |
| `synthetic_acceptance_rate` | `float` | `None` | Average acceptance rate to target when `rejection_sample_method` is `synthetic`. Valid range is `[0, 1]`. |
### Method-specific keys
#### N-gram
| Key | Type | Default | Meaning |
| --- | --- | --- | --- |
| `prompt_lookup_max` | `integer >= 1` | `5` if both lookup bounds are omitted; otherwise mirrors `prompt_lookup_min` when omitted | Maximum n-gram window size. |
| `prompt_lookup_min` | `integer >= 1` | `5` if both lookup bounds are omitted; otherwise mirrors `prompt_lookup_max` when omitted | Minimum n-gram window size. |
Example:
```bash
vllm serve <target-model> \
--speculative-config '{
"method": "ngram",
"num_speculative_tokens": 4,
"prompt_lookup_min": 2,
"prompt_lookup_max": 5
}'
```
#### Suffix decoding
| Key | Type | Default | Meaning |
| --- | --- | --- | --- |
| `suffix_decoding_max_tree_depth` | `integer` | `24` | Maximum combined prefix-match and speculation tree depth. |
| `suffix_decoding_max_cached_requests` | `integer` | `10000` | Maximum number of requests cached in the global suffix tree. Set `0` to disable the global cache. |
| `suffix_decoding_max_spec_factor` | `float` | `1.0` | Caps speculative length as a multiple of prefix-match length. |
| `suffix_decoding_min_token_prob` | `float` | `0.1` | Minimum estimated token probability required to speculate a token. |
Example:
```bash
vllm serve <target-model> \
--speculative-config '{
"method": "suffix",
"num_speculative_tokens": 8,
"suffix_decoding_max_tree_depth": 24,
"suffix_decoding_max_cached_requests": 10000,
"suffix_decoding_max_spec_factor": 1.0,
"suffix_decoding_min_token_prob": 0.1
}'
```
### Notes
- `--speculative-config` expects a JSON object on the CLI. In YAML config
files, use a nested mapping instead of an escaped JSON string.
- `tensor_parallel_size` is not a valid key in `speculative_config`. Use
`draft_tensor_parallel_size` instead.
- Keys such as `temperature` and `top_p` are sampling parameters, not
`--speculative-config` fields.
- Internal fields such as `target_model_config`, `draft_model_config`,
`target_parallel_config`, `draft_parallel_config`, and `draft_load_config`
are populated by vLLM and are not intended to be set by users.
## Lossless guarantees of Speculative Decoding
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
......
......@@ -33,9 +33,9 @@ vllm serve Qwen/Qwen3-4B-Thinking-2507 \
--port 8000 \
--seed 42 \
-tp 1 \
--max_model_len 2048 \
--gpu_memory_utilization 0.8 \
--speculative_config '{"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5, "method": "draft_model"}'
--max-model-len 2048 \
--gpu-memory-utilization 0.8 \
--speculative-config '{"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5, "method": "draft_model"}'
```
The code used to request as completions as a client remains unchanged:
......@@ -77,4 +77,8 @@ The code used to request as completions as a client remains unchanged:
```
!!! warning
Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated.
Note: Please use `--speculative-config` to set all configurations related
to speculative decoding. The previous method of specifying the model
through `--speculative-model` and adding related parameters such as
`--num-speculative-tokens` separately has been deprecated. For supported
keys and examples, see the [`--speculative-config` schema](README.md#--speculative-config-schema).
......@@ -38,7 +38,7 @@ for output in outputs:
```bash
vllm serve XiaomiMiMo/MiMo-7B-Base \
--tensor-parallel-size 1 \
--speculative_config '{"method":"mtp","num_speculative_tokens":1}'
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
```
## Notes
......
......@@ -36,9 +36,9 @@ vllm serve Qwen/Qwen3-4B \
--port 8000 \
--seed 42 \
-tp 1 \
--max_model_len 2048 \
--gpu_memory_utilization 0.8 \
--speculative_config '{"model": "amd/PARD-Qwen3-0.6B", "num_speculative_tokens": 12, "method": "draft_model", "parallel_drafting": true}'
--max-model-len 2048 \
--gpu-memory-utilization 0.8 \
--speculative-config '{"model": "amd/PARD-Qwen3-0.6B", "num_speculative_tokens": 12, "method": "draft_model", "parallel_drafting": true}'
```
## Pre-trained PARD weights
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment