Unverified Commit ecbe42e9 authored by xiao's avatar xiao Committed by GitHub
Browse files

[Doc] Clarify supported keys for --speculative-config (#40455)


Signed-off-by: default avatarWangxiaoxiaoa <Wangxiaoxiaoa@users.noreply.github.com>
Co-authored-by: default avatarWangxiaoxiaoa <Wangxiaoxiaoa@users.noreply.github.com>
parent a250f1bd
...@@ -35,6 +35,99 @@ For reproducible measurements in your environment, use ...@@ -35,6 +35,99 @@ For reproducible measurements in your environment, use
[`examples/offline_inference/spec_decode.py`](../../../examples/offline_inference/spec_decode.py) [`examples/offline_inference/spec_decode.py`](../../../examples/offline_inference/spec_decode.py)
or the [benchmark CLI guide](../../benchmarking/cli.md). or the [benchmark CLI guide](../../benchmarking/cli.md).
## `--speculative-config` schema
Use `--speculative-config` to pass speculative decoding settings as a JSON
object on the CLI:
```bash
vllm serve <target-model> \
--speculative-config '{
"method": "draft_model",
"model": "<draft-model>",
"num_speculative_tokens": 5
}'
```
The same keys are accepted from Python via `LLM(..., speculative_config={...})`.
The tables below highlight common user-facing keys accepted in this JSON
object; they are not an exhaustive schema reference.
For more details, see the generated [engine arguments reference](../../configuration/engine_args.md)
and the API docs for [vllm.config.SpeculativeConfig][].
### Common keys
These keys are commonly used across speculative decoding setups, though some
only apply to model-based methods such as `draft_model`, `mtp`, `eagle3`, and
`dflash`.
| Key | Type | Default | Allowed values / meaning |
| --- | --- | --- | --- |
| `method` | `string` | `None` | Speculation method. Common values include `draft_model`, `ngram`, `suffix`, `mtp`, `eagle3`, and `dflash`. If omitted, vLLM infers the method from the provided configuration when possible. |
| `model` | `string` | `None` | Draft model, EAGLE head, or auxiliary model identifier. For `ngram`, `ngram_gpu`, `suffix`, and `mtp`, this can often be omitted. |
| `num_speculative_tokens` | `integer > 0` | `None` | Number of speculative tokens to propose per step. Required for methods that do not infer it from model metadata. |
| `draft_tensor_parallel_size` | `integer >= 1` | `None` | Tensor parallel size for the draft model. |
| `max_model_len` | `integer >= 1` | `None` | Maximum context length for the draft model. |
| `parallel_drafting` | `boolean` | `false` | Enable parallel draft token generation. Only compatible with EAGLE and draft-model methods. |
| `rejection_sample_method` | `string` | `strict` | `strict`, `probabilistic`, or `synthetic`. |
| `synthetic_acceptance_rate` | `float` | `None` | Average acceptance rate to target when `rejection_sample_method` is `synthetic`. Valid range is `[0, 1]`. |
### Method-specific keys
#### N-gram
| Key | Type | Default | Meaning |
| --- | --- | --- | --- |
| `prompt_lookup_max` | `integer >= 1` | `5` if both lookup bounds are omitted; otherwise mirrors `prompt_lookup_min` when omitted | Maximum n-gram window size. |
| `prompt_lookup_min` | `integer >= 1` | `5` if both lookup bounds are omitted; otherwise mirrors `prompt_lookup_max` when omitted | Minimum n-gram window size. |
Example:
```bash
vllm serve <target-model> \
--speculative-config '{
"method": "ngram",
"num_speculative_tokens": 4,
"prompt_lookup_min": 2,
"prompt_lookup_max": 5
}'
```
#### Suffix decoding
| Key | Type | Default | Meaning |
| --- | --- | --- | --- |
| `suffix_decoding_max_tree_depth` | `integer` | `24` | Maximum combined prefix-match and speculation tree depth. |
| `suffix_decoding_max_cached_requests` | `integer` | `10000` | Maximum number of requests cached in the global suffix tree. Set `0` to disable the global cache. |
| `suffix_decoding_max_spec_factor` | `float` | `1.0` | Caps speculative length as a multiple of prefix-match length. |
| `suffix_decoding_min_token_prob` | `float` | `0.1` | Minimum estimated token probability required to speculate a token. |
Example:
```bash
vllm serve <target-model> \
--speculative-config '{
"method": "suffix",
"num_speculative_tokens": 8,
"suffix_decoding_max_tree_depth": 24,
"suffix_decoding_max_cached_requests": 10000,
"suffix_decoding_max_spec_factor": 1.0,
"suffix_decoding_min_token_prob": 0.1
}'
```
### Notes
- `--speculative-config` expects a JSON object on the CLI. In YAML config
files, use a nested mapping instead of an escaped JSON string.
- `tensor_parallel_size` is not a valid key in `speculative_config`. Use
`draft_tensor_parallel_size` instead.
- Keys such as `temperature` and `top_p` are sampling parameters, not
`--speculative-config` fields.
- Internal fields such as `target_model_config`, `draft_model_config`,
`target_parallel_config`, `draft_parallel_config`, and `draft_load_config`
are populated by vLLM and are not intended to be set by users.
## Lossless guarantees of Speculative Decoding ## Lossless guarantees of Speculative Decoding
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
......
...@@ -33,9 +33,9 @@ vllm serve Qwen/Qwen3-4B-Thinking-2507 \ ...@@ -33,9 +33,9 @@ vllm serve Qwen/Qwen3-4B-Thinking-2507 \
--port 8000 \ --port 8000 \
--seed 42 \ --seed 42 \
-tp 1 \ -tp 1 \
--max_model_len 2048 \ --max-model-len 2048 \
--gpu_memory_utilization 0.8 \ --gpu-memory-utilization 0.8 \
--speculative_config '{"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5, "method": "draft_model"}' --speculative-config '{"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5, "method": "draft_model"}'
``` ```
The code used to request as completions as a client remains unchanged: The code used to request as completions as a client remains unchanged:
...@@ -77,4 +77,8 @@ The code used to request as completions as a client remains unchanged: ...@@ -77,4 +77,8 @@ The code used to request as completions as a client remains unchanged:
``` ```
!!! warning !!! warning
Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated. Note: Please use `--speculative-config` to set all configurations related
to speculative decoding. The previous method of specifying the model
through `--speculative-model` and adding related parameters such as
`--num-speculative-tokens` separately has been deprecated. For supported
keys and examples, see the [`--speculative-config` schema](README.md#--speculative-config-schema).
...@@ -38,7 +38,7 @@ for output in outputs: ...@@ -38,7 +38,7 @@ for output in outputs:
```bash ```bash
vllm serve XiaomiMiMo/MiMo-7B-Base \ vllm serve XiaomiMiMo/MiMo-7B-Base \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--speculative_config '{"method":"mtp","num_speculative_tokens":1}' --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
``` ```
## Notes ## Notes
......
...@@ -36,9 +36,9 @@ vllm serve Qwen/Qwen3-4B \ ...@@ -36,9 +36,9 @@ vllm serve Qwen/Qwen3-4B \
--port 8000 \ --port 8000 \
--seed 42 \ --seed 42 \
-tp 1 \ -tp 1 \
--max_model_len 2048 \ --max-model-len 2048 \
--gpu_memory_utilization 0.8 \ --gpu-memory-utilization 0.8 \
--speculative_config '{"model": "amd/PARD-Qwen3-0.6B", "num_speculative_tokens": 12, "method": "draft_model", "parallel_drafting": true}' --speculative-config '{"model": "amd/PARD-Qwen3-0.6B", "num_speculative_tokens": 12, "method": "draft_model", "parallel_drafting": true}'
``` ```
## Pre-trained PARD weights ## Pre-trained PARD weights
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment