[Doc] Clarify supported keys for --speculative-config (#40455)

Signed-off-by: Wangxiaoxiaoa <Wangxiaoxiaoa@users.noreply.github.com> Co-authored-by: Wangxiaoxiaoa <Wangxiaoxiaoa@users.noreply.github.com>

[Doc] Clarify supported keys for --speculative-config (#40455)
Signed-off-by: Wangxiaoxiaoa <Wangxiaoxiaoa@users.noreply.github.com> Co-authored-by: Wangxiaoxiaoa <Wangxiaoxiaoa@users.noreply.github.com>
ecbe42e9 · xiao · GitHub · a250f1bd · ecbe42e9 · ecbe42e9
Unverified Commit ecbe42e9 authored Apr 22, 2026 by xiao Committed by GitHub Apr 22, 2026
4 changed files
--- a/docs/features/speculative_decoding/README.md
+++ b/docs/features/speculative_decoding/README.md
@@ -35,6 +35,99 @@ For reproducible measurements in your environment, use
 [`examples/offline_inference/spec_decode.py`](../../../examples/offline_inference/spec_decode.py)
 or the [benchmark CLI guide](../../benchmarking/cli.md).

+## `--speculative-config` schema
+
+Use `--speculative-config` to pass speculative decoding settings as a JSON
+object on the CLI:
+
+```bash
+vllm serve <target-model> \
+  --speculative-config '{
+    "method": "draft_model",
+    "model": "<draft-model>",
+    "num_speculative_tokens": 5
+  }'
+```
+
+The same keys are accepted from Python via `LLM(..., speculative_config={...})`.
+The tables below highlight common user-facing keys accepted in this JSON
+object; they are not an exhaustive schema reference.
+For more details, see the generated [engine arguments reference](../../configuration/engine_args.md)
+and the API docs for [vllm.config.SpeculativeConfig][].
+
+### Common keys
+
+These keys are commonly used across speculative decoding setups, though some
+only apply to model-based methods such as `draft_model`, `mtp`, `eagle3`, and
+`dflash`.
+
+| Key | Type | Default | Allowed values / meaning |
+| --- | --- | --- | --- |
+| `method` | `string` | `None` | Speculation method. Common values include `draft_model`, `ngram`, `suffix`, `mtp`, `eagle3`, and `dflash`. If omitted, vLLM infers the method from the provided configuration when possible. |
+| `model` | `string` | `None` | Draft model, EAGLE head, or auxiliary model identifier. For `ngram`, `ngram_gpu`, `suffix`, and `mtp`, this can often be omitted. |
+| `num_speculative_tokens` | `integer > 0` | `None` | Number of speculative tokens to propose per step. Required for methods that do not infer it from model metadata. |
+| `draft_tensor_parallel_size` | `integer >= 1` | `None` | Tensor parallel size for the draft model. |
+| `max_model_len` | `integer >= 1` | `None` | Maximum context length for the draft model. |
+| `parallel_drafting` | `boolean` | `false` | Enable parallel draft token generation. Only compatible with EAGLE and draft-model methods. |
+| `rejection_sample_method` | `string` | `strict` | `strict`, `probabilistic`, or `synthetic`. |
+| `synthetic_acceptance_rate` | `float` | `None` | Average acceptance rate to target when `rejection_sample_method` is `synthetic`. Valid range is `[0, 1]`. |
+
+### Method-specific keys
+
+#### N-gram
+
+| Key | Type | Default | Meaning |
+| --- | --- | --- | --- |
+| `prompt_lookup_max` | `integer >= 1` | `5` if both lookup bounds are omitted; otherwise mirrors `prompt_lookup_min` when omitted | Maximum n-gram window size. |
+| `prompt_lookup_min` | `integer >= 1` | `5` if both lookup bounds are omitted; otherwise mirrors `prompt_lookup_max` when omitted | Minimum n-gram window size. |
+
+Example:
+
+```bash
+vllm serve <target-model> \
+  --speculative-config '{
+    "method": "ngram",
+    "num_speculative_tokens": 4,
+    "prompt_lookup_min": 2,
+    "prompt_lookup_max": 5
+  }'
+```
+
+#### Suffix decoding
+
+| Key | Type | Default | Meaning |
+| --- | --- | --- | --- |
+| `suffix_decoding_max_tree_depth` | `integer` | `24` | Maximum combined prefix-match and speculation tree depth. |
+| `suffix_decoding_max_cached_requests` | `integer` | `10000` | Maximum number of requests cached in the global suffix tree. Set `0` to disable the global cache. |
+| `suffix_decoding_max_spec_factor` | `float` | `1.0` | Caps speculative length as a multiple of prefix-match length. |
+| `suffix_decoding_min_token_prob` | `float` | `0.1` | Minimum estimated token probability required to speculate a token. |
+
+Example:
+
+```bash
+vllm serve <target-model> \
+  --speculative-config '{
+    "method": "suffix",
+    "num_speculative_tokens": 8,
+    "suffix_decoding_max_tree_depth": 24,
+    "suffix_decoding_max_cached_requests": 10000,
+    "suffix_decoding_max_spec_factor": 1.0,
+    "suffix_decoding_min_token_prob": 0.1
+  }'
+```
+
+### Notes
+
+- `--speculative-config` expects a JSON object on the CLI. In YAML config
+  files, use a nested mapping instead of an escaped JSON string.
+- `tensor_parallel_size` is not a valid key in `speculative_config`. Use
+  `draft_tensor_parallel_size` instead.
+- Keys such as `temperature` and `top_p` are sampling parameters, not
+  `--speculative-config` fields.
+- Internal fields such as `target_model_config`, `draft_model_config`,
+  `target_parallel_config`, `draft_parallel_config`, and `draft_load_config`
+  are populated by vLLM and are not intended to be set by users.
+
 ## Lossless guarantees of Speculative Decoding

 In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of

--- a/docs/features/speculative_decoding/draft_model.md
+++ b/docs/features/speculative_decoding/draft_model.md
@@ -33,9 +33,9 @@ vllm serve Qwen/Qwen3-4B-Thinking-2507 \
    --port 8000 \
    --seed 42 \
    -tp 1 \
-    --max_model_len 2048 \
-    --gpu_memory_utilization 0.8 \
-    --speculative_config '{"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5, "method": "draft_model"}'
+    --max-model-len 2048 \
+    --gpu-memory-utilization 0.8 \
+    --speculative-config '{"model": "Qwen/Qwen3-0.6B", "num_speculative_tokens": 5, "method": "draft_model"}'
 ```

 The code used to request as completions as a client remains unchanged:
@@ -77,4 +77,8 @@ The code used to request as completions as a client remains unchanged:
    ```

 !!! warning
-    Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated.
+    Note: Please use `--speculative-config` to set all configurations related
+    to speculative decoding. The previous method of specifying the model
+    through `--speculative-model` and adding related parameters such as
+    `--num-speculative-tokens` separately has been deprecated. For supported
+    keys and examples, see the [`--speculative-config` schema](README.md#--speculative-config-schema).
--- a/docs/features/speculative_decoding/mtp.md
+++ b/docs/features/speculative_decoding/mtp.md
@@ -38,7 +38,7 @@ for output in outputs:
 ```bash
 vllm serve XiaomiMiMo/MiMo-7B-Base \
    --tensor-parallel-size 1 \
-    --speculative_config '{"method":"mtp","num_speculative_tokens":1}'
+    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
 ```

 ## Notes

--- a/docs/features/speculative_decoding/parallel_draft_model.md
+++ b/docs/features/speculative_decoding/parallel_draft_model.md
@@ -36,9 +36,9 @@ vllm serve Qwen/Qwen3-4B \
    --port 8000 \
    --seed 42 \
    -tp 1 \
-    --max_model_len 2048 \
-    --gpu_memory_utilization 0.8 \
-    --speculative_config '{"model": "amd/PARD-Qwen3-0.6B", "num_speculative_tokens": 12, "method": "draft_model", "parallel_drafting": true}'
+    --max-model-len 2048 \
+    --gpu-memory-utilization 0.8 \
+    --speculative-config '{"model": "amd/PARD-Qwen3-0.6B", "num_speculative_tokens": 12, "method": "draft_model", "parallel_drafting": true}'
 ```

 ## Pre-trained PARD weights