update sampling_params documentation with defaults (#11315)

c4d77774 · Xinyuan Tong · GitHub · 832c84fb · c4d77774
Unverified Commit c4d77774 authored Oct 07, 2025 by Xinyuan Tong Committed by GitHub Oct 07, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 4 deletions

docs/basic_usage/sampling_params.md docs/basic_usage/sampling_params.md +16 -4

No files found.
--- a/docs/basic_usage/sampling_params.md
+++ b/docs/basic_usage/sampling_params.md
@@ -30,6 +30,18 @@ The `/generate` endpoint accepts the following parameters in JSON format. For de

 The object is defined at `sampling_params.py::SamplingParams`. You can also read the source code to find more arguments and docs.

+### Note on defaults
+
+By default, SGLang initializes several sampling parameters from the model's `generation_config.json` (when the server is launched with `--sampling-defaults model`, which is the default). To use SGLang/OpenAI constant defaults instead, start the server with `--sampling-defaults openai`. You can always override any parameter per request via `sampling_params`.
+
+```bash
+# Use model-provided defaults from generation_config.json (default behavior)
+python -m sglang.launch_server --model-path <MODEL> --sampling-defaults model
+
+# Use SGLang/OpenAI constant defaults instead
+python -m sglang.launch_server --model-path <MODEL> --sampling-defaults openai
+```
+
 ### Core parameters

 | Argument        | Type/Default                                 | Description                                                                                                                                    |
@@ -37,10 +49,10 @@ The object is defined at `sampling_params.py::SamplingParams`. You can also read
 | max_new_tokens  | `int = 128`                                  | The maximum output length measured in tokens.                                                                                                  |
 | stop            | `Optional[Union[str, List[str]]] = None`     | One or multiple [stop words](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stop). Generation will stop if one of these words is sampled. |
 | stop_token_ids  | `Optional[List[int]] = None`                 | Provide stop words in the form of token IDs. Generation will stop if one of these token IDs is sampled.                                        |
-| temperature     | `float = 1.0`                                | [Temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) when sampling the next token. `temperature = 0` corresponds to greedy sampling, a higher temperature leads to more diversity. |
-| top_p           | `float = 1.0`                                | [Top-p](https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_p) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens. |
-| top_k           | `int = -1`                                   | [Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens. |
-| min_p           | `float = 0.0`                                | [Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`. |
+| temperature     | `float (model default; fallback 1.0)`        | [Temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) when sampling the next token. `temperature = 0` corresponds to greedy sampling, a higher temperature leads to more diversity. |
+| top_p           | `float (model default; fallback 1.0)`        | [Top-p](https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_p) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens. |
+| top_k           | `int (model default; fallback -1)`           | [Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens. |
+| min_p           | `float (model default; fallback 0.0)`        | [Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`. |

 ### Penalizers