Clean up docs for server args and sampling parameters (generated by grok) (#7076)

dbdf76ca · Lianmin Zheng · GitHub · f2a75a66 · dbdf76ca · dbdf76ca
Unverified Commit dbdf76ca authored Jun 10, 2025 by Lianmin Zheng Committed by GitHub Jun 10, 2025
4 changed files
--- a/docs/backend/sampling_params.md
+++ b/docs/backend/sampling_params.md
@@ -6,23 +6,31 @@ If you want a high-level endpoint that can automatically handle chat templates,

 ## `/generate` Endpoint

-The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb).
+The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.

 | Argument                   | Type/Default                                                                 | Description                                                                                                                                                     |
-|------------------------|---------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
+|----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | text                       | `Optional[Union[List[str], str]] = None`                                     | The input prompt. Can be a single prompt or a batch of prompts.                                                                                                 |
-| input_ids              | `Optional[Union[List[List[int]], List[int]]] = None`    | Alternative to `text`. Specify the input as token IDs instead of text.                                                                         |
+| input_ids                  | `Optional[Union[List[List[int]], List[int]]] = None`                         | The token IDs for text; one can specify either text or input_ids.                                                                                               |
+| input_embeds               | `Optional[Union[List[List[List[float]]], List[List[float]]]] = None`         | The embeddings for input_ids; one can specify either text, input_ids, or input_embeds.                                                                          |
+| image_data                 | `Optional[Union[List[List[ImageDataItem]], List[ImageDataItem], ImageDataItem]] = None` | The image input. Can be an image instance, file name, URL, or base64 encoded string. Can be a single image, list of images, or list of lists of images. |
+| audio_data                 | `Optional[Union[List[AudioDataItem], AudioDataItem]] = None`                 | The audio input. Can be a file name, URL, or base64 encoded string.                                                                                             |
 | sampling_params            | `Optional[Union[List[Dict], Dict]] = None`                                   | The sampling parameters as described in the sections below.                                                                                                     |
+| rid                        | `Optional[Union[List[str], str]] = None`                                     | The request ID.                                                                                                                                                 |
 | return_logprob             | `Optional[Union[List[bool], bool]] = None`                                   | Whether to return log probabilities for tokens.                                                                                                                 |
-| logprob_start_len      | `Optional[Union[List[int], int]] = None`                | If returning log probabilities, specifies the start position in the prompt. Default is "-1", which returns logprobs only for output tokens.   |
-| top_logprobs_num       | `Optional[Union[List[int], int]] = None`                | If returning log probabilities, specifies the number of top logprobs to return at each position.                                               |
-| stream                 | `bool = False`                                          | Whether to stream the output.                                                                                                                  |
-| lora_path              | `Optional[Union[List[Optional[str]], Optional[str]]] = None`| Path to LoRA weights.                                                                                                                          |
-| custom_logit_processor | `Optional[Union[List[Optional[str]], str]] = None`      | Custom logit processor for advanced sampling control. For usage see below.                                                                     |
-| return_hidden_states   | `bool = False`                                          | Whether to return hidden states of the model. Note that each time it changes, the CUDA graph will be recaptured, which might lead to a performance hit. See the [examples](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states) for more information. |
+| logprob_start_len          | `Optional[Union[List[int], int]] = None`                                     | If return_logprob, the start location in the prompt for returning logprobs. Default is "-1", which returns logprobs for output tokens only.                     |
+| top_logprobs_num           | `Optional[Union[List[int], int]] = None`                                     | If return_logprob, the number of top logprobs to return at each position.                                                                                       |
+| token_ids_logprob          | `Optional[Union[List[List[int]], List[int]]] = None`                         | If return_logprob, the token IDs to return logprob for.                                                                                                         |
+| return_text_in_logprobs    | `bool = False`                                                               | Whether to detokenize tokens in text in the returned logprobs.                                                                                                  |
+| stream                     | `bool = False`                                                               | Whether to stream output.                                                                                                                                       |
+| lora_path                  | `Optional[Union[List[Optional[str]], Optional[str]]] = None`                 | The path to the LoRA.                                                                                                                                           |
+| custom_logit_processor     | `Optional[Union[List[Optional[str]], str]] = None`                           | Custom logit processor for advanced sampling control. Must be a serialized instance of `CustomLogitProcessor` using its `to_str()` method. For usage see below. |
+| return_hidden_states       | `Union[List[bool], bool] = False`                                            | Whether to return hidden states.                                                                                                                                |

 ## Sampling parameters

+The object is defined at `sampling_params.py::SamplingParams`. You can also read the source code to find more arguments and docs.
+
 ### Core parameters

 | Argument        | Type/Default                                 | Description                                                                                                                                    |
@@ -48,21 +56,21 @@ The `/generate` endpoint accepts the following parameters in JSON format. For de
 Please refer to our dedicated guide on [constrained decoding](./structured_outputs.ipynb) for the following parameters.

 | Argument        | Type/Default                    | Description                                                                                                                                    |
-|--------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
+|-----------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
 | json_schema     | `Optional[str] = None`          | JSON schema for structured outputs.                                                                                                            |
 | regex           | `Optional[str] = None`          | Regex for structured outputs.                                                                                                                  |
 | ebnf            | `Optional[str] = None`          | EBNF for structured outputs.                                                                                                                   |
+| structural_tag  | `Optional[str] = None`          | The structal tag for structured outputs.                                                                                                       |

 ### Other options

 | Argument                      | Type/Default                    | Description                                                                                                                                    |
 |-------------------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
 | n                             | `int = 1`                       | Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.) |
-| spaces_between_special_tokens | `bool = True`                   | Whether or not to add spaces between special tokens during detokenization.                                                                     |
-| no_stop_trim                  | `bool = False`                  | Don't trim stop words or EOS token from the generated text.                                                                                    |
-| continue_final_message        | `bool = False`                  | When enabled, the final assistant message is removed and its content is used as a prefill so that the model continues that message instead of starting a new turn. See [openai_chat_with_response_prefill.py](https://github.com/sgl-project/sglang/blob/main/examples/runtime/openai_chat_with_response_prefill.py) for examples. |
 | ignore_eos                    | `bool = False`                  | Don't stop generation when EOS token is sampled.                                                                                               |
 | skip_special_tokens           | `bool = True`                   | Remove special tokens during decoding.                                                                                                         |
+| spaces_between_special_tokens | `bool = True`                   | Whether or not to add spaces between special tokens during detokenization.                                                                     |
+| no_stop_trim                  | `bool = False`                  | Don't trim stop words or EOS token from the generated text.                                                                                    |
 | custom_params                 | `Optional[List[Optional[Dict[str, Any]]]] = None` | Used when employing `CustomLogitProcessor`. For usage, see below.                                                                              |

 ## Examples

--- a/docs/backend/server_arguments.md
+++ b/docs/backend/server_arguments.md
--- a/python/sglang/srt/managers/io_struct.py
+++ b/python/sglang/srt/managers/io_struct.py
@@ -87,7 +87,7 @@ class GenerateReqInput:

    # The modalities of the image data [image, multi-images, video]
    modalities: Optional[List[str]] = None
-    # LoRA related
+    # The path to the LoRA
    lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None

    # Session info for continual prompting

--- a/python/sglang/srt/server_args.py
+++ b/python/sglang/srt/server_args.py
@@ -28,7 +28,6 @@ from sglang.srt.utils import (
    configure_ipv6,
    get_device,
    get_device_memory_capacity,
-    is_cuda,
    is_flashinfer_available,
    is_hip,
    is_port_available,
@@ -214,8 +213,8 @@ class ServerArgs:
    disable_shared_experts_fusion: bool = False
    disable_chunked_prefix_cache: bool = False
    disable_fast_image_processor: bool = False
-    warmups: Optional[str] = None
    enable_return_hidden_states: bool = False
+    warmups: Optional[str] = None

    # Debug tensor dumps
    debug_tensor_dump_output_folder: Optional[str] = None
@@ -536,10 +535,16 @@ class ServerArgs:
            help="The path of the tokenizer.",
        )
        parser.add_argument(
-            "--host", type=str, default=ServerArgs.host, help="The host of the server."
+            "--host",
+            type=str,
+            default=ServerArgs.host,
+            help="The host of the HTTP server.",
        )
        parser.add_argument(
-            "--port", type=int, default=ServerArgs.port, help="The port of the server."
+            "--port",
+            type=int,
+            default=ServerArgs.port,
+            help="The port of the HTTP server.",
        )
        parser.add_argument(
            "--tokenizer-mode",
@@ -694,6 +699,18 @@ class ServerArgs:
            "name, a tag name, or a commit id. If unspecified, will use "
            "the default version.",
        )
+        parser.add_argument(
+            "--impl",
+            type=str,
+            default=ServerArgs.impl,
+            help="Which implementation of the model to use.\n\n"
+            '* "auto" will try to use the SGLang implementation if it exists '
+            "and fall back to the Transformers implementation if no SGLang "
+            "implementation is available.\n"
+            '* "sglang" will use the SGLang model implementation.\n'
+            '* "transformers" will use the Transformers model '
+            "implementation.\n",
+        )

        # Memory and scheduling
        parser.add_argument(
@@ -752,18 +769,6 @@ class ServerArgs:
            default=ServerArgs.page_size,
            help="The number of tokens in a page.",
        )
-        parser.add_argument(
-            "--impl",
-            type=str,
-            default=ServerArgs.impl,
-            help="Which implementation of the model to use.\n\n"
-            '* "auto" will try to use the SGLang implementation if it exists '
-            "and fall back to the Transformers implementation if no SGLang "
-            "implementation is available.\n"
-            '* "sglang" will use the SGLang model implementation.\n'
-            '* "transformers" will use the Transformers model '
-            "implementation.\n",
-        )

        # Other runtime options
        parser.add_argument(
@@ -1442,6 +1447,11 @@ class ServerArgs:
            action="store_true",
            help="Adopt base image processor instead of fast image processor.",
        )
+        parser.add_argument(
+            "--enable-return-hidden-states",
+            action="store_true",
+            help="Enable returning hidden states with responses.",
+        )
        parser.add_argument(
            "--warmups",
            type=str,
@@ -1469,12 +1479,6 @@ class ServerArgs:
            default=ServerArgs.debug_tensor_dump_inject,
            help="Inject the outputs from jax as the input of every layer.",
        )
-
-        parser.add_argument(
-            "--enable-return-hidden-states",
-            action="store_true",
-            help="Enable returning hidden states with responses.",
-        )
        parser.add_argument(
            "--debug-tensor-dump-prefill-only",
            action="store_true",