Unverified Commit dbdf76ca authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Clean up docs for server args and sampling parameters (generated by grok) (#7076)

parent f2a75a66
...@@ -6,23 +6,31 @@ If you want a high-level endpoint that can automatically handle chat templates, ...@@ -6,23 +6,31 @@ If you want a high-level endpoint that can automatically handle chat templates,
## `/generate` Endpoint ## `/generate` Endpoint
The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.
| Argument | Type/Default | Description | | Argument | Type/Default | Description |
|------------------------|---------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------| |----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| text | `Optional[Union[List[str], str]] = None` | The input prompt. Can be a single prompt or a batch of prompts. | | text | `Optional[Union[List[str], str]] = None` | The input prompt. Can be a single prompt or a batch of prompts. |
| input_ids | `Optional[Union[List[List[int]], List[int]]] = None` | Alternative to `text`. Specify the input as token IDs instead of text. | | input_ids | `Optional[Union[List[List[int]], List[int]]] = None` | The token IDs for text; one can specify either text or input_ids. |
| sampling_params | `Optional[Union[List[Dict], Dict]] = None` | The sampling parameters as described in the sections below. | | input_embeds | `Optional[Union[List[List[List[float]]], List[List[float]]]] = None` | The embeddings for input_ids; one can specify either text, input_ids, or input_embeds. |
| return_logprob | `Optional[Union[List[bool], bool]] = None` | Whether to return log probabilities for tokens. | | image_data | `Optional[Union[List[List[ImageDataItem]], List[ImageDataItem], ImageDataItem]] = None` | The image input. Can be an image instance, file name, URL, or base64 encoded string. Can be a single image, list of images, or list of lists of images. |
| logprob_start_len | `Optional[Union[List[int], int]] = None` | If returning log probabilities, specifies the start position in the prompt. Default is "-1", which returns logprobs only for output tokens. | | audio_data | `Optional[Union[List[AudioDataItem], AudioDataItem]] = None` | The audio input. Can be a file name, URL, or base64 encoded string. |
| top_logprobs_num | `Optional[Union[List[int], int]] = None` | If returning log probabilities, specifies the number of top logprobs to return at each position. | | sampling_params | `Optional[Union[List[Dict], Dict]] = None` | The sampling parameters as described in the sections below. |
| stream | `bool = False` | Whether to stream the output. | | rid | `Optional[Union[List[str], str]] = None` | The request ID. |
| lora_path | `Optional[Union[List[Optional[str]], Optional[str]]] = None`| Path to LoRA weights. | | return_logprob | `Optional[Union[List[bool], bool]] = None` | Whether to return log probabilities for tokens. |
| custom_logit_processor | `Optional[Union[List[Optional[str]], str]] = None` | Custom logit processor for advanced sampling control. For usage see below. | | logprob_start_len | `Optional[Union[List[int], int]] = None` | If return_logprob, the start location in the prompt for returning logprobs. Default is "-1", which returns logprobs for output tokens only. |
| return_hidden_states | `bool = False` | Whether to return hidden states of the model. Note that each time it changes, the CUDA graph will be recaptured, which might lead to a performance hit. See the [examples](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states) for more information. | | top_logprobs_num | `Optional[Union[List[int], int]] = None` | If return_logprob, the number of top logprobs to return at each position. |
| token_ids_logprob | `Optional[Union[List[List[int]], List[int]]] = None` | If return_logprob, the token IDs to return logprob for. |
| return_text_in_logprobs | `bool = False` | Whether to detokenize tokens in text in the returned logprobs. |
| stream | `bool = False` | Whether to stream output. |
| lora_path | `Optional[Union[List[Optional[str]], Optional[str]]] = None` | The path to the LoRA. |
| custom_logit_processor | `Optional[Union[List[Optional[str]], str]] = None` | Custom logit processor for advanced sampling control. Must be a serialized instance of `CustomLogitProcessor` using its `to_str()` method. For usage see below. |
| return_hidden_states | `Union[List[bool], bool] = False` | Whether to return hidden states. |
## Sampling parameters ## Sampling parameters
The object is defined at `sampling_params.py::SamplingParams`. You can also read the source code to find more arguments and docs.
### Core parameters ### Core parameters
| Argument | Type/Default | Description | | Argument | Type/Default | Description |
...@@ -47,22 +55,22 @@ The `/generate` endpoint accepts the following parameters in JSON format. For de ...@@ -47,22 +55,22 @@ The `/generate` endpoint accepts the following parameters in JSON format. For de
Please refer to our dedicated guide on [constrained decoding](./structured_outputs.ipynb) for the following parameters. Please refer to our dedicated guide on [constrained decoding](./structured_outputs.ipynb) for the following parameters.
| Argument | Type/Default | Description | | Argument | Type/Default | Description |
|--------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------| |-----------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| json_schema | `Optional[str] = None` | JSON schema for structured outputs. | | json_schema | `Optional[str] = None` | JSON schema for structured outputs. |
| regex | `Optional[str] = None` | Regex for structured outputs. | | regex | `Optional[str] = None` | Regex for structured outputs. |
| ebnf | `Optional[str] = None` | EBNF for structured outputs. | | ebnf | `Optional[str] = None` | EBNF for structured outputs. |
| structural_tag | `Optional[str] = None` | The structal tag for structured outputs. |
### Other options ### Other options
| Argument | Type/Default | Description | | Argument | Type/Default | Description |
|-------------------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------| |-------------------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| n | `int = 1` | Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.) | | n | `int = 1` | Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.) |
| spaces_between_special_tokens | `bool = True` | Whether or not to add spaces between special tokens during detokenization. |
| no_stop_trim | `bool = False` | Don't trim stop words or EOS token from the generated text. |
| continue_final_message | `bool = False` | When enabled, the final assistant message is removed and its content is used as a prefill so that the model continues that message instead of starting a new turn. See [openai_chat_with_response_prefill.py](https://github.com/sgl-project/sglang/blob/main/examples/runtime/openai_chat_with_response_prefill.py) for examples. |
| ignore_eos | `bool = False` | Don't stop generation when EOS token is sampled. | | ignore_eos | `bool = False` | Don't stop generation when EOS token is sampled. |
| skip_special_tokens | `bool = True` | Remove special tokens during decoding. | | skip_special_tokens | `bool = True` | Remove special tokens during decoding. |
| spaces_between_special_tokens | `bool = True` | Whether or not to add spaces between special tokens during detokenization. |
| no_stop_trim | `bool = False` | Don't trim stop words or EOS token from the generated text. |
| custom_params | `Optional[List[Optional[Dict[str, Any]]]] = None` | Used when employing `CustomLogitProcessor`. For usage, see below. | | custom_params | `Optional[List[Optional[Dict[str, Any]]]] = None` | Used when employing `CustomLogitProcessor`. For usage, see below. |
## Examples ## Examples
......
This diff is collapsed.
...@@ -87,7 +87,7 @@ class GenerateReqInput: ...@@ -87,7 +87,7 @@ class GenerateReqInput:
# The modalities of the image data [image, multi-images, video] # The modalities of the image data [image, multi-images, video]
modalities: Optional[List[str]] = None modalities: Optional[List[str]] = None
# LoRA related # The path to the LoRA
lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None
# Session info for continual prompting # Session info for continual prompting
......
...@@ -28,7 +28,6 @@ from sglang.srt.utils import ( ...@@ -28,7 +28,6 @@ from sglang.srt.utils import (
configure_ipv6, configure_ipv6,
get_device, get_device,
get_device_memory_capacity, get_device_memory_capacity,
is_cuda,
is_flashinfer_available, is_flashinfer_available,
is_hip, is_hip,
is_port_available, is_port_available,
...@@ -214,8 +213,8 @@ class ServerArgs: ...@@ -214,8 +213,8 @@ class ServerArgs:
disable_shared_experts_fusion: bool = False disable_shared_experts_fusion: bool = False
disable_chunked_prefix_cache: bool = False disable_chunked_prefix_cache: bool = False
disable_fast_image_processor: bool = False disable_fast_image_processor: bool = False
warmups: Optional[str] = None
enable_return_hidden_states: bool = False enable_return_hidden_states: bool = False
warmups: Optional[str] = None
# Debug tensor dumps # Debug tensor dumps
debug_tensor_dump_output_folder: Optional[str] = None debug_tensor_dump_output_folder: Optional[str] = None
...@@ -536,10 +535,16 @@ class ServerArgs: ...@@ -536,10 +535,16 @@ class ServerArgs:
help="The path of the tokenizer.", help="The path of the tokenizer.",
) )
parser.add_argument( parser.add_argument(
"--host", type=str, default=ServerArgs.host, help="The host of the server." "--host",
type=str,
default=ServerArgs.host,
help="The host of the HTTP server.",
) )
parser.add_argument( parser.add_argument(
"--port", type=int, default=ServerArgs.port, help="The port of the server." "--port",
type=int,
default=ServerArgs.port,
help="The port of the HTTP server.",
) )
parser.add_argument( parser.add_argument(
"--tokenizer-mode", "--tokenizer-mode",
...@@ -694,6 +699,18 @@ class ServerArgs: ...@@ -694,6 +699,18 @@ class ServerArgs:
"name, a tag name, or a commit id. If unspecified, will use " "name, a tag name, or a commit id. If unspecified, will use "
"the default version.", "the default version.",
) )
parser.add_argument(
"--impl",
type=str,
default=ServerArgs.impl,
help="Which implementation of the model to use.\n\n"
'* "auto" will try to use the SGLang implementation if it exists '
"and fall back to the Transformers implementation if no SGLang "
"implementation is available.\n"
'* "sglang" will use the SGLang model implementation.\n'
'* "transformers" will use the Transformers model '
"implementation.\n",
)
# Memory and scheduling # Memory and scheduling
parser.add_argument( parser.add_argument(
...@@ -752,18 +769,6 @@ class ServerArgs: ...@@ -752,18 +769,6 @@ class ServerArgs:
default=ServerArgs.page_size, default=ServerArgs.page_size,
help="The number of tokens in a page.", help="The number of tokens in a page.",
) )
parser.add_argument(
"--impl",
type=str,
default=ServerArgs.impl,
help="Which implementation of the model to use.\n\n"
'* "auto" will try to use the SGLang implementation if it exists '
"and fall back to the Transformers implementation if no SGLang "
"implementation is available.\n"
'* "sglang" will use the SGLang model implementation.\n'
'* "transformers" will use the Transformers model '
"implementation.\n",
)
# Other runtime options # Other runtime options
parser.add_argument( parser.add_argument(
...@@ -1442,6 +1447,11 @@ class ServerArgs: ...@@ -1442,6 +1447,11 @@ class ServerArgs:
action="store_true", action="store_true",
help="Adopt base image processor instead of fast image processor.", help="Adopt base image processor instead of fast image processor.",
) )
parser.add_argument(
"--enable-return-hidden-states",
action="store_true",
help="Enable returning hidden states with responses.",
)
parser.add_argument( parser.add_argument(
"--warmups", "--warmups",
type=str, type=str,
...@@ -1469,12 +1479,6 @@ class ServerArgs: ...@@ -1469,12 +1479,6 @@ class ServerArgs:
default=ServerArgs.debug_tensor_dump_inject, default=ServerArgs.debug_tensor_dump_inject,
help="Inject the outputs from jax as the input of every layer.", help="Inject the outputs from jax as the input of every layer.",
) )
parser.add_argument(
"--enable-return-hidden-states",
action="store_true",
help="Enable returning hidden states with responses.",
)
parser.add_argument( parser.add_argument(
"--debug-tensor-dump-prefill-only", "--debug-tensor-dump-prefill-only",
action="store_true", action="store_true",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment