"vscode:/vscode.git/clone" did not exist on "9302933b93f573ac92026ccc48b3b0a4df7b1fda"
Unverified Commit dbdf76ca authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Clean up docs for server args and sampling parameters (generated by grok) (#7076)

parent f2a75a66
......@@ -6,23 +6,31 @@ If you want a high-level endpoint that can automatically handle chat templates,
## `/generate` Endpoint
The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb).
The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.
| Argument | Type/Default | Description |
|------------------------|---------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
|----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| text | `Optional[Union[List[str], str]] = None` | The input prompt. Can be a single prompt or a batch of prompts. |
| input_ids | `Optional[Union[List[List[int]], List[int]]] = None` | Alternative to `text`. Specify the input as token IDs instead of text. |
| input_ids | `Optional[Union[List[List[int]], List[int]]] = None` | The token IDs for text; one can specify either text or input_ids. |
| input_embeds | `Optional[Union[List[List[List[float]]], List[List[float]]]] = None` | The embeddings for input_ids; one can specify either text, input_ids, or input_embeds. |
| image_data | `Optional[Union[List[List[ImageDataItem]], List[ImageDataItem], ImageDataItem]] = None` | The image input. Can be an image instance, file name, URL, or base64 encoded string. Can be a single image, list of images, or list of lists of images. |
| audio_data | `Optional[Union[List[AudioDataItem], AudioDataItem]] = None` | The audio input. Can be a file name, URL, or base64 encoded string. |
| sampling_params | `Optional[Union[List[Dict], Dict]] = None` | The sampling parameters as described in the sections below. |
| rid | `Optional[Union[List[str], str]] = None` | The request ID. |
| return_logprob | `Optional[Union[List[bool], bool]] = None` | Whether to return log probabilities for tokens. |
| logprob_start_len | `Optional[Union[List[int], int]] = None` | If returning log probabilities, specifies the start position in the prompt. Default is "-1", which returns logprobs only for output tokens. |
| top_logprobs_num | `Optional[Union[List[int], int]] = None` | If returning log probabilities, specifies the number of top logprobs to return at each position. |
| stream | `bool = False` | Whether to stream the output. |
| lora_path | `Optional[Union[List[Optional[str]], Optional[str]]] = None`| Path to LoRA weights. |
| custom_logit_processor | `Optional[Union[List[Optional[str]], str]] = None` | Custom logit processor for advanced sampling control. For usage see below. |
| return_hidden_states | `bool = False` | Whether to return hidden states of the model. Note that each time it changes, the CUDA graph will be recaptured, which might lead to a performance hit. See the [examples](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states) for more information. |
| logprob_start_len | `Optional[Union[List[int], int]] = None` | If return_logprob, the start location in the prompt for returning logprobs. Default is "-1", which returns logprobs for output tokens only. |
| top_logprobs_num | `Optional[Union[List[int], int]] = None` | If return_logprob, the number of top logprobs to return at each position. |
| token_ids_logprob | `Optional[Union[List[List[int]], List[int]]] = None` | If return_logprob, the token IDs to return logprob for. |
| return_text_in_logprobs | `bool = False` | Whether to detokenize tokens in text in the returned logprobs. |
| stream | `bool = False` | Whether to stream output. |
| lora_path | `Optional[Union[List[Optional[str]], Optional[str]]] = None` | The path to the LoRA. |
| custom_logit_processor | `Optional[Union[List[Optional[str]], str]] = None` | Custom logit processor for advanced sampling control. Must be a serialized instance of `CustomLogitProcessor` using its `to_str()` method. For usage see below. |
| return_hidden_states | `Union[List[bool], bool] = False` | Whether to return hidden states. |
## Sampling parameters
The object is defined at `sampling_params.py::SamplingParams`. You can also read the source code to find more arguments and docs.
### Core parameters
| Argument | Type/Default | Description |
......@@ -48,21 +56,21 @@ The `/generate` endpoint accepts the following parameters in JSON format. For de
Please refer to our dedicated guide on [constrained decoding](./structured_outputs.ipynb) for the following parameters.
| Argument | Type/Default | Description |
|--------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
|-----------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| json_schema | `Optional[str] = None` | JSON schema for structured outputs. |
| regex | `Optional[str] = None` | Regex for structured outputs. |
| ebnf | `Optional[str] = None` | EBNF for structured outputs. |
| structural_tag | `Optional[str] = None` | The structal tag for structured outputs. |
### Other options
| Argument | Type/Default | Description |
|-------------------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| n | `int = 1` | Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.) |
| spaces_between_special_tokens | `bool = True` | Whether or not to add spaces between special tokens during detokenization. |
| no_stop_trim | `bool = False` | Don't trim stop words or EOS token from the generated text. |
| continue_final_message | `bool = False` | When enabled, the final assistant message is removed and its content is used as a prefill so that the model continues that message instead of starting a new turn. See [openai_chat_with_response_prefill.py](https://github.com/sgl-project/sglang/blob/main/examples/runtime/openai_chat_with_response_prefill.py) for examples. |
| ignore_eos | `bool = False` | Don't stop generation when EOS token is sampled. |
| skip_special_tokens | `bool = True` | Remove special tokens during decoding. |
| spaces_between_special_tokens | `bool = True` | Whether or not to add spaces between special tokens during detokenization. |
| no_stop_trim | `bool = False` | Don't trim stop words or EOS token from the generated text. |
| custom_params | `Optional[List[Optional[Dict[str, Any]]]] = None` | Used when employing `CustomLogitProcessor`. For usage, see below. |
## Examples
......
This diff is collapsed.
......@@ -87,7 +87,7 @@ class GenerateReqInput:
# The modalities of the image data [image, multi-images, video]
modalities: Optional[List[str]] = None
# LoRA related
# The path to the LoRA
lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None
# Session info for continual prompting
......
......@@ -28,7 +28,6 @@ from sglang.srt.utils import (
configure_ipv6,
get_device,
get_device_memory_capacity,
is_cuda,
is_flashinfer_available,
is_hip,
is_port_available,
......@@ -214,8 +213,8 @@ class ServerArgs:
disable_shared_experts_fusion: bool = False
disable_chunked_prefix_cache: bool = False
disable_fast_image_processor: bool = False
warmups: Optional[str] = None
enable_return_hidden_states: bool = False
warmups: Optional[str] = None
# Debug tensor dumps
debug_tensor_dump_output_folder: Optional[str] = None
......@@ -536,10 +535,16 @@ class ServerArgs:
help="The path of the tokenizer.",
)
parser.add_argument(
"--host", type=str, default=ServerArgs.host, help="The host of the server."
"--host",
type=str,
default=ServerArgs.host,
help="The host of the HTTP server.",
)
parser.add_argument(
"--port", type=int, default=ServerArgs.port, help="The port of the server."
"--port",
type=int,
default=ServerArgs.port,
help="The port of the HTTP server.",
)
parser.add_argument(
"--tokenizer-mode",
......@@ -694,6 +699,18 @@ class ServerArgs:
"name, a tag name, or a commit id. If unspecified, will use "
"the default version.",
)
parser.add_argument(
"--impl",
type=str,
default=ServerArgs.impl,
help="Which implementation of the model to use.\n\n"
'* "auto" will try to use the SGLang implementation if it exists '
"and fall back to the Transformers implementation if no SGLang "
"implementation is available.\n"
'* "sglang" will use the SGLang model implementation.\n'
'* "transformers" will use the Transformers model '
"implementation.\n",
)
# Memory and scheduling
parser.add_argument(
......@@ -752,18 +769,6 @@ class ServerArgs:
default=ServerArgs.page_size,
help="The number of tokens in a page.",
)
parser.add_argument(
"--impl",
type=str,
default=ServerArgs.impl,
help="Which implementation of the model to use.\n\n"
'* "auto" will try to use the SGLang implementation if it exists '
"and fall back to the Transformers implementation if no SGLang "
"implementation is available.\n"
'* "sglang" will use the SGLang model implementation.\n'
'* "transformers" will use the Transformers model '
"implementation.\n",
)
# Other runtime options
parser.add_argument(
......@@ -1442,6 +1447,11 @@ class ServerArgs:
action="store_true",
help="Adopt base image processor instead of fast image processor.",
)
parser.add_argument(
"--enable-return-hidden-states",
action="store_true",
help="Enable returning hidden states with responses.",
)
parser.add_argument(
"--warmups",
type=str,
......@@ -1469,12 +1479,6 @@ class ServerArgs:
default=ServerArgs.debug_tensor_dump_inject,
help="Inject the outputs from jax as the input of every layer.",
)
parser.add_argument(
"--enable-return-hidden-states",
action="store_true",
help="Enable returning hidden states with responses.",
)
parser.add_argument(
"--debug-tensor-dump-prefill-only",
action="store_true",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment