This doc describes the sampling parameters of the SGLang Runtime.
This doc describes the sampling parameters of the SGLang Runtime.
It is the low-level endpoint of the runtime.
It is the low-level endpoint of the runtime.
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](https://docs.sglang.ai/backend/openai_api_completions.html).
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](../backend/openai_api_completions.ipynb).
## `/generate` Endpoint
The`/generate`endpoint accepts the following arguments in the JSON format. You can code examples below.
The `/generate` endpoint accepts the following parameters in JSON format. For in detail usage see the [native api doc](https://docs.sglang.ai/backend/native_api.html).
```python
@dataclass
classGenerateReqInput:
# The input prompt. It can be a single prompt or a batch of prompts.
text:Optional[Union[List[str],str]]=None
# The token ids for text; one can specify either text or input_ids
*`logprob_start_len`: If returning log probabilities, specifies the start position in the prompt. Default is "-1" which returns logprobs only for output tokens. `Optional[Union[List[int], int]] = None`
*`top_logprobs_num`: If returning log probabilities, specifies the number of top logprobs to return at each position. `Optional[Union[List[int], int]] = None`
*`stream`: Whether to stream the output. `bool = False`
*`lora_path`: Path to LoRA weights. `Optional[Union[List[Optional[str]], Optional[str]]] = None`
*`custom_logit_processor`: Custom logit processor for advanced sampling control. For usage see below. `Optional[Union[List[Optional[str]], str]] = None`
*`return_hidden_states`: Whether to return hidden states of the model. Note that each time it changes, the cuda graph will be recaptured, which might lead to a performance hit. See the [examples](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/hidden_states.py) for more information. `bool = False`
## Sampling params
# Custom logit processor for advanced sampling control. Must be a serialized instance
# of `CustomLogitProcessor` in python/sglang/srt/sampling/custom_logit_processor.py
# Use the processor's `to_str()` method to generate the serialized string.
*`max_new_tokens`: The maximum output length measured in tokens. `int = 128`
The `sampling_params` follows this format
*`stop`: One or multiple [stop words](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stop). Generation will stop if one of these words is sampled. `Optional[Union[str, List[str]]] = None`
*`stop_token_ids`: Provide stop words in form of token ids. Generation will stop if one of these token ids is sampled. `Optional[List[int]] = []`
*`temperature`: [Temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) when sampling the next token. `temperature = 0` corresponds to greedy sampling, higher temperature leads to more diversity. `float = 1.0`
*`top_p`: [Top-p](https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_p) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens. `top_p: float = 1.0`
*`top_k`: [Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens. `int = -1`
*`min_p`: [Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`. `float = 0.0`
### Penalizers
```python
# The maximum number of output tokens
max_new_tokens:int=128,
# Stop when hitting any of the strings in this list
stop:Optional[Union[str,List[str]]]=None,
# Stop when hitting any of the token_ids in this list
stop_token_ids:Optional[List[int]]=[],
# Sampling temperature
temperature:float=1.0,
# Top-p sampling
top_p:float=1.0,
# Top-k sampling
top_k:int=-1,
# Min-p sampling
min_p:float=0.0,
# Do parallel sampling and return `n` outputs.
n:int=1,
To use penalizers you will need to `--disable-overlap`. Please note that this might degrade performance.
## Structured Outputs
# Only one of the below three can be set for a request.
*`frequency_penalty`: Penalizes tokens based on their frequency in generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token. `float = 0.0`
# Constrain the output to follow a given JSON schema.
*`presence_penalty`: Penalizes tokens if they appeared in the generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occured. `float = 0.0`
json_schema:Optional[str]=None,
*`repetition_penalty`: Penalizes tokens if they appeared in prompt or generation so far. Must be between `0` and `2` where numbers smaller than `1` encourage repeatment of tokens and numbers larger than `2` encourages sampling of new tokens. The penalization scales multiplicatively. `float = 0.0`
# Constrain the output to follow a given regular expression.
*`min_new_tokens`: Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Note that this might lead to unintended behavior for example if the distribution is highly skewed towards these tokens. `int = 0`
regex:Optional[str]=None,
# Constrain the output to follow a given EBNF grammar.
ebnf:Optional[str]=None,
### Constrained decoding
## Penalties
Please refer to our dedicated guide on [constrained decoding](https://docs.sglang.ai/backend/structured_outputs.html#Native-API-and-SGLang-Runtime-(SRT)) for the following parameters.
# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty:float=0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty:float=0.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens:int=0,
*`json_schema`: `Optional[str] = None`
# Whether to ignore EOS token
*`regex`: `Optional[str] = None`
ignore_eos:bool=False,
*`ebnf`: `Optional[str] = None`
# Whether to skip the special tokens during detokenization
skip_special_tokens:bool=True,
# Whether to add spaces between special tokens during detokenization
spaces_between_special_tokens:bool=True,
### Other options
## Custom Parameters for Custom Logit Processor.
# A dictionary of custom parameters for the custom logit processor.
# The custom logit processor takes a list of dictionaries as input, where each
# dictionary is the custom parameters for one token in a batch of the input.
# See also python/sglang/srt/sampling/custom_logit_processor.py
custom_params:Optional[Dict[str,Any]]=None,
```
*`n`: Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeat the same prompts for several times offer better control and efficiency.) `int = 1`
## Examples
*`spaces_between_special_tokens`: Whether or not to add spaces between special tokens during detokenization. `bool = True`
*`no_stop_trim`: Don't trim stop words or EOS token from the generated text. `bool = False`
*`ignore_eos`: Don't stop generation when EOS token is sampled. `bool = False`
*`skip_special_tokens`: Remove special tokens during decoding. `bool = True`
*`custom_params`: Used when employing `CustomLogitProcessor`. For usage see below. `Optional[List[Optional[Dict[str, Any]]]] = None`
"text":"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
"<|im_start|>assistant\n",
"image_data":"example_image.png",
"sampling_params":{
"temperature":0,
"max_new_tokens":32,
},
},
)
print(response.json())
```
The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
Streaming is supported in a similar manner as [above](#streaming).
### Structured Outputs (JSON, Regex, EBNF)
You can specify a JSON schema, regular expression or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.
SGLang supports two grammar backends:
-[Outlines](https://github.com/dottxt-ai/outlines)(default): Supports JSON schema and regular expression constraints.
-[XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema, regular expression, and EBNF constraints.
- XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)
Initialize the XGrammar backend using `--grammar-backend xgrammar` flag
@@ -68,7 +68,7 @@ Please consult the documentation below to learn more about the parameters you ma
...
@@ -68,7 +68,7 @@ Please consult the documentation below to learn more about the parameters you ma
### API configuration
### API configuration
*`api_key`: Sets an API key for the server and the OpenAI-compatible API.
*`api_key`: Sets an API key for the server and the OpenAI-compatible API.
*`file_storage_pth`: Directory for storing uploaded or generated files from API calls.
*`file_storage_path`: Directory for storing uploaded or generated files from API calls.
*`enable_cache_report`: If set, includes detailed usage of cached tokens in the response usage.
*`enable_cache_report`: If set, includes detailed usage of cached tokens in the response usage.
## Parallelism
## Parallelism
...
@@ -162,7 +162,6 @@ Please consult the documentation below to learn more about the parameters you ma
...
@@ -162,7 +162,6 @@ Please consult the documentation below to learn more about the parameters you ma
*Note: We recommend to stay with the defaults and only use these options for debugging for best possible performance.*
*Note: We recommend to stay with the defaults and only use these options for debugging for best possible performance.*
*`disable_radix_cache`: Disable [Radix](https://lmsys.org/blog/2024-01-17-sglang/) backend for prefix caching.
*`disable_radix_cache`: Disable [Radix](https://lmsys.org/blog/2024-01-17-sglang/) backend for prefix caching.
*`disable_jump_forward`: Disable [jump-forward](https://lmsys.org/blog/2024-02-05-compressed-fsm/#our-method-jump-forward-decoding-with-a-compressed-finite-state-machine) for outlines grammar backend.
*`disable_cuda_graph`: Disable [cuda graph](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) for model forward. Use if encountering uncorrectable CUDA ECC errors.
*`disable_cuda_graph`: Disable [cuda graph](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) for model forward. Use if encountering uncorrectable CUDA ECC errors.
*`disable_cuda_graph_padding`: Disable cuda graph when padding is needed. In other case still use cuda graph.
*`disable_cuda_graph_padding`: Disable cuda graph when padding is needed. In other case still use cuda graph.
*`disable_outlines_disk_cache`: Disable disk cache for outlines grammar backend.
*`disable_outlines_disk_cache`: Disable disk cache for outlines grammar backend.
@@ -43,4 +43,4 @@ If you want to contribute but don’t have a specific idea in mind, pick issues
...
@@ -43,4 +43,4 @@ If you want to contribute but don’t have a specific idea in mind, pick issues
If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2um0ad92q-LkU19KQTxCGzlCgRiOiQEw).
If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2um0ad92q-LkU19KQTxCGzlCgRiOiQEw).
Thank you for your interest in SGLang—**happy coding**!
Thank you for your interest in SGLang. Happy coding!
You can install SGLang using any of the methods below.
You can install SGLang using any of the methods below.
For running DeepSeek V3/R1, refer to [DeepSeek V3 Support](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). It is recommended to use the [latest version](https://pypi.org/project/sglang/#history) and deploy it with [Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) to avoid environment-related problems.
For running DeepSeek V3/R1, refer to [DeepSeek V3 Support](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). It is recommended to use the latest version and deploy it with [Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) to avoid environment-related issues.
It is recommended to use uv to install the dependencies for faster installation:
We recommend using uv to install the dependencies with a higher installation speed:
- SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
- SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
- If you encounter `OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root`, please try either of the following solutions:
- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.48.3`.
- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.48.3`.