sampling_params.md 11.4 KB
Newer Older
Ying Sheng's avatar
Ying Sheng committed
1
# Sampling Parameters in SGLang Runtime
2
This doc describes the sampling parameters of the SGLang Runtime.
3
It is the low-level endpoint of the runtime.
4
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](../backend/openai_api_completions.ipynb).
5
6
7
8

The `/generate` endpoint accepts the following arguments in the JSON format.

```python
9
@dataclass
10
class GenerateReqInput:
Ying Sheng's avatar
Ying Sheng committed
11
    # The input prompt. It can be a single prompt or a batch of prompts.
12
    text: Optional[Union[List[str], str]] = None
Rin Intachuen's avatar
Rin Intachuen committed
13
    # The token ids for text; one can specify either text or input_ids
14
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
Rin Intachuen's avatar
Rin Intachuen committed
15
16
    # The embeddings for input_ids; one can specify either text or input_ids or input_embeds.
    input_embeds: Optional[Union[List[List[List[float]]], List[List[float]]]] = None
Ying Sheng's avatar
Ying Sheng committed
17
18
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
19
    image_data: Optional[Union[List[str], str]] = None
20
    # The sampling_params. See descriptions below.
Rin Intachuen's avatar
Rin Intachuen committed
21
    sampling_params: Optional[Union[List[Dict], Dict]] = None
Ying Sheng's avatar
Ying Sheng committed
22
    # The request id.
23
    rid: Optional[Union[List[str], str]] = None
Ying Sheng's avatar
Ying Sheng committed
24
    # Whether to return logprobs.
25
    return_logprob: Optional[Union[List[bool], bool]] = None
Rin Intachuen's avatar
Rin Intachuen committed
26
    # If return logprobs, the start location in the prompt for returning logprobs.
27
    # By default, this value is "-1", which means it will only return logprobs for output tokens.
28
    logprob_start_len: Optional[Union[List[int], int]] = None
Rin Intachuen's avatar
Rin Intachuen committed
29
    # If return logprobs, the number of top logprobs to return at each position.
Liangsheng Yin's avatar
Liangsheng Yin committed
30
    top_logprobs_num: Optional[Union[List[int], int]] = None
31
    # Whether to detokenize tokens in text in the returned logprobs.
Liangsheng Yin's avatar
Liangsheng Yin committed
32
    return_text_in_logprobs: bool = False
Ying Sheng's avatar
Ying Sheng committed
33
    # Whether to stream output.
34
    stream: bool = False
35
36
37
38
39
40
41
42
43
44
45
46
47
48
    # Whether to log metrics for this request (e.g. health_generate calls do not log metrics)
    log_metrics: bool = True

    # The modalities of the image data [image, multi-images, video]
    modalities: Optional[List[str]] = None
    # LoRA related
    lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None

    # Session info for continual prompting
    session_params: Optional[Union[List[Dict], Dict]] = None
    # Custom logit processor for advanced sampling control. Must be a serialized instance
    # of `CustomLogitProcessor` in python/sglang/srt/sampling/custom_logit_processor.py
    # Use the processor's `to_str()` method to generate the serialized string.
    custom_logit_processor: Optional[Union[List[Optional[str]], str]] = None
49
50
51
52
53
```

The `sampling_params` follows this format

```python
54
# The maximum number of output tokens
55
max_new_tokens: int = 128,
56
# Stop when hitting any of the strings in this list
57
stop: Optional[Union[str, List[str]]] = None,
58
# Stop when hitting any of the token_ids in this list
59
stop_token_ids: Optional[List[int]] = [],
60
61
62
63
64
65
# Sampling temperature
temperature: float = 1.0,
# Top-p sampling
top_p: float = 1.0,
# Top-k sampling
top_k: int = -1,
intervitens's avatar
intervitens committed
66
67
# Min-p sampling
min_p: float = 0.0,
68
# Whether to ignore EOS token
69
ignore_eos: bool = False,
70
# Whether to skip the special tokens during detokenization
71
skip_special_tokens: bool = True,
72
# Whether to add spaces between special tokens during detokenization
73
74
75
spaces_between_special_tokens: bool = True,
# Do parallel sampling and return `n` outputs.
n: int = 1,
76
77

## Structured Outputs
78
# Only one of the below three can be set for a request.
79

80
# Constrain the output to follow a given JSON schema.
81
json_schema: Optional[str] = None,
82
83
84
# Constrain the output to follow a given regular expression.
regex: Optional[str] = None,
# Constrain the output to follow a given EBNF grammar.
85
ebnf: Optional[str] = None,
86

87
## Penalties.
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106

# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty: float = 1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens: int = 0,
107
108
109
110
111
112
113
114


## Custom Parameters for Custom Logit Processor.
# A dictionary of custom parameters for the custom logit processor.
# The custom logit processor takes a list of dictionaries as input, where each
# dictionary is the custom parameters for one token in a batch of the input.
# See also python/sglang/srt/sampling/custom_logit_processor.py
custom_params: Optional[Dict[str, Any]] = None,
115
116
117
118
119
```

## Examples

### Normal
Ying Sheng's avatar
Ying Sheng committed
120
Launch a server
121
```
Ying Sheng's avatar
Ying Sheng committed
122
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
123
124
```

Ying Sheng's avatar
Ying Sheng committed
125
Send a request
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

### Streaming
Ying Sheng's avatar
Ying Sheng committed
143
Send a request and stream the output
144
145
146
147
148
149
150
151
152
```python
import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
153
            "max_new_tokens": 32,
154
155
156
157
158
159
160
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
Lianmin Zheng's avatar
Lianmin Zheng committed
161
162
163
164
165
166
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
167
168
169
170
171
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")
```
172
173
174

### Multi modal

Ying Sheng's avatar
Ying Sheng committed
175
176
Launch a server
```
177
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava
Ying Sheng's avatar
Ying Sheng committed
178
179
180
181
182
183
184
```

Download an image
```
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
```

Ying Sheng's avatar
Ying Sheng committed
185
Send a request
Ying Sheng's avatar
Ying Sheng committed
186
187
188
189
190
191
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
192
193
194
        "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
                "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
                "<|im_start|>assistant\n",
Ying Sheng's avatar
Ying Sheng committed
195
196
197
198
199
200
201
202
203
204
205
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
Ying Sheng's avatar
Ying Sheng committed
206
Streaming is supported in a similar manner as [above](#streaming).
Lianmin Zheng's avatar
Lianmin Zheng committed
207

208
### Structured Outputs (JSON, Regex, EBNF)
209
You can specify a JSON schema, regular expression or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.
210
211
212

SGLang supports two grammar backends:

213
- [Outlines](https://github.com/dottxt-ai/outlines) (default): Supports JSON schema and regular expression constraints.
214
- [XGrammar](https://github.com/mlc-ai/xgrammar): Supports JSON schema, regular expression, and EBNF constraints.
215
216
  - XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)

217
Initialize the XGrammar backend using `--grammar-backend xgrammar` flag
218
219
220
221
```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: outlines)
```
Lianmin Zheng's avatar
Lianmin Zheng committed
222
223
224
225
226

```python
import json
import requests

227
228
229
230
231
232
233
234
json_schema = json.dumps({
    "type": "object",
    "properties": {
        "name": {"type": "string", "pattern": "^[\\w]+$"},
        "population": {"type": "integer"},
    },
    "required": ["name", "population"],
})
Lianmin Zheng's avatar
Lianmin Zheng committed
235

236
# JSON (works with both Outlines and XGrammar)
Lianmin Zheng's avatar
Lianmin Zheng committed
237
238
239
240
241
242
243
244
245
246
247
248
249
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)
print(response.json())

250
# Regular expression (Outlines backend only)
Lianmin Zheng's avatar
Lianmin Zheng committed
251
252
253
254
255
256
257
258
259
260
261
262
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print(response.json())
263
264
265
266
267
268
269
270
271
272
273
274
275
276

# EBNF (XGrammar backend only)
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Write a greeting.",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "ebnf": 'root ::= "Hello" | "Hi" | "Hey"',
        },
    },
)
print(response.json())
277
```
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
### Custom Logit Processor
Launch a server with `--enable-custom-logit-processor` flag on.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --enable-custom-logit-processor
```

Define a custom logit processor that will always sample a specific token id.
```python
from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor

class DeterministicLogitProcessor(CustomLogitProcessor):
    """A dummy logit processor that changes the logits to always
    sample the given token id.
    """

    def __call__(self, logits, custom_param_list):
        # Check that the number of logits matches the number of custom parameters
        assert logits.shape[0] == len(custom_param_list)
        key = "token_id"

        for i, param_dict in enumerate(custom_param_list):
            # Mask all other tokens
            logits[i, :] = -float("inf")
            # Assign highest probability to the specified token
            logits[i, param_dict[key]] = 0.0
        return logits
```

Send a request
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "custom_logit_processor": DeterministicLogitProcessor().to_str(),
        "sampling_params": {
            "temperature": 0.0,
            "max_new_tokens": 32,
            "custom_params": {"token_id": 5},
        },
    },
)
print(response.json())
```