sampling_params.md 7.7 KB
Newer Older
Ying Sheng's avatar
Ying Sheng committed
1
# Sampling Parameters in SGLang Runtime
2
This doc describes the sampling parameters of the SGLang Runtime.
3
4
5
It is the low-level endpoint of the runtime.
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API
](https://github.com/sgl-project/sglang?tab=readme-ov-file#openai-compatible-api).
6
7
8
9

The `/generate` endpoint accepts the following arguments in the JSON format.

```python
10
@dataclass
11
class GenerateReqInput:
Ying Sheng's avatar
Ying Sheng committed
12
    # The input prompt. It can be a single prompt or a batch of prompts.
13
    text: Optional[Union[List[str], str]] = None
Rin Intachuen's avatar
Rin Intachuen committed
14
    # The token ids for text; one can specify either text or input_ids
15
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
Rin Intachuen's avatar
Rin Intachuen committed
16
17
    # The embeddings for input_ids; one can specify either text or input_ids or input_embeds.
    input_embeds: Optional[Union[List[List[List[float]]], List[List[float]]]] = None
Ying Sheng's avatar
Ying Sheng committed
18
19
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
20
    image_data: Optional[Union[List[str], str]] = None
21
    # The sampling_params. See descriptions below.
Rin Intachuen's avatar
Rin Intachuen committed
22
    sampling_params: Optional[Union[List[Dict], Dict]] = None
Ying Sheng's avatar
Ying Sheng committed
23
    # The request id.
24
    rid: Optional[Union[List[str], str]] = None
Ying Sheng's avatar
Ying Sheng committed
25
    # Whether to return logprobs.
26
    return_logprob: Optional[Union[List[bool], bool]] = None
Rin Intachuen's avatar
Rin Intachuen committed
27
    # If return logprobs, the start location in the prompt for returning logprobs.
28
    # By default, this value is "-1", which means it will only return logprobs for output tokens.
29
    logprob_start_len: Optional[Union[List[int], int]] = None
Rin Intachuen's avatar
Rin Intachuen committed
30
    # If return logprobs, the number of top logprobs to return at each position.
Liangsheng Yin's avatar
Liangsheng Yin committed
31
    top_logprobs_num: Optional[Union[List[int], int]] = None
32
    # Whether to detokenize tokens in text in the returned logprobs.
Liangsheng Yin's avatar
Liangsheng Yin committed
33
    return_text_in_logprobs: bool = False
Ying Sheng's avatar
Ying Sheng committed
34
    # Whether to stream output.
35
36
37
38
39
40
    stream: bool = False
```

The `sampling_params` follows this format

```python
41
# The maximum number of output tokens
42
max_new_tokens: int = 128,
43
44
# Stop when hitting any of the strings in this list.
stop: Optional[Union[str, List[str]]] = None,
45
46
47
# Stop when hitting any of the token_ids in this list. Could be useful when mixed with
# `min_new_tokens`.
stop_token_ids: Optional[List[int]] = [],
48
49
50
51
52
53
# Sampling temperature
temperature: float = 1.0,
# Top-p sampling
top_p: float = 1.0,
# Top-k sampling
top_k: int = -1,
intervitens's avatar
intervitens committed
54
55
# Min-p sampling
min_p: float = 0.0,
56
57
58
59
60
61
62
63
64
65
# Whether to ignore EOS token.
ignore_eos: bool = False,
# Whether to skip the special tokens during detokenization.
skip_special_tokens: bool = True,
# Whether to add spaces between special tokens during detokenization.
spaces_between_special_tokens: bool = True,
# Constrains the output to follow a given regular expression.
regex: Optional[str] = None,
# Do parallel sampling and return `n` outputs.
n: int = 1,
66
67
68
# Constrains the output to follow a given JSON schema.
# `regex` and `json_schema` cannot be set at the same time.
json_schema: Optional[str] = None,
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

## Penalties. See [Performance Implications on Penalties] section below for more informations.

# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty: float = 1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens: int = 0,
90
91
92
93
94
```

## Examples

### Normal
Ying Sheng's avatar
Ying Sheng committed
95
Launch a server
96
```
Ying Sheng's avatar
Ying Sheng committed
97
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
98
99
```

Ying Sheng's avatar
Ying Sheng committed
100
Send a request
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

### Streaming
Ying Sheng's avatar
Ying Sheng committed
118
Send a request and stream the output
119
120
121
122
123
124
125
126
127
```python
import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
128
            "max_new_tokens": 32,
129
130
131
132
133
134
135
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
Lianmin Zheng's avatar
Lianmin Zheng committed
136
137
138
139
140
141
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
142
143
144
145
146
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")
```
147
148
149

### Multi modal

Ying Sheng's avatar
Ying Sheng committed
150
151
Launch a server
```
152
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava
Ying Sheng's avatar
Ying Sheng committed
153
154
155
156
157
158
159
```

Download an image
```
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
```

Ying Sheng's avatar
Ying Sheng committed
160
Send a request
Ying Sheng's avatar
Ying Sheng committed
161
162
163
164
165
166
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
167
168
169
        "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
                "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
                "<|im_start|>assistant\n",
Ying Sheng's avatar
Ying Sheng committed
170
171
172
173
174
175
176
177
178
179
180
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
Ying Sheng's avatar
Ying Sheng committed
181
Streaming is supported in a similar manner as [above](#streaming).
Lianmin Zheng's avatar
Lianmin Zheng committed
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227

### Structured decoding (JSON, Regex)
You can specify a JSON schema or a regular expression to constrain the model output. The model output will be guaranteed to follow the given constraints.

```python
import json
import requests

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

# JSON
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)
print(response.json())

# Regular expression
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print(response.json())
228
```