sampling_params.md 7.46 KB
Newer Older
Ying Sheng's avatar
Ying Sheng committed
1
# Sampling Parameters in SGLang Runtime
2
This doc describes the sampling parameters of the SGLang Runtime.
3
4
5
It is the low-level endpoint of the runtime.
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API
](https://github.com/sgl-project/sglang?tab=readme-ov-file#openai-compatible-api).
6
7
8
9

The `/generate` endpoint accepts the following arguments in the JSON format.

```python
10
@dataclass
11
class GenerateReqInput:
Ying Sheng's avatar
Ying Sheng committed
12
    # The input prompt. It can be a single prompt or a batch of prompts.
13
    text: Optional[Union[List[str], str]] = None
Ying Sheng's avatar
Ying Sheng committed
14
    # The token ids for text; one can either specify text or input_ids.
15
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
Ying Sheng's avatar
Ying Sheng committed
16
17
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
18
    image_data: Optional[Union[List[str], str]] = None
19
    # The sampling_params. See descriptions below.
20
    sampling_params: Union[List[Dict], Dict] = None
Ying Sheng's avatar
Ying Sheng committed
21
    # The request id.
22
    rid: Optional[Union[List[str], str]] = None
Ying Sheng's avatar
Ying Sheng committed
23
    # Whether to return logprobs.
24
    return_logprob: Optional[Union[List[bool], bool]] = None
Ying Sheng's avatar
Ying Sheng committed
25
    # The start location of the prompt for return_logprob.
26
    # By default, this value is "-1", which means it will only return logprobs for output tokens.
27
    logprob_start_len: Optional[Union[List[int], int]] = None
Ying Sheng's avatar
Ying Sheng committed
28
    # The number of top logprobs to return.
Liangsheng Yin's avatar
Liangsheng Yin committed
29
    top_logprobs_num: Optional[Union[List[int], int]] = None
30
    # Whether to detokenize tokens in text in the returned logprobs.
Liangsheng Yin's avatar
Liangsheng Yin committed
31
    return_text_in_logprobs: bool = False
Ying Sheng's avatar
Ying Sheng committed
32
    # Whether to stream output.
33
34
35
36
37
38
    stream: bool = False
```

The `sampling_params` follows this format

```python
39
# The maximum number of output tokens
40
max_new_tokens: int = 128,
41
42
# Stop when hitting any of the strings in this list.
stop: Optional[Union[str, List[str]]] = None,
43
44
45
# Stop when hitting any of the token_ids in this list. Could be useful when mixed with
# `min_new_tokens`.
stop_token_ids: Optional[List[int]] = [],
46
47
48
49
50
51
# Sampling temperature
temperature: float = 1.0,
# Top-p sampling
top_p: float = 1.0,
# Top-k sampling
top_k: int = -1,
intervitens's avatar
intervitens committed
52
53
# Min-p sampling
min_p: float = 0.0,
54
55
56
57
58
59
60
61
62
63
# Whether to ignore EOS token.
ignore_eos: bool = False,
# Whether to skip the special tokens during detokenization.
skip_special_tokens: bool = True,
# Whether to add spaces between special tokens during detokenization.
spaces_between_special_tokens: bool = True,
# Constrains the output to follow a given regular expression.
regex: Optional[str] = None,
# Do parallel sampling and return `n` outputs.
n: int = 1,
64
65
66
# Constrains the output to follow a given JSON schema.
# `regex` and `json_schema` cannot be set at the same time.
json_schema: Optional[str] = None,
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87

## Penalties. See [Performance Implications on Penalties] section below for more informations.

# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty: float = 1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens: int = 0,
88
89
90
91
92
```

## Examples

### Normal
Ying Sheng's avatar
Ying Sheng committed
93
Launch a server
94
```
Ying Sheng's avatar
Ying Sheng committed
95
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
96
97
```

Ying Sheng's avatar
Ying Sheng committed
98
Send a request
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

### Streaming
Ying Sheng's avatar
Ying Sheng committed
116
Send a request and stream the output
117
118
119
120
121
122
123
124
125
```python
import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
126
            "max_new_tokens": 32,
127
128
129
130
131
132
133
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
Lianmin Zheng's avatar
Lianmin Zheng committed
134
135
136
137
138
139
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
140
141
142
143
144
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")
```
145
146
147

### Multi modal

Ying Sheng's avatar
Ying Sheng committed
148
149
Launch a server
```
150
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava
Ying Sheng's avatar
Ying Sheng committed
151
152
153
154
155
156
157
```

Download an image
```
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
```

Ying Sheng's avatar
Ying Sheng committed
158
Send a request
Ying Sheng's avatar
Ying Sheng committed
159
160
161
162
163
164
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
165
166
167
        "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
                "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
                "<|im_start|>assistant\n",
Ying Sheng's avatar
Ying Sheng committed
168
169
170
171
172
173
174
175
176
177
178
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
Ying Sheng's avatar
Ying Sheng committed
179
Streaming is supported in a similar manner as [above](#streaming).
Lianmin Zheng's avatar
Lianmin Zheng committed
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226

### Structured decoding (JSON, Regex)
You can specify a JSON schema or a regular expression to constrain the model output. The model output will be guaranteed to follow the given constraints.

```python
import json
import requests

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

# JSON
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)
print(response.json())

# Regular expression
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print(response.json())
```