sampling_params.md 7.66 KB
Newer Older
Ying Sheng's avatar
Ying Sheng committed
1
# Sampling Parameters in SGLang Runtime
2
This doc describes the sampling parameters of the SGLang Runtime.
3
It is the low-level endpoint of the runtime.
4
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](../backend/openai_api_completions.ipynb).
5
6
7
8

The `/generate` endpoint accepts the following arguments in the JSON format.

```python
9
@dataclass
10
class GenerateReqInput:
Ying Sheng's avatar
Ying Sheng committed
11
    # The input prompt. It can be a single prompt or a batch of prompts.
12
    text: Optional[Union[List[str], str]] = None
Rin Intachuen's avatar
Rin Intachuen committed
13
    # The token ids for text; one can specify either text or input_ids
14
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
Rin Intachuen's avatar
Rin Intachuen committed
15
16
    # The embeddings for input_ids; one can specify either text or input_ids or input_embeds.
    input_embeds: Optional[Union[List[List[List[float]]], List[List[float]]]] = None
Ying Sheng's avatar
Ying Sheng committed
17
18
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
19
    image_data: Optional[Union[List[str], str]] = None
20
    # The sampling_params. See descriptions below.
Rin Intachuen's avatar
Rin Intachuen committed
21
    sampling_params: Optional[Union[List[Dict], Dict]] = None
Ying Sheng's avatar
Ying Sheng committed
22
    # The request id.
23
    rid: Optional[Union[List[str], str]] = None
Ying Sheng's avatar
Ying Sheng committed
24
    # Whether to return logprobs.
25
    return_logprob: Optional[Union[List[bool], bool]] = None
Rin Intachuen's avatar
Rin Intachuen committed
26
    # If return logprobs, the start location in the prompt for returning logprobs.
27
    # By default, this value is "-1", which means it will only return logprobs for output tokens.
28
    logprob_start_len: Optional[Union[List[int], int]] = None
Rin Intachuen's avatar
Rin Intachuen committed
29
    # If return logprobs, the number of top logprobs to return at each position.
Liangsheng Yin's avatar
Liangsheng Yin committed
30
    top_logprobs_num: Optional[Union[List[int], int]] = None
31
    # Whether to detokenize tokens in text in the returned logprobs.
Liangsheng Yin's avatar
Liangsheng Yin committed
32
    return_text_in_logprobs: bool = False
Ying Sheng's avatar
Ying Sheng committed
33
    # Whether to stream output.
34
35
36
37
38
39
    stream: bool = False
```

The `sampling_params` follows this format

```python
40
# The maximum number of output tokens
41
max_new_tokens: int = 128,
42
43
# Stop when hitting any of the strings in this list.
stop: Optional[Union[str, List[str]]] = None,
44
45
46
# Stop when hitting any of the token_ids in this list. Could be useful when mixed with
# `min_new_tokens`.
stop_token_ids: Optional[List[int]] = [],
47
48
49
50
51
52
# Sampling temperature
temperature: float = 1.0,
# Top-p sampling
top_p: float = 1.0,
# Top-k sampling
top_k: int = -1,
intervitens's avatar
intervitens committed
53
54
# Min-p sampling
min_p: float = 0.0,
55
56
57
58
59
60
61
62
63
64
# Whether to ignore EOS token.
ignore_eos: bool = False,
# Whether to skip the special tokens during detokenization.
skip_special_tokens: bool = True,
# Whether to add spaces between special tokens during detokenization.
spaces_between_special_tokens: bool = True,
# Constrains the output to follow a given regular expression.
regex: Optional[str] = None,
# Do parallel sampling and return `n` outputs.
n: int = 1,
65
66
67
# Constrains the output to follow a given JSON schema.
# `regex` and `json_schema` cannot be set at the same time.
json_schema: Optional[str] = None,
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

## Penalties. See [Performance Implications on Penalties] section below for more informations.

# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty: float = 1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens: int = 0,
89
90
91
92
93
```

## Examples

### Normal
Ying Sheng's avatar
Ying Sheng committed
94
Launch a server
95
```
Ying Sheng's avatar
Ying Sheng committed
96
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
97
98
```

Ying Sheng's avatar
Ying Sheng committed
99
Send a request
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

### Streaming
Ying Sheng's avatar
Ying Sheng committed
117
Send a request and stream the output
118
119
120
121
122
123
124
125
126
```python
import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
127
            "max_new_tokens": 32,
128
129
130
131
132
133
134
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
Lianmin Zheng's avatar
Lianmin Zheng committed
135
136
137
138
139
140
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
141
142
143
144
145
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")
```
146
147
148

### Multi modal

Ying Sheng's avatar
Ying Sheng committed
149
150
Launch a server
```
151
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava
Ying Sheng's avatar
Ying Sheng committed
152
153
154
155
156
157
158
```

Download an image
```
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
```

Ying Sheng's avatar
Ying Sheng committed
159
Send a request
Ying Sheng's avatar
Ying Sheng committed
160
161
162
163
164
165
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
166
167
168
        "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
                "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
                "<|im_start|>assistant\n",
Ying Sheng's avatar
Ying Sheng committed
169
170
171
172
173
174
175
176
177
178
179
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
Ying Sheng's avatar
Ying Sheng committed
180
Streaming is supported in a similar manner as [above](#streaming).
Lianmin Zheng's avatar
Lianmin Zheng committed
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226

### Structured decoding (JSON, Regex)
You can specify a JSON schema or a regular expression to constrain the model output. The model output will be guaranteed to follow the given constraints.

```python
import json
import requests

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

# JSON
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)
print(response.json())

# Regular expression
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print(response.json())
227
```