sampling_params.md 4.21 KB
Newer Older
Ying Sheng's avatar
Ying Sheng committed
1
# Sampling Parameters in SGLang Runtime
2
3
4
5
6
This doc describes the sampling parameters of the SGLang Runtime.

The `/generate` endpoint accepts the following arguments in the JSON format.

```python
7
@dataclass
8
class GenerateReqInput:
Ying Sheng's avatar
Ying Sheng committed
9
    # The input prompt. It can be a single prompt or a batch of prompts.
10
    text: Union[List[str], str]
Ying Sheng's avatar
Ying Sheng committed
11
    # The token ids for text; one can either specify text or input_ids.
12
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
Ying Sheng's avatar
Ying Sheng committed
13
14
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
15
    image_data: Optional[Union[List[str], str]] = None
Ying Sheng's avatar
Ying Sheng committed
16
    # The sampling_params.
17
    sampling_params: Union[List[Dict], Dict] = None
Ying Sheng's avatar
Ying Sheng committed
18
    # The request id.
19
    rid: Optional[Union[List[str], str]] = None
Ying Sheng's avatar
Ying Sheng committed
20
    # Whether to return logprobs.
21
    return_logprob: Optional[Union[List[bool], bool]] = None
Ying Sheng's avatar
Ying Sheng committed
22
    # The start location of the prompt for return_logprob.
23
    logprob_start_len: Optional[Union[List[int], int]] = None
Ying Sheng's avatar
Ying Sheng committed
24
    # The number of top logprobs to return.
Liangsheng Yin's avatar
Liangsheng Yin committed
25
    top_logprobs_num: Optional[Union[List[int], int]] = None
Ying Sheng's avatar
Ying Sheng committed
26
    # Whether to detokenize tokens in logprobs.
Liangsheng Yin's avatar
Liangsheng Yin committed
27
    return_text_in_logprobs: bool = False
Ying Sheng's avatar
Ying Sheng committed
28
    # Whether to stream output.
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
    stream: bool = False
```

The `sampling_params` follows this format

```python
class SamplingParams:
    def __init__(
        self,
        max_new_tokens: int = 16,
        stop: Optional[Union[str, List[str]]] = None,
        temperature: float = 1.0,
        top_p: float = 1.0,
        top_k: int = -1,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        ignore_eos: bool = False,
        skip_special_tokens: bool = True,
        dtype: Optional[str] = None,
        regex: Optional[str] = None,
    ) -> None:
```

Ying Sheng's avatar
Ying Sheng committed
52
53
54
55
- `max_new_tokens`, `stop`, `temperature`, `top_p`, `top_k` are common sampling parameters.
- `ignore_eos` means ignoring the EOS token and continue decoding, which is helpful for benchmarking purposes.
- `regex` constrains the output to follow a given regular expression.

56
57
58
## Examples

### Normal
Ying Sheng's avatar
Ying Sheng committed
59
Launch a server
60
```
Ying Sheng's avatar
Ying Sheng committed
61
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
62
63
```

Ying Sheng's avatar
Ying Sheng committed
64
Send a request
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

### Streaming
Ying Sheng's avatar
Ying Sheng committed
82
Send a request and stream the output
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
```python
import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
Lianmin Zheng's avatar
Lianmin Zheng committed
100
101
102
103
104
105
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
106
107
108
109
110
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")
```
111
112
113

### Multi modal

Ying Sheng's avatar
Ying Sheng committed
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
Launch a server
```
python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000
```

Download an image
```
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
```

```python
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nDescribe this picture ASSISTANT:",
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())
```

The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
Streaming is supported in a similar manner as [above](#streaming).