send_request.ipynb 7.02 KB
Newer Older
Chayenne's avatar
Chayenne committed
1
2
3
4
5
6
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
7
    "# Quick Start: Sending Requests\n",
Chayenne's avatar
Chayenne committed
8
9
10
11
    "This notebook provides a quick-start guide to use SGLang in chat completions after installation.\n",
    "\n",
    "- For Vision Language Models, see [OpenAI APIs - Vision](../backend/openai_api_vision.ipynb).\n",
    "- For Embedding Models, see [OpenAI APIs - Embedding](../backend/openai_api_embeddings.ipynb) and [Encode (embedding model)](../backend/native_api.html#Encode-(embedding-model)).\n",
12
    "- For Reward Models, see [Classify (reward model)](../backend/native_api.html#Classify-(reward-model))."
Chayenne's avatar
Chayenne committed
13
14
15
16
17
18
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chayenne's avatar
Chayenne committed
19
20
    "## Launch A Server\n",
    "\n",
Chayenne's avatar
Chayenne committed
21
22
23
24
25
26
27
    "This code block is equivalent to executing \n",
    "\n",
    "```bash\n",
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
    "--port 30000 --host 0.0.0.0\n",
    "```\n",
    "\n",
Chayenne's avatar
Chayenne committed
28
    "in your terminal and wait for the server to be ready. Once the server is running, you can send test requests using curl or requests. The server implements the [OpenAI-compatible APIs](https://platform.openai.com/docs/api-reference/chat)."
Chayenne's avatar
Chayenne committed
29
30
31
32
   ]
  },
  {
   "cell_type": "code",
Chayenne's avatar
Chayenne committed
33
   "execution_count": null,
34
   "metadata": {},
Chayenne's avatar
Chayenne committed
35
   "outputs": [],
Chayenne's avatar
Chayenne committed
36
37
38
39
40
41
42
43
44
   "source": [
    "from sglang.utils import (\n",
    "    execute_shell_command,\n",
    "    wait_for_server,\n",
    "    terminate_process,\n",
    "    print_highlight,\n",
    ")\n",
    "\n",
    "server_process = execute_shell_command(\n",
Chayenne's avatar
Chayenne committed
45
    "    \"\"\"\n",
Chayenne's avatar
Chayenne committed
46
47
48
49
50
51
52
53
54
55
56
57
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
    "--port 30000 --host 0.0.0.0\n",
    "\"\"\"\n",
    ")\n",
    "\n",
    "wait_for_server(\"http://localhost:30000\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chayenne's avatar
Chayenne committed
58
59
60
61
62
63
    "## Using cURL\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
64
   "metadata": {},
Chayenne's avatar
Chayenne committed
65
66
67
   "outputs": [],
   "source": [
    "import subprocess, json\n",
Chayenne's avatar
Chayenne committed
68
    "\n",
Chayenne's avatar
Chayenne committed
69
70
    "curl_command = \"\"\"\n",
    "curl -s http://localhost:30000/v1/chat/completions \\\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
71
    "  -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]}'\n",
Chayenne's avatar
Chayenne committed
72
73
74
75
76
77
78
79
80
81
    "\"\"\"\n",
    "\n",
    "response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
    "print_highlight(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chayenne's avatar
Chayenne committed
82
    "## Using Python Requests"
Chayenne's avatar
Chayenne committed
83
84
85
86
   ]
  },
  {
   "cell_type": "code",
Chayenne's avatar
Chayenne committed
87
   "execution_count": null,
88
   "metadata": {},
Chayenne's avatar
Chayenne committed
89
   "outputs": [],
Chayenne's avatar
Chayenne committed
90
   "source": [
Chayenne's avatar
Chayenne committed
91
    "import requests\n",
92
    "\n",
Chayenne's avatar
Chayenne committed
93
94
95
    "url = \"http://localhost:30000/v1/chat/completions\"\n",
    "\n",
    "data = {\n",
96
    "    \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
Chayenne's avatar
Chayenne committed
97
    "    \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n",
Chayenne's avatar
Chayenne committed
98
    "}\n",
99
    "\n",
Chayenne's avatar
Chayenne committed
100
101
    "response = requests.post(url, json=data)\n",
    "print_highlight(response.json())"
Chayenne's avatar
Chayenne committed
102
103
104
105
106
107
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Lianmin Zheng's avatar
Lianmin Zheng committed
108
    "## Using OpenAI Python Client"
Chayenne's avatar
Chayenne committed
109
110
111
112
   ]
  },
  {
   "cell_type": "code",
Chayenne's avatar
Chayenne committed
113
   "execution_count": null,
114
   "metadata": {},
Chayenne's avatar
Chayenne committed
115
   "outputs": [],
Chayenne's avatar
Chayenne committed
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
   "source": [
    "import openai\n",
    "\n",
    "client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[\n",
    "        {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
    "    ],\n",
    "    temperature=0,\n",
    "    max_tokens=64,\n",
    ")\n",
    "print_highlight(response)"
   ]
  },
Lianmin Zheng's avatar
Lianmin Zheng committed
132
133
134
135
136
137
138
139
140
141
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streaming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
142
   "metadata": {},
Lianmin Zheng's avatar
Lianmin Zheng committed
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
   "outputs": [],
   "source": [
    "import openai\n",
    "\n",
    "client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "# Use stream=True for streaming responses\n",
    "response = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[\n",
    "        {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
    "    ],\n",
    "    temperature=0,\n",
    "    max_tokens=64,\n",
    "    stream=True,\n",
    ")\n",
    "\n",
    "# Handle the streaming output\n",
    "for chunk in response:\n",
    "    if chunk.choices[0].delta.content:\n",
Chayenne's avatar
Chayenne committed
163
    "        print(chunk.choices[0].delta.content, end=\"\", flush=True)"
Lianmin Zheng's avatar
Lianmin Zheng committed
164
165
   ]
  },
166
167
168
169
170
171
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Native Generation APIs\n",
    "\n",
172
    "You can also use the native `/generate` endpoint with requests, which provides more flexiblity. An API reference is available at [Sampling Parameters](../references/sampling_params.md)."
173
174
175
176
   ]
  },
  {
   "cell_type": "code",
Chayenne's avatar
Chayenne committed
177
   "execution_count": null,
178
   "metadata": {},
Chayenne's avatar
Chayenne committed
179
   "outputs": [],
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
   "source": [
    "import requests\n",
    "\n",
    "response = requests.post(\n",
    "    \"http://localhost:30000/generate\",\n",
    "    json={\n",
    "        \"text\": \"The capital of France is\",\n",
    "        \"sampling_params\": {\n",
    "            \"temperature\": 0,\n",
    "            \"max_new_tokens\": 32,\n",
    "        },\n",
    "    },\n",
    ")\n",
    "\n",
    "print_highlight(response.json())"
   ]
  },
Lianmin Zheng's avatar
Lianmin Zheng committed
197
198
199
200
201
202
203
204
205
206
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streaming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
207
   "metadata": {},
Lianmin Zheng's avatar
Lianmin Zheng committed
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
   "outputs": [],
   "source": [
    "import requests, json\n",
    "\n",
    "response = requests.post(\n",
    "    \"http://localhost:30000/generate\",\n",
    "    json={\n",
    "        \"text\": \"The capital of France is\",\n",
    "        \"sampling_params\": {\n",
    "            \"temperature\": 0,\n",
    "            \"max_new_tokens\": 32,\n",
    "        },\n",
    "        \"stream\": True,\n",
    "    },\n",
    "    stream=True,\n",
    ")\n",
    "\n",
    "prev = 0\n",
    "for chunk in response.iter_lines(decode_unicode=False):\n",
    "    chunk = chunk.decode(\"utf-8\")\n",
    "    if chunk and chunk.startswith(\"data:\"):\n",
    "        if chunk == \"data: [DONE]\":\n",
    "            break\n",
    "        data = json.loads(chunk[5:].strip(\"\\n\"))\n",
    "        output = data[\"text\"]\n",
    "        print(output[prev:], end=\"\", flush=True)\n",
    "        prev = len(output)"
   ]
  },
Chayenne's avatar
Chayenne committed
237
238
  {
   "cell_type": "code",
239
240
   "execution_count": null,
   "metadata": {},
Chayenne's avatar
Chayenne committed
241
242
243
244
245
246
247
   "outputs": [],
   "source": [
    "terminate_process(server_process)"
   ]
  }
 ],
 "metadata": {
Chayenne's avatar
Chayenne committed
248
249
250
251
252
253
254
255
256
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
257
   "pygments_lexer": "ipython3"
Chayenne's avatar
Chayenne committed
258
259
260
261
262
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}