send_request.ipynb 7.19 KB
Newer Older
Chayenne's avatar
Chayenne committed
1
2
3
4
5
6
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
7
    "# Quick Start: Sending Requests\n",
Chayenne's avatar
Chayenne committed
8
9
10
11
    "This notebook provides a quick-start guide to use SGLang in chat completions after installation.\n",
    "\n",
    "- For Vision Language Models, see [OpenAI APIs - Vision](../backend/openai_api_vision.ipynb).\n",
    "- For Embedding Models, see [OpenAI APIs - Embedding](../backend/openai_api_embeddings.ipynb) and [Encode (embedding model)](../backend/native_api.html#Encode-(embedding-model)).\n",
12
    "- For Reward Models, see [Classify (reward model)](../backend/native_api.html#Classify-(reward-model))."
Chayenne's avatar
Chayenne committed
13
14
15
16
17
18
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chayenne's avatar
Chayenne committed
19
20
    "## Launch A Server\n",
    "\n",
Chayenne's avatar
Chayenne committed
21
22
23
24
    "This code block is equivalent to executing \n",
    "\n",
    "```bash\n",
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
25
    " --host 0.0.0.0\n",
Chayenne's avatar
Chayenne committed
26
27
    "```\n",
    "\n",
Chayenne's avatar
Chayenne committed
28
    "in your terminal and wait for the server to be ready. Once the server is running, you can send test requests using curl or requests. The server implements the [OpenAI-compatible APIs](https://platform.openai.com/docs/api-reference/chat)."
Chayenne's avatar
Chayenne committed
29
30
31
32
   ]
  },
  {
   "cell_type": "code",
Chayenne's avatar
Chayenne committed
33
   "execution_count": null,
34
   "metadata": {},
Chayenne's avatar
Chayenne committed
35
   "outputs": [],
Chayenne's avatar
Chayenne committed
36
   "source": [
37
38
39
40
41
42
43
44
    "from sglang.test.test_utils import is_in_ci\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
    "\n",
    "if is_in_ci():\n",
    "    from patch import launch_server_cmd\n",
    "else:\n",
    "    from sglang.utils import launch_server_cmd\n",
    "\n",
Chayenne's avatar
Chayenne committed
45
    "\n",
46
    "server_process, port = launch_server_cmd(\n",
Chayenne's avatar
Chayenne committed
47
    "    \"\"\"\n",
Chayenne's avatar
Chayenne committed
48
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
49
    " --host 0.0.0.0\n",
Chayenne's avatar
Chayenne committed
50
51
52
    "\"\"\"\n",
    ")\n",
    "\n",
53
    "wait_for_server(f\"http://localhost:{port}\")"
Chayenne's avatar
Chayenne committed
54
55
56
57
58
59
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chayenne's avatar
Chayenne committed
60
61
62
63
64
65
    "## Using cURL\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
66
   "metadata": {},
Chayenne's avatar
Chayenne committed
67
68
69
   "outputs": [],
   "source": [
    "import subprocess, json\n",
Chayenne's avatar
Chayenne committed
70
    "\n",
71
72
73
74
    "curl_command = f\"\"\"\n",
    "curl -s http://localhost:{port}/v1/chat/completions \\\n",
    "  -H \"Content-Type: application/json\" \\\n",
    "  -d '{{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{{\"role\": \"user\", \"content\": \"What is the capital of France?\"}}]}}'\n",
Chayenne's avatar
Chayenne committed
75
76
77
78
79
80
81
82
83
84
    "\"\"\"\n",
    "\n",
    "response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
    "print_highlight(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chayenne's avatar
Chayenne committed
85
    "## Using Python Requests"
Chayenne's avatar
Chayenne committed
86
87
88
89
   ]
  },
  {
   "cell_type": "code",
Chayenne's avatar
Chayenne committed
90
   "execution_count": null,
91
   "metadata": {},
Chayenne's avatar
Chayenne committed
92
   "outputs": [],
Chayenne's avatar
Chayenne committed
93
   "source": [
Chayenne's avatar
Chayenne committed
94
    "import requests\n",
95
    "\n",
96
    "url = f\"http://localhost:{port}/v1/chat/completions\"\n",
Chayenne's avatar
Chayenne committed
97
98
    "\n",
    "data = {\n",
99
    "    \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
Chayenne's avatar
Chayenne committed
100
    "    \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n",
Chayenne's avatar
Chayenne committed
101
    "}\n",
102
    "\n",
Chayenne's avatar
Chayenne committed
103
104
    "response = requests.post(url, json=data)\n",
    "print_highlight(response.json())"
Chayenne's avatar
Chayenne committed
105
106
107
108
109
110
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Lianmin Zheng's avatar
Lianmin Zheng committed
111
    "## Using OpenAI Python Client"
Chayenne's avatar
Chayenne committed
112
113
114
115
   ]
  },
  {
   "cell_type": "code",
Chayenne's avatar
Chayenne committed
116
   "execution_count": null,
117
   "metadata": {},
Chayenne's avatar
Chayenne committed
118
   "outputs": [],
Chayenne's avatar
Chayenne committed
119
120
121
   "source": [
    "import openai\n",
    "\n",
122
    "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
Chayenne's avatar
Chayenne committed
123
124
125
126
127
128
129
130
131
132
133
134
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[\n",
    "        {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
    "    ],\n",
    "    temperature=0,\n",
    "    max_tokens=64,\n",
    ")\n",
    "print_highlight(response)"
   ]
  },
Lianmin Zheng's avatar
Lianmin Zheng committed
135
136
137
138
139
140
141
142
143
144
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streaming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
145
   "metadata": {},
Lianmin Zheng's avatar
Lianmin Zheng committed
146
147
148
149
   "outputs": [],
   "source": [
    "import openai\n",
    "\n",
150
    "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
    "\n",
    "# Use stream=True for streaming responses\n",
    "response = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[\n",
    "        {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
    "    ],\n",
    "    temperature=0,\n",
    "    max_tokens=64,\n",
    "    stream=True,\n",
    ")\n",
    "\n",
    "# Handle the streaming output\n",
    "for chunk in response:\n",
    "    if chunk.choices[0].delta.content:\n",
Chayenne's avatar
Chayenne committed
166
    "        print(chunk.choices[0].delta.content, end=\"\", flush=True)"
Lianmin Zheng's avatar
Lianmin Zheng committed
167
168
   ]
  },
169
170
171
172
173
174
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Native Generation APIs\n",
    "\n",
175
    "You can also use the native `/generate` endpoint with requests, which provides more flexiblity. An API reference is available at [Sampling Parameters](../references/sampling_params.md)."
176
177
178
179
   ]
  },
  {
   "cell_type": "code",
Chayenne's avatar
Chayenne committed
180
   "execution_count": null,
181
   "metadata": {},
Chayenne's avatar
Chayenne committed
182
   "outputs": [],
183
184
185
186
   "source": [
    "import requests\n",
    "\n",
    "response = requests.post(\n",
187
    "    f\"http://localhost:{port}/generate\",\n",
188
189
190
191
192
193
194
195
196
197
198
199
    "    json={\n",
    "        \"text\": \"The capital of France is\",\n",
    "        \"sampling_params\": {\n",
    "            \"temperature\": 0,\n",
    "            \"max_new_tokens\": 32,\n",
    "        },\n",
    "    },\n",
    ")\n",
    "\n",
    "print_highlight(response.json())"
   ]
  },
Lianmin Zheng's avatar
Lianmin Zheng committed
200
201
202
203
204
205
206
207
208
209
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streaming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
210
   "metadata": {},
Lianmin Zheng's avatar
Lianmin Zheng committed
211
212
213
214
215
   "outputs": [],
   "source": [
    "import requests, json\n",
    "\n",
    "response = requests.post(\n",
216
    "    f\"http://localhost:{port}/generate\",\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
    "    json={\n",
    "        \"text\": \"The capital of France is\",\n",
    "        \"sampling_params\": {\n",
    "            \"temperature\": 0,\n",
    "            \"max_new_tokens\": 32,\n",
    "        },\n",
    "        \"stream\": True,\n",
    "    },\n",
    "    stream=True,\n",
    ")\n",
    "\n",
    "prev = 0\n",
    "for chunk in response.iter_lines(decode_unicode=False):\n",
    "    chunk = chunk.decode(\"utf-8\")\n",
    "    if chunk and chunk.startswith(\"data:\"):\n",
    "        if chunk == \"data: [DONE]\":\n",
    "            break\n",
    "        data = json.loads(chunk[5:].strip(\"\\n\"))\n",
    "        output = data[\"text\"]\n",
    "        print(output[prev:], end=\"\", flush=True)\n",
    "        prev = len(output)"
   ]
  },
Chayenne's avatar
Chayenne committed
240
241
  {
   "cell_type": "code",
242
243
   "execution_count": null,
   "metadata": {},
Chayenne's avatar
Chayenne committed
244
245
   "outputs": [],
   "source": [
246
    "terminate_process(server_process)"
Chayenne's avatar
Chayenne committed
247
248
249
250
   ]
  }
 ],
 "metadata": {
Chayenne's avatar
Chayenne committed
251
252
253
254
255
256
257
258
259
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
260
   "pygments_lexer": "ipython3"
Chayenne's avatar
Chayenne committed
261
262
263
264
265
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}