offline_engine_api.ipynb 7.01 KB
Newer Older
Chayenne's avatar
Chayenne committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Offline Engine API\n",
    "\n",
    "SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n",
    "\n",
    "- Offline Batch Inference\n",
    "- Custom Server on Top of the Engine\n",
    "\n",
    "This document focuses on the offline batch inference, demonstrating four different inference modes:\n",
    "\n",
    "- Non-streaming synchronous generation\n",
    "- Streaming synchronous generation\n",
    "- Non-streaming asynchronous generation\n",
    "- Streaming asynchronous generation\n",
    "\n",
21
    "Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).\n",
22
    "\n"
Chayenne's avatar
Chayenne committed
23
24
   ]
  },
25
26
27
28
29
30
31
32
33
34
35
36
37
38
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Nest Asyncio\n",
    "Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:\n",
    "```python\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "```"
   ]
  },
39
40
41
42
43
44
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Usage\n",
    "\n",
45
    "The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). \n",
46
47
48
49
    "\n",
    "Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases."
   ]
  },
Chayenne's avatar
Chayenne committed
50
51
52
53
54
55
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Offline Batch Inference\n",
    "\n",
56
    "SGLang offline engine supports batch inference with efficient scheduling."
Chayenne's avatar
Chayenne committed
57
58
59
60
61
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
62
   "metadata": {},
Chayenne's avatar
Chayenne committed
63
64
65
66
   "outputs": [],
   "source": [
    "# launch the offline engine\n",
    "import asyncio\n",
67
68
    "\n",
    "import sglang as sgl\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
69
    "import sglang.test.doc_patch\n",
70
    "from sglang.utils import async_stream_and_merge, stream_and_merge\n",
71
    "\n",
72
    "llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")"
Chayenne's avatar
Chayenne committed
73
74
75
76
77
78
79
80
81
82
83
84
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Non-streaming Synchronous Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
85
   "metadata": {},
Chayenne's avatar
Chayenne committed
86
87
88
89
90
91
92
93
94
95
96
97
98
   "outputs": [],
   "source": [
    "prompts = [\n",
    "    \"Hello, my name is\",\n",
    "    \"The president of the United States is\",\n",
    "    \"The capital of France is\",\n",
    "    \"The future of AI is\",\n",
    "]\n",
    "\n",
    "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
    "\n",
    "outputs = llm.generate(prompts, sampling_params)\n",
    "for prompt, output in zip(prompts, outputs):\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
99
100
    "    print(\"===============================\")\n",
    "    print(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
Chayenne's avatar
Chayenne committed
101
102
103
104
105
106
107
108
109
110
111
112
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streaming Synchronous Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
113
   "metadata": {},
Chayenne's avatar
Chayenne committed
114
115
116
   "outputs": [],
   "source": [
    "prompts = [\n",
117
118
119
    "    \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
    "    \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
    "    \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
Chayenne's avatar
Chayenne committed
120
121
    "]\n",
    "\n",
122
123
124
125
    "sampling_params = {\n",
    "    \"temperature\": 0.2,\n",
    "    \"top_p\": 0.9,\n",
    "}\n",
Chayenne's avatar
Chayenne committed
126
    "\n",
127
    "print(\"\\n=== Testing synchronous streaming generation with overlap removal ===\\n\")\n",
Chayenne's avatar
Chayenne committed
128
    "\n",
129
130
131
132
    "for prompt in prompts:\n",
    "    print(f\"Prompt: {prompt}\")\n",
    "    merged_output = stream_and_merge(llm, prompt, sampling_params)\n",
    "    print(\"Generated text:\", merged_output)\n",
Chayenne's avatar
Chayenne committed
133
134
135
136
137
138
139
140
141
142
143
144
145
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Non-streaming Asynchronous Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
146
   "metadata": {},
Chayenne's avatar
Chayenne committed
147
148
149
   "outputs": [],
   "source": [
    "prompts = [\n",
150
151
152
    "    \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
    "    \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
    "    \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
Chayenne's avatar
Chayenne committed
153
154
155
156
    "]\n",
    "\n",
    "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
    "\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
157
    "print(\"\\n=== Testing asynchronous batch generation ===\")\n",
Chayenne's avatar
Chayenne committed
158
159
160
161
162
163
    "\n",
    "\n",
    "async def main():\n",
    "    outputs = await llm.async_generate(prompts, sampling_params)\n",
    "\n",
    "    for prompt, output in zip(prompts, outputs):\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
164
165
    "        print(f\"\\nPrompt: {prompt}\")\n",
    "        print(f\"Generated text: {output['text']}\")\n",
Chayenne's avatar
Chayenne committed
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
    "\n",
    "\n",
    "asyncio.run(main())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streaming Asynchronous Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
181
   "metadata": {},
Chayenne's avatar
Chayenne committed
182
183
184
   "outputs": [],
   "source": [
    "prompts = [\n",
185
186
187
    "    \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
    "    \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
    "    \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
Chayenne's avatar
Chayenne committed
188
    "]\n",
189
    "\n",
Chayenne's avatar
Chayenne committed
190
191
    "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
    "\n",
192
    "print(\"\\n=== Testing asynchronous streaming generation (no repeats) ===\")\n",
Chayenne's avatar
Chayenne committed
193
194
195
196
    "\n",
    "\n",
    "async def main():\n",
    "    for prompt in prompts:\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
197
    "        print(f\"\\nPrompt: {prompt}\")\n",
Chayenne's avatar
Chayenne committed
198
199
    "        print(\"Generated text: \", end=\"\", flush=True)\n",
    "\n",
200
201
202
203
204
    "        # Replace direct calls to async_generate with our custom overlap-aware version\n",
    "        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):\n",
    "            print(cleaned_chunk, end=\"\", flush=True)\n",
    "\n",
    "        print()  # New line after each prompt\n",
Chayenne's avatar
Chayenne committed
205
206
207
208
209
210
211
    "\n",
    "\n",
    "asyncio.run(main())"
   ]
  },
  {
   "cell_type": "code",
212
213
   "execution_count": null,
   "metadata": {},
Chayenne's avatar
Chayenne committed
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
   "outputs": [],
   "source": [
    "llm.shutdown()"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
230
   "pygments_lexer": "ipython3"
Chayenne's avatar
Chayenne committed
231
232
233
234
235
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}