offline_engine_api.ipynb 5.39 KB
Newer Older
Chayenne's avatar
Chayenne committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Offline Engine API\n",
    "\n",
    "SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n",
    "\n",
    "- Offline Batch Inference\n",
    "- Custom Server on Top of the Engine\n",
    "\n",
    "This document focuses on the offline batch inference, demonstrating four different inference modes:\n",
    "\n",
    "- Non-streaming synchronous generation\n",
    "- Streaming synchronous generation\n",
    "- Non-streaming asynchronous generation\n",
    "- Streaming asynchronous generation\n",
    "\n",
    "Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Offline Batch Inference\n",
    "\n",
30
    "SGLang offline engine supports batch inference with efficient scheduling."
Chayenne's avatar
Chayenne committed
31
32
33
34
35
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
36
   "metadata": {},
Chayenne's avatar
Chayenne committed
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
   "outputs": [],
   "source": [
    "# launch the offline engine\n",
    "\n",
    "import sglang as sgl\n",
    "import asyncio\n",
    "\n",
    "llm = sgl.Engine(model_path=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Non-streaming Synchronous Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
57
   "metadata": {},
Chayenne's avatar
Chayenne committed
58
59
60
61
62
63
64
65
66
67
68
69
70
   "outputs": [],
   "source": [
    "prompts = [\n",
    "    \"Hello, my name is\",\n",
    "    \"The president of the United States is\",\n",
    "    \"The capital of France is\",\n",
    "    \"The future of AI is\",\n",
    "]\n",
    "\n",
    "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
    "\n",
    "outputs = llm.generate(prompts, sampling_params)\n",
    "for prompt, output in zip(prompts, outputs):\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
71
72
    "    print(\"===============================\")\n",
    "    print(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
Chayenne's avatar
Chayenne committed
73
74
75
76
77
78
79
80
81
82
83
84
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streaming Synchronous Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
85
   "metadata": {},
Chayenne's avatar
Chayenne committed
86
87
88
89
90
91
92
93
94
   "outputs": [],
   "source": [
    "prompts = [\n",
    "    \"Hello, my name is\",\n",
    "    \"The capital of France is\",\n",
    "    \"The future of AI is\",\n",
    "]\n",
    "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
    "\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
95
    "print(\"\\n=== Testing synchronous streaming generation ===\")\n",
Chayenne's avatar
Chayenne committed
96
97
    "\n",
    "for prompt in prompts:\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
98
    "    print(f\"\\nPrompt: {prompt}\")\n",
Chayenne's avatar
Chayenne committed
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
    "    print(\"Generated text: \", end=\"\", flush=True)\n",
    "\n",
    "    for chunk in llm.generate(prompt, sampling_params, stream=True):\n",
    "        print(chunk[\"text\"], end=\"\", flush=True)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Non-streaming Asynchronous Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
116
   "metadata": {},
Chayenne's avatar
Chayenne committed
117
118
119
120
121
122
123
124
125
126
   "outputs": [],
   "source": [
    "prompts = [\n",
    "    \"Hello, my name is\",\n",
    "    \"The capital of France is\",\n",
    "    \"The future of AI is\",\n",
    "]\n",
    "\n",
    "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
    "\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
127
    "print(\"\\n=== Testing asynchronous batch generation ===\")\n",
Chayenne's avatar
Chayenne committed
128
129
130
131
132
133
    "\n",
    "\n",
    "async def main():\n",
    "    outputs = await llm.async_generate(prompts, sampling_params)\n",
    "\n",
    "    for prompt, output in zip(prompts, outputs):\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
134
135
    "        print(f\"\\nPrompt: {prompt}\")\n",
    "        print(f\"Generated text: {output['text']}\")\n",
Chayenne's avatar
Chayenne committed
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
    "\n",
    "\n",
    "asyncio.run(main())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streaming Asynchronous Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
151
   "metadata": {},
Chayenne's avatar
Chayenne committed
152
153
154
155
156
157
158
159
160
   "outputs": [],
   "source": [
    "prompts = [\n",
    "    \"Hello, my name is\",\n",
    "    \"The capital of France is\",\n",
    "    \"The future of AI is\",\n",
    "]\n",
    "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
    "\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
161
    "print(\"\\n=== Testing asynchronous streaming generation ===\")\n",
Chayenne's avatar
Chayenne committed
162
163
164
165
    "\n",
    "\n",
    "async def main():\n",
    "    for prompt in prompts:\n",
Lianmin Zheng's avatar
Lianmin Zheng committed
166
    "        print(f\"\\nPrompt: {prompt}\")\n",
Chayenne's avatar
Chayenne committed
167
168
169
170
171
172
173
174
175
176
177
178
179
    "        print(\"Generated text: \", end=\"\", flush=True)\n",
    "\n",
    "        generator = await llm.async_generate(prompt, sampling_params, stream=True)\n",
    "        async for chunk in generator:\n",
    "            print(chunk[\"text\"], end=\"\", flush=True)\n",
    "        print()\n",
    "\n",
    "\n",
    "asyncio.run(main())"
   ]
  },
  {
   "cell_type": "code",
180
181
   "execution_count": null,
   "metadata": {},
Chayenne's avatar
Chayenne committed
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
   "outputs": [],
   "source": [
    "llm.shutdown()"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
198
   "pygments_lexer": "ipython3"
Chayenne's avatar
Chayenne committed
199
200
201
202
203
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}