"SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n",
"\n",
"- Offline Batch Inference\n",
"- Custom Server on Top of the Engine\n",
"\n",
"This document focuses on the offline batch inference, demonstrating four different inference modes:\n",
"\n",
"- Non-streaming synchronous generation\n",
"- Streaming synchronous generation\n",
"- Non-streaming asynchronous generation\n",
"- Streaming asynchronous generation\n",
"\n",
"Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Offline Batch Inference\n",
"\n",
"SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104)."