readme.md 1.43 KB
Newer Older
1
2
3
4
5
6
# SGLang Engine

## Introduction
SGLang provides a direct inference engine without the need for an HTTP server. There are generally two use cases:

1. **Offline Batch Inference**
James Xu's avatar
James Xu committed
7
8
2. **Embedding Generation**
3. **Custom Server on Top of the Engine**
9
10
11
12
13
14
15

## Examples

### 1. [Offline Batch Inference](./offline_batch_inference.py)

In this example, we launch an SGLang engine and feed a batch of inputs for inference. If you provide a very large batch, the engine will intelligently schedule the requests to process efficiently and prevent OOM (Out of Memory) errors.

James Xu's avatar
James Xu committed
16
17
18
19
20
### 2. [Embedding Generation](./embedding.py)

In this example, we launch an SGLang engine and feed a batch of inputs for embedding generation.

### 3. [Custom Server](./custom_server.py)
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

This example demonstrates how to create a custom server on top of the SGLang Engine. We use [Sanic](https://sanic.dev/en/) as an example. The server supports both non-streaming and streaming endpoints.

#### Steps:

1. Install Sanic:

```bash
pip install sanic
```

2. Run the server:

```bash
python custom_server
```

3. Send requests:

```bash
curl -X POST http://localhost:8000/generate  -H "Content-Type: application/json"  -d '{"prompt": "The Transformer architecture is..."}'
curl -X POST http://localhost:8000/generate_stream  -H "Content-Type: application/json"  -d '{"prompt": "The Transformer architecture is..."}' --no-buffer
```

45
This will send both non-streaming and streaming requests to the server.