README.md 10.2 KB
Newer Older
Byron Hsu's avatar
Byron Hsu committed
1
# SGLang Router
2
3
4

SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.

5
## User docs
Byron Hsu's avatar
Byron Hsu committed
6

Yineng Zhang's avatar
Yineng Zhang committed
7
Please check https://docs.sglang.ai/router/router.html
8

9
## Developer docs
10

11
### Prerequisites
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

- Rust and Cargo installed

```bash
# Install rustup (Rust installer and version manager)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Follow the installation prompts, then reload your shell
source $HOME/.cargo/env

# Verify installation
rustc --version
cargo --version
```

- Python with pip installed

29

Byron Hsu's avatar
Byron Hsu committed
30
### Build Process
31

Byron Hsu's avatar
Byron Hsu committed
32
#### 1. Build Rust Project
33
34

```bash
35
$ cargo build
36
37
```

Byron Hsu's avatar
Byron Hsu committed
38
#### 2. Build Python Binding
39

Byron Hsu's avatar
Byron Hsu committed
40
##### Option A: Build and Install Wheel
41
42
1. Build the wheel package:
```bash
43
44
$ pip install setuptools-rust wheel build
$ python -m build
45
46
47
48
```

2. Install the generated wheel:
```bash
49
50
51
52
53
54
55
$ pip install <path-to-wheel>
```

If you want one handy command to do build + install for every change you make:

```bash
$ python -m build && pip install --force-reinstall dist/*.whl
56
57
```

Byron Hsu's avatar
Byron Hsu committed
58
##### Option B: Development Mode
59
60

For development purposes, you can install the package in editable mode:
61
62
63

Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.

64
```bash
65
$ pip install -e .
66
67
68
69
```

**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.

70
71
72
73
74
75
76
77
78
79
80
81
### Logging

The SGL Router includes structured logging with console output by default. To enable log files:

```python
# Enable file logging when creating a router
router = Router(
    worker_urls=["http://worker1:8000", "http://worker2:8000"],
    log_dir="./logs"  # Daily log files will be created here
)
```

82
Use the `--log-level` flag with the CLI to set [log level](https://docs.sglang.ai/backend/server_arguments.html#logging).
83

84
85
86
87
88
89
90
91
92
93
94
95
### Metrics

SGL Router exposes a Prometheus HTTP scrape endpoint for monitoring, which by default listens at 127.0.0.1:29000.

To change the endpoint to listen on all network interfaces and set the port to 9000, configure the following options when launching the router:
```
python -m sglang_router.launch_router \
  --worker-urls http://localhost:8080 http://localhost:8081 \
  --prometheus-host 0.0.0.0 \
  --prometheus-port 9000
```

96
97
### Kubernetes Service Discovery

98
SGL Router supports automatic service discovery for worker nodes in Kubernetes environments. This feature works with both regular (single-server) routing and PD (Prefill-Decode) routing modes. When enabled, the router will automatically:
99
100
101
102

- Discover and add worker pods with matching labels
- Remove unhealthy or deleted worker pods
- Dynamically adjust the worker pool based on pod health and availability
103
- For PD mode: distinguish between prefill and decode servers based on labels
104

105
106
107
#### Regular Mode Service Discovery

For traditional single-server routing:
108
109
110
111
112
113
114
115

```bash
python -m sglang_router.launch_router \
    --service-discovery \
    --selector app=sglang-worker role=inference \
    --service-discovery-namespace default
```

116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
#### PD Mode Service Discovery

For PD (Prefill-Decode) disaggregated routing, service discovery can automatically discover and classify pods as either prefill or decode servers based on their labels:

```bash
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --service-discovery \
    --prefill-selector app=sglang component=prefill \
    --decode-selector app=sglang component=decode \
    --service-discovery-namespace sglang-system
```

You can also specify initial prefill and decode servers and let service discovery add more:

```bash
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://prefill-1:8000 8001 \
    --decode http://decode-1:8000 \
    --service-discovery \
    --prefill-selector app=sglang component=prefill \
    --decode-selector app=sglang component=decode \
    --service-discovery-namespace sglang-system
```

#### Kubernetes Pod Configuration for PD Mode

When using PD service discovery, your Kubernetes pods need specific labels to be classified as prefill or decode servers:

**Prefill Server Pod:**
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: sglang-prefill-1
  labels:
    app: sglang
    component: prefill
  annotations:
    sglang.ai/bootstrap-port: "9001"  # Optional: Bootstrap port for Mooncake prefill coordination
spec:
  containers:
  - name: sglang
    image: lmsys/sglang:latest
    ports:
    - containerPort: 8000  # Main API port
    - containerPort: 9001  # Optional: Bootstrap coordination port
    # ... rest of configuration
```

**Decode Server Pod:**
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: sglang-decode-1
  labels:
    app: sglang
    component: decode
spec:
  containers:
  - name: sglang
    image: lmsys/sglang:latest
    ports:
    - containerPort: 8000  # Main API port
    # ... rest of configuration
```

**Key Requirements:**
- Prefill pods must have labels matching your `--prefill-selector`
- Decode pods must have labels matching your `--decode-selector`
- Prefill pods can optionally include bootstrap port in annotations using `sglang.ai/bootstrap-port` (defaults to None if not specified)

192
193
#### Service Discovery Arguments

194
**General Arguments:**
195
- `--service-discovery`: Enable Kubernetes service discovery feature
196
- `--service-discovery-port`: Port to use when generating worker URLs (default: 8000)
197
- `--service-discovery-namespace`: Optional. Kubernetes namespace to watch for pods. If not provided, watches all namespaces (requires cluster-wide permissions)
198
199
200
201
202
203
204
205
206
207
208
209
210
- `--selector`: One or more label key-value pairs for pod selection in regular mode (format: key1=value1 key2=value2)

**PD Mode Arguments:**
- `--pd-disaggregation`: Enable PD (Prefill-Decode) disaggregated mode
- `--prefill`: Specify initial prefill server URL and bootstrap port (format: URL BOOTSTRAP_PORT, can be used multiple times)
- `--decode`: Specify initial decode server URL (can be used multiple times)
- `--prefill-selector`: Label selector for prefill server pods in PD mode (format: key1=value1 key2=value2)
- `--decode-selector`: Label selector for decode server pods in PD mode (format: key1=value1 key2=value2)
- `--policy`: Routing policy (cache_aware, random, power_of_two - note: power_of_two only works in PD mode)

**Notes:**
- Bootstrap port annotation is automatically set to `sglang.ai/bootstrap-port` for Mooncake deployments
- Advanced cache tuning parameters use sensible defaults and are not exposed via CLI
211
212
213
214
215

#### RBAC Requirements

When using service discovery, you must configure proper Kubernetes RBAC permissions:

216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
**Namespace-scoped (recommended):**
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sglang-router
  namespace: sglang-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: sglang-system
  name: sglang-router
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: sglang-router
  namespace: sglang-system
subjects:
- kind: ServiceAccount
  name: sglang-router
  namespace: sglang-system
roleRef:
  kind: Role
  name: sglang-router
  apiGroup: rbac.authorization.k8s.io
```

**Cluster-wide (if watching all namespaces):**
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sglang-router
  namespace: sglang-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sglang-router
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: sglang-router
subjects:
- kind: ServiceAccount
  name: sglang-router
  namespace: sglang-system
roleRef:
  kind: ClusterRole
  name: sglang-router
  apiGroup: rbac.authorization.k8s.io
```

#### Complete Example: PD Mode with Service Discovery

Here's a complete example of running SGLang Router with PD mode and service discovery:

```bash
# Start the router with PD mode and automatic prefill/decode discovery
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --service-discovery \
    --prefill-selector app=sglang component=prefill environment=production \
    --decode-selector app=sglang component=decode environment=production \
    --service-discovery-namespace production \
    --host 0.0.0.0 \
    --port 8080 \
    --prometheus-host 0.0.0.0 \
    --prometheus-port 9090
```

This setup will:
1. Enable PD (Prefill-Decode) disaggregated routing mode with automatic pod classification
2. Watch for pods in the `production` namespace
3. Automatically add prefill servers with labels `app=sglang`, `component=prefill`, `environment=production`
4. Automatically add decode servers with labels `app=sglang`, `component=decode`, `environment=production`
5. Extract bootstrap ports from the `sglang.ai/bootstrap-port` annotation on prefill pods
6. Use cache-aware load balancing for optimal performance
7. Expose the router API on port 8080 and metrics on port 9090
307

308
**Note:** In PD mode with service discovery, pods MUST match either the prefill or decode selector to be added. Pods that don't match either selector are ignored.
309

310
311
312
313
314
315
316
317
318
319
### Troubleshooting

1. If rust analyzer is not working in VSCode, set `rust-analyzer.linkedProjects` to the absolute path of `Cargo.toml` in your repo. For example:

```json
{
  "rust-analyzer.linkedProjects":  ["/workspaces/sglang/sgl-router/Cargo.toml"]
}
```

Byron Hsu's avatar
Byron Hsu committed
320
### CI/CD Setup
321
322
323

The continuous integration pipeline consists of three main steps:

Byron Hsu's avatar
Byron Hsu committed
324
#### 1. Build Wheels
325
326
327
328
329
- Uses `cibuildwheel` to create manylinux x86_64 packages
- Compatible with major Linux distributions (Ubuntu, CentOS, etc.)
- Additional configurations can be added to support other OS/architectures
- Reference: [cibuildwheel documentation](https://cibuildwheel.pypa.io/en/stable/)

Byron Hsu's avatar
Byron Hsu committed
330
#### 2. Build Source Distribution
331
332
333
- Creates a source distribution containing the raw, unbuilt code
- Enables `pip` to build the package from source when prebuilt wheels are unavailable

Byron Hsu's avatar
Byron Hsu committed
334
#### 3. Publish to PyPI
335
336
337
- Uploads both wheels and source distribution to PyPI

The CI configuration is based on the [tiktoken workflow](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/.github/workflows/build_wheels.yml#L1).