README.md 14.6 KB
Newer Older
Byron Hsu's avatar
Byron Hsu committed
1
# SGLang Router
2

Simo Lin's avatar
Simo Lin committed
3
SGLang router is a standalone Rust module that enables data parallelism across SGLang instances, providing high-performance request routing and advanced load balancing. The router supports multiple load balancing algorithms including cache-aware, power of two, random, and round robin, and acts as a specialized load balancer for prefill-decode disaggregated serving architectures.
4

Simo Lin's avatar
Simo Lin committed
5
## Documentation
Byron Hsu's avatar
Byron Hsu committed
6

7
- **User Guide**: [docs.sglang.ai/advanced_features/router.html](https://docs.sglang.ai/advanced_features/router.html)
8

Simo Lin's avatar
Simo Lin committed
9
## Quick Start
10

11
### Prerequisites
12

Simo Lin's avatar
Simo Lin committed
13
**Rust and Cargo:**
14
15
16
17
18
19
20
21
22
23
24
25
```bash
# Install rustup (Rust installer and version manager)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Follow the installation prompts, then reload your shell
source $HOME/.cargo/env

# Verify installation
rustc --version
cargo --version
```

Simo Lin's avatar
Simo Lin committed
26
**Python with pip installed**
27

Simo Lin's avatar
Simo Lin committed
28
### Installation
29

Simo Lin's avatar
Simo Lin committed
30
#### Option A: Build and Install Wheel (Recommended)
31
```bash
Simo Lin's avatar
Simo Lin committed
32
33
# Install build dependencies
pip install setuptools-rust wheel build
34

Simo Lin's avatar
Simo Lin committed
35
36
# Build the wheel package
python -m build
37

Simo Lin's avatar
Simo Lin committed
38
39
# Install the generated wheel
pip install dist/*.whl
40

Simo Lin's avatar
Simo Lin committed
41
42
# One-liner for development (rebuild + install)
python -m build && pip install --force-reinstall dist/*.whl
43
44
```

Simo Lin's avatar
Simo Lin committed
45
#### Option B: Development Mode
46

47
```bash
48
# Currently broken
Simo Lin's avatar
Simo Lin committed
49
pip install -e .
50
51
```

Simo Lin's avatar
Simo Lin committed
52
⚠️ **Warning**: Editable installs may suffer performance degradation. Use wheel builds for performance testing.
53

Simo Lin's avatar
Simo Lin committed
54
### Basic Usage
55

56
```bash
Simo Lin's avatar
Simo Lin committed
57
58
# Build Rust components
cargo build
59
```
Simo Lin's avatar
Simo Lin committed
60

61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#### Using the Rust Binary Directly (Alternative to Python)
```bash
# Build the Rust binary
cargo build --release

# Launch router with worker URLs in regular mode
./target/release/sglang-router \
    --worker-urls http://worker1:8000 http://worker2:8000

# Or use cargo run
cargo run --release -- \
    --worker-urls http://worker1:8000 http://worker2:8000
```

#### Launch Router with Python (Original Method)
76
```bash
Simo Lin's avatar
Simo Lin committed
77
78
79
# Launch router with worker URLs
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8000
80
81
```

82
83
84
85
86
#### Launch Router with Worker URLs in prefill-decode mode
```bash
# Note that the prefill and decode URLs must be provided in the following format:
# http://<ip>:<port> for  decode nodes
# http://<ip>:<port> bootstrap-port for  prefill nodes, where bootstrap-port is optional
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

# Using Rust binary directly
./target/release/sglang-router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://127.0.0.1:30001 9001 \
    --prefill http://127.0.0.2:30002 9002 \
    --prefill http://127.0.0.3:30003 9003 \
    --prefill http://127.0.0.4:30004 9004 \
    --decode http://127.0.0.5:30005 \
    --decode http://127.0.0.6:30006 \
    --decode http://127.0.0.7:30007 \
    --host 0.0.0.0 \
    --port 8080

# Or using Python launcher
103
104
105
106
107
108
109
110
111
112
113
114
115
116
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --prefill http://127.0.0.1:30001 9001 \
    --prefill http://127.0.0.2:30002 9002 \
    --prefill http://127.0.0.3:30003 9003 \
    --prefill http://127.0.0.4:30004 9004 \
    --decode http://127.0.0.5:30005 \
    --decode http://127.0.0.6:30006 \
    --decode http://127.0.0.7:30007 \
    --host 0.0.0.0 \
    --port 8080
````

Simo Lin's avatar
Simo Lin committed
117
## Configuration
118

119

120
121
### Logging

Simo Lin's avatar
Simo Lin committed
122
Enable structured logging with optional file output:
123
124

```python
Simo Lin's avatar
Simo Lin committed
125
126
127
128
129
130
from sglang_router import Router

# Console logging (default)
router = Router(worker_urls=["http://worker1:8000", "http://worker2:8000"])

# File logging enabled
131
132
router = Router(
    worker_urls=["http://worker1:8000", "http://worker2:8000"],
Simo Lin's avatar
Simo Lin committed
133
    log_dir="./logs"  # Daily log files created here
134
135
136
)
```

Simo Lin's avatar
Simo Lin committed
137
Set log level with `--log-level` flag ([documentation](https://docs.sglang.ai/backend/server_arguments.html#logging)).
138

139
140
### Metrics

Simo Lin's avatar
Simo Lin committed
141
Prometheus metrics endpoint available at `127.0.0.1:29000` by default.
142

Simo Lin's avatar
Simo Lin committed
143
144
```bash
# Custom metrics configuration
145
python -m sglang_router.launch_router \
Simo Lin's avatar
Simo Lin committed
146
147
148
    --worker-urls http://localhost:8080 http://localhost:8081 \
    --prometheus-host 0.0.0.0 \
    --prometheus-port 9000
149
150
```

151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
### Retries and Circuit Breakers

- Retries (regular router) are enabled by default with exponential backoff and jitter. You can tune them via CLI:

```bash
python -m sglang_router.launch_router \
  --worker-urls http://localhost:8080 http://localhost:8081 \
  --retry-max-retries 3 \
  --retry-initial-backoff-ms 100 \
  --retry-max-backoff-ms 10000 \
  --retry-backoff-multiplier 2.0 \
  --retry-jitter-factor 0.1
```

- Circuit Breaker defaults protect workers and auto-recover. Tune thresholds/timeouts:

```bash
python -m sglang_router.launch_router \
  --worker-urls http://localhost:8080 http://localhost:8081 \
  --cb-failure-threshold 5 \
  --cb-success-threshold 2 \
  --cb-timeout-duration-secs 30 \
  --cb-window-duration-secs 60
```

Behavior summary:
- Closed → Open after N consecutive failures (failure-threshold)
- Open → HalfOpen after timeout (timeout-duration-secs)
- HalfOpen → Closed after M consecutive successes (success-threshold)
- Any failure in HalfOpen reopens immediately

Retry predicate (regular router): retry on 408/429/500/502/503/504, otherwise return immediately. Backoff/jitter observed between attempts.

184
185
186
187
188
189
190
191
192
193
194
195
196
### Request ID Tracking

Track requests across distributed systems with configurable headers:

```bash
# Use custom request ID headers
python -m sglang_router.launch_router \
    --worker-urls http://localhost:8080 \
    --request-id-headers x-trace-id x-request-id
```

Default headers: `x-request-id`, `x-correlation-id`, `x-trace-id`, `request-id`

Simo Lin's avatar
Simo Lin committed
197
## Advanced Features
198

Simo Lin's avatar
Simo Lin committed
199
### Kubernetes Service Discovery
200

Simo Lin's avatar
Simo Lin committed
201
Automatic worker discovery and management in Kubernetes environments.
202

Simo Lin's avatar
Simo Lin committed
203
#### Basic Service Discovery
204
205
206
207
208
209
210
211

```bash
python -m sglang_router.launch_router \
    --service-discovery \
    --selector app=sglang-worker role=inference \
    --service-discovery-namespace default
```

Simo Lin's avatar
Simo Lin committed
212
#### PD (Prefill-Decode) Mode
213

Simo Lin's avatar
Simo Lin committed
214
For disaggregated prefill/decode routing:
215
216
217
218
219
220
221
222
223

```bash
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --service-discovery \
    --prefill-selector app=sglang component=prefill \
    --decode-selector app=sglang component=decode \
    --service-discovery-namespace sglang-system
224
225
226
227
228
229
230
231
232
233

# With separate routing policies:
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill-policy cache_aware \
    --decode-policy power_of_two \
    --service-discovery \
    --prefill-selector app=sglang component=prefill \
    --decode-selector app=sglang component=decode \
    --service-discovery-namespace sglang-system
234
235
236
237
238
239
240
241
242

# in lws case, such as tp16(1 leader pod, 1 worker pod)
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --service-discovery \
    --prefill-selector app=sglang component=prefill role=leader\
    --decode-selector app=sglang component=decode role=leader\
    --service-discovery-namespace sglang-system
243
244
```

Simo Lin's avatar
Simo Lin committed
245
#### Kubernetes Pod Configuration
246
247
248
249
250
251
252
253
254
255
256

**Prefill Server Pod:**
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: sglang-prefill-1
  labels:
    app: sglang
    component: prefill
  annotations:
Simo Lin's avatar
Simo Lin committed
257
    sglang.ai/bootstrap-port: "9001"  # Optional: Bootstrap port
258
259
260
261
262
263
spec:
  containers:
  - name: sglang
    image: lmsys/sglang:latest
    ports:
    - containerPort: 8000  # Main API port
Simo Lin's avatar
Simo Lin committed
264
    - containerPort: 9001  # Optional: Bootstrap port
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
```

**Decode Server Pod:**
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: sglang-decode-1
  labels:
    app: sglang
    component: decode
spec:
  containers:
  - name: sglang
    image: lmsys/sglang:latest
    ports:
Simo Lin's avatar
Simo Lin committed
281
    - containerPort: 8000
282
283
```

Simo Lin's avatar
Simo Lin committed
284
#### RBAC Configuration
285

286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
**Namespace-scoped (recommended):**
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sglang-router
  namespace: sglang-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: sglang-system
  name: sglang-router
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: sglang-router
  namespace: sglang-system
subjects:
- kind: ServiceAccount
  name: sglang-router
  namespace: sglang-system
roleRef:
  kind: Role
  name: sglang-router
  apiGroup: rbac.authorization.k8s.io
```

Simo Lin's avatar
Simo Lin committed
319
#### Complete PD Example
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334

```bash
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --policy cache_aware \
    --service-discovery \
    --prefill-selector app=sglang component=prefill environment=production \
    --decode-selector app=sglang component=decode environment=production \
    --service-discovery-namespace production \
    --host 0.0.0.0 \
    --port 8080 \
    --prometheus-host 0.0.0.0 \
    --prometheus-port 9090
```

335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
### API Key Authentication

The router supports multi-level API key authentication for both the router itself and individual workers:

#### Router API Key
Protect access to the router endpoints:

```bash
python -m sglang_router.launch_router \
    --api-key "your-router-api-key" \
    --worker-urls http://worker1:8000 http://worker2:8000
```

When router API key is set, clients must include the Bearer token:
```bash
curl -H "Authorization: Bearer your-router-api-key" http://localhost:8080/v1/chat/completions
```

#### Worker API Keys
Workers can have their own API keys for authentication:

```bash
# Workers specified in --worker-urls automatically inherit the router's API key
python -m sglang_router.launch_router \
    --api-key "shared-api-key" \
    --worker-urls http://worker1:8000 http://worker2:8000
# Both workers will use "shared-api-key" for authentication

# Adding workers dynamically WITHOUT inheriting router's key
curl -X POST http://localhost:8080/add_worker?url=http://worker3:8000
# WARNING: This worker has NO API key even though router has one!

# Adding workers with specific API keys dynamically
curl -X POST http://localhost:8080/add_worker?url=http://worker3:8000&api_key=worker3-specific-key
```

#### Security Configurations

1. **No Authentication** (default):
   - Router and workers accessible without keys
   - Suitable for trusted environments

2. **Router-only Authentication**:
   - Clients need key to access router
   - Router can access workers freely

3. **Worker-only Authentication**:
   - Router accessible without key
   - Each worker requires authentication
   ```bash
   # Add workers with their API keys
   curl -X POST http://localhost:8080/add_worker?url=http://worker:8000&api_key=worker-key
   ```

4. **Full Authentication**:
   - Router requires key from clients
   - Each worker requires its own key
   ```bash
   # Start router with its key
   python -m sglang_router.launch_router --api-key "router-key"

   # Add workers with their keys
   curl -H "Authorization: Bearer router-key" \
        -X POST http://localhost:8080/add_worker?url=http://worker:8000&api_key=worker-key
   ```

#### Important Notes

- **Initial Workers**: Workers specified in `--worker-urls` automatically inherit the router's API key
- **Dynamic Workers**: When adding workers via API, you must explicitly specify their API keys - they do NOT inherit the router's key
- **Security Warning**: When adding workers without API keys while the router has one configured, a warning will be logged
- **Common Pitfall**: If router and workers use the same API key, you must still specify the key when adding workers dynamically

Simo Lin's avatar
Simo Lin committed
408
### Command Line Arguments Reference
409

Simo Lin's avatar
Simo Lin committed
410
411
412
413
414
#### Service Discovery
- `--service-discovery`: Enable Kubernetes service discovery
- `--service-discovery-port`: Port for worker URLs (default: 8000)
- `--service-discovery-namespace`: Kubernetes namespace to watch
- `--selector`: Label selectors for regular mode (format: `key1=value1 key2=value2`)
415

Simo Lin's avatar
Simo Lin committed
416
417
418
419
420
421
#### PD Mode
- `--pd-disaggregation`: Enable Prefill-Decode disaggregated mode
- `--prefill`: Initial prefill server (format: `URL BOOTSTRAP_PORT`)
- `--decode`: Initial decode server URL
- `--prefill-selector`: Label selector for prefill pods
- `--decode-selector`: Label selector for decode pods
422
423
424
- `--policy`: Routing policy (`cache_aware`, `random`, `power_of_two`, `round_robin`)
- `--prefill-policy`: Separate routing policy for prefill nodes (optional, overrides `--policy` for prefill)
- `--decode-policy`: Separate routing policy for decode nodes (optional, overrides `--policy` for decode)
425

426
427
428
#### Authentication
- `--api-key`: API key for router authentication (clients must provide this as Bearer token)

429
430
431
432
433
#### Concurrency and Rate Limiting
- `--queue-size`: Size of the pending-request queue when concurrency limits are reached (default: 100; set to 0 to disable queuing)
- `--queue-timeout-secs`: Maximum time a request may wait in the queue before timing out (default: 60; must be > 0 when queue is enabled)
- `--rate-limit-tokens-per-second`: Override token bucket refill rate for rate limiting (defaults to `--max-concurrent-requests` when omitted)

Simo Lin's avatar
Simo Lin committed
434
## Development
435

Simo Lin's avatar
Simo Lin committed
436
### Build Process
437

Simo Lin's avatar
Simo Lin committed
438
439
440
441
442
443
```bash
# Build Rust project
cargo build

# Build Python binding (see Installation section above)
```
444

Simo Lin's avatar
Simo Lin committed
445
**Note**: When modifying Rust code, you must rebuild the wheel for changes to take effect.
446

Simo Lin's avatar
Simo Lin committed
447
### Troubleshooting
448

Simo Lin's avatar
Simo Lin committed
449
450
**VSCode Rust Analyzer Issues:**
Set `rust-analyzer.linkedProjects` to the absolute path of `Cargo.toml`:
451

Simo Lin's avatar
Simo Lin committed
452
453
454
455
456
```json
{
  "rust-analyzer.linkedProjects": ["/workspaces/sglang/sgl-router/Cargo.toml"]
}
```
457

Simo Lin's avatar
Simo Lin committed
458
459
460
461
462
### CI/CD Pipeline

The continuous integration pipeline includes comprehensive testing, benchmarking, and publishing:

#### Build & Test
463

Simo Lin's avatar
Simo Lin committed
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
1. **Build Wheels**: Uses `cibuildwheel` for manylinux x86_64 packages
2. **Build Source Distribution**: Creates source distribution for pip fallback
3. **Rust HTTP Server Benchmarking**: Performance testing of router overhead
4. **Basic Inference Testing**: End-to-end validation through the router
5. **PD Disaggregation Testing**: Benchmark and sanity checks for prefill-decode load balancing

#### Publishing
- **PyPI Publishing**: Wheels and source distributions are published only when the version changes in `pyproject.toml`
- **Container Images**: Docker images published using `/docker/Dockerfile.router`

## Features
- **High Performance**: Rust-based routing with connection pooling and optimized request handling
- **Advanced Load Balancing**: Multiple algorithms including:
  - **Cache-Aware**: Intelligent routing based on cache locality for optimal performance
  - **Power of Two**: Chooses the less loaded of two randomly selected workers
  - **Random**: Distributes requests randomly across available workers
  - **Round Robin**: Sequential distribution across workers in rotation
- **Prefill-Decode Disaggregation**: Specialized load balancing for separated prefill and decode servers
- **Service Discovery**: Automatic Kubernetes worker discovery and health management
- **Monitoring**: Comprehensive Prometheus metrics and structured logging
- **Scalability**: Handles thousands of concurrent connections with efficient resource utilization