router.md 16.6 KB
Newer Older
1
# SGLang Router
2

3
The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
4

5
6
7
8
9
10
11
12
13
## Key Features

- **Cache-Aware Load Balancing**: Optimizes cache utilization while maintaining balanced load distribution
- **Multiple Routing Policies**: Choose from random, round-robin, cache-aware, or power-of-two policies
- **Fault Tolerance**: Automatic retry and circuit breaker mechanisms for resilient operation
- **Dynamic Scaling**: Add or remove workers at runtime without service interruption
- **Kubernetes Integration**: Native service discovery and pod management
- **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing
- **Prometheus Metrics**: Built-in observability and monitoring
14
15
16
17

## Installation

```bash
simveit's avatar
simveit committed
18
pip install sglang-router
19
20
```

21
22
23
## Quick Start

To see all available options:
24
25

```bash
26
27
python -m sglang_router.launch_server --help  # Co-launch router and workers
python -m sglang_router.launch_router --help  # Launch router only
28
29
```

30
## Deployment Modes
31

32
The router supports three primary deployment patterns:
33

34
35
36
1. **Co-launch Mode**: Router and workers launch together (simplest for single-node deployments)
2. **Separate Launch Mode**: Router and workers launch independently (best for multi-node setups)
3. **Prefill-Decode Disaggregation**: Specialized mode for disaggregated serving
37

38
39
40
### Mode 1: Co-launch Router and Workers

This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the `--dp-size` argument of SGLang Runtime.
41
42

```bash
43
44
45
46
47
48
# Launch router with 4 workers
python -m sglang_router.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --dp-size 4 \
    --host 0.0.0.0 \
    --port 30000
49
50
```

51
#### Sending Requests
52

53
Once the server is ready, send requests to the router endpoint:
simveit's avatar
simveit committed
54

55
56
57
```python
import requests

58
# Using the /generate endpoint
59
url = "http://localhost:30000/generate"
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
data = {
    "text": "What is the capital of France?",
    "sampling_params": {
        "temperature": 0.7,
        "max_new_tokens": 100
    }
}

response = requests.post(url, json=data)
print(response.json())

# OpenAI-compatible endpoint
url = "http://localhost:30000/v1/chat/completions"
data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
}
77
78
79
80
81

response = requests.post(url, json=data)
print(response.json())
```

82
83
84
### Mode 2: Separate Launch Mode

This mode is ideal for multi-node deployments where workers run on different machines.
85

86
87
88
#### Step 1: Launch Workers

On each worker node:
89
90

```bash
91
92
93
94
95
96
97
98
99
100
101
# Worker node 1
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

# Worker node 2
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8001
102
103
```

104
105
106
#### Step 2: Launch Router

On the router node:
107

108
109
110
111
112
113
114
```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --host 0.0.0.0 \
    --port 30000 \
    --policy cache_aware  # or random, round_robin, power_of_two
```
115

116
### Mode 3: Prefill-Decode Disaggregation
117

118
This advanced mode separates prefill and decode operations for optimized performance:
119
120

```bash
121
122
123
124
125
126
127
128
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://prefill1:8000 9000 \
    --prefill http://prefill2:8001 9001 \
    --decode http://decode1:8002 \
    --decode http://decode2:8003 \
    --prefill-policy cache_aware \
    --decode-policy round_robin
129
130
```

131
#### Understanding --prefill Arguments
132

133
134
135
136
137
138
139
140
The `--prefill` flag accepts URLs with optional bootstrap ports:
- `--prefill http://server:8000` - No bootstrap port
- `--prefill http://server:8000 9000` - Bootstrap port 9000
- `--prefill http://server:8000 none` - Explicitly no bootstrap port

#### Policy Inheritance in PD Mode

The router intelligently handles policy configuration for prefill and decode nodes:
simveit's avatar
simveit committed
141

142
143
144
145
1. **Only `--policy` specified**: Both prefill and decode nodes use this policy
2. **`--policy` and `--prefill-policy` specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--policy`
3. **`--policy` and `--decode-policy` specified**: Prefill nodes use `--policy`, decode nodes use `--decode-policy`
4. **All three specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--decode-policy` (main `--policy` is ignored)
simveit's avatar
simveit committed
146

147
148
149
150
151
152
153
154
155
156
Example with mixed policies:
```bash
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://prefill1:8000
    --prefill http://prefill2:8000 \
    --decode http://decode1:8001
    --decode http://decode2:8001 \
    --policy round_robin \
    --prefill-policy cache_aware  # Prefill uses cache_aware and decode uses round_robin from --policy
157
158
```

159
#### PD Mode with Service Discovery
160

161
For Kubernetes deployments with separate prefill and decode server pools:
162
163

```bash
164
165
166
167
168
169
170
171
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --service-discovery \
    --prefill-selector app=prefill-server tier=gpu \
    --decode-selector app=decode-server tier=cpu \
    --service-discovery-namespace production \
    --prefill-policy cache_aware \
    --decode-policy round_robin
172
173
```

174
175
176
177
178
## Dynamic Scaling

The router supports runtime scaling through REST APIs:

### Adding Workers
179
180

```bash
181
182
183
184
# Launch a new worker
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --port 30001
simveit's avatar
simveit committed
185

186
187
# Add it to the router
curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001"
188
189
```

190
191
192
193
194
### Removing Workers

```bash
curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001"
```
195

196
**Note**: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.
197
198
199

## Fault Tolerance

200
The router includes comprehensive fault tolerance mechanisms:
201

202
### Retry Configuration
203

204
205
206
207
208
209
210
211
212
```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --retry-max-retries 3 \
    --retry-initial-backoff-ms 100 \
    --retry-max-backoff-ms 10000 \
    --retry-backoff-multiplier 2.0 \
    --retry-jitter-factor 0.1
```
213

214
### Circuit Breaker
215

216
Protects against cascading failures:
217

218
219
220
221
222
223
224
225
```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --cb-failure-threshold 5 \
    --cb-success-threshold 2 \
    --cb-timeout-duration-secs 30 \
    --cb-window-duration-secs 60
```
226

227
228
229
230
**Behavior**:
- Worker is marked unhealthy after `cb-failure-threshold` consecutive failures
- Returns to service after `cb-success-threshold` successful health checks
- Circuit breaker can be disabled with `--disable-circuit-breaker`
231

232
## Routing Policies
233

234
The router supports multiple routing strategies:
235

236
237
### 1. Random Routing
Distributes requests randomly across workers.
238

239
240
241
```bash
--policy random
```
242

243
244
### 2. Round-Robin Routing
Cycles through workers in order.
245

246
247
248
```bash
--policy round_robin
```
249

250
251
### 3. Power of Two Choices
Samples two workers and routes to the less loaded one.
252

253
254
255
```bash
--policy power_of_two
```
256

257
### 4. Cache-Aware Load Balancing (Default)
258

259
The most sophisticated policy that combines cache optimization with load balancing:
260

261
262
263
264
265
266
```bash
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
```
267

268
#### How It Works
269

270
271
1. **Load Assessment**: Checks if the system is balanced
   - Imbalanced if: `(max_load - min_load) > balance_abs_threshold` AND `max_load > balance_rel_threshold * min_load`
272

273
274
275
276
277
2. **Routing Decision**:
   - **Balanced System**: Uses cache-aware routing
     - Routes to worker with highest prefix match if match > `cache_threshold`
     - Otherwise routes to worker with most available cache capacity
   - **Imbalanced System**: Uses shortest queue routing to the least busy worker
278

279
280
281
3. **Cache Management**:
   - Maintains approximate radix trees per worker
   - Periodically evicts LRU entries based on `--eviction-interval` and `--max-tree-size`
282

283
### Data Parallelism Aware Routing
284

285
Enables fine-grained control over data parallel replicas:
286

287
288
289
290
```bash
--dp-aware \
--api-key your_api_key  # Required for worker authentication
```
291

292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
This mode coordinates with SGLang's DP controller for optimized request distribution across data parallel ranks.

## Configuration Reference

### Core Settings

| Parameter                   | Type | Default     | Description                                                     |
|-----------------------------|------|-------------|-----------------------------------------------------------------|
| `--host`                    | str  | 127.0.0.1   | Router server host address                                      |
| `--port`                    | int  | 30000       | Router server port                                              |
| `--worker-urls`             | list | []          | Worker URLs for separate launch mode                            |
| `--policy`                  | str  | cache_aware | Routing policy (random, round_robin, cache_aware, power_of_two) |
| `--max-concurrent-requests` | int  | 64          | Maximum concurrent requests (rate limiting)                     |
| `--request-timeout-secs`    | int  | 600         | Request timeout in seconds                                      |
| `--max-payload-size`        | int  | 256MB       | Maximum request payload size                                    |

### Cache-Aware Routing Parameters

| Parameter                 | Type  | Default  | Description                                            |
|---------------------------|-------|----------|--------------------------------------------------------|
| `--cache-threshold`       | float | 0.5      | Minimum prefix match ratio for cache routing (0.0-1.0) |
| `--balance-abs-threshold` | int   | 32       | Absolute load difference threshold                     |
| `--balance-rel-threshold` | float | 1.0001   | Relative load ratio threshold                          |
| `--eviction-interval`     | int   | 60       | Seconds between cache eviction cycles                  |
| `--max-tree-size`         | int   | 16777216 | Maximum nodes in routing tree                          |

### Fault Tolerance Parameters

| Parameter                    | Type  | Default | Description                           |
|------------------------------|-------|---------|---------------------------------------|
| `--retry-max-retries`        | int   | 3       | Maximum retry attempts per request    |
| `--retry-initial-backoff-ms` | int   | 100     | Initial retry backoff in milliseconds |
| `--retry-max-backoff-ms`     | int   | 10000   | Maximum retry backoff in milliseconds |
| `--retry-backoff-multiplier` | float | 2.0     | Backoff multiplier between retries    |
| `--retry-jitter-factor`      | float | 0.1     | Random jitter factor for retries      |
| `--disable-retries`          | flag  | False   | Disable retry mechanism               |
| `--cb-failure-threshold`     | int   | 5       | Failures before circuit opens         |
| `--cb-success-threshold`     | int   | 2       | Successes to close circuit            |
| `--cb-timeout-duration-secs` | int   | 30      | Circuit breaker timeout duration      |
| `--cb-window-duration-secs`  | int   | 60      | Circuit breaker window duration       |
| `--disable-circuit-breaker`  | flag  | False   | Disable circuit breaker               |

### Prefill-Decode Disaggregation Parameters

| Parameter                         | Type | Default | Description                                           |
|-----------------------------------|------|---------|-------------------------------------------------------|
| `--pd-disaggregation`             | flag | False   | Enable PD disaggregated mode                          |
| `--prefill`                       | list | []      | Prefill server URLs with optional bootstrap ports     |
| `--decode`                        | list | []      | Decode server URLs                                    |
| `--prefill-policy`                | str  | None    | Routing policy for prefill nodes (overrides --policy) |
| `--decode-policy`                 | str  | None    | Routing policy for decode nodes (overrides --policy)  |
| `--worker-startup-timeout-secs`   | int  | 300     | Timeout for worker startup                            |
| `--worker-startup-check-interval` | int  | 10      | Interval between startup checks                       |

### Kubernetes Integration

| Parameter                       | Type | Default                  | Description                                          |
|---------------------------------|------|--------------------------|------------------------------------------------------|
| `--service-discovery`           | flag | False                    | Enable Kubernetes service discovery                  |
| `--selector`                    | list | []                       | Label selector for workers (key1=value1 key2=value2) |
| `--prefill-selector`            | list | []                       | Label selector for prefill servers in PD mode        |
| `--decode-selector`             | list | []                       | Label selector for decode servers in PD mode         |
| `--service-discovery-port`      | int  | 80                       | Port for discovered pods                             |
| `--service-discovery-namespace` | str  | None                     | Kubernetes namespace to watch                        |
| `--bootstrap-port-annotation`   | str  | sglang.ai/bootstrap-port | Annotation for bootstrap ports                       |

### Observability

| Parameter              | Type | Default   | Description                                           |
|------------------------|------|-----------|-------------------------------------------------------|
| `--prometheus-port`    | int  | 29000     | Prometheus metrics port                               |
| `--prometheus-host`    | str  | 127.0.0.1 | Prometheus metrics host                               |
| `--log-dir`            | str  | None      | Directory for log files                               |
| `--log-level`          | str  | info      | Logging level (debug, info, warning, error, critical) |
| `--request-id-headers` | list | None      | Custom headers for request tracing                    |

### CORS Configuration

| Parameter                | Type | Default | Description          |
|--------------------------|------|---------|----------------------|
| `--cors-allowed-origins` | list | []      | Allowed CORS origins |

## Advanced Features

### Kubernetes Service Discovery

Automatically discover and manage workers in Kubernetes:

#### Standard Mode
```bash
python -m sglang_router.launch_router \
    --service-discovery \
    --selector app=sglang-worker env=prod \
    --service-discovery-namespace production \
    --service-discovery-port 8000
```
388

389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
#### Prefill-Decode Disaggregation Mode
```bash
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --service-discovery \
    --prefill-selector app=prefill-server env=prod \
    --decode-selector app=decode-server env=prod \
    --service-discovery-namespace production
```

**Note**: The `--bootstrap-port-annotation` (default: `sglang.ai/bootstrap-port`) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.

### Prometheus Metrics

Expose metrics for monitoring:

```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --prometheus-port 29000 \
    --prometheus-host 0.0.0.0
```
411

412
Metrics available at `http://localhost:29000/metrics`
413

414
### Request Tracing
415

416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
Enable request ID tracking:

```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --request-id-headers x-request-id x-trace-id
```

## Troubleshooting

### Common Issues

1. **Workers not connecting**: Ensure workers are fully initialized before starting the router. Use `--worker-startup-timeout-secs` to increase wait time.

2. **High latency**: Check if cache-aware routing is causing imbalance. Try adjusting `--balance-abs-threshold` and `--balance-rel-threshold`.

3. **Memory growth**: Reduce `--max-tree-size` or decrease `--eviction-interval` for more aggressive cache cleanup.

4. **Circuit breaker triggering frequently**: Increase `--cb-failure-threshold` or extend `--cb-window-duration-secs`.

### Debug Mode

Enable detailed logging:

```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --log-level debug \
    --log-dir ./router_logs
```