router.md 20.3 KB
Newer Older
1
# SGLang Router
2

3
The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
4

5
6
7
8
9
10
11
12
13
## Key Features

- **Cache-Aware Load Balancing**: Optimizes cache utilization while maintaining balanced load distribution
- **Multiple Routing Policies**: Choose from random, round-robin, cache-aware, or power-of-two policies
- **Fault Tolerance**: Automatic retry and circuit breaker mechanisms for resilient operation
- **Dynamic Scaling**: Add or remove workers at runtime without service interruption
- **Kubernetes Integration**: Native service discovery and pod management
- **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing
- **Prometheus Metrics**: Built-in observability and monitoring
14
- **Rate Limiter**: Token-bucket rate limiter to shield workers from overload
15
16
17
18

## Installation

```bash
simveit's avatar
simveit committed
19
pip install sglang-router
20
21
```

22
23
24
## Quick Start

To see all available options:
25
26

```bash
27
28
python -m sglang_router.launch_server --help  # Co-launch router and workers
python -m sglang_router.launch_router --help  # Launch router only
29
30
```

31
## Deployment Modes
32

33
The router supports three primary deployment patterns:
34

35
36
37
1. **Co-launch Mode**: Router and workers launch together (simplest for single-node deployments)
2. **Separate Launch Mode**: Router and workers launch independently (best for multi-node setups)
3. **Prefill-Decode Disaggregation**: Specialized mode for disaggregated serving
38

39
40
41
### Mode 1: Co-launch Router and Workers

This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the `--dp-size` argument of SGLang Runtime.
42
43

```bash
44
45
46
47
48
49
# Launch router with 4 workers
python -m sglang_router.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --dp-size 4 \
    --host 0.0.0.0 \
    --port 30000
50
51
```

52
#### Sending Requests
53

54
Once the server is ready, send requests to the router endpoint:
simveit's avatar
simveit committed
55

56
57
58
```python
import requests

59
# Using the /generate endpoint
60
url = "http://localhost:30000/generate"
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
data = {
    "text": "What is the capital of France?",
    "sampling_params": {
        "temperature": 0.7,
        "max_new_tokens": 100
    }
}

response = requests.post(url, json=data)
print(response.json())

# OpenAI-compatible endpoint
url = "http://localhost:30000/v1/chat/completions"
data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
}
78
79
80
81
82

response = requests.post(url, json=data)
print(response.json())
```

83
84
85
### Mode 2: Separate Launch Mode

This mode is ideal for multi-node deployments where workers run on different machines.
86

87
88
89
#### Step 1: Launch Workers

On each worker node:
90
91

```bash
92
93
94
95
96
97
98
99
100
101
102
# Worker node 1
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

# Worker node 2
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8001
103
104
```

105
106
107
#### Step 2: Launch Router

On the router node:
108

109
110
111
112
113
114
115
```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --host 0.0.0.0 \
    --port 30000 \
    --policy cache_aware  # or random, round_robin, power_of_two
```
116

117
### Mode 3: Prefill-Decode Disaggregation
118

119
This advanced mode separates prefill and decode operations for optimized performance:
120
121

```bash
122
123
124
125
126
127
128
129
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://prefill1:8000 9000 \
    --prefill http://prefill2:8001 9001 \
    --decode http://decode1:8002 \
    --decode http://decode2:8003 \
    --prefill-policy cache_aware \
    --decode-policy round_robin
130
131
```

132
#### Understanding --prefill Arguments
133

134
135
136
137
138
139
140
141
The `--prefill` flag accepts URLs with optional bootstrap ports:
- `--prefill http://server:8000` - No bootstrap port
- `--prefill http://server:8000 9000` - Bootstrap port 9000
- `--prefill http://server:8000 none` - Explicitly no bootstrap port

#### Policy Inheritance in PD Mode

The router intelligently handles policy configuration for prefill and decode nodes:
simveit's avatar
simveit committed
142

143
144
145
146
1. **Only `--policy` specified**: Both prefill and decode nodes use this policy
2. **`--policy` and `--prefill-policy` specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--policy`
3. **`--policy` and `--decode-policy` specified**: Prefill nodes use `--policy`, decode nodes use `--decode-policy`
4. **All three specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--decode-policy` (main `--policy` is ignored)
simveit's avatar
simveit committed
147

148
149
150
151
152
153
154
155
156
157
Example with mixed policies:
```bash
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://prefill1:8000
    --prefill http://prefill2:8000 \
    --decode http://decode1:8001
    --decode http://decode2:8001 \
    --policy round_robin \
    --prefill-policy cache_aware  # Prefill uses cache_aware and decode uses round_robin from --policy
158
159
```

160
#### PD Mode with Service Discovery
161

162
For Kubernetes deployments with separate prefill and decode server pools:
163
164

```bash
165
166
167
168
169
170
171
172
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --service-discovery \
    --prefill-selector app=prefill-server tier=gpu \
    --decode-selector app=decode-server tier=cpu \
    --service-discovery-namespace production \
    --prefill-policy cache_aware \
    --decode-policy round_robin
173
174
```

175
176
177
178
179
## Dynamic Scaling

The router supports runtime scaling through REST APIs:

### Adding Workers
180
181

```bash
182
183
184
185
# Launch a new worker
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --port 30001
simveit's avatar
simveit committed
186

187
188
# Add it to the router
curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001"
189
190
```

191
192
193
194
195
### Removing Workers

```bash
curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001"
```
196

197
**Note**: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.
198
199
200

## Fault Tolerance

201
The router includes comprehensive fault tolerance mechanisms:
202

203
### Retry Configuration
204

205
206
207
208
209
210
211
212
213
```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --retry-max-retries 3 \
    --retry-initial-backoff-ms 100 \
    --retry-max-backoff-ms 10000 \
    --retry-backoff-multiplier 2.0 \
    --retry-jitter-factor 0.1
```
214

215
### Circuit Breaker
216

217
Protects against cascading failures:
218

219
220
221
222
223
224
225
```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --cb-failure-threshold 5 \
    --cb-success-threshold 2 \
    --cb-timeout-duration-secs 30 \
    --cb-window-duration-secs 60
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
```

```mermaid
flowchart TD
    Closed(["Closed"])
    Open(["Open"])
    HalfOpen(["HalfOpen"])

    Closed -- "Consecutive Failures >=<br/>cb-failure-threshold" --> Open;
    Closed --> HalfOpen;
    linkStyle 1 stroke:transparent;
    Open -- "After cb-timeout-duration-secs" --> HalfOpen;
    HalfOpen -- "Fail any test request" --> Open;
    HalfOpen -- "After cb-success-threshold<br/>test requests" --> Closed;
    Closed -- "Failures < cb-failure-threshold" --> Closed;
    style Closed fill:#00C853,color:#000000
    style Open fill:#D50000,color:#000000
    style HalfOpen fill:#FFD600,color:#000000
    linkStyle 1 stroke:transparent,fill:none
245
```
246

247
248
249
250
**Behavior**:
- Worker is marked unhealthy after `cb-failure-threshold` consecutive failures
- Returns to service after `cb-success-threshold` successful health checks
- Circuit breaker can be disabled with `--disable-circuit-breaker`
251

252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
### Rate Limiter

Use the token-bucket rate limiter to cap requests before they overwhelm downstream workers.

- Enable rate limiting by setting `--max-concurrent-requests` to a positive integer. A bucket with that many tokens (concurrent leases) is created; `-1` keeps it disabled.
- Optionally override the refill rate with `--rate-limit-tokens-per-second`. If omitted, the refill rate matches `max-concurrent-requests`.
- Overflow traffic can wait in a FIFO queue controlled by:
  - `--queue-size`: pending-request buffer (0 disables queuing; defaults to 100).
  - `--queue-timeout-secs`: maximum wait time for queued requests before returning `429` (defaults to 60 seconds).

Example:

```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --max-concurrent-requests 256 \
    --rate-limit-tokens-per-second 512 \
    --queue-size 128 \
    --queue-timeout-secs 30
```

**Behavior**:

This configuration allows up to 256 concurrent requests, refills 512 tokens (requests) per second, and keeps up to 128 overflow requests queued for 30 seconds before timing out.

**Responses**:
- Returns **429** when the router cannot enqueue the request (queue disabled or full).
- Returns **408** when a queued request waits longer than `--queue-timeout-secs` or no token becomes available before the timeout.

281
## Routing Policies
282

283
The router supports multiple routing strategies:
284

285
286
### 1. Random Routing
Distributes requests randomly across workers.
287

288
289
290
```bash
--policy random
```
291

292
293
### 2. Round-Robin Routing
Cycles through workers in order.
294

295
296
297
```bash
--policy round_robin
```
298

299
300
### 3. Power of Two Choices
Samples two workers and routes to the less loaded one.
301

302
303
304
```bash
--policy power_of_two
```
305

306
### 4. Cache-Aware Load Balancing (Default)
307

308
The most sophisticated policy that combines cache optimization with load balancing:
309

310
311
312
313
314
315
```bash
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
```
316

317
#### How It Works
318

319
320
1. **Load Assessment**: Checks if the system is balanced
   - Imbalanced if: `(max_load - min_load) > balance_abs_threshold` AND `max_load > balance_rel_threshold * min_load`
321

322
323
324
325
326
2. **Routing Decision**:
   - **Balanced System**: Uses cache-aware routing
     - Routes to worker with highest prefix match if match > `cache_threshold`
     - Otherwise routes to worker with most available cache capacity
   - **Imbalanced System**: Uses shortest queue routing to the least busy worker
327

328
329
3. **Cache Management**:
   - Maintains approximate radix trees per worker
330
   - Periodically evicts LRU entries based on `--eviction-interval-secs` and `--max-tree-size`
331

332
### Data Parallelism Aware Routing
333

334
Enables fine-grained control over data parallel replicas:
335

336
337
338
339
```bash
--dp-aware \
--api-key your_api_key  # Required for worker authentication
```
340

341
342
343
344
345
346
347
This mode coordinates with SGLang's DP controller for optimized request distribution across data parallel ranks.

## Configuration Reference

### Core Settings

| Parameter                   | Type | Default     | Description                                                     |
348
| --------------------------- | ---- | ----------- | --------------------------------------------------------------- |
349
350
351
352
353
354
355
356
357
358
| `--host`                    | str  | 127.0.0.1   | Router server host address                                      |
| `--port`                    | int  | 30000       | Router server port                                              |
| `--worker-urls`             | list | []          | Worker URLs for separate launch mode                            |
| `--policy`                  | str  | cache_aware | Routing policy (random, round_robin, cache_aware, power_of_two) |
| `--max-concurrent-requests` | int  | 64          | Maximum concurrent requests (rate limiting)                     |
| `--request-timeout-secs`    | int  | 600         | Request timeout in seconds                                      |
| `--max-payload-size`        | int  | 256MB       | Maximum request payload size                                    |

### Cache-Aware Routing Parameters

359
360
361
362
363
364
365
| Parameter                  | Type  | Default  | Description                                            |
| -------------------------- | ----- | -------- | ------------------------------------------------------ |
| `--cache-threshold`        | float | 0.5      | Minimum prefix match ratio for cache routing (0.0-1.0) |
| `--balance-abs-threshold`  | int   | 32       | Absolute load difference threshold                     |
| `--balance-rel-threshold`  | float | 1.0001   | Relative load ratio threshold                          |
| `--eviction-interval-secs` | int   | 60       | Seconds between cache eviction cycles                  |
| `--max-tree-size`          | int   | 16777216 | Maximum nodes in routing tree                          |
366
367
368
369

### Fault Tolerance Parameters

| Parameter                    | Type  | Default | Description                           |
370
| ---------------------------- | ----- | ------- | ------------------------------------- |
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
| `--retry-max-retries`        | int   | 3       | Maximum retry attempts per request    |
| `--retry-initial-backoff-ms` | int   | 100     | Initial retry backoff in milliseconds |
| `--retry-max-backoff-ms`     | int   | 10000   | Maximum retry backoff in milliseconds |
| `--retry-backoff-multiplier` | float | 2.0     | Backoff multiplier between retries    |
| `--retry-jitter-factor`      | float | 0.1     | Random jitter factor for retries      |
| `--disable-retries`          | flag  | False   | Disable retry mechanism               |
| `--cb-failure-threshold`     | int   | 5       | Failures before circuit opens         |
| `--cb-success-threshold`     | int   | 2       | Successes to close circuit            |
| `--cb-timeout-duration-secs` | int   | 30      | Circuit breaker timeout duration      |
| `--cb-window-duration-secs`  | int   | 60      | Circuit breaker window duration       |
| `--disable-circuit-breaker`  | flag  | False   | Disable circuit breaker               |

### Prefill-Decode Disaggregation Parameters

| Parameter                         | Type | Default | Description                                           |
386
| --------------------------------- | ---- | ------- | ----------------------------------------------------- |
387
388
389
390
391
392
393
394
395
396
397
| `--pd-disaggregation`             | flag | False   | Enable PD disaggregated mode                          |
| `--prefill`                       | list | []      | Prefill server URLs with optional bootstrap ports     |
| `--decode`                        | list | []      | Decode server URLs                                    |
| `--prefill-policy`                | str  | None    | Routing policy for prefill nodes (overrides --policy) |
| `--decode-policy`                 | str  | None    | Routing policy for decode nodes (overrides --policy)  |
| `--worker-startup-timeout-secs`   | int  | 300     | Timeout for worker startup                            |
| `--worker-startup-check-interval` | int  | 10      | Interval between startup checks                       |

### Kubernetes Integration

| Parameter                       | Type | Default                  | Description                                          |
398
| ------------------------------- | ---- | ------------------------ | ---------------------------------------------------- |
399
400
401
402
403
404
405
406
407
408
409
| `--service-discovery`           | flag | False                    | Enable Kubernetes service discovery                  |
| `--selector`                    | list | []                       | Label selector for workers (key1=value1 key2=value2) |
| `--prefill-selector`            | list | []                       | Label selector for prefill servers in PD mode        |
| `--decode-selector`             | list | []                       | Label selector for decode servers in PD mode         |
| `--service-discovery-port`      | int  | 80                       | Port for discovered pods                             |
| `--service-discovery-namespace` | str  | None                     | Kubernetes namespace to watch                        |
| `--bootstrap-port-annotation`   | str  | sglang.ai/bootstrap-port | Annotation for bootstrap ports                       |

### Observability

| Parameter              | Type | Default   | Description                                           |
410
| ---------------------- | ---- | --------- | ----------------------------------------------------- |
411
412
413
414
415
416
417
418
419
| `--prometheus-port`    | int  | 29000     | Prometheus metrics port                               |
| `--prometheus-host`    | str  | 127.0.0.1 | Prometheus metrics host                               |
| `--log-dir`            | str  | None      | Directory for log files                               |
| `--log-level`          | str  | info      | Logging level (debug, info, warning, error, critical) |
| `--request-id-headers` | list | None      | Custom headers for request tracing                    |

### CORS Configuration

| Parameter                | Type | Default | Description          |
420
| ------------------------ | ---- | ------- | -------------------- |
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
| `--cors-allowed-origins` | list | []      | Allowed CORS origins |

## Advanced Features

### Kubernetes Service Discovery

Automatically discover and manage workers in Kubernetes:

#### Standard Mode
```bash
python -m sglang_router.launch_router \
    --service-discovery \
    --selector app=sglang-worker env=prod \
    --service-discovery-namespace production \
    --service-discovery-port 8000
```
437

438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
#### Prefill-Decode Disaggregation Mode
```bash
python -m sglang_router.launch_router \
    --pd-disaggregation \
    --service-discovery \
    --prefill-selector app=prefill-server env=prod \
    --decode-selector app=decode-server env=prod \
    --service-discovery-namespace production
```

**Note**: The `--bootstrap-port-annotation` (default: `sglang.ai/bootstrap-port`) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.

### Prometheus Metrics

Expose metrics for monitoring:

```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --prometheus-port 29000 \
    --prometheus-host 0.0.0.0
```
460

461
Metrics available at `http://localhost:29000/metrics`
462

463
### Request Tracing
464

465
466
467
468
469
470
471
472
Enable request ID tracking:

```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --request-id-headers x-request-id x-trace-id
```

473
474
475
476
477
478
479
480
481
482
483
484
485
486
## Observability

When Prometheus is enabled, the router provides several key metrics for observability.

| Metric Name                            | Type      | Description                                                                                          |
|:---------------------------------------|:----------|:-----------------------------------------------------------------------------------------------------|
| `sgl_router_requests_total`            | Counter   | Total number of requests received by the router's API endpoint. Useful for tracking overall traffic. |
| `sgl_router_processed_requests_total`  | Counter   | Total requests processed, labeled by `worker`. Critical for spotting load imbalances.                |
| `sgl_router_active_workers`            | Gauge     | The current number of healthy workers in the routing pool. Essential for alerting.                   |
| `sgl_router_running_requests`          | Gauge     | The number of currently in-flight requests, labeled by `worker`. For monitoring real-time load.      |
| `sgl_router_cache_hits_total`          | Counter   | Total requests routed to a worker with a matching prefix cache.                                      |
| `sgl_router_cache_misses_total`        | Counter   | Total requests that could not be routed based on cache locality.                                     |
| `sgl_router_generate_duration_seconds` | Histogram | Tracks end-to-end request latency. Use this to monitor performance (e.g., p95/p99).                  |

487
488
489
490
491
492
## Troubleshooting

### Common Issues

1. **Workers not connecting**: Ensure workers are fully initialized before starting the router. Use `--worker-startup-timeout-secs` to increase wait time.

493
494
495
496
497
2. **High latency**:
   - **A common cause**: Load Imbalanced.
   - Check the `sgl_router_processed_requests_total` metric grouped by `worker`.
   - Cache-aware routing might be prioritizing cache hits too aggressively.
   - Try adjusting `--balance-abs-threshold` and `--balance-rel-threshold`.
498

499
3. **Memory growth**: Reduce `--max-tree-size` or decrease `--eviction-interval-secs` for more aggressive cache cleanup.
500
501
502
503
504
505
506
507
508
509
510
511
512

4. **Circuit breaker triggering frequently**: Increase `--cb-failure-threshold` or extend `--cb-window-duration-secs`.

### Debug Mode

Enable detailed logging:

```bash
python -m sglang_router.launch_router \
    --worker-urls http://worker1:8000 http://worker2:8001 \
    --log-level debug \
    --log-dir ./router_logs
```