router.md 5.9 KB
Newer Older
1
2
3
4
# Router for Data Parallelism

Given multiple GPUs running multiple SGLang Runtimes, SGLang Router distributes the requests to different Runtimes with its unique cache-aware load-balancing algorithm.

Didier Durand's avatar
Didier Durand committed
5
The router is an independent Python package, and it can be used as a drop-in replacement for the OpenAI API.
6
7
8
9

## Installation

```bash
simveit's avatar
simveit committed
10
pip install sglang-router
11
12
```

13
Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/sgl-router/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/sgl-router/py_src/sglang_router/launch_server.py). Also, you can directly run the following command to see the usage of the router.
14
15

```bash
simveit's avatar
simveit committed
16
17
python -m sglang_router.launch_server --help
python -m sglang_router.launch_router --help
18
19
20
21
22
23
24
25
26
```

The router supports two working modes:

1. Co-launch Router and Runtimes
2. Launch Runtimes and Router separately

## Co-launch Router and Runtimes

Didier Durand's avatar
Didier Durand committed
27
This will be a drop-in replacement for the existing `--dp-size` argument of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
28
29

```bash
simveit's avatar
simveit committed
30
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 4
31
32
33
34
```

After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.

simveit's avatar
simveit committed
35
36
Please adjust the batchsize accordingly to archieve maximum throughput.

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
```python
import requests

url = "http://localhost:30000/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print(response.json())
```

## Launch Runtimes and Router Separately

This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.

```bash
simveit's avatar
simveit committed
52
python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
53
54
```

55
## Dynamic Scaling APIs
56

57
58
59
60
61
62
63
We offer `/add_worker` and `/remove_worker` APIs to dynamically add or remove workers from the router.

- `/add_worker`

Usage:

```bash
simveit's avatar
simveit committed
64
curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1
65
66
67
68
69
```

Example:

```bash
simveit's avatar
simveit committed
70
71
72
73
74
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001

curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001

# Successfully added worker: http://127.0.0.1:30001
75
76
77
78
79
80
81
```

- `/remove_worker`

Usage:

```bash
simveit's avatar
simveit committed
82
curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1
83
84
85
86
87
```

Example:

```bash
simveit's avatar
simveit committed
88
89
90
curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001

# Successfully removed worker: http://127.0.0.1:30001
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
```

Note:

- For cache-aware router, the worker will be removed from the tree and the queues.

## Fault Tolerance

We provide retries based for failure tolerance.

1. If the request to a worker fails for `max_worker_retries` times, the router will remove the worker from the router and move on to the next worker.
2. If the total number of retries exceeds `max_total_retries`, the router will return an error.

Note:

- `max_worker_retries` is 3 and `max_total_retries` is 6 by default.

## Routing Strategies

110
### Cache-Aware Load-Balancing Router
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165

The native router combines two strategies to optimize both cache utilization and request distribution:

1. Cache-Aware Routing (Approximate Tree)
2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)

The router dynamically switches between these strategies based on load conditions:

- Uses load balancing when the system is imbalanced
- Uses cache-aware routing when the system is balanced

A system is considered imbalanced if both conditions are met:

1. (max_load - min_load) > balance_abs_threshold
2. max_load > balance_rel_threshold * min_load

***Cache-Aware Routing (Approximate Tree)***

When the workers are considered to be balanced, the router maintains an approximate radix tree for each worker based on request history, eliminating the need for direct cache state queries on each worker. The tree stores raw text characters instead of token IDs to avoid tokenization overhead.

Process:

1. For each request, find the worker with the highest prefix match.

   - If match rate > cache_threshold, route the request to the worker with highest match (likely has relevant data cached)
   - If match rate ≤ cache_threshold, route the request to the worker with smallest tree size (most available cache capacity)

2. Background maintenance: Periodically evict least recently used leaf nodes on the approximate tree to prevent memory overflow.

***Load-Balancing (Shortest Queue)***

For unbalanced systems, this strategy tracks pending request counts per worker and routes new requests to the least busy worker. This helps maintain optimal load distribution across workers.

## Configuration Parameters

1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
   - Minimum prefix match ratio to use highest-match routing.
   - Below this threshold, the request will be routed to the worker with most available cache space.

2. `balance_abs_threshold`: (integer, default: 32)
   - Absolute difference threshold for load imbalance detection.
   - The system is potentially imbalanced if (max_load - min_load) > abs_threshold.

3. `balance_rel_threshold`: (float, default: 1.0001)
   - Relative ratio threshold for load imbalance detection.
   - The system is potentially imbalanced if max_load > min_load * rel_threshold.
   - Used in conjunction with `balance_abs_threshold` to determine the final imbalance state.

4. `eviction_interval`: (integer, default: 60)
   - Interval in seconds between LRU eviction cycles for the approximate trees.
   - Background thread periodically evicts least recently used nodes to maintain tree size.

5. `max_tree_size`: (integer, default: 16777216)
   - Maximum nodes on the approximate tree.
   - When exceeded, LRU leaf nodes are evicted during the next eviction cycle.