README.md 6.95 KB
Newer Older
Byron Hsu's avatar
Byron Hsu committed
1
# SGLang Router
2
3
4

SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.

Byron Hsu's avatar
Byron Hsu committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## Installation

```bash
pip install sglang-router
```

## Usage
The router offers two modes:

### 1. Co-launch workers and router
This will be a drop-in replacement for the existing `--dp-size`. This part of code will be moved into sglang core.
Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.

```bash
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8
```

### 2. Launch only router
23
This is useful for multi-node DP. You can launch workers on different nodes, then connect the router to them.
Byron Hsu's avatar
Byron Hsu committed
24
25
26
27
28
29

```bash
$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000

$ python -m sglang_router.launch_router --help
usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
30
31
32
                       [--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
                       [--balance-abs-threshold BALANCE_ABS_THRESHOLD] [--balance-rel-threshold BALANCE_REL_THRESHOLD]
                       [--eviction-interval EVICTION_INTERVAL] [--max-tree-size MAX_TREE_SIZE]
Byron Hsu's avatar
Byron Hsu committed
33
34
35

options:
  -h, --help            show this help message and exit
36
37
  --host HOST          Host address to bind the router server (default: 127.0.0.1)
  --port PORT          Port number to bind the router server (default: 30000)
Byron Hsu's avatar
Byron Hsu committed
38
  --worker-urls WORKER_URLS [WORKER_URLS ...]
39
                       List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
Byron Hsu's avatar
Byron Hsu committed
40
  --policy {random,round_robin,cache_aware}
41
                       Load balancing policy to use (default: cache_aware)
Byron Hsu's avatar
Byron Hsu committed
42
  --cache-threshold CACHE_THRESHOLD
43
44
45
46
47
                       Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
  --balance-abs-threshold BALANCE_ABS_THRESHOLD
                       Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 32)
  --balance-rel-threshold BALANCE_REL_THRESHOLD
                       Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 1.0001)
Byron Hsu's avatar
Byron Hsu committed
48
  --eviction-interval EVICTION_INTERVAL
49
                       Interval in seconds between cache eviction operations (default: 60)
Byron Hsu's avatar
Byron Hsu committed
50
  --max-tree-size MAX_TREE_SIZE
51
                       Maximum size of the approximation tree for cache-aware routing (default: 16777216)
Byron Hsu's avatar
Byron Hsu committed
52
53
54
55
56
57
58
59
60
```

## Strategy

### Cache-Aware Load-Balancing Router

This router combines two strategies to optimize both cache utilization and request distribution:

1. Cache-Aware Routing (Approximate Tree)
61
62
63
64
65
66
67
68
69
2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)

The router dynamically switches between these strategies based on load conditions:
- Uses load balancing when the system is imbalanced
- Uses cache-aware routing when the system is balanced

A system is considered imbalanced if both conditions are met:
1. (max_load - min_load) > balance_abs_threshold
2. max_load > balance_rel_threshold * min_load
Byron Hsu's avatar
Byron Hsu committed
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86

#### 1. Cache-Aware Routing (Approximate Tree)
This strategy maintains an approximate radix tree for each worker based on request history,
eliminating the need for direct cache state queries. The tree stores raw text characters
instead of token IDs to avoid tokenization overhead.

Process:
- For each request, find the worker with the highest prefix match
- If match rate > cache_threshold:
  - Route to the worker with highest match (likely has relevant data cached)
- If match rate ≤ cache_threshold:
  - Route to the worker with smallest tree size (most available cache capacity)
- Background maintenance:
  - Periodically evict least recently used leaf nodes to prevent memory overflow

#### 2. Load-Balancing (Shortest Queue)
This strategy tracks pending request counts per worker and routes new requests
87
88
to the least busy worker when the system is detected to be imbalanced. This helps
maintain optimal load distribution across workers.
Byron Hsu's avatar
Byron Hsu committed
89
90
91

### Configuration Parameters

92
1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
Byron Hsu's avatar
Byron Hsu committed
93
94
95
   - Minimum prefix match ratio to use highest-match routing
   - Below this threshold, routes to worker with most available cache space

96
97
98
2. `balance_abs_threshold`: (integer, default: 32)
   - Absolute difference threshold for load imbalance detection
   - System is potentially imbalanced if (max_load - min_load) > abs_threshold
Byron Hsu's avatar
Byron Hsu committed
99

100
101
102
103
104
105
106
107
108
109
3. `balance_rel_threshold`: (float, default: 1.0001)
   - Relative ratio threshold for load imbalance detection
   - System is potentially imbalanced if max_load > min_load * rel_threshold
   - Used in conjunction with abs_threshold to determine final imbalance state

4. `eviction_interval`: (integer, default: 60)
   - Interval in seconds between LRU eviction cycles for the approximate trees
   - Background thread periodically evicts least recently used nodes to maintain tree size

5. `max_tree_size`: (integer, default: 16777216)
Byron Hsu's avatar
Byron Hsu committed
110
111
112
113
   - Maximum nodes per tree
   - When exceeded, LRU leaf nodes are evicted during the next eviction cycle

## Development
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

- Rust and Cargo installed

```bash
# Install rustup (Rust installer and version manager)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Follow the installation prompts, then reload your shell
source $HOME/.cargo/env

# Verify installation
rustc --version
cargo --version
```

- Python with pip installed

131

Byron Hsu's avatar
Byron Hsu committed
132
### Build Process
133

Byron Hsu's avatar
Byron Hsu committed
134
#### 1. Build Rust Project
135
136
137
138
139

```bash
cargo build
```

Byron Hsu's avatar
Byron Hsu committed
140
#### 2. Build Python Binding
141

Byron Hsu's avatar
Byron Hsu committed
142
##### Option A: Build and Install Wheel
143
144
145
146
147
148
149
150
151
152
153
1. Build the wheel package:
```bash
pip install setuptools-rust wheel build
python -m build
```

2. Install the generated wheel:
```bash
pip install <path-to-wheel>
```

Byron Hsu's avatar
Byron Hsu committed
154
##### Option B: Development Mode
155
156

For development purposes, you can install the package in editable mode:
157
158
159

Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.

160
161
162
163
164
165
```bash
pip install -e .
```

**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.

Byron Hsu's avatar
Byron Hsu committed
166
### CI/CD Setup
167
168
169

The continuous integration pipeline consists of three main steps:

Byron Hsu's avatar
Byron Hsu committed
170
#### 1. Build Wheels
171
172
173
174
175
- Uses `cibuildwheel` to create manylinux x86_64 packages
- Compatible with major Linux distributions (Ubuntu, CentOS, etc.)
- Additional configurations can be added to support other OS/architectures
- Reference: [cibuildwheel documentation](https://cibuildwheel.pypa.io/en/stable/)

Byron Hsu's avatar
Byron Hsu committed
176
#### 2. Build Source Distribution
177
178
179
- Creates a source distribution containing the raw, unbuilt code
- Enables `pip` to build the package from source when prebuilt wheels are unavailable

Byron Hsu's avatar
Byron Hsu committed
180
#### 3. Publish to PyPI
181
182
183
- Uploads both wheels and source distribution to PyPI

The CI configuration is based on the [tiktoken workflow](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/.github/workflows/build_wheels.yml#L1).