README.md 5.82 KB
Newer Older
Byron Hsu's avatar
Byron Hsu committed
1
# SGLang Router
2
3
4

SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.

Byron Hsu's avatar
Byron Hsu committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
## Installation

```bash
pip install sglang-router
```

## Usage
The router offers two modes:

### 1. Co-launch workers and router
This will be a drop-in replacement for the existing `--dp-size`. This part of code will be moved into sglang core.
Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.

```bash
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8
```

### 2. Launch only router
This is useful if you for multi node DP. You can launch workers on different nodes, then connect the router to them.

```bash
$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000

$ python -m sglang_router.launch_router --help
usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
                        [--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
                        [--cache-routing-prob CACHE_ROUTING_PROB] [--eviction-interval EVICTION_INTERVAL]
                        [--max-tree-size MAX_TREE_SIZE]

options:
  -h, --help            show this help message and exit
  --host HOST           Host address to bind the router server (default: 127.0.0.1)
  --port PORT           Port number to bind the router server (default: 30000)
  --worker-urls WORKER_URLS [WORKER_URLS ...]
                        List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
  --policy {random,round_robin,cache_aware}
                        Load balancing policy to use (default: cache_aware)
  --cache-threshold CACHE_THRESHOLD
                        Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
  --cache-routing-prob CACHE_ROUTING_PROB
                        Probability of using cache-aware routing (0.0-1.0) (default: 1.0)
  --eviction-interval EVICTION_INTERVAL
                        Interval in seconds between cache eviction operations (default: 60)
  --max-tree-size MAX_TREE_SIZE
                        Maximum size of the approximation tree for cache-aware routing (default: 16777216)
```

## Strategy

### Cache-Aware Load-Balancing Router

This router combines two strategies to optimize both cache utilization and request distribution:

1. Cache-Aware Routing (Approximate Tree)
2. Load-Balancing Routing (Shortest Queue)

#### 1. Cache-Aware Routing (Approximate Tree)
This strategy maintains an approximate radix tree for each worker based on request history,
eliminating the need for direct cache state queries. The tree stores raw text characters
instead of token IDs to avoid tokenization overhead.

Process:
- For each request, find the worker with the highest prefix match
- If match rate > cache_threshold:
  - Route to the worker with highest match (likely has relevant data cached)
- If match rate ≤ cache_threshold:
  - Route to the worker with smallest tree size (most available cache capacity)
- Background maintenance:
  - Periodically evict least recently used leaf nodes to prevent memory overflow

#### 2. Load-Balancing (Shortest Queue)
This strategy tracks pending request counts per worker and routes new requests
to the least busy worker for optimal load distribution.

### Configuration Parameters

1. `cache_routing_prob`: (float, 0.0 to 1.0)
   - 0.0: Exclusively use load balancing
   - 1.0: Exclusively use cache-aware routing
   - Between 0-1: Probability of using cache-aware routing vs load balancing

2. `cache_threshold`: (float, 0.0 to 1.0)
   - Minimum prefix match ratio to use highest-match routing
   - Below this threshold, routes to worker with most available cache space

3. `eviction_interval_secs`: (integer)
   - Interval between LRU eviction cycles for the approximate trees

4. `max_tree_size`: (integer)
   - Maximum nodes per tree
   - When exceeded, LRU leaf nodes are evicted during the next eviction cycle


## Development
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115

- Rust and Cargo installed

```bash
# Install rustup (Rust installer and version manager)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Follow the installation prompts, then reload your shell
source $HOME/.cargo/env

# Verify installation
rustc --version
cargo --version
```

- Python with pip installed

116

Byron Hsu's avatar
Byron Hsu committed
117
### Build Process
118

Byron Hsu's avatar
Byron Hsu committed
119
#### 1. Build Rust Project
120
121
122
123
124

```bash
cargo build
```

Byron Hsu's avatar
Byron Hsu committed
125
#### 2. Build Python Binding
126

Byron Hsu's avatar
Byron Hsu committed
127
##### Option A: Build and Install Wheel
128
129
130
131
132
133
134
135
136
137
138
1. Build the wheel package:
```bash
pip install setuptools-rust wheel build
python -m build
```

2. Install the generated wheel:
```bash
pip install <path-to-wheel>
```

Byron Hsu's avatar
Byron Hsu committed
139
##### Option B: Development Mode
140
141

For development purposes, you can install the package in editable mode:
142
143
144

Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.

145
146
147
148
149
150
```bash
pip install -e .
```

**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.

Byron Hsu's avatar
Byron Hsu committed
151
### CI/CD Setup
152
153
154

The continuous integration pipeline consists of three main steps:

Byron Hsu's avatar
Byron Hsu committed
155
#### 1. Build Wheels
156
157
158
159
160
- Uses `cibuildwheel` to create manylinux x86_64 packages
- Compatible with major Linux distributions (Ubuntu, CentOS, etc.)
- Additional configurations can be added to support other OS/architectures
- Reference: [cibuildwheel documentation](https://cibuildwheel.pypa.io/en/stable/)

Byron Hsu's avatar
Byron Hsu committed
161
#### 2. Build Source Distribution
162
163
164
- Creates a source distribution containing the raw, unbuilt code
- Enables `pip` to build the package from source when prebuilt wheels are unavailable

Byron Hsu's avatar
Byron Hsu committed
165
#### 3. Publish to PyPI
166
167
168
- Uploads both wheels and source distribution to PyPI

The CI configuration is based on the [tiktoken workflow](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/.github/workflows/build_wheels.yml#L1).