Unverified Commit c0ee46fe authored by Byron Hsu's avatar Byron Hsu Committed by GitHub
Browse files

[router] Update doc for dynamic scaling and fault tolerance (#2454)

parent 9208618b
...@@ -7,14 +7,14 @@ The router is a independent Python package, and it can be used as a drop-in repl ...@@ -7,14 +7,14 @@ The router is a independent Python package, and it can be used as a drop-in repl
## Installation ## Installation
```bash ```bash
pip install sglang-router $ pip install sglang-router
``` ```
Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang/launch_server.py). Also, you can directly run the following command to see the usage of the router. Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang/launch_server.py). Also, you can directly run the following command to see the usage of the router.
```bash ```bash
python -m sglang_router.launch_server --help $ python -m sglang_router.launch_server --help
python -m sglang_router.launch_router --help $ python -m sglang_router.launch_router --help
``` ```
The router supports two working modes: The router supports two working modes:
...@@ -27,7 +27,7 @@ The router supports two working modes: ...@@ -27,7 +27,7 @@ The router supports two working modes:
This will be a drop-in replacement for the existing `--dp-size` arguement of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers. This will be a drop-in replacement for the existing `--dp-size` arguement of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
```bash ```bash
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1 $ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
``` ```
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker. After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
...@@ -47,12 +47,62 @@ print(response.json()) ...@@ -47,12 +47,62 @@ print(response.json())
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers. This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
```bash ```bash
python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2 $ python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
``` ```
## Strategies ## Dynamic Scaling APIs
### Cache-Aware Load-Balancing Router We offer `/add_worker` and `/remove_worker` APIs to dynamically add or remove workers from the router.
- `/add_worker`
Usage:
```bash
$ curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1
```
Example:
```bash
$ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001
$ curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001
Successfully added worker: http://127.0.0.1:30001
```
- `/remove_worker`
Usage:
```bash
$ curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1
```
Example:
```bash
$ curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001
Successfully removed worker: http://127.0.0.1:30001
```
Note:
- For cache-aware router, the worker will be removed from the tree and the queues.
## Fault Tolerance
We provide retries based for failure tolerance.
1. If the request to a worker fails for `max_worker_retries` times, the router will remove the worker from the router and move on to the next worker.
2. If the total number of retries exceeds `max_total_retries`, the router will return an error.
Note:
- `max_worker_retries` is 3 and `max_total_retries` is 6 by default.
## Routing Strategies
#### Cache-Aware Load-Balancing Router
The native router combines two strategies to optimize both cache utilization and request distribution: The native router combines two strategies to optimize both cache utilization and request distribution:
......
...@@ -2,115 +2,13 @@ ...@@ -2,115 +2,13 @@
SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances. SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.
## Installation ## User docs
```bash Please check https://sgl-project.github.io/router/router.html
pip install sglang-router
```
## Usage
The router offers two modes:
### 1. Co-launch workers and router
This will be a drop-in replacement for the existing `--dp-size`. This part of code will be moved into sglang core.
Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.
```bash
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8
```
### 2. Launch only router
This is useful for multi-node DP. You can launch workers on different nodes, then connect the router to them.
```bash
$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000
$ python -m sglang_router.launch_router --help
usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
[--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
[--balance-abs-threshold BALANCE_ABS_THRESHOLD] [--balance-rel-threshold BALANCE_REL_THRESHOLD]
[--eviction-interval EVICTION_INTERVAL] [--max-tree-size MAX_TREE_SIZE]
options:
-h, --help show this help message and exit
--host HOST Host address to bind the router server (default: 127.0.0.1)
--port PORT Port number to bind the router server (default: 30000)
--worker-urls WORKER_URLS [WORKER_URLS ...]
List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
--policy {random,round_robin,cache_aware}
Load balancing policy to use (default: cache_aware)
--cache-threshold CACHE_THRESHOLD
Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
--balance-abs-threshold BALANCE_ABS_THRESHOLD
Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 32)
--balance-rel-threshold BALANCE_REL_THRESHOLD
Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 1.0001)
--eviction-interval EVICTION_INTERVAL
Interval in seconds between cache eviction operations (default: 60)
--max-tree-size MAX_TREE_SIZE
Maximum size of the approximation tree for cache-aware routing (default: 16777216)
```
## Strategy
### Cache-Aware Load-Balancing Router
This router combines two strategies to optimize both cache utilization and request distribution:
1. Cache-Aware Routing (Approximate Tree)
2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)
The router dynamically switches between these strategies based on load conditions: ## Developer docs
- Uses load balancing when the system is imbalanced
- Uses cache-aware routing when the system is balanced
A system is considered imbalanced if both conditions are met: ### Prerequisites
1. (max_load - min_load) > balance_abs_threshold
2. max_load > balance_rel_threshold * min_load
#### 1. Cache-Aware Routing (Approximate Tree)
This strategy maintains an approximate radix tree for each worker based on request history,
eliminating the need for direct cache state queries. The tree stores raw text characters
instead of token IDs to avoid tokenization overhead.
Process:
- For each request, find the worker with the highest prefix match
- If match rate > cache_threshold:
- Route to the worker with highest match (likely has relevant data cached)
- If match rate ≤ cache_threshold:
- Route to the worker with smallest tree size (most available cache capacity)
- Background maintenance:
- Periodically evict least recently used leaf nodes to prevent memory overflow
#### 2. Load-Balancing (Shortest Queue)
This strategy tracks pending request counts per worker and routes new requests
to the least busy worker when the system is detected to be imbalanced. This helps
maintain optimal load distribution across workers.
### Configuration Parameters
1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
- Minimum prefix match ratio to use highest-match routing
- Below this threshold, routes to worker with most available cache space
2. `balance_abs_threshold`: (integer, default: 32)
- Absolute difference threshold for load imbalance detection
- System is potentially imbalanced if (max_load - min_load) > abs_threshold
3. `balance_rel_threshold`: (float, default: 1.0001)
- Relative ratio threshold for load imbalance detection
- System is potentially imbalanced if max_load > min_load * rel_threshold
- Used in conjunction with abs_threshold to determine final imbalance state
4. `eviction_interval`: (integer, default: 60)
- Interval in seconds between LRU eviction cycles for the approximate trees
- Background thread periodically evicts least recently used nodes to maintain tree size
5. `max_tree_size`: (integer, default: 16777216)
- Maximum nodes per tree
- When exceeded, LRU leaf nodes are evicted during the next eviction cycle
## Development
- Rust and Cargo installed - Rust and Cargo installed
...@@ -134,7 +32,7 @@ cargo --version ...@@ -134,7 +32,7 @@ cargo --version
#### 1. Build Rust Project #### 1. Build Rust Project
```bash ```bash
cargo build $ cargo build
``` ```
#### 2. Build Python Binding #### 2. Build Python Binding
...@@ -142,13 +40,19 @@ cargo build ...@@ -142,13 +40,19 @@ cargo build
##### Option A: Build and Install Wheel ##### Option A: Build and Install Wheel
1. Build the wheel package: 1. Build the wheel package:
```bash ```bash
pip install setuptools-rust wheel build $ pip install setuptools-rust wheel build
python -m build $ python -m build
``` ```
2. Install the generated wheel: 2. Install the generated wheel:
```bash ```bash
pip install <path-to-wheel> $ pip install <path-to-wheel>
```
If you want one handy command to do build + install for every change you make:
```bash
$ python -m build && pip install --force-reinstall dist/*.whl
``` ```
##### Option B: Development Mode ##### Option B: Development Mode
...@@ -158,7 +62,7 @@ For development purposes, you can install the package in editable mode: ...@@ -158,7 +62,7 @@ For development purposes, you can install the package in editable mode:
Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance. Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.
```bash ```bash
pip install -e . $ pip install -e .
``` ```
**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect. **Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.
......
...@@ -118,7 +118,7 @@ async fn remove_worker( ...@@ -118,7 +118,7 @@ async fn remove_worker(
None => return HttpResponse::BadRequest().finish(), None => return HttpResponse::BadRequest().finish(),
}; };
data.router.remove_worker(&worker_url); data.router.remove_worker(&worker_url);
HttpResponse::Ok().finish() HttpResponse::Ok().body(format!("Successfully removed worker: {}", worker_url))
} }
pub struct ServerConfig { pub struct ServerConfig {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment