@@ -17,6 +17,10 @@ For the design details, please refer to [link](https://docs.google.com/document/
Currently, we support Mooncake and NIXL as the transfer engine.
## Router Integration
For deploying PD disaggregation at scale with load balancing and fault tolerance, SGLang provides a router. The router can distribute requests between prefill and decode instances using various routing policies. For detailed information on setting up routing with PD disaggregation, including configuration options and deployment patterns, see the [SGLang Router documentation](router.md#mode-3-prefill-decode-disaggregation).
Given multiple GPUs running multiple SGLang Runtimes, SGLang Router distributes the requests to different Runtimes with its unique cache-aware load-balancing algorithm.
The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware loadbalancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
The router is an independent Python package, and it can be used as a drop-in replacement for the OpenAI API.
## Key Features
-**Cache-Aware Load Balancing**: Optimizes cache utilization while maintaining balanced load distribution
-**Multiple Routing Policies**: Choose from random, round-robin, cache-aware, or power-of-two policies
-**Fault Tolerance**: Automatic retry and circuit breaker mechanisms for resilient operation
-**Dynamic Scaling**: Add or remove workers at runtime without service interruption
-**Kubernetes Integration**: Native service discovery and pod management
-**Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing
-**Prometheus Metrics**: Built-in observability and monitoring
## Installation
...
...
@@ -10,164 +18,428 @@ The router is an independent Python package, and it can be used as a drop-in rep
pip install sglang-router
```
Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/sgl-router/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/sgl-router/py_src/sglang_router/launch_server.py). Also, you can directly run the following command to see the usage of the router.
## Quick Start
To see all available options:
```bash
python -m sglang_router.launch_server --help
python -m sglang_router.launch_router --help
python -m sglang_router.launch_server --help# Co-launch router and workers
python -m sglang_router.launch_router --help# Launch router only
```
The router supports two working modes:
## Deployment Modes
1. Co-launch Router and Runtimes
2. Launch Runtimes and Router separately
The router supports three primary deployment patterns:
## Co-launch Router and Runtimes
1.**Co-launch Mode**: Router and workers launch together (simplest for single-node deployments)
2.**Separate Launch Mode**: Router and workers launch independently (best for multi-node setups)
3.**Prefill-Decode Disaggregation**: Specialized mode for disaggregated serving
This will be a drop-in replacement for the existing `--dp-size` argument of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
### Mode 1: Co-launch Router and Workers
This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the `--dp-size` argument of SGLang Runtime.
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
#### Sending Requests
Please adjust the batchsize accordingly to achieve maximum throughput.
Once the server is ready, send requests to the router endpoint:
```python
importrequests
# Using the /generate endpoint
url="http://localhost:30000/generate"
data={"text":"What is the capital of France?"}
data={
"text":"What is the capital of France?",
"sampling_params":{
"temperature":0.7,
"max_new_tokens":100
}
}
response=requests.post(url,json=data)
print(response.json())
# OpenAI-compatible endpoint
url="http://localhost:30000/v1/chat/completions"
data={
"model":"meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages":[{"role":"user","content":"What is the capital of France?"}]
}
response=requests.post(url,json=data)
print(response.json())
```
## Launch Runtimes and Router Separately
### Mode 2: Separate Launch Mode
This mode is ideal for multi-node deployments where workers run on different machines.
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
The native router combines two strategies to optimize both cache utilization and request distribution:
**Behavior**:
- Worker is marked unhealthy after `cb-failure-threshold` consecutive failures
- Returns to service after `cb-success-threshold` successful health checks
- Circuit breaker can be disabled with `--disable-circuit-breaker`
1. Cache-Aware Routing (Approximate Tree)
2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)
## Routing Policies
The router dynamically switches between these strategies based on load conditions:
The router supports multiple routing strategies:
- Uses load balancing when the system is imbalanced
- Uses cache-aware routing when the system is balanced
### 1. Random Routing
Distributes requests randomly across workers.
A system is considered imbalanced if both conditions are met:
```bash
--policy random
```
1. (max_load - min_load) > balance_abs_threshold
2. max_load > balance_rel_threshold * min_load
### 2. Round-Robin Routing
Cycles through workers in order.
***Cache-Aware Routing (Approximate Tree)***
```bash
--policy round_robin
```
When the workers are considered to be balanced, the router maintains an approximate radix tree for each worker based on request history, eliminating the need for direct cache state queries on each worker. The tree stores raw text characters instead of token IDs to avoid tokenization overhead.
### 3. Power of Two Choices
Samples two workers and routes to the less loaded one.
Process:
```bash
--policy power_of_two
```
1. For each request, find the worker with the highest prefix match.
### 4. Cache-Aware Load Balancing (Default)
- If match rate > cache_threshold, route the request to the worker with highest match (likely has relevant data cached)
- If match rate ≤ cache_threshold, route the request to the worker with smallest tree size (most available cache capacity)
The most sophisticated policy that combines cache optimization with load balancing:
2. Background maintenance: Periodically evict least recently used leaf nodes on the approximate tree to prevent memory overflow.
```bash
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
```
***Load-Balancing (Shortest Queue)***
#### How It Works
For unbalanced systems, this strategy tracks pending request counts per worker and routes new requests to the least busy worker. This helps maintain optimal load distribution across workers.
1.**Load Assessment**: Checks if the system is balanced
- Routes to worker with highest prefix match if match > `cache_threshold`
- Otherwise routes to worker with most available cache capacity
-**Imbalanced System**: Uses shortest queue routing to the least busy worker
An additional DP-aware routing strategy can be enabled on top of the sgl-router’s hybrid cache-aware load-balancing strategy by setting the `--dp-aware` flag when starting the router.
3.**Cache Management**:
- Maintains approximate radix trees per worker
- Periodically evicts LRU entries based on `--eviction-interval` and `--max-tree-size`
When this flag is enabled, the router attempts to contact the workers to retrieve the `dp_size` of each one and registers the new workers at the DP-rank level. In this mode, the router applies the cache-aware routing strategy in a more fine-grained manner, with assistance from the DP controller on the SRT side.
### Data Parallelism Aware Routing
By default (when the flag is not set), the SRT’s DP controller distributes incoming requests across DP ranks in a round-robin fashion.
Enables fine-grained control over data parallel replicas:
## Configuration Parameters
```bash
--dp-aware\
--api-key your_api_key # Required for worker authentication
```
1.`cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
- Minimum prefix match ratio to use highest-match routing.
- Below this threshold, the request will be routed to the worker with most available cache space.
This mode coordinates with SGLang's DP controller for optimized request distribution across data parallel ranks.
| `--cors-allowed-origins` | list | [] | Allowed CORS origins |
## Advanced Features
### Kubernetes Service Discovery
Automatically discover and manage workers in Kubernetes:
#### Standard Mode
```bash
python -m sglang_router.launch_router \
--service-discovery\
--selectorapp=sglang-worker env=prod \
--service-discovery-namespace production \
--service-discovery-port 8000
```
2.`balance_abs_threshold`: (integer, default: 32)
- Absolute difference threshold for load imbalance detection.
- The system is potentially imbalanced if (max_load - min_load) > abs_threshold.
#### Prefill-Decode Disaggregation Mode
```bash
python -m sglang_router.launch_router \
--pd-disaggregation\
--service-discovery\
--prefill-selectorapp=prefill-server env=prod \
--decode-selectorapp=decode-server env=prod \
--service-discovery-namespace production
```
**Note**: The `--bootstrap-port-annotation` (default: `sglang.ai/bootstrap-port`) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.
1.**Workers not connecting**: Ensure workers are fully initialized before starting the router. Use `--worker-startup-timeout-secs` to increase wait time.
2.**High latency**: Check if cache-aware routing is causing imbalance. Try adjusting `--balance-abs-threshold` and `--balance-rel-threshold`.
3.**Memory growth**: Reduce `--max-tree-size` or decrease `--eviction-interval` for more aggressive cache cleanup.
4.**Circuit breaker triggering frequently**: Increase `--cb-failure-threshold` or extend `--cb-window-duration-secs`.