Unverified Commit 1ce30dd1 authored by Simo Lin's avatar Simo Lin Committed by GitHub
Browse files

[router] update router documentation (#9121)

parent c9ee7385
......@@ -17,6 +17,10 @@ For the design details, please refer to [link](https://docs.google.com/document/
Currently, we support Mooncake and NIXL as the transfer engine.
## Router Integration
For deploying PD disaggregation at scale with load balancing and fault tolerance, SGLang provides a router. The router can distribute requests between prefill and decode instances using various routing policies. For detailed information on setting up routing with PD disaggregation, including configuration options and deployment patterns, see the [SGLang Router documentation](router.md#mode-3-prefill-decode-disaggregation).
## Mooncake
### Requirements
......
# Router for Data Parallelism
# SGLang Router
Given multiple GPUs running multiple SGLang Runtimes, SGLang Router distributes the requests to different Runtimes with its unique cache-aware load-balancing algorithm.
The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
The router is an independent Python package, and it can be used as a drop-in replacement for the OpenAI API.
## Key Features
- **Cache-Aware Load Balancing**: Optimizes cache utilization while maintaining balanced load distribution
- **Multiple Routing Policies**: Choose from random, round-robin, cache-aware, or power-of-two policies
- **Fault Tolerance**: Automatic retry and circuit breaker mechanisms for resilient operation
- **Dynamic Scaling**: Add or remove workers at runtime without service interruption
- **Kubernetes Integration**: Native service discovery and pod management
- **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing
- **Prometheus Metrics**: Built-in observability and monitoring
## Installation
......@@ -10,164 +18,428 @@ The router is an independent Python package, and it can be used as a drop-in rep
pip install sglang-router
```
Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/sgl-router/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/sgl-router/py_src/sglang_router/launch_server.py). Also, you can directly run the following command to see the usage of the router.
## Quick Start
To see all available options:
```bash
python -m sglang_router.launch_server --help
python -m sglang_router.launch_router --help
python -m sglang_router.launch_server --help # Co-launch router and workers
python -m sglang_router.launch_router --help # Launch router only
```
The router supports two working modes:
## Deployment Modes
1. Co-launch Router and Runtimes
2. Launch Runtimes and Router separately
The router supports three primary deployment patterns:
## Co-launch Router and Runtimes
1. **Co-launch Mode**: Router and workers launch together (simplest for single-node deployments)
2. **Separate Launch Mode**: Router and workers launch independently (best for multi-node setups)
3. **Prefill-Decode Disaggregation**: Specialized mode for disaggregated serving
This will be a drop-in replacement for the existing `--dp-size` argument of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
### Mode 1: Co-launch Router and Workers
This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the `--dp-size` argument of SGLang Runtime.
```bash
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 4 --host 0.0.0.0
# Launch router with 4 workers
python -m sglang_router.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--dp-size 4 \
--host 0.0.0.0 \
--port 30000
```
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
#### Sending Requests
Please adjust the batchsize accordingly to achieve maximum throughput.
Once the server is ready, send requests to the router endpoint:
```python
import requests
# Using the /generate endpoint
url = "http://localhost:30000/generate"
data = {"text": "What is the capital of France?"}
data = {
"text": "What is the capital of France?",
"sampling_params": {
"temperature": 0.7,
"max_new_tokens": 100
}
}
response = requests.post(url, json=data)
print(response.json())
# OpenAI-compatible endpoint
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}
response = requests.post(url, json=data)
print(response.json())
```
## Launch Runtimes and Router Separately
### Mode 2: Separate Launch Mode
This mode is ideal for multi-node deployments where workers run on different machines.
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
#### Step 1: Launch Workers
On each worker node:
```bash
python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
# Worker node 1
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# Worker node 2
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8001
```
## Dynamic Scaling APIs
#### Step 2: Launch Router
On the router node:
We offer `/add_worker` and `/remove_worker` APIs to dynamically add or remove workers from the router.
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--host 0.0.0.0 \
--port 30000 \
--policy cache_aware # or random, round_robin, power_of_two
```
- `/add_worker`
### Mode 3: Prefill-Decode Disaggregation
Usage:
This advanced mode separates prefill and decode operations for optimized performance:
```bash
curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://prefill1:8000 9000 \
--prefill http://prefill2:8001 9001 \
--decode http://decode1:8002 \
--decode http://decode2:8003 \
--prefill-policy cache_aware \
--decode-policy round_robin
```
Example:
#### Understanding --prefill Arguments
```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001
The `--prefill` flag accepts URLs with optional bootstrap ports:
- `--prefill http://server:8000` - No bootstrap port
- `--prefill http://server:8000 9000` - Bootstrap port 9000
- `--prefill http://server:8000 none` - Explicitly no bootstrap port
#### Policy Inheritance in PD Mode
The router intelligently handles policy configuration for prefill and decode nodes:
curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001
1. **Only `--policy` specified**: Both prefill and decode nodes use this policy
2. **`--policy` and `--prefill-policy` specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--policy`
3. **`--policy` and `--decode-policy` specified**: Prefill nodes use `--policy`, decode nodes use `--decode-policy`
4. **All three specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--decode-policy` (main `--policy` is ignored)
# Successfully added worker: http://127.0.0.1:30001
Example with mixed policies:
```bash
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://prefill1:8000
--prefill http://prefill2:8000 \
--decode http://decode1:8001
--decode http://decode2:8001 \
--policy round_robin \
--prefill-policy cache_aware # Prefill uses cache_aware and decode uses round_robin from --policy
```
- `/remove_worker`
#### PD Mode with Service Discovery
Usage:
For Kubernetes deployments with separate prefill and decode server pools:
```bash
curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1
python -m sglang_router.launch_router \
--pd-disaggregation \
--service-discovery \
--prefill-selector app=prefill-server tier=gpu \
--decode-selector app=decode-server tier=cpu \
--service-discovery-namespace production \
--prefill-policy cache_aware \
--decode-policy round_robin
```
Example:
## Dynamic Scaling
The router supports runtime scaling through REST APIs:
### Adding Workers
```bash
curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001
# Launch a new worker
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30001
# Successfully removed worker: http://127.0.0.1:30001
# Add it to the router
curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001"
```
Note:
### Removing Workers
```bash
curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001"
```
- For cache-aware router, the worker will be removed from the tree and the queues.
**Note**: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.
## Fault Tolerance
We provide retries based for failure tolerance.
The router includes comprehensive fault tolerance mechanisms:
1. If the request to a worker fails for `max_worker_retries` times, the router will remove the worker from the router and move on to the next worker.
2. If the total number of retries exceeds `max_total_retries`, the router will return an error.
### Retry Configuration
Note:
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--retry-max-retries 3 \
--retry-initial-backoff-ms 100 \
--retry-max-backoff-ms 10000 \
--retry-backoff-multiplier 2.0 \
--retry-jitter-factor 0.1
```
- `max_worker_retries` is 3 and `max_total_retries` is 6 by default.
### Circuit Breaker
## Routing Strategies
Protects against cascading failures:
### Cache-Aware Load-Balancing Router
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--cb-failure-threshold 5 \
--cb-success-threshold 2 \
--cb-timeout-duration-secs 30 \
--cb-window-duration-secs 60
```
The native router combines two strategies to optimize both cache utilization and request distribution:
**Behavior**:
- Worker is marked unhealthy after `cb-failure-threshold` consecutive failures
- Returns to service after `cb-success-threshold` successful health checks
- Circuit breaker can be disabled with `--disable-circuit-breaker`
1. Cache-Aware Routing (Approximate Tree)
2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)
## Routing Policies
The router dynamically switches between these strategies based on load conditions:
The router supports multiple routing strategies:
- Uses load balancing when the system is imbalanced
- Uses cache-aware routing when the system is balanced
### 1. Random Routing
Distributes requests randomly across workers.
A system is considered imbalanced if both conditions are met:
```bash
--policy random
```
1. (max_load - min_load) > balance_abs_threshold
2. max_load > balance_rel_threshold * min_load
### 2. Round-Robin Routing
Cycles through workers in order.
***Cache-Aware Routing (Approximate Tree)***
```bash
--policy round_robin
```
When the workers are considered to be balanced, the router maintains an approximate radix tree for each worker based on request history, eliminating the need for direct cache state queries on each worker. The tree stores raw text characters instead of token IDs to avoid tokenization overhead.
### 3. Power of Two Choices
Samples two workers and routes to the less loaded one.
Process:
```bash
--policy power_of_two
```
1. For each request, find the worker with the highest prefix match.
### 4. Cache-Aware Load Balancing (Default)
- If match rate > cache_threshold, route the request to the worker with highest match (likely has relevant data cached)
- If match rate ≤ cache_threshold, route the request to the worker with smallest tree size (most available cache capacity)
The most sophisticated policy that combines cache optimization with load balancing:
2. Background maintenance: Periodically evict least recently used leaf nodes on the approximate tree to prevent memory overflow.
```bash
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
```
***Load-Balancing (Shortest Queue)***
#### How It Works
For unbalanced systems, this strategy tracks pending request counts per worker and routes new requests to the least busy worker. This helps maintain optimal load distribution across workers.
1. **Load Assessment**: Checks if the system is balanced
- Imbalanced if: `(max_load - min_load) > balance_abs_threshold` AND `max_load > balance_rel_threshold * min_load`
***Data-Parallelism Aware Routing***
2. **Routing Decision**:
- **Balanced System**: Uses cache-aware routing
- Routes to worker with highest prefix match if match > `cache_threshold`
- Otherwise routes to worker with most available cache capacity
- **Imbalanced System**: Uses shortest queue routing to the least busy worker
An additional DP-aware routing strategy can be enabled on top of the sgl-router’s hybrid cache-aware load-balancing strategy by setting the `--dp-aware` flag when starting the router.
3. **Cache Management**:
- Maintains approximate radix trees per worker
- Periodically evicts LRU entries based on `--eviction-interval` and `--max-tree-size`
When this flag is enabled, the router attempts to contact the workers to retrieve the `dp_size` of each one and registers the new workers at the DP-rank level. In this mode, the router applies the cache-aware routing strategy in a more fine-grained manner, with assistance from the DP controller on the SRT side.
### Data Parallelism Aware Routing
By default (when the flag is not set), the SRT’s DP controller distributes incoming requests across DP ranks in a round-robin fashion.
Enables fine-grained control over data parallel replicas:
## Configuration Parameters
```bash
--dp-aware \
--api-key your_api_key # Required for worker authentication
```
1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
- Minimum prefix match ratio to use highest-match routing.
- Below this threshold, the request will be routed to the worker with most available cache space.
This mode coordinates with SGLang's DP controller for optimized request distribution across data parallel ranks.
## Configuration Reference
### Core Settings
| Parameter | Type | Default | Description |
|-----------------------------|------|-------------|-----------------------------------------------------------------|
| `--host` | str | 127.0.0.1 | Router server host address |
| `--port` | int | 30000 | Router server port |
| `--worker-urls` | list | [] | Worker URLs for separate launch mode |
| `--policy` | str | cache_aware | Routing policy (random, round_robin, cache_aware, power_of_two) |
| `--max-concurrent-requests` | int | 64 | Maximum concurrent requests (rate limiting) |
| `--request-timeout-secs` | int | 600 | Request timeout in seconds |
| `--max-payload-size` | int | 256MB | Maximum request payload size |
### Cache-Aware Routing Parameters
| Parameter | Type | Default | Description |
|---------------------------|-------|----------|--------------------------------------------------------|
| `--cache-threshold` | float | 0.5 | Minimum prefix match ratio for cache routing (0.0-1.0) |
| `--balance-abs-threshold` | int | 32 | Absolute load difference threshold |
| `--balance-rel-threshold` | float | 1.0001 | Relative load ratio threshold |
| `--eviction-interval` | int | 60 | Seconds between cache eviction cycles |
| `--max-tree-size` | int | 16777216 | Maximum nodes in routing tree |
### Fault Tolerance Parameters
| Parameter | Type | Default | Description |
|------------------------------|-------|---------|---------------------------------------|
| `--retry-max-retries` | int | 3 | Maximum retry attempts per request |
| `--retry-initial-backoff-ms` | int | 100 | Initial retry backoff in milliseconds |
| `--retry-max-backoff-ms` | int | 10000 | Maximum retry backoff in milliseconds |
| `--retry-backoff-multiplier` | float | 2.0 | Backoff multiplier between retries |
| `--retry-jitter-factor` | float | 0.1 | Random jitter factor for retries |
| `--disable-retries` | flag | False | Disable retry mechanism |
| `--cb-failure-threshold` | int | 5 | Failures before circuit opens |
| `--cb-success-threshold` | int | 2 | Successes to close circuit |
| `--cb-timeout-duration-secs` | int | 30 | Circuit breaker timeout duration |
| `--cb-window-duration-secs` | int | 60 | Circuit breaker window duration |
| `--disable-circuit-breaker` | flag | False | Disable circuit breaker |
### Prefill-Decode Disaggregation Parameters
| Parameter | Type | Default | Description |
|-----------------------------------|------|---------|-------------------------------------------------------|
| `--pd-disaggregation` | flag | False | Enable PD disaggregated mode |
| `--prefill` | list | [] | Prefill server URLs with optional bootstrap ports |
| `--decode` | list | [] | Decode server URLs |
| `--prefill-policy` | str | None | Routing policy for prefill nodes (overrides --policy) |
| `--decode-policy` | str | None | Routing policy for decode nodes (overrides --policy) |
| `--worker-startup-timeout-secs` | int | 300 | Timeout for worker startup |
| `--worker-startup-check-interval` | int | 10 | Interval between startup checks |
### Kubernetes Integration
| Parameter | Type | Default | Description |
|---------------------------------|------|--------------------------|------------------------------------------------------|
| `--service-discovery` | flag | False | Enable Kubernetes service discovery |
| `--selector` | list | [] | Label selector for workers (key1=value1 key2=value2) |
| `--prefill-selector` | list | [] | Label selector for prefill servers in PD mode |
| `--decode-selector` | list | [] | Label selector for decode servers in PD mode |
| `--service-discovery-port` | int | 80 | Port for discovered pods |
| `--service-discovery-namespace` | str | None | Kubernetes namespace to watch |
| `--bootstrap-port-annotation` | str | sglang.ai/bootstrap-port | Annotation for bootstrap ports |
### Observability
| Parameter | Type | Default | Description |
|------------------------|------|-----------|-------------------------------------------------------|
| `--prometheus-port` | int | 29000 | Prometheus metrics port |
| `--prometheus-host` | str | 127.0.0.1 | Prometheus metrics host |
| `--log-dir` | str | None | Directory for log files |
| `--log-level` | str | info | Logging level (debug, info, warning, error, critical) |
| `--request-id-headers` | list | None | Custom headers for request tracing |
### CORS Configuration
| Parameter | Type | Default | Description |
|--------------------------|------|---------|----------------------|
| `--cors-allowed-origins` | list | [] | Allowed CORS origins |
## Advanced Features
### Kubernetes Service Discovery
Automatically discover and manage workers in Kubernetes:
#### Standard Mode
```bash
python -m sglang_router.launch_router \
--service-discovery \
--selector app=sglang-worker env=prod \
--service-discovery-namespace production \
--service-discovery-port 8000
```
2. `balance_abs_threshold`: (integer, default: 32)
- Absolute difference threshold for load imbalance detection.
- The system is potentially imbalanced if (max_load - min_load) > abs_threshold.
#### Prefill-Decode Disaggregation Mode
```bash
python -m sglang_router.launch_router \
--pd-disaggregation \
--service-discovery \
--prefill-selector app=prefill-server env=prod \
--decode-selector app=decode-server env=prod \
--service-discovery-namespace production
```
**Note**: The `--bootstrap-port-annotation` (default: `sglang.ai/bootstrap-port`) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.
### Prometheus Metrics
Expose metrics for monitoring:
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--prometheus-port 29000 \
--prometheus-host 0.0.0.0
```
3. `balance_rel_threshold`: (float, default: 1.0001)
- Relative ratio threshold for load imbalance detection.
- The system is potentially imbalanced if max_load > min_load * rel_threshold.
- Used in conjunction with `balance_abs_threshold` to determine the final imbalance state.
Metrics available at `http://localhost:29000/metrics`
4. `eviction_interval`: (integer, default: 60)
- Interval in seconds between LRU eviction cycles for the approximate trees.
- Background thread periodically evicts least recently used nodes to maintain tree size.
### Request Tracing
5. `max_tree_size`: (integer, default: 16777216)
- Maximum nodes on the approximate tree.
- When exceeded, LRU leaf nodes are evicted during the next eviction cycle.
Enable request ID tracking:
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--request-id-headers x-request-id x-trace-id
```
## Troubleshooting
### Common Issues
1. **Workers not connecting**: Ensure workers are fully initialized before starting the router. Use `--worker-startup-timeout-secs` to increase wait time.
2. **High latency**: Check if cache-aware routing is causing imbalance. Try adjusting `--balance-abs-threshold` and `--balance-rel-threshold`.
3. **Memory growth**: Reduce `--max-tree-size` or decrease `--eviction-interval` for more aggressive cache cleanup.
4. **Circuit breaker triggering frequently**: Increase `--cb-failure-threshold` or extend `--cb-window-duration-secs`.
### Debug Mode
Enable detailed logging:
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--log-level debug \
--log-dir ./router_logs
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment