SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.
SGLang router is a standalone Rust module that enables data parallelism across SGLang instances, providing high-performance request routing and advanced load balancing. The router supports multiple load balancing algorithms including cache-aware, power of two, random, and round robin, and acts as a specialized load balancer for prefill-decode disaggregated serving architectures.
⚠️ **Warning**: Editable installs may suffer performance degradation. Use wheel builds for performance testing.
For development purposes, you can install the package in editable mode:
Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.
SGL Router supports automatic service discovery for worker nodes in Kubernetes environments. This feature works with both regular (single-server) routing and PD (Prefill-Decode) routing modes. When enabled, the router will automatically:
- Discover and add worker pods with matching labels
### Kubernetes Service Discovery
- Remove unhealthy or deleted worker pods
- Dynamically adjust the worker pool based on pod health and availability
- For PD mode: distinguish between prefill and decode servers based on labels
#### Regular Mode Service Discovery
Automatic worker discovery and management in Kubernetes environments.
For PD (Prefill-Decode) disaggregated routing, service discovery can automatically discover and classify pods as either prefill or decode servers based on their labels:
```bash
For disaggregated prefill/decode routing:
python -m sglang_router.launch_router \
--pd-disaggregation\
--policy cache_aware \
--service-discovery\
--prefill-selectorapp=sglang component=prefill \
--decode-selectorapp=sglang component=decode \
--service-discovery-namespace sglang-system
```
You can also specify initial prefill and decode servers and let service discovery add more:
```bash
```bash
python -m sglang_router.launch_router \
python -m sglang_router.launch_router \
--pd-disaggregation\
--pd-disaggregation\
--policy cache_aware \
--policy cache_aware \
--prefill http://prefill-1:8000 8001 \
--decode http://decode-1:8000 \
--service-discovery\
--service-discovery\
--prefill-selectorapp=sglang component=prefill \
--prefill-selectorapp=sglang component=prefill \
--decode-selectorapp=sglang component=decode \
--decode-selectorapp=sglang component=decode \
--service-discovery-namespace sglang-system
--service-discovery-namespace sglang-system
```
```
#### Kubernetes Pod Configuration for PD Mode
#### Kubernetes Pod Configuration
When using PD service discovery, your Kubernetes pods need specific labels to be classified as prefill or decode servers:
**Prefill Server Pod:**
**Prefill Server Pod:**
```yaml
```yaml
...
@@ -155,15 +134,14 @@ metadata:
...
@@ -155,15 +134,14 @@ metadata:
app:sglang
app:sglang
component:prefill
component:prefill
annotations:
annotations:
sglang.ai/bootstrap-port:"9001"# Optional: Bootstrap port for Mooncake prefill coordination
sglang.ai/bootstrap-port:"9001"# Optional: Bootstrap port
spec:
spec:
containers:
containers:
-name:sglang
-name:sglang
image:lmsys/sglang:latest
image:lmsys/sglang:latest
ports:
ports:
-containerPort:8000# Main API port
-containerPort:8000# Main API port
-containerPort:9001# Optional: Bootstrap coordination port
-containerPort:9001# Optional: Bootstrap port
# ... rest of configuration
```
```
**Decode Server Pod:**
**Decode Server Pod:**
...
@@ -180,38 +158,10 @@ spec:
...
@@ -180,38 +158,10 @@ spec:
-name:sglang
-name:sglang
image:lmsys/sglang:latest
image:lmsys/sglang:latest
ports:
ports:
-containerPort:8000# Main API port
-containerPort:8000
# ... rest of configuration
```
```
**Key Requirements:**
#### RBAC Configuration
- Prefill pods must have labels matching your `--prefill-selector`
- Decode pods must have labels matching your `--decode-selector`
- Prefill pods can optionally include bootstrap port in annotations using `sglang.ai/bootstrap-port` (defaults to None if not specified)
#### Service Discovery Arguments
**General Arguments:**
-`--service-discovery`: Enable Kubernetes service discovery feature
-`--service-discovery-port`: Port to use when generating worker URLs (default: 8000)
-`--service-discovery-namespace`: Optional. Kubernetes namespace to watch for pods. If not provided, watches all namespaces (requires cluster-wide permissions)
-`--selector`: One or more label key-value pairs for pod selection in regular mode (format: key1=value1 key2=value2)
# Build Python binding (see Installation section above)
```
**Note:** In PD mode with service discovery, pods MUST match either the prefill or decode selector to be added. Pods that don't match either selector are ignored.
**Note**: When modifying Rust code, you must rebuild the wheel for changes to take effect.
### Troubleshooting
### Troubleshooting
1. If rust analyzer is not working in VSCode, set `rust-analyzer.linkedProjects` to the absolute path of `Cargo.toml` in your repo. For example:
**VSCode Rust Analyzer Issues:**
Set `rust-analyzer.linkedProjects` to the absolute path of `Cargo.toml`:
```json
```json
{
{
...
@@ -317,21 +252,30 @@ This setup will:
...
@@ -317,21 +252,30 @@ This setup will:
}
}
```
```
### CI/CD Setup
### CI/CD Pipeline
The continuous integration pipeline consists of three main steps:
The continuous integration pipeline includes comprehensive testing, benchmarking, and publishing:
#### 1. Build Wheels
#### Build & Test
- Uses `cibuildwheel` to create manylinux x86_64 packages
1.**Build Wheels**: Uses `cibuildwheel` for manylinux x86_64 packages
- Compatible with major Linux distributions (Ubuntu, CentOS, etc.)
2.**Build Source Distribution**: Creates source distribution for pip fallback
- Additional configurations can be added to support other OS/architectures
3.**Rust HTTP Server Benchmarking**: Performance testing of router overhead
4.**Basic Inference Testing**: End-to-end validation through the router
5.**PD Disaggregation Testing**: Benchmark and sanity checks for prefill-decode load balancing
#### 2. Build Source Distribution
- Creates a source distribution containing the raw, unbuilt code
#### Publishing
- Enables `pip` to build the package from source when prebuilt wheels are unavailable
-**PyPI Publishing**: Wheels and source distributions are published only when the version changes in `pyproject.toml`
-**Container Images**: Docker images published using `/docker/Dockerfile.router`
#### 3. Publish to PyPI
- Uploads both wheels and source distribution to PyPI
## Features
The CI configuration is based on the [tiktoken workflow](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/.github/workflows/build_wheels.yml#L1).
-**High Performance**: Rust-based routing with connection pooling and optimized request handling