The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
SGLang Model Gateway is a high-performance model-routing gateway for large-scale LLM deployments. It centralizes worker lifecycle management, balances traffic across heterogeneous protocols (HTTP, gRPC, OpenAI-compatible), and provides enterprise-ready control over history storage, MCP tooling, and privacy-sensitive workflows. The router is deeply optimized for the SGLang serving runtime, but can route to any OpenAI-compatible backend.
## Key Features
---
-**Cache-Aware Load Balancing**: Optimizes cache utilization while maintaining balanced load distribution
## Table of Contents
-**Multiple Routing Policies**: Choose from random, round-robin, cache-aware, or power-of-two policies
1.[Overview](#overview)
-**Fault Tolerance**: Automatic retry and circuit breaker mechanisms for resilient operation
2.[Architecture](#architecture)
-**Dynamic Scaling**: Add or remove workers at runtime without service interruption
-[Control Plane](#control-plane)
-**Kubernetes Integration**: Native service discovery and pod management
-[Data Plane](#data-plane)
-**Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing
-[Storage & Privacy](#storage--privacy)
-**Prometheus Metrics**: Built-in observability and monitoring
3.[Deployment Modes](#deployment-modes)
-**Rate Limiter**: Token-bucket rate limiter to shield workers from overload
python -m sglang_router.launch_router --help# Launch router only
13.[Observability](#observability)
```
14.[Troubleshooting](#troubleshooting)
---
## Overview
-**Unified control plane** for registering, monitoring, and orchestrating regular, prefill, and decode workers across heterogeneous model fleets.
-**Multi-protocol data plane** that routes traffic across HTTP, PD (prefill/decode), gRPC, and OpenAI-compatible backends with shared reliability primitives.
-**Industry-first gRPC pipeline** with native Rust tokenization, reasoning parsers, and tool-call execution for high-throughput, OpenAI-compatible serving; supports both single-stage and PD topologies.
-**Inference Gateway Mode (`--enable-igw`)** dynamically instantiates multiple router stacks (HTTP regular/PD, gRPC) and applies per-model policies for multi-tenant deployments.
-**Conversation & responses connectors** centralize chat history inside the router so the same context can be reused across models and MCP loops without leaking data to upstream vendors (memory, none, Oracle ATP).
-**Enterprise privacy**: agentic multi-turn `/v1/responses`, native MCP client (STDIO/HTTP/SSE/Streamable), and history storage all operate within the router boundary.
-**Reliability core**: retries with jitter, worker-scoped circuit breakers, token-bucket rate limiting with queuing, background health checks, and cache-aware load monitoring.
-**Observability**: Prometheus metrics, structured tracing, request ID propagation, and detailed job queue stats.
---
## Architecture
### Control Plane
-**Worker Manager** discovers capabilities (`/get_server_info`, `/get_model_info`), tracks load, and registers/removes workers in the shared registry.
-**Job Queue** serializes add/remove requests and exposes status (`/workers/{url}`) so clients can track onboarding progress.
-**Load Monitor** feeds cache-aware and power-of-two policies with live worker load statistics.
-**Health Checker** continuously probes workers and updates readiness, circuit breaker state, and router metrics.
-**gRPC router** streams tokenized requests directly to SRT gRPC workers, running fully in Rust—tokenizer, reasoning parser, and tool parser all reside in-process. Supports both single-stage and PD routing.
-**OpenAI router** proxies OpenAI-compatible endpoints to external vendors (OpenAI, xAI, etc.) while keeping chat history and multi-turn orchestration local.
### Storage & Privacy
- Conversation and response history is stored at the router tier (memory, none, or Oracle ATP). The same history can power multiple models or MCP loops without sending data to upstream vendors.
-`/v1/responses` agentic flows, MCP sessions, and conversation APIs share the same storage layer, enabling compliance for regulated workloads.
---
## Deployment Modes
## Deployment Modes
The router supports three primary deployment patterns:
### Co-launch Router + Workers
Launch the router and a fleet of SGLang workers in one process (ideal for single-node or quick starts). The CLI accepts two namespaces of arguments:
1.**Co-launch Mode**: Router and workers launch together (simplest for single-node deployments)
-**Worker arguments** (no prefix) configure the SGLang runtime (`--model`, `--tp-size`, `--dp-size`, `--grpc-mode`, etc.).
2.**Separate Launch Mode**: Router and workers launch independently (best for multi-node setups)
-**Router arguments** are prefixed with `--router-` and map directly to `launch_router` flags (`--router-policy`, `--router-model-path`, `--router-log-level`, ...).
3.**Prefill-Decode Disaggregation**: Specialized mode for disaggregated serving
### Mode 1: Co-launch Router and Workers
This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the `--dp-size` argument of SGLang Runtime.
> gRPC router supports both single-stage and PD serving. Provide `--tokenizer-path` or `--model-path` (HF repo or local directory) plus optional `--chat-template`.
The `--prefill` flag accepts URLs with optional bootstrap ports:
-`--prefill http://server:8000` - No bootstrap port
-`--prefill http://server:8000 9000` - Bootstrap port 9000
-`--prefill http://server:8000 none` - Explicitly no bootstrap port
#### Policy Inheritance in PD Mode
The router intelligently handles policy configuration for prefill and decode nodes:
1.**Only `--policy` specified**: Both prefill and decode nodes use this policy
### Prefill/Decode Disaggregation
2.**`--policy` and `--prefill-policy` specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--policy`
Split prefill and decode workers for PD-aware caching and balancing.
3.**`--policy` and `--decode-policy` specified**: Prefill nodes use `--policy`, decode nodes use `--decode-policy`
4.**All three specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--decode-policy` (main `--policy` is ignored)
Example with mixed policies:
```bash
```bash
python -m sglang_router.launch_router \
python -m sglang_router.launch_router \
--pd-disaggregation\
--pd-disaggregation\
--prefill http://prefill1:8000
--prefill http://prefill1:30001 9001 \
--prefill http://prefill2:8000 \
--decode http://decode1:30011 \
--decode http://decode1:8001
--policy cache_aware \
--decode http://decode2:8001 \
--prefill-policy cache_aware \
--policy round_robin \
--decode-policy power_of_two
--prefill-policy cache_aware # Prefill uses cache_aware and decode uses round_robin from --policy
```
```
#### PD Mode with Service Discovery
### OpenAI Backend Proxy
Proxy OpenAI-compatible endpoints (OpenAI, xAI, etc.) while keeping history and MCP sessions local.
For Kubernetes deployments with separate prefill and decode server pools:
```bash
```bash
python -m sglang_router.launch_router \
python -m sglang_router.launch_router \
--pd-disaggregation\
--backend openai \
--service-discovery\
--worker-urls https://api.openai.com \
--prefill-selectorapp=prefill-server tier=gpu \
--api-key"$OPENAI_API_KEY"\
--decode-selectorapp=decode-server tier=cpu \
--history-backend memory
--service-discovery-namespace production \
--prefill-policy cache_aware \
--decode-policy round_robin
```
```
## Dynamic Scaling
> OpenAI backend mode expects exactly one `--worker-urls` entry per router instance.
The router supports runtime scaling through REST APIs:
---
### Adding Workers
## Worker Lifecycle & Dynamic Scaling
```bash
Add or remove workers at runtime using the REST APIs. Jobs are queued and tracked for eventual consistency.
**Note**: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.
Legacy endpoints (`/add_worker`, `/remove_worker`, `/list_workers`) remain available but will be deprecated. `/workers/{url}` returns both registry data and queued job status.
## Fault Tolerance
The router includes comprehensive fault tolerance mechanisms:
- Worker is marked unhealthy after `cb-failure-threshold` consecutive failures
- Returns to service after `cb-success-threshold` successful health checks
- Circuit breaker can be disabled with `--disable-circuit-breaker`
### Rate Limiter
Use the token-bucket rate limiter to cap requests before they overwhelm downstream workers.
- Enable rate limiting by setting `--max-concurrent-requests` to a positive integer. A bucket with that many tokens (concurrent leases) is created; `-1` keeps it disabled.
- Optionally override the refill rate with `--rate-limit-tokens-per-second`. If omitted, the refill rate matches `max-concurrent-requests`.
- Overflow traffic can wait in a FIFO queue controlled by:
-`--queue-size`: pending-request buffer (0 disables queuing; defaults to 100).
-`--queue-timeout-secs`: maximum wait time for queued requests before returning `429` (defaults to 60 seconds).
Requests beyond the concurrency limit wait in a FIFO queue (up to `queue-size`). A `429` is returned when the queue is full; `408` is returned when `queue-timeout-secs` expires.
This configuration allows up to 256 concurrent requests, refills 512 tokens (requests) per second, and keeps up to 128 overflow requests queued for 30 seconds before timing out.
**Responses**:
- Returns **429** when the router cannot enqueue the request (queue disabled or full).
- Returns **408** when a queued request waits longer than `--queue-timeout-secs` or no token becomes available before the timeout.
| `random` | Uniform random selection. | `--policy random` |
| `round_robin` | Cycles through workers in order. | `--policy round_robin` |
| `power_of_two` | Samples two workers and picks the lighter one (requires Load Monitor). | `--policy power_of_two` |
| `cache_aware` | Default policy; combines cache locality with load balancing, falling back to shortest queue. | `--policy cache_aware` + tuning flags |
Key tuning flags:
```bash
```bash
--policy random
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.5 \
--eviction-interval-secs 120 \
--max-tree-size 67108864
```
```
### 2. Round-Robin Routing
---
Cycles through workers in order.
```bash
## Service Discovery (Kubernetes)
--policy round_robin
```
### 3. Power of Two Choices
Enable automatic worker discovery via Kubernetes pod selectors.
Samples two workers and routes to the less loaded one.
```bash
```bash
--policy power_of_two
python -m sglang_router.launch_router \
--service-discovery\
--selectorapp=sglang-worker role=inference \
--service-discovery-namespace production \
--service-discovery-port 8000
```
```
### 4. Cache-Aware Load Balancing (Default)
PD deployments can specify `--prefill-selector` and `--decode-selector` plus the `sglang.ai/bootstrap-port` annotation for prefill bootstrap ports. Ensure RBAC grants `get/list/watch` on pods.
The most sophisticated policy that combines cache optimization with load balancing:
---
```bash
## Security & Authentication
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
```
#### How It Works
-**Router API key (`--api-key`)**: clients must supply `Authorization: Bearer <key>`.
-**Worker API keys**: when adding workers dynamically, include `api_key` in the payload; workers listed via CLI inherit the router key.
-**Full-stack auth**: start router with `--api-key`, then add workers with their own keys:
-**Privacy**: All conversation history, `/v1/responses` state, and MCP sessions stay inside the router. Nothing is persisted at remote model vendors unless explicitly proxied.
1.**Load Assessment**: Checks if the system is balanced
Automatically discover and manage workers in Kubernetes:
#### Standard Mode
```bash
python -m sglang_router.launch_router \
--service-discovery\
--selectorapp=sglang-worker env=prod \
--service-discovery-namespace production \
--service-discovery-port 8000
```
#### Prefill-Decode Disaggregation Mode
```bash
python -m sglang_router.launch_router \
--pd-disaggregation\
--service-discovery\
--prefill-selectorapp=prefill-server env=prod \
--decode-selectorapp=decode-server env=prod \
--service-discovery-namespace production
```
**Note**: The `--bootstrap-port-annotation` (default: `sglang.ai/bootstrap-port`) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.
| `sgl_router_requests_total` | Counter | Total number of requests received by the router's API endpoint. Useful for tracking overall traffic. |
| `sgl_router_processed_requests_total` | Counter | Total requests processed, labeled by `worker`. Critical for spotting load imbalances. |
| `sgl_router_active_workers` | Gauge | The current number of healthy workers in the routing pool. Essential for alerting. |
| `sgl_router_running_requests` | Gauge | The number of currently in-flight requests, labeled by `worker`. For monitoring real-time load. |
| `sgl_router_cache_hits_total` | Counter | Total requests routed to a worker with a matching prefix cache. |
| `sgl_router_cache_misses_total` | Counter | Total requests that could not be routed based on cache locality. |
| `sgl_router_generate_duration_seconds` | Histogram | Tracks end-to-end request latency. Use this to monitor performance (e.g., p95/p99). |
## Troubleshooting
## Troubleshooting
### Common Issues
1.**Workers never ready**
Increase `--worker-startup-timeout-secs` or ensure health probes respond before router startup.
1.**Workers not connecting**: Ensure workers are fully initialized before starting the router. Use `--worker-startup-timeout-secs` to increase wait time.
2.**High latency**:
2.**Load imbalance / hot workers**
-**A common cause**: Load Imbalanced.
Inspect `sgl_router_processed_requests_total` and tune cache-aware thresholds (`--balance-*`, `--cache-threshold`).
- Check the `sgl_router_processed_requests_total` metric grouped by `worker`.
- Cache-aware routing might be prioritizing cache hits too aggressively.
- Try adjusting `--balance-abs-threshold` and `--balance-rel-threshold`.
3.**Memory growth**: Reduce `--max-tree-size` or decrease `--eviction-interval-secs` for more aggressive cache cleanup.
3.**Circuit breaker flapping**
Increase `--cb-failure-threshold` or extend the timeout/window durations. Consider temporarily disabling retries.
4.**Circuit breaker triggering frequently**: Increase `--cb-failure-threshold` or extend `--cb-window-duration-secs`.
SGLang Model Gateway continues to evolve alongside the SGLang runtime. Keep CLI flags, integrations, and documentation aligned when adopting new features or contributing improvements.