SGLang router is a standalone Rust module that enables data parallelism across SGLang instances, providing high-performance request routing and advanced load balancing. The router supports multiple load balancing algorithms including cache-aware, power of two, random, and round robin, and acts as a specialized load balancer for prefill-decode disaggregated serving architectures.
High-performance model routing control and data plane for large-scale LLM deployments. The gateway orchestrates fleets of workers, balances traffic across HTTP and gRPC backends, and exposes OpenAI-compatible APIs with pluggable history storage and tool integrations—while remaining deeply optimized for the SGLang serving runtime.
## Overview
- Unified control plane for registering, monitoring, and orchestrating prefill, decode, and regular workers across heterogeneous model fleets.
- Data plane that routes requests across HTTP, PD (prefill/decode), gRPC, and OpenAI-compatible backends with shared reliability features.
- Industry-first gRPC pipeline with native Rust tokenization, reasoning, and tool-call execution for high-throughput OpenAI-compatible serving.
- Multi-model inference gateway mode (`--enable-igw`) that runs several routers at once and applies per-model policies.
- Conversation, response, and chat-history connectors that centralize state at the router, enabling compliant sharing across models/MCP loops with in-memory, no-op, or Oracle ATP storage options.
- Built-in reliability primitives: retries with exponential backoff, circuit breakers, token-bucket rate limiting, and queuing.
- First-class observability with structured logging and Prometheus metrics.
### Architecture at a Glance
**Control Plane**
- Worker Manager validates workers, discovers capabilities, and keeps the registry in sync.
- Job Queue serializes background operations (add/remove) and exposes status via `/workers/{url}`.
- Background health checker and load monitor keep circuit breakers and policies informed.
- Optional Kubernetes service discovery keeps the registry aligned with pods.
**Data Plane**
- SGLang HTTP routers for regular and PD (prefill/decode) traffic with policy-aware selection.
- SGLang gRPC router and pipeline that stream tokenized requests through SRT gRPC workers with fully Rust tokenizer, reasoning parser, and tool parser implementations for maximal OpenAI API performance, supporting both single-stage and PD serving topologies.
- OpenAI router that proxies OpenAI-style requests, responses, and conversations to remote vendors (OpenAI, xAI, Gemini, and other OpenAI-compatible providers) while preserving streaming/SSE semantics.
- Router Manager coordinates multiple router implementations when IGW is enabled.
- Resilience layer delivers token-bucket rate limiting, request queuing, retry executor, and per-worker circuit breakers to keep traffic flowing through failures.
- Advanced load balancing with cache-aware request reuse, load-aware (power-of-two) selection, and per-model policy overrides.
- Multi-model HTTP serving and inference gateway routing with model-specific policies.
- Prefill/decode disaggregation, including bootstrap port handling and cache-aware merging.
- gRPC routing with fully Rust tokenizer loading, reasoning parser selection, and tool parser integration for OpenAI-compatible endpoints—supporting streaming and non-streaming modes across DeepSeek, Llama, Kimi K2, Qwen, GPT-OSS, Mistral, Step-3, GLM4, and other reasoning-capable models.
- OpenAI-compatible `/v1/chat/completions`, `/v1/responses`, `/v1/conversations`, `/v1/embeddings`, and `/v1/rerank` endpoints.
- Native MCP client integration supporting all MCP transport protocols (STDIO, HTTP, SSE, and Streamable) for tool execution loops.
- Pluggable history connectors: in-memory, disabled, or Oracle ATP (with pooling and credential support).
- Reliability controls: retry with jitter, worker-scoped circuit breakers, token bucket limiter with optional queue, and cache flush APIs.
- Service discovery for regular and PD workloads with independent selectors.
- Prometheus metrics and structured tracing for every stage of routing.
Add more workers with the same API; include optional `labels` (for per-model policies) or `tokenizer_path` / `reasoning_parser` / `tool_parser` fields as needed. `/workers/{url}` exposes queued job status while background jobs finalize registration.
The gRPC router tokenizes inputs locally, supports tool-call parsing, and streams responses. It supports both regular HTTP-equivalent serving and PD (prefill/decode) serving when the worker registry contains PD workers. Provide `--model-path` or `--tokenizer-path` (HuggingFace ID or local directory) whenever connection mode resolves to gRPC.
Use `--reasoning-parser` to select built-in reasoning pipelines (DeepSeek-R1, Qwen3, Step-3, GLM4, etc.) and `--tool-call-parser` for JSON/Pythonic/XML tool contracts in streaming or non-streaming modes.
#### OpenAI Backend Mode
### OpenAI Backend Mode
Route requests to OpenAI or OpenAI-compatible endpoints:
Route requests to OpenAI or OpenAI-compatible endpoints:
```bash
```bash
# Route to OpenAI API
# Route to OpenAI API
python -m sglang_router.launch_router \
python3 -m sglang_router.launch_router \
--backend openai \
--backend openai \
--worker-urls https://api.openai.com
--worker-urls https://api.openai.com \
--api-key"$OPENAI_API_KEY"
# Route to custom OpenAI-compatible endpoint
python -m sglang_router.launch_router \
# Route to custom OpenAI-compatible endpoint (Gemini, xAI, etc.)
| `GET` | `/list_workers` | Legacy list of worker URLs. **Deprecated soon**—use `GET /workers` instead. |
| `POST` | `/flush_cache` | Trigger cache flush across HTTP workers with success/failure breakdown. |
| `GET` | `/get_loads` | Sample current load reported by each worker. |
All administrative routes inherit router API-key protection when `--api-key` is supplied. Job status includes `pending`, `processing`, and `failed` phases with timestamps.
### Service Discovery
Enable Kubernetes discovery to reconcile workers automatically:
- Share load-balancing policies (random, round-robin, cache-aware, power-of-two) with DP-aware scheduling, retries, circuit breakers, and rate limiting.
- Record metrics per request, track running load, and integrate with the router-wide policy registry.
The HTTP router exposes the full OpenAI-compatible surface area (`/generate`, `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/responses`, `/v1/rerank`, etc.). The gRPC router delivers blazing-fast `/generate` and `/v1/chat/completions` today, with the remaining endpoints returning `501 Not Implemented` until their pipelines are finalised.
#### HTTP Router specifics
-**Regular router** handles classic single-stage workers with per-model policy overrides.
-**Prefill/Decode router** coordinates disaggregated prefill and decode workers, merges metadata, and manages streaming fan-in.
#### gRPC Router specifics
- Industry-first fully Rust implementation of an OpenAI-compatible gRPC inference gateway, including tokenizer, reasoning parser, and tool parser execution in-process for maximum throughput.
- Supports both single-stage and PD (prefill/decode) worker topologies; the router automatically selects the appropriate pipeline per model.
- Provides the same `/v1/*` APIs as the HTTP router while streaming tokenized requests/responses directly to SRT gRPC workers.
- Built-in reasoning parsers for DeepSeek, Qwen, Llama, Mistral, GPT-OSS, Step-3, GLM4, Kimi K2, and other structured-thought models.
- Tool-call parsers for JSON, Pythonic, XML, and custom schemas with streaming and non-streaming execution loops.
- Tokenizer factory supporting HuggingFace models, local tokenizer.json files, and chat template overrides (see `src/tokenizer`).
- Explore the code paths in `src/reasoning_parser`, `src/tool_parser`, and `src/tokenizer` for the end-to-end Rust implementations that power gRPC mode.
### OpenAI Router
- Proxies OpenAI-compatible chat completions and responses APIs, preserving headers and SSE streams end-to-end.
- Supports `/v1/responses` background jobs with cancellation, deletion, and listing input items—enabling agentic, multi-turn orchestration without persisting data at remote vendor endpoints.
- Conversation APIs (`/v1/conversations` and `/items`) interact with the configured conversation storage backend for compliant chat-history management. Conversation state lives at the router tier, so the same history can drive different models or MCP loops without leaking data to upstream vendors.
- Chat history, agentic multi-turn `/v1/responses`, and the native MCP client (STDIO/HTTP/SSE/Streamable transports) are designed to satisfy enterprise data-privacy requirements by keeping sensitive state within the router.
Public health endpoints (`/liveness`, `/readiness`, `/health`, `/health_generate`) reflect registry state; readiness ensures PD workers are paired and IGW has at least one healthy route.
## Conversations, Responses, and Data Connectors
-`--history-backend memory` (default) stores responses and conversations in-process.
-`--history-backend none` disables persistence while keeping APIs.
-`--history-backend oracle` uses Oracle Autonomous Database; provide credentials via flags or environment variables.
- Conversation item storage mirrors the history backend (Oracle or memory). The same storage powers OpenAI `/responses` and conversation APIs.
### History Backend (OpenAI Router Mode)
Store conversation and response data for tracking, debugging, or analytics.
Store conversation and response data for tracking, debugging, or analytics.
**NOTE: This feature is currently only supported in OpenAI router mode.**
> **Note:** History backends are currently supported only when running with `--backend openai`. gRPC mode support for the `/v1/responses` API is planned.
#### Available options for history storage
- **Memory** (default): In-memory storage, fast but ephemeral
#### Available storage options
- **None**: No storage, minimal overhead
-**Memory** (default): In-memory storage, fast but ephemeral.
- **Oracle**: Persistent storage using Oracle ATP
-**None**: No storage, minimal overhead.
-**Oracle**: Persistent storage backed by Oracle Autonomous Database.
Only one of `--oracle-dsn` or `--oracle-tns-alias` should be supplied.
## Reliability & Flow Control
-**Retries**: Default max retries = 5 with exponential backoff (`--retry-max-retries`, `--retry-initial-backoff-ms`, `--retry-max-backoff-ms`, `--retry-backoff-multiplier`, `--retry-jitter-factor`). Retries trigger on 408/429/500/502/503/504.
-**Circuit Breakers**: Per worker thresholds (`--cb-failure-threshold`, `--cb-success-threshold`, `--cb-timeout-duration-secs`, `--cb-window-duration-secs`). Disable via `--disable-circuit-breaker`.
-**Rate Limiting**: Token bucket driven by `--max-concurrent-requests`. Set `--rate-limit-tokens-per-second` to override refill rate. Configure request queue via `--queue-size` and `--queue-timeout-secs`; queued requests observe FIFO order and respect cancellation.
-**Health Checks**: Runtime probes via `--health-check-interval-secs`, `--health-check-timeout-secs`, failure/success thresholds, and `--health-check-endpoint`.
-**Cache Management**: `/flush_cache` ensures LRU eviction when redeploying PD workers.
## Load Balancing Policies
-`random`: uniform random worker selection.
-`round_robin`: sequential rotation with atomic counters.
-`cache_aware`: maintains a prefix tree of prompts to route repeat traffic and evens load with configurable thresholds (`--cache-threshold`, `--balance-abs-threshold`, `--balance-rel-threshold`, `--eviction-interval`, `--max-tree-size`).
-`power_of_two`: chooses the lighter worker among two random candidates; integrates with `LoadMonitor`.
Per-model overrides are available in PD mode (`--prefill-policy`, `--decode-policy`) and IGW mode via the worker registry.
## Observability
-**Logging**: Structured tracing through `tracing` with optional file sink (`--log-dir`) and `--log-level` (`debug`, `info`, `warn`, `error`).
-**Prometheus Metrics**: Enable with `--prometheus-host`/`--prometheus-port` (defaults to `0.0.0.0:29000`). Metrics cover request latency, retry behavior, circuit breaker states, worker health/load, queue depth, PD pipeline stats, tokenizer timings, and MCP activity.
-**Request IDs**: Configurable headers via `--request-id-headers`; responses include `x-request-id`.
-**CORS**: Set `--cors-allowed-origins` for browser access.
## Security
### Router and Worker API Keys
-**Router API key (`--api-key`)** protects client access to router endpoints; all protected routes expect `Authorization: Bearer <key>`.
- Workers listed in `--worker-urls` inherit the router API key automatically.
- When adding workers dynamically, provide explicit API keys via payload or query string; they do **not** inherit automatically.
```bash
```bash
# Workers specified in --worker-urls automatically inherit the router's API key
When modifying runtime behavior, rebuild the wheel or run the binary directly. Use `python -m sglang_router.launch_server` to co-launch router and SGLang workers in small clusters for local validation.
---
### CI/CD Pipeline
SGLang Model Gateway continues to evolve alongside the core SGLang runtime. Contributions should keep CLI flags, documentation, and Python bindings in sync with the Rust implementation.
The continuous integration pipeline includes comprehensive testing, benchmarking, and publishing:
#### Build & Test
1. **Build Wheels**: Uses `cibuildwheel` for manylinux x86_64 packages
2. **Build Source Distribution**: Creates source distribution for pip fallback
3. **Rust HTTP Server Benchmarking**: Performance testing of router overhead
4. **Basic Inference Testing**: End-to-end validation through the router
5. **PD Disaggregation Testing**: Benchmark and sanity checks for prefill-decode load balancing
#### Publishing
- **PyPI Publishing**: Wheels and source distributions are published only when the version changes in `pyproject.toml`
- **Container Images**: Docker images published using `/docker/Dockerfile.router`
## Features
- **High Performance**: Rust-based routing with connection pooling and optimized request handling