[router] update router doc to latest features (#11639)

e0c2af2a · Simo Lin · GitHub · 1d7f7835 · e0c2af2a
Unverified Commit e0c2af2a authored Oct 14, 2025 by Simo Lin Committed by GitHub Oct 14, 2025
Show whitespace changes
Inline Side-by-side

Showing with 350 additions and 401 deletions

docs/advanced_features/router.md docs/advanced_features/router.md +350 -401

No files found.
--- a/docs/advanced_features/router.md
+++ b/docs/advanced_features/router.md
-# SGLang Router
-
-The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
-
-## Key Features
-
- **Cache-Aware Load Balancing**: Optimizes cache utilization while maintaining balanced load distribution
- **Multiple Routing Policies**: Choose from random, round-robin, cache-aware, or power-of-two policies
- **Fault Tolerance**: Automatic retry and circuit breaker mechanisms for resilient operation
- **Dynamic Scaling**: Add or remove workers at runtime without service interruption
- **Kubernetes Integration**: Native service discovery and pod management
- **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing
- **Prometheus Metrics**: Built-in observability and monitoring
- **Rate Limiter**: Token-bucket rate limiter to shield workers from overload
-
-## Installation
-
-```bash
-pip install sglang-router
-```
-
-## Quick Start
-
-To see all available options:
-
-```bash
-python -m sglang_router.launch_server --help  # Co-launch router and workers
-python -m sglang_router.launch_router --help  # Launch router only
-```
+# SGLang Model Gateway (formerly SGLang Router)
+
+SGLang Model Gateway is a high-performance model-routing gateway for large-scale LLM deployments. It centralizes worker lifecycle management, balances traffic across heterogeneous protocols (HTTP, gRPC, OpenAI-compatible), and provides enterprise-ready control over history storage, MCP tooling, and privacy-sensitive workflows. The router is deeply optimized for the SGLang serving runtime, but can route to any OpenAI-compatible backend.
+
+---
+
+## Table of Contents
+1. [Overview](#overview)
+2. [Architecture](#architecture)
+   - [Control Plane](#control-plane)
+   - [Data Plane](#data-plane)
+   - [Storage & Privacy](#storage--privacy)
+3. [Deployment Modes](#deployment-modes)
+   - [Co-launch Router + Workers](#co-launch-router--workers)
+   - [Separate Launch (HTTP)](#separate-launch-http)
+   - [gRPC Launch](#grpc-launch)
+   - [Prefill/Decode Disaggregation](#prefilldecode-disaggregation)
+   - [OpenAI Backend Proxy](#openai-backend-proxy)
+4. [Worker Lifecycle & Dynamic Scaling](#worker-lifecycle--dynamic-scaling)
+5. [Reliability & Flow Control](#reliability--flow-control)
+6. [Load Balancing Policies](#load-balancing-policies)
+7. [Service Discovery (Kubernetes)](#service-discovery-kubernetes)
+8. [Security & Authentication](#security--authentication)
+9. [History & Data Connectors](#history--data-connectors)
+10. [MCP & Advanced Tooling](#mcp--advanced-tooling)
+11. [API Surface](#api-surface)
+12. [Configuration Reference](#configuration-reference)
+13. [Observability](#observability)
+14. [Troubleshooting](#troubleshooting)
+
+---
+
+## Overview
+- **Unified control plane** for registering, monitoring, and orchestrating regular, prefill, and decode workers across heterogeneous model fleets.
+- **Multi-protocol data plane** that routes traffic across HTTP, PD (prefill/decode), gRPC, and OpenAI-compatible backends with shared reliability primitives.
+- **Industry-first gRPC pipeline** with native Rust tokenization, reasoning parsers, and tool-call execution for high-throughput, OpenAI-compatible serving; supports both single-stage and PD topologies.
+- **Inference Gateway Mode (`--enable-igw`)** dynamically instantiates multiple router stacks (HTTP regular/PD, gRPC) and applies per-model policies for multi-tenant deployments.
+- **Conversation & responses connectors** centralize chat history inside the router so the same context can be reused across models and MCP loops without leaking data to upstream vendors (memory, none, Oracle ATP).
+- **Enterprise privacy**: agentic multi-turn `/v1/responses`, native MCP client (STDIO/HTTP/SSE/Streamable), and history storage all operate within the router boundary.
+- **Reliability core**: retries with jitter, worker-scoped circuit breakers, token-bucket rate limiting with queuing, background health checks, and cache-aware load monitoring.
+- **Observability**: Prometheus metrics, structured tracing, request ID propagation, and detailed job queue stats.
+
+---
+
+## Architecture
+
+### Control Plane
+- **Worker Manager** discovers capabilities (`/get_server_info`, `/get_model_info`), tracks load, and registers/removes workers in the shared registry.
+- **Job Queue** serializes add/remove requests and exposes status (`/workers/{url}`) so clients can track onboarding progress.
+- **Load Monitor** feeds cache-aware and power-of-two policies with live worker load statistics.
+- **Health Checker** continuously probes workers and updates readiness, circuit breaker state, and router metrics.
+
+### Data Plane
+- **HTTP routers** (regular & PD) implement `/generate`, `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, `/v1/embeddings`, `/v1/rerank`, and associated admin endpoints.
+- **gRPC router** streams tokenized requests directly to SRT gRPC workers, running fully in Rust—tokenizer, reasoning parser, and tool parser all reside in-process. Supports both single-stage and PD routing.
+- **OpenAI router** proxies OpenAI-compatible endpoints to external vendors (OpenAI, xAI, etc.) while keeping chat history and multi-turn orchestration local.
+
+### Storage & Privacy
+- Conversation and response history is stored at the router tier (memory, none, or Oracle ATP). The same history can power multiple models or MCP loops without sending data to upstream vendors.
+- `/v1/responses` agentic flows, MCP sessions, and conversation APIs share the same storage layer, enabling compliance for regulated workloads.
+
+---

 ## Deployment Modes

-The router supports three primary deployment patterns:
-
-1. **Co-launch Mode**: Router and workers launch together (simplest for single-node deployments)
-2. **Separate Launch Mode**: Router and workers launch independently (best for multi-node setups)
-3. **Prefill-Decode Disaggregation**: Specialized mode for disaggregated serving
-
-### Mode 1: Co-launch Router and Workers
-
-This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the `--dp-size` argument of SGLang Runtime.
+### Co-launch Router + Workers
+Launch the router and a fleet of SGLang workers in one process (ideal for single-node or quick starts). The CLI accepts two namespaces of arguments:
+- **Worker arguments** (no prefix) configure the SGLang runtime (`--model`, `--tp-size`, `--dp-size`, `--grpc-mode`, etc.).
+- **Router arguments** are prefixed with `--router-` and map directly to `launch_router` flags (`--router-policy`, `--router-model-path`, `--router-log-level`, ...).

 ```bash
-# Launch router with 4 workers
 python -m sglang_router.launch_server \
-    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
+  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dp-size 4 \
  --host 0.0.0.0 \
  --port 30000
 ```

-#### Sending Requests
-
-Once the server is ready, send requests to the router endpoint:
-
-```python
-import requests
-
-# Using the /generate endpoint
-url = "http://localhost:30000/generate"
-data = {
-    "text": "What is the capital of France?",
-    "sampling_params": {
-        "temperature": 0.7,
-        "max_new_tokens": 100
-    }
-}
-
-response = requests.post(url, json=data)
-print(response.json())
-
-# OpenAI-compatible endpoint
-url = "http://localhost:30000/v1/chat/completions"
-data = {
-    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
-    "messages": [{"role": "user", "content": "What is the capital of France?"}]
-}
-
-response = requests.post(url, json=data)
-print(response.json())
-```
-
-### Mode 2: Separate Launch Mode
-
-This mode is ideal for multi-node deployments where workers run on different machines.
-
-#### Step 1: Launch Workers
-
-On each worker node:
-
+Comprehensive example:
 ```bash
-# Worker node 1
-python -m sglang.launch_server \
-    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
-    --host 0.0.0.0 \
-    --port 8000
-
-# Worker node 2
-python -m sglang.launch_server \
-    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
+python3 -m sglang_router.launch_server \
  --host 0.0.0.0 \
-    --port 8001
+  --port 8080 \
+  --model /raid/models/meta-llama/Llama-3.1-8B-Instruct \
+  --tp-size 1 \
+  --dp-size 8 \
+  --grpc-mode \
+  --log-level debug \
+  --router-prometheus-port 10001 \
+  --router-tool-call-parser llama \
+  --router-health-success-threshold 2 \
+  --router-health-check-timeout-secs 6000 \
+  --router-health-check-interval-secs 60 \
+  --router-model-path /raid/models/meta-llama/Llama-3.1-8B-Instruct \
+  --router-policy round_robin \
+  --router-log-level debug
 ```

-#### Step 2: Launch Router
-
-On the router node:
+### Separate Launch (HTTP)
+Run workers independently and point the router at their HTTP endpoints.

 ```bash
+# Worker nodes
+python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000
+python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --port 8001
+
+# Router node
 python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
-    --host 0.0.0.0 \
-    --port 30000 \
-    --policy cache_aware  # or random, round_robin, power_of_two
+  --policy cache_aware \
+  --host 0.0.0.0 --port 30000
 ```

-### Mode 3: Prefill-Decode Disaggregation
-
-This advanced mode separates prefill and decode operations for optimized performance:
+### gRPC Launch
+Use SRT gRPC workers to unlock the highest throughput and access native reasoning/tool pipelines.

 ```bash
+# Workers expose gRPC endpoints
+python -m sglang.launch_server \
+  --model /raid/models/meta-llama/Llama-3.1-8B-Instruct \
+  --grpc-mode \
+  --port 20000
+
+# Router
 python -m sglang_router.launch_router \
-    --pd-disaggregation \
-    --prefill http://prefill1:8000 9000 \
-    --prefill http://prefill2:8001 9001 \
-    --decode http://decode1:8002 \
-    --decode http://decode2:8003 \
-    --prefill-policy cache_aware \
-    --decode-policy round_robin
+  --worker-urls grpc://127.0.0.1:20000 \
+  --model-path meta-llama/Llama-3.1-8B-Instruct \
+  --reasoning-parser deepseek-r1 \
+  --tool-call-parser json \
+  --host 0.0.0.0 --port 8080
 ```

-#### Understanding --prefill Arguments
-
-The `--prefill` flag accepts URLs with optional bootstrap ports:
- `--prefill http://server:8000` - No bootstrap port
- `--prefill http://server:8000 9000` - Bootstrap port 9000
- `--prefill http://server:8000 none` - Explicitly no bootstrap port
-
-#### Policy Inheritance in PD Mode
-
-The router intelligently handles policy configuration for prefill and decode nodes:
+> gRPC router supports both single-stage and PD serving. Provide `--tokenizer-path` or `--model-path` (HF repo or local directory) plus optional `--chat-template`.

-1. **Only `--policy` specified**: Both prefill and decode nodes use this policy
-2. **`--policy` and `--prefill-policy` specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--policy`
-3. **`--policy` and `--decode-policy` specified**: Prefill nodes use `--policy`, decode nodes use `--decode-policy`
-4. **All three specified**: Prefill nodes use `--prefill-policy`, decode nodes use `--decode-policy` (main `--policy` is ignored)
+### Prefill/Decode Disaggregation
+Split prefill and decode workers for PD-aware caching and balancing.

-Example with mixed policies:
 ```bash
 python -m sglang_router.launch_router \
  --pd-disaggregation \
-    --prefill http://prefill1:8000
-    --prefill http://prefill2:8000 \
-    --decode http://decode1:8001
-    --decode http://decode2:8001 \
-    --policy round_robin \
-    --prefill-policy cache_aware  # Prefill uses cache_aware and decode uses round_robin from --policy
+  --prefill http://prefill1:30001 9001 \
+  --decode http://decode1:30011 \
+  --policy cache_aware \
+  --prefill-policy cache_aware \
+  --decode-policy power_of_two
 ```

-#### PD Mode with Service Discovery
-
-For Kubernetes deployments with separate prefill and decode server pools:
+### OpenAI Backend Proxy
+Proxy OpenAI-compatible endpoints (OpenAI, xAI, etc.) while keeping history and MCP sessions local.

 ```bash
 python -m sglang_router.launch_router \
-    --pd-disaggregation \
-    --service-discovery \
-    --prefill-selector app=prefill-server tier=gpu \
-    --decode-selector app=decode-server tier=cpu \
-    --service-discovery-namespace production \
-    --prefill-policy cache_aware \
-    --decode-policy round_robin
+  --backend openai \
+  --worker-urls https://api.openai.com \
+  --api-key "$OPENAI_API_KEY" \
+  --history-backend memory
 ```

-## Dynamic Scaling
+> OpenAI backend mode expects exactly one `--worker-urls` entry per router instance.

-The router supports runtime scaling through REST APIs:
+---

-### Adding Workers
+## Worker Lifecycle & Dynamic Scaling

-```bash
-# Launch a new worker
-python -m sglang.launch_server \
-    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
-    --port 30001
+Add or remove workers at runtime using the REST APIs. Jobs are queued and tracked for eventual consistency.

-# Add it to the router
-curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001"
-```
+```bash
+# Add a worker (HTTP or gRPC)
+curl -X POST http://localhost:30000/workers \
+  -H "Content-Type: application/json" \
+  -d '{"url":"grpc://0.0.0.0:31000","worker_type":"regular"}'

-### Removing Workers
+# Inspect registry
+curl http://localhost:30000/workers | jq

-```bash
-curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001"
+# Remove a worker
+curl -X DELETE http://localhost:30000/workers/grpc://0.0.0.0:31000
 ```

-**Note**: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.
-
-## Fault Tolerance
+Legacy endpoints (`/add_worker`, `/remove_worker`, `/list_workers`) remain available but will be deprecated. `/workers/{url}` returns both registry data and queued job status.

-The router includes comprehensive fault tolerance mechanisms:
+---

-### Retry Configuration
+## Reliability & Flow Control

+### Retries
 ```bash
 python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
-    --retry-max-retries 3 \
-    --retry-initial-backoff-ms 100 \
-    --retry-max-backoff-ms 10000 \
-    --retry-backoff-multiplier 2.0 \
-    --retry-jitter-factor 0.1
+  --retry-max-retries 5 \
+  --retry-initial-backoff-ms 50 \
+  --retry-max-backoff-ms 30000 \
+  --retry-backoff-multiplier 1.5 \
+  --retry-jitter-factor 0.2
 ```

 ### Circuit Breaker
-
-Protects against cascading failures:
-
 ```bash
 python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
@@ -225,42 +204,7 @@ python -m sglang_router.launch_router \
  --cb-window-duration-secs 60
 ```

-```mermaid
-flowchart TD
-    Closed(["Closed"])
-    Open(["Open"])
-    HalfOpen(["HalfOpen"])
-
-    Closed -- "Consecutive Failures >=<br/>cb-failure-threshold" --> Open;
-    Closed --> HalfOpen;
-    linkStyle 1 stroke:transparent;
-    Open -- "After cb-timeout-duration-secs" --> HalfOpen;
-    HalfOpen -- "Fail any test request" --> Open;
-    HalfOpen -- "After cb-success-threshold<br/>test requests" --> Closed;
-    Closed -- "Failures < cb-failure-threshold" --> Closed;
-    style Closed fill:#00C853,color:#000000
-    style Open fill:#D50000,color:#000000
-    style HalfOpen fill:#FFD600,color:#000000
-    linkStyle 1 stroke:transparent,fill:none
-```
-
-**Behavior**:
- Worker is marked unhealthy after `cb-failure-threshold` consecutive failures
- Returns to service after `cb-success-threshold` successful health checks
- Circuit breaker can be disabled with `--disable-circuit-breaker`
-
-### Rate Limiter
-
-Use the token-bucket rate limiter to cap requests before they overwhelm downstream workers.
-
- Enable rate limiting by setting `--max-concurrent-requests` to a positive integer. A bucket with that many tokens (concurrent leases) is created; `-1` keeps it disabled.
- Optionally override the refill rate with `--rate-limit-tokens-per-second`. If omitted, the refill rate matches `max-concurrent-requests`.
- Overflow traffic can wait in a FIFO queue controlled by:
-  - `--queue-size`: pending-request buffer (0 disables queuing; defaults to 100).
-  - `--queue-timeout-secs`: maximum wait time for queued requests before returning `429` (defaults to 60 seconds).
-
-Example:
-
+### Rate Limiting & Queuing
 ```bash
 python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
@@ -270,243 +214,248 @@ python -m sglang_router.launch_router \
  --queue-timeout-secs 30
 ```

-**Behavior**:
-
-This configuration allows up to 256 concurrent requests, refills 512 tokens (requests) per second, and keeps up to 128 overflow requests queued for 30 seconds before timing out.
-
-**Responses**:
- Returns **429** when the router cannot enqueue the request (queue disabled or full).
- Returns **408** when a queued request waits longer than `--queue-timeout-secs` or no token becomes available before the timeout.
+Requests beyond the concurrency limit wait in a FIFO queue (up to `queue-size`). A `429` is returned when the queue is full; `408` is returned when `queue-timeout-secs` expires.

-## Routing Policies
+---

-The router supports multiple routing strategies:
+## Load Balancing Policies

-### 1. Random Routing
-Distributes requests randomly across workers.
+| Policy             | Description                                                                                      | Usage                         |
+|--------------------|--------------------------------------------------------------------------------------------------|-------------------------------|
+| `random`           | Uniform random selection.                                                                        | `--policy random`             |
+| `round_robin`      | Cycles through workers in order.                                                                 | `--policy round_robin`        |
+| `power_of_two`     | Samples two workers and picks the lighter one (requires Load Monitor).                           | `--policy power_of_two`       |
+| `cache_aware`      | Default policy; combines cache locality with load balancing, falling back to shortest queue.     | `--policy cache_aware` + tuning flags |

+Key tuning flags:
 ```bash
--policy random
+--cache-threshold 0.5 \
+--balance-abs-threshold 32 \
+--balance-rel-threshold 1.5 \
+--eviction-interval-secs 120 \
+--max-tree-size 67108864
 ```

-### 2. Round-Robin Routing
-Cycles through workers in order.
+---

-```bash
--policy round_robin
-```
+## Service Discovery (Kubernetes)

-### 3. Power of Two Choices
-Samples two workers and routes to the less loaded one.
+Enable automatic worker discovery via Kubernetes pod selectors.

 ```bash
--policy power_of_two
+python -m sglang_router.launch_router \
+  --service-discovery \
+  --selector app=sglang-worker role=inference \
+  --service-discovery-namespace production \
+  --service-discovery-port 8000
 ```

-### 4. Cache-Aware Load Balancing (Default)
+PD deployments can specify `--prefill-selector` and `--decode-selector` plus the `sglang.ai/bootstrap-port` annotation for prefill bootstrap ports. Ensure RBAC grants `get/list/watch` on pods.

-The most sophisticated policy that combines cache optimization with load balancing:
+---

-```bash
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
-```
+## Security & Authentication

-#### How It Works
+- **Router API key (`--api-key`)**: clients must supply `Authorization: Bearer <key>`.
+- **Worker API keys**: when adding workers dynamically, include `api_key` in the payload; workers listed via CLI inherit the router key.
+- **Full-stack auth**: start router with `--api-key`, then add workers with their own keys:
+  ```bash
+  curl -H "Authorization: Bearer router-key" \
+    -X POST http://localhost:30000/workers \
+    -H "Content-Type: application/json" \
+    -d '{"url":"http://worker:8000","api_key":"worker-key"}'
+  ```
+- **Privacy**: All conversation history, `/v1/responses` state, and MCP sessions stay inside the router. Nothing is persisted at remote model vendors unless explicitly proxied.

-1. **Load Assessment**: Checks if the system is balanced
-   - Imbalanced if: `(max_load - min_load) > balance_abs_threshold` AND `max_load > balance_rel_threshold * min_load`
+---

-2. **Routing Decision**:
-   - **Balanced System**: Uses cache-aware routing
-     - Routes to worker with highest prefix match if match > `cache_threshold`
-     - Otherwise routes to worker with most available cache capacity
-   - **Imbalanced System**: Uses shortest queue routing to the least busy worker
+## History & Data Connectors

-3. **Cache Management**:
-   - Maintains approximate radix trees per worker
-   - Periodically evicts LRU entries based on `--eviction-interval-secs` and `--max-tree-size`
-
-### Data Parallelism Aware Routing
-
-Enables fine-grained control over data parallel replicas:
+| Backend | Description | Usage |
+|---------|-------------|-------|
+| `memory` (default) | In-memory storage for quick prototyping. | `--history-backend memory` |
+| `none` | No persistence; APIs operate but store nothing. | `--history-backend none` |
+| `oracle` | Oracle Autonomous Database-backed storage (pooled connections). | `--history-backend oracle` |

+Oracle configuration (choose DSN *or* TNS alias):
 ```bash
--dp-aware \
--api-key your_api_key  # Required for worker authentication
+export ATP_DSN="tcps://host:port/service"  # or use ATP_TNS_ALIAS + ATP_WALLET_PATH
+export ATP_USER="admin"
+export ATP_PASSWORD="secret"
+export ATP_POOL_MIN=4
+export ATP_POOL_MAX=32
+
+python -m sglang_router.launch_router \
+  --backend openai \
+  --worker-urls https://api.openai.com \
+  --history-backend oracle
 ```

-This mode coordinates with SGLang's DP controller for optimized request distribution across data parallel ranks.
+> History backends currently apply to OpenAI router mode. gRPC parity for `/v1/responses` is on the roadmap.

-## Configuration Reference
+---

-### Core Settings
+## MCP & Advanced Tooling

-| Parameter                   | Type | Default     | Description                                                     |
-| --------------------------- | ---- | ----------- | --------------------------------------------------------------- |
-| `--host`                    | str  | 127.0.0.1   | Router server host address                                      |
-| `--port`                    | int  | 30000       | Router server port                                              |
-| `--worker-urls`             | list | []          | Worker URLs for separate launch mode                            |
-| `--policy`                  | str  | cache_aware | Routing policy (random, round_robin, cache_aware, power_of_two) |
-| `--max-concurrent-requests` | int  | 64          | Maximum concurrent requests (rate limiting)                     |
-| `--request-timeout-secs`    | int  | 600         | Request timeout in seconds                                      |
-| `--max-payload-size`        | int  | 256MB       | Maximum request payload size                                    |
+- Native MCP client supports **STDIO**, **HTTP**, **SSE**, and **Streamable** transports—no external config files required.
+- Tool-call parsers cover JSON, Pythonic, XML, and custom schemas with streaming/non-streaming execution loops.
+- Reasoning parsers ship for DeepSeek-R1, Qwen3, Step-3, GLM4, Llama families, Kimi K2, GPT-OSS, Mistral, and more (`src/reasoning_parser`).
+- Tokenizer factory accepts HuggingFace IDs, local directories, and explicit `tokenizer.json` files with chat template overrides (`src/tokenizer`).

-### Cache-Aware Routing Parameters
+Use CLI flags to select parsers:
+```bash
+--reasoning-parser deepseek-r1 \
+--tool-call-parser json \
+--chat-template /path/to/template.json
+```

-| Parameter                  | Type  | Default  | Description                                            |
-| -------------------------- | ----- | -------- | ------------------------------------------------------ |
-| `--cache-threshold`        | float | 0.5      | Minimum prefix match ratio for cache routing (0.0-1.0) |
-| `--balance-abs-threshold`  | int   | 32       | Absolute load difference threshold                     |
-| `--balance-rel-threshold`  | float | 1.0001   | Relative load ratio threshold                          |
-| `--eviction-interval-secs` | int   | 60       | Seconds between cache eviction cycles                  |
-| `--max-tree-size`          | int   | 16777216 | Maximum nodes in routing tree                          |
+---
+
+## API Surface
+
+| Method                | Path                                     | Description                                    |
+|-----------------------|------------------------------------------|------------------------------------------------|
+| `POST`                | `/generate`                              | SGLang generate API.                           |
+| `POST`                | `/v1/chat/completions`                   | OpenAI-compatible chat (streaming/tool calls). |
+| `POST`                | `/v1/completions`                        | OpenAI-compatible text completions.            |
+| `POST`                | `/v1/responses`                          | Create background responses (agentic loops).   |
+| `GET`                 | `/v1/responses/{id}`                     | Retrieve stored responses.                     |
+| `GET`                 | `/v1/responses/{id}/input`               | List captured input items.                     |
+| `POST`                | `/v1/embeddings`                         | Forward embedding requests.                    |
+| `POST`                | `/v1/rerank`                             | Ranking endpoint (`/rerank` synonym).          |
+| `POST`                | `/v1/conversations`                      | Create conversation metadata.                  |
+| `GET`/`POST`/`DELETE` | `/v1/conversations/{id}`                 | Get/update/delete conversation.                |
+| `GET`/`POST`          | `/v1/conversations/{id}/items`           | List or append conversation items.             |
+| `GET`/`DELETE`        | `/v1/conversations/{id}/items/{item_id}` | Inspect/delete conversation item.              |
+| `GET`                 | `/workers`                               | List registered workers with health/load.      |
+| `POST`                | `/workers`                               | Queue worker registration.                     |
+| `DELETE`              | `/workers/{url}`                         | Queue worker removal.                          |
+| `POST`                | `/flush_cache`                           | Flush worker caches (HTTP workers).            |
+| `GET`                 | `/get_loads`                             | Retrieve worker load snapshot.                 |
+| `GET`                 | `/liveness` / `/readiness` / `/health`   | Health probes.                                 |
+
+---

-### Fault Tolerance Parameters
+## Configuration Reference

-| Parameter                    | Type  | Default | Description                           |
-| ---------------------------- | ----- | ------- | ------------------------------------- |
-| `--retry-max-retries`        | int   | 3       | Maximum retry attempts per request    |
-| `--retry-initial-backoff-ms` | int   | 100     | Initial retry backoff in milliseconds |
-| `--retry-max-backoff-ms`     | int   | 10000   | Maximum retry backoff in milliseconds |
-| `--retry-backoff-multiplier` | float | 2.0     | Backoff multiplier between retries    |
-| `--retry-jitter-factor`      | float | 0.1     | Random jitter factor for retries      |
-| `--disable-retries`          | flag  | False   | Disable retry mechanism               |
-| `--cb-failure-threshold`     | int   | 5       | Failures before circuit opens         |
-| `--cb-success-threshold`     | int   | 2       | Successes to close circuit            |
-| `--cb-timeout-duration-secs` | int   | 30      | Circuit breaker timeout duration      |
-| `--cb-window-duration-secs`  | int   | 60      | Circuit breaker window duration       |
-| `--disable-circuit-breaker`  | flag  | False   | Disable circuit breaker               |
-
-### Prefill-Decode Disaggregation Parameters
+### Core Settings

 | Parameter                   | Type | Default     | Description                                                              |
-| --------------------------------- | ---- | ------- | ----------------------------------------------------- |
-| `--pd-disaggregation`             | flag | False   | Enable PD disaggregated mode                          |
-| `--prefill`                       | list | []      | Prefill server URLs with optional bootstrap ports     |
-| `--decode`                        | list | []      | Decode server URLs                                    |
-| `--prefill-policy`                | str  | None    | Routing policy for prefill nodes (overrides --policy) |
-| `--decode-policy`                 | str  | None    | Routing policy for decode nodes (overrides --policy)  |
-| `--worker-startup-timeout-secs`   | int  | 300     | Timeout for worker startup                            |
-| `--worker-startup-check-interval` | int  | 10      | Interval between startup checks                       |
+|-----------------------------|------|-------------|--------------------------------------------------------------------------|
+| `--host`                    | str  | 127.0.0.1   | Router host.                                                             |
+| `--port`                    | int  | 30000       | Router port.                                                             |
+| `--worker-urls`             | list | []          | Worker URLs (HTTP or gRPC).                                              |
+| `--policy`                  | str  | cache_aware | Routing policy (`random`, `round_robin`, `cache_aware`, `power_of_two`). |
+| `--max-concurrent-requests` | int  | -1          | Concurrency limit (-1 disables rate limiting).                           |
+| `--request-timeout-secs`    | int  | 600         | Request timeout.                                                         |
+| `--max-payload-size`        | int  | 256MB       | Maximum request payload.                                                 |

-### Kubernetes Integration
+### Cache-Aware Tuning

 | Parameter                  | Type  | Default  | Description                 |
-| ------------------------------- | ---- | ------------------------ | ---------------------------------------------------- |
-| `--service-discovery`           | flag | False                    | Enable Kubernetes service discovery                  |
-| `--selector`                    | list | []                       | Label selector for workers (key1=value1 key2=value2) |
-| `--prefill-selector`            | list | []                       | Label selector for prefill servers in PD mode        |
-| `--decode-selector`             | list | []                       | Label selector for decode servers in PD mode         |
-| `--service-discovery-port`      | int  | 80                       | Port for discovered pods                             |
-| `--service-discovery-namespace` | str  | None                     | Kubernetes namespace to watch                        |
-| `--bootstrap-port-annotation`   | str  | sglang.ai/bootstrap-port | Annotation for bootstrap ports                       |
+|----------------------------|-------|----------|-----------------------------|
+| `--cache-threshold`        | float | 0.3      | Minimum prefix match ratio. |
+| `--balance-abs-threshold`  | int   | 64       | Absolute load threshold.    |
+| `--balance-rel-threshold`  | float | 1.5      | Relative load ratio.        |
+| `--eviction-interval-secs` | int   | 120      | Cache eviction cadence.     |
+| `--max-tree-size`          | int   | 67108864 | Max nodes in cache tree.    |

-### Observability
+### Fault Tolerance

 | Parameter                    | Type  | Default | Description                      |
-| ---------------------- | ---- | --------- | ----------------------------------------------------- |
-| `--prometheus-port`    | int  | 29000     | Prometheus metrics port                               |
-| `--prometheus-host`    | str  | 127.0.0.1 | Prometheus metrics host                               |
-| `--log-dir`            | str  | None      | Directory for log files                               |
-| `--log-level`          | str  | info      | Logging level (debug, info, warning, error, critical) |
-| `--request-id-headers` | list | None      | Custom headers for request tracing                    |
-
-### CORS Configuration
+|------------------------------|-------|---------|----------------------------------|
+| `--retry-max-retries`        | int   | 5       | Max retries.                     |
+| `--retry-initial-backoff-ms` | int   | 50      | Initial backoff (ms).            |
+| `--retry-max-backoff-ms`     | int   | 30000   | Max backoff (ms).                |
+| `--retry-backoff-multiplier` | float | 1.5     | Backoff multiplier.              |
+| `--retry-jitter-factor`      | float | 0.2     | Retry jitter (0.0-1.0).          |
+| `--disable-retries`          | flag  | False   | Disable retries.                 |
+| `--cb-failure-threshold`     | int   | 5       | Failures before opening circuit. |
+| `--cb-success-threshold`     | int   | 2       | Successes to close circuit.      |
+| `--cb-timeout-duration-secs` | int   | 30      | Cooldown period.                 |
+| `--cb-window-duration-secs`  | int   | 60      | Window size.                     |
+| `--disable-circuit-breaker`  | flag  | False   | Disable circuit breaker.         |
+
+### Prefill/Decode

 | Parameter                         | Type | Default | Description                              |
-| ------------------------ | ---- | ------- | -------------------- |
-| `--cors-allowed-origins` | list | []      | Allowed CORS origins |
-
-## Advanced Features
+|-----------------------------------|------|---------|------------------------------------------|
+| `--pd-disaggregation`             | flag | False   | Enable PD mode.                          |
+| `--prefill`                       | list | []      | Prefill URLs + optional bootstrap ports. |
+| `--decode`                        | list | []      | Decode URLs.                             |
+| `--prefill-policy`                | str  | None    | Override policy for prefill nodes.       |
+| `--decode-policy`                 | str  | None    | Override policy for decode nodes.        |
+| `--worker-startup-timeout-secs`   | int  | 600     | Worker init timeout.                     |
+| `--worker-startup-check-interval` | int  | 30      | Polling interval.                        |
+
+### Kubernetes Discovery
+
+| Parameter                                  | Type | Description                                                        |
+|--------------------------------------------|------|--------------------------------------------------------------------|
+| `--service-discovery`                      | flag | Enable discovery.                                                  |
+| `--selector key=value ...`                 | list | Label selectors (regular mode).                                    |
+| `--prefill-selector` / `--decode-selector` | list | Label selectors for PD mode.                                       |
+| `--service-discovery-namespace`            | str  | Namespace to watch.                                                |
+| `--service-discovery-port`                 | int  | Worker port (default 80).                                          |
+| `--bootstrap-port-annotation`              | str  | Prefill bootstrap annotation (default `sglang.ai/bootstrap-port`). |
+
+---

-### Kubernetes Service Discovery
-
-Automatically discover and manage workers in Kubernetes:
-
-#### Standard Mode
-```bash
-python -m sglang_router.launch_router \
-    --service-discovery \
-    --selector app=sglang-worker env=prod \
-    --service-discovery-namespace production \
-    --service-discovery-port 8000
-```
-
-#### Prefill-Decode Disaggregation Mode
-```bash
-python -m sglang_router.launch_router \
-    --pd-disaggregation \
-    --service-discovery \
-    --prefill-selector app=prefill-server env=prod \
-    --decode-selector app=decode-server env=prod \
-    --service-discovery-namespace production
-```
-
-**Note**: The `--bootstrap-port-annotation` (default: `sglang.ai/bootstrap-port`) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.
-
-### Prometheus Metrics
-
-Expose metrics for monitoring:
+## Observability

+Enable Prometheus metrics:
 ```bash
 python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
-    --prometheus-port 29000 \
-    --prometheus-host 0.0.0.0
+  --prometheus-host 0.0.0.0 \
+  --prometheus-port 29000
 ```

-Metrics available at `http://localhost:29000/metrics`
+Key metrics:

-### Request Tracing
-
-Enable request ID tracking:
+| Metric | Type | Description |
+|--------|------|-------------|
+| `sgl_router_requests_total` | Counter | Total requests by endpoint/method. |
+| `sgl_router_processed_requests_total` | Counter | Requests processed per worker. |
+| `sgl_router_active_workers` | Gauge | Healthy worker count. |
+| `sgl_router_running_requests` | Gauge | In-flight requests per worker. |
+| `sgl_router_cache_hits_total` / `misses_total` | Counter | Cache-aware routing hits/misses. |
+| `sgl_router_generate_duration_seconds` | Histogram | Request latency distribution. |

+Enable request ID propagation:
 ```bash
 python -m sglang_router.launch_router \
-    --worker-urls http://worker1:8000 http://worker2:8001 \
+  --worker-urls http://worker1:8000 \
  --request-id-headers x-request-id x-trace-id
 ```

-## Observability
-
-When Prometheus is enabled, the router provides several key metrics for observability.
-
-| Metric Name                            | Type      | Description                                                                                          |
-|:---------------------------------------|:----------|:-----------------------------------------------------------------------------------------------------|
-| `sgl_router_requests_total`            | Counter   | Total number of requests received by the router's API endpoint. Useful for tracking overall traffic. |
-| `sgl_router_processed_requests_total`  | Counter   | Total requests processed, labeled by `worker`. Critical for spotting load imbalances.                |
-| `sgl_router_active_workers`            | Gauge     | The current number of healthy workers in the routing pool. Essential for alerting.                   |
-| `sgl_router_running_requests`          | Gauge     | The number of currently in-flight requests, labeled by `worker`. For monitoring real-time load.      |
-| `sgl_router_cache_hits_total`          | Counter   | Total requests routed to a worker with a matching prefix cache.                                      |
-| `sgl_router_cache_misses_total`        | Counter   | Total requests that could not be routed based on cache locality.                                     |
-| `sgl_router_generate_duration_seconds` | Histogram | Tracks end-to-end request latency. Use this to monitor performance (e.g., p95/p99).                  |
+---

 ## Troubleshooting

-### Common Issues
-
-1. **Workers not connecting**: Ensure workers are fully initialized before starting the router. Use `--worker-startup-timeout-secs` to increase wait time.
+1. **Workers never ready**
+   Increase `--worker-startup-timeout-secs` or ensure health probes respond before router startup.

-2. **High latency**:
-   - **A common cause**: Load Imbalanced.
-   - Check the `sgl_router_processed_requests_total` metric grouped by `worker`.
-   - Cache-aware routing might be prioritizing cache hits too aggressively.
-   - Try adjusting `--balance-abs-threshold` and `--balance-rel-threshold`.
+2. **Load imbalance / hot workers**
+   Inspect `sgl_router_processed_requests_total` and tune cache-aware thresholds (`--balance-*`, `--cache-threshold`).

-3. **Memory growth**: Reduce `--max-tree-size` or decrease `--eviction-interval-secs` for more aggressive cache cleanup.
+3. **Circuit breaker flapping**
+   Increase `--cb-failure-threshold` or extend the timeout/window durations. Consider temporarily disabling retries.

-4. **Circuit breaker triggering frequently**: Increase `--cb-failure-threshold` or extend `--cb-window-duration-secs`.
+4. **Queue overflow (429)**
+   Increase `--queue-size` or reduce client concurrency. Ensure `--max-concurrent-requests` matches downstream capacity.

-### Debug Mode
+5. **Memory growth**
+   Reduce `--max-tree-size` or lower `--eviction-interval-secs` for more aggressive cache pruning.

-Enable detailed logging:
-
-```bash
-python -m sglang_router.launch_router \
-    --worker-urls http://worker1:8000 http://worker2:8001 \
+6. **Debugging**
+   ```bash
+   python -m sglang_router.launch_router \
+     --worker-urls http://worker1:8000 \
     --log-level debug \
     --log-dir ./router_logs
-```
+   ```
+
+---
+
+SGLang Model Gateway continues to evolve alongside the SGLang runtime. Keep CLI flags, integrations, and documentation aligned when adopting new features or contributing improvements.