router.md 22.1 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# SGLang Model Gateway (formerly SGLang Router)

SGLang Model Gateway is a high-performance model-routing gateway for large-scale LLM deployments. It centralizes worker lifecycle management, balances traffic across heterogeneous protocols (HTTP, gRPC, OpenAI-compatible), and provides enterprise-ready control over history storage, MCP tooling, and privacy-sensitive workflows. The router is deeply optimized for the SGLang serving runtime, but can route to any OpenAI-compatible backend.

---

## Table of Contents
1. [Overview](#overview)
2. [Architecture](#architecture)
   - [Control Plane](#control-plane)
   - [Data Plane](#data-plane)
   - [Storage & Privacy](#storage--privacy)
3. [Deployment Modes](#deployment-modes)
   - [Co-launch Router + Workers](#co-launch-router--workers)
   - [Separate Launch (HTTP)](#separate-launch-http)
   - [gRPC Launch](#grpc-launch)
   - [Prefill/Decode Disaggregation](#prefilldecode-disaggregation)
   - [OpenAI Backend Proxy](#openai-backend-proxy)
4. [Worker Lifecycle & Dynamic Scaling](#worker-lifecycle--dynamic-scaling)
5. [Reliability & Flow Control](#reliability--flow-control)
6. [Load Balancing Policies](#load-balancing-policies)
7. [Service Discovery (Kubernetes)](#service-discovery-kubernetes)
8. [Security & Authentication](#security--authentication)
9. [History & Data Connectors](#history--data-connectors)
10. [MCP & Advanced Tooling](#mcp--advanced-tooling)
11. [API Surface](#api-surface)
12. [Configuration Reference](#configuration-reference)
13. [Observability](#observability)
14. [Troubleshooting](#troubleshooting)

---

## Overview
- **Unified control plane** for registering, monitoring, and orchestrating regular, prefill, and decode workers across heterogeneous model fleets.
- **Multi-protocol data plane** that routes traffic across HTTP, PD (prefill/decode), gRPC, and OpenAI-compatible backends with shared reliability primitives.
- **Industry-first gRPC pipeline** with native Rust tokenization, reasoning parsers, and tool-call execution for high-throughput, OpenAI-compatible serving; supports both single-stage and PD topologies.
- **Inference Gateway Mode (`--enable-igw`)** dynamically instantiates multiple router stacks (HTTP regular/PD, gRPC) and applies per-model policies for multi-tenant deployments.
- **Conversation & responses connectors** centralize chat history inside the router so the same context can be reused across models and MCP loops without leaking data to upstream vendors (memory, none, Oracle ATP).
- **Enterprise privacy**: agentic multi-turn `/v1/responses`, native MCP client (STDIO/HTTP/SSE/Streamable), and history storage all operate within the router boundary.
- **Reliability core**: retries with jitter, worker-scoped circuit breakers, token-bucket rate limiting with queuing, background health checks, and cache-aware load monitoring.
- **Observability**: Prometheus metrics, structured tracing, request ID propagation, and detailed job queue stats.

---

## Architecture

### Control Plane
- **Worker Manager** discovers capabilities (`/get_server_info`, `/get_model_info`), tracks load, and registers/removes workers in the shared registry.
- **Job Queue** serializes add/remove requests and exposes status (`/workers/{url}`) so clients can track onboarding progress.
- **Load Monitor** feeds cache-aware and power-of-two policies with live worker load statistics.
- **Health Checker** continuously probes workers and updates readiness, circuit breaker state, and router metrics.

### Data Plane
- **HTTP routers** (regular & PD) implement `/generate`, `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, `/v1/embeddings`, `/v1/rerank`, and associated admin endpoints.
- **gRPC router** streams tokenized requests directly to SRT gRPC workers, running fully in Rust—tokenizer, reasoning parser, and tool parser all reside in-process. Supports both single-stage and PD routing.
- **OpenAI router** proxies OpenAI-compatible endpoints to external vendors (OpenAI, xAI, etc.) while keeping chat history and multi-turn orchestration local.

### Storage & Privacy
- Conversation and response history is stored at the router tier (memory, none, or Oracle ATP). The same history can power multiple models or MCP loops without sending data to upstream vendors.
- `/v1/responses` agentic flows, MCP sessions, and conversation APIs share the same storage layer, enabling compliance for regulated workloads.

---
63

64
## Deployment Modes
65

66
67
68
69
### Co-launch Router + Workers
Launch the router and a fleet of SGLang workers in one process (ideal for single-node or quick starts). The CLI accepts two namespaces of arguments:
- **Worker arguments** (no prefix) configure the SGLang runtime (`--model`, `--tp-size`, `--dp-size`, `--grpc-mode`, etc.).
- **Router arguments** are prefixed with `--router-` and map directly to `launch_router` flags (`--router-policy`, `--router-model-path`, `--router-log-level`, ...).
70
71

```bash
72
73
74
75
76
python -m sglang_router.launch_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dp-size 4 \
  --host 0.0.0.0 \
  --port 30000
77
78
```

79
Comprehensive example:
80
```bash
81
82
83
python3 -m sglang_router.launch_server \
  --host 0.0.0.0 \
  --port 8080 \
84
  --model meta-llama/Llama-3.1-8B-Instruct \
85
86
87
88
89
90
91
92
93
  --tp-size 1 \
  --dp-size 8 \
  --grpc-mode \
  --log-level debug \
  --router-prometheus-port 10001 \
  --router-tool-call-parser llama \
  --router-health-success-threshold 2 \
  --router-health-check-timeout-secs 6000 \
  --router-health-check-interval-secs 60 \
94
  --router-model-path meta-llama/Llama-3.1-8B-Instruct \
95
96
  --router-policy round_robin \
  --router-log-level debug
97
98
```

99
100
### Separate Launch (HTTP)
Run workers independently and point the router at their HTTP endpoints.
101
102

```bash
103
104
105
# Worker nodes
python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --port 8000
python -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --port 8001
106

107
108
109
110
111
# Router node
python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
  --policy cache_aware \
  --host 0.0.0.0 --port 30000
112
113
```

114
115
### gRPC Launch
Use SRT gRPC workers to unlock the highest throughput and access native reasoning/tool pipelines.
116
117

```bash
118
# Workers expose gRPC endpoints
119
python -m sglang.launch_server \
120
  --model meta-llama/Llama-3.1-8B-Instruct \
121
122
  --grpc-mode \
  --port 20000
123

124
# Router
125
python -m sglang_router.launch_router \
126
127
128
129
130
  --worker-urls grpc://127.0.0.1:20000 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --reasoning-parser deepseek-r1 \
  --tool-call-parser json \
  --host 0.0.0.0 --port 8080
131
```
132

133
> gRPC router supports both single-stage and PD serving. Provide `--tokenizer-path` or `--model-path` (HF repo or local directory) plus optional `--chat-template`.
134

135
136
### Prefill/Decode Disaggregation
Split prefill and decode workers for PD-aware caching and balancing.
137
138

```bash
139
python -m sglang_router.launch_router \
140
141
142
143
144
145
  --pd-disaggregation \
  --prefill http://prefill1:30001 9001 \
  --decode http://decode1:30011 \
  --policy cache_aware \
  --prefill-policy cache_aware \
  --decode-policy power_of_two
146
147
```

148
149
### OpenAI Backend Proxy
Proxy OpenAI-compatible endpoints (OpenAI, xAI, etc.) while keeping history and MCP sessions local.
simveit's avatar
simveit committed
150

151
152
```bash
python -m sglang_router.launch_router \
153
154
155
  --backend openai \
  --worker-urls https://api.openai.com \
  --history-backend memory
156
157
```

158
> OpenAI backend mode expects exactly one `--worker-urls` entry per router instance.
159

160
---
161

162
## Worker Lifecycle & Dynamic Scaling
163

164
Add or remove workers at runtime using the REST APIs. Jobs are queued and tracked for eventual consistency.
165
166

```bash
167
168
169
170
# Add a worker (HTTP or gRPC)
curl -X POST http://localhost:30000/workers \
  -H "Content-Type: application/json" \
  -d '{"url":"grpc://0.0.0.0:31000","worker_type":"regular"}'
simveit's avatar
simveit committed
171

172
# Inspect registry
173
curl http://localhost:30000/workers
174

175
176
# Remove a worker
curl -X DELETE http://localhost:30000/workers/grpc://0.0.0.0:31000
177
```
178

179
Legacy endpoints (`/add_worker`, `/remove_worker`, `/list_workers`) remain available but will be deprecated. `/workers/{url}` returns both registry data and queued job status.
180

181
---
182

183
## Reliability & Flow Control
184

185
### Retries
186
187
```bash
python -m sglang_router.launch_router \
188
189
190
191
192
193
  --worker-urls http://worker1:8000 http://worker2:8001 \
  --retry-max-retries 5 \
  --retry-initial-backoff-ms 50 \
  --retry-max-backoff-ms 30000 \
  --retry-backoff-multiplier 1.5 \
  --retry-jitter-factor 0.2
194
```
195

196
197
198
### Circuit Breaker
```bash
python -m sglang_router.launch_router \
199
200
201
202
203
  --worker-urls http://worker1:8000 http://worker2:8001 \
  --cb-failure-threshold 5 \
  --cb-success-threshold 2 \
  --cb-timeout-duration-secs 30 \
  --cb-window-duration-secs 60
204
205
```

206
207
208
209
210
211
212
213
### Rate Limiting & Queuing
```bash
python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
  --max-concurrent-requests 256 \
  --rate-limit-tokens-per-second 512 \
  --queue-size 128 \
  --queue-timeout-secs 30
214
```
215

216
Requests beyond the concurrency limit wait in a FIFO queue (up to `queue-size`). A `429` is returned when the queue is full; `408` is returned when `queue-timeout-secs` expires.
217

218
---
219

220
## Load Balancing Policies
221

222
223
224
225
226
227
| Policy             | Description                                                                                      | Usage                         |
|--------------------|--------------------------------------------------------------------------------------------------|-------------------------------|
| `random`           | Uniform random selection.                                                                        | `--policy random`             |
| `round_robin`      | Cycles through workers in order.                                                                 | `--policy round_robin`        |
| `power_of_two`     | Samples two workers and picks the lighter one (requires Load Monitor).                           | `--policy power_of_two`       |
| `cache_aware`      | Default policy; combines cache locality with load balancing, falling back to shortest queue.     | `--policy cache_aware` + tuning flags |
228

229
Key tuning flags:
230
```bash
231
232
233
234
235
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.5 \
--eviction-interval-secs 120 \
--max-tree-size 67108864
236
237
```

238
---
239

240
## Service Discovery (Kubernetes)
241

242
Enable automatic worker discovery via Kubernetes pod selectors.
243

244
```bash
245
246
247
248
249
python -m sglang_router.launch_router \
  --service-discovery \
  --selector app=sglang-worker role=inference \
  --service-discovery-namespace production \
  --service-discovery-port 8000
250
```
251

252
PD deployments can specify `--prefill-selector` and `--decode-selector` plus the `sglang.ai/bootstrap-port` annotation for prefill bootstrap ports. Ensure RBAC grants `get/list/watch` on pods.
253

254
---
255

256
## Security & Authentication
257

258
259
260
261
262
263
264
265
266
267
- **Router API key (`--api-key`)**: clients must supply `Authorization: Bearer <key>`.
- **Worker API keys**: when adding workers dynamically, include `api_key` in the payload; workers listed via CLI inherit the router key.
- **Full-stack auth**: start router with `--api-key`, then add workers with their own keys:
  ```bash
  curl -H "Authorization: Bearer router-key" \
    -X POST http://localhost:30000/workers \
    -H "Content-Type: application/json" \
    -d '{"url":"http://worker:8000","api_key":"worker-key"}'
  ```
- **Privacy**: All conversation history, `/v1/responses` state, and MCP sessions stay inside the router. Nothing is persisted at remote model vendors unless explicitly proxied.
268

269
---
270

271
## History & Data Connectors
272

273
274
275
276
277
| Backend | Description | Usage |
|---------|-------------|-------|
| `memory` (default) | In-memory storage for quick prototyping. | `--history-backend memory` |
| `none` | No persistence; APIs operate but store nothing. | `--history-backend none` |
| `oracle` | Oracle Autonomous Database-backed storage (pooled connections). | `--history-backend oracle` |
278

279
Oracle configuration (choose DSN *or* TNS alias):
280
281
282
283
284
285
286
287
288
289
290
Install the Oracle Instant Client and set `LD_LIBRARY_PATH` accordingly.
Choose **one** connection method:
```bash
# Option 1: Full connection descriptor
export ATP_DSN="(description=(address=(protocol=tcps)(port=1522)(host=adb.region.oraclecloud.com))(connect_data=(service_name=service_name)))"

# Option 2: TNS alias (requires wallet)
export ATP_TNS_ALIAS="sglroutertestatp_high"
export ATP_WALLET_PATH="/path/to/wallet"
```
Provide database credentials and optional pool sizing:
291
292
293
294
295
```bash
export ATP_USER="admin"
export ATP_PASSWORD="secret"
export ATP_POOL_MIN=4
export ATP_POOL_MAX=32
296

297
298
299
300
301
python -m sglang_router.launch_router \
  --backend openai \
  --worker-urls https://api.openai.com \
  --history-backend oracle
```
302

303
> History backends currently apply to OpenAI router mode. gRPC parity for `/v1/responses` is on the roadmap.
304

305
---
306

307
## MCP & Advanced Tooling
308

309
310
311
312
- Native MCP client supports **STDIO**, **HTTP**, **SSE**, and **Streamable** transports—no external config files required.
- Tool-call parsers cover JSON, Pythonic, XML, and custom schemas with streaming/non-streaming execution loops.
- Reasoning parsers ship for DeepSeek-R1, Qwen3, Step-3, GLM4, Llama families, Kimi K2, GPT-OSS, Mistral, and more (`src/reasoning_parser`).
- Tokenizer factory accepts HuggingFace IDs, local directories, and explicit `tokenizer.json` files with chat template overrides (`src/tokenizer`).
313

314
Use CLI flags to select parsers:
315
```bash
316
317
318
--reasoning-parser deepseek-r1 \
--tool-call-parser json \
--chat-template /path/to/template.json
319
```
320

321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---

## API Surface

| Method                | Path                                     | Description                                    |
|-----------------------|------------------------------------------|------------------------------------------------|
| `POST`                | `/generate`                              | SGLang generate API.                           |
| `POST`                | `/v1/chat/completions`                   | OpenAI-compatible chat (streaming/tool calls). |
| `POST`                | `/v1/completions`                        | OpenAI-compatible text completions.            |
| `POST`                | `/v1/responses`                          | Create background responses (agentic loops).   |
| `GET`                 | `/v1/responses/{id}`                     | Retrieve stored responses.                     |
| `POST`                | `/v1/embeddings`                         | Forward embedding requests.                    |
| `POST`                | `/v1/rerank`                             | Ranking endpoint (`/rerank` synonym).          |
| `POST`                | `/v1/conversations`                      | Create conversation metadata.                  |
| `GET`/`POST`/`DELETE` | `/v1/conversations/{id}`                 | Get/update/delete conversation.                |
| `GET`/`POST`          | `/v1/conversations/{id}/items`           | List or append conversation items.             |
| `GET`/`DELETE`        | `/v1/conversations/{id}/items/{item_id}` | Inspect/delete conversation item.              |
| `GET`                 | `/workers`                               | List registered workers with health/load.      |
| `POST`                | `/workers`                               | Queue worker registration.                     |
| `DELETE`              | `/workers/{url}`                         | Queue worker removal.                          |
| `POST`                | `/flush_cache`                           | Flush worker caches (HTTP workers).            |
| `GET`                 | `/get_loads`                             | Retrieve worker load snapshot.                 |
| `GET`                 | `/liveness` / `/readiness` / `/health`   | Health probes.                                 |

---
346
347
348
349
350

## Configuration Reference

### Core Settings

351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
| Parameter                   | Type | Default     | Description                                                              |
|-----------------------------|------|-------------|--------------------------------------------------------------------------|
| `--host`                    | str  | 127.0.0.1   | Router host.                                                             |
| `--port`                    | int  | 30000       | Router port.                                                             |
| `--worker-urls`             | list | []          | Worker URLs (HTTP or gRPC).                                              |
| `--policy`                  | str  | cache_aware | Routing policy (`random`, `round_robin`, `cache_aware`, `power_of_two`). |
| `--max-concurrent-requests` | int  | -1          | Concurrency limit (-1 disables rate limiting).                           |
| `--request-timeout-secs`    | int  | 600         | Request timeout.                                                         |
| `--max-payload-size`        | int  | 256MB       | Maximum request payload.                                                 |

### Cache-Aware Tuning

| Parameter                  | Type  | Default  | Description                 |
|----------------------------|-------|----------|-----------------------------|
| `--cache-threshold`        | float | 0.3      | Minimum prefix match ratio. |
| `--balance-abs-threshold`  | int   | 64       | Absolute load threshold.    |
| `--balance-rel-threshold`  | float | 1.5      | Relative load ratio.        |
| `--eviction-interval-secs` | int   | 120      | Cache eviction cadence.     |
| `--max-tree-size`          | int   | 67108864 | Max nodes in cache tree.    |

### Fault Tolerance

| Parameter                    | Type  | Default | Description                      |
|------------------------------|-------|---------|----------------------------------|
| `--retry-max-retries`        | int   | 5       | Max retries.                     |
| `--retry-initial-backoff-ms` | int   | 50      | Initial backoff (ms).            |
| `--retry-max-backoff-ms`     | int   | 30000   | Max backoff (ms).                |
| `--retry-backoff-multiplier` | float | 1.5     | Backoff multiplier.              |
| `--retry-jitter-factor`      | float | 0.2     | Retry jitter (0.0-1.0).          |
| `--disable-retries`          | flag  | False   | Disable retries.                 |
| `--cb-failure-threshold`     | int   | 5       | Failures before opening circuit. |
| `--cb-success-threshold`     | int   | 2       | Successes to close circuit.      |
| `--cb-timeout-duration-secs` | int   | 30      | Cooldown period.                 |
| `--cb-window-duration-secs`  | int   | 60      | Window size.                     |
| `--disable-circuit-breaker`  | flag  | False   | Disable circuit breaker.         |

### Prefill/Decode

| Parameter                         | Type | Default | Description                              |
|-----------------------------------|------|---------|------------------------------------------|
| `--pd-disaggregation`             | flag | False   | Enable PD mode.                          |
| `--prefill`                       | list | []      | Prefill URLs + optional bootstrap ports. |
| `--decode`                        | list | []      | Decode URLs.                             |
| `--prefill-policy`                | str  | None    | Override policy for prefill nodes.       |
| `--decode-policy`                 | str  | None    | Override policy for decode nodes.        |
| `--worker-startup-timeout-secs`   | int  | 600     | Worker init timeout.                     |
| `--worker-startup-check-interval` | int  | 30      | Polling interval.                        |

### Kubernetes Discovery

| Parameter                                  | Type | Description                                                        |
|--------------------------------------------|------|--------------------------------------------------------------------|
| `--service-discovery`                      | flag | Enable discovery.                                                  |
| `--selector key=value ...`                 | list | Label selectors (regular mode).                                    |
| `--prefill-selector` / `--decode-selector` | list | Label selectors for PD mode.                                       |
| `--service-discovery-namespace`            | str  | Namespace to watch.                                                |
| `--service-discovery-port`                 | int  | Worker port (default 80).                                          |
| `--bootstrap-port-annotation`              | str  | Prefill bootstrap annotation (default `sglang.ai/bootstrap-port`). |

---
411

412
## Observability
413

414
Enable Prometheus metrics:
415
416
```bash
python -m sglang_router.launch_router \
417
418
419
  --worker-urls http://worker1:8000 http://worker2:8001 \
  --prometheus-host 0.0.0.0 \
  --prometheus-port 29000
420
```
421

422
Key metrics:
423

424
425
426
427
428
429
430
431
| Metric | Type | Description |
|--------|------|-------------|
| `sgl_router_requests_total` | Counter | Total requests by endpoint/method. |
| `sgl_router_processed_requests_total` | Counter | Requests processed per worker. |
| `sgl_router_active_workers` | Gauge | Healthy worker count. |
| `sgl_router_running_requests` | Gauge | In-flight requests per worker. |
| `sgl_router_cache_hits_total` / `misses_total` | Counter | Cache-aware routing hits/misses. |
| `sgl_router_generate_duration_seconds` | Histogram | Request latency distribution. |
432

433
Enable request ID propagation:
434
435
```bash
python -m sglang_router.launch_router \
436
437
  --worker-urls http://worker1:8000 \
  --request-id-headers x-request-id x-trace-id
438
439
```

440
---
441

442
443
## Troubleshooting

444
445
1. **Workers never ready**
   Increase `--worker-startup-timeout-secs` or ensure health probes respond before router startup.
446

447
448
2. **Load imbalance / hot workers**
   Inspect `sgl_router_processed_requests_total` and tune cache-aware thresholds (`--balance-*`, `--cache-threshold`).
449

450
451
3. **Circuit breaker flapping**
   Increase `--cb-failure-threshold` or extend the timeout/window durations. Consider temporarily disabling retries.
452

453
454
4. **Queue overflow (429)**
   Increase `--queue-size` or reduce client concurrency. Ensure `--max-concurrent-requests` matches downstream capacity.
455

456
457
5. **Memory growth**
   Reduce `--max-tree-size` or lower `--eviction-interval-secs` for more aggressive cache pruning.
458

459
460
461
462
463
464
465
6. **Debugging**
   ```bash
   python -m sglang_router.launch_router \
     --worker-urls http://worker1:8000 \
     --log-level debug \
     --log-dir ./router_logs
   ```
466

467
---
468

469
SGLang Model Gateway continues to evolve alongside the SGLang runtime. Keep CLI flags, integrations, and documentation aligned when adopting new features or contributing improvements.