Unverified Commit 8e776c78 authored by Jonah Bernard's avatar Jonah Bernard Committed by GitHub
Browse files

docs(router): add token-bucket rate limiting to the docs (#11485)

parent 63e84352
...@@ -11,6 +11,7 @@ The SGLang Router is a high-performance request distribution system that routes ...@@ -11,6 +11,7 @@ The SGLang Router is a high-performance request distribution system that routes
- **Kubernetes Integration**: Native service discovery and pod management - **Kubernetes Integration**: Native service discovery and pod management
- **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing - **Prefill-Decode Disaggregation**: Support for disaggregated serving load balancing
- **Prometheus Metrics**: Built-in observability and monitoring - **Prometheus Metrics**: Built-in observability and monitoring
- **Rate Limiter**: Token-bucket rate limiter to shield workers from overload
## Installation ## Installation
...@@ -229,6 +230,35 @@ python -m sglang_router.launch_router \ ...@@ -229,6 +230,35 @@ python -m sglang_router.launch_router \
- Returns to service after `cb-success-threshold` successful health checks - Returns to service after `cb-success-threshold` successful health checks
- Circuit breaker can be disabled with `--disable-circuit-breaker` - Circuit breaker can be disabled with `--disable-circuit-breaker`
### Rate Limiter
Use the token-bucket rate limiter to cap requests before they overwhelm downstream workers.
- Enable rate limiting by setting `--max-concurrent-requests` to a positive integer. A bucket with that many tokens (concurrent leases) is created; `-1` keeps it disabled.
- Optionally override the refill rate with `--rate-limit-tokens-per-second`. If omitted, the refill rate matches `max-concurrent-requests`.
- Overflow traffic can wait in a FIFO queue controlled by:
- `--queue-size`: pending-request buffer (0 disables queuing; defaults to 100).
- `--queue-timeout-secs`: maximum wait time for queued requests before returning `429` (defaults to 60 seconds).
Example:
```bash
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--max-concurrent-requests 256 \
--rate-limit-tokens-per-second 512 \
--queue-size 128 \
--queue-timeout-secs 30
```
**Behavior**:
This configuration allows up to 256 concurrent requests, refills 512 tokens (requests) per second, and keeps up to 128 overflow requests queued for 30 seconds before timing out.
**Responses**:
- Returns **429** when the router cannot enqueue the request (queue disabled or full).
- Returns **408** when a queued request waits longer than `--queue-timeout-secs` or no token becomes available before the timeout.
## Routing Policies ## Routing Policies
The router supports multiple routing strategies: The router supports multiple routing strategies:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment