Unverified Commit 002c6823 authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: fix typos and wrong imports in router docs (#6654)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent 64574a3e
...@@ -31,7 +31,7 @@ Backend workers register themselves using the `register_model` API, after which ...@@ -31,7 +31,7 @@ Backend workers register themselves using the `register_model` API, after which
| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) | | `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) | | `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
| `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking | | `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking |
| `--router-kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT). | | `--router-kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
For all available options: `python -m dynamo.frontend --help` For all available options: `python -m dynamo.frontend --help`
......
...@@ -51,11 +51,13 @@ python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf ...@@ -51,11 +51,13 @@ python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
```python ```python
import asyncio import asyncio
from dynamollm import DistributedRuntime, KvRouter, KvRouterConfig from dynamo.runtime import DistributedRuntime
from dynamo.llm import KvRouter, KvRouterConfig
async def main(): async def main():
# Get runtime and create endpoint # Get runtime and create endpoint
runtime = DistributedRuntime.detached() loop = asyncio.get_running_loop()
runtime = DistributedRuntime(loop, "etcd", "nats")
endpoint = runtime.endpoint("dynamo.backend.generate") endpoint = runtime.endpoint("dynamo.backend.generate")
# Create KV router # Create KV router
...@@ -191,9 +193,12 @@ Query without state updates, then route through a chosen router: ...@@ -191,9 +193,12 @@ Query without state updates, then route through a chosen router:
worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens) # No request_id worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens) # No request_id
worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens) worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens)
# Pick the best router based on results # Pick the best router and corresponding worker based on results
chosen_router = router_1 if overlap_1 > overlap_2 else router_2 if overlap_1 > overlap_2:
stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id) chosen_router, chosen_worker = router_1, worker_id_1
else:
chosen_router, chosen_worker = router_2, worker_id_2
stream = await chosen_router.generate(tokens, model="model-name", worker_id=chosen_worker)
``` ```
- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups) - **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
- **Advantage**: Query multiple routers before committing to one - **Advantage**: Query multiple routers before committing to one
...@@ -218,11 +223,13 @@ Here's an example of using `get_potential_loads()` to implement custom routing t ...@@ -218,11 +223,13 @@ Here's an example of using `get_potential_loads()` to implement custom routing t
```python ```python
import asyncio import asyncio
from dynamo.llm import DistributedRuntime, KvRouter, KvRouterConfig from dynamo.runtime import DistributedRuntime
from dynamo.llm import KvRouter, KvRouterConfig
async def minimize_ttft_routing(): async def minimize_ttft_routing():
# Setup router # Setup router
runtime = DistributedRuntime.detached() loop = asyncio.get_running_loop()
runtime = DistributedRuntime(loop, "etcd", "nats")
endpoint = runtime.endpoint("dynamo.backend.generate") endpoint = runtime.endpoint("dynamo.backend.generate")
router = KvRouter( router = KvRouter(
......
...@@ -35,7 +35,7 @@ Backend workers register themselves using the `register_model` API, after which ...@@ -35,7 +35,7 @@ Backend workers register themselves using the `register_model` API, after which
| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) | | `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) | | `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
| `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking | | `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking |
| `--router-kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT). | | `--router-kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
| `--router-queue-threshold <float>` | None (disabled) | Queue threshold fraction; enables priority scheduling via `latency_sensitivity` | | `--router-queue-threshold <float>` | None (disabled) | Queue threshold fraction; enables priority scheduling via `latency_sensitivity` |
For all available options: `python -m dynamo.frontend --help` For all available options: `python -m dynamo.frontend --help`
...@@ -142,7 +142,7 @@ To evaluate the benefits of KV-aware routing, compare your workload's performanc ...@@ -142,7 +142,7 @@ To evaluate the benefits of KV-aware routing, compare your workload's performanc
The main KV-aware routing arguments (frontend uses the same `--router-*` flag names as the standalone router; legacy names without the prefix are obsolete): The main KV-aware routing arguments (frontend uses the same `--router-*` flag names as the standalone router; legacy names without the prefix are obsolete):
- `--router-kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1. . - `--router-kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.
- `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness. - `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
...@@ -156,7 +156,7 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na ...@@ -156,7 +156,7 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na
- `--router-snapshot-threshold`: Only applies in JetStream mode (`--router-durable-kv-events`). Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATS object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart. - `--router-snapshot-threshold`: Only applies in JetStream mode (`--router-durable-kv-events`). Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATS object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.
- `--no-router-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management. . - `--no-router-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management.
- `--router-track-output-blocks`: Enables tracking of output blocks during generation (default: disabled). When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward the expected output sequence length (`agent_hints.osl` in nvext). This improves load balancing accuracy for long-running generation requests by accounting for output-side KV cache growth. - `--router-track-output-blocks`: Enables tracking of output blocks during generation (default: disabled). When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward the expected output sequence length (`agent_hints.osl` in nvext). This improves load balancing accuracy for long-running generation requests by accounting for output-side KV cache growth.
...@@ -389,7 +389,7 @@ Persistence behavior depends on which event transport mode is active: ...@@ -389,7 +389,7 @@ Persistence behavior depends on which event transport mode is active:
- Recovery depends on workers being available; if a worker is down, its blocks cannot be recovered - Recovery depends on workers being available; if a worker is down, its blocks cannot be recovered
- Simpler infrastructure (no JetStream required) - Simpler infrastructure (no JetStream required)
**JetStream Mode** (`--router-durable-kv-events` on **both** frontend **and** workers):** **JetStream Mode** (`--router-durable-kv-events` on **both** frontend **and** workers):
- Prefix blocks are stored in NATS JetStream with 1-hour retention - Prefix blocks are stored in NATS JetStream with 1-hour retention
- Snapshots saved to NATS object store at configurable thresholds - Snapshots saved to NATS object store at configurable thresholds
- New replicas automatically restore this state on startup - New replicas automatically restore this state on startup
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment