"vscode:/vscode.git/clone" did not exist on "46070b36c84735dbe4292335e308a41c68773c57"
README.md 4.59 KB
Newer Older
Yan Ru Pei's avatar
Yan Ru Pei committed
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Yan Ru Pei's avatar
Yan Ru Pei committed
3
4
5
SPDX-License-Identifier: Apache-2.0
-->

6
# Router
Yan Ru Pei's avatar
Yan Ru Pei committed
7

8
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
Yan Ru Pei's avatar
Yan Ru Pei committed
9

10
11
12
## Quick Start

### Python / CLI Deployment
Yan Ru Pei's avatar
Yan Ru Pei committed
13
14
15
16

To launch the Dynamo frontend with the KV Router:

```bash
17
python -m dynamo.frontend --router-mode kv --http-port 8000
Yan Ru Pei's avatar
Yan Ru Pei committed
18
19
20
21
```

This command:
- Launches the Dynamo frontend service with KV routing enabled
22
- Exposes the service on port 8000 (configurable)
Yan Ru Pei's avatar
Yan Ru Pei committed
23
24
- Automatically handles all backend workers registered to the Dynamo endpoint

25
26
27
28
29
30
31
32
33
34
35
36
37
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.

#### CLI Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |

For all available options: `python -m dynamo.frontend --help`
Yan Ru Pei's avatar
Yan Ru Pei committed
38

39
40
### Kubernetes Deployment

41
To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-deployment
spec:
  services:
    Frontend:
      dynamoNamespace: my-namespace
      componentType: frontend
      replicas: 1
      envs:
        - name: DYN_ROUTER_MODE
          value: kv  # Enable KV Smart Router
```

**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed

64
#### Environment Variables
Yan Ru Pei's avatar
Yan Ru Pei committed
65

66
All CLI arguments can be configured via environment variables using the `DYN_` prefix:
Yan Ru Pei's avatar
Yan Ru Pei committed
67

68
69
70
71
72
73
74
| CLI Argument | Environment Variable | Default |
|--------------|---------------------|---------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
Yan Ru Pei's avatar
Yan Ru Pei committed
75

76
For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples).
Yan Ru Pei's avatar
Yan Ru Pei committed
77

78
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
Yan Ru Pei's avatar
Yan Ru Pei committed
79

80
For more configuration options and tuning guidelines, see the [Router Guide](router_guide.md).
81
82
83

## Prerequisites and Limitations

84
85
**Requirements:**
- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
86
- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md))
87
- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
Yan Ru Pei's avatar
Yan Ru Pei committed
88

89
90
91
92
**Multimodal Support:**
- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes
- **SGLang**: Image routing not yet supported
- **Other modalities** (audio, video, etc.): Not yet supported
Yan Ru Pei's avatar
Yan Ru Pei committed
93

94
95
**Limitations:**
- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states
Yan Ru Pei's avatar
Yan Ru Pei committed
96

97
For basic model registration without KV routing, use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
Yan Ru Pei's avatar
Yan Ru Pei committed
98

99
## Next Steps
Yan Ru Pei's avatar
Yan Ru Pei committed
100

101
102
- **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
103
- **[Router Design](../../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes
104
105
106
107
108
109
110

```{toctree}
:hidden:

router_guide
router_examples
```