This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
For more information about the core concepts, see:
KV-aware routing optimizes LLM inference by directing requests to workers that already have relevant data cached. Instead of random or round-robin distribution, the router:
- **Tracks cached data**: Monitors which token sequences are cached on each worker
- **Maximizes cache reuse**: Routes requests to workers with the best cache overlap, reducing redundant computation
- **Balances load**: Considers both cache efficiency and worker utilization when making routing decisions
This is particularly beneficial for:
- **Shared system prompts**: Cached across workers and reused efficiently
- **Multi-turn conversations**: Full conversation history benefits from caching
- **Similar queries**: Common prefixes are computed once and reused
- **Batch processing**: Related requests can be routed to workers with shared context
For detailed technical information about how KV routing works, see the [KV Cache Routing Architecture documentation](../../../docs/architecture/kv_cache_routing.md).
## Prerequisites
### 1. Infrastructure Services
Ensure etcd and NATS are running on a node accessible by all workers:
```bash
# On the infrastructure node (can be Node 1 or a dedicated node)
docker compose -f deploy/docker-compose.yml up -d
```
Note the IP address of this node - you'll need it for worker configuration.
### 2. Software Requirements
Install Dynamo with [SGLang](https://docs.sglang.ai/) support:
```bash
pip install ai-dynamo[sglang]
```
For more information about the SGLang backend and its integration with Dynamo, see the [SGLang Backend Documentation](../../components/backends/sglang/README.md).
### 3. Network Requirements
Ensure the following ports are accessible between nodes:
- **2379**: etcd client port
- **4222**: NATS client port
- **8000**: Frontend HTTP port (only needed on frontend node)
- **High-speed interconnect**: For optimal NIXL performance (InfiniBand, RoCE, or high-bandwidth Ethernet)
### 4. Hardware Setup
This example assumes:
- **Node 1**: At least 2 GPUs (for Replica 1's decode and prefill workers)
- **Node 2**: At least 2 GPUs (for Replica 2's decode and prefill workers)
- **Frontend Node**: Can be on Node 1, Node 2, or a separate node (no GPU required)
> [!NOTE]
> You can run this example with minimal modifications on a single node with at least 4 GPUs.
> In step 3, modify the `CUDA_VISIBLE_DEVICES` flags to `CUDA_VISIBLE_DEVICES=2`
> for the prefill component and `CUDA_VISIBLE_DEVICES=3` for the decode component.
## Setup Instructions
### Step 1: Set Environment Variables
On all nodes, set the etcd and NATS endpoints:
```bash
# Replace with your infrastructure node's IP
# To find your IP address, run the follwing on your infrastructure node:
While this example demonstrates KV-aware routing for optimal cache utilization, Dynamo also supports simpler routing strategies:
- **KV-Aware** (recommended): Routes based on cache overlap across all workers
- **Round-Robin**: Distributes requests evenly across workers in sequence
- **Random**: Randomly selects workers for each request
```bash
# Example: Use round-robin routing instead of KV routing
python -m dynamo.frontend \
--http-port 8000 \
--router-mode round-robin
```
However, for maximum performance with shared prefixes and multi-turn conversations, KV routing provides significant advantages by minimizing redundant computation.
For detailed router configuration and tuning options, see the [KV Router Documentation](../../../docs/components/router/README.md).
## Monitoring and Debugging
### Check Worker Registration
Verify all workers are properly registered:
```bash
etcdctl --endpoints=$ETCD_ENDPOINTS get --prefix /dynamo/workers/
```
### Monitor Routing Decisions
With `DYN_LOG=debug`, the frontend logs show routing decisions:
etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health
```
2. Check NATS connectivity:
```bash
nats --server=$NATS_SERVER server check connection
```
### NIXL Transfer Failures
1. Ensure GPUs can communicate across nodes
2. Check InfiniBand/RoCE configuration if using high-speed interconnect
3. Verify CUDA IPC is enabled for optimal performance
### Routing Not Working
1. Confirm frontend is started with `--router-mode kv`
2. Check that all workers are properly registered in etcd
3. Verify workers are publishing KV events
4. Check logs for overlap scores - if all zeros, cache tracking may not be working
5. Ensure NATS is functioning for KV event distribution
## Advanced Configuration
For production deployments, you can fine-tune KV routing behavior:
```bash
python -m dynamo.frontend \
--http-port 8000 \
--router-mode kv \
--kv-overlap-score-weight 1.0 # Weight for cache overlap scoring \
--router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic)
```
For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [KV Cache Routing documentation](../../../docs/architecture/kv_cache_routing.md).
## Cleanup
Stop all components in reverse order:
1. Stop Frontend (Ctrl+C in the frontend terminal)
2. Stop workers on each node:
- On Node 1: Press Ctrl+C in the terminal (this stops the decode worker)
- On Node 2: Press Ctrl+C in the terminal (this stops the decode worker)
- To stop the background prefill workers, use one of these methods:
```bash
# Method 1: Kill background jobs in the same terminal
jobs # See background jobs
kill %1 # Kill the background prefill worker
# Method 2: Close the terminal entirely (sends SIGHUP to background processes)
exit
# Method 3: Kill by process name (from any terminal)
pkill -f "dynamo.sglang.worker.*prefill"
```
3. Stop infrastructure services:
```bash
docker compose -f deploy/docker-compose.yml down
```
## Next Steps
-**Scale Up**: Add more replicas by repeating Steps 2-3 on additional nodes
-**High Availability**: Run multiple frontend instances with a load balancer
-**Monitoring**: Deploy Prometheus and Grafana for production monitoring
-**Optimization**: Tune worker configurations based on workload patterns
-**Cache Analysis**: Use SGLang's built-in cache statistics to optimize your workloads