This folder contains kubernetes manifests to deploy Dynamo frontend component as a standalone DynamoGraphDeployment (DGD)
and two models.
Frontend is shared across the two models. Frontend is deployed to dynamo namespace `dynamo`, which is a reserved namespace name for frontend to observe deployed models across all dynamo namespaces.
A shared PVC is configured to store model checkpoint weights fetched from Hugging Face.
1. Install Dynamo k8s platform helm chart
2. Create a K8S secret with your Huggingface token and then render k8s manifests
and use following request to test one of the deployed model
```sh
curl localhost:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
You can also benchmark the performance of the endpoint by [AIPerf](https://github.com/ai-dynamo/aiperf/blob/main/README.md)
This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
For more information about the core concepts, see:
KV-aware routing optimizes LLM inference by directing requests to workers that already have relevant data cached. Instead of random or round-robin distribution, the router:
- **Tracks cached data**: Monitors which token sequences are cached on each worker
- **Maximizes cache reuse**: Routes requests to workers with the best cache overlap, reducing redundant computation
- **Balances load**: Considers both cache efficiency and worker utilization when making routing decisions
This is particularly beneficial for:
- **Shared system prompts**: Cached across workers and reused efficiently
- **Multi-turn conversations**: Full conversation history benefits from caching
- **Similar queries**: Common prefixes are computed once and reused
- **Batch processing**: Related requests can be routed to workers with shared context
For detailed technical information about how KV routing works, see the [Router Guide](../../../docs/components/router/router-guide.md).
## Prerequisites
### 1. Infrastructure Services
Ensure etcd and NATS are running on a node accessible by all workers:
```bash
# On the infrastructure node (can be Node 1 or a dedicated node)
docker compose -f deploy/docker-compose.yml up -d
```
Note the IP address of this node - you'll need it for worker configuration.
### 2. Software Requirements
Install Dynamo with [SGLang](https://docs.sglang.io/) support:
```bash
pip install ai-dynamo[sglang]
```
For more information about the SGLang backend and its integration with Dynamo, see the [SGLang Backend Documentation](../../../docs/backends/sglang/README.md).
### 3. Network Requirements
Ensure the following ports are accessible between nodes:
- **2379**: etcd client port
- **4222**: NATS client port
- **8000**: Frontend HTTP port (only needed on frontend node)
- **${DISAGG_BOOTSTRAP_PORT}**: SGLang disaggregation bootstrap port (set in Step 1; must be reachable across nodes)
- **High-speed interconnect**: For optimal NIXL performance (InfiniBand, RoCE, or high-bandwidth Ethernet)
### 4. Hardware Setup
This example assumes:
- **Node 1**: At least 2 GPUs (for Replica 1's decode and prefill workers)
- **Node 2**: At least 2 GPUs (for Replica 2's decode and prefill workers)
- **Frontend Node**: Can be on Node 1, Node 2, or a separate node (no GPU required)
> [!NOTE]
> You can run this example with minimal modifications on a single node with at least 4 GPUs.
> In step 3, modify the `CUDA_VISIBLE_DEVICES` flags to `CUDA_VISIBLE_DEVICES=2`
> for the prefill component and `CUDA_VISIBLE_DEVICES=3` for the decode component.
## Setup Instructions
### Step 1: Set Environment Variables
On all nodes, set the etcd and NATS endpoints:
```bash
# Replace with your infrastructure node's IP
# To find your IP address, run the follwing on your infrastructure node:
While this example demonstrates KV-aware routing for optimal cache utilization, Dynamo also supports simpler routing strategies:
- **KV-Aware** (recommended): Routes based on cache overlap across all workers
- **Round-Robin**: Distributes requests evenly across workers in sequence
- **Random**: Randomly selects workers for each request
```bash
# Example: Use round-robin routing instead of KV routing
python -m dynamo.frontend \
--http-port 8000 \
--router-mode round-robin
```
However, for maximum performance with shared prefixes and multi-turn conversations, KV routing provides significant advantages by minimizing redundant computation.
## Monitoring and Debugging
### Check Worker Registration
Verify all workers are properly registered:
```bash
etcdctl --endpoints=$ETCD_ENDPOINTS get --prefix /dynamo/workers/
```
### Monitor Routing Decisions
With `DYN_LOG=debug`, the frontend logs show routing decisions:
etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health
```
2. Check NATS connectivity:
```bash
nats --server=$NATS_SERVER server check connection
```
### NIXL Transfer Failures
1. Ensure GPUs can communicate across nodes
2. Check InfiniBand/RoCE configuration if using high-speed interconnect
3. Verify CUDA IPC is enabled for optimal performance
### Routing Not Working
1. Confirm frontend is started with `--router-mode kv`
2. Check that all workers are properly registered in etcd
3. Verify workers are publishing KV events
4. Check logs for overlap scores - if all zeros, cache tracking may not be working
5. Ensure NATS is functioning for KV event distribution
## Advanced Configuration
For production deployments, you can fine-tune KV routing behavior:
```bash
python -m dynamo.frontend \
--http-port 8000 \
--router-mode kv \
--kv-overlap-score-weight 1.0 # Weight for cache overlap scoring \
--router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic)
```
For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [Router Guide](../../../docs/components/router/router-guide.md).
## Cleanup
Stop all components in reverse order:
1. Stop Frontend (Ctrl+C in the frontend terminal)
2. Stop workers on each node:
- On Node 1: Press Ctrl+C in the terminal (this stops the decode worker)
- On Node 2: Press Ctrl+C in the terminal (this stops the decode worker)
- To stop the background prefill workers, use one of these methods:
```bash
# Method 1: Kill background jobs in the same terminal
jobs # See background jobs
kill %1 # Kill the background prefill worker
# Method 2: Close the terminal entirely (sends SIGHUP to background processes)
exit
# Method 3: Kill by process name (from any terminal)
pkill -f "dynamo.sglang.*prefill"
```
3. Stop infrastructure services:
```bash
docker compose -f deploy/docker-compose.yml down
```
## Next Steps
-**Scale Up**: Add more replicas by repeating Steps 2-3 on additional nodes
-**High Availability**: Run multiple frontend instances with a load balancer
-**Monitoring**: Deploy Prometheus and Grafana for production monitoring
-**Optimization**: Tune worker configurations based on workload patterns
-**Cache Analysis**: Use SGLang's built-in cache statistics to optimize your workloads
This is a simple example showing how you can quickly get started deploying Large Language Models with Dynamo.
## Prerequisites
Before running this example, ensure you have the following services running:
-**etcd**: A distributed key-value store used for service discovery and metadata storage
-**NATS**: A high-performance message broker for inter-component communication
You can start these services using Docker Compose:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
## Components
-[Frontend](/components/src/dynamo/frontend/README.md) - A built-in component that launches an OpenAI compliant HTTP server, a pre-processor, and a router in a single process
-[vLLM Backend](/docs/backends/vllm/README.md) - A built-in component that runs vLLM within the Dynamo runtime
```mermaid
---
title: Request Flow
---
flowchart TD
A["Users/Clients<br/>(HTTP)"] --> B["Frontend<br/>HTTP API endpoint<br/>(OpenAI Style)"]
B --> C["NATS Message Broker<br/>(Inter-component communication)"]
C --> D["vLLM Backend<br/>(NATS subscriber)"]
D --> C
C --> B
B --> A
```
## Instructions
There are three steps to deploy and use LLM with Dynamo.
### 1. Launch Engine
**Open a new terminal** and run:
```bash
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
Leave this terminal running - it will show vLLM Backend logs.
### 2. Launch Frontend
**Open another terminal** and interact with the deployed engine using the built-in frontend component. You have two options:
1. Interactive Command Line Interface
```bash
python -m dynamo.frontend --interactive
```
2. HTTP Server
```bash
python -m dynamo.frontend --http-port 8000
```
Leave this terminal running as well - it will show Frontend logs.
### 3. Send Requests
If you launched the frontend in `interactive` mode, simply start typing and hit `Enter` to have an interactive chat with your LLM.
If you launched the frontend in HTTP mode, you can send requests via `curl`, or any OpenAI compatible client program or library.
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{ "role": "user", "content": "Tell me a story about a brave cat" }
],
"stream": false,
"max_tokens": 1028
}'
```
## Cleanup
When you're done with the quickstart example, follow these steps to clean up:
### 1. Stop Dynamo Components
In each terminal where you started Dynamo components, press `Ctrl+C` to stop them:
- Stop the vLLM Backend (terminal from step 1)
- Stop the Frontend (terminal from step 2)
### 2. Stop Infrastructure Services
If you don't plan to run any more examples, stop the etcd and NATS services that were started with Docker Compose:
```bash
docker compose -f deploy/docker-compose.yml down
```
This will stop and remove the containers for etcd and NATS.
## Understand
### What's Happening Under the Hood
When you run the two commands above, here's what Dynamo does to spin up the necessary processes and connect your HTTP requests to the vLLM Backend:
### 1. Service Registration and Discovery
#### DistributedRuntime Setup
At startup, each Dynamo component (vLLM Backend, Frontend) connects to the `DistributedRuntime`, which involves creating connections to two critical infrastructure services:
- **etcd**: A distributed key-value store used for service discovery and metadata storage
- **NATS**: A high-performance message broker for inter-component communication
#### Component Registration
When the vLLM Backend starts up, it registers itself as a `component` in etcd with one or more `endpoints`.
This registration includes each endpoint's [NATS subject](https://docs.nats.io/nats-concepts/subjects) for communication and is tied to a `lease` that automatically expires if the component goes offline.
<details>
<summary> Inspecting the Component Registry </summary>
If you want to find out more about the internal organization of components in Dynamo, you can inspect the contents of `etcd` using the [`etcdctl` command line tool](https://etcd.io/docs/latest/dev-guide/interacting_v3/). For this example, you can try running
```bash
etcdctl get "instances" --prefix
```
which will show you each registered endpoint, along with their associated NATS subject. Note that the specific etcd and NATS info is internal and always subject to change -- in future examples we'll show how to use the `DistributedRuntime` itself to communicate across components.
</details>
#### Frontend Discovery
When the Frontend starts, it doesn't receive an explicit pointer to the vLLM Backend component. Instead, it constantly watches etcd for registered models, automatically discovering the vLLM Backend component and its endpoints when it becomes available.
### 2. Request Flow and NATS Messaging
When you send an HTTP request to the Frontend:
1. **Request Packaging**: The Frontend wraps your HTTP request in a standardized internal format with routing metadata
2. **NATS Subject Resolution**: Using the discovered endpoints in etcd, it determines the appropriate NATS endpoint
3. **Message Dispatch**: The request is published to the discovered NATS subject, where the target vLLM Backend picks it up
4. **Response Streaming**: The vLLM Backend executes the request, and streams responses back through NATS which the Frontend converts back to HTTP
### 3. Network-Transparent Operation
One of Dynamo's key strengths is that this entire system works seamlessly whether components are:
- Running on the same machine (like in this quickstart)
- Distributed across multiple nodes in a cluster
- Deployed in different availability zones
The same two commands work in all scenarios, as long as all components can connect with the `DistributedRuntime` - Dynamo handles the networking complexity automatically.