"components/vscode:/vscode.git/clone" did not exist on "08a3763a8ce257151cfd16813a2fb5c51d0f2dda"
Unverified Commit 7dfbe4fd authored by Alec's avatar Alec Committed by GitHub
Browse files

chore: remove stale example assets (#7059)

parent 310f8ca9
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
# Shared Dynamo Frontend
This folder contains kubernetes manifests to deploy Dynamo frontend component as a standalone DynamoGraphDeployment (DGD)
and two models.
Frontend is shared across the two models. Frontend is deployed to dynamo namespace `dynamo`, which is a reserved namespace name for frontend to observe deployed models across all dynamo namespaces.
A shared PVC is configured to store model checkpoint weights fetched from Hugging Face.
1. Install Dynamo k8s platform helm chart
2. Create a K8S secret with your Huggingface token and then render k8s manifests
```sh
export HF_TOKEN=YOUR_HF_TOKEN
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
--namespace ${NAMESPACE}
kubectl apply -f shared_frontend.yaml --namespace ${NAMESPACE}
```
3. Testing the deployment and run benchmarks
After deployment, forward the frontend service to access the API:
```sh
kubectl port-forward svc/frontend-frontend 8000:8000 -n ${NAMESPACE}
```
confirm both deployed models are present in the model listing:
```sh
curl localhost:8000/v1/models
{"object":"list","data":[{"id":"Qwen/Qwen3-0.6B","object":"object","created":1759458713,"owned_by":"nvidia"},{"id":"Qwen/Qwen2.5-VL-7B-Instruct","object":"object","created":1759458718,"owned_by":"nvidia"}]}
```
and use following request to test one of the deployed model
```sh
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
You can also benchmark the performance of the endpoint by [AIPerf](https://github.com/ai-dynamo/aiperf/blob/main/README.md)
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: dynamo-model-cache
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: frontend
spec:
services:
Frontend:
componentType: frontend
globalDynamoNamespace: true
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg
spec:
pvcs:
- name: dynamo-model-cache
create: false
services:
VllmDecodeWorker:
volumeMounts:
- name: dynamo-model-cache
mountPoint: /root/.cache/huggingface
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/examples/backends/vllm
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: agg-qwen
spec:
backendFramework: vllm
services:
EncodeWorker:
volumeMounts:
- name: dynamo-model-cache
mountPoint: /root/.cache/huggingface
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/examples/multimodal
command:
- /bin/sh
- -c
args:
- python3 components/encode_worker.py --model Qwen/Qwen2.5-VL-7B-Instruct
VLMWorker:
volumeMounts:
- name: dynamo-model-cache
mountPoint: /root/.cache/huggingface
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/examples/multimodal
command:
- /bin/sh
- -c
args:
- python3 components/worker.py --model Qwen/Qwen2.5-VL-7B-Instruct --worker-type prefill
Processor:
volumeMounts:
- name: dynamo-model-cache
mountPoint: /root/.cache/huggingface
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/examples/multimodal
command:
- /bin/sh
- -c
args:
- 'python3 components/processor.py --model Qwen/Qwen2.5-VL-7B-Instruct --prompt-template "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|><prompt><|im_end|>\n<|im_start|>assistant\n"'
# Multi-Node Dynamo with KV Routing
This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
For more information about the core concepts, see:
- [Dynamo Disaggregated Serving](../../../docs/design-docs/disagg-serving.md)
- [KV Cache Routing](../../../docs/components/router/README.md)
## Architecture Overview
The multi-node setup consists of:
- **1 Frontend**: Receives HTTP requests and uses KV routing to distribute them
- **2 Model Replicas**: Each with dedicated prefill and decode workers
- **Smart KV-Aware Routing**: Intelligently routes requests based on KV cache locality across **all workers**
```mermaid
---
title: Multi-Node Architecture with Full KV Routing (SGLang)
---
flowchart TD
Client["Users/Clients<br/>(HTTP)"] --> Frontend["Frontend<br/>KV-Aware Router<br/>(Any Node)"]
Frontend --> Router{KV Routing<br/>Decision}
Router --> Prefill1["Prefill Worker 1"]
Router --> Prefill2["Prefill Worker 2"]
Prefill1 -->|NIXL Transfer| Decode1
Prefill2 -->|NIXL Transfer| Decode2
Prefill1 -.->|KV Events| Frontend
Prefill2 -.->|KV Events| Frontend
Decode1 --> |Response| Frontend
Decode2 --> |Response| Frontend
Frontend --> Client
subgraph Node1["Node 1"]
Decode1
Prefill1
end
subgraph Node2["Node 2"]
Decode2
Prefill2
end
```
## What is KV-Aware Routing?
KV-aware routing optimizes LLM inference by directing requests to workers that already have relevant data cached. Instead of random or round-robin distribution, the router:
- **Tracks cached data**: Monitors which token sequences are cached on each worker
- **Maximizes cache reuse**: Routes requests to workers with the best cache overlap, reducing redundant computation
- **Balances load**: Considers both cache efficiency and worker utilization when making routing decisions
This is particularly beneficial for:
- **Shared system prompts**: Cached across workers and reused efficiently
- **Multi-turn conversations**: Full conversation history benefits from caching
- **Similar queries**: Common prefixes are computed once and reused
- **Batch processing**: Related requests can be routed to workers with shared context
For detailed technical information about how KV routing works, see the [Router Guide](../../../docs/components/router/router-guide.md).
## Prerequisites
### 1. Infrastructure Services
Ensure etcd and NATS are running on a node accessible by all workers:
```bash
# On the infrastructure node (can be Node 1 or a dedicated node)
docker compose -f deploy/docker-compose.yml up -d
```
Note the IP address of this node - you'll need it for worker configuration.
### 2. Software Requirements
Install Dynamo with [SGLang](https://docs.sglang.io/) support:
```bash
pip install ai-dynamo[sglang]
```
For more information about the SGLang backend and its integration with Dynamo, see the [SGLang Backend Documentation](../../../docs/backends/sglang/README.md).
### 3. Network Requirements
Ensure the following ports are accessible between nodes:
- **2379**: etcd client port
- **4222**: NATS client port
- **8000**: Frontend HTTP port (only needed on frontend node)
- **${DISAGG_BOOTSTRAP_PORT}**: SGLang disaggregation bootstrap port (set in Step 1; must be reachable across nodes)
- **High-speed interconnect**: For optimal NIXL performance (InfiniBand, RoCE, or high-bandwidth Ethernet)
### 4. Hardware Setup
This example assumes:
- **Node 1**: At least 2 GPUs (for Replica 1's decode and prefill workers)
- **Node 2**: At least 2 GPUs (for Replica 2's decode and prefill workers)
- **Frontend Node**: Can be on Node 1, Node 2, or a separate node (no GPU required)
> [!NOTE]
> You can run this example with minimal modifications on a single node with at least 4 GPUs.
> In step 3, modify the `CUDA_VISIBLE_DEVICES` flags to `CUDA_VISIBLE_DEVICES=2`
> for the prefill component and `CUDA_VISIBLE_DEVICES=3` for the decode component.
## Setup Instructions
### Step 1: Set Environment Variables
On all nodes, set the etcd and NATS endpoints:
```bash
# Replace with your infrastructure node's IP
# To find your IP address, run the follwing on your infrastructure node:
# hostname -I | awk '{print $1}'
export INFRA_NODE_IP=<INFRA_NODE_IP>
export ETCD_ENDPOINTS=http://${INFRA_NODE_IP}:2379
export NATS_SERVER=nats://${INFRA_NODE_IP}:4222
export DYN_LOG=debug # Enable debug logging to see routing decisions
# Use a fixed, reachable port for the disaggregation bootstrap server
# Pick any free port and ensure it's open between nodes
export DISAGG_BOOTSTRAP_PORT=32963
```
### Step 2: Launch Replica 1 (Node 1)
Open a terminal on Node 1 and launch both workers:
```bash
# Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--host 0.0.0.0 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--host 0.0.0.0 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl
```
> [!INFO]
>
> - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs)
> - `--page-size 16`: Sets the KV cache block size - must be identical across all workers
> - `--host 0.0.0.0`: Exposes the SGLang bootstrap server on all interfaces so other nodes can reach it
> - `--disaggregation-bootstrap-port`: Uses the fixed port you set in `DISAGG_BOOTSTRAP_PORT`; ensure this port is open between nodes
> - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation)
> - `--disaggregation-transfer-backend nixl`: Enables high-speed GPU-to-GPU transfers
> - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend > handles tokenization
### Step 3: Launch Replica 2 (Node 2)
Open a terminal on Node 2 and launch both workers:
```bash
# Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--host 0.0.0.0 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &
# Launch decode worker in foreground
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
--tp 1 \
--host 0.0.0.0 \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl
```
### Step 4: Launch Frontend with KV Routing
Open a terminal on any node and launch the frontend:
```bash
# On any node (no GPU required)
python -m dynamo.frontend \
--http-port 8000 \
--router-mode kv
```
Take note of the frontend IP address:
```bash
# On the same node you launched dynamo.frontend
hostname -I | awk '{print $1}'
```
The frontend will:
- Discover all available decode workers via etcd
- Enable KV-aware routing for intelligent request distribution
- Monitor worker health and adjust routing accordingly
For more details about frontend configuration options, see the [Frontend Component Documentation](/components/src/dynamo/frontend/README.md).
## Testing the Setup
### Prerequisites
Install the [OpenAI Python client](https://github.com/openai/openai-python) library:
```bash
pip install openai
```
Paste in the Dynamo Frontend IP from step 4 (or use localhost if on the same node):
```bash
export DYN_FRONTEND_IP=<PASTE_FRONTEND_IP_HERE>
```
### 1. Simple Request (New Conversation)
Send a request to see it routed to one of the replicas:
```python
from openai import OpenAI
import os
if os.environ.get("DYN_FRONTEND_IP"):
frontend_ip=os.environ.get("DYN_FRONTEND_IP")
else:
raise Exception("DYN_FRONTEND_IP is not set")
client = OpenAI(
base_url=f"http://{frontend_ip}:8000/v1",
api_key="dummy" # Not used by Dynamo, but required by OpenAI client
)
response = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=[
{"role": "user", "content": "What is the capital of France?"}
],
stream=False,
max_tokens=50
)
print(response.choices[0].message.content)
```
### 2. Multi-Turn Conversation (Tests KV Routing)
Create a conversation to observe how KV routing naturally benefits multi-turn interactions:
```python
from openai import OpenAI
import os
if os.environ.get("DYN_FRONTEND_IP"):
frontend_ip=os.environ.get("DYN_FRONTEND_IP")
else:
raise Exception("DYN_FRONTEND_IP is not set")
client = OpenAI(
base_url=f"http://{frontend_ip}:8000/v1",
api_key="dummy" # Not used by Dynamo, but required by OpenAI client
)
# First turn - establishes context
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Alice. Please remember it."}
]
response1 = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=messages,
stream=False,
max_tokens=50
)
print("First response:", response1.choices[0].message.content)
# Add the assistant's response to conversation history
messages.append({"role": "assistant", "content": response1.choices[0].message.content})
# Second turn - includes the full conversation history
# KV routing will likely route this to the same worker due to shared token prefix
messages.append({"role": "user", "content": "What is my name?"})
response2 = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=messages,
stream=False,
max_tokens=50
)
print("Second response:", response2.choices[0].message.content)
```
### 3. Load Distribution Test
Send multiple new conversations to see them distributed across replicas:
```python
import asyncio
from openai import AsyncOpenAI
import os
if os.environ.get("DYN_FRONTEND_IP"):
frontend_ip=os.environ.get("DYN_FRONTEND_IP")
else:
raise Exception("DYN_FRONTEND_IP is not set")
async def send_request(client, i):
"""Send a single request and return the response"""
try:
response = await client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=[
{"role": "user", "content": f"Count to {i}"}
],
stream=False,
max_tokens=20
)
return f"Request {i}: {response.choices[0].message.content}"
except Exception as e:
return f"Request {i} failed: {e}"
async def load_test():
"""Send 10 requests in parallel to test load distribution"""
client = AsyncOpenAI(
base_url=f"http://{frontend_ip}:8000/v1",
api_key="dummy"
)
# Send 10 requests in parallel
tasks = [send_request(client, i) for i in range(1, 11)]
results = await asyncio.gather(*tasks)
for result in results:
print(result)
# Run the load test
if __name__ == "__main__":
asyncio.run(load_test())
```
## Monitoring KV Routing
With `DYN_LOG=debug`, you can observe KV routing decisions in the logs:
```
[DEBUG] KV overlap scores: {worker-1: 15 blocks, worker-2: 8 blocks}
[DEBUG] Selected worker-1 (best overlap: 15 blocks)
[DEBUG] Cache hit rate: 75% for this request
```
### Alternative Routing Modes
While this example demonstrates KV-aware routing for optimal cache utilization, Dynamo also supports simpler routing strategies:
- **KV-Aware** (recommended): Routes based on cache overlap across all workers
- **Round-Robin**: Distributes requests evenly across workers in sequence
- **Random**: Randomly selects workers for each request
```bash
# Example: Use round-robin routing instead of KV routing
python -m dynamo.frontend \
--http-port 8000 \
--router-mode round-robin
```
However, for maximum performance with shared prefixes and multi-turn conversations, KV routing provides significant advantages by minimizing redundant computation.
## Monitoring and Debugging
### Check Worker Registration
Verify all workers are properly registered:
```bash
etcdctl --endpoints=$ETCD_ENDPOINTS get --prefix /dynamo/workers/
```
### Monitor Routing Decisions
With `DYN_LOG=debug`, the frontend logs show routing decisions:
```
[DEBUG] KV overlap scores: {prefill-worker-1: 15 blocks, prefill-worker-2: 8 blocks}
[DEBUG] Selected prefill-worker-1 (best overlap: 15 blocks)
[DEBUG] KV overlap scores: {decode-worker-1: 12 blocks, decode-worker-2: 18 blocks}
[DEBUG] Selected decode-worker-2 (best overlap: 18 blocks)
[DEBUG] Worker decode-worker-1 unhealthy, rerouting -> decode-worker-2
```
### Health Checks
Check worker health status:
```bash
curl http://${DYN_FRONTEND_IP}:8000/health
```
## Troubleshooting
### Workers Not Discovering Each Other
1. Verify etcd connectivity from all nodes:
```bash
etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health
```
2. Check NATS connectivity:
```bash
nats --server=$NATS_SERVER server check connection
```
### NIXL Transfer Failures
1. Ensure GPUs can communicate across nodes
2. Check InfiniBand/RoCE configuration if using high-speed interconnect
3. Verify CUDA IPC is enabled for optimal performance
### Routing Not Working
1. Confirm frontend is started with `--router-mode kv`
2. Check that all workers are properly registered in etcd
3. Verify workers are publishing KV events
4. Check logs for overlap scores - if all zeros, cache tracking may not be working
5. Ensure NATS is functioning for KV event distribution
## Advanced Configuration
For production deployments, you can fine-tune KV routing behavior:
```bash
python -m dynamo.frontend \
--http-port 8000 \
--router-mode kv \
--kv-overlap-score-weight 1.0 # Weight for cache overlap scoring \
--router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic)
```
For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [Router Guide](../../../docs/components/router/router-guide.md).
## Cleanup
Stop all components in reverse order:
1. Stop Frontend (Ctrl+C in the frontend terminal)
2. Stop workers on each node:
- On Node 1: Press Ctrl+C in the terminal (this stops the decode worker)
- On Node 2: Press Ctrl+C in the terminal (this stops the decode worker)
- To stop the background prefill workers, use one of these methods:
```bash
# Method 1: Kill background jobs in the same terminal
jobs # See background jobs
kill %1 # Kill the background prefill worker
# Method 2: Close the terminal entirely (sends SIGHUP to background processes)
exit
# Method 3: Kill by process name (from any terminal)
pkill -f "dynamo.sglang.*prefill"
```
3. Stop infrastructure services:
```bash
docker compose -f deploy/docker-compose.yml down
```
## Next Steps
- **Scale Up**: Add more replicas by repeating Steps 2-3 on additional nodes
- **High Availability**: Run multiple frontend instances with a load balancer
- **Monitoring**: Deploy Prometheus and Grafana for production monitoring
- **Optimization**: Tune worker configurations based on workload patterns
- **Cache Analysis**: Use SGLang's built-in cache statistics to optimize your workloads
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
See [here](/docs/backends/trtllm/multinode) for how to setup this example.
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# This is one of the only variables that must be set currently, most of the rest may
# just work out of the box if following the steps in the README.
IMAGE="${IMAGE:-""}"
# Set to mount current host directory to /mnt inside the container as an example,
# but you may freely customize the mounts based on your cluster. A common practice
# is to mount paths to NFS storage for common scripts, model weights, etc.
# NOTE: This can be a comma separated list of multiple mounts as well.
DEFAULT_MOUNT="${PWD}/../../../../:/mnt"
MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
# Example values, assuming 4 nodes with 4 GPUs on each node, such as 4xGB200 nodes.
# For 8xH100 nodes as an example, you may set this to 2 nodes x 8 gpus/node instead.
NUM_NODES=${NUM_NODES:-4}
NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
export ENGINE_CONFIG="${ENGINE_CONFIG:-/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/wide_ep_agg.yaml}"
# Automate settings of certain variables for convenience, but you are free
# to manually set these for more control as well.
ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
export HEAD_NODE="${SLURMD_NODENAME}"
export HEAD_NODE_IP="$(hostname -i)"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
if [[ -z ${IMAGE} ]]; then
echo "ERROR: You need to set the IMAGE environment variable to the " \
"Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
"See how to build one from source here: " \
"https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container"
exit 1
fi
# NOTE: Output streamed to stdout for ease of understanding the example, but
# in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files.
echo "Launching frontend services in background."
srun \
--mpi pmix \
--overlap \
--container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \
--verbose \
--label \
-A "${ACCOUNT}" \
-J "${ACCOUNT}-dynamo.trtllm" \
--nodelist "${HEAD_NODE}" \
--nodes 1 \
--jobid "${SLURM_JOB_ID}" \
/mnt/examples/basics/multinode/trtllm/start_frontend_services.sh &
# NOTE: Output streamed to stdout for ease of understanding the example, but
# in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files.
echo "Launching multi-node worker in background."
DISAGGREGATION_MODE="prefill_and_decode" \
srun \
--mpi pmix \
--oversubscribe \
--container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \
--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,DISAGGREGATION_MODE,ENGINE_CONFIG \
--verbose \
--label \
-A "${ACCOUNT}" \
-J "${ACCOUNT}-dynamo.trtllm" \
--nodes "${NUM_NODES}" \
--ntasks-per-node "${NUM_GPUS_PER_NODE}" \
--jobid "${SLURM_JOB_ID}" \
/mnt/examples/basics/multinode/trtllm/start_trtllm_worker.sh &
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# This is one of the only variables that must be set currently, most of the rest may
# just work out of the box if following the steps in the README.
IMAGE="${IMAGE:-""}"
# Set to mount current host directory to /mnt inside the container as an example,
# but you may freely customize the mounts based on your cluster. A common practice
# is to mount paths to NFS storage for common scripts, model weights, etc.
# NOTE: This can be a comma separated list of multiple mounts as well.
DEFAULT_MOUNT="${PWD}/../../../../:/mnt"
MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4}
NUM_PREFILL_WORKERS=${NUM_PREFILL_WORKERS:-1}
PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_prefill.yaml}"
NUM_DECODE_NODES=${NUM_DECODE_NODES:-4}
NUM_DECODE_WORKERS=${NUM_DECODE_WORKERS:-1}
DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_decode.yaml}"
# Automate settings of certain variables for convenience, but you are free
# to manually set these for more control as well.
ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
export HEAD_NODE="${SLURMD_NODENAME}"
export HEAD_NODE_IP="$(hostname -i)"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
if [[ -z ${IMAGE} ]]; then
echo "ERROR: You need to set the IMAGE environment variable to the " \
"Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
"See how to build one from source here: " \
"https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container"
exit 1
fi
# NOTE: Output streamed to stdout for ease of understanding the example, but
# in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files.
echo "Launching frontend services in background."
srun \
--mpi pmix \
--overlap \
--container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \
--verbose \
--label \
-A "${ACCOUNT}" \
-J "${ACCOUNT}-dynamo.trtllm" \
--nodelist "${HEAD_NODE}" \
--nodes 1 \
--jobid "${SLURM_JOB_ID}" \
/mnt/examples/basics/multinode/trtllm/start_frontend_services.sh &
# NOTE: Output streamed to stdout for ease of understanding the example, but
# in practice you would probably set `srun --output ... --error ...` to pipe
# the stdout/stderr to files.
for ((i=1; i<=${NUM_PREFILL_WORKERS}; i++)); do
echo "Launching multi-node prefill worker in background."
DISAGGREGATION_MODE=prefill \
ENGINE_CONFIG=${PREFILL_ENGINE_CONFIG} \
srun \
--mpi pmix \
--oversubscribe \
--container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \
--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,DISAGGREGATION_MODE,ENGINE_CONFIG \
--verbose \
--label \
-A "${ACCOUNT}" \
-J "${ACCOUNT}-dynamo.trtllm" \
--nodes "${NUM_PREFILL_NODES}" \
--ntasks-per-node "${NUM_GPUS_PER_NODE}" \
--jobid "${SLURM_JOB_ID}" \
/mnt/examples/basics/multinode/trtllm/start_trtllm_worker.sh &
done
for ((i=1; i<=${NUM_DECODE_WORKERS}; i++)); do
echo "Launching multi-node decode worker in background."
DISAGGREGATION_MODE=decode \
ENGINE_CONFIG=${DECODE_ENGINE_CONFIG} \
srun \
--mpi pmix \
--oversubscribe \
--container-image "${IMAGE}" \
--container-mounts "${MOUNTS}" \
--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,DISAGGREGATION_MODE,ENGINE_CONFIG \
--verbose \
--label \
-A "${ACCOUNT}" \
-J "${ACCOUNT}-dynamo.trtllm" \
--nodes "${NUM_DECODE_NODES}" \
--ntasks-per-node "${NUM_GPUS_PER_NODE}" \
--jobid "${SLURM_JOB_ID}" \
/mnt/examples/basics/multinode/trtllm/start_trtllm_worker.sh &
done
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Start NATS
nats-server -js &
# Start etcd
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
# Wait for NATS/etcd to startup
sleep 3
# Start OpenAI Frontend which will dynamically discover workers when they startup
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
# NOTE: This is a blocking call.
python3 -m dynamo.frontend
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
if [[ -z ${MODEL_PATH} ]]; then
echo "ERROR: MODEL_PATH was not set."
echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \
"downloaded path to the model weights. Since Deepseek R1 is large, it is " \
"recommended to pre-download them to a shared location and provide the path."
exit 1
fi
if [[ -z ${SERVED_MODEL_NAME} ]]; then
echo "WARNING: SERVED_MODEL_NAME was not set. It will be derived from MODEL_PATH."
fi
if [[ -z ${ENGINE_CONFIG} ]]; then
echo "ERROR: ENGINE_CONFIG was not set."
echo "ERROR: ENGINE_CONFIG must be set to a valid Dynamo+TRTLLM engine config file."
exit 1
fi
EXTRA_ARGS=""
if [[ -n ${DISAGGREGATION_MODE} ]]; then
EXTRA_ARGS+="--disaggregation-mode ${DISAGGREGATION_MODE} "
fi
# Only publish KV events if using KV-aware routing (not needed for round-robin)
if [[ -n ${PUBLISH_KV_EVENTS} ]] && [[ ${PUBLISH_KV_EVENTS} == "true" ]]; then
EXTRA_ARGS+="--publish-events-and-metrics "
fi
if [[ -n ${MODALITY} ]]; then
EXTRA_ARGS+="--modality ${MODALITY} "
fi
trtllm-llmapi-launch \
python3 -m dynamo.trtllm \
--model-path "${MODEL_PATH}" \
--served-model-name "${SERVED_MODEL_NAME}" \
--extra-engine-args "${ENGINE_CONFIG}" \
${EXTRA_ARGS}
\ No newline at end of file
# Quickstart
This is a simple example showing how you can quickly get started deploying Large Language Models with Dynamo.
## Prerequisites
Before running this example, ensure you have the following services running:
- **etcd**: A distributed key-value store used for service discovery and metadata storage
- **NATS**: A high-performance message broker for inter-component communication
You can start these services using Docker Compose:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
## Components
- [Frontend](/components/src/dynamo/frontend/README.md) - A built-in component that launches an OpenAI compliant HTTP server, a pre-processor, and a router in a single process
- [vLLM Backend](/docs/backends/vllm/README.md) - A built-in component that runs vLLM within the Dynamo runtime
```mermaid
---
title: Request Flow
---
flowchart TD
A["Users/Clients<br/>(HTTP)"] --> B["Frontend<br/>HTTP API endpoint<br/>(OpenAI Style)"]
B --> C["NATS Message Broker<br/>(Inter-component communication)"]
C --> D["vLLM Backend<br/>(NATS subscriber)"]
D --> C
C --> B
B --> A
```
## Instructions
There are three steps to deploy and use LLM with Dynamo.
### 1. Launch Engine
**Open a new terminal** and run:
```bash
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
Leave this terminal running - it will show vLLM Backend logs.
### 2. Launch Frontend
**Open another terminal** and interact with the deployed engine using the built-in frontend component. You have two options:
1. Interactive Command Line Interface
```bash
python -m dynamo.frontend --interactive
```
2. HTTP Server
```bash
python -m dynamo.frontend --http-port 8000
```
Leave this terminal running as well - it will show Frontend logs.
### 3. Send Requests
If you launched the frontend in `interactive` mode, simply start typing and hit `Enter` to have an interactive chat with your LLM.
If you launched the frontend in HTTP mode, you can send requests via `curl`, or any OpenAI compatible client program or library.
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{ "role": "user", "content": "Tell me a story about a brave cat" }
],
"stream": false,
"max_tokens": 1028
}'
```
## Cleanup
When you're done with the quickstart example, follow these steps to clean up:
### 1. Stop Dynamo Components
In each terminal where you started Dynamo components, press `Ctrl+C` to stop them:
- Stop the vLLM Backend (terminal from step 1)
- Stop the Frontend (terminal from step 2)
### 2. Stop Infrastructure Services
If you don't plan to run any more examples, stop the etcd and NATS services that were started with Docker Compose:
```bash
docker compose -f deploy/docker-compose.yml down
```
This will stop and remove the containers for etcd and NATS.
## Understand
### What's Happening Under the Hood
When you run the two commands above, here's what Dynamo does to spin up the necessary processes and connect your HTTP requests to the vLLM Backend:
### 1. Service Registration and Discovery
#### DistributedRuntime Setup
At startup, each Dynamo component (vLLM Backend, Frontend) connects to the `DistributedRuntime`, which involves creating connections to two critical infrastructure services:
- **etcd**: A distributed key-value store used for service discovery and metadata storage
- **NATS**: A high-performance message broker for inter-component communication
#### Component Registration
When the vLLM Backend starts up, it registers itself as a `component` in etcd with one or more `endpoints`.
This registration includes each endpoint's [NATS subject](https://docs.nats.io/nats-concepts/subjects) for communication and is tied to a `lease` that automatically expires if the component goes offline.
<details>
<summary> Inspecting the Component Registry </summary>
If you want to find out more about the internal organization of components in Dynamo, you can inspect the contents of `etcd` using the [`etcdctl` command line tool](https://etcd.io/docs/latest/dev-guide/interacting_v3/). For this example, you can try running
```bash
etcdctl get "instances" --prefix
```
which will show you each registered endpoint, along with their associated NATS subject. Note that the specific etcd and NATS info is internal and always subject to change -- in future examples we'll show how to use the `DistributedRuntime` itself to communicate across components.
</details>
#### Frontend Discovery
When the Frontend starts, it doesn't receive an explicit pointer to the vLLM Backend component. Instead, it constantly watches etcd for registered models, automatically discovering the vLLM Backend component and its endpoints when it becomes available.
### 2. Request Flow and NATS Messaging
When you send an HTTP request to the Frontend:
1. **Request Packaging**: The Frontend wraps your HTTP request in a standardized internal format with routing metadata
2. **NATS Subject Resolution**: Using the discovered endpoints in etcd, it determines the appropriate NATS endpoint
3. **Message Dispatch**: The request is published to the discovered NATS subject, where the target vLLM Backend picks it up
4. **Response Streaming**: The vLLM Backend executes the request, and streams responses back through NATS which the Frontend converts back to HTTP
### 3. Network-Transparent Operation
One of Dynamo's key strengths is that this entire system works seamlessly whether components are:
- Running on the same machine (like in this quickstart)
- Distributed across multiple nodes in a cluster
- Deployed in different availability zones
The same two commands work in all scenarios, as long as all components can connect with the `DistributedRuntime` - Dynamo handles the networking complexity automatically.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment