chore: remove stale example assets (#7059)

7dfbe4fd · Alec · GitHub · 310f8ca9 · 310f8ca9 · 310f8ca9
Unverified Commit 7dfbe4fd authored Mar 27, 2026 by Alec Committed by GitHub Mar 28, 2026
10 changed files
--- a/examples/basics/kubernetes/Distributed_Inference/images/settings.png
+++ b/examples/basics/kubernetes/Distributed_Inference/images/settings.png
--- a/examples/basics/kubernetes/shared_frontend/README.md
+++ b/examples/basics/kubernetes/shared_frontend/README.md
-# Shared Dynamo Frontend
-This folder contains kubernetes manifests to deploy Dynamo frontend component as a standalone DynamoGraphDeployment (DGD)
-and two models.
-Frontend is shared across the two models. Frontend is deployed to  dynamo namespace `dynamo`, which is a reserved namespace name for frontend to observe deployed models across all dynamo namespaces.
-A shared PVC is configured to store model checkpoint weights fetched from Hugging Face.
-1. Install Dynamo k8s platform helm chart
-2. Create a K8S secret with your Huggingface token and then render k8s manifests
-```sh
-export HF_TOKEN=YOUR_HF_TOKEN
-kubectl create secret generic hf-token-secret \
-    --from-literal=HF_TOKEN=${HF_TOKEN} \
-    --namespace ${NAMESPACE}
-kubectl apply -f shared_frontend.yaml --namespace ${NAMESPACE}
-```
-3. Testing the deployment and run benchmarks
-After deployment, forward the frontend service to access the API:
-```sh
-kubectl port-forward svc/frontend-frontend 8000:8000 -n ${NAMESPACE}
-```
-confirm both deployed models are present in the model listing:
-```sh
-curl localhost:8000/v1/models
-{"object":"list","data":[{"id":"Qwen/Qwen3-0.6B","object":"object","created":1759458713,"owned_by":"nvidia"},{"id":"Qwen/Qwen2.5-VL-7B-Instruct","object":"object","created":1759458718,"owned_by":"nvidia"}]}
-```
-and use following request to test one of the deployed model
-```sh
-curl localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [
-    {
-        "role": "user",
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
-    }
-    ],
-    "stream": false,
-    "max_tokens": 30
-  }'
-  ```
-You can also benchmark the performance of the endpoint by [AIPerf](https://github.com/ai-dynamo/aiperf/blob/main/README.md)
--- a/examples/basics/kubernetes/shared_frontend/shared_frontend.yaml
+++ b/examples/basics/kubernetes/shared_frontend/shared_frontend.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: dynamo-model-cache
-spec:
-  accessModes:
-    - ReadWriteOnce
-  resources:
-    requests:
-      storage: 100Gi
---
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: frontend
-spec:
-  services:
-    Frontend:
-      componentType: frontend
-      globalDynamoNamespace: true
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
---
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: vllm-agg
-spec:
-  pvcs:
-  - name: dynamo-model-cache
-    create: false
-  services:
-    VllmDecodeWorker:
-      volumeMounts:
-      - name: dynamo-model-cache
-        mountPoint: /root/.cache/huggingface
-      envFromSecret: hf-token-secret
-      componentType: worker
-      replicas: 1
-      resources:
-        limits:
-          gpu: "1"
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
-          workingDir: /workspace/examples/backends/vllm
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
---
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: agg-qwen
-spec:
-  backendFramework: vllm
-  services:
-    EncodeWorker:
-      volumeMounts:
-      - name: dynamo-model-cache
-        mountPoint: /root/.cache/huggingface
-      envFromSecret: hf-token-secret
-      componentType: worker
-      replicas: 1
-      resources:
-        limits:
-          gpu: "1"
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
-          workingDir: /workspace/examples/multimodal
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - python3 components/encode_worker.py --model Qwen/Qwen2.5-VL-7B-Instruct
-    VLMWorker:
-      volumeMounts:
-      - name: dynamo-model-cache
-        mountPoint: /root/.cache/huggingface
-      envFromSecret: hf-token-secret
-      componentType: worker
-      replicas: 1
-      resources:
-        limits:
-          gpu: "1"
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
-          workingDir: /workspace/examples/multimodal
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - python3 components/worker.py --model Qwen/Qwen2.5-VL-7B-Instruct --worker-type prefill
-    Processor:
-      volumeMounts:
-      - name: dynamo-model-cache
-        mountPoint: /root/.cache/huggingface
-      envFromSecret: hf-token-secret
-      componentType: worker
-      replicas: 1
-      resources:
-        limits:
-          gpu: "1"
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
-          workingDir: /workspace/examples/multimodal
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - 'python3 components/processor.py --model Qwen/Qwen2.5-VL-7B-Instruct --prompt-template "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|><prompt><|im_end|>\n<|im_start|>assistant\n"'
--- a/examples/basics/multinode/README.md
+++ b/examples/basics/multinode/README.md
-# Multi-Node Dynamo with KV Routing
-This example demonstrates running Dynamo across multiple nodes with **KV-aware routing** to distribute requests between two replicas of a disaggregated model. Each replica consists of dedicated prefill and decode workers, providing high availability and load distribution.
-For more information about the core concepts, see:
- [Dynamo Disaggregated Serving](../../../docs/design-docs/disagg-serving.md)
- [KV Cache Routing](../../../docs/components/router/README.md)
-## Architecture Overview
-The multi-node setup consists of:
- **1 Frontend**: Receives HTTP requests and uses KV routing to distribute them
- **2 Model Replicas**: Each with dedicated prefill and decode workers
- **Smart KV-Aware Routing**: Intelligently routes requests based on KV cache locality across **all workers**
-```mermaid
---
-title: Multi-Node Architecture with Full KV Routing (SGLang)
---
-flowchart TD
-    Client["Users/Clients<br/>(HTTP)"] --> Frontend["Frontend<br/>KV-Aware Router<br/>(Any Node)"]
-    Frontend --> Router{KV Routing<br/>Decision}
-    Router --> Prefill1["Prefill Worker 1"]
-    Router --> Prefill2["Prefill Worker 2"]
-    Prefill1 -->|NIXL Transfer| Decode1
-    Prefill2 -->|NIXL Transfer| Decode2
-    Prefill1 -.->|KV Events| Frontend
-    Prefill2 -.->|KV Events| Frontend
-    Decode1 --> |Response| Frontend
-    Decode2 --> |Response| Frontend
-    Frontend --> Client
-    subgraph Node1["Node 1"]
-        Decode1
-        Prefill1
-    end
-    subgraph Node2["Node 2"]
-        Decode2
-        Prefill2
-    end
-```
-## What is KV-Aware Routing?
-KV-aware routing optimizes LLM inference by directing requests to workers that already have relevant data cached. Instead of random or round-robin distribution, the router:
- **Tracks cached data**: Monitors which token sequences are cached on each worker
- **Maximizes cache reuse**: Routes requests to workers with the best cache overlap, reducing redundant computation
- **Balances load**: Considers both cache efficiency and worker utilization when making routing decisions
-This is particularly beneficial for:
- **Shared system prompts**: Cached across workers and reused efficiently
- **Multi-turn conversations**: Full conversation history benefits from caching
- **Similar queries**: Common prefixes are computed once and reused
- **Batch processing**: Related requests can be routed to workers with shared context
-For detailed technical information about how KV routing works, see the [Router Guide](../../../docs/components/router/router-guide.md).
-## Prerequisites
-### 1. Infrastructure Services
-Ensure etcd and NATS are running on a node accessible by all workers:
-```bash
-# On the infrastructure node (can be Node 1 or a dedicated node)
-docker compose -f deploy/docker-compose.yml up -d
-```
-Note the IP address of this node - you'll need it for worker configuration.
-### 2. Software Requirements
-Install Dynamo with [SGLang](https://docs.sglang.io/) support:
-```bash
-pip install ai-dynamo[sglang]
-```
-For more information about the SGLang backend and its integration with Dynamo, see the [SGLang Backend Documentation](../../../docs/backends/sglang/README.md).
-### 3. Network Requirements
-Ensure the following ports are accessible between nodes:
- **2379**: etcd client port
- **4222**: NATS client port
- **8000**: Frontend HTTP port (only needed on frontend node)
- **${DISAGG_BOOTSTRAP_PORT}**: SGLang disaggregation bootstrap port (set in Step 1; must be reachable across nodes)
- **High-speed interconnect**: For optimal NIXL performance (InfiniBand, RoCE, or high-bandwidth Ethernet)
-### 4. Hardware Setup
-This example assumes:
- **Node 1**: At least 2 GPUs (for Replica 1's decode and prefill workers)
- **Node 2**: At least 2 GPUs (for Replica 2's decode and prefill workers)
- **Frontend Node**: Can be on Node 1, Node 2, or a separate node (no GPU required)
-> [!NOTE]
-> You can run this example with minimal modifications on a single node with at least 4 GPUs.
-> In step 3, modify the `CUDA_VISIBLE_DEVICES` flags to `CUDA_VISIBLE_DEVICES=2`
-> for the prefill component and `CUDA_VISIBLE_DEVICES=3` for the decode component.
-## Setup Instructions
-### Step 1: Set Environment Variables
-On all nodes, set the etcd and NATS endpoints:
-```bash
-# Replace with your infrastructure node's IP
-# To find your IP address, run the follwing on your infrastructure node:
-# hostname -I | awk '{print $1}'
-export INFRA_NODE_IP=<INFRA_NODE_IP>
-export ETCD_ENDPOINTS=http://${INFRA_NODE_IP}:2379
-export NATS_SERVER=nats://${INFRA_NODE_IP}:4222
-export DYN_LOG=debug  # Enable debug logging to see routing decisions
-# Use a fixed, reachable port for the disaggregation bootstrap server
-# Pick any free port and ensure it's open between nodes
-export DISAGG_BOOTSTRAP_PORT=32963
-```
-### Step 2: Launch Replica 1 (Node 1)
-Open a terminal on Node 1 and launch both workers:
-```bash
-# Launch prefill worker in background
-CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
-    --model-path Qwen/Qwen3-0.6B \
-    --served-model-name Qwen/Qwen3-0.6B \
-    --page-size 16 \
-    --tp 1 \
-    --host 0.0.0.0 \
-    --trust-remote-code \
-    --skip-tokenizer-init \
-    --disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
-    --disaggregation-mode prefill \
-    --disaggregation-transfer-backend nixl &
-CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
-    --model-path Qwen/Qwen3-0.6B \
-    --served-model-name Qwen/Qwen3-0.6B \
-    --page-size 16 \
-    --tp 1 \
-    --host 0.0.0.0 \
-    --trust-remote-code \
-    --skip-tokenizer-init \
-    --disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
-    --disaggregation-mode decode \
-    --disaggregation-transfer-backend nixl
-```
-> [!INFO]
->
-> - `CUDA_VISIBLE_DEVICES`: Controls which GPU each worker uses (0 and 1 for different > GPUs)
-> - `--page-size 16`: Sets the KV cache block size - must be identical across all workers
-> - `--host 0.0.0.0`: Exposes the SGLang bootstrap server on all interfaces so other nodes can reach it
-> - `--disaggregation-bootstrap-port`: Uses the fixed port you set in `DISAGG_BOOTSTRAP_PORT`; ensure this port is open between nodes
-> - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation)
-> - `--disaggregation-transfer-backend nixl`: Enables high-speed GPU-to-GPU transfers
-> - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend > handles tokenization
-### Step 3: Launch Replica 2 (Node 2)
-Open a terminal on Node 2 and launch both workers:
-```bash
-# Launch prefill worker in background
-CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
-    --model-path Qwen/Qwen3-0.6B \
-    --served-model-name Qwen/Qwen3-0.6B \
-    --page-size 16 \
-    --tp 1 \
-    --host 0.0.0.0 \
-    --trust-remote-code \
-    --skip-tokenizer-init \
-    --disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
-    --disaggregation-mode prefill \
-    --disaggregation-transfer-backend nixl &
-# Launch decode worker in foreground
-CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
-    --model-path Qwen/Qwen3-0.6B \
-    --served-model-name Qwen/Qwen3-0.6B \
-    --page-size 16 \
-    --tp 1 \
-    --host 0.0.0.0 \
-    --trust-remote-code \
-    --skip-tokenizer-init \
-    --disaggregation-bootstrap-port ${DISAGG_BOOTSTRAP_PORT} \
-    --disaggregation-mode decode \
-    --disaggregation-transfer-backend nixl
-```
-### Step 4: Launch Frontend with KV Routing
-Open a terminal on any node and launch the frontend:
-```bash
-# On any node (no GPU required)
-python -m dynamo.frontend \
-    --http-port 8000 \
-    --router-mode kv
-```
-Take note of the frontend IP address:
-```bash
-# On the same node you launched dynamo.frontend
-hostname -I | awk '{print $1}'
-```
-The frontend will:
- Discover all available decode workers via etcd
- Enable KV-aware routing for intelligent request distribution
- Monitor worker health and adjust routing accordingly
-For more details about frontend configuration options, see the [Frontend Component Documentation](/components/src/dynamo/frontend/README.md).
-## Testing the Setup
-### Prerequisites
-Install the [OpenAI Python client](https://github.com/openai/openai-python) library:
-```bash
-pip install openai
-```
-Paste in the Dynamo Frontend IP from step 4 (or use localhost if on the same node):
-```bash
-export DYN_FRONTEND_IP=<PASTE_FRONTEND_IP_HERE>
-```
-### 1. Simple Request (New Conversation)
-Send a request to see it routed to one of the replicas:
-```python
-from openai import OpenAI
-import os
-if os.environ.get("DYN_FRONTEND_IP"):
-    frontend_ip=os.environ.get("DYN_FRONTEND_IP")
-else:
-    raise Exception("DYN_FRONTEND_IP is not set")
-client = OpenAI(
-    base_url=f"http://{frontend_ip}:8000/v1",
-    api_key="dummy"  # Not used by Dynamo, but required by OpenAI client
-)
-response = client.chat.completions.create(
-    model="Qwen/Qwen3-0.6B",
-    messages=[
-        {"role": "user", "content": "What is the capital of France?"}
-    ],
-    stream=False,
-    max_tokens=50
-)
-print(response.choices[0].message.content)
-```
-### 2. Multi-Turn Conversation (Tests KV Routing)
-Create a conversation to observe how KV routing naturally benefits multi-turn interactions:
-```python
-from openai import OpenAI
-import os
-if os.environ.get("DYN_FRONTEND_IP"):
-    frontend_ip=os.environ.get("DYN_FRONTEND_IP")
-else:
-    raise Exception("DYN_FRONTEND_IP is not set")
-client = OpenAI(
-    base_url=f"http://{frontend_ip}:8000/v1",
-    api_key="dummy"  # Not used by Dynamo, but required by OpenAI client
-)
-# First turn - establishes context
-messages = [
-    {"role": "system", "content": "You are a helpful assistant."},
-    {"role": "user", "content": "My name is Alice. Please remember it."}
-]
-response1 = client.chat.completions.create(
-    model="Qwen/Qwen3-0.6B",
-    messages=messages,
-    stream=False,
-    max_tokens=50
-)
-print("First response:", response1.choices[0].message.content)
-# Add the assistant's response to conversation history
-messages.append({"role": "assistant", "content": response1.choices[0].message.content})
-# Second turn - includes the full conversation history
-# KV routing will likely route this to the same worker due to shared token prefix
-messages.append({"role": "user", "content": "What is my name?"})
-response2 = client.chat.completions.create(
-    model="Qwen/Qwen3-0.6B",
-    messages=messages,
-    stream=False,
-    max_tokens=50
-)
-print("Second response:", response2.choices[0].message.content)
-```
-### 3. Load Distribution Test
-Send multiple new conversations to see them distributed across replicas:
-```python
-import asyncio
-from openai import AsyncOpenAI
-import os
-if os.environ.get("DYN_FRONTEND_IP"):
-    frontend_ip=os.environ.get("DYN_FRONTEND_IP")
-else:
-    raise Exception("DYN_FRONTEND_IP is not set")
-async def send_request(client, i):
-    """Send a single request and return the response"""
-    try:
-        response = await client.chat.completions.create(
-            model="Qwen/Qwen3-0.6B",
-            messages=[
-                {"role": "user", "content": f"Count to {i}"}
-            ],
-            stream=False,
-            max_tokens=20
-        )
-        return f"Request {i}: {response.choices[0].message.content}"
-    except Exception as e:
-        return f"Request {i} failed: {e}"
-async def load_test():
-    """Send 10 requests in parallel to test load distribution"""
-    client = AsyncOpenAI(
-        base_url=f"http://{frontend_ip}:8000/v1",
-        api_key="dummy"
-    )
-    # Send 10 requests in parallel
-    tasks = [send_request(client, i) for i in range(1, 11)]
-    results = await asyncio.gather(*tasks)
-    for result in results:
-        print(result)
-# Run the load test
-if __name__ == "__main__":
-    asyncio.run(load_test())
-```
-## Monitoring KV Routing
-With `DYN_LOG=debug`, you can observe KV routing decisions in the logs:
-```
-[DEBUG] KV overlap scores: {worker-1: 15 blocks, worker-2: 8 blocks}
-[DEBUG] Selected worker-1 (best overlap: 15 blocks)
-[DEBUG] Cache hit rate: 75% for this request
-```
-### Alternative Routing Modes
-While this example demonstrates KV-aware routing for optimal cache utilization, Dynamo also supports simpler routing strategies:
- **KV-Aware** (recommended): Routes based on cache overlap across all workers
- **Round-Robin**: Distributes requests evenly across workers in sequence
- **Random**: Randomly selects workers for each request
-```bash
-# Example: Use round-robin routing instead of KV routing
-python -m dynamo.frontend \
-    --http-port 8000 \
-    --router-mode round-robin
-```
-However, for maximum performance with shared prefixes and multi-turn conversations, KV routing provides significant advantages by minimizing redundant computation.
-## Monitoring and Debugging
-### Check Worker Registration
-Verify all workers are properly registered:
-```bash
-etcdctl --endpoints=$ETCD_ENDPOINTS get --prefix /dynamo/workers/
-```
-### Monitor Routing Decisions
-With `DYN_LOG=debug`, the frontend logs show routing decisions:
-```
-[DEBUG] KV overlap scores: {prefill-worker-1: 15 blocks, prefill-worker-2: 8 blocks}
-[DEBUG] Selected prefill-worker-1 (best overlap: 15 blocks)
-[DEBUG] KV overlap scores: {decode-worker-1: 12 blocks, decode-worker-2: 18 blocks}
-[DEBUG] Selected decode-worker-2 (best overlap: 18 blocks)
-[DEBUG] Worker decode-worker-1 unhealthy, rerouting -> decode-worker-2
-```
-### Health Checks
-Check worker health status:
-```bash
-curl http://${DYN_FRONTEND_IP}:8000/health
-```
-## Troubleshooting
-### Workers Not Discovering Each Other
-1. Verify etcd connectivity from all nodes:
-   ```bash
-   etcdctl --endpoints=$ETCD_ENDPOINTS endpoint health
-   ```
-2. Check NATS connectivity:
-   ```bash
-   nats --server=$NATS_SERVER server check connection
-   ```
-### NIXL Transfer Failures
-1. Ensure GPUs can communicate across nodes
-2. Check InfiniBand/RoCE configuration if using high-speed interconnect
-3. Verify CUDA IPC is enabled for optimal performance
-### Routing Not Working
-1. Confirm frontend is started with `--router-mode kv`
-2. Check that all workers are properly registered in etcd
-3. Verify workers are publishing KV events
-4. Check logs for overlap scores - if all zeros, cache tracking may not be working
-5. Ensure NATS is functioning for KV event distribution
-## Advanced Configuration
-For production deployments, you can fine-tune KV routing behavior:
-```bash
-python -m dynamo.frontend \
-    --http-port 8000 \
-    --router-mode kv \
-    --kv-overlap-score-weight 1.0  # Weight for cache overlap scoring \
-    --router-temperature 0.0     # Temperature for probabilistic routing (0 = deterministic)
-```
-For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [Router Guide](../../../docs/components/router/router-guide.md).
-## Cleanup
-Stop all components in reverse order:
-1. Stop Frontend (Ctrl+C in the frontend terminal)
-2. Stop workers on each node:
-   - On Node 1: Press Ctrl+C in the terminal (this stops the decode worker)
-   - On Node 2: Press Ctrl+C in the terminal (this stops the decode worker)
-   - To stop the background prefill workers, use one of these methods:
-     ```bash
-     # Method 1: Kill background jobs in the same terminal
-     jobs           # See background jobs
-     kill %1        # Kill the background prefill worker
-     # Method 2: Close the terminal entirely (sends SIGHUP to background processes)
-     exit
-     # Method 3: Kill by process name (from any terminal)
-     pkill -f "dynamo.sglang.*prefill"
-     ```
-3. Stop infrastructure services:
-   ```bash
-   docker compose -f deploy/docker-compose.yml down
-   ```
-## Next Steps
- **Scale Up**: Add more replicas by repeating Steps 2-3 on additional nodes
- **High Availability**: Run multiple frontend instances with a load balancer
- **Monitoring**: Deploy Prometheus and Grafana for production monitoring
- **Optimization**: Tune worker configurations based on workload patterns
- **Cache Analysis**: Use SGLang's built-in cache statistics to optimize your workloads
--- a/examples/basics/multinode/trtllm/README.md
+++ b/examples/basics/multinode/trtllm/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
-See [here](/docs/backends/trtllm/multinode) for how to setup this example.
--- a/examples/basics/multinode/trtllm/srun_aggregated.sh
+++ b/examples/basics/multinode/trtllm/srun_aggregated.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-# This is one of the only variables that must be set currently, most of the rest may
-# just work out of the box if following the steps in the README.
-IMAGE="${IMAGE:-""}"
-# Set to mount current host directory to /mnt inside the container as an example,
-# but you may freely customize the mounts based on your cluster. A common practice
-# is to mount paths to NFS storage for common scripts, model weights, etc.
-# NOTE: This can be a comma separated list of multiple mounts as well.
-DEFAULT_MOUNT="${PWD}/../../../../:/mnt"
-MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
-# Example values, assuming 4 nodes with 4 GPUs on each node, such as 4xGB200 nodes.
-# For 8xH100 nodes as an example, you may set this to 2 nodes x 8 gpus/node instead.
-NUM_NODES=${NUM_NODES:-4}
-NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
-export ENGINE_CONFIG="${ENGINE_CONFIG:-/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/wide_ep_agg.yaml}"
-# Automate settings of certain variables for convenience, but you are free
-# to manually set these for more control as well.
-ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
-export HEAD_NODE="${SLURMD_NODENAME}"
-export HEAD_NODE_IP="$(hostname -i)"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-if [[ -z ${IMAGE} ]]; then
-  echo "ERROR: You need to set the IMAGE environment variable to the " \
-       "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
-       "See how to build one from source here: " \
-       "https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container"
-  exit 1
-fi
-# NOTE: Output streamed to stdout for ease of understanding the example, but
-# in practice you would probably set `srun --output ... --error ...` to pipe
-# the stdout/stderr to files.
-echo "Launching frontend services in background."
-srun \
-  --mpi pmix \
-  --overlap \
-  --container-image "${IMAGE}" \
-  --container-mounts "${MOUNTS}" \
-  --verbose \
-  --label \
-  -A "${ACCOUNT}" \
-  -J "${ACCOUNT}-dynamo.trtllm" \
-  --nodelist "${HEAD_NODE}" \
-  --nodes 1 \
-  --jobid "${SLURM_JOB_ID}" \
-  /mnt/examples/basics/multinode/trtllm/start_frontend_services.sh &
-# NOTE: Output streamed to stdout for ease of understanding the example, but
-# in practice you would probably set `srun --output ... --error ...` to pipe
-# the stdout/stderr to files.
-echo "Launching multi-node worker in background."
-DISAGGREGATION_MODE="prefill_and_decode" \
-srun \
-  --mpi pmix \
-  --oversubscribe \
-  --container-image "${IMAGE}" \
-  --container-mounts "${MOUNTS}" \
-  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,DISAGGREGATION_MODE,ENGINE_CONFIG \
-  --verbose \
-  --label \
-  -A "${ACCOUNT}" \
-  -J "${ACCOUNT}-dynamo.trtllm" \
-  --nodes "${NUM_NODES}" \
-  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
-  --jobid "${SLURM_JOB_ID}" \
-  /mnt/examples/basics/multinode/trtllm/start_trtllm_worker.sh &
--- a/examples/basics/multinode/trtllm/srun_disaggregated.sh
+++ b/examples/basics/multinode/trtllm/srun_disaggregated.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-# This is one of the only variables that must be set currently, most of the rest may
-# just work out of the box if following the steps in the README.
-IMAGE="${IMAGE:-""}"
-# Set to mount current host directory to /mnt inside the container as an example,
-# but you may freely customize the mounts based on your cluster. A common practice
-# is to mount paths to NFS storage for common scripts, model weights, etc.
-# NOTE: This can be a comma separated list of multiple mounts as well.
-DEFAULT_MOUNT="${PWD}/../../../../:/mnt"
-MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
-NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
-NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4}
-NUM_PREFILL_WORKERS=${NUM_PREFILL_WORKERS:-1}
-PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_prefill.yaml}"
-NUM_DECODE_NODES=${NUM_DECODE_NODES:-4}
-NUM_DECODE_WORKERS=${NUM_DECODE_WORKERS:-1}
-DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_decode.yaml}"
-# Automate settings of certain variables for convenience, but you are free
-# to manually set these for more control as well.
-ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
-export HEAD_NODE="${SLURMD_NODENAME}"
-export HEAD_NODE_IP="$(hostname -i)"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-if [[ -z ${IMAGE} ]]; then
-  echo "ERROR: You need to set the IMAGE environment variable to the " \
-       "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
-       "See how to build one from source here: " \
-       "https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container"
-  exit 1
-fi
-# NOTE: Output streamed to stdout for ease of understanding the example, but
-# in practice you would probably set `srun --output ... --error ...` to pipe
-# the stdout/stderr to files.
-echo "Launching frontend services in background."
-srun \
-  --mpi pmix \
-  --overlap \
-  --container-image "${IMAGE}" \
-  --container-mounts "${MOUNTS}" \
-  --verbose \
-  --label \
-  -A "${ACCOUNT}" \
-  -J "${ACCOUNT}-dynamo.trtllm" \
-  --nodelist "${HEAD_NODE}" \
-  --nodes 1 \
-  --jobid "${SLURM_JOB_ID}" \
-  /mnt/examples/basics/multinode/trtllm/start_frontend_services.sh &
-# NOTE: Output streamed to stdout for ease of understanding the example, but
-# in practice you would probably set `srun --output ... --error ...` to pipe
-# the stdout/stderr to files.
-for ((i=1; i<=${NUM_PREFILL_WORKERS}; i++)); do
-  echo "Launching multi-node prefill worker in background."
-  DISAGGREGATION_MODE=prefill \
-  ENGINE_CONFIG=${PREFILL_ENGINE_CONFIG} \
-  srun \
-    --mpi pmix \
-    --oversubscribe \
-    --container-image "${IMAGE}" \
-    --container-mounts "${MOUNTS}" \
-    --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,DISAGGREGATION_MODE,ENGINE_CONFIG \
-    --verbose \
-    --label \
-    -A "${ACCOUNT}" \
-    -J "${ACCOUNT}-dynamo.trtllm" \
-    --nodes "${NUM_PREFILL_NODES}" \
-    --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
-    --jobid "${SLURM_JOB_ID}" \
-    /mnt/examples/basics/multinode/trtllm/start_trtllm_worker.sh &
-done
-for ((i=1; i<=${NUM_DECODE_WORKERS}; i++)); do
-  echo "Launching multi-node decode worker in background."
-  DISAGGREGATION_MODE=decode \
-  ENGINE_CONFIG=${DECODE_ENGINE_CONFIG} \
-  srun \
-    --mpi pmix \
-    --oversubscribe \
-    --container-image "${IMAGE}" \
-    --container-mounts "${MOUNTS}" \
-    --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,DISAGGREGATION_MODE,ENGINE_CONFIG \
-    --verbose \
-    --label \
-    -A "${ACCOUNT}" \
-    -J "${ACCOUNT}-dynamo.trtllm" \
-    --nodes "${NUM_DECODE_NODES}" \
-    --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
-    --jobid "${SLURM_JOB_ID}" \
-    /mnt/examples/basics/multinode/trtllm/start_trtllm_worker.sh &
-done
--- a/examples/basics/multinode/trtllm/start_frontend_services.sh
+++ b/examples/basics/multinode/trtllm/start_frontend_services.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-# Start NATS
-nats-server -js &
-# Start etcd
-etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
-# Wait for NATS/etcd to startup
-sleep 3
-# Start OpenAI Frontend which will dynamically discover workers when they startup
-# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
-# NOTE: This is a blocking call.
-python3 -m dynamo.frontend
--- a/examples/basics/multinode/trtllm/start_trtllm_worker.sh
+++ b/examples/basics/multinode/trtllm/start_trtllm_worker.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-if [[ -z ${MODEL_PATH} ]]; then
-    echo "ERROR: MODEL_PATH was not set."
-    echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \
-         "downloaded path to the model weights. Since Deepseek R1 is large, it is " \
-         "recommended to pre-download them to a shared location and provide the path."
-    exit 1
-fi
-if [[ -z ${SERVED_MODEL_NAME} ]]; then
-    echo "WARNING: SERVED_MODEL_NAME was not set. It will be derived from MODEL_PATH."
-fi
-if [[ -z ${ENGINE_CONFIG} ]]; then
-    echo "ERROR: ENGINE_CONFIG was not set."
-    echo "ERROR: ENGINE_CONFIG must be set to a valid Dynamo+TRTLLM engine config file."
-    exit 1
-fi
-EXTRA_ARGS=""
-if [[ -n ${DISAGGREGATION_MODE} ]]; then
-  EXTRA_ARGS+="--disaggregation-mode ${DISAGGREGATION_MODE} "
-fi
-# Only publish KV events if using KV-aware routing (not needed for round-robin)
-if [[ -n ${PUBLISH_KV_EVENTS} ]] && [[ ${PUBLISH_KV_EVENTS} == "true" ]]; then
-  EXTRA_ARGS+="--publish-events-and-metrics "
-fi
-if [[ -n ${MODALITY} ]]; then
-  EXTRA_ARGS+="--modality ${MODALITY} "
-fi
-trtllm-llmapi-launch \
-  python3 -m dynamo.trtllm \
-    --model-path "${MODEL_PATH}" \
-    --served-model-name "${SERVED_MODEL_NAME}" \
-    --extra-engine-args "${ENGINE_CONFIG}" \
-    ${EXTRA_ARGS}
\ No newline at end of file
--- a/examples/basics/quickstart/README.md
+++ b/examples/basics/quickstart/README.md
-# Quickstart
-This is a simple example showing how you can quickly get started deploying Large Language Models with Dynamo.
-## Prerequisites
-Before running this example, ensure you have the following services running:
- **etcd**: A distributed key-value store used for service discovery and metadata storage
- **NATS**: A high-performance message broker for inter-component communication
-You can start these services using Docker Compose:
-```bash
-docker compose -f deploy/docker-compose.yml up -d
-```
-## Components
- [Frontend](/components/src/dynamo/frontend/README.md) - A built-in component that launches an OpenAI compliant HTTP server, a pre-processor, and a router in a single process
- [vLLM Backend](/docs/backends/vllm/README.md) - A built-in component that runs vLLM within the Dynamo runtime
-```mermaid
---
-title: Request Flow
---
-flowchart TD
-    A["Users/Clients<br/>(HTTP)"] --> B["Frontend<br/>HTTP API endpoint<br/>(OpenAI Style)"]
-    B --> C["NATS Message Broker<br/>(Inter-component communication)"]
-    C --> D["vLLM Backend<br/>(NATS subscriber)"]
-    D --> C
-    C --> B
-    B --> A
-```
-## Instructions
-There are three steps to deploy and use LLM with Dynamo.
-### 1. Launch Engine
-**Open a new terminal** and run:
-```bash
-python -m dynamo.vllm --model Qwen/Qwen3-0.6B
-```
-Leave this terminal running - it will show vLLM Backend logs.
-### 2. Launch Frontend
-**Open another terminal** and interact with the deployed engine using the built-in frontend component. You have two options:
-1. Interactive Command Line Interface
-  ```bash
-  python -m dynamo.frontend --interactive
-  ```
-2. HTTP Server
-  ```bash
-  python -m dynamo.frontend --http-port 8000
-  ```
-Leave this terminal running as well - it will show Frontend logs.
-### 3. Send Requests
-If you launched the frontend in `interactive` mode, simply start typing and hit `Enter` to have an interactive chat with your LLM.
-If you launched the frontend in HTTP mode, you can send requests via `curl`, or any OpenAI compatible client program or library.
-```bash
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H 'Content-Type: application/json' \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [
-      { "role": "user", "content": "Tell me a story about a brave cat" }
-    ],
-    "stream": false,
-    "max_tokens": 1028
-  }'
-```
-## Cleanup
-When you're done with the quickstart example, follow these steps to clean up:
-### 1. Stop Dynamo Components
-In each terminal where you started Dynamo components, press `Ctrl+C` to stop them:
- Stop the vLLM Backend (terminal from step 1)
- Stop the Frontend (terminal from step 2)
-### 2. Stop Infrastructure Services
-If you don't plan to run any more examples, stop the etcd and NATS services that were started with Docker Compose:
-```bash
-docker compose -f deploy/docker-compose.yml down
-```
-This will stop and remove the containers for etcd and NATS.
-## Understand
-### What's Happening Under the Hood
-When you run the two commands above, here's what Dynamo does to spin up the necessary processes and connect your HTTP requests to the vLLM Backend:
-### 1. Service Registration and Discovery
-#### DistributedRuntime Setup
-At startup, each Dynamo component (vLLM Backend, Frontend) connects to the `DistributedRuntime`, which involves creating connections to two critical infrastructure services:
- **etcd**: A distributed key-value store used for service discovery and metadata storage
- **NATS**: A high-performance message broker for inter-component communication
-#### Component Registration
-When the vLLM Backend starts up, it registers itself as a `component` in etcd with one or more `endpoints`.
-This registration includes each endpoint's [NATS subject](https://docs.nats.io/nats-concepts/subjects) for communication and is tied to a `lease` that automatically expires if the component goes offline.
-<details>
-<summary> Inspecting the Component Registry </summary>
-If you want to find out more about the internal organization of components in Dynamo, you can inspect the contents of `etcd` using the [`etcdctl` command line tool](https://etcd.io/docs/latest/dev-guide/interacting_v3/). For this example, you can try running
-```bash
-etcdctl get "instances" --prefix
-```
-which will show you each registered endpoint, along with their associated NATS subject. Note that the specific etcd and NATS info is internal and always subject to change -- in future examples we'll show how to use the `DistributedRuntime` itself to communicate across components.
-</details>
-#### Frontend Discovery
-When the Frontend starts, it doesn't receive an explicit pointer to the vLLM Backend component. Instead, it constantly watches etcd for registered models, automatically discovering the vLLM Backend component and its endpoints when it becomes available.
-### 2. Request Flow and NATS Messaging
-When you send an HTTP request to the Frontend:
-1. **Request Packaging**: The Frontend wraps your HTTP request in a standardized internal format with routing metadata
-2. **NATS Subject Resolution**: Using the discovered endpoints in etcd, it determines the appropriate NATS endpoint
-3. **Message Dispatch**: The request is published to the discovered NATS subject, where the target vLLM Backend picks it up
-4. **Response Streaming**: The vLLM Backend executes the request, and streams responses back through NATS which the Frontend converts back to HTTP
-### 3. Network-Transparent Operation
-One of Dynamo's key strengths is that this entire system works seamlessly whether components are:
- Running on the same machine (like in this quickstart)
- Distributed across multiple nodes in a cluster
- Deployed in different availability zones
-The same two commands work in all scenarios, as long as all components can connect with the `DistributedRuntime` - Dynamo handles the networking complexity automatically.