> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.
This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.
### Quick Start: Using the Placeholder Target (Recommended)
```bash
cd deploy/chrek
# Define your images
export BASE_IMAGE="your-app:latest"# Your existing application image
The scripts in the `examples/<backend>/launch` folder like [agg.sh](../../../examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](../../../examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
This guide explains how to create your own deployment files.
## Step 1: Choose Your Architecture Pattern
Before choosing a template, understand the different architecture patterns:
### Aggregated Serving (agg.yaml)
**Pattern**: Prefill and decode on the same GPU in a single process.
**Suggested to use for**:
- Small to medium models (under 70B parameters)
- Development and testing
- Low to moderate traffic
- Simplicity is prioritized over maximum throughput
**Tradeoffs**:
- Simpler setup and debugging
- Lower operational complexity
- GPU utilization may not be optimal (prefill and decode compete for resources)
- Lower throughput ceiling compared to disaggregated
Select the architecture pattern as your template that best fits your use case.
For example, when using the `vLLM` backend:
-**Development / Testing**: Use [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml) as the base configuration.
-**Production with Load Balancing**: Use [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
-**High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
## Step 2: Customize the Template
You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
It serves the following roles:
1. OpenAI-Compatible HTTP Server
* Provides `/v1/chat/completions` endpoint
* Handles HTTP request/response formatting
* Supports streaming responses
* Validates incoming requests
2. Service Discovery and Routing
* Auto-discovers backend workers via etcd
* Routes requests to the appropriate Processor/Worker components
* Handles load balancing between multiple workers
3. Request Preprocessing
* Initial request validation
* Model name verification
* Request format standardization
You should then pick a worker and specialize the config. For example,
```yaml
VllmWorker:# vLLM-specific config
enforce-eager:true
enable-prefix-caching:true
SglangWorker:# SGLang-specific config
router-mode:kv
disagg-mode:true
TrtllmWorker:# TensorRT-LLM-specific config
engine-config:./engine.yaml
kv-cache-transfer:ucx
```
Here's a template structure based on the examples:
---is-prefill-worker# For disaggregated prefill workers
```
### Image Pull Secret Configuration
#### Automatic Discovery and Injection
By default, the Dynamo operator automatically discovers and injects image pull secrets based on container registry host matching. The operator scans Docker config secrets within the same namespace and matches their registry hostnames to the container image URLs, automatically injecting the appropriate secrets into the pod's `imagePullSecrets`.
**Disabling Automatic Discovery:**
To disable this behavior for a component and manually control image pull secrets:
This automatic discovery eliminates the need to manually configure image pull secrets for each deployment.
## Step 6: Deploy LoRA Adapters (Optional)
After your base model deployment is running, you can deploy LoRA adapters using the `DynamoModel` custom resource. This allows you to fine-tune and extend your models without modifying the base deployment.
To add a LoRA adapter to your deployment, link it using `modelRef` in your worker configuration:
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:my-deployment
spec:
services:
Worker:
modelRef:
name:Qwen/Qwen3-0.6B# Base model identifier
componentType:worker
# ... rest of worker config
```
Then create a `DynamoModel` resource for your LoRA:
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoModel
metadata:
name:my-lora
spec:
modelName:my-custom-lora
baseModelName:Qwen/Qwen3-0.6B# Must match modelRef.name above
modelType:lora
source:
uri:s3://my-bucket/loras/my-lora
```
**For complete details on managing models and LoRA adapters, see:**
📖 **[Managing Models with DynamoModel Guide](./dynamomodel-guide.md)**
`DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to:
-**Deploy LoRA adapters** on top of running base models
-**Track model endpoints** and their readiness across your cluster
-**Manage model lifecycle** declaratively with Kubernetes
DynamoModel works alongside `DynamoGraphDeployment` (DGD) or `DynamoComponentDeployment` (DCD) resources. While DGD/DCD deploy the inference infrastructure (pods, services), DynamoModel handles model-specific operations like loading LoRA adapters.
## Quick Start
### Prerequisites
Before creating a DynamoModel, you need:
1. A running `DynamoGraphDeployment` or `DynamoComponentDeployment`
2. Components configured with `modelRef` pointing to your base model
3. Pods are ready and serving your base model
For complete setup including DGD configuration, see [Integration with DynamoGraphDeployment](#integration-with-dynamographdeployment).
### Deploy a LoRA Adapter
**1. Create your DynamoModel:**
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoModel
metadata:
name:my-lora
namespace:dynamo-system
spec:
modelName:my-custom-lora
baseModelName:Qwen/Qwen3-0.6B# Must match modelRef.name in your DGD
modelType:lora
source:
uri:s3://my-bucket/loras/my-lora
```
**2. Apply and verify:**
```bash
# Apply the DynamoModel
kubectl apply -f my-lora.yaml
# Check status
kubectl get dynamomodel my-lora
```
**Expected output:**
```
NAME TOTAL READY AGE
my-lora 2 2 30s
```
That's it! The operator automatically discovers endpoints and loads the LoRA.
For detailed status monitoring, see [Monitoring & Operations](#monitoring--operations).
## Understanding DynamoModel
### Model Types
DynamoModel supports three model types:
| Type | Description | Use Case |
|------|-------------|----------|
| **`base`** | Reference to an existing base model | Tracking endpoints for a base model (default) |
| **`lora`** | LoRA adapter that extends a base model | Deploy fine-tuned adapters on existing models |
| **`adapter`** | Generic model adapter | Future extensibility for other adapter types |
Most users will use **`lora`** to deploy fine-tuned models on top of their base model deployments.
### How It Works
When you create a DynamoModel, the operator:
1.**Discovers endpoints**: Finds all pods running your `baseModelName` (by matching `modelRef.name` in DGD/DCD)
2.**Creates service**: Automatically creates a Kubernetes Service to track these pods
3.**Loads LoRA**: Calls the LoRA load API on each endpoint (for `lora` type)
4.**Updates status**: Reports which endpoints are ready
**Key linkage:**
```yaml
# DGD modelRef.name ↔ DynamoModel baseModelName must match
Worker:
modelRef:
name:Qwen/Qwen3-0.6B
---
spec:
baseModelName:Qwen/Qwen3-0.6B
```
## Configuration Overview
DynamoModel requires just a few key fields to deploy a model or adapter:
| Field | Required | Purpose | Example |
|-------|----------|---------|---------|
| `modelName` | Yes | Model identifier | `my-custom-lora` |
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Minikube Setup Guide
Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
## 1. Install Minikube
First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
## 2. Configure GPU Support (Optional)
Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
> [!TIP]
> Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
## 3. Start Minikube
Time to launch your local cluster!
```bash
# Start Minikube with GPU support (if configured)
minikube start --driver docker --container-runtime docker --gpus all --memory=16000mb --cpus=8
Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](../installation_guide.md) to deploy the platform to your local cluster.
This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
## Overview
Dynamo supports multinode deployments through the `multinode` section in resource specifications. This allows you to:
- Distribute workloads across multiple physical nodes
- Scale GPU resources beyond a single machine
- Support large models requiring extensive tensor parallelism
- Achieve high availability and fault tolerance
## Basic requirements
-**Kubernetes Cluster**: Version 1.24 or later
-**GPU Nodes**: Multiple nodes with NVIDIA GPUs
-**High-Speed Networking**: InfiniBand, RoCE, or high-bandwidth Ethernet (recommended for optimal performance)
### Advanced Multinode Orchestration
#### Using Grove (default)
For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems:
-**[Grove](https://github.com/NVIDIA/grove)**: Network topology-aware gang scheduling and auto-scaling for AI workloads
-**[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler)**: Kubernetes native scheduler optimized for AI workloads at scale
These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes.
**Features Enabled with Grove:**
- Declarative composition of AI workloads
- Multi-level horizontal auto-scaling
- Custom startup ordering for components
- Resource-aware rolling updates
[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a Kubernetes native scheduler optimized for AI workloads at large scale.
**Features Enabled with KAI-Scheduler:**
- Gang scheduling
- Network topology-aware pod placement
- AI workload-optimized scheduling algorithms
- GPU resource awareness and allocation
- Support for complex scheduling constraints
- Integration with Grove for enhanced capabilities
- Performance optimizations for large-scale deployments
##### Prerequisites
-[Grove](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) installed on the cluster
- (Optional) [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with the default queue name `dynamo` created. If no queue annotation is specified on the DGD resource, the operator uses the `dynamo` queue by default. Custom queue names can be specified via the `nvidia.com/kai-scheduler-queue` annotation, but the queue must exist in the cluster before deployment.
KAI-Scheduler is optional but recommended for advanced scheduling capabilities.
#### Using LWS and Volcano
LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.
Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support.
## Core Concepts
### Orchestrator Selection Algorithm
Dynamo automatically selects the best available orchestrator for multinode deployments using the following logic:
#### When Both Grove and LWS are Available:
-**Grove is selected by default** (recommended for advanced AI workloads)
-**LWS is selected** if you explicitly set `nvidia.com/enable-grove: "false"` annotation on your DGD resource
#### When Only One Orchestrator is Available:
- The installed orchestrator (Grove or LWS) is automatically selected
#### Scheduler Integration:
-**With Grove**: Automatically integrates with [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) when available, providing:
- Advanced queue management via `nvidia.com/kai-scheduler-queue` annotation
- AI-optimized scheduling policies
- Resource-aware workload placement
-**With LWS**: Uses Volcano scheduler for gang scheduling and resource coordination
#### Configuration Examples:
**Default (Grove with KAI-Scheduler):**
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:my-multinode-deployment
annotations:
nvidia.com/kai-scheduler-queue:"dynamo"
spec:
# ... your deployment spec
```
> **Note:** The `nvidia.com/kai-scheduler-queue` annotation defaults to `"dynamo"`. If you specify a custom queue name, ensure the queue exists in your cluster before deploying. You can verify available queues with `kubectl get queues`.
**Force LWS usage:**
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:my-multinode-deployment
annotations:
nvidia.com/enable-grove:"false"
spec:
# ... your deployment spec
```
### The `multinode` Section
The `multinode` section in a resource specification defines how many physical nodes the workload should span:
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:my-multinode-deployment
spec:
# ... your deployment spec
services:
my-service:
...
multinode:
nodeCount:2
resources:
limits:
gpu:"2"# 2 GPUs per node
```
### GPU Distribution
The relationship between `multinode.nodeCount` and `gpu` is multiplicative:
-**`multinode.nodeCount`**: Number of physical nodes
-**`gpu`**: Number of GPUs per node
-**Total GPUs**: `multinode.nodeCount × gpu`
**Example:**
-`multinode.nodeCount: "2"` + `gpu: "4"` = 8 total GPUs (4 GPUs per node across 2 nodes)
-`multinode.nodeCount: "4"` + `gpu: "8"` = 32 total GPUs (8 GPUs per node across 4 nodes)
### Tensor Parallelism Alignment
The tensor parallelism (`tp-size` or `--tp`) in your command/args must match the total number of GPUs:
When you deploy a multinode workload, the Dynamo operator automatically applies backend-specific configurations to enable distributed execution. Understanding these automatic modifications helps troubleshoot issues and optimize your deployments.
### vLLM Backend
For vLLM multinode deployments, the operator automatically selects and configures the appropriate distributed execution mode based on your parallelism settings:
#### Deployment Modes
The operator automatically determines the deployment mode based on your parallelism configuration:
**1. Tensor/Pipeline Parallelism Mode (Single model across nodes)**
-**When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size`
-**Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism
The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes.
-**Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers
-**Probes**: All probes (liveness, readiness, startup) are automatically removed
> **Note**: vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend.
**2. Data Parallel Mode (Multiple model instances across nodes)**
-**When used**: When `world_size × data_parallel_size > GPUs_per_node`
-**Use case**: Running multiple independent model instances across nodes with data parallelism (e.g., MoE models with expert parallelism)
**All Nodes (Leader and Workers):**
-**Injected Flags**:
-`--data-parallel-address <leader-hostname>` - Address of the coordination server
-`--data-parallel-size-local <value>` - Number of data parallel workers per node
-`--data-parallel-rpc-port 13445` - RPC port for data parallel coordination
-`--data-parallel-start-rank <value>` - Starting rank for this node (calculated automatically)
-**Probes**: Worker probes are removed; leader probes remain active
**Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
#### Why Ray for Multi-Node TP/PP?
vLLM supports two distributed executor backends: `ray` and `mp`. For multi-node deployments:
-**Ray executor**: vLLM creates a placement group and spawns Ray actors across the cluster. Workers don't run vLLM directly - the leader's vLLM process manages everything.
-**mp executor**: Each node must run its own vLLM process with `--nnodes`, `--node-rank`, `--master-addr`, `--master-port`. This approach is more complex to orchestrate.
The Dynamo operator uses Ray because:
1. It aligns with vLLM's official multi-node documentation (see `multi-node-serving.sh`)
2. Simpler orchestration - only the leader runs vLLM, workers just need Ray agents
3. vLLM automatically handles placement group creation and worker management
#### Compilation Cache Support
When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets:
-**`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point
### SGLang Backend
For SGLang multinode deployments, the operator injects distributed training parameters:
- The `node-rank` is automatically determined from the pod's stateful identity
-**Probes**: All probes (liveness, readiness, startup) are automatically removed
**Note:** The operator intelligently injects these flags regardless of your command structure (direct Python commands or shell wrappers).
### TensorRT-LLM Backend
For TensorRT-LLM multinode deployments, the operator configures MPI-based communication:
#### Leader Node
-**SSH Configuration**: Automatically sets up SSH keys and configuration from a Kubernetes secret
-**MPI Command**: Wraps your command in an `mpirun` command with:
- Proper host list including all worker nodes
- SSH configuration for passwordless authentication on port 2222
- Environment variable propagation to all nodes
- Activation of the Dynamo virtual environment
-**Probes**: All health probes remain active
#### Worker Nodes
-**SSH Daemon**: Replaces your command with SSH daemon setup and execution
- Generates host keys in user-writable directories (non-privileged)
- Configures SSH daemon to listen on port 2222
- Sets up authorized keys for leader access
-**Probes**:
-**Liveness and Startup**: Removed (workers run SSH daemon, not the main application)
-**Readiness**: Replaced with TCP socket check on SSH port 2222
- Initial Delay: 20 seconds
- Period: 20 seconds
- Timeout: 5 seconds
- Failure Threshold: 10
#### Additional Configuration
-**Environment Variable**: `OMPI_MCA_orte_keep_fqdn_hostnames=1` is added to all nodes
-**SSH Volume**: Automatically mounts the SSH keypair secret (typically named `mpirun-ssh-key-<deployment-name>`)
**Important:** TensorRT-LLM requires an SSH keypair secret to be created before deployment. The secret name follows the pattern `mpirun-ssh-key-<component-name>`.
### Compilation Cache Configuration
The operator supports compilation cache volumes for backend-specific optimization:
| Backend | Support Level | Environment Variables | Default Mount Point |
To enable compilation cache, add a volume mount with `useAsCompilationCache: true` in your component specification. For vLLM, the operator will automatically configure the necessary environment variables. For other backends, volume mounts are created, but additional environment configuration may be required until upstream support is added.
## Next Steps
For additional support and examples, see the working multinode configurations in:
Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
## Architecture
-**Operator Deployment:**
Deployed as a Kubernetes `Deployment` in a specific namespace.
-**Controllers:**
-`DynamoGraphDeploymentController`: Watches `DynamoGraphDeployment` CRs and orchestrates graph deployments.
-`DynamoComponentDeploymentController`: Watches `DynamoComponentDeployment` CRs and handles individual component deployments.
-`DynamoModelController`: Watches `DynamoModel` CRs and manages model lifecycle (e.g., loading LoRA adapters).
-**Workflow:**
1. A custom resource is created by the user or API server.
2. The corresponding controller detects the change and runs reconciliation.
3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
4. Status fields are updated to reflect the current state.
## Deployment Modes
The Dynamo operator supports three deployment modes to accommodate different cluster environments and use cases:
### 1. Cluster-Wide Mode (Default)
The operator monitors and manages DynamoGraph resources across **all namespaces** in the cluster.
**When to Use:**
- You have full cluster admin access
- You want centralized management of all Dynamo workloads
- Standard production deployment on a dedicated cluster
---
### 2. Namespace-Scoped Mode
The operator monitors and manages DynamoGraph resources **only in a specific namespace**. A lease marker is created to signal the operator's presence to any cluster-wide operators.
**When to Use:**
- You're on a shared/multi-tenant cluster
- You only have namespace-level permissions
- You want to test a new operator version in isolation
- You need to avoid conflicts with other operators
A **cluster-wide operator** manages most namespaces, while **one or more namespace-scoped operators** run in specific namespaces (e.g., for testing new versions). The cluster-wide operator automatically detects and excludes namespaces with namespace-scoped operators using lease markers.
**When to Use:**
- Running production workloads with a stable operator version
- Testing new operator versions in isolated namespaces without affecting production
- Gradual rollout of operator updates
- Development/staging environments on production clusters
**How It Works:**
1. Namespace-scoped operator creates a lease named `dynamo-operator-namespace-scope` in its namespace
2. Cluster-wide operator watches for these lease markers across all namespaces
3. Cluster-wide operator automatically excludes any namespace with a lease marker
4. If namespace-scoped operator stops, its lease expires (TTL: 30s by default)
5. Cluster-wide operator automatically resumes managing that namespace
-**DynamoModel**: Manages model lifecycle (e.g., loading LoRA adapters)
For the complete technical API reference for Dynamo Custom Resource Definitions, see:
**📖 [Dynamo CRD API Reference](./api_reference.md)**
For a user-focused guide on deploying and managing models with DynamoModel, see:
**📖 [Managing Models with DynamoModel Guide](./deployment/dynamomodel-guide.md)**
## Webhooks
The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation of custom resources before they are persisted to the cluster. Webhooks are **enabled by default** and ensure that invalid configurations are rejected immediately at the API server level.
**Key Features:**
- ✅ Shared certificate infrastructure across all webhook types
- ✅ Multi-operator support with lease-based coordination
- ✅ Immutability enforcement for critical fields
For complete documentation on webhooks, certificate management, and troubleshooting, see:
**📖 [Webhooks Guide](./webhooks.md)**
## Observability
The Dynamo Operator provides comprehensive observability through Prometheus metrics and Grafana dashboards. This allows you to monitor:
-**Controller Performance**: Reconciliation loop duration, success rates, and error rates by resource type
-**Webhook Activity**: Validation performance, admission rates, and denial patterns
-**Resource Inventory**: Current count of managed resources by state and namespace
-**Operational Health**: Success rates and health indicators for controllers and webhooks
### Metrics Collection
Metrics are automatically exposed on the operator's `/metrics` endpoint (port 8443 by default) and collected by Prometheus via a ServiceMonitor. The ServiceMonitor is automatically created when you install the operator via Helm (controlled by `metricsService.enabled`, which defaults to `true`).
### Grafana Dashboard
A pre-built Grafana dashboard is available for visualizing operator metrics. The dashboard includes:
-**Reconciliation Metrics**: Rate, duration (P95), and errors by resource type
-**Webhook Metrics**: Request rate, duration (P95), and denials by resource type and operation
-**Resource Inventory**: Count of DynamoGraphDeployments by state and namespace
-**Operational Health**: Success rate gauges for controllers and webhooks
For complete setup instructions and metrics reference, see:
> **Note:** For shared/multi-tenant clusters or testing scenarios, see [Deployment Modes](#deployment-modes) above for namespace-scoped and hybrid configurations.
### Building from Source
```bash
# Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=your-registry.com/ # your container registry
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../backends/vllm/README.md) to demonstrate the workflow.
## Prerequisites
- A Kubernetes cluster with [Dynamo Kubernetes Platform](./installation_guide.md) installed
-[FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations
## Workflow Overview
The GitOps workflow for Dynamo deployments consists of three main steps:
1. Build and push the Dynamo Operator
2. Create and commit a DynamoGraphDeployment custom resource for initial deployment
3. Update the graph by building a new version and updating the CR for subsequent updates
## Step 1: Build and Push Dynamo Operator
First, follow to [See Install Dynamo Kubernetes Platform](./installation_guide.md).
## Step 2: Create Initial Deployment
Create a new file in your Git repository (e.g., `deployments/llm-agg.yaml`) with the following content:
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:llm-agg
spec:
pvcs:
-name:vllm-model-storage
size:100Gi
services:
Frontend:
replicas:1
envs:
-name:SPECIFIC_ENV_VAR
value:some_specific_value
Processor:
replicas:1
envs:
-name:SPECIFIC_ENV_VAR
value:some_specific_value
VllmWorker:
replicas:1
envs:
-name:SPECIFIC_ENV_VAR
value:some_specific_value
# Add PVC for model storage
volumeMounts:
-name:vllm-model-storage
mountPoint:/models
```
Commit and push this file to your Git repository. FluxCD will detect the new CR and create the initial Dynamo deployment in your cluster.
## Step 3: Update Existing Deployment
To update your pipeline, just update the associated DynamoGraphDeployment CRD
The Dynamo operator will automatically reconcile it.
Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
## Overview
Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource.
### How Grove Works for Disaggregated Serving
Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages:
-**Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks
-**Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns
-**Resource Optimization**: Better utilization of hardware resources through specialized workload placement
-**Fault Isolation**: Issues in one component don't necessarily affect others
## Core Components and API Resources
Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups:
### PodCliqueSet
The top-level Grove object that defines a group of components managed and colocated together. Key features include:
- Support for autoscaling
- Topology-aware spread of replicas for availability
- Unified management of multiple disaggregated components
### PodClique
Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features:
- Independent configuration options
- Custom scaling logic support
- Role-specific resource allocation
### PodCliqueScalingGroup
A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior.
## Key Capabilities for Disaggregated Serving
Grove provides several specialized features that make it particularly well-suited for disaggregated serving:
### Flexible Gang Scheduling
PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodCliqueSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
### Multi-level Horizontal Auto-Scaling
Supports pluggable horizontal auto-scaling solutions to scale PodCliqueSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
### Network Topology-Aware Scheduling
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication.
### Custom Startup Dependencies
Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
## Use Cases and Examples
Grove specifically supports:
-**Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick
-**Single-node disaggregated inference** for optimized resource utilization
-**Agentic pipelines of models** for complex AI workflows
-**Standard aggregated serving** patterns for single node or single GPU inference
## Integration with NVIDIA Dynamo
Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack:
### Complementary Roles
-**Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads
-**Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management
### Release Coordination
Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap.
### Unified AI Platform
The integration creates a comprehensive platform where:
- Grove manages complex orchestration of disaggregated components
- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations
- Together they enable sophisticated AI serving architectures with simplified management
## Architecture Benefits
Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by:
1.**Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition
2.**Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate
3.**Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components
4.**Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling
## Getting Started
Grove relies on KAI Scheduler for resource allocation and scheduling.
For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler).
For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](./deployment/multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
Dynamo Kubernetes Platform also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Kubernetes Platform Deployment Installation Guide](./installation_guide.md) for more details.
- A cluster-wide Dynamo operator is likely already running
-**Do NOT install another operator** - use the existing cluster-wide operator
- Only install a namespace-restricted operator if you specifically need to prevent the cluster-wide operator from managing your namespace (e.g., testing operator features you're developing)
-**Kubernetes cluster v1.24+** with admin or namespace-scoped access
-**Cluster type determined** (shared vs dedicated) — see [Before You Start](#before-you-start)
-**CRD status checked** if on a shared cluster
-**NGC credentials** (optional) — required only if pulling NVIDIA images from NGC
### Verify Installation
Run the following to confirm your tools are correctly installed:
```bash
# Verify tools and versions
kubectl version --client# Should show v1.24+
helm version # Should show v3.0+
docker version # Required for Path B only
# Set your release version
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
```
### Pre-Deployment Checks
Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements:
```bash
./deploy/pre-deployment/pre-deployment-check.sh
```
This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](../../deploy/pre-deployment/README.md) for details.
> **No cluster?** See [Minikube Setup](deployment/minikube.md) for local development.
**Estimated installation time:** 5-30 minutes depending on path
## Path A: Production Install
Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).
```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
If you see this validation error, you need namespace restriction:
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```
> [!TIP]
> For multinode deployments, you need to install multinode orchestration components:
>
> **Option 1 (Recommended): Grove + KAI Scheduler**
> - Grove and KAI Scheduler can be installed manually or through the dynamo-platform helm install command.
> - When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags:
>
> ```bash
> --set "grove.enabled=true"
> --set "kai-scheduler.enabled=true"
> ```
>
> **Option 2: LeaderWorkerSet (LWS) + Volcano**
> - If using LWS for multinode deployments, you must also install Volcano (required dependency):
> By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces.
> If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true.
> You can also change the restricted namespace by setting the targetNamespace property.
ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
```
This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog).
just add the following to the helm install command:
# Model Caching with Fluid: Cloud-Native Data Orchestration and Acceleration
Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.
## Key Features
-**Data Caching and Acceleration:** Cache remote data close to compute workloads for faster access.
-**Unified Data Access:** Access data from S3, HDFS, NFS, and more through a single interface.
-**Kubernetes Native:** Integrates with Kubernetes using CRDs for data management.
-**Scalability:** Supports large-scale data and compute clusters.
## Installation
You can install Fluid on any Kubernetes cluster using Helm.
You can then use `s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B` as your Dataset mount.
## Usage with Dynamo
Mount the Fluid-generated PVC in your DynamoGraphDeployment:
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:model-caching
spec:
pvcs:
-name:s3-model
envs:
-name:HF_HOME
value:/model
-name:DYN_DEPLOYMENT_CONFIG
value:'{"Common":{"model":"/model",...}}'
services:
VllmWorker:
volumeMounts:
-name:s3-model
mountPoint:/model
Processor:
volumeMounts:
-name:s3-model
mountPoint:/model
```
## Full example with llama3.3 70B
### Performance
When deploying LLaMA 3.3 70B using Fluid as the caching layer, we observed the best performance by configuring a single-node cache that holds 100% of the model files locally. By ensuring that the vllm worker pod is scheduled on the same node as the Fluid cache, we were able to eliminate network I/O bottlenecks, which resulted in the fastest model startup time and the highest inference efficiency during our tests.
| Cache Configuration | vLLM Pod Placement | Startup Time |
This guide demonstrates how to set up logging for Dynamo in Kubernetes using Grafana Loki and Alloy. This setup provides a simple reference logging setup that can be followed in Kubernetes clusters including Minikube and MicroK8s.
> [!Note]
> This setup is intended for development and testing purposes. For production environments, please refer to the official documentation for high-availability configurations.
## Components Overview
-**[Grafana Loki](https://grafana.com/oss/loki/)**: Fast and cost-effective Kubernetes-native log aggregation system.
-**[Grafana Alloy](https://grafana.com/oss/alloy/)**: OpenTelemetry collector that replaces Promtail, gathering logs, metrics and traces from Kubernetes pods.
-**[Grafana](https://grafana.com/grafana/)**: Visualization platform for querying and exploring logs.
## Prerequisites
### 1. Dynamo Kubernetes Platform
This guide assumes you have installed Dynamo Kubernetes Platform. For more information, see [Dynamo Kubernetes Platform](../README.md).
### 2. Kube-prometheus
While this guide does not use Prometheus, it assumes Grafana is pre-installed with the kube-prometheus. For more information, see [kube-prometheus](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
### 3. Environment Variables
#### Kubernetes Setup Variables
The following env variables are set:
-`MONITORING_NAMESPACE`: The namespace where Loki is installed
-`DYN_NAMESPACE`: The namespace where Dynamo Kubernetes Platform is installed
```bash
export MONITORING_NAMESPACE=monitoring
export DYN_NAMESPACE=dynamo-system
```
#### Dynamo Logging Variables
| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for Loki) | `true` |
Our configuration (`loki-values.yaml`) sets up Loki in a simple configuration that is suitable for testing and development. It uses a local MinIO for storage. The installation pods can be viewed with:
```bash
kubectl get pods -n$MONITORING_NAMESPACE-lapp=loki
```
### 2. Install Grafana Alloy
Next, install the Grafana Alloy collector to gather logs from your Kubernetes cluster and forward them to Loki. Here we use the Helm chart `k8s-monitoring` provided by Grafana to install the collector:
```bash
# Generate a custom values file with the namespace information
-"nvidia_com_dynamo_component_type"# extract this label from the dynamo graph deployment
-"nvidia_com_dynamo_graph_deployment_name"# extract this label from the dynamo graph deployment
namespaces:
-$DYN_NAMESPACE
```
### 3. Configure Grafana with the Loki datasource and Dynamo Logs dashboard
We will be viewing the logs associated with our DynamoGraphDeployment in Grafana. To do this, we need to configure Grafana with the Loki datasource and Dynamo Logs dashboard.
Since we are using Grafana with the Prometheus Operator, we can simply apply the following ConfigMaps to quickly achieve this configuration.
> If using Grafana installed without the Prometheus Operator, you can manually import the Loki datasource and Dynamo Logs dashboard using the Grafana UI.
### 4. Deploy a DynamoGraphDeployment with JSONL Logging
At this point, we should have everything in place to collect and view logs in our Grafana instance. All that is left is to deploy a DynamoGraphDeployment to collect logs from.
To enable structured logs in a DynamoGraphDeployment, we need to set the `DYN_LOGGING_JSONL` environment variable to `1`. This is done for us in the `agg_logging.yaml` setup for the Sglang backend. We can now deploy the DynamoGraphDeployment with:
Send a few chat completions requests to generate structured logs across the frontend and worker pods across the DynamoGraphDeployment. We are now all set to view the logs in Grafana.
## Viewing Logs in Grafana
Port-forward the Grafana service to access the UI:
If everything is working, under Home > Dashboards > Dynamo Logs, you should see a dashboard that can be used to view the logs associated with our DynamoGraphDeployments
The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g., frontend, worker, etc.).
This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the kube-prometheus-stack. The kube-prometheus-stack provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components.
## Prerequisites
### Install kube-prometheus-stack
If you don't have an existing Prometheus setup, you'll likely want to install the kube-prometheus-stack. This is a collection of Kubernetes manifests that includes the Prometheus Operator, Prometheus, Grafana, and other monitoring components in a pre-configured setup. The stack introduces custom resources that make it easy to deploy and manage monitoring in Kubernetes:
-`PodMonitor`: Automatically discovers and scrapes metrics from pods based on label selectors
-`ServiceMonitor`: Similar to PodMonitor but works with Services
-`PrometheusRule`: Defines alerting and recording rules
> The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
### Install Dynamo Operator
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../installation_guide.md) for detailed instructions on deploying the Dynamo operator.
Make sure to set the `dynamo-operator.dynamo.metrics.prometheusEndpoint` to the Prometheus endpoint you installed in the previous step.
The Dynamo Grafana dashboard includes panels for node-level CPU utilization, system load, and container resource usage. These metrics are collected and exported to Prometheus via [node-exporter](https://github.com/prometheus/node_exporter), which exposes hardware and OS metrics from Linux systems.
> [!Note]
> The kube-prometheus-stack installation described above includes node-exporter by default. If you're using a custom Prometheus setup, you'll need to ensure node-exporter is deployed as a DaemonSet on your cluster nodes.
To verify node-exporter is running:
```bash
kubectl get daemonset -A | grep node-exporter
```
If node-exporter is not running, you can install it via the kube-prometheus-stack or deploy it separately. For more information, see the [node-exporter documentation](https://github.com/prometheus/node_exporter).
### DCGM Metrics Collection (Optional)
GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
```bash
kubectl get daemonset -A | grep dcgm-exporter
```
If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
## Deploy a DynamoGraphDeployment
Let's start by deploying a simple vLLM aggregated deployment:
```bash
export NAMESPACE=dynamo-system # namespace where dynamo operator is installed
pushd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n$NAMESPACE
popd
```
This will create two components:
- A Frontend component exposing metrics on its HTTP port
- A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](../../backends/vllm/README.md)
- Available metrics: See the [metrics guide](../../observability/metrics.md)
### Validate the Deployment
Let's send some test requests to populate metrics:
```bash
curl localhost:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 30
}'
```
For more information about validating the deployment, see the [vLLM README](../../backends/vllm/README.md).
## Set Up Metrics Collection
### Create PodMonitors
The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods:
The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_dashboard: "1"`, the Grafana will discover and populate it to its list of available dashboards. The dashboard includes panels for:
Visit http://localhost:3000 and log in with the credentials captured above.
Once logged in, find the Dynamo dashboard under General.

## Operator Metrics
> **Note:** The metrics described above are for Dynamo **applications** (frontends, workers). The Dynamo **Operator** itself also exposes metrics for monitoring controller reconciliation, webhook validation, and resource inventory.
>
> See the **[Operator Metrics Guide](operator-metrics.md)** for details on operator-specific metrics and the operator dashboard.
The Dynamo Operator exposes Prometheus metrics for monitoring its own health and performance. These metrics are separate from application metrics (frontend/worker) and provide visibility into:
-**Controller Reconciliation**: How efficiently controllers process DynamoGraphDeployments, DynamoComponentDeployments, and DynamoModels
-**Webhook Validation**: Performance and outcomes of admission webhook requests
-**Resource Inventory**: Current count of managed resources by state and namespace
## Prerequisites
The operator metrics feature requires the same monitoring infrastructure as application metrics. For detailed setup instructions, see the [Kubernetes Metrics Guide](./metrics.md#prerequisites).
Operator metrics are automatically collected via a ServiceMonitor, which is created by the Helm chart when `metricsService.enabled: true` (default).
**Unlike application metrics** (which use PodMonitor), the operator uses ServiceMonitor and requires no manual RBAC configuration. The operator's kube-rbac-proxy sidecar is configured with `--ignore-paths=/metrics` to allow Prometheus access.
The dashboard will automatically appear in Grafana (assuming you have the Grafana dashboard sidecar configured, which is included in kube-prometheus-stack).
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Service Discovery
Dynamo components (frontends, workers, planner) need to be able to discover each other and their capabilities at runtime. We refer to this as service discovery. There are 2 kinds of service discovery backends supported on Kubernetes.
## Discovery Backends
| Backend | Default | Dependencies | Use Case |
|---------|---------|--------------|----------|
| **Kubernetes** | ✅ Yes | None (native K8s) | Recommended for all Kubernetes deployments |
| **KV Store (etcd)** | No | etcd cluster | Legacy deployments |
## Kubernetes Discovery (Default)
Kubernetes discovery is the default and recommended backend when running on Kubernetes. It uses native Kubernetes primitives to facilitate discovery of components:
-**DynamoWorkerMetadata CRD**: Each worker stores its registered endpoints and model cards in a Custom Resource
-**EndpointSlices**: EndpointSlices signal each component's readiness status
### Implementation Details
Each pod runs a **discovery daemon** that watches both EndpointSlices and DynamoWorkerMetadata CRs. A pod is only discoverable when it appears as "ready" in an EndpointSlice AND has a corresponding `DynamoWorkerMetadata` CR. This correlation ensures pods aren't discoverable until they're ready, metadata is immediately available, and stale entries are cleaned up when pods terminate.
#### DynamoWorkerMetadata CRD
Each worker pod creates a `DynamoWorkerMetadata` CR that stores its discovery metadata:
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoWorkerMetadata
metadata:
name:my-worker-pod-abc123
namespace:dynamo-system
ownerReferences:
-apiVersion:v1
kind:Pod
name:my-worker-pod-abc123
uid:<pod-uid>
controller:true
spec:
data:
endpoints:
"dynamo/backend/generate":
type:Endpoint
namespace:dynamo
component:backend
endpoint:generate
instance_id:12345678901234567890
transport:
nats_tcp:"dynamo_backend.generate-abc123"
model_cards:{}
```
The CR is named after the pod and includes an owner reference for automatic garbage collection when the pod is deleted.
#### EndpointSlices
While DynamoWorkerMetadata resources provide an up-to-date snapshot of a component's capabilities, EndpointSlices give a snapshot of health of the various Dynamo components.
The operator creates a Kubernetes Service targeting the Dynamo components. The Kubernetes controller in turn creates and maintains EndpointSlice resources that keep track of the readiness of the pods targeted by the Service. Watching these slices gives us an up-to-date snapshot of which Dynamo components are ready to serve traffic.
##### Readiness Probes
A pod is marked ready if the readiness probe succeeds. On Dynamo workers, this is when the `generate` endpoint is available and healthy. These probes are configured by the Dynamo operator for each pod/component.
#### RBAC
Each Dynamo component pod is automatically given a ServiceAccount that allows it to watch `EndpointSlice` and `DynamoWorkerMetadata` resources within its namespace.
#### Environment Variables
The following environment variables are automatically injected into pods by the operator to facilitate service discovery:
| Variable | Description |
|----------|-------------|
| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
| `POD_NAME` | Pod name (via downward API) |
| `POD_NAMESPACE` | Pod namespace (via downward API) |
| `POD_UID` | Pod UID (via downward API) |
The pod's instance ID is deterministically generated by hashing the pod name, ensuring consistent identity and correlation between EndpointSlices and CRs.
## KV Store Discovery (etcd)
To use etcd-based discovery instead of Kubernetes-native discovery, add the annotation to your DynamoGraphDeployment:
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:my-deployment
annotations:
nvidia.com/dynamo-discovery-backend:etcd
spec:
services:
# ...
```
This requires an etcd cluster to be available. The etcd connection is configured via the platform Helm chart.
The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.
All webhook types (validating, mutating, conversion, etc.) share the same **webhook server** and **TLS certificate infrastructure**, making certificate management consistent across all webhook operations.
### Key Features
- ✅ **Enabled by default** - Zero-touch validation out of the box
- ✅ **Shared certificate infrastructure** - All webhook types use the same TLS certificates
- ✅ **Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
- ✅ **Immutability enforcement** - Critical fields protected via CEL validation rules
### Current Webhook Types
-**Validating Webhooks**: Validate custom resource specifications before persistence
-`DynamoComponentDeployment` validation
-`DynamoGraphDeployment` validation
-`DynamoModel` validation
**Note:** Future releases may add mutating webhooks (for defaults/transformations) and conversion webhooks (for CRD version migrations). All will use the same certificate infrastructure described in this document.
failurePolicy:Fail# Fail (reject on error) or Ignore (allow on error)
timeoutSeconds:10# Webhook timeout
# Namespace filtering (advanced)
namespaceSelector:{}# Kubernetes label selector for namespaces
```
#### Failure Policy
```yaml
# Fail: Reject resources if webhook is unavailable (recommended for production)
webhook:
failurePolicy:Fail
# Ignore: Allow resources if webhook is unavailable (use with caution)
webhook:
failurePolicy:Ignore
```
**Recommendation:** Use `Fail` in production to ensure validation is always enforced. Only use `Ignore` if you need high availability and can tolerate occasional invalid resources.
#### Namespace Filtering
Control which namespaces are validated (applies to **cluster-wide operator** only):
```yaml
# Only validate resources in namespaces with specific labels
webhook:
namespaceSelector:
matchLabels:
dynamo-validation:enabled
# Or exclude specific namespaces
webhook:
namespaceSelector:
matchExpressions:
-key:dynamo-validation
operator:NotIn
values:["disabled"]
```
**Note:** For **namespace-restricted operators**, the namespace selector is automatically set to validate only the operator's namespace. This configuration is ignored in namespace-restricted mode.
---
## Certificate Management
### Automatic Certificates (Default)
**Zero configuration required!** Certificates are automatically generated during `helm install` and `helm upgrade`.
The operator supports running both **cluster-wide** and **namespace-restricted** instances simultaneously using a **lease-based coordination mechanism**.
### Scenario
```
Cluster:
├─ Operator A (cluster-wide, namespace: platform-system)
│ └─ Validates all namespaces EXCEPT team-a
└─ Operator B (namespace-restricted, namespace: team-a)
└─ Validates only team-a namespace
```
### How It Works
1.**Namespace-restricted operator** creates a Lease in its namespace
2.**Cluster-wide operator** watches for Leases named `dynamo-operator-ns-lock`
3.**Cluster-wide operator** skips validation for namespaces with active Leases
4.**Namespace-restricted operator** validates resources in its namespace
### Lease Configuration
The lease mechanism is **automatically configured** based on deployment mode:
```yaml
# Cluster-wide operator (default)
namespaceRestriction:
enabled:false
# → Watches for leases in all namespaces
# → Skips validation for namespaces with active leases
# Namespace-restricted operator
namespaceRestriction:
enabled:true
namespace:team-a
# → Creates lease in team-a namespace
# → Does NOT check for leases (no cluster permissions)
```
### Deployment Example
```bash
# 1. Deploy cluster-wide operator
helm install platform-operator dynamo-platform \
-n platform-system \
--set namespaceRestriction.enabled=false
# 2. Deploy namespace-restricted operator for team-a
helm install team-a-operator dynamo-platform \
-n team-a \
--set namespaceRestriction.enabled=true\
--set namespaceRestriction.namespace=team-a
```
### ValidatingWebhookConfiguration Naming
The webhook configuration name reflects the deployment mode:
Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, and various exporters for metrics, tracing, and visualization.
| [Metrics](metrics.md) | Available metrics reference | `DYN_SYSTEM_PORT`† |
| [Operator Metrics (Kubernetes)](../kubernetes/observability/operator-metrics.md) | Operator controller and webhook metrics for Kubernetes | N/A (configured via Helm) |
| [Health Checks](health-checks.md) | Component health monitoring and readiness probes | `DYN_SYSTEM_PORT`†, `DYN_SYSTEM_STARTING_HEALTH_STATUS`, `DYN_SYSTEM_HEALTH_PATH`, `DYN_SYSTEM_LIVE_PATH`, `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` |
| [Tracing](tracing.md) | Distributed tracing with OpenTelemetry and Tempo | `DYN_LOGGING_JSONL`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`†, `OTEL_SERVICE_NAME`† |
| [Metrics Developer Guide](metrics-developer-guide.md) | Creating custom metrics in Rust and Python | `DYN_SYSTEM_PORT`† |
## Kubernetes
For Kubernetes-specific setup and configuration, see [docs/kubernetes/observability/](../kubernetes/observability/).
**Operator Metrics**: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the [Operator Metrics Guide](../kubernetes/observability/operator-metrics.md).
---
## Topology
This provides:
-**Prometheus** on `http://localhost:9090` - metrics collection and querying
-**Grafana** on `http://localhost:3000` - visualization dashboards (username: `dynamo`, password: `dynamo`)
-**Tempo** on `http://localhost:3200` - distributed tracing backend
-**DCGM Exporter** on `http://localhost:9401/metrics` - GPU metrics
-**NATS Exporter** on `http://localhost:7777/metrics` - NATS messaging metrics
The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.
### Configuration Files
The following configuration files are located in the `deploy/observability/` directory:
-[docker-compose.yml](../../deploy/docker-compose.yml): Defines NATS and etcd services
-[docker-observability.yml](../../deploy/docker-observability.yml): Defines Prometheus, Grafana, Tempo, and exporters
For collecting and visualizing logs with Grafana Loki (Kubernetes), or viewing trace context in logs alongside Grafana Tempo, start the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for instructions.
### Enable Structured Logging
Enable structured JSONL logging:
```bash
export DYN_LOGGING_JSONL=true
export DYN_LOG=debug
# Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
{"time":"2025-09-02T15:53:31.943377Z","level":"INFO","target":"log","message":"VllmWorker for Qwen/Qwen3-0.6B has been initialized","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":191,"log.target":"main.init"}
{"time":"2025-09-02T15:53:31.943550Z","level":"INFO","target":"log","message":"Reading Events from tcp://127.0.0.1:26771","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":212,"log.target":"main.init"}
{"time":"2025-09-02T15:53:31.943636Z","level":"INFO","target":"log","message":"Getting engine runtime configuration metadata from vLLM engine...","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":220,"log.target":"main.init"}
When `DYN_LOGGING_JSONL` is enabled, all logs include `trace_id` and `span_id` fields, and spans are automatically created for requests. This is useful for short debugging sessions where you want to examine trace context in logs without setting up a full tracing backend and for correlating log messages with traces.
The trace and span information uses the OpenTelemetry format and libraries, which means the IDs are compatible with OpenTelemetry-based tracing backends like Tempo or Jaeger if you later choose to enable trace export.
**Note:** This section has overlap with [Distributed Tracing with Tempo](tracing.md). For trace visualization in Grafana Tempo and persistent trace analysis, see [Distributed Tracing with Tempo](tracing.md).
### Configuration for Logging
To see trace information in logs:
```bash
export DYN_LOGGING_JSONL=true
export DYN_LOG=debug # Set to debug to see detailed trace logs
# Start your Dynamo components (e.g., frontend and worker) (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
This enables JSONL logging with `trace_id` and `span_id` fields. Traces appear in logs but are not exported to any backend.
### Example Request
Send a request to generate logs with trace context:
```bash
curl -H'Content-Type: application/json'\
-H'x-request-id: test-trace-001'\
-d'{
"model": "Qwen/Qwen3-0.6B",
"max_completion_tokens": 100,
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'\
http://localhost:8000/v1/chat/completions
```
Check the logs (stderr) for JSONL output containing `trace_id`, `span_id`, and `x_request_id` fields.
## Trace and Span Information in Logs
This section shows how trace and span information appears in JSONL logs. These logs can be used to understand request flows even without a trace visualization backend.
### Example Disaggregated Trace in Grafana
When viewing the corresponding trace in Grafana, you should be able to see something like the following:
Dynamo creates distributed traces that span across multiple services in a disaggregated serving setup. The following sections describe the key spans you'll see in Grafana when viewing traces for chat completion requests.
#### Available Spans in Disaggregated Mode
When running Dynamo in disaggregated mode, a typical request creates the following spans:
##### 1. `http-request` (Frontend - Root Span)
The root span for the entire request lifecycle, created in the **dynamo-frontend** service.
**Key Attributes:**
-**Service**: `dynamo-frontend`
-**Operation**: Handles the HTTP request from client to completion
-**Duration**: Total end-to-end request time (includes prefill + decode)
A child span of `http-request`, created in the **dynamo-frontend** service during the routing phase.
**Key Attributes:**
-**Service**: `dynamo-frontend`
-**Operation**: Routes the prefill request to an appropriate prefill worker
-**Duration**: Time spent selecting and the span of prefill.
-**Parent**: `http-request` span
This span captures the routing logic and decision-making process and the request sent to the prefill worker.
##### 3. `handle_payload` (Prefill Worker Span)
A child span of `http-request`, created in the **dynamo-worker-vllm-prefill** service.
**Key Attributes:**
-**Service**: `dynamo-worker-vllm-prefill` (or `dynamo-worker-sglang-prefill` for SGLang)
-**Operation**: Processes the prefill phase of generation
-**Duration**: Time to compute prefill (typically milliseconds to seconds)
-**Component**: `prefill`
-**Endpoint**: `generate`
-**Parent**: `http-request` span
This span represents the actual prefill computation on a prefill-specialized worker, including prompt processing and initial KV cache generation.
##### 4. `handle_payload` (Decode Worker Span)
A child span of `http-request`, created in the **dynamo-worker-vllm-decode** service.
**Key Attributes:**
-**Service**: `dynamo-worker-vllm-decode` (or `dynamo-worker-sglang-decode` for SGLang)
-**Operation**: Processes the decode phase of generation
-**Duration**: Time to generate all output tokens (typically seconds)
-**Component**: `decode` or `backend`
-**Endpoint**: `generate`
-**Parent**: `http-request` span
This span represents the iterative token generation phase on a decode-specialized worker, which consumes the KV cache from prefill and produces output tokens.
#### Understanding Span Metrics
Each span provides several useful metrics:
| Metric | Description |
|--------|-------------|
| **Duration** | Total time from span start to end |
| **Busy Time** | Time actively processing (excluding waiting) |
| **Idle Time** | Time spent waiting (e.g., for network, other services) |
| **Start Time** | When the span began |
| **Child Count** | Number of direct child spans |
The relationship **Duration = Busy Time + Idle Time** helps identify where time is spent and potential bottlenecks.
## Custom Request IDs in Logs
You can provide a custom request ID using the `x-request-id` header. This ID will be attached to all spans and logs for that request, making it easier to correlate traces with application-level request tracking.
### Example Request with Custom Request ID
```sh
curl -X POST http://localhost:8000/v1/chat/completions \
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": false,
"max_tokens": 1000
}'
```
All spans and logs for this request will include the `x_request_id` attribute with value `8372eac7-5f43-4d76-beca-0a94cfb311d0`.
### Frontend Logs with Custom Request ID
Notice how the `x_request_id` field appears in all log entries, alongside the `trace_id` (`80196f3e3a6fdf06d23bb9ada3788518`) and `span_id`:
```
{"time":"2025-10-31T21:06:45.397194Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
{"time":"2025-10-31T21:06:45.418584Z","level":"DEBUG","file":"/opt/dynamo/lib/llm/src/kv_router/prefill_router.rs","line":232,"target":"dynamo_llm::kv_router::prefill_router","message":"Prefill succeeded, using disaggregated params for decode","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
{"time":"2025-10-31T21:06:45.418854Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Metrics Developer Guide
This guide explains how to create and use custom metrics in Dynamo components using the Dynamo metrics API.
## Metrics Exposure
All metrics created via the Dynamo metrics API are automatically exposed on the `/metrics` HTTP endpoint in Prometheus Exposition Format text when the following environment variable is set:
-`DYN_SYSTEM_PORT=<port>` - Port for the metrics endpoint (set to positive value to enable, default: `-1` disabled)
Prometheus Exposition Format text metrics will be available at: `http://localhost:8081/metrics`
## Metric Name Constants
The [prometheus_names.rs](../../lib/runtime/src/metrics/prometheus_names.rs) module provides centralized metric name constants and sanitization functions to ensure consistency across all Dynamo components.
---
## Metrics API in Rust
The metrics API is accessible through the `.metrics()` method on runtime, namespace, component, and endpoint objects. See [Runtime Hierarchy](metrics.md#runtime-hierarchy) for details on the hierarchical structure.
### Available Methods
-`.metrics().create_counter()`: Create a counter metric
-`.metrics().create_gauge()`: Create a gauge metric
-`.metrics().create_histogram()`: Create a histogram metric
-`.metrics().create_countervec()`: Create a counter with labels
-`.metrics().create_gaugevec()`: Create a gauge with labels
-`.metrics().create_histogramvec()`: Create a histogram with labels