Unverified Commit 39d645e5 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate Fern docs from fern/ into docs/ (#6206)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent d381e6ff
# ChReK Standalone Usage Guide
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.
This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
## Table of Contents
- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Step 1: Deploy ChReK](#step-1-deploy-chrek)
- [Step 2: Build Checkpoint-Enabled Images](#step-2-build-checkpoint-enabled-images)
- [Step 3: Create Checkpoint Jobs](#step-3-create-checkpoint-jobs)
- [Step 4: Restore from Checkpoints](#step-4-restore-from-checkpoints)
- [Environment Variables Reference](#environment-variables-reference)
- [Checkpoint Flow Explained](#checkpoint-flow-explained)
- [Troubleshooting](#troubleshooting)
---
## Overview
When using ChReK standalone, you are responsible for:
1. **Deploying the ChReK Helm chart** (DaemonSet + PVC)
2. **Building checkpoint-enabled container images** with the restore entrypoint
3. **Creating checkpoint jobs** with the correct environment variables
4. **Creating restore pods** that detect and use the checkpoints
The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.
---
## Prerequisites
- Kubernetes cluster with:
- NVIDIA GPUs with checkpoint support
- **Privileged security context allowed** (⚠️ required for CRIU - see [Security Considerations](#security-considerations))
- PVC storage (ReadWriteMany recommended for multi-node)
- Docker or compatible container runtime for building images
- Access to the ChReK source code: `deploy/chrek/`
### Security Considerations
⚠️ **Important**: ChReK restore operations **require privileged mode**, which has significant security implications:
- **Privileged containers** can access all host devices and bypass most security restrictions
- This may violate security policies in production environments
- Privileged containers, if compromised, can potentially compromise node security
**Recommended for:**
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
**Not recommended for:**
- ❌ Multi-tenant clusters without proper isolation
- ❌ Security-sensitive production workloads without risk assessment
- ❌ Environments with strict security compliance requirements
### Technical Limitations
⚠️ **Current Restrictions:**
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations are not yet supported
- **Network state**: Active TCP connections are closed during restore
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
---
## Step 1: Deploy ChReK
### Install the Helm Chart
```bash
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Install ChReK in your namespace
helm install chrek ./deploy/helm/charts/chrek \
--namespace my-app \
--create-namespace \
--set storage.pvc.size=100Gi \
--set storage.pvc.storageClass=your-storage-class
```
### Verify Installation
```bash
# Check the DaemonSet is running
kubectl get daemonset -n my-app
# NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE
# chrek-agent 3 3 3 3 3
# Check the PVC is bound
kubectl get pvc -n my-app
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# chrek-pvc Bound pvc-xyz 100Gi RWX your-storage-class
```
---
## Step 2: Build Checkpoint-Enabled Images
ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.
### Quick Start: Using the Placeholder Target (Recommended)
```bash
cd deploy/chrek
# Define your images
export BASE_IMAGE="your-app:latest" # Your existing application image
export RESTORE_IMAGE="your-app:checkpoint-enabled" # Output checkpoint-enabled image
# Build using the placeholder target
docker build \
--target placeholder \
--build-arg BASE_IMAGE="$BASE_IMAGE" \
-t "$RESTORE_IMAGE" \
.
# Push to your registry
docker push "$RESTORE_IMAGE"
```
**Example with a Dynamo vLLM image:**
```bash
cd deploy/chrek
export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"
docker build \
--target placeholder \
--build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
-t "$RESTORE_IMAGE" \
.
```
### What the Placeholder Target Does
The ChReK Dockerfile's `placeholder` stage automatically:
- ✅ Builds the restore-entrypoint binary
- ✅ Injects it into `/usr/local/bin/restore-entrypoint`
- ✅ Adds `smart-entrypoint.sh` to `/usr/local/bin/`
- ✅ Sets executable permissions
- ✅ Configures the entrypoint to detect and restore checkpoints
- ✅ Preserves your original application CMD
### Alternative: Manual Multi-Stage Build
If you need more control, you can create your own Dockerfile:
```dockerfile
# Stage 1: Build restore-entrypoint
FROM golang:1.23-alpine AS restore-builder
WORKDIR /build
COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint
COPY deploy/chrek/pkg ./pkg
COPY deploy/chrek/go.mod deploy/chrek/go.sum ./
RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint
# Stage 2: Your application image
FROM your-base-image:latest
# Copy restore-entrypoint
COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint
# Copy smart-entrypoint.sh
COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh
RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint
# Set smart-entrypoint as the default entrypoint
ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]
# Your application command (becomes CMD, can be overridden)
CMD ["python", "your_app.py"]
```
> **💡 Tip**: Using the `placeholder` target is the recommended approach as it's maintained with the ChReK codebase and ensures compatibility.
---
## Step 3: Create Checkpoint Jobs
A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.
### Required Environment Variables
Your checkpoint job MUST set these environment variables:
| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_CHECKPOINT_SIGNAL_FILE` | Path where DaemonSet writes completion signal | `/checkpoint-signal/my-checkpoint.done` |
| `DYN_READY_FOR_CHECKPOINT_FILE` | Path where your app signals it's ready | `/tmp/ready-for-checkpoint` |
| `DYN_CHECKPOINT_HASH` | Unique identifier for this checkpoint | `abc123def456` |
| `DYN_CHECKPOINT_LOCATION` | Directory where checkpoint is stored | `/checkpoints/abc123def456` |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Storage backend type | `pvc` |
### Required Labels
Add this label to enable DaemonSet checkpoint detection:
```yaml
labels:
nvidia.com/checkpoint-source: "true"
```
### Example Checkpoint Job
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: checkpoint-my-model
namespace: my-app
spec:
template:
metadata:
labels:
nvidia.com/checkpoint-source: "true" # Required for DaemonSet detection
spec:
restartPolicy: Never
# Init container to clean up stale signal files
initContainers:
- name: cleanup-signal-file
image: busybox:latest
command:
- sh
- -c
- |
rm -f /checkpoint-signal/my-checkpoint.done || true
echo "Signal file cleanup complete"
volumeMounts:
- name: checkpoint-signal
mountPath: /checkpoint-signal
containers:
- name: main
image: my-app:checkpoint-enabled
# Security context required for CRIU
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
# Readiness probe: Pod becomes Ready when model is loaded
# This is what triggers the DaemonSet to start checkpointing
readinessProbe:
exec:
command: ["sh", "-c", "cat ${DYN_READY_FOR_CHECKPOINT_FILE}"]
initialDelaySeconds: 15
periodSeconds: 2
# Remove liveness/startup probes for checkpoint jobs
# Model loading can take several minutes
livenessProbe: null
startupProbe: null
# Checkpoint-related environment variables
env:
- name: DYN_CHECKPOINT_SIGNAL_FILE
value: "/checkpoint-signal/my-checkpoint.done"
- name: DYN_READY_FOR_CHECKPOINT_FILE
value: "/tmp/ready-for-checkpoint"
- name: DYN_CHECKPOINT_HASH
value: "abc123def456"
- name: DYN_CHECKPOINT_LOCATION
value: "/checkpoints/abc123def456"
- name: DYN_CHECKPOINT_STORAGE_TYPE
value: "pvc"
# GPU request
resources:
limits:
nvidia.com/gpu: 1
# Required volume mounts
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
- name: checkpoint-signal
mountPath: /checkpoint-signal
- name: tmp
mountPath: /tmp
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: chrek-pvc
- name: checkpoint-signal
hostPath:
path: /var/lib/chrek/signals
type: DirectoryOrCreate
- name: tmp
emptyDir: {}
```
### Application Code Requirements
Your application must implement the checkpoint flow. Here's the pattern used by Dynamo vLLM:
```python
import os
import time
def main():
# 1. Check for checkpoint mode
signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE")
ready_file = os.environ.get("DYN_READY_FOR_CHECKPOINT_FILE")
restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE")
is_checkpoint_mode = signal_file is not None
if is_checkpoint_mode:
print("Checkpoint mode detected")
# 2. Load your model/application
model = load_model()
# 3. Optional: Put model to sleep to reduce memory footprint
# model.sleep()
# 4. Write ready file (for application use, not DaemonSet)
if ready_file:
with open(ready_file, "w") as f:
f.write("ready")
print(f"Wrote checkpoint ready file: {ready_file}")
# 5. Log readiness messages (helps debugging)
print("CHECKPOINT_READY: Model loaded, ready for container checkpoint")
print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}")
print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}")
# 6. Wait for checkpoint completion OR restore detection
while True:
# Check if we've been restored (marker file created by restore entrypoint)
if os.path.exists(restore_marker):
print(f"Detected restore from checkpoint (marker: {restore_marker})")
# Continue with normal application flow
break
# Check if checkpoint is complete (signal file created by DaemonSet)
if os.path.exists(signal_file):
print(f"Checkpoint signal file detected: {signal_file}")
print("Checkpoint complete, exiting")
return # Exit gracefully
time.sleep(1)
# Normal application flow (or post-restore flow)
run_application()
```
**Important Notes:**
1. **Ready File & Readiness Probe**: The checkpoint job must have a readiness probe that checks for the ready file:
```yaml
readinessProbe:
exec:
command: ["sh", "-c", "cat ${DYN_READY_FOR_CHECKPOINT_FILE}"]
initialDelaySeconds: 15
periodSeconds: 2
```
The ChReK DaemonSet triggers checkpointing when:
- Pod has `nvidia.com/checkpoint-source: "true"` label
- Pod status is `Ready` (readiness probe passes = ready file exists)
2. **Restore Marker**: Created by `restore-entrypoint` before CRIU restore, allows the restored process to detect it was restored
3. **Two Exit Paths**:
- **Signal file found**: Checkpoint complete, exit gracefully
- **Restore marker found**: Process was restored, continue running
---
## Step 4: Restore from Checkpoints
Restore pods automatically detect and restore from checkpoints if they exist.
### Example Restore Pod
```yaml
apiVersion: v1
kind: Pod
metadata:
name: my-app-restored
namespace: my-app
spec:
restartPolicy: Never
containers:
- name: main
image: my-app:checkpoint-enabled
# Security context required for CRIU restore
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
# Set checkpoint environment variables
env:
- name: DYN_CHECKPOINT_HASH
value: "abc123def456" # Must match checkpoint job
- name: DYN_CHECKPOINT_PATH
value: "/checkpoints" # Base path (hash appended automatically)
- name: DYN_RESTORE_MARKER_FILE
value: "/tmp/dynamo-restored"
# GPU request
resources:
limits:
nvidia.com/gpu: 1
# Mount checkpoint storage (READ-ONLY is fine for restore)
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
readOnly: true
- name: checkpoint-signal
mountPath: /checkpoint-signal
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: chrek-pvc
- name: checkpoint-signal
hostPath:
path: /var/lib/chrek/signals
type: DirectoryOrCreate
```
### How Restore Works
1. **Smart Entrypoint Detects Checkpoint**: The `smart-entrypoint.sh` checks if a checkpoint exists at `/checkpoints/${DYN_CHECKPOINT_HASH}/`
2. **Calls Restore Entrypoint**: If found, calls `/usr/local/bin/restore-entrypoint` which invokes CRIU
3. **CRIU Restores Process**: The entire process tree is restored from the checkpoint, including GPU state
4. **Application Continues**: Your application resumes exactly where it was checkpointed
---
## Environment Variables Reference
### Checkpoint Jobs
| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_CHECKPOINT_SIGNAL_FILE` | Yes | Full path to signal file (e.g., `/checkpoint-signal/my-checkpoint.done`) |
| `DYN_READY_FOR_CHECKPOINT_FILE` | Yes | Full path where app signals readiness (e.g., `/tmp/ready-for-checkpoint`) |
| `DYN_CHECKPOINT_HASH` | Yes | Unique checkpoint identifier (alphanumeric string) |
| `DYN_CHECKPOINT_LOCATION` | Yes | Directory where checkpoint is stored (e.g., `/checkpoints/abc123`) |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Yes | Storage backend: `pvc`, `s3`, or `oci` |
### Restore Pods
| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_CHECKPOINT_HASH` | Yes | Checkpoint identifier (must match checkpoint job) |
| `DYN_CHECKPOINT_PATH` | Yes | Base checkpoint directory (hash appended automatically) |
| `DYN_RESTORE_MARKER_FILE` | Yes | Path for restore marker file |
### Optional CRIU Tuning (Advanced)
| Variable | Default | Description |
|----------|---------|-------------|
| `CRIU_TIMEOUT` | `0` (unlimited) | CRIU operation timeout in seconds |
| `CRIU_LOG_LEVEL` | `4` | CRIU log verbosity (0-4) |
| `CRIU_WORK_DIR` | `/tmp` | CRIU working directory |
| `CUDA_PLUGIN_DIR` | `/usr/local/lib/criu` | Path to CRIU CUDA plugin |
| `CRIU_SKIP_IN_FLIGHT` | `false` | Skip in-flight TCP connections |
| `CRIU_AUTO_DEDUP` | `false` | Enable auto-deduplication |
| `CRIU_LAZY_PAGES` | `false` | Enable lazy page migration (experimental) |
| `WAIT_FOR_CHECKPOINT` | `false` | Wait for checkpoint to appear before starting |
| `RESTORE_WAIT_TIMEOUT` | `300` | Max seconds to wait for checkpoint |
| `DEBUG` | `false` | Enable debug mode (sleeps 300s on error) |
---
## Checkpoint Flow Explained
### 1. Checkpoint Creation Flow
```
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with nvidia.com/checkpoint-source=true label │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. Application loads model and creates ready file │
│ /tmp/ready-for-checkpoint │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. Pod becomes Ready (kubelet readiness probe passes) │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. ChReK DaemonSet detects: │
│ - Pod is Ready │
│ - Has checkpoint-source label │
│ - Ready file exists: /tmp/ready-for-checkpoint │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. DaemonSet executes CRIU checkpoint via runc: │
│ - Freezes container process │
│ - Dumps memory (CPU + GPU) │
│ - Saves to /checkpoints/${HASH}/ │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. DaemonSet writes signal file: │
│ /checkpoint-signal/${HASH}.done │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 7. Application detects signal file and exits gracefully │
└─────────────────────────────────────────────────────────────┘
```
### 2. Restore Flow
```
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with DYN_CHECKPOINT_HASH set │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. smart-entrypoint.sh checks for checkpoint: │
│ /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done │
└──────────────────────┬──────────────────────────────────────┘
├─ Not Found ─────────────────┐
│ │
▼ ▼
┌───────────────────────┐ ┌──────────────────────┐
│ Checkpoint exists │ │ Cold start │
└──────────┬────────────┘ │ Run original CMD │
│ └──────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. Call restore-entrypoint with checkpoint path │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. restore-entrypoint extracts checkpoint and calls CRIU: │
│ criu restore --images-dir /checkpoints/${HASH}/images │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. CRIU restores process from checkpoint │
│ - Restores memory (CPU + GPU) │
│ - Restores file descriptors │
│ - Resumes process execution │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. Application continues from checkpointed state │
│ (Model already loaded, GPU memory initialized) │
└─────────────────────────────────────────────────────────────┘
```
---
## Troubleshooting
### Checkpoint Not Created
**Symptom**: Job runs but no checkpoint appears in `/checkpoints/`
**Checks**:
1. Verify the pod has the label:
```bash
kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}'
```
2. Check pod readiness:
```bash
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
```
3. Check ready file was created:
```bash
kubectl exec <pod-name> -- ls -la /tmp/ready-for-checkpoint
```
4. Check DaemonSet logs:
```bash
kubectl logs -n my-app daemonset/chrek-agent --all-containers
```
### Restore Fails
**Symptom**: Pod fails to restore from checkpoint
**Checks**:
1. Verify checkpoint files exist:
```bash
kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/
```
2. Check privileged mode is enabled:
```bash
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].securityContext.privileged}'
```
3. Check CRIU logs in `/tmp/criu-restore.log`:
```bash
kubectl exec <pod-name> -- cat /tmp/criu-restore.log
```
4. Ensure checkpoint and restore have same:
- Container image
- GPU count
- Volume mounts
- Environment variables (except POD_NAME, POD_IP, etc.)
### Permission Denied Errors
**Symptom**: `CRIU: Permission denied` or `Operation not permitted`
**Solution**: Ensure pod has:
```yaml
securityContext:
privileged: true
capabilities:
add:
- SYS_ADMIN
- SYS_PTRACE
- SYS_CHROOT
```
### Signal File Not Appearing
**Symptom**: Application waits forever for signal file
**Checks**:
1. Verify hostPath mount is correct:
```bash
kubectl get pod <pod-name> -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}'
```
2. Check DaemonSet has access to the same path:
```bash
kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}'
```
3. Verify paths match exactly:
- Pod: `/var/lib/chrek/signals`
- DaemonSet: `/var/lib/chrek/signals`
---
## Additional Resources
- [ChReK Helm Chart Values](../../deploy/helm/charts/chrek/values.yaml)
- [Smart Entrypoint Script](../../deploy/chrek/scripts/smart-entrypoint.sh)
- [CRIU Documentation](https://criu.org/Main_Page)
- [CUDA Checkpoint Plugin](https://docs.nvidia.com/cuda/cuda-checkpoint-plugin/)
---
## Getting Help
If you encounter issues:
1. Check the [Troubleshooting](#troubleshooting) section
2. Review DaemonSet logs: `kubectl logs -n <namespace> daemonset/chrek-agent`
3. Open an issue on [GitHub](https://github.com/ai-dynamo/dynamo/issues)
# Creating Kubernetes Deployments
The scripts in the `examples/<backend>/launch` folder like [agg.sh](../../../examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](../../../examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
This guide explains how to create your own deployment files.
## Step 1: Choose Your Architecture Pattern
Before choosing a template, understand the different architecture patterns:
### Aggregated Serving (agg.yaml)
**Pattern**: Prefill and decode on the same GPU in a single process.
**Suggested to use for**:
- Small to medium models (under 70B parameters)
- Development and testing
- Low to moderate traffic
- Simplicity is prioritized over maximum throughput
**Tradeoffs**:
- Simpler setup and debugging
- Lower operational complexity
- GPU utilization may not be optimal (prefill and decode compete for resources)
- Lower throughput ceiling compared to disaggregated
**Example**: [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml)
### Aggregated + Router (agg_router.yaml)
**Pattern**: Load balancer routing across multiple aggregated worker instances.
**Suggested to use for**:
- Medium traffic requiring high availability
- Need horizontal scaling
- Want some load balancing without disaggregation complexity
**Tradeoffs**:
- Better scalability than plain aggregated
- High availability through multiple replicas
- Still has GPU underutilization issues of aggregated serving
- More complex than plain aggregated but simpler than disaggregated
**Example**: [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml)
### Disaggregated Serving (disagg_router.yaml)
**Pattern**: Separate prefill and decode workers with specialized optimization.
**Suggested to use for**:
- Production-style deployments
- High throughput requirements
- Large models (70B+ parameters)
- Maximum GPU utilization needed
**Tradeoffs**:
- Maximum performance and throughput
- Better GPU utilization (prefill and decode specialized)
- Independent scaling of prefill and decode
- More complex setup and debugging
- Requires understanding of prefill/decode separation
**Example**: [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml)
### Quick Selection Guide
Select the architecture pattern as your template that best fits your use case.
For example, when using the `vLLM` backend:
- **Development / Testing**: Use [`agg.yaml`](../../../examples/backends/vllm/deploy/agg.yaml) as the base configuration.
- **Production with Load Balancing**: Use [`agg_router.yaml`](../../../examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](../../../examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
## Step 2: Customize the Template
You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
It serves the following roles:
1. OpenAI-Compatible HTTP Server
* Provides `/v1/chat/completions` endpoint
* Handles HTTP request/response formatting
* Supports streaming responses
* Validates incoming requests
2. Service Discovery and Routing
* Auto-discovers backend workers via etcd
* Routes requests to the appropriate Processor/Worker components
* Handles load balancing between multiple workers
3. Request Preprocessing
* Initial request validation
* Model name verification
* Request format standardization
You should then pick a worker and specialize the config. For example,
```yaml
VllmWorker: # vLLM-specific config
enforce-eager: true
enable-prefix-caching: true
SglangWorker: # SGLang-specific config
router-mode: kv
disagg-mode: true
TrtllmWorker: # TensorRT-LLM-specific config
engine-config: ./engine.yaml
kv-cache-transfer: ucx
```
Here's a template structure based on the examples:
```yaml
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
replicas: N
envFromSecret: your-secrets # e.g., hf-token-secret
# Health checks for worker initialization
readinessProbe:
exec:
command: ["/bin/sh", "-c", 'grep "Worker.*initialized" /tmp/worker.log']
resources:
requests:
gpu: "1" # GPU allocation
extraPodSpec:
mainContainer:
image: your-image
command:
- /bin/sh
- -c
args:
- python -m dynamo.YOUR_INFERENCE_ENGINE --model YOUR_MODEL --your-flags
```
Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
`extraPodSpec: -> mainContainer: -> args:`
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](../../reference/cli.md) for details on how to run this command.
## Step 3: Key Customization Points
### Model Configuration
```yaml
args:
- "python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flag"
```
### Resource Allocation
```yaml
resources:
requests:
cpu: "N"
memory: "NGi"
gpu: "N"
```
### Scaling
```yaml
replicas: N # Number of worker instances
```
### Routing Mode
```yaml
args:
- --router-mode
- kv # Enable KV-cache routing
```
### Worker Specialization
```yaml
args:
- --is-prefill-worker # For disaggregated prefill workers
```
### Image Pull Secret Configuration
#### Automatic Discovery and Injection
By default, the Dynamo operator automatically discovers and injects image pull secrets based on container registry host matching. The operator scans Docker config secrets within the same namespace and matches their registry hostnames to the container image URLs, automatically injecting the appropriate secrets into the pod's `imagePullSecrets`.
**Disabling Automatic Discovery:**
To disable this behavior for a component and manually control image pull secrets:
```yaml
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
annotations:
nvidia.com/disable-image-pull-secret-discovery: "true"
```
When disabled, you can manually specify secrets as you would for a normal pod spec via:
```yaml
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
annotations:
nvidia.com/disable-image-pull-secret-discovery: "true"
extraPodSpec:
imagePullSecrets:
- name: my-registry-secret
- name: another-secret
mainContainer:
image: your-image
```
This automatic discovery eliminates the need to manually configure image pull secrets for each deployment.
## Step 6: Deploy LoRA Adapters (Optional)
After your base model deployment is running, you can deploy LoRA adapters using the `DynamoModel` custom resource. This allows you to fine-tune and extend your models without modifying the base deployment.
To add a LoRA adapter to your deployment, link it using `modelRef` in your worker configuration:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Worker:
modelRef:
name: Qwen/Qwen3-0.6B # Base model identifier
componentType: worker
# ... rest of worker config
```
Then create a `DynamoModel` resource for your LoRA:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name above
modelType: lora
source:
uri: s3://my-bucket/loras/my-lora
```
**For complete details on managing models and LoRA adapters, see:**
📖 **[Managing Models with DynamoModel Guide](./dynamomodel-guide.md)**
# Managing Models with DynamoModel
## Overview
`DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to:
- **Deploy LoRA adapters** on top of running base models
- **Track model endpoints** and their readiness across your cluster
- **Manage model lifecycle** declaratively with Kubernetes
DynamoModel works alongside `DynamoGraphDeployment` (DGD) or `DynamoComponentDeployment` (DCD) resources. While DGD/DCD deploy the inference infrastructure (pods, services), DynamoModel handles model-specific operations like loading LoRA adapters.
## Quick Start
### Prerequisites
Before creating a DynamoModel, you need:
1. A running `DynamoGraphDeployment` or `DynamoComponentDeployment`
2. Components configured with `modelRef` pointing to your base model
3. Pods are ready and serving your base model
For complete setup including DGD configuration, see [Integration with DynamoGraphDeployment](#integration-with-dynamographdeployment).
### Deploy a LoRA Adapter
**1. Create your DynamoModel:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
namespace: dynamo-system
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in your DGD
modelType: lora
source:
uri: s3://my-bucket/loras/my-lora
```
**2. Apply and verify:**
```bash
# Apply the DynamoModel
kubectl apply -f my-lora.yaml
# Check status
kubectl get dynamomodel my-lora
```
**Expected output:**
```
NAME TOTAL READY AGE
my-lora 2 2 30s
```
That's it! The operator automatically discovers endpoints and loads the LoRA.
For detailed status monitoring, see [Monitoring & Operations](#monitoring--operations).
## Understanding DynamoModel
### Model Types
DynamoModel supports three model types:
| Type | Description | Use Case |
|------|-------------|----------|
| **`base`** | Reference to an existing base model | Tracking endpoints for a base model (default) |
| **`lora`** | LoRA adapter that extends a base model | Deploy fine-tuned adapters on existing models |
| **`adapter`** | Generic model adapter | Future extensibility for other adapter types |
Most users will use **`lora`** to deploy fine-tuned models on top of their base model deployments.
### How It Works
When you create a DynamoModel, the operator:
1. **Discovers endpoints**: Finds all pods running your `baseModelName` (by matching `modelRef.name` in DGD/DCD)
2. **Creates service**: Automatically creates a Kubernetes Service to track these pods
3. **Loads LoRA**: Calls the LoRA load API on each endpoint (for `lora` type)
4. **Updates status**: Reports which endpoints are ready
**Key linkage:**
```yaml
# DGD modelRef.name ↔ DynamoModel baseModelName must match
Worker:
modelRef:
name: Qwen/Qwen3-0.6B
---
spec:
baseModelName: Qwen/Qwen3-0.6B
```
## Configuration Overview
DynamoModel requires just a few key fields to deploy a model or adapter:
| Field | Required | Purpose | Example |
|-------|----------|---------|---------|
| `modelName` | Yes | Model identifier | `my-custom-lora` |
| `baseModelName` | Yes | Links to DGD modelRef | `Qwen/Qwen3-0.6B` |
| `modelType` | No | Type: base/lora/adapter | `lora` (default: `base`) |
| `source.uri` | For LoRA | Model location | `s3://bucket/path` or `hf://org/model` |
**Example minimal LoRA configuration:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: s3://my-bucket/my-lora
```
**For complete field specifications, validation rules, and all options, see:**
📖 [DynamoModel API Reference](../api_reference.md#dynamomodel)
### Status Summary
The status shows discovered endpoints and their readiness:
```bash
kubectl get dynamomodel my-lora
```
**Key status fields:**
- `totalEndpoints` / `readyEndpoints`: Counts of discovered vs ready endpoints
- `endpoints[]`: List with addresses, pod names, and ready status
- `conditions`: Standard Kubernetes conditions (EndpointsReady, ServicesFound)
For detailed status usage, see the [Monitoring & Operations](#monitoring--operations) section below
## Common Use Cases
### Use Case 1: S3-Hosted LoRA Adapter
Deploy a LoRA adapter stored in an S3 bucket.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: customer-support-lora
namespace: production
spec:
modelName: customer-support-adapter-v1
baseModelName: meta-llama/Llama-3.3-70B-Instruct
modelType: lora
source:
uri: s3://my-models-bucket/loras/customer-support/v1
```
**Prerequisites:**
- S3 bucket accessible from your pods (IAM role or credentials)
- Base model `meta-llama/Llama-3.3-70B-Instruct` running via DGD/DCD
**Verification:**
```bash
# Check LoRA is loaded
kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.readyEndpoints}'
# Should output: 2 (or your number of replicas)
# View which pods are serving
kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.endpoints[*].podName}'
```
### Use Case 2: HuggingFace-Hosted LoRA
Deploy a LoRA adapter from HuggingFace Hub.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: multilingual-lora
namespace: dynamo-system
spec:
modelName: multilingual-adapter
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: hf://myorg/qwen-multilingual-lora@v1.0.0 # Optional: @revision
```
**Prerequisites:**
- HuggingFace Hub accessible from your pods
- If private repo: HF token configured as secret and mounted in pods
- Base model `Qwen/Qwen3-0.6B` running via DGD/DCD
**With HuggingFace token:**
```yaml
# In your DGD/DCD
spec:
services:
worker:
envFromSecret: hf-token-secret # Provides HF_TOKEN env var
modelRef:
name: Qwen/Qwen3-0.6B
# ... rest of config
```
### Use Case 3: Multiple LoRAs on Same Base Model
Deploy multiple LoRA adapters on the same base model deployment.
```yaml
---
# LoRA for customer support
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: support-lora
spec:
modelName: support-adapter
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: s3://models/support-lora
---
# LoRA for code generation
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: code-lora
spec:
modelName: code-adapter
baseModelName: Qwen/Qwen3-0.6B # Same base model
modelType: lora
source:
uri: s3://models/code-lora
```
Both LoRAs will be loaded on all pods serving `Qwen/Qwen3-0.6B`. Your application can then route requests to the appropriate adapter.
## Monitoring & Operations
### Checking Status
**Quick status check:**
```bash
kubectl get dynamomodel
```
**Example output:**
```
NAME TOTAL READY AGE
my-lora 2 2 5m
customer-lora 4 3 2h
```
**Detailed status:**
```bash
kubectl describe dynamomodel my-lora
```
**Example output:**
```
Name: my-lora
Namespace: dynamo-system
Spec:
Model Name: my-custom-lora
Base Model Name: Qwen/Qwen3-0.6B
Model Type: lora
Source:
Uri: s3://my-bucket/my-lora
Status:
Ready Endpoints: 2
Total Endpoints: 2
Endpoints:
Address: http://10.0.1.5:9090
Pod Name: worker-0
Ready: true
Address: http://10.0.1.6:9090
Pod Name: worker-1
Ready: true
Conditions:
Type: EndpointsReady
Status: True
Reason: EndpointsDiscovered
Events:
Type Reason Message
---- ------ -------
Normal EndpointsReady Discovered 2 ready endpoints for base model Qwen/Qwen3-0.6B
```
### Understanding Readiness
An endpoint is **ready** when:
1. The pod is running and healthy
2. The LoRA load API call succeeded
**Condition states:**
- `EndpointsReady=True`: All endpoints are ready (full availability)
- `EndpointsReady=False, Reason=NotReady`: Not all endpoints ready (check message for counts)
- `EndpointsReady=False, Reason=NoEndpoints`: No endpoints found
When `readyEndpoints < totalEndpoints`, the operator automatically retries loading every 30 seconds.
### Viewing Endpoints
**Get endpoint addresses:**
```bash
kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].address}' | tr ' ' '\n'
```
**Output:**
```
http://10.0.1.5:9090
http://10.0.1.6:9090
```
**Get endpoint pod names:**
```bash
kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].podName}' | tr ' ' '\n'
```
**Check readiness of each endpoint:**
```bash
kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | {podName, ready}'
```
**Output:**
```json
{
"podName": "worker-0",
"ready": true
}
{
"podName": "worker-1",
"ready": true
}
```
### Updating a Model
To update a LoRA (e.g., deploy a new version):
```bash
# Edit the source URI
kubectl edit dynamomodel my-lora
# Or apply an updated YAML
kubectl apply -f my-lora-v2.yaml
```
The operator will detect the change and reload the LoRA on all endpoints.
### Deleting a Model
```bash
kubectl delete dynamomodel my-lora
```
For LoRA models, the operator will:
1. Unload the LoRA from all endpoints
2. Clean up associated resources
3. Remove the DynamoModel CR
The base model deployment (DGD/DCD) continues running normally.
## Troubleshooting
### No Endpoints Found
**Symptom:**
```yaml
status:
totalEndpoints: 0
readyEndpoints: 0
conditions:
- type: EndpointsReady
status: "False"
reason: NoEndpoints
message: "No endpoint slices found for base model Qwen/Qwen3-0.6B"
```
**Common Causes:**
1. **Base model deployment not running**
```bash
# Check if pods exist
kubectl get pods -l nvidia.com/dynamo-component-type=worker
```
**Solution:** Deploy your DGD/DCD first, wait for pods to be ready.
2. **`baseModelName` mismatch**
```bash
# Check modelRef in your DGD
kubectl get dynamographdeployment my-deployment -o yaml | grep -A2 modelRef
```
**Solution:** Ensure `baseModelName` in DynamoModel exactly matches `modelRef.name` in DGD.
3. **Pods not ready**
```bash
# Check pod status
kubectl get pods -l nvidia.com/dynamo-component-type=worker
```
**Solution:** Wait for pods to reach `Running` and `Ready` state.
4. **Wrong namespace**
**Solution:** Ensure DynamoModel is in the same namespace as your DGD/DCD.
### LoRA Load Failures
**Symptom:**
```yaml
status:
totalEndpoints: 2
readyEndpoints: 0 # ← No endpoints ready despite pods existing
conditions:
- type: EndpointsReady
status: "False"
reason: NoReadyEndpoints
```
**Common Causes:**
1. **Source URI not accessible**
```bash
# Check operator logs
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f | grep "Failed to load"
```
**Solution:**
- For S3: Verify bucket permissions, IAM role, credentials
- For HuggingFace: Verify token is valid, repo exists and is accessible
2. **Invalid LoRA format**
**Solution:** Ensure your LoRA weights are in the format expected by your backend framework (vLLM, SGLang, etc.)
3. **Endpoint API errors**
```bash
# Check operator logs for HTTP errors
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "error"
```
**Solution:** Check the backend framework's logs in the worker pods:
```bash
kubectl logs worker-0
```
4. **Out of memory**
**Solution:** LoRA adapters require additional memory. Increase memory limits in your DGD:
```yaml
resources:
limits:
memory: "32Gi" # Increase if needed
```
### Status Shows Not Ready
**Symptom:**
Some endpoints remain not ready for extended periods.
**Diagnosis:**
```bash
# Check which endpoints are not ready
kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | select(.ready == false)'
# View operator logs for that specific pod
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "worker-0"
# Check the worker pod logs
kubectl logs worker-0 | tail -50
```
**Common Causes:**
1. **Network issues**: Pod can't reach S3/HuggingFace
2. **Resource constraints**: Pod is OOMing or being throttled
3. **API endpoint not responding**: Backend framework isn't serving the LoRA API
**When to wait vs investigate:**
- **Wait**: If readyEndpoints is increasing over time (LoRAs loading progressively)
- **Investigate**: If stuck at same readyEndpoints for >5 minutes
### Viewing Events and Logs
**Check events:**
```bash
kubectl describe dynamomodel my-lora | tail -20
```
**View operator logs:**
```bash
# Follow logs
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f
# Filter for specific model
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "my-lora"
```
**Common events and messages:**
| Event/Message | Meaning | Action |
|---------------|---------|--------|
| `EndpointsReady` | All endpoints are ready | ✅ Good - full service availability |
| `NotReady` | Not all endpoints ready | ⚠️ Check readyEndpoints count - operator will retry |
| `PartialEndpointFailure` | Some endpoints failed to load | Check logs for errors |
| `NoEndpointsFound` | No pods discovered | Verify DGD running and modelRef matches |
| `EndpointDiscoveryFailed` | Can't query endpoints | Check operator RBAC permissions |
| `Successfully reconciled` | Reconciliation complete | ✅ Good |
## Integration with DynamoGraphDeployment
This section shows the complete end-to-end workflow for deploying base models and LoRA adapters together.
DynamoModel and DynamoGraphDeployment work together to provide complete model deployment:
- **DGD**: Deploys the infrastructure (pods, services, resources)
- **DynamoModel**: Manages model-specific operations (LoRA loading)
### Linking Models to Components
The connection is established through the `modelRef` field in your DGD:
**Complete example:**
```yaml
---
# 1. Deploy the base model infrastructure
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
backendFramework: vllm
services:
Frontend:
componentType: frontend
replicas: 1
dynamoNamespace: my-app
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
Worker:
# This modelRef creates the link to DynamoModel
modelRef:
name: Qwen/Qwen3-0.6B # ← Key linking field
componentType: worker
replicas: 2
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
args:
- --model
- Qwen/Qwen3-0.6B
- --tensor-parallel-size
- "1"
---
# 2. Deploy LoRA adapters on top
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # ← Must match modelRef.name above
modelType: lora
source:
uri: s3://my-bucket/loras/my-lora
```
### Deployment Workflow
**Recommended order:**
```bash
# 1. Deploy base model infrastructure
kubectl apply -f my-deployment.yaml
# 2. Wait for pods to be ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-component-type=worker --timeout=5m
# 3. Deploy LoRA adapters
kubectl apply -f my-lora.yaml
# 4. Verify LoRA is loaded
kubectl get dynamomodel my-lora
```
**What happens behind the scenes:**
| Step | DGD | DynamoModel |
|------|-----|-------------|
| 1 | Creates pods with modelRef | - |
| 2 | Pods become running and ready | - |
| 3 | - | CR created, discovers endpoints via auto-created Service |
| 4 | - | Calls LoRA load API on each endpoint |
| 5 | - | All endpoints ready ✓ |
The operator automatically handles all service discovery - you don't configure services, labels, or selectors manually.
## API Reference
For complete field specifications, validation rules, and detailed type definitions, see:
**📖 [Dynamo CRD API Reference](../api_reference.md#dynamomodel)**
## Summary
DynamoModel provides declarative model management for Dynamo deployments:
✅ **Simple**: 2-step deployment of LoRA adapters
✅ **Automatic**: Endpoint discovery and loading handled by operator
✅ **Observable**: Rich status reporting and conditions
✅ **Integrated**: Works seamlessly with DynamoGraphDeployment
**Next Steps:**
- Try the [Quick Start](#quick-start) example
- Explore [Common Use Cases](#common-use-cases)
- Check the [API Reference](../api_reference.md#dynamomodel) for advanced configuration
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Minikube Setup Guide
Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
## 1. Install Minikube
First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
## 2. Configure GPU Support (Optional)
Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
> [!TIP]
> Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
## 3. Start Minikube
Time to launch your local cluster!
```bash
# Start Minikube with GPU support (if configured)
minikube start --driver docker --container-runtime docker --gpus all --memory=16000mb --cpus=8
# Enable required addons
minikube addons enable istio-provisioner
minikube addons enable istio
minikube addons enable storage-provisioner-rancher
```
## 4. Verify Installation
Let's make sure everything is working correctly!
```bash
# Check Minikube status
minikube status
# Verify Istio installation
kubectl get pods -n istio-system
# Verify storage class
kubectl get storageclass
```
## Next Steps
Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](../installation_guide.md) to deploy the platform to your local cluster.
# Multinode Deployment Guide
This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
## Overview
Dynamo supports multinode deployments through the `multinode` section in resource specifications. This allows you to:
- Distribute workloads across multiple physical nodes
- Scale GPU resources beyond a single machine
- Support large models requiring extensive tensor parallelism
- Achieve high availability and fault tolerance
## Basic requirements
- **Kubernetes Cluster**: Version 1.24 or later
- **GPU Nodes**: Multiple nodes with NVIDIA GPUs
- **High-Speed Networking**: InfiniBand, RoCE, or high-bandwidth Ethernet (recommended for optimal performance)
### Advanced Multinode Orchestration
#### Using Grove (default)
For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems:
- **[Grove](https://github.com/NVIDIA/grove)**: Network topology-aware gang scheduling and auto-scaling for AI workloads
- **[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler)**: Kubernetes native scheduler optimized for AI workloads at scale
These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes.
**Features Enabled with Grove:**
- Declarative composition of AI workloads
- Multi-level horizontal auto-scaling
- Custom startup ordering for components
- Resource-aware rolling updates
[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a Kubernetes native scheduler optimized for AI workloads at large scale.
**Features Enabled with KAI-Scheduler:**
- Gang scheduling
- Network topology-aware pod placement
- AI workload-optimized scheduling algorithms
- GPU resource awareness and allocation
- Support for complex scheduling constraints
- Integration with Grove for enhanced capabilities
- Performance optimizations for large-scale deployments
##### Prerequisites
- [Grove](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) installed on the cluster
- (Optional) [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with the default queue name `dynamo` created. If no queue annotation is specified on the DGD resource, the operator uses the `dynamo` queue by default. Custom queue names can be specified via the `nvidia.com/kai-scheduler-queue` annotation, but the queue must exist in the cluster before deployment.
KAI-Scheduler is optional but recommended for advanced scheduling capabilities.
#### Using LWS and Volcano
LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.
- **LWS**: [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
- **Volcano**: [Volcano Installation](https://volcano.sh/en/docs/installation/)
Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support.
## Core Concepts
### Orchestrator Selection Algorithm
Dynamo automatically selects the best available orchestrator for multinode deployments using the following logic:
#### When Both Grove and LWS are Available:
- **Grove is selected by default** (recommended for advanced AI workloads)
- **LWS is selected** if you explicitly set `nvidia.com/enable-grove: "false"` annotation on your DGD resource
#### When Only One Orchestrator is Available:
- The installed orchestrator (Grove or LWS) is automatically selected
#### Scheduler Integration:
- **With Grove**: Automatically integrates with [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) when available, providing:
- Advanced queue management via `nvidia.com/kai-scheduler-queue` annotation
- AI-optimized scheduling policies
- Resource-aware workload placement
- **With LWS**: Uses Volcano scheduler for gang scheduling and resource coordination
#### Configuration Examples:
**Default (Grove with KAI-Scheduler):**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
annotations:
nvidia.com/kai-scheduler-queue: "dynamo"
spec:
# ... your deployment spec
```
> **Note:** The `nvidia.com/kai-scheduler-queue` annotation defaults to `"dynamo"`. If you specify a custom queue name, ensure the queue exists in your cluster before deploying. You can verify available queues with `kubectl get queues`.
**Force LWS usage:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
annotations:
nvidia.com/enable-grove: "false"
spec:
# ... your deployment spec
```
### The `multinode` Section
The `multinode` section in a resource specification defines how many physical nodes the workload should span:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
spec:
# ... your deployment spec
services:
my-service:
...
multinode:
nodeCount: 2
resources:
limits:
gpu: "2" # 2 GPUs per node
```
### GPU Distribution
The relationship between `multinode.nodeCount` and `gpu` is multiplicative:
- **`multinode.nodeCount`**: Number of physical nodes
- **`gpu`**: Number of GPUs per node
- **Total GPUs**: `multinode.nodeCount × gpu`
**Example:**
- `multinode.nodeCount: "2"` + `gpu: "4"` = 8 total GPUs (4 GPUs per node across 2 nodes)
- `multinode.nodeCount: "4"` + `gpu: "8"` = 32 total GPUs (8 GPUs per node across 4 nodes)
### Tensor Parallelism Alignment
The tensor parallelism (`tp-size` or `--tp`) in your command/args must match the total number of GPUs:
```yaml
# Example: 2 multinode.nodeCount × 4 GPUs = 8 total GPUs
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
spec:
# ... your deployment spec
services:
my-service:
...
multinode:
nodeCount: 2
resources:
limits:
gpu: "4"
extraPodSpec:
mainContainer:
...
args:
# Command args must use tp-size=8
- "--tp-size"
- "8" # Must equal multinode.nodeCount × gpu
```
## Backend-Specific Operator Behavior
When you deploy a multinode workload, the Dynamo operator automatically applies backend-specific configurations to enable distributed execution. Understanding these automatic modifications helps troubleshoot issues and optimize your deployments.
### vLLM Backend
For vLLM multinode deployments, the operator automatically selects and configures the appropriate distributed execution mode based on your parallelism settings:
#### Deployment Modes
The operator automatically determines the deployment mode based on your parallelism configuration:
**1. Tensor/Pipeline Parallelism Mode (Single model across nodes)**
- **When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size`
- **Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism
The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes.
**Leader Node:**
- **Command**: `ray start --head --port=6379 && <original-vllm-command> --distributed-executor-backend ray`
- **Behavior**: Starts Ray head node, then runs vLLM which creates a placement group spanning all Ray workers
- **Probes**: All health probes remain active (liveness, readiness, startup)
**Worker Nodes:**
- **Command**: `ray start --address=<leader-hostname>:6379 --block`
- **Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers
- **Probes**: All probes (liveness, readiness, startup) are automatically removed
> **Note**: vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend.
**2. Data Parallel Mode (Multiple model instances across nodes)**
- **When used**: When `world_size × data_parallel_size > GPUs_per_node`
- **Use case**: Running multiple independent model instances across nodes with data parallelism (e.g., MoE models with expert parallelism)
**All Nodes (Leader and Workers):**
- **Injected Flags**:
- `--data-parallel-address <leader-hostname>` - Address of the coordination server
- `--data-parallel-size-local <value>` - Number of data parallel workers per node
- `--data-parallel-rpc-port 13445` - RPC port for data parallel coordination
- `--data-parallel-start-rank <value>` - Starting rank for this node (calculated automatically)
- **Probes**: Worker probes are removed; leader probes remain active
**Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
#### Why Ray for Multi-Node TP/PP?
vLLM supports two distributed executor backends: `ray` and `mp`. For multi-node deployments:
- **Ray executor**: vLLM creates a placement group and spawns Ray actors across the cluster. Workers don't run vLLM directly - the leader's vLLM process manages everything.
- **mp executor**: Each node must run its own vLLM process with `--nnodes`, `--node-rank`, `--master-addr`, `--master-port`. This approach is more complex to orchestrate.
The Dynamo operator uses Ray because:
1. It aligns with vLLM's official multi-node documentation (see `multi-node-serving.sh`)
2. Simpler orchestration - only the leader runs vLLM, workers just need Ray agents
3. vLLM automatically handles placement group creation and worker management
#### Compilation Cache Support
When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets:
- **`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point
### SGLang Backend
For SGLang multinode deployments, the operator injects distributed training parameters:
#### Leader Node
- **Distributed Flags**: Injects `--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank 0`
- **Probes**: All health probes remain active
#### Worker Nodes
- **Distributed Flags**: Injects `--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank <dynamic-rank>`
- The `node-rank` is automatically determined from the pod's stateful identity
- **Probes**: All probes (liveness, readiness, startup) are automatically removed
**Note:** The operator intelligently injects these flags regardless of your command structure (direct Python commands or shell wrappers).
### TensorRT-LLM Backend
For TensorRT-LLM multinode deployments, the operator configures MPI-based communication:
#### Leader Node
- **SSH Configuration**: Automatically sets up SSH keys and configuration from a Kubernetes secret
- **MPI Command**: Wraps your command in an `mpirun` command with:
- Proper host list including all worker nodes
- SSH configuration for passwordless authentication on port 2222
- Environment variable propagation to all nodes
- Activation of the Dynamo virtual environment
- **Probes**: All health probes remain active
#### Worker Nodes
- **SSH Daemon**: Replaces your command with SSH daemon setup and execution
- Generates host keys in user-writable directories (non-privileged)
- Configures SSH daemon to listen on port 2222
- Sets up authorized keys for leader access
- **Probes**:
- **Liveness and Startup**: Removed (workers run SSH daemon, not the main application)
- **Readiness**: Replaced with TCP socket check on SSH port 2222
- Initial Delay: 20 seconds
- Period: 20 seconds
- Timeout: 5 seconds
- Failure Threshold: 10
#### Additional Configuration
- **Environment Variable**: `OMPI_MCA_orte_keep_fqdn_hostnames=1` is added to all nodes
- **SSH Volume**: Automatically mounts the SSH keypair secret (typically named `mpirun-ssh-key-<deployment-name>`)
**Important:** TensorRT-LLM requires an SSH keypair secret to be created before deployment. The secret name follows the pattern `mpirun-ssh-key-<component-name>`.
### Compilation Cache Configuration
The operator supports compilation cache volumes for backend-specific optimization:
| Backend | Support Level | Environment Variables | Default Mount Point |
|---------|--------------|----------------------|---------------------|
| vLLM | Fully Supported | `VLLM_CACHE_ROOT` | User-specified |
| SGLang | Partial Support | _None (pending upstream)_ | User-specified |
| TensorRT-LLM | Partial Support | _None (pending upstream)_ | User-specified |
To enable compilation cache, add a volume mount with `useAsCompilationCache: true` in your component specification. For vLLM, the operator will automatically configure the necessary environment variables. For other backends, volume mounts are created, but additional environment configuration may be required until upstream support is added.
## Next Steps
For additional support and examples, see the working multinode configurations in:
- **SGLang**: [examples/backends/sglang/deploy/](../../../examples/backends/sglang/deploy/)
- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](../../../examples/backends/trtllm/deploy/)
- **vLLM**: [examples/backends/vllm/deploy/](../../../examples/backends/vllm/deploy/)
These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
```{toctree}
:hidden:
Grove <../grove>
```
# Working with Dynamo Kubernetes Operator
## Overview
Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
## Architecture
- **Operator Deployment:**
Deployed as a Kubernetes `Deployment` in a specific namespace.
- **Controllers:**
- `DynamoGraphDeploymentController`: Watches `DynamoGraphDeployment` CRs and orchestrates graph deployments.
- `DynamoComponentDeploymentController`: Watches `DynamoComponentDeployment` CRs and handles individual component deployments.
- `DynamoModelController`: Watches `DynamoModel` CRs and manages model lifecycle (e.g., loading LoRA adapters).
- **Workflow:**
1. A custom resource is created by the user or API server.
2. The corresponding controller detects the change and runs reconciliation.
3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
4. Status fields are updated to reflect the current state.
## Deployment Modes
The Dynamo operator supports three deployment modes to accommodate different cluster environments and use cases:
### 1. Cluster-Wide Mode (Default)
The operator monitors and manages DynamoGraph resources across **all namespaces** in the cluster.
**When to Use:**
- You have full cluster admin access
- You want centralized management of all Dynamo workloads
- Standard production deployment on a dedicated cluster
---
### 2. Namespace-Scoped Mode
The operator monitors and manages DynamoGraph resources **only in a specific namespace**. A lease marker is created to signal the operator's presence to any cluster-wide operators.
**When to Use:**
- You're on a shared/multi-tenant cluster
- You only have namespace-level permissions
- You want to test a new operator version in isolation
- You need to avoid conflicts with other operators
**Installation:**
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace my-namespace \
--create-namespace \
--set dynamo-operator.namespaceRestriction.enabled=true
```
---
### 3. Hybrid Mode
A **cluster-wide operator** manages most namespaces, while **one or more namespace-scoped operators** run in specific namespaces (e.g., for testing new versions). The cluster-wide operator automatically detects and excludes namespaces with namespace-scoped operators using lease markers.
**When to Use:**
- Running production workloads with a stable operator version
- Testing new operator versions in isolated namespaces without affecting production
- Gradual rollout of operator updates
- Development/staging environments on production clusters
**How It Works:**
1. Namespace-scoped operator creates a lease named `dynamo-operator-namespace-scope` in its namespace
2. Cluster-wide operator watches for these lease markers across all namespaces
3. Cluster-wide operator automatically excludes any namespace with a lease marker
4. If namespace-scoped operator stops, its lease expires (TTL: 30s by default)
5. Cluster-wide operator automatically resumes managing that namespace
**Setup Example:**
```bash
# 1. Install cluster-wide operator (production, v1.0.0)
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
--create-namespace
# 2. Install namespace-scoped operator (testing, v2.0.0-beta)
helm install dynamo-test dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace test-namespace \
--create-namespace \
--set dynamo-operator.namespaceRestriction.enabled=true \
--set dynamo-operator.controllerManager.manager.image.tag=v2.0.0-beta
```
**Observability:**
```bash
# List all namespaces with local operators
kubectl get lease -A --field-selector metadata.name=dynamo-operator-namespace-scope
# Check which operator version is running in a namespace
kubectl get lease -n my-namespace dynamo-operator-namespace-scope \
-o jsonpath='{.spec.holderIdentity}'
```
## Custom Resource Definitions (CRDs)
Dynamo provides the following Custom Resources:
- **DynamoGraphDeployment (DGD)**: Deploys complete inference pipelines
- **DynamoComponentDeployment (DCD)**: Deploys individual components
- **DynamoModel**: Manages model lifecycle (e.g., loading LoRA adapters)
For the complete technical API reference for Dynamo Custom Resource Definitions, see:
**📖 [Dynamo CRD API Reference](./api_reference.md)**
For a user-focused guide on deploying and managing models with DynamoModel, see:
**📖 [Managing Models with DynamoModel Guide](./deployment/dynamomodel-guide.md)**
## Webhooks
The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation of custom resources before they are persisted to the cluster. Webhooks are **enabled by default** and ensure that invalid configurations are rejected immediately at the API server level.
**Key Features:**
- ✅ Shared certificate infrastructure across all webhook types
- ✅ Automatic certificate generation (for testing/development)
- ✅ cert-manager integration (for production)
- ✅ Multi-operator support with lease-based coordination
- ✅ Immutability enforcement for critical fields
For complete documentation on webhooks, certificate management, and troubleshooting, see:
**📖 [Webhooks Guide](./webhooks.md)**
## Observability
The Dynamo Operator provides comprehensive observability through Prometheus metrics and Grafana dashboards. This allows you to monitor:
- **Controller Performance**: Reconciliation loop duration, success rates, and error rates by resource type
- **Webhook Activity**: Validation performance, admission rates, and denial patterns
- **Resource Inventory**: Current count of managed resources by state and namespace
- **Operational Health**: Success rates and health indicators for controllers and webhooks
### Metrics Collection
Metrics are automatically exposed on the operator's `/metrics` endpoint (port 8443 by default) and collected by Prometheus via a ServiceMonitor. The ServiceMonitor is automatically created when you install the operator via Helm (controlled by `metricsService.enabled`, which defaults to `true`).
### Grafana Dashboard
A pre-built Grafana dashboard is available for visualizing operator metrics. The dashboard includes:
- **Reconciliation Metrics**: Rate, duration (P95), and errors by resource type
- **Webhook Metrics**: Request rate, duration (P95), and denials by resource type and operation
- **Resource Inventory**: Count of DynamoGraphDeployments by state and namespace
- **Operational Health**: Success rate gauges for controllers and webhooks
For complete setup instructions and metrics reference, see:
**📖 [Operator Metrics Guide](./observability/operator-metrics.md)**
## Installation
### Quick Install with Helm
```bash
# Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# Install Platform (includes operator)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
> **Note:** For shared/multi-tenant clusters or testing scenarios, see [Deployment Modes](#deployment-modes) above for namespace-scoped and hybrid configurations.
### Building from Source
```bash
# Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=your-registry.com/ # your container registry
export IMAGE_TAG=latest
# Build operator image
cd deploy/operator
docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG .
docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
cd -
# Install CRDs
cd deploy/helm/charts
helm install dynamo-crds ./crds/ --namespace default
# Install platform with custom operator image
helm install dynamo-platform ./platform/ \
--namespace ${NAMESPACE} \
--create-namespace \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
--set etcd.enabled=false \
--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret
```
For detailed installation options, see the [Installation Guide](./installation_guide.md)
## Development
- **Code Structure:**
The operator is built using Kubebuilder and the operator-sdk, with the following structure:
- `controllers/`: Reconciliation logic
- `api/v1alpha1/`: CRD types
- `config/`: Manifests and Helm charts
## References
- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
- [Custom Resource Definitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
- [Operator SDK](https://sdk.operatorframework.io/)
- [Helm Best Practices for CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/)
# GitOps Deployment with FluxCD
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../backends/vllm/README.md) to demonstrate the workflow.
## Prerequisites
- A Kubernetes cluster with [Dynamo Kubernetes Platform](./installation_guide.md) installed
- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations
## Workflow Overview
The GitOps workflow for Dynamo deployments consists of three main steps:
1. Build and push the Dynamo Operator
2. Create and commit a DynamoGraphDeployment custom resource for initial deployment
3. Update the graph by building a new version and updating the CR for subsequent updates
## Step 1: Build and Push Dynamo Operator
First, follow to [See Install Dynamo Kubernetes Platform](./installation_guide.md).
## Step 2: Create Initial Deployment
Create a new file in your Git repository (e.g., `deployments/llm-agg.yaml`) with the following content:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: llm-agg
spec:
pvcs:
- name: vllm-model-storage
size: 100Gi
services:
Frontend:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
Processor:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
VllmWorker:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
# Add PVC for model storage
volumeMounts:
- name: vllm-model-storage
mountPoint: /models
```
Commit and push this file to your Git repository. FluxCD will detect the new CR and create the initial Dynamo deployment in your cluster.
## Step 3: Update Existing Deployment
To update your pipeline, just update the associated DynamoGraphDeployment CRD
The Dynamo operator will automatically reconcile it.
## Monitoring the Deployment
You can monitor the deployment status using:
```bash
export NAMESPACE=<namespace-with-the-dynamo-operator>
# Check the DynamoGraphDeployment status
kubectl get dynamographdeployment llm-agg -n $NAMESPACE
```
\ No newline at end of file
# Grove Deployment Guide
Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
## Overview
Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource.
### How Grove Works for Disaggregated Serving
Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages:
- **Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks
- **Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns
- **Resource Optimization**: Better utilization of hardware resources through specialized workload placement
- **Fault Isolation**: Issues in one component don't necessarily affect others
## Core Components and API Resources
Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups:
### PodCliqueSet
The top-level Grove object that defines a group of components managed and colocated together. Key features include:
- Support for autoscaling
- Topology-aware spread of replicas for availability
- Unified management of multiple disaggregated components
### PodClique
Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features:
- Independent configuration options
- Custom scaling logic support
- Role-specific resource allocation
### PodCliqueScalingGroup
A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior.
## Key Capabilities for Disaggregated Serving
Grove provides several specialized features that make it particularly well-suited for disaggregated serving:
### Flexible Gang Scheduling
PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodCliqueSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
### Multi-level Horizontal Auto-Scaling
Supports pluggable horizontal auto-scaling solutions to scale PodCliqueSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
### Network Topology-Aware Scheduling
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication.
### Custom Startup Dependencies
Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
## Use Cases and Examples
Grove specifically supports:
- **Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick
- **Single-node disaggregated inference** for optimized resource utilization
- **Agentic pipelines of models** for complex AI workflows
- **Standard aggregated serving** patterns for single node or single GPU inference
## Integration with NVIDIA Dynamo
Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack:
### Complementary Roles
- **Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads
- **Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management
### Release Coordination
Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap.
### Unified AI Platform
The integration creates a comprehensive platform where:
- Grove manages complex orchestration of disaggregated components
- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations
- Together they enable sophisticated AI serving architectures with simplified management
## Architecture Benefits
Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by:
1. **Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition
2. **Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate
3. **Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components
4. **Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling
## Getting Started
Grove relies on KAI Scheduler for resource allocation and scheduling.
For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler).
For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](./deployment/multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
Dynamo Kubernetes Platform also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Kubernetes Platform Deployment Installation Guide](./installation_guide.md) for more details.
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Installation Guide for Dynamo Kubernetes Platform
Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
## Before You Start
Determine your cluster environment:
**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
- CRDs already installed cluster-wide - skip CRD installation step
- A cluster-wide Dynamo operator is likely already running
- **Do NOT install another operator** - use the existing cluster-wide operator
- Only install a namespace-restricted operator if you specifically need to prevent the cluster-wide operator from managing your namespace (e.g., testing operator features you're developing)
**Dedicated Cluster** (full cluster admin access):
- You install CRDs yourself
- Can use cluster-wide operator (default)
**Local Development** (Minikube, testing):
- See [Minikube Setup](deployment/minikube.md) first, then follow installation steps below
To check if CRDs already exist:
```bash
kubectl get crd | grep dynamo
# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed
```
To check if a cluster-wide operator already exists:
```bash
# Check for cluster-wide operator and show its namespace
kubectl get clusterrolebinding -o json | \
jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) |
"Cluster-wide operator found in namespace: \(.subjects[0].namespace)"'
# If a cluster-wide operator exists: Do NOT install another operator
# Only install namespace-restricted mode if you specifically need namespace isolation
```
## Installation Paths
Platform is installed using Dynamo Kubernetes Platform [helm chart](../../deploy/helm/charts/platform/README.md).
**Path A: Pre-built Artifacts**
- Use case: Production deployment, shared or dedicated clusters
- Source: NGC published Helm charts
- Time: ~10 minutes
- Jump to: [Path A](#path-a-production-install)
**Path B: Custom Build from Source**
- Use case: Contributing to Dynamo, using latest features from main branch, customization
- Requirements: Docker build environment
- Time: ~30 minutes
- Jump to: [Path B](#path-b-custom-build-from-source)
All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:
```bash
helm install ...
-f your-values.yaml
```
and/or setting values as flags to the helm install command, as follows:
```bash
helm install ...
--set "your-value=your-value"
```
## Prerequisites
Before installing the Dynamo Kubernetes Platform, ensure you have the following tools and access:
### Required Tools
| Tool | Minimum Version | Description | Installation |
|------|-----------------|-------------|--------------|
| **kubectl** | v1.24+ | Kubernetes command-line tool | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | Kubernetes package manager | [Install Helm](https://helm.sh/docs/intro/install/) |
| **Docker** | Latest | Container runtime (Path B only) | [Install Docker](https://docs.docker.com/get-docker/) |
### Cluster and Access Requirements
- **Kubernetes cluster v1.24+** with admin or namespace-scoped access
- **Cluster type determined** (shared vs dedicated) — see [Before You Start](#before-you-start)
- **CRD status checked** if on a shared cluster
- **NGC credentials** (optional) — required only if pulling NVIDIA images from NGC
### Verify Installation
Run the following to confirm your tools are correctly installed:
```bash
# Verify tools and versions
kubectl version --client # Should show v1.24+
helm version # Should show v3.0+
docker version # Required for Path B only
# Set your release version
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
```
### Pre-Deployment Checks
Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements:
```bash
./deploy/pre-deployment/pre-deployment-check.sh
```
This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](../../deploy/pre-deployment/README.md) for details.
> **No cluster?** See [Minikube Setup](deployment/minikube.md) for local development.
**Estimated installation time:** 5-30 minutes depending on path
## Path A: Production Install
Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).
```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
**For Shared/Multi-Tenant Clusters:**
If your cluster has namespace-restricted Dynamo operators, you MUST add namespace restriction to your installation:
```bash
# Add this flag to the helm install command above
--set dynamo-operator.namespaceRestriction.enabled=true
```
Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
If you see this validation error, you need namespace restriction:
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```
> [!TIP]
> For multinode deployments, you need to install multinode orchestration components:
>
> **Option 1 (Recommended): Grove + KAI Scheduler**
> - Grove and KAI Scheduler can be installed manually or through the dynamo-platform helm install command.
> - When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags:
>
> ```bash
> --set "grove.enabled=true"
> --set "kai-scheduler.enabled=true"
> ```
>
> **Option 2: LeaderWorkerSet (LWS) + Volcano**
> - If using LWS for multinode deployments, you must also install Volcano (required dependency):
> - [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
> - [Volcano Installation](https://volcano.sh/en/docs/installation/) (required for gang scheduling with LWS)
> - These must be installed manually before deploying multinode workloads with LWS.
>
> See the [Multinode Deployment Guide](./deployment/multinode-deployment.md) for details on orchestrator selection.
> [!TIP]
> By default, Model Express Server is not used.
> If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command:
```bash
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```
> [!TIP]
> By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces.
> If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true.
> You can also change the restricted namespace by setting the targetNamespace property.
```bash
--set "dynamo-operator.namespaceRestriction.enabled=true"
--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional
```
[Verify Installation](#verify-installation)
## Path B: Custom Build from Source
Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.
Note: This gives you access to the latest unreleased features and fixes on the main branch.
```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
export IMAGE_TAG=${RELEASE_VERSION}
# 2. Build operator
cd deploy/operator
# 2.1 Alternative 1 : Build and push the operator image for multiple platforms
docker buildx create --name multiplatform --driver docker-container --bootstrap
docker buildx use multiplatform
docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG --push .
# 2.2 Alternative 2 : Build and push the operator image for a single platform
docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
cd -
# 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry)
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=${DOCKER_SERVER} \
--docker-username=${DOCKER_USERNAME} \
--docker-password=${DOCKER_PASSWORD} \
--namespace=${NAMESPACE}
cd deploy/helm/charts
# 4. Install CRDs
helm upgrade --install dynamo-crds ./crds/ --namespace default
# 5. Install Platform
helm dep build ./platform/
# To install cluster-wide instead, set NS_RESTRICT_FLAGS="" (empty) or omit that line entirely.
NS_RESTRICT_FLAGS="--set dynamo-operator.namespaceRestriction.enabled=true"
helm install dynamo-platform ./platform/ \
--namespace "${NAMESPACE}" \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
--set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" \
${NS_RESTRICT_FLAGS}
```
[Verify Installation](#verify-installation)
## Verify Installation
```bash
# Check CRDs
kubectl get crd | grep dynamo
# Check operator and platform pods
kubectl get pods -n ${NAMESPACE}
# Expected: dynamo-operator-* and etcd-* and nats-* pods Running
```
## Next Steps
1. **Deploy Model/Workflow**
```bash
# Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Port forward and test
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
```
2. **Explore Backend Guides**
- [vLLM Deployments](../../examples/backends/vllm/deploy/README.md)
- [SGLang Deployments](../../examples/backends/sglang/deploy/README.md)
- [TensorRT-LLM Deployments](../../examples/backends/trtllm/deploy/README.md)
3. **Optional:**
- [Set up Prometheus & Grafana](./observability/metrics.md)
- [SLA Planner Guide](../components/planner/planner_guide.md) (for SLA-aware scheduling and autoscaling)
## Troubleshooting
**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```
Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.
Solution: Add namespace restriction to your installation:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```
Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
**CRDs already exist**
Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).
Solution: Skip step 2 (CRD installation), proceed directly to platform installation.
To check if CRDs exist:
```bash
kubectl get crd | grep dynamo
```
**Pods not starting?**
```bash
kubectl describe pod <pod-name> -n ${NAMESPACE}
kubectl logs <pod-name> -n ${NAMESPACE}
```
**HuggingFace model access?**
```bash
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
**Bitnami etcd "unrecognized" image?**
```bash
ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
```
This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog).
just add the following to the helm install command:
```bash
--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"
```
**Clean uninstall?**
To uninstall the platform, you can run the following command:
```
helm uninstall dynamo-platform --namespace ${NAMESPACE}
```
To uninstall the CRDs, follow these steps:
Get all of the dynamo CRDs installed in your cluster:
```bash
kubectl get crd | grep "dynamo.*nvidia.com"
```
You should see something like this:
```
dynamocomponentdeployments.nvidia.com 2025-10-21T14:49:52Z
dynamocomponents.nvidia.com 2025-10-25T05:16:10Z
dynamographdeploymentrequests.nvidia.com 2025-11-24T05:26:04Z
dynamographdeployments.nvidia.com 2025-09-04T20:56:40Z
dynamographdeploymentscalingadapters.nvidia.com 2025-12-09T21:05:59Z
dynamomodels.nvidia.com 2025-11-07T00:19:43Z
```
Delete each CRD one by one:
```bash
kubectl delete crd <crd-name>
```
## Advanced Options
- [Helm Chart Configuration](../../deploy/helm/charts/platform/README.md)
- [Create custom deployments](./deployment/create_deployment.md)
- [Dynamo Operator details](./dynamo_operator.md)
- [Model Express Server details](https://github.com/ai-dynamo/modelexpress)
# Model Caching with Fluid: Cloud-Native Data Orchestration and Acceleration
Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.
## Key Features
- **Data Caching and Acceleration:** Cache remote data close to compute workloads for faster access.
- **Unified Data Access:** Access data from S3, HDFS, NFS, and more through a single interface.
- **Kubernetes Native:** Integrates with Kubernetes using CRDs for data management.
- **Scalability:** Supports large-scale data and compute clusters.
## Installation
You can install Fluid on any Kubernetes cluster using Helm.
**Prerequisites:**
- Kubernetes >= 1.18
- `kubectl` >= 1.18
- `Helm` >= 3.5
**Quick Install:**
```sh
kubectl create ns fluid-system
helm repo add fluid https://fluid-cloudnative.github.io/charts
helm repo update
helm install fluid fluid/fluid -n fluid-system
```
For advanced configuration, see the [Fluid Installation Guide](https://fluid-cloudnative.github.io/docs/get-started/installation).
## Pre-deployment Steps
1. Install Fluid (see [Installation](#installation)).
2. Create a Dataset and Runtime (see [the following example](#webufs-example)).
3. Mount the resulting PVC in your workload.
## Mounting Data Sources
### WebUFS Example
WebUFS allows mounting HTTP/HTTPS sources as filesystems.
```yaml
# Mount a public HTTP directory as a Fluid Dataset
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: webufs-model
spec:
mounts:
- mountPoint: https://myhost.org/path_to_my_model # Replace with your HTTP source
name: webufs-model
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: webufs-model
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
```
After applying, Fluid creates a PersistentVolumeClaim (PVC) named `webufs-model` containing the files.
### S3 Example
Mount an S3 bucket as a Fluid Dataset.
```yaml
# Mount an S3 bucket as a Fluid Dataset
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: s3-model
spec:
mounts:
- mountPoint: s3://<your-bucket> # Replace with your bucket name
options:
alluxio.underfs.s3.endpoint: http://minio:9000 # S3 endpoint (e.g., MinIO)
alluxio.underfs.s3.disable.dns.buckets: "true"
aws.secretKey: "<your-secret>"
aws.accessKeyId: "<your-access-key>"
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: s3-model
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 1Gi
high: "0.95"
low: "0.7"
---
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: s3-model-loader
spec:
dataset:
name: s3-model
namespace: <your-namespace> # Replace with your namespace
loadMetadata: true
target:
- path: "/"
replicas: 1
```
The resulting PVC is named `s3-model`.
## Using HuggingFace Models with Fluid
**Limitations:**
- HuggingFace models are not exposed as simple filesystems or buckets.
- No native integration exists between Fluid and the HuggingFace Hub API.
**Workaround: Download and Upload to S3/MinIO**
1. Download the model using the HuggingFace CLI or SDK.
2. Upload the model files to a supported storage backend (S3, GCS, NFS).
3. Mount that backend using Fluid.
**Example Pod to Download and Upload:**
```yaml
apiVersion: v1
kind: Pod
metadata:
name: download-hf-to-minio
spec:
restartPolicy: Never
containers:
- name: downloader
image: python:3.10-slim
command: ["sh", "-c"]
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub awscli
BUCKET_NAME=hf-models
ENDPOINT_URL=http://minio:9000
MODEL_NAME=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
LOCAL_DIR=/tmp/model
if ! aws --endpoint-url $ENDPOINT_URL s3 ls "s3://$BUCKET_NAME" > /dev/null 2>&1; then
aws --endpoint-url $ENDPOINT_URL s3 mb "s3://$BUCKET_NAME"
fi
huggingface-cli download $MODEL_NAME --local-dir $LOCAL_DIR --local-dir-use-symlinks False
aws --endpoint-url $ENDPOINT_URL s3 cp $LOCAL_DIR s3://$BUCKET_NAME/$MODEL_NAME --recursive
env:
- name: AWS_ACCESS_KEY_ID
value: "<your-access-key>"
- name: AWS_SECRET_ACCESS_KEY
value: "<your-secret>"
volumeMounts:
- name: tmp-volume
mountPath: /tmp/model
volumes:
- name: tmp-volume
emptyDir: {}
```
You can then use `s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B` as your Dataset mount.
## Usage with Dynamo
Mount the Fluid-generated PVC in your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: model-caching
spec:
pvcs:
- name: s3-model
envs:
- name: HF_HOME
value: /model
- name: DYN_DEPLOYMENT_CONFIG
value: '{"Common": {"model": "/model", ...}}'
services:
VllmWorker:
volumeMounts:
- name: s3-model
mountPoint: /model
Processor:
volumeMounts:
- name: s3-model
mountPoint: /model
```
## Full example with llama3.3 70B
### Performance
When deploying LLaMA 3.3 70B using Fluid as the caching layer, we observed the best performance by configuring a single-node cache that holds 100% of the model files locally. By ensuring that the vllm worker pod is scheduled on the same node as the Fluid cache, we were able to eliminate network I/O bottlenecks, which resulted in the fastest model startup time and the highest inference efficiency during our tests.
| Cache Configuration | vLLM Pod Placement | Startup Time |
|----------------------------------------------|----------------------------------|-----------------|
| ❌ No Cache (Download from HuggingFace) | N/A | ~9 minutes |
| 🟡 Multi-Node Cache (100% Model Cached) | Not on Cache Node | ~18 minutes |
| 🟡 Multi-Node Cache (100% Model Cached) | On Cache Node | ~10 minutes |
| ✅ Single-Node Cache (100% Model Cached) | On Cache Node | ~80 seconds |
### Resources
```yaml
# dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: llama-3-3-70b-instruct-model
namespace: my-namespace
spec:
mounts:
- mountPoint: s3://hf-models/meta-llama/Llama-3.3-70B-Instruct
options:
alluxio.underfs.s3.endpoint: http://minio:9000
alluxio.underfs.s3.disable.dns.buckets: "true"
aws.secretKey: "minioadmin"
aws.accessKeyId: "minioadmin"
alluxio.underfs.s3.streaming.upload.enabled: "true"
alluxio.underfs.s3.multipart.upload.threads: "20"
alluxio.underfs.s3.socket.timeout: "50s"
alluxio.underfs.s3.request.timeout: "60s"
---
# runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: llama-3-3-70b-instruct-model
namespace: my-namespace
spec:
replicas: 1
properties:
alluxio.user.file.readtype.default: CACHE_PROMOTE
alluxio.user.file.write.type.default: CACHE_THROUGH
alluxio.user.block.size.bytes.default: 128MB
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 300Gi
high: "1.0"
low: "0.7"
---
# DataLoad - Preloads the model into cache
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: llama-3-3-70b-instruct-model-loader
spec:
dataset:
name: llama-3-3-70b-instruct-model
namespace: my-namespace
loadMetadata: true
target:
- path: "/"
replicas: 1
```
and the associated DynamoGraphDeployment with pod affinity to schedule the vllm worker on the same node than the Alluxio cache worker
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-hello-world
spec:
envs:
- name: DYN_LOG
value: "debug"
- name: DYN_DEPLOYMENT_CONFIG
value: '{"Common": {"model": "/model", "block-size": 64, "max-model-len": 16384},
"Frontend": {"served_model_name": "meta-llama/Llama-3.3-70B-Instruct", "endpoint":
"dynamo.Processor.chat/completions", "port": 8000}, "Processor": {"router":
"round-robin", "router-num-threads": 4, "common-configs": ["model", "block-size",
"max-model-len"]}, "VllmWorker": {"tensor-parallel-size": 4, "enforce-eager": true, "max-num-batched-tokens":
16384, "enable-prefix-caching": true, "ServiceArgs": {"workers": 1, "resources":
{"gpu": "4", "memory": "40Gi"}}, "common-configs": ["model", "block-size", "max-model-len"]},
"Planner": {"environment": "kubernetes", "no-operation": true}}'
pvcs:
- name: llama-3-3-70b-instruct-model
services:
Processor:
volumeMounts:
- name: llama-3-3-70b-instruct-model
mountPoint: /model
VllmWorker:
volumeMounts:
- name: llama-3-3-70b-instruct-model
mountPoint: /model
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: fluid.io/s-alluxio-my-namespace-llama-3-3-70b-instruct-model
operator: In
values:
- "true"
```
## Troubleshooting & FAQ
- **PVC not created?** Check Fluid and AlluxioRuntime pod logs.
- **Model not found?** Ensure the model was uploaded to the correct bucket/path.
- **Permission errors?** Verify S3/MinIO credentials and bucket policies.
## Resources
- [Fluid Documentation](https://fluid-cloudnative.github.io/)
- [Alluxio Documentation](https://docs.alluxio.io/)
- [MinIO Documentation](https://docs.min.io/)
- [Hugging Face Hub](https://huggingface.co/docs/hub/index)
- [Dynamo README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md)
- [Dynamo Documentation](https://docs.nvidia.com/dynamo/latest/index.html)
# Log Aggregation in Dynamo on Kubernetes
This guide demonstrates how to set up logging for Dynamo in Kubernetes using Grafana Loki and Alloy. This setup provides a simple reference logging setup that can be followed in Kubernetes clusters including Minikube and MicroK8s.
> [!Note]
> This setup is intended for development and testing purposes. For production environments, please refer to the official documentation for high-availability configurations.
## Components Overview
- **[Grafana Loki](https://grafana.com/oss/loki/)**: Fast and cost-effective Kubernetes-native log aggregation system.
- **[Grafana Alloy](https://grafana.com/oss/alloy/)**: OpenTelemetry collector that replaces Promtail, gathering logs, metrics and traces from Kubernetes pods.
- **[Grafana](https://grafana.com/grafana/)**: Visualization platform for querying and exploring logs.
## Prerequisites
### 1. Dynamo Kubernetes Platform
This guide assumes you have installed Dynamo Kubernetes Platform. For more information, see [Dynamo Kubernetes Platform](../README.md).
### 2. Kube-prometheus
While this guide does not use Prometheus, it assumes Grafana is pre-installed with the kube-prometheus. For more information, see [kube-prometheus](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
### 3. Environment Variables
#### Kubernetes Setup Variables
The following env variables are set:
- `MONITORING_NAMESPACE`: The namespace where Loki is installed
- `DYN_NAMESPACE`: The namespace where Dynamo Kubernetes Platform is installed
```bash
export MONITORING_NAMESPACE=monitoring
export DYN_NAMESPACE=dynamo-system
```
#### Dynamo Logging Variables
| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for Loki) | `true` |
| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `true` |
## Installation Steps
### 1. Install Loki
First, we'll install Loki in single binary mode, which is ideal for testing and development:
```bash
# Add the Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki
helm install --values deploy/observability/k8s/logging/values/loki-values.yaml loki grafana/loki -n $MONITORING_NAMESPACE
```
Our configuration (`loki-values.yaml`) sets up Loki in a simple configuration that is suitable for testing and development. It uses a local MinIO for storage. The installation pods can be viewed with:
```bash
kubectl get pods -n $MONITORING_NAMESPACE -l app=loki
```
### 2. Install Grafana Alloy
Next, install the Grafana Alloy collector to gather logs from your Kubernetes cluster and forward them to Loki. Here we use the Helm chart `k8s-monitoring` provided by Grafana to install the collector:
```bash
# Generate a custom values file with the namespace information
envsubst < deploy/observability/k8s/logging/values/alloy-values.yaml > alloy-custom-values.yaml
# Install the collector
helm install --values alloy-custom-values.yaml alloy grafana/k8s-monitoring -n $MONITORING_NAMESPACE
```
The values file (`alloy-values.yaml`) includes the following configurations for the collector:
- Destination to forward logs to Loki
- Namespace to collect logs from
- Pod labels to be mapped to Loki labels
- Collection method (kubernetesApi or tailing `/var/log/containers/`)
```yaml
destinations:
- name: loki
type: loki
url: http://loki-gateway.$MONITORING_NAMESPACE.svc.cluster.local/loki/api/v1/push
podLogs:
enabled: true
gatherMethod: kubernetesApi # collect logs from the kubernetes api, rather than /var/log/containers/; friendly for testing and development
collector: alloy-logs
labels:
app_kubernetes_io_name: app.kubernetes.io/name
nvidia_com_dynamo_component_type: nvidia.com/dynamo-component-type
nvidia_com_dynamo_graph_deployment_name: nvidia.com/dynamo-graph-deployment-name
labelsToKeep:
- "app_kubernetes_io_name"
- "container"
- "instance"
- "job"
- "level"
- "namespace"
- "service_name"
- "service_namespace"
- "deployment_environment"
- "deployment_environment_name"
- "nvidia_com_dynamo_component_type" # extract this label from the dynamo graph deployment
- "nvidia_com_dynamo_graph_deployment_name" # extract this label from the dynamo graph deployment
namespaces:
- $DYN_NAMESPACE
```
### 3. Configure Grafana with the Loki datasource and Dynamo Logs dashboard
We will be viewing the logs associated with our DynamoGraphDeployment in Grafana. To do this, we need to configure Grafana with the Loki datasource and Dynamo Logs dashboard.
Since we are using Grafana with the Prometheus Operator, we can simply apply the following ConfigMaps to quickly achieve this configuration.
```bash
# Configure Grafana with the Loki datasource
envsubst < deploy/observability/k8s/logging/grafana/loki-datasource.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
# Configure Grafana with the Dynamo Logs dashboard
kubectl apply -f deploy/observability/k8s/logging/grafana/logging-dashboard.yaml -n $MONITORING_NAMESPACE
```
> [!Note]
> If using Grafana installed without the Prometheus Operator, you can manually import the Loki datasource and Dynamo Logs dashboard using the Grafana UI.
### 4. Deploy a DynamoGraphDeployment with JSONL Logging
At this point, we should have everything in place to collect and view logs in our Grafana instance. All that is left is to deploy a DynamoGraphDeployment to collect logs from.
To enable structured logs in a DynamoGraphDeployment, we need to set the `DYN_LOGGING_JSONL` environment variable to `1`. This is done for us in the `agg_logging.yaml` setup for the Sglang backend. We can now deploy the DynamoGraphDeployment with:
```bash
kubectl apply -n $DYN_NAMESPACE -f examples/backends/sglang/deploy/agg_logging.yaml
```
Send a few chat completions requests to generate structured logs across the frontend and worker pods across the DynamoGraphDeployment. We are now all set to view the logs in Grafana.
## Viewing Logs in Grafana
Port-forward the Grafana service to access the UI:
```bash
kubectl port-forward svc/prometheus-grafana 3000:80 -n $MONITORING_NAMESPACE
```
If everything is working, under Home > Dashboards > Dynamo Logs, you should see a dashboard that can be used to view the logs associated with our DynamoGraphDeployments
The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g., frontend, worker, etc.).
# Dynamo Metrics Collection on Kubernetes
## Overview
This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the kube-prometheus-stack. The kube-prometheus-stack provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components.
## Prerequisites
### Install kube-prometheus-stack
If you don't have an existing Prometheus setup, you'll likely want to install the kube-prometheus-stack. This is a collection of Kubernetes manifests that includes the Prometheus Operator, Prometheus, Grafana, and other monitoring components in a pre-configured setup. The stack introduces custom resources that make it easy to deploy and manage monitoring in Kubernetes:
- `PodMonitor`: Automatically discovers and scrapes metrics from pods based on label selectors
- `ServiceMonitor`: Similar to PodMonitor but works with Services
- `PrometheusRule`: Defines alerting and recording rules
For a basic installation:
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Values allow PodMonitors to be picked up that are outside of the kube-prometheus-stack helm release
helm install prometheus -n monitoring --create-namespace prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorNamespaceSelector.matchLabels=null \
--set prometheus.prometheusSpec.probeNamespaceSelector.matchLabels=null
```
> [!Note]
> The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
### Install Dynamo Operator
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../installation_guide.md) for detailed instructions on deploying the Dynamo operator.
Make sure to set the `dynamo-operator.dynamo.metrics.prometheusEndpoint` to the Prometheus endpoint you installed in the previous step.
```bash
helm install dynamo-platform ...
--set dynamo-operator.dynamo.metrics.prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
```
### Node Exporter for CPU/Memory Metrics
The Dynamo Grafana dashboard includes panels for node-level CPU utilization, system load, and container resource usage. These metrics are collected and exported to Prometheus via [node-exporter](https://github.com/prometheus/node_exporter), which exposes hardware and OS metrics from Linux systems.
> [!Note]
> The kube-prometheus-stack installation described above includes node-exporter by default. If you're using a custom Prometheus setup, you'll need to ensure node-exporter is deployed as a DaemonSet on your cluster nodes.
To verify node-exporter is running:
```bash
kubectl get daemonset -A | grep node-exporter
```
If node-exporter is not running, you can install it via the kube-prometheus-stack or deploy it separately. For more information, see the [node-exporter documentation](https://github.com/prometheus/node_exporter).
### DCGM Metrics Collection (Optional)
GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
```bash
kubectl get daemonset -A | grep dcgm-exporter
```
If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
## Deploy a DynamoGraphDeployment
Let's start by deploying a simple vLLM aggregated deployment:
```bash
export NAMESPACE=dynamo-system # namespace where dynamo operator is installed
pushd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n $NAMESPACE
popd
```
This will create two components:
- A Frontend component exposing metrics on its HTTP port
- A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](../../backends/vllm/README.md)
- Available metrics: See the [metrics guide](../../observability/metrics.md)
### Validate the Deployment
Let's send some test requests to populate metrics:
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 30
}'
```
For more information about validating the deployment, see the [vLLM README](../../backends/vllm/README.md).
## Set Up Metrics Collection
### Create PodMonitors
The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods:
- `nvidia.com/metrics-enabled: "true"` - Enables metrics collection
- `nvidia.com/dynamo-component-type: "frontend|worker"` - Identifies the component type
> **Note**: You can opt-out specific deployments from metrics collection by adding this annotation to your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
annotations:
nvidia.com/enable-metrics: "false"
spec:
# …
```
### Configure Grafana Dashboard
Apply the Dynamo dashboard configuration to populate Grafana with the Dynamo dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
```
The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_dashboard: "1"`, the Grafana will discover and populate it to its list of available dashboards. The dashboard includes panels for:
- Frontend request rates
- Time to first token
- Inter-token latency
- Request duration
- Input/Output sequence lengths
- GPU utilization via DCGM
- Node CPU utilization and system load
- Container CPU usage per pod
- Memory usage per pod
## Viewing the Metrics
### In Prometheus
```bash
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
```
Visit http://localhost:9090 and try these example queries:
- `dynamo_frontend_requests_total`
- `dynamo_frontend_time_to_first_token_seconds_bucket`
![Prometheus UI showing Dynamo metrics](../../images/prometheus-k8s.png)
### In Grafana
```bash
# Get Grafana credentials
export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
echo "Grafana user: $GRAFANA_USER"
echo "Grafana password: $GRAFANA_PASSWORD"
# Port forward Grafana service
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
```
Visit http://localhost:3000 and log in with the credentials captured above.
Once logged in, find the Dynamo dashboard under General.
![Grafana dashboard showing Dynamo metrics](../../images/grafana-k8s.png)
## Operator Metrics
> **Note:** The metrics described above are for Dynamo **applications** (frontends, workers). The Dynamo **Operator** itself also exposes metrics for monitoring controller reconciliation, webhook validation, and resource inventory.
>
> See the **[Operator Metrics Guide](operator-metrics.md)** for details on operator-specific metrics and the operator dashboard.
```{toctree}
:hidden:
Logging <logging>
Operator Metrics <operator-metrics>
```
# Dynamo Operator Metrics
## Overview
The Dynamo Operator exposes Prometheus metrics for monitoring its own health and performance. These metrics are separate from application metrics (frontend/worker) and provide visibility into:
- **Controller Reconciliation**: How efficiently controllers process DynamoGraphDeployments, DynamoComponentDeployments, and DynamoModels
- **Webhook Validation**: Performance and outcomes of admission webhook requests
- **Resource Inventory**: Current count of managed resources by state and namespace
## Prerequisites
The operator metrics feature requires the same monitoring infrastructure as application metrics. For detailed setup instructions, see the [Kubernetes Metrics Guide](./metrics.md#prerequisites).
**Quick checklist:**
- ✅ kube-prometheus-stack installed (for ServiceMonitor support)
- ✅ Prometheus and Grafana running
- ✅ Dynamo Operator installed via Helm
## Metrics Collection
### ServiceMonitor
Operator metrics are automatically collected via a ServiceMonitor, which is created by the Helm chart when `metricsService.enabled: true` (default).
**Unlike application metrics** (which use PodMonitor), the operator uses ServiceMonitor and requires no manual RBAC configuration. The operator's kube-rbac-proxy sidecar is configured with `--ignore-paths=/metrics` to allow Prometheus access.
To verify the ServiceMonitor is created:
```bash
kubectl get servicemonitor -n dynamo-system
```
### Disabling Metrics Collection
To disable operator metrics collection:
```bash
helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
--set dynamo-operator.metricsService.enabled=false
```
## Available Metrics
All metrics use the `dynamo_operator` namespace prefix.
### Reconciliation Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `dynamo_operator_reconcile_duration_seconds` | Histogram | `resource_type`, `namespace`, `result` | Duration of reconciliation loops |
| `dynamo_operator_reconcile_total` | Counter | `resource_type`, `namespace`, `result` | Total number of reconciliations |
| `dynamo_operator_reconcile_errors_total` | Counter | `resource_type`, `namespace`, `error_type` | Total reconciliation errors by type |
**Labels:**
- `resource_type`: `DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoModel`, `DynamoGraphDeploymentRequest`, `DynamoGraphDeploymentScalingAdapter`
- `namespace`: Target namespace of the resource
- `result`: `success`, `error`, `requeue`
- `error_type`: `not_found`, `already_exists`, `conflict`, `validation`, `bad_request`, `unauthorized`, `forbidden`, `timeout`, `server_timeout`, `unavailable`, `rate_limited`, `internal`
### Webhook Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `dynamo_operator_webhook_duration_seconds` | Histogram | `resource_type`, `operation` | Duration of webhook validation requests |
| `dynamo_operator_webhook_requests_total` | Counter | `resource_type`, `operation`, `result` | Total webhook admission requests |
| `dynamo_operator_webhook_denials_total` | Counter | `resource_type`, `operation`, `reason` | Total webhook denials with reasons |
**Labels:**
- `resource_type`: Same as reconciliation metrics
- `operation`: `CREATE`, `UPDATE`, `DELETE`
- `result`: `allowed`, `denied`
- `reason`: Validation failure reason (e.g., `immutable_field_changed`, `invalid_config`)
### Resource Inventory Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `dynamo_operator_resources_total` | Gauge | `resource_type`, `namespace`, `status` | Current count of resources by state |
**Labels:**
- `resource_type`: `DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoModel`, `DynamoGraphDeploymentRequest`, `DynamoGraphDeploymentScalingAdapter`
- `namespace`: Resource namespace
- `status`: Resource state derived from each CRD's status. Common values:
- `"ready"` - Resource is healthy and operational (DCD, DM, DGDSA)
- `"not_ready"` - Resource exists but is not operational (DCD, DM, DGDSA)
- `"unknown"` - State cannot be determined (default for empty status)
- DGD uses: `"pending"`, `"successful"`, `"failed"` from `.status.state`
- DGDR uses: `"Pending"`, `"Profiling"`, `"Deploying"`, `"Ready"`, `"DeploymentDeleted"`, `"Failed"` from `.status.state`
## Example Queries
### Reconciliation Performance
```promql
# P95 reconciliation duration by resource type
histogram_quantile(0.95,
sum by (resource_type, le) (
rate(dynamo_operator_reconcile_duration_seconds_bucket[5m])
)
)
# Reconciliation rate by result
sum by (resource_type, result) (
rate(dynamo_operator_reconcile_total[5m])
)
# Error rate by type
sum by (resource_type, error_type) (
rate(dynamo_operator_reconcile_errors_total[5m])
)
```
### Webhook Performance
```promql
# Webhook P95 latency
histogram_quantile(0.95,
sum by (resource_type, le) (
rate(dynamo_operator_webhook_duration_seconds_bucket[5m])
)
)
# Webhook denial rate
sum by (resource_type, operation, reason) (
rate(dynamo_operator_webhook_denials_total[5m])
)
```
### Resource Inventory
```promql
# Total resources by type and state
sum by (resource_type, status) (
dynamo_operator_resources_total
)
# DynamoGraphDeployments by state
sum by (status) (
dynamo_operator_resources_total{resource_type="DynamoGraphDeployment"}
)
# All resources by namespace and state
sum by (resource_type, namespace, status) (
dynamo_operator_resources_total
)
```
## Grafana Dashboard
A pre-built Grafana dashboard is available for visualizing operator metrics.
### Dashboard Sections
1. **Reconciliation Metrics** (3 panels)
- Reconciliation rate by resource type and result
- P95 reconciliation duration
- Reconciliation errors by type
2. **Webhook Metrics** (3 panels)
- Webhook request rate by operation
- P95 webhook duration
- Webhook denials by reason
3. **Resource Inventory** (2 panels)
- Resource inventory timeline by state and namespace (filterable by resource type)
- Current resource count by state (filterable by resource type)
4. **Operational Health** (2 panels)
- Reconciliation success rate gauges
- Webhook admission success rate gauges
### Deploying the Dashboard
```bash
kubectl apply -f deploy/observability/k8s/grafana-operator-dashboard-configmap.yaml
```
The dashboard will automatically appear in Grafana (assuming you have the Grafana dashboard sidecar configured, which is included in kube-prometheus-stack).
### Finding the Dashboard
1. Port-forward to Grafana (if needed):
```bash
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
```
2. Log in to Grafana at http://localhost:3000
3. Navigate to **Dashboards** → Search for **"Dynamo Operator"**
### Dashboard Filters
The dashboard includes two filter variables:
- **Namespace**: View metrics across all namespaces or filter by specific ones (multi-select)
- **Resource Type**: Filter all panels by resource type or select "All" to see aggregated metrics across all CRDs (single select)
When "All" is selected for Resource Type, all panels will show data for all five managed CRDs with resource_type labels for differentiation.
## Accessing Metrics Directly
For instructions on accessing Prometheus and Grafana, see the [Kubernetes Metrics Guide](./metrics.md#viewing-the-metrics).
Once you have access to Prometheus, you can query operator metrics directly:
```bash
# Port-forward to Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
# Visit http://localhost:9090 and try queries like:
# - dynamo_operator_reconcile_total
# - dynamo_operator_webhook_requests_total
# - dynamo_operator_resources_total
```
## Troubleshooting
### Metrics Not Appearing in Prometheus
1. **Check ServiceMonitor exists:**
```bash
kubectl get servicemonitor -n dynamo-system | grep operator
```
2. **Check ServiceMonitor is discovered by Prometheus:**
- Go to Prometheus UI → Status → Targets
- Look for `serviceMonitor/dynamo-system/dynamo-platform-dynamo-operator-operator`
- Should show state: `UP`
3. **Check Prometheus selector configuration:**
```bash
kubectl get prometheus -o yaml | grep serviceMonitorSelector
```
Ensure `serviceMonitorSelectorNilUsesHelmValues: false` was set during kube-prometheus-stack installation.
### Dashboard Not Appearing in Grafana
1. **Check ConfigMap is created:**
```bash
kubectl get configmap -n monitoring grafana-operator-dashboard
```
2. **Check ConfigMap has the label:**
```bash
kubectl get configmap -n monitoring grafana-operator-dashboard -o jsonpath='{.metadata.labels.grafana_dashboard}'
```
Should return `"1"`
3. **Check Grafana dashboard sidecar configuration:**
```bash
kubectl get deployment -n monitoring prometheus-grafana -o yaml | grep -A 5 sidecar
```
The sidecar should be configured to watch for `grafana_dashboard: "1"` label.
4. **Restart Grafana pod** to force dashboard refresh:
```bash
kubectl rollout restart deployment/prometheus-grafana -n monitoring
```
## Related Documentation
- [Kubernetes Metrics Guide](./metrics.md) - Application metrics for frontends and workers
- [Dynamo Operator Guide](../dynamo_operator.md) - Operator architecture and deployment modes
- [Operator Webhooks](../webhooks.md) - Webhook validation details
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Service Discovery
Dynamo components (frontends, workers, planner) need to be able to discover each other and their capabilities at runtime. We refer to this as service discovery. There are 2 kinds of service discovery backends supported on Kubernetes.
## Discovery Backends
| Backend | Default | Dependencies | Use Case |
|---------|---------|--------------|----------|
| **Kubernetes** | ✅ Yes | None (native K8s) | Recommended for all Kubernetes deployments |
| **KV Store (etcd)** | No | etcd cluster | Legacy deployments |
## Kubernetes Discovery (Default)
Kubernetes discovery is the default and recommended backend when running on Kubernetes. It uses native Kubernetes primitives to facilitate discovery of components:
- **DynamoWorkerMetadata CRD**: Each worker stores its registered endpoints and model cards in a Custom Resource
- **EndpointSlices**: EndpointSlices signal each component's readiness status
### Implementation Details
Each pod runs a **discovery daemon** that watches both EndpointSlices and DynamoWorkerMetadata CRs. A pod is only discoverable when it appears as "ready" in an EndpointSlice AND has a corresponding `DynamoWorkerMetadata` CR. This correlation ensures pods aren't discoverable until they're ready, metadata is immediately available, and stale entries are cleaned up when pods terminate.
#### DynamoWorkerMetadata CRD
Each worker pod creates a `DynamoWorkerMetadata` CR that stores its discovery metadata:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoWorkerMetadata
metadata:
name: my-worker-pod-abc123
namespace: dynamo-system
ownerReferences:
- apiVersion: v1
kind: Pod
name: my-worker-pod-abc123
uid: <pod-uid>
controller: true
spec:
data:
endpoints:
"dynamo/backend/generate":
type: Endpoint
namespace: dynamo
component: backend
endpoint: generate
instance_id: 12345678901234567890
transport:
nats_tcp: "dynamo_backend.generate-abc123"
model_cards: {}
```
The CR is named after the pod and includes an owner reference for automatic garbage collection when the pod is deleted.
#### EndpointSlices
While DynamoWorkerMetadata resources provide an up-to-date snapshot of a component's capabilities, EndpointSlices give a snapshot of health of the various Dynamo components.
The operator creates a Kubernetes Service targeting the Dynamo components. The Kubernetes controller in turn creates and maintains EndpointSlice resources that keep track of the readiness of the pods targeted by the Service. Watching these slices gives us an up-to-date snapshot of which Dynamo components are ready to serve traffic.
##### Readiness Probes
A pod is marked ready if the readiness probe succeeds. On Dynamo workers, this is when the `generate` endpoint is available and healthy. These probes are configured by the Dynamo operator for each pod/component.
#### RBAC
Each Dynamo component pod is automatically given a ServiceAccount that allows it to watch `EndpointSlice` and `DynamoWorkerMetadata` resources within its namespace.
#### Environment Variables
The following environment variables are automatically injected into pods by the operator to facilitate service discovery:
| Variable | Description |
|----------|-------------|
| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
| `POD_NAME` | Pod name (via downward API) |
| `POD_NAMESPACE` | Pod namespace (via downward API) |
| `POD_UID` | Pod UID (via downward API) |
The pod's instance ID is deterministically generated by hashing the pod name, ensuring consistent identity and correlation between EndpointSlices and CRs.
## KV Store Discovery (etcd)
To use etcd-based discovery instead of Kubernetes-native discovery, add the annotation to your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
annotations:
nvidia.com/dynamo-discovery-backend: etcd
spec:
services:
# ...
```
This requires an etcd cluster to be available. The etcd connection is configured via the platform Helm chart.
# Webhooks
This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Configuration](#configuration)
- [Enabling/Disabling Webhooks](#enablingdisabling-webhooks)
- [Certificate Management Options](#certificate-management-options)
- [Advanced Configuration](#advanced-configuration)
- [Certificate Management](#certificate-management)
- [Automatic Certificates (Default)](#automatic-certificates-default)
- [cert-manager Integration](#cert-manager-integration)
- [External Certificates](#external-certificates)
- [Multi-Operator Deployments](#multi-operator-deployments)
- [Troubleshooting](#troubleshooting)
---
## Overview
The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.
All webhook types (validating, mutating, conversion, etc.) share the same **webhook server** and **TLS certificate infrastructure**, making certificate management consistent across all webhook operations.
### Key Features
-**Enabled by default** - Zero-touch validation out of the box
-**Shared certificate infrastructure** - All webhook types use the same TLS certificates
-**Automatic certificate generation** - No manual certificate management required
-**Defense in depth** - Controllers validate when webhooks are disabled
-**cert-manager integration** - Optional integration for automated certificate lifecycle
-**Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
-**Immutability enforcement** - Critical fields protected via CEL validation rules
### Current Webhook Types
- **Validating Webhooks**: Validate custom resource specifications before persistence
- `DynamoComponentDeployment` validation
- `DynamoGraphDeployment` validation
- `DynamoModel` validation
**Note:** Future releases may add mutating webhooks (for defaults/transformations) and conversion webhooks (for CRD version migrations). All will use the same certificate infrastructure described in this document.
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 1. User submits CR (kubectl apply) │
│ 2. API server calls ValidatingWebhookConfiguration │
└────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required)
┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │
│ 3. Validates CR against business rules │
│ 4. Returns admit/deny decision + warnings │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 5. If admitted: Persist CR to etcd │
│ 6. If denied: Return error to user │
└─────────────────────────────────────────────────────────────────┘
```
### Validation Flow
1. **Webhook validation** (if enabled): Validates at API server level
2. **CEL validation**: Kubernetes-native immutability checks (always active)
3. **Controller validation** (if webhooks disabled): Defense-in-depth validation during reconciliation
---
## Configuration
### Enabling/Disabling Webhooks
Webhooks are **enabled by default**. To disable them:
```yaml
# Platform-level values.yaml
dynamo-operator:
webhook:
enabled: false
```
**When to disable webhooks:**
- During development/testing when rapid iteration is needed
- In environments where admission webhooks are not supported
- When troubleshooting validation issues
**Note:** When webhooks are disabled, controllers perform validation during reconciliation (defense in depth).
---
### Certificate Management Options
The operator supports three certificate management modes:
| Mode | Description | Use Case |
|------|-------------|----------|
| **Automatic (Default)** | Helm hooks generate self-signed certificates | Testing and development environments |
| **cert-manager** | Integrate with cert-manager for automated lifecycle | Production deployments with cert-manager |
| **External** | Bring your own certificates | Production deployments with custom PKI |
---
### Advanced Configuration
#### Complete Configuration Reference
```yaml
dynamo-operator:
webhook:
# Enable/disable validation webhooks
enabled: true
# Certificate management
certManager:
enabled: false
issuerRef:
kind: Issuer
name: selfsigned-issuer
# Certificate secret configuration
certificateSecret:
name: webhook-server-cert
external: false
# Certificate validity period (automatic generation only)
certificateValidity: 3650 # 10 years
# Certificate generator image (automatic generation only)
certGenerator:
image:
repository: bitnami/kubectl
tag: latest
# Webhook behavior configuration
failurePolicy: Fail # Fail (reject on error) or Ignore (allow on error)
timeoutSeconds: 10 # Webhook timeout
# Namespace filtering (advanced)
namespaceSelector: {} # Kubernetes label selector for namespaces
```
#### Failure Policy
```yaml
# Fail: Reject resources if webhook is unavailable (recommended for production)
webhook:
failurePolicy: Fail
# Ignore: Allow resources if webhook is unavailable (use with caution)
webhook:
failurePolicy: Ignore
```
**Recommendation:** Use `Fail` in production to ensure validation is always enforced. Only use `Ignore` if you need high availability and can tolerate occasional invalid resources.
#### Namespace Filtering
Control which namespaces are validated (applies to **cluster-wide operator** only):
```yaml
# Only validate resources in namespaces with specific labels
webhook:
namespaceSelector:
matchLabels:
dynamo-validation: enabled
# Or exclude specific namespaces
webhook:
namespaceSelector:
matchExpressions:
- key: dynamo-validation
operator: NotIn
values: ["disabled"]
```
**Note:** For **namespace-restricted operators**, the namespace selector is automatically set to validate only the operator's namespace. This configuration is ignored in namespace-restricted mode.
---
## Certificate Management
### Automatic Certificates (Default)
**Zero configuration required!** Certificates are automatically generated during `helm install` and `helm upgrade`.
#### How It Works
1. **Pre-install/pre-upgrade hook**: Generates self-signed TLS certificates
- Root CA (valid 10 years)
- Server certificate (valid 10 years)
- Stores in Secret: `<release>-webhook-server-cert`
2. **Post-install/post-upgrade hook**: Injects CA bundle into `ValidatingWebhookConfiguration`
- Reads `ca.crt` from Secret
- Patches `ValidatingWebhookConfiguration` with base64-encoded CA bundle
3. **Operator pod**: Mounts certificate secret and serves webhook on port 9443
#### Certificate Validity
- **Root CA**: 10 years
- **Server Certificate**: 10 years (same as Root CA)
- **Automatic rotation**: Certificates are re-generated on every `helm upgrade`
#### Smart Certificate Generation
The certificate generation hook is intelligent:
-**Checks existing certificates** before generating new ones
-**Skips generation** if valid certificates exist (valid for 30+ days with correct SANs)
-**Regenerates** only when needed (missing, expiring soon, or incorrect SANs)
This means:
- Fast `helm upgrade` operations (no unnecessary cert generation)
- Safe to run `helm upgrade` frequently
- Certificates persist across reinstalls (stored in Secret)
#### Manual Certificate Rotation
If you need to rotate certificates manually:
```bash
# Delete the certificate secret
kubectl delete secret <release>-webhook-server-cert -n <namespace>
# Upgrade the release to regenerate certificates
helm upgrade <release> dynamo-platform -n <namespace>
```
---
### cert-manager Integration
For clusters with cert-manager installed, you can enable automated certificate lifecycle management.
#### Prerequisites
1. **cert-manager installed** (v1.0+)
2. **CA issuer configured** (e.g., `selfsigned-issuer`)
#### Configuration
```yaml
dynamo-operator:
webhook:
certManager:
enabled: true
issuerRef:
kind: Issuer # Or ClusterIssuer
name: selfsigned-issuer # Your issuer name
```
#### How It Works
1. **Helm creates Certificate resource**: Requests TLS certificate from cert-manager
2. **cert-manager generates certificate**: Based on configured issuer
3. **cert-manager stores in Secret**: `<release>-webhook-server-cert`
4. **cert-manager ca-injector**: Automatically injects CA bundle into `ValidatingWebhookConfiguration`
5. **Operator pod**: Mounts certificate secret and serves webhook
#### Benefits Over Automatic Mode
-**Automated rotation**: cert-manager renews certificates before expiration
-**Custom validity periods**: Configure certificate lifetime
-**CA rotation support**: ca-injector handles CA updates automatically
-**Integration with existing PKI**: Use your organization's certificate infrastructure
#### Certificate Rotation
With cert-manager, certificate rotation is **fully automated**:
1. **Leaf certificate rotation** (default: every year)
- cert-manager auto-renews before expiration
- controller-runtime auto-reloads new certificate
- **No pod restart required**
- **No caBundle update required** (same Root CA)
2. **Root CA rotation** (every 10 years)
- cert-manager rotates Root CA
- ca-injector auto-updates caBundle in `ValidatingWebhookConfiguration`
- **No manual intervention required**
#### Example: Self-Signed Issuer
```yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: dynamo-system
spec:
selfSigned: {}
---
# Enable in platform values.yaml
dynamo-operator:
webhook:
certManager:
enabled: true
issuerRef:
kind: Issuer
name: selfsigned-issuer
```
---
### External Certificates
Bring your own certificates for custom PKI requirements.
#### Steps
1. **Create certificate secret manually**:
```bash
kubectl create secret tls <release>-webhook-server-cert \
--cert=tls.crt \
--key=tls.key \
-n <namespace>
# Also add ca.crt to the secret
kubectl patch secret <release>-webhook-server-cert -n <namespace> \
--type='json' \
-p='[{"op": "add", "path": "/data/ca.crt", "value": "'$(base64 -w0 < ca.crt)'"}]'
```
2. **Configure operator to use external secret**:
```yaml
dynamo-operator:
webhook:
certificateSecret:
external: true
caBundle: <base64-encoded-ca-cert> # Must manually specify
```
3. **Deploy operator**:
```bash
helm install dynamo-platform . -n <namespace> -f values.yaml
```
#### Certificate Requirements
- **Secret name**: Must match `webhook.certificateSecret.name` (default: `webhook-server-cert`)
- **Secret keys**: `tls.crt`, `tls.key`, `ca.crt`
- **Certificate SAN**: Must include `<service-name>.<namespace>.svc`
- Example: `dynamo-platform-dynamo-operator-webhook-service.dynamo-system.svc`
---
## Multi-Operator Deployments
The operator supports running both **cluster-wide** and **namespace-restricted** instances simultaneously using a **lease-based coordination mechanism**.
### Scenario
```
Cluster:
├─ Operator A (cluster-wide, namespace: platform-system)
│ └─ Validates all namespaces EXCEPT team-a
└─ Operator B (namespace-restricted, namespace: team-a)
└─ Validates only team-a namespace
```
### How It Works
1. **Namespace-restricted operator** creates a Lease in its namespace
2. **Cluster-wide operator** watches for Leases named `dynamo-operator-ns-lock`
3. **Cluster-wide operator** skips validation for namespaces with active Leases
4. **Namespace-restricted operator** validates resources in its namespace
### Lease Configuration
The lease mechanism is **automatically configured** based on deployment mode:
```yaml
# Cluster-wide operator (default)
namespaceRestriction:
enabled: false
# → Watches for leases in all namespaces
# → Skips validation for namespaces with active leases
# Namespace-restricted operator
namespaceRestriction:
enabled: true
namespace: team-a
# → Creates lease in team-a namespace
# → Does NOT check for leases (no cluster permissions)
```
### Deployment Example
```bash
# 1. Deploy cluster-wide operator
helm install platform-operator dynamo-platform \
-n platform-system \
--set namespaceRestriction.enabled=false
# 2. Deploy namespace-restricted operator for team-a
helm install team-a-operator dynamo-platform \
-n team-a \
--set namespaceRestriction.enabled=true \
--set namespaceRestriction.namespace=team-a
```
### ValidatingWebhookConfiguration Naming
The webhook configuration name reflects the deployment mode:
- **Cluster-wide**: `<release>-validating`
- **Namespace-restricted**: `<release>-validating-<namespace>`
Example:
```bash
# Cluster-wide
platform-operator-validating
# Namespace-restricted (team-a)
team-a-operator-validating-team-a
```
This allows multiple webhook configurations to coexist without conflicts.
### Lease Health
If the namespace-restricted operator is deleted or becomes unhealthy:
- Lease expires after `leaseDuration + gracePeriod` (default: ~30 seconds)
- Cluster-wide operator automatically resumes validation for that namespace
---
## Troubleshooting
### Webhook Not Called
**Symptoms:**
- Invalid resources are accepted
- No validation errors in logs
**Checks:**
1. **Verify webhook is enabled**:
```bash
kubectl get validatingwebhookconfiguration | grep dynamo
```
2. **Check webhook configuration**:
```bash
kubectl get validatingwebhookconfiguration <name> -o yaml
# Verify:
# - caBundle is present and non-empty
# - clientConfig.service points to correct service
# - webhooks[].namespaceSelector matches your namespace
```
3. **Verify webhook service exists**:
```bash
kubectl get service -n <namespace> | grep webhook
```
4. **Check operator logs for webhook startup**:
```bash
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep webhook
# Should see: "Webhooks are enabled - webhooks will validate, controllers will skip validation"
# Should see: "Starting webhook server"
```
---
### Connection Refused Errors
**Symptoms:**
```
Error from server (InternalError): Internal error occurred: failed calling webhook:
Post "https://...webhook-service...:443/validate-...": dial tcp ...:443: connect: connection refused
```
**Checks:**
1. **Verify operator pod is running**:
```bash
kubectl get pods -n <namespace> -l app.kubernetes.io/name=dynamo-operator
```
2. **Check webhook server is listening**:
```bash
# Port-forward to pod
kubectl port-forward -n <namespace> pod/<operator-pod> 9443:9443
# In another terminal, test connection
curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeployment
# Should NOT get "connection refused"
```
3. **Verify webhook port in deployment**:
```bash
kubectl get deployment -n <namespace> <release>-dynamo-operator -o yaml | grep -A5 "containerPort: 9443"
```
4. **Check for webhook initialization errors**:
```bash
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep -i error
```
---
### Certificate Errors
**Symptoms:**
```
Error from server (InternalError): Internal error occurred: failed calling webhook:
x509: certificate signed by unknown authority
```
**Checks:**
1. **Verify caBundle is present**:
```bash
kubectl get validatingwebhookconfiguration <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
# Should output a valid PEM certificate
```
2. **Verify certificate secret exists**:
```bash
kubectl get secret -n <namespace> <release>-webhook-server-cert
```
3. **Check certificate validity**:
```bash
kubectl get secret -n <namespace> <release>-webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
# Check:
# - Not expired
# - SAN includes: <service-name>.<namespace>.svc
```
4. **Check CA injection job logs**:
```bash
kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
```
---
### Helm Hook Job Failures
**Symptoms:**
- `helm install` or `helm upgrade` hangs or fails
- Certificate generation errors
**Checks:**
1. **List hook jobs**:
```bash
kubectl get jobs -n <namespace> | grep webhook
```
2. **Check job logs**:
```bash
# Certificate generation
kubectl logs -n <namespace> job/<release>-webhook-cert-gen-<revision>
# CA injection
kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
```
3. **Check RBAC permissions**:
```bash
# Verify ServiceAccount exists
kubectl get sa -n <namespace> <release>-webhook-ca-inject
# Verify ClusterRole and ClusterRoleBinding exist
kubectl get clusterrole <release>-webhook-ca-inject
kubectl get clusterrolebinding <release>-webhook-ca-inject
```
4. **Manual cleanup**:
```bash
# Delete failed jobs
kubectl delete job -n <namespace> <release>-webhook-cert-gen-<revision>
kubectl delete job -n <namespace> <release>-webhook-ca-inject-<revision>
# Retry helm upgrade
helm upgrade <release> dynamo-platform -n <namespace>
```
---
### Validation Errors Not Clear
**Symptoms:**
- Webhook rejects resource but error message is unclear
**Solution:**
Check operator logs for detailed validation errors:
```bash
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep "validate create\|validate update"
```
Webhook logs include:
- Resource name and namespace
- Validation errors with context
- Warnings for immutable field changes
---
### Stuck Deleting Resources
**Symptoms:**
- Resource stuck in "Terminating" state
- Webhook blocks finalizer removal
**Solution:**
The webhook automatically skips validation for resources being deleted. If stuck:
1. **Check if webhook is blocking**:
```bash
kubectl describe <resource-type> <name> -n <namespace>
# Look for events mentioning webhook errors
```
2. **Temporarily disable webhook**:
```bash
# Option 1: Delete ValidatingWebhookConfiguration
kubectl delete validatingwebhookconfiguration <name>
# Option 2: Set failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration <name> \
--type='json' \
-p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
```
3. **Delete resource again**:
```bash
kubectl delete <resource-type> <name> -n <namespace>
```
4. **Re-enable webhook**:
```bash
helm upgrade <release> dynamo-platform -n <namespace>
```
---
## Best Practices
### Production Deployments
1.**Keep webhooks enabled** (default) for real-time validation
2.**Use `failurePolicy: Fail`** (default) to ensure validation is enforced
3.**Monitor webhook latency** - Validation adds ~10-50ms per resource operation
4.**Use cert-manager** for automated certificate lifecycle in large deployments
5.**Test webhook configuration** in staging before production
### Development Deployments
1.**Disable webhooks** for rapid iteration if needed
2.**Use `failurePolicy: Ignore`** if webhook availability is problematic
3.**Keep automatic certificates** (simpler than cert-manager for dev)
### Multi-Tenant Deployments
1.**Deploy one cluster-wide operator** for platform-wide validation
2.**Deploy namespace-restricted operators** for tenant-specific namespaces
3.**Monitor lease health** to ensure coordination works correctly
4.**Use unique release names** per namespace to avoid naming conflicts
---
## Additional Resources
- [Kubernetes Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
- [cert-manager Documentation](https://cert-manager.io/docs/)
- [Kubebuilder Webhook Tutorial](https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation.html)
- [CEL Validation Rules](https://kubernetes.io/docs/reference/using-api/cel/)
---
## Support
For issues or questions:
- Check [Troubleshooting](#troubleshooting) section
- Review operator logs: `kubectl logs -n <namespace> deployment/<release>-dynamo-operator`
- Open an issue on GitHub
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Observability
## Getting Started Quickly
This is an example to get started quickly on a single machine.
### Prerequisites
Install these on your machine:
- [Docker](https://docs.docker.com/get-docker/)
- [Docker Compose](https://docs.docker.com/compose/install/)
### Starting the Observability Stack
Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, and various exporters for metrics, tracing, and visualization.
From the Dynamo root directory:
```bash
# Start infrastructure (NATS, etcd)
docker compose -f deploy/docker-compose.yml up -d
# Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
docker compose -f deploy/docker-observability.yml up -d
```
For detailed setup instructions and configuration, see [Prometheus + Grafana Setup](prometheus-grafana.md).
## Observability Documentations
| Guide | Description | Environment Variables to Control |
|-------|-------------|----------------------------------|
| [Metrics](metrics.md) | Available metrics reference | `DYN_SYSTEM_PORT`† |
| [Operator Metrics (Kubernetes)](../kubernetes/observability/operator-metrics.md) | Operator controller and webhook metrics for Kubernetes | N/A (configured via Helm) |
| [Health Checks](health-checks.md) | Component health monitoring and readiness probes | `DYN_SYSTEM_PORT`†, `DYN_SYSTEM_STARTING_HEALTH_STATUS`, `DYN_SYSTEM_HEALTH_PATH`, `DYN_SYSTEM_LIVE_PATH`, `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` |
| [Tracing](tracing.md) | Distributed tracing with OpenTelemetry and Tempo | `DYN_LOGGING_JSONL`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`†, `OTEL_SERVICE_NAME`† |
| [Logging](logging.md) | Structured logging configuration | `DYN_LOGGING_JSONL`†, `DYN_LOG`, `DYN_LOG_USE_LOCAL_TZ`, `DYN_LOGGING_CONFIG_PATH`, `OTEL_SERVICE_NAME`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`† |
**Variables marked with † are shared across multiple observability systems.**
## Developer Guides
| Guide | Description | Environment Variables to Control |
|-------|-------------|----------------------------------|
| [Metrics Developer Guide](metrics-developer-guide.md) | Creating custom metrics in Rust and Python | `DYN_SYSTEM_PORT`† |
## Kubernetes
For Kubernetes-specific setup and configuration, see [docs/kubernetes/observability/](../kubernetes/observability/).
**Operator Metrics**: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the [Operator Metrics Guide](../kubernetes/observability/operator-metrics.md).
---
## Topology
This provides:
- **Prometheus** on `http://localhost:9090` - metrics collection and querying
- **Grafana** on `http://localhost:3000` - visualization dashboards (username: `dynamo`, password: `dynamo`)
- **Tempo** on `http://localhost:3200` - distributed tracing backend
- **DCGM Exporter** on `http://localhost:9401/metrics` - GPU metrics
- **NATS Exporter** on `http://localhost:7777/metrics` - NATS messaging metrics
### Service Relationship Diagram
```mermaid
graph TD
BROWSER[Browser] -->|:3000| GRAFANA[Grafana :3000]
subgraph DockerComposeNetwork [Network inside Docker Compose]
NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222]
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS
end
```
The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.
### Configuration Files
The following configuration files are located in the `deploy/observability/` directory:
- [docker-compose.yml](../../deploy/docker-compose.yml): Defines NATS and etcd services
- [docker-observability.yml](../../deploy/docker-observability.yml): Defines Prometheus, Grafana, Tempo, and exporters
- [prometheus.yml](../../deploy/observability/prometheus.yml): Contains Prometheus scraping configuration
- [grafana-datasources.yml](../../deploy/observability/grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana_dashboards/dashboard-providers.yml](../../deploy/observability/grafana_dashboards/dashboard-providers.yml): Contains Grafana dashboard provider configuration
- [grafana_dashboards/dynamo.json](../../deploy/observability/grafana_dashboards/dynamo.json): A general Dynamo Dashboard for both SW and HW metrics
- [grafana_dashboards/dcgm-metrics.json](../../deploy/observability/grafana_dashboards/dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
- [grafana_dashboards/kvbm.json](../../deploy/observability/grafana_dashboards/kvbm.json): Contains Grafana dashboard configuration for KVBM metrics
```{toctree}
:hidden:
Prometheus + Grafana Setup <prometheus-grafana>
Metrics <metrics>
Metrics Developer Guide <metrics-developer-guide>
Health Checks <health-checks>
Tracing <tracing>
Logging <logging>
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Health Checks
## Overview
Dynamo provides health check and liveness HTTP endpoints for each component which
can be used to configure startup, liveness and readiness probes in
orchestration frameworks such as Kubernetes.
## Environment Variables
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System status server port | `8081` | `9090` |
| `DYN_SYSTEM_STARTING_HEALTH_STATUS` | Initial health status | `notready` | `ready`, `notready` |
| `DYN_SYSTEM_HEALTH_PATH` | Custom health endpoint path | `/health` | `/custom/health` |
| `DYN_SYSTEM_LIVE_PATH` | Custom liveness endpoint path | `/live` | `/custom/live` |
| `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` | Endpoints required for ready state | none | `["generate"]` |
| `DYN_HEALTH_CHECK_ENABLED` | Enable canary health checks | `false` (K8s: `true`) | `true`, `false` |
| `DYN_CANARY_WAIT_TIME` | Seconds before sending canary health check | `10` | `5`, `30` |
| `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | Health check request timeout in seconds | `3` | `5`, `10` |
## Getting Started Quickly
Enable health checks and query endpoints:
```bash
# Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
python -m dynamo.frontend &
# Enable system status server on port 8081
DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
```
Check health status:
```bash
# Frontend health (port 8000)
curl -s localhost:8000/health | jq
# Worker health (port 8081)
curl -s localhost:8081/health | jq
```
## Frontend Liveness Check
The frontend liveness endpoint reports a status of `live` as long as
the service is running.
> **Note**: Frontend liveness doesn't depend on worker health or liveness only on the Frontend service itself.
### Example Request
```
curl -s localhost:8080/live -q | jq
```
### Example Response
```
{
"message": "Service is live",
"status": "live"
}
```
## Frontend Health Check
The frontend health endpoint reports a status of `healthy` as long as
the service is running. Once workers have been registered, the
`health` endpoint will also list registered endpoints and instances.
> **Note**: Frontend liveness doesn't depend on worker health or liveness only on the Frontend service itself.
### Example Request
```
curl -v localhost:8080/health -q | jq
```
### Example Response
Before workers are registered:
```
HTTP/1.1 200 OK
content-type: application/json
content-length: 72
date: Wed, 03 Sep 2025 13:31:44 GMT
{
"instances": [],
"message": "No endpoints available",
"status": "unhealthy"
}
```
After workers are registered:
```
HTTP/1.1 200 OK
content-type: application/json
content-length: 609
date: Wed, 03 Sep 2025 13:32:03 GMT
{
"endpoints": [
"dyn://dynamo.backend.generate"
],
"instances": [
{
"component": "backend",
"endpoint": "clear_kv_blocks",
"instance_id": 7587888160958628000,
"namespace": "dynamo",
"transport": {
"nats_tcp": "dynamo_backend.clear_kv_blocks-694d98147d54be25"
}
},
{
"component": "backend",
"endpoint": "generate",
"instance_id": 7587888160958628000,
"namespace": "dynamo",
"transport": {
"nats_tcp": "dynamo_backend.generate-694d98147d54be25"
}
},
{
"component": "backend",
"endpoint": "load_metrics",
"instance_id": 7587888160958628000,
"namespace": "dynamo",
"transport": {
"nats_tcp": "dynamo_backend.load_metrics-694d98147d54be25"
}
}
],
"status": "healthy"
}
```
## Worker Liveness and Health Check
Health checks for components other than the frontend are enabled
selectively based on environment variables. If a health check for a
component is enabled the starting status can be set along with the set
of endpoints that are required to be served before the component is
declared `ready`.
Once all endpoints declared in `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS`
are served the component transitions to a `ready` state until the
component is shutdown. The endpoints return HTTP status code of `HTTP/1.1 503 Service Unavailable`
when initializing and HTTP status code `HTTP/1.1 200 OK` once ready.
> **Note**: Both /live and /ready return the same information
### Example Environment Setting
```
export DYN_SYSTEM_PORT=9090
export DYN_SYSTEM_STARTING_HEALTH_STATUS="notready"
export DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS="[\"generate\"]"
```
#### Example Request
```
curl -v localhost:9090/health | jq
```
#### Example Response
Before endpoints are being served:
```
HTTP/1.1 503 Service Unavailable
content-type: text/plain; charset=utf-8
content-length: 96
date: Wed, 03 Sep 2025 13:42:39 GMT
{
"endpoints": {
"generate": "notready"
},
"status": "notready",
"uptime": {
"nanos": 313803539,
"secs": 12
}
}
```
After endpoints are being served:
```
HTTP/1.1 200 OK
content-type: text/plain; charset=utf-8
content-length: 139
date: Wed, 03 Sep 2025 13:42:45 GMT
{
"endpoints": {
"clear_kv_blocks": "ready",
"generate": "ready",
"load_metrics": "ready"
},
"status": "ready",
"uptime": {
"nanos": 356504530,
"secs": 18
}
}
```
## Canary Health Checks (Active Monitoring)
In addition to the HTTP endpoints described above, Dynamo includes a **canary health check** system that actively monitors worker endpoints.
### Overview
The canary health check system:
- **Monitors endpoint health** by sending periodic test requests to worker endpoints
- **Only activates during idle periods** - if there's ongoing traffic, health checks are skipped to avoid overhead
- **Automatically enabled in Kubernetes** deployments via the operator
- **Disabled by default** in local/development environments
### How It Works
1. **Idle Detection**: After no activity on an endpoint for a configurable wait time (default: 10 seconds), a canary health check is triggered
2. **Health Check Request**: A lightweight test request is sent to the endpoint with a minimal payload (generates 1 token)
3. **Activity Resets Timer**: If normal requests arrive, the canary timer resets and no health check is sent
4. **Timeout Handling**: If a health check doesn't respond within the timeout (default: 3 seconds), the endpoint is marked as unhealthy
### Configuration
#### In Kubernetes (Enabled by Default)
Health checks are automatically enabled by the Dynamo operator. No additional configuration is required.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
VllmWorker:
componentType: worker
replicas: 2
# Health checks automatically enabled by operator
```
#### In Local/Development Environments (Disabled by Default)
To enable health checks locally:
```bash
# Enable health checks
export DYN_HEALTH_CHECK_ENABLED=true
# Optional: Customize timing
export DYN_CANARY_WAIT_TIME=5 # Wait 5 seconds before sending health check
export DYN_HEALTH_CHECK_REQUEST_TIMEOUT=5 # 5 second timeout
# Start worker
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
#### Configuration Options
| Environment Variable | Description | Default | Notes |
|---------------------|-------------|---------|-------|
| `DYN_HEALTH_CHECK_ENABLED` | Enable/disable canary health checks | `false` (K8s: `true`) | Automatically set to `true` in K8s |
| `DYN_CANARY_WAIT_TIME` | Seconds to wait (during idle) before sending health check | `10` | Lower values = more frequent checks |
| `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | Max seconds to wait for health check response | `3` | Higher values = more tolerance for slow responses |
### Health Check Payloads
Each backend defines its own minimal health check payload:
- **vLLM**: Single token generation with minimal sampling options
- **TensorRT-LLM**: Single token with BOS token ID
- **SGLang**: Single token generation request
These payloads are designed to:
- Complete quickly (< 100ms typically)
- Minimize GPU overhead
- Verify the full inference stack is working
### Observing Health Checks
When health checks are enabled, you'll see logs like:
```
INFO Health check manager started (canary_wait_time: 10s, request_timeout: 3s)
INFO Spawned health check task for endpoint: generate
INFO Canary timer expired for generate, sending health check
INFO Health check successful for generate
```
If an endpoint fails:
```
WARN Health check timeout for generate
ERROR Health check request failed for generate: connection refused
```
### When to Use Canary Health Checks
**Enable in production (Kubernetes):**
- ✅ Detect unhealthy workers before they affect user traffic
- ✅ Enable faster failure detection and recovery
- ✅ Monitor worker availability continuously
**Disable in development:**
- ✅ Reduce log noise during debugging
- ✅ Avoid overhead when not needed
- ✅ Simplify local testing
### Troubleshooting
**Health checks timing out:**
- Increase `DYN_HEALTH_CHECK_REQUEST_TIMEOUT`
- Check worker logs for errors
- Verify network connectivity
**Too many health check logs:**
- Increase `DYN_CANARY_WAIT_TIME` to reduce frequency
- Or disable with `DYN_HEALTH_CHECK_ENABLED=false` in dev
**Health checks not running:**
- Verify `DYN_HEALTH_CHECK_ENABLED=true` is set
- Check that `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` includes the endpoint
- Ensure the worker is serving the endpoint
## Related Documentation
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
- [Dynamo Architecture Overview](../design_docs/architecture.md)
- [Backend Guide](../development/backend-guide.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Logging
## Overview
Dynamo provides structured logging in both text as well as JSONL. When
JSONL is enabled, logs support `trace_id` and `span_id` fields for
distributed tracing. Span creation and exit events can be optionally
enabled via the `DYN_LOGGING_SPAN_EVENTS` environment variable.
## Environment Variables
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` |
| `DYN_LOGGING_SPAN_EVENTS` | Enable span entry/close event logging (`SPAN_FIRST_ENTRY`, `SPAN_CLOSED` messages) | `false` | `true` |
| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `info` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps (default is UTC) | `false` | `true` |
| `DYN_LOGGING_CONFIG_PATH` | Path to custom TOML logging configuration | none | `/path/to/config.toml` |
| `OTEL_SERVICE_NAME` | Service name for trace and span information | `dynamo` | `dynamo-frontend` |
| `OTEL_EXPORT_ENABLED` | Enable OTLP trace exporting | `false` | `true` |
| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP exporter endpoint | `http://localhost:4317` | `http://tempo:4317` |
## Getting Started Quickly
### Start Observability Stack
For collecting and visualizing logs with Grafana Loki (Kubernetes), or viewing trace context in logs alongside Grafana Tempo, start the observability stack. See [Observability Getting Started](README.md#getting-started-quickly) for instructions.
### Enable Structured Logging
Enable structured JSONL logging:
```bash
export DYN_LOGGING_JSONL=true
export DYN_LOG=debug
# Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
python -m dynamo.frontend &
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
```
Logs will be written to stderr in JSONL format with trace context.
## Available Logging Levels
| **Logging Levels (Least to Most Verbose)** | **Description** |
|-------------------------------------------|---------------------------------------------------------------------------------|
| **ERROR** | Critical errors (e.g., unrecoverable failures, resource exhaustion) |
| **WARN** | Unexpected or degraded situations (e.g., retries, recoverable errors) |
| **INFO** | Operational information (e.g., startup/shutdown, major events) |
| **DEBUG** | General debugging information (e.g., variable values, flow control) |
| **TRACE** | Very low-level, detailed information (e.g., internal algorithm steps) |
## Example Readable Format
Environment Setting:
```
export DYN_LOG="info,dynamo_runtime::system_status_server:trace"
export DYN_LOGGING_JSONL="false"
```
Resulting Log format:
```
2025-09-02T15:50:01.770028Z INFO main.init: VllmWorker for Qwen/Qwen3-0.6B has been initialized
2025-09-02T15:50:01.770195Z INFO main.init: Reading Events from tcp://127.0.0.1:21555
2025-09-02T15:50:01.770265Z INFO main.init: Getting engine runtime configuration metadata from vLLM engine...
2025-09-02T15:50:01.770316Z INFO main.get_engine_cache_info: Cache config values: {'num_gpu_blocks': 24064}
2025-09-02T15:50:01.770358Z INFO main.get_engine_cache_info: Scheduler config values: {'max_num_seqs': 256, 'max_num_batched_tokens': 2048}
```
## Example JSONL Format
Environment Setting:
```
export DYN_LOG="info,dynamo_runtime::system_status_server:trace"
export DYN_LOGGING_JSONL="true"
```
Resulting Log format:
```
{"time":"2025-09-02T15:53:31.943377Z","level":"INFO","target":"log","message":"VllmWorker for Qwen/Qwen3-0.6B has been initialized","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":191,"log.target":"main.init"}
{"time":"2025-09-02T15:53:31.943550Z","level":"INFO","target":"log","message":"Reading Events from tcp://127.0.0.1:26771","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":212,"log.target":"main.init"}
{"time":"2025-09-02T15:53:31.943636Z","level":"INFO","target":"log","message":"Getting engine runtime configuration metadata from vLLM engine...","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":220,"log.target":"main.init"}
{"time":"2025-09-02T15:53:31.943701Z","level":"INFO","target":"log","message":"Cache config values: {'num_gpu_blocks': 24064}","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":267,"log.target":"main.get_engine_cache_info"}
{"time":"2025-09-02T15:53:31.943747Z","level":"INFO","target":"log","message":"Scheduler config values: {'max_num_seqs': 256, 'max_num_batched_tokens': 2048}","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":268,"log.target":"main.get_engine_cache_info"}
```
## Logging of Trace and Span IDs
When `DYN_LOGGING_JSONL` is enabled, all logs include `trace_id` and `span_id` fields, and spans are automatically created for requests. This is useful for short debugging sessions where you want to examine trace context in logs without setting up a full tracing backend and for correlating log messages with traces.
The trace and span information uses the OpenTelemetry format and libraries, which means the IDs are compatible with OpenTelemetry-based tracing backends like Tempo or Jaeger if you later choose to enable trace export.
**Note:** This section has overlap with [Distributed Tracing with Tempo](tracing.md). For trace visualization in Grafana Tempo and persistent trace analysis, see [Distributed Tracing with Tempo](tracing.md).
### Configuration for Logging
To see trace information in logs:
```bash
export DYN_LOGGING_JSONL=true
export DYN_LOG=debug # Set to debug to see detailed trace logs
# Start your Dynamo components (e.g., frontend and worker) (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
python -m dynamo.frontend &
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager &
```
This enables JSONL logging with `trace_id` and `span_id` fields. Traces appear in logs but are not exported to any backend.
### Example Request
Send a request to generate logs with trace context:
```bash
curl -H 'Content-Type: application/json' \
-H 'x-request-id: test-trace-001' \
-d '{
"model": "Qwen/Qwen3-0.6B",
"max_completion_tokens": 100,
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' \
http://localhost:8000/v1/chat/completions
```
Check the logs (stderr) for JSONL output containing `trace_id`, `span_id`, and `x_request_id` fields.
## Trace and Span Information in Logs
This section shows how trace and span information appears in JSONL logs. These logs can be used to understand request flows even without a trace visualization backend.
### Example Disaggregated Trace in Grafana
When viewing the corresponding trace in Grafana, you should be able to see something like the following:
![Disaggregated Trace Example](grafana-disagg-trace.png)
### Trace Overview
Dynamo creates distributed traces that span across multiple services in a disaggregated serving setup. The following sections describe the key spans you'll see in Grafana when viewing traces for chat completion requests.
#### Available Spans in Disaggregated Mode
When running Dynamo in disaggregated mode, a typical request creates the following spans:
##### 1. `http-request` (Frontend - Root Span)
The root span for the entire request lifecycle, created in the **dynamo-frontend** service.
**Key Attributes:**
- **Service**: `dynamo-frontend`
- **Operation**: Handles the HTTP request from client to completion
- **Duration**: Total end-to-end request time (includes prefill + decode)
- **Method**: HTTP method (typically `POST`)
- **URI**: Request endpoint (e.g., `/v1/chat/completions`)
- **Status**: Request completion status
- **Children**: Typically 2-3 child spans (routing span + worker spans)
This span represents the complete request flow from when the frontend receives the HTTP request until the final response is sent back to the client.
##### 2. `prefill_routing` (Frontend - Routing Span)
A child span of `http-request`, created in the **dynamo-frontend** service during the routing phase.
**Key Attributes:**
- **Service**: `dynamo-frontend`
- **Operation**: Routes the prefill request to an appropriate prefill worker
- **Duration**: Time spent selecting and the span of prefill.
- **Parent**: `http-request` span
This span captures the routing logic and decision-making process and the request sent to the prefill worker.
##### 3. `handle_payload` (Prefill Worker Span)
A child span of `http-request`, created in the **dynamo-worker-vllm-prefill** service.
**Key Attributes:**
- **Service**: `dynamo-worker-vllm-prefill` (or `dynamo-worker-sglang-prefill` for SGLang)
- **Operation**: Processes the prefill phase of generation
- **Duration**: Time to compute prefill (typically milliseconds to seconds)
- **Component**: `prefill`
- **Endpoint**: `generate`
- **Parent**: `http-request` span
This span represents the actual prefill computation on a prefill-specialized worker, including prompt processing and initial KV cache generation.
##### 4. `handle_payload` (Decode Worker Span)
A child span of `http-request`, created in the **dynamo-worker-vllm-decode** service.
**Key Attributes:**
- **Service**: `dynamo-worker-vllm-decode` (or `dynamo-worker-sglang-decode` for SGLang)
- **Operation**: Processes the decode phase of generation
- **Duration**: Time to generate all output tokens (typically seconds)
- **Component**: `decode` or `backend`
- **Endpoint**: `generate`
- **Parent**: `http-request` span
This span represents the iterative token generation phase on a decode-specialized worker, which consumes the KV cache from prefill and produces output tokens.
#### Understanding Span Metrics
Each span provides several useful metrics:
| Metric | Description |
|--------|-------------|
| **Duration** | Total time from span start to end |
| **Busy Time** | Time actively processing (excluding waiting) |
| **Idle Time** | Time spent waiting (e.g., for network, other services) |
| **Start Time** | When the span began |
| **Child Count** | Number of direct child spans |
The relationship **Duration = Busy Time + Idle Time** helps identify where time is spent and potential bottlenecks.
## Custom Request IDs in Logs
You can provide a custom request ID using the `x-request-id` header. This ID will be attached to all spans and logs for that request, making it easier to correlate traces with application-level request tracking.
### Example Request with Custom Request ID
```sh
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'x-request-id: 8372eac7-5f43-4d76-beca-0a94cfb311d0' \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": false,
"max_tokens": 1000
}'
```
All spans and logs for this request will include the `x_request_id` attribute with value `8372eac7-5f43-4d76-beca-0a94cfb311d0`.
### Frontend Logs with Custom Request ID
Notice how the `x_request_id` field appears in all log entries, alongside the `trace_id` (`80196f3e3a6fdf06d23bb9ada3788518`) and `span_id`:
```
{"time":"2025-10-31T21:06:45.397194Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
{"time":"2025-10-31T21:06:45.418584Z","level":"DEBUG","file":"/opt/dynamo/lib/llm/src/kv_router/prefill_router.rs","line":232,"target":"dynamo_llm::kv_router::prefill_router","message":"Prefill succeeded, using disaggregated params for decode","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
{"time":"2025-10-31T21:06:45.418854Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
```
## Related Documentation
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
- [Dynamo Architecture Overview](../design_docs/architecture.md)
- [Backend Guide](../development/backend-guide.md)
- [Log Aggregation in Kubernetes](../kubernetes/observability/logging.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Metrics Developer Guide
This guide explains how to create and use custom metrics in Dynamo components using the Dynamo metrics API.
## Metrics Exposure
All metrics created via the Dynamo metrics API are automatically exposed on the `/metrics` HTTP endpoint in Prometheus Exposition Format text when the following environment variable is set:
- `DYN_SYSTEM_PORT=<port>` - Port for the metrics endpoint (set to positive value to enable, default: `-1` disabled)
Example:
```bash
DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model>
```
Prometheus Exposition Format text metrics will be available at: `http://localhost:8081/metrics`
## Metric Name Constants
The [prometheus_names.rs](../../lib/runtime/src/metrics/prometheus_names.rs) module provides centralized metric name constants and sanitization functions to ensure consistency across all Dynamo components.
---
## Metrics API in Rust
The metrics API is accessible through the `.metrics()` method on runtime, namespace, component, and endpoint objects. See [Runtime Hierarchy](metrics.md#runtime-hierarchy) for details on the hierarchical structure.
### Available Methods
- `.metrics().create_counter()`: Create a counter metric
- `.metrics().create_gauge()`: Create a gauge metric
- `.metrics().create_histogram()`: Create a histogram metric
- `.metrics().create_countervec()`: Create a counter with labels
- `.metrics().create_gaugevec()`: Create a gauge with labels
- `.metrics().create_histogramvec()`: Create a histogram with labels
### Creating Metrics
```rust
use dynamo_runtime::DistributedRuntime;
let runtime = DistributedRuntime::new()?;
let endpoint = runtime.namespace("my_namespace").component("my_component").endpoint("my_endpoint");
// Simple metrics
let requests_total = endpoint.metrics().create_counter(
"requests_total",
"Total requests",
&[]
)?;
let active_connections = endpoint.metrics().create_gauge(
"active_connections",
"Active connections",
&[]
)?;
let latency = endpoint.metrics().create_histogram(
"latency_seconds",
"Request latency",
&[],
Some(vec![0.001, 0.01, 0.1, 1.0, 10.0])
)?;
```
### Using Metrics
```rust
// Counters
requests_total.inc();
// Gauges
active_connections.set(42.0);
active_connections.inc();
active_connections.dec();
// Histograms
latency.observe(0.023); // 23ms
```
### Vector Metrics with Labels
```rust
// Create vector metrics with label names
let requests_by_model = endpoint.metrics().create_countervec(
"requests_by_model",
"Requests by model",
&["model_type", "model_size"],
&[]
)?;
let memory_by_gpu = endpoint.metrics().create_gaugevec(
"gpu_memory_bytes",
"GPU memory by device",
&["gpu_id", "memory_type"],
&[]
)?;
// Use with specific label values
requests_by_model.with_label_values(&["llama", "7b"]).inc();
memory_by_gpu.with_label_values(&["0", "allocated"]).set(8192.0);
```
### Advanced Features
**Custom histogram buckets:**
```rust
let latency = endpoint.metrics().create_histogram(
"latency_seconds",
"Request latency",
&[],
Some(vec![0.001, 0.01, 0.1, 1.0, 10.0])
)?;
```
**Constant labels:**
```rust
let counter = endpoint.metrics().create_counter(
"requests_total",
"Total requests",
&[("region", "us-west"), ("env", "prod")]
)?;
```
---
## Related Documentation
- [Metrics Overview](metrics.md)
- [Prometheus and Grafana Setup](prometheus-grafana.md)
- [Distributed Runtime Architecture](../design_docs/distributed_runtime.md)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment