Unverified Commit f9050aae authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate existing docs to fern (#5445)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
Co-authored-by: default avatarNeal Vaidya <nealv@nvidia.com>
parent f238d23a
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Autoscaling"
---
This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the `sglang-agg` example from `examples/backends/sglang/deploy/agg.yaml`.
## Example DGD
All examples in this guide use the following DGD:
```yaml
# examples/backends/sglang/deploy/agg.yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: sglang-agg
namespace: default
spec:
services:
Frontend:
dynamoNamespace: sglang-agg
componentType: frontend
replicas: 1
decode:
dynamoNamespace: sglang-agg
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
```
**Key identifiers:**
- **DGD name**: `sglang-agg`
- **Namespace**: `default`
- **Services**: `Frontend`, `decode`
- **dynamo_namespace label**: `default-sglang-agg` (used for metric filtering)
## Overview
Dynamo provides flexible autoscaling through the `DynamoGraphDeploymentScalingAdapter` (DGDSA) resource. When you deploy a DGD, the operator automatically creates one adapter per service (unless explicitly disabled). These adapters implement the Kubernetes [Scale subresource](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource), enabling integration with:
| Autoscaler | Description | Best For |
|------------|-------------|----------|
| **KEDA** | Event-driven autoscaling (recommended) | Most use cases |
| **Kubernetes HPA** | Native horizontal pod autoscaling | Simple CPU/memory-based scaling |
| **Dynamo Planner** | LLM-aware autoscaling with SLA optimization | Production LLM workloads |
| **Custom Controllers** | Any scale-subresource-compatible controller | Custom requirements |
<Warning>
**Deprecation Notice:** The `spec.services[X].autoscaling` field in DGD is **deprecated and ignored**. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with `autoscaling` configured, you'll see a warning. Remove the field to silence the warning.
</Warning>
## Architecture
```
┌──────────────────────────────────┐ ┌─────────────────────────────────────┐
│ DynamoGraphDeployment │ │ Scaling Adapters (auto-created) │
│ "sglang-agg" │ │ (one per service) │
├──────────────────────────────────┤ ├─────────────────────────────────────┤
│ │ │ │
│ spec.services: │ │ ┌─────────────────────────────┐ │ ┌──────────────────┐
│ │ │ │ sglang-agg-frontend │◄───┼──────│ Autoscalers │
│ ┌────────────────────────┐◄───┼──────────┼──│ spec.replicas: 1 │ │ │ │
│ │ Frontend: 1 replica │ │ │ └─────────────────────────────┘ │ │ • KEDA │
│ └────────────────────────┘ │ │ │ │ • HPA │
│ │ │ ┌─────────────────────────────┐ │ │ • Planner │
│ ┌────────────────────────┐◄───┼──────────┼──│ sglang-agg-decode │◄───┼──────│ • Custom │
│ │ decode: 1 replica │ │ │ │ spec.replicas: 1 │ │ │ │
│ └────────────────────────┘ │ │ └─────────────────────────────┘ │ └──────────────────┘
│ │ │ │
└──────────────────────────────────┘ └─────────────────────────────────────┘
```
**How it works:**
1. You deploy a DGD with services (Frontend, decode)
2. The operator auto-creates one DGDSA per service
3. Autoscalers (KEDA, HPA, Planner) target the adapters via `/scale` subresource
4. Adapter controller syncs replica changes to the DGD
5. DGD controller reconciles the underlying pods
## Viewing Scaling Adapters
After deploying the `sglang-agg` DGD, verify the auto-created adapters:
```bash
kubectl get dgdsa -n default
# Example output:
# NAME DGD SERVICE REPLICAS AGE
# sglang-agg-frontend sglang-agg Frontend 1 5m
# sglang-agg-decode sglang-agg decode 1 5m
```
## Replica Ownership Model
When DGDSA is enabled (the default), it becomes the **source of truth** for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.
### How It Works
1. **DGDSA owns replicas**: Autoscalers (HPA, KEDA, Planner) update the DGDSA's `spec.replicas`
2. **DGDSA syncs to DGD**: The DGDSA controller writes the replica count to the DGD's service
3. **Direct DGD edits blocked**: A validating webhook prevents users from directly editing `spec.services[X].replicas` in the DGD
4. **Controllers allowed**: Only authorized controllers (operator, Planner) can modify DGD replicas
### Manual Scaling with DGDSA Enabled
When DGDSA is enabled, use `kubectl scale` on the adapter (not the DGD):
```bash
# ✅ Correct - scale via DGDSA
kubectl scale dgdsa sglang-agg-decode --replicas=3
# ❌ Blocked - direct DGD edit rejected by webhook
kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'
# Error: spec.services[decode].replicas cannot be modified directly when scaling adapter is enabled;
# use 'kubectl scale dgdsa/sglang-agg-decode --replicas=3' or update the DynamoGraphDeploymentScalingAdapter instead
```
## Enabling DGDSA for a Service
By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: sglang-agg
spec:
services:
Frontend:
replicas: 2 # ← No DGDSA by default, direct edits allowed
decode:
replicas: 1
scalingAdapter:
enabled: true # ← DGDSA created, managed via adapter
```
**When to enable DGDSA:**
- You want to use HPA, KEDA, or Planner for autoscaling
- You want a clear separation between "desired scale" (adapter) and "deployment config" (DGD)
- You want protection against accidental direct replica edits
**When to keep DGDSA disabled (default):**
- You want simple, manual replica management
- You don't need autoscaling for that service
- You prefer direct DGD edits over adapter-based scaling
## Autoscaling with Dynamo Planner
The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.
**When to use Planner:**
- You want LLM-optimized autoscaling out of the box
- You need coordinated scaling across prefill/decode services
- You want SLA-driven scaling (e.g., target TTFT < 500ms)
**How Planner works:**
Planner is deployed as a service component within your DGD. It:
1. Queries Prometheus for frontend metrics (request rate, latency, etc.)
2. Uses profiling data to predict optimal replica counts
3. Scales prefill/decode workers to meet SLA targets
**Deployment:**
The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../planner/sla-planner-quickstart.md) for complete instructions.
Example configurations with Planner:
- `examples/backends/vllm/deploy/disagg_planner.yaml`
- `examples/backends/sglang/deploy/disagg_planner.yaml`
- `examples/backends/trtllm/deploy/disagg_planner.yaml`
For more details, see the [SLA Planner documentation](../planner/sla-planner.md).
## Autoscaling with Kubernetes HPA
The Horizontal Pod Autoscaler (HPA) is Kubernetes' native autoscaling solution.
**When to use HPA:**
- You have simple, predictable scaling requirements
- You want to use standard Kubernetes tooling
- You need CPU or memory-based scaling
> **Note**: For custom metrics (like TTFT or queue depth), consider using [KEDA](#autoscaling-with-keda-recommended) instead - it's simpler to configure.
### Basic HPA (CPU-based)
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-agg-frontend-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-frontend
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
scaleUp:
stabilizationWindowSeconds: 0
```
### HPA with Dynamo Metrics
Dynamo exports several metrics useful for autoscaling. These are available at the `/metrics` endpoint on each frontend pod.
> **See also**: For a complete list of all Dynamo metrics, see the [Metrics Reference](../observability/metrics.md). For Prometheus and Grafana setup, see the [Prometheus and Grafana Setup Guide](../observability/prometheus-grafana.md).
#### Available Dynamo Metrics
| Metric | Type | Description | Good for scaling |
|--------|------|-------------|------------------|
| `dynamo_frontend_queued_requests` | Gauge | Requests waiting in HTTP queue | ✅ Workers |
| `dynamo_frontend_inflight_requests` | Gauge | Concurrent requests to engine | ✅ All services |
| `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
| `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
| `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
| `kvstats_gpu_cache_usage_percent` | Gauge | GPU KV cache usage (0-1) | ✅ Decode |
#### Metric Labels
Dynamo metrics include these labels for filtering:
| Label | Description | Example |
|-------|-------------|---------|
| `dynamo_namespace` | Unique DGD identifier (`{k8s-namespace}-{dynamoNamespace}`) | `default-sglang-agg` |
| `model` | Model being served | `Qwen/Qwen3-0.6B` |
> **Note**: When you have multiple DGDs in the same namespace, use `dynamo_namespace` to filter metrics for a specific DGD.
#### Example: Scale Decode Service Based on TTFT
Using HPA with Prometheus Adapter requires configuring external metrics.
**Step 1: Configure Prometheus Adapter**
Add this to your Helm values file (e.g., `prometheus-adapter-values.yaml`):
```yaml
# prometheus-adapter-values.yaml
prometheus:
url: http://prometheus-kube-prometheus-prometheus.monitoring.svc
port: 9090
rules:
external:
# TTFT p95 from frontend - used to scale decode
- seriesQuery: 'dynamo_frontend_time_to_first_token_seconds_bucket{namespace!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
name:
as: "dynamo_ttft_p95_seconds"
metricsQuery: |
histogram_quantile(0.95,
sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{<<.LabelMatchers>>}[5m]))
by (le, namespace, dynamo_namespace)
)
```
**Step 2: Install Prometheus Adapter**
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
-n monitoring --create-namespace \
-f prometheus-adapter-values.yaml
```
**Step 3: Verify the metric is available**
```bash
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/<your-namespace>/dynamo_ttft_p95_seconds" | jq
```
**Step 4: Create the HPA**
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-agg-decode-hpa
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode # ← DGD name + service name (lowercase)
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: dynamo_ttft_p95_seconds
selector:
matchLabels:
dynamo_namespace: "default-sglang-agg" # ← {namespace}-{dynamoNamespace}
target:
type: Value
value: "500m" # Scale up when TTFT p95 > 500ms
behavior:
scaleDown:
stabilizationWindowSeconds: 60 # Wait 1 min before scaling down
policies:
- type: Pods
value: 1
periodSeconds: 30
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Pods
value: 2
periodSeconds: 30
```
**How it works:**
1. Frontend pods export `dynamo_frontend_time_to_first_token_seconds` histogram
2. Prometheus Adapter calculates p95 TTFT per `dynamo_namespace`
3. HPA monitors this metric filtered by `dynamo_namespace: "default-sglang-agg"`
4. When TTFT p95 > 500ms, HPA scales up the `sglang-agg-decode` adapter
5. Adapter controller syncs the replica count to the DGD's `decode` service
6. More decode workers are created, reducing TTFT
#### Example: Scale Based on Queue Depth
Add this rule to your `prometheus-adapter-values.yaml` (alongside the TTFT rule):
```yaml
# Add to rules.external in prometheus-adapter-values.yaml
- seriesQuery: 'dynamo_frontend_queued_requests{namespace!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
name:
as: "dynamo_queued_requests"
metricsQuery: |
sum(<<.Series>>{<<.LabelMatchers>>}) by (namespace, dynamo_namespace)
```
Then create the HPA:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-agg-decode-queue-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: dynamo_queued_requests
selector:
matchLabels:
dynamo_namespace: "default-sglang-agg"
target:
type: Value
value: "10" # Scale up when queue > 10 requests
```
## Autoscaling with KEDA (Recommended)
KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus.
**Advantages over HPA + Prometheus Adapter:**
- No Prometheus Adapter configuration needed
- PromQL queries are defined in the ScaledObject itself (declarative, per-deployment)
- Easy to update - just `kubectl apply` the ScaledObject
- Can scale to zero when idle
- Supports multiple triggers per object
**When to use KEDA:**
- You want simpler configuration (no Prometheus Adapter to manage)
- You need event-driven scaling (e.g., queue depth, Kafka, etc.)
- You want to scale to zero when idle
### Installing KEDA
```bash
# Add KEDA Helm repo
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
# Install KEDA
helm install keda kedacore/keda \
--namespace keda \
--create-namespace
# Verify installation
kubectl get pods -n keda
```
> **Note**: If you have Prometheus Adapter installed, either uninstall it first (`helm uninstall prometheus-adapter -n monitoring`) or install KEDA with `--set metricsServer.enabled=false` to avoid API conflicts.
### Example: Scale Decode Based on TTFT
Using the `sglang-agg` DGD from `examples/backends/sglang/deploy/agg.yaml`:
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sglang-agg-decode-scaler
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 15 # Check metrics every 15 seconds
cooldownPeriod: 60 # Wait 60s before scaling down
triggers:
- type: prometheus
metadata:
# Update this URL to match your Prometheus service
serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
metricName: dynamo_ttft_p95
query: |
histogram_quantile(0.95,
sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
by (le)
)
threshold: "0.5" # Scale up when TTFT p95 > 500ms (0.5 seconds)
activationThreshold: "0.1" # Start scaling when TTFT > 100ms
```
Apply it:
```bash
kubectl apply -f sglang-agg-decode-scaler.yaml
```
### Verify KEDA Scaling
```bash
# Check ScaledObject status
kubectl get scaledobject -n default
# KEDA creates an HPA under the hood - you can see it
kubectl get hpa -n default
# Example output:
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# keda-hpa-sglang-agg-decode-scaler DynamoGraphDeploymentScalingAdapter/sglang-agg-decode 45m/500m 1 10 1
# Get detailed status
kubectl describe scaledobject sglang-agg-decode-scaler -n default
```
### Example: Scale Based on Queue Depth
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sglang-agg-decode-queue-scaler
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 15
cooldownPeriod: 60
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
metricName: dynamo_queued_requests
query: |
sum(dynamo_frontend_queued_requests{dynamo_namespace="default-sglang-agg"})
threshold: "10" # Scale up when queue > 10 requests
```
### How KEDA Works
KEDA creates and manages an HPA under the hood:
```
┌──────────────────────────────────────────────────────────────────────┐
│ You create: ScaledObject │
│ - scaleTargetRef: sglang-agg-decode │
│ - triggers: prometheus query │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ KEDA Operator automatically creates: HPA │
│ - name: keda-hpa-sglang-agg-decode-scaler │
│ - scaleTargetRef: sglang-agg-decode │
│ - metrics: External (from KEDA metrics server) │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ DynamoGraphDeploymentScalingAdapter: sglang-agg-decode │
│ - spec.replicas: updated by HPA │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ DynamoGraphDeployment: sglang-agg │
│ - spec.services.decode.replicas: synced from adapter │
└──────────────────────────────────────────────────────────────────────┘
```
## Mixed Autoscaling
For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services:
```yaml
---
# HPA for Frontend (CPU-based)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-agg-frontend-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-frontend
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
---
# KEDA for Decode (TTFT-based)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sglang-agg-decode-scaler
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
query: |
histogram_quantile(0.95,
sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
by (le)
)
threshold: "0.5"
```
## Manual Scaling
### With DGDSA Enabled (Default)
When DGDSA is enabled (the default), scale via the adapter:
```bash
kubectl scale dgdsa sglang-agg-decode -n default --replicas=3
```
Verify the scaling:
```bash
kubectl get dgdsa sglang-agg-decode -n default
# Output:
# NAME DGD SERVICE REPLICAS AGE
# sglang-agg-decode sglang-agg decode 3 10m
```
> **Note**: If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.
### With DGDSA Disabled
If you've disabled the scaling adapter for a service, edit the DGD directly:
```bash
kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'
```
Or edit the YAML (no `scalingAdapter.enabled: true` means direct edits are allowed):
```yaml
spec:
services:
decode:
replicas: 3
# No scalingAdapter.enabled means replicas can be edited directly
```
## Best Practices
### 1. Choose One Autoscaler Per Service
Avoid configuring multiple autoscalers for the same service:
| Configuration | Status |
|---------------|--------|
| HPA for frontend, Planner for prefill/decode | ✅ Good |
| KEDA for all services | ✅ Good |
| Planner only (default) | ✅ Good |
| HPA + Planner both targeting decode | ❌ Bad - they will fight |
### 2. Use Appropriate Metrics
| Service Type | Recommended Metrics | Dynamo Metric |
|--------------|---------------------|---------------|
| Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
| Prefill | Queue depth, TTFT | `dynamo_frontend_queued_requests`, `dynamo_frontend_time_to_first_token_seconds` |
| Decode | KV cache utilization, ITL | `kvstats_gpu_cache_usage_percent`, `dynamo_frontend_inter_token_latency_seconds` |
### 3. Configure Stabilization Windows
Prevent thrashing with appropriate stabilization:
```yaml
# HPA
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
# KEDA
spec:
cooldownPeriod: 300
```
### 4. Set Sensible Min/Max Replicas
Always configure minimum and maximum replicas in your HPA/KEDA to prevent:
- Scaling to zero (unless intentional)
- Unbounded scaling that exhausts cluster resources
## Troubleshooting
### Adapters Not Created
```bash
# Check DGD status
kubectl describe dgd sglang-agg -n default
# Check operator logs
kubectl logs -n dynamo-system deployment/dynamo-operator
```
### Scaling Not Working
```bash
# Check adapter status
kubectl describe dgdsa sglang-agg-decode -n default
# Check HPA/KEDA status
kubectl describe hpa sglang-agg-decode-hpa -n default
kubectl describe scaledobject sglang-agg-decode-scaler -n default
# Verify metrics are available in Kubernetes metrics API
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1
```
### Metrics Not Available
If HPA/KEDA shows `<unknown>` for metrics:
```bash
# Check if Dynamo metrics are being scraped
kubectl port-forward -n default svc/sglang-agg-frontend 8000:8000
curl http://localhost:8000/metrics | grep dynamo_frontend
# Example output:
# dynamo_frontend_queued_requests{model="Qwen/Qwen3-0.6B"} 2
# dynamo_frontend_inflight_requests{model="Qwen/Qwen3-0.6B"} 5
# Verify Prometheus is scraping the metrics
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Then query: dynamo_frontend_time_to_first_token_seconds_bucket
# Check KEDA operator logs
kubectl logs -n keda deployment/keda-operator
```
### Rapid Scaling Up and Down
If you see unstable scaling:
1. Check if multiple autoscalers are targeting the same adapter
2. Increase `cooldownPeriod` in KEDA ScaledObject
3. Increase `stabilizationWindowSeconds` in HPA behavior
## References
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [KEDA Documentation](https://keda.sh/)
- [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter)
- [Planner Documentation](../planner/sla-planner.md)
- [Dynamo Metrics Reference](../observability/metrics.md)
- [Prometheus and Grafana Setup](../observability/prometheus-grafana.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Creating Kubernetes Deployments"
---
The scripts in the `examples/<backend>/launch` folder like [agg.sh](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/agg.sh) demonstrate how you can serve your models locally.
The corresponding YAML files like [agg.yaml](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml) show you how you could create a Kubernetes deployment for your inference graph.
This guide explains how to create your own deployment files.
## Step 1: Choose Your Architecture Pattern
Before choosing a template, understand the different architecture patterns:
### Aggregated Serving (agg.yaml)
**Pattern**: Prefill and decode on the same GPU in a single process.
**Suggested to use for**:
- Small to medium models (under 70B parameters)
- Development and testing
- Low to moderate traffic
- Simplicity is prioritized over maximum throughput
**Tradeoffs**:
- Simpler setup and debugging
- Lower operational complexity
- GPU utilization may not be optimal (prefill and decode compete for resources)
- Lower throughput ceiling compared to disaggregated
**Example**: [`agg.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml)
### Aggregated + Router (agg_router.yaml)
**Pattern**: Load balancer routing across multiple aggregated worker instances.
**Suggested to use for**:
- Medium traffic requiring high availability
- Need horizontal scaling
- Want some load balancing without disaggregation complexity
**Tradeoffs**:
- Better scalability than plain aggregated
- High availability through multiple replicas
- Still has GPU underutilization issues of aggregated serving
- More complex than plain aggregated but simpler than disaggregated
**Example**: [`agg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
### Disaggregated Serving (disagg_router.yaml)
**Pattern**: Separate prefill and decode workers with specialized optimization.
**Suggested to use for**:
- Production-style deployments
- High throughput requirements
- Large models (70B+ parameters)
- Maximum GPU utilization needed
**Tradeoffs**:
- Maximum performance and throughput
- Better GPU utilization (prefill and decode specialized)
- Independent scaling of prefill and decode
- More complex setup and debugging
- Requires understanding of prefill/decode separation
**Example**: [`disagg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/disagg_router.yaml)
### Quick Selection Guide
Select the architecture pattern as your template that best fits your use case.
For example, when using the `vLLM` backend:
- **Development / Testing**: Use [`agg.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg.yaml) as the base configuration.
- **Production with Load Balancing**: Use [`agg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml) to enable scalable, load-balanced inference.
- **High Performance / Disaggregated Deployment**: Use [`disagg_router.yaml`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/disagg_router.yaml) for maximum throughput and modular scalability.
## Step 2: Customize the Template
You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
It serves the following roles:
1. OpenAI-Compatible HTTP Server
* Provides `/v1/chat/completions` endpoint
* Handles HTTP request/response formatting
* Supports streaming responses
* Validates incoming requests
2. Service Discovery and Routing
* Auto-discovers backend workers via etcd
* Routes requests to the appropriate Processor/Worker components
* Handles load balancing between multiple workers
3. Request Preprocessing
* Initial request validation
* Model name verification
* Request format standardization
You should then pick a worker and specialize the config. For example,
```yaml
VllmWorker: # vLLM-specific config
enforce-eager: true
enable-prefix-caching: true
SglangWorker: # SGLang-specific config
router-mode: kv
disagg-mode: true
TrtllmWorker: # TensorRT-LLM-specific config
engine-config: ./engine.yaml
kv-cache-transfer: ucx
```
Here's a template structure based on the examples:
```yaml
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
replicas: N
envFromSecret: your-secrets # e.g., hf-token-secret
# Health checks for worker initialization
readinessProbe:
exec:
command: ["/bin/sh", "-c", 'grep "Worker.*initialized" /tmp/worker.log']
resources:
requests:
gpu: "1" # GPU allocation
extraPodSpec:
mainContainer:
image: your-image
command:
- /bin/sh
- -c
args:
- python -m dynamo.YOUR_INFERENCE_ENGINE --model YOUR_MODEL --your-flags
```
Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
`extraPodSpec: -> mainContainer: -> args:`
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](../../reference/cli.md) for details on how to run this command.
## Step 3: Key Customization Points
### Model Configuration
```yaml
args:
- "python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flag"
```
### Resource Allocation
```yaml
resources:
requests:
cpu: "N"
memory: "NGi"
gpu: "N"
```
### Scaling
```yaml
replicas: N # Number of worker instances
```
### Routing Mode
```yaml
args:
- --router-mode
- kv # Enable KV-cache routing
```
### Worker Specialization
```yaml
args:
- --is-prefill-worker # For disaggregated prefill workers
```
### Image Pull Secret Configuration
#### Automatic Discovery and Injection
By default, the Dynamo operator automatically discovers and injects image pull secrets based on container registry host matching. The operator scans Docker config secrets within the same namespace and matches their registry hostnames to the container image URLs, automatically injecting the appropriate secrets into the pod's `imagePullSecrets`.
**Disabling Automatic Discovery:**
To disable this behavior for a component and manually control image pull secrets:
```yaml
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
annotations:
nvidia.com/disable-image-pull-secret-discovery: "true"
```
When disabled, you can manually specify secrets as you would for a normal pod spec via:
```yaml
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
annotations:
nvidia.com/disable-image-pull-secret-discovery: "true"
extraPodSpec:
imagePullSecrets:
- name: my-registry-secret
- name: another-secret
mainContainer:
image: your-image
```
This automatic discovery eliminates the need to manually configure image pull secrets for each deployment.
## Step 6: Deploy LoRA Adapters (Optional)
After your base model deployment is running, you can deploy LoRA adapters using the `DynamoModel` custom resource. This allows you to fine-tune and extend your models without modifying the base deployment.
To add a LoRA adapter to your deployment, link it using `modelRef` in your worker configuration:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Worker:
modelRef:
name: Qwen/Qwen3-0.6B # Base model identifier
componentType: worker
# ... rest of worker config
```
Then create a `DynamoModel` resource for your LoRA:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name above
modelType: lora
source:
uri: s3://my-bucket/loras/my-lora
```
**For complete details on managing models and LoRA adapters, see:**
📖 **[Managing Models with DynamoModel Guide](dynamomodel-guide.md)**
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Managing Models with DynamoModel"
---
## Overview
`DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to:
- **Deploy LoRA adapters** on top of running base models
- **Track model endpoints** and their readiness across your cluster
- **Manage model lifecycle** declaratively with Kubernetes
DynamoModel works alongside `DynamoGraphDeployment` (DGD) or `DynamoComponentDeployment` (DCD) resources. While DGD/DCD deploy the inference infrastructure (pods, services), DynamoModel handles model-specific operations like loading LoRA adapters.
## Quick Start
### Prerequisites
Before creating a DynamoModel, you need:
1. A running `DynamoGraphDeployment` or `DynamoComponentDeployment`
2. Components configured with `modelRef` pointing to your base model
3. Pods are ready and serving your base model
For complete setup including DGD configuration, see [Integration with DynamoGraphDeployment](#integration-with-dynamographdeployment).
### Deploy a LoRA Adapter
**1. Create your DynamoModel:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
namespace: dynamo-system
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in your DGD
modelType: lora
source:
uri: s3://my-bucket/loras/my-lora
```
**2. Apply and verify:**
```bash
# Apply the DynamoModel
kubectl apply -f my-lora.yaml
# Check status
kubectl get dynamomodel my-lora
```
**Expected output:**
```
NAME TOTAL READY AGE
my-lora 2 2 30s
```
That's it! The operator automatically discovers endpoints and loads the LoRA.
For detailed status monitoring, see [Monitoring & Operations](#monitoring--operations).
## Understanding DynamoModel
### Model Types
DynamoModel supports three model types:
| Type | Description | Use Case |
|------|-------------|----------|
| **`base`** | Reference to an existing base model | Tracking endpoints for a base model (default) |
| **`lora`** | LoRA adapter that extends a base model | Deploy fine-tuned adapters on existing models |
| **`adapter`** | Generic model adapter | Future extensibility for other adapter types |
Most users will use **`lora`** to deploy fine-tuned models on top of their base model deployments.
### How It Works
When you create a DynamoModel, the operator:
1. **Discovers endpoints**: Finds all pods running your `baseModelName` (by matching `modelRef.name` in DGD/DCD)
2. **Creates service**: Automatically creates a Kubernetes Service to track these pods
3. **Loads LoRA**: Calls the LoRA load API on each endpoint (for `lora` type)
4. **Updates status**: Reports which endpoints are ready
**Key linkage:**
```yaml
# DGD modelRef.name ↔ DynamoModel baseModelName must match
Worker:
modelRef:
name: Qwen/Qwen3-0.6B
---
spec:
baseModelName: Qwen/Qwen3-0.6B
```
## Configuration Overview
DynamoModel requires just a few key fields to deploy a model or adapter:
| Field | Required | Purpose | Example |
|-------|----------|---------|---------|
| `modelName` | Yes | Model identifier | `my-custom-lora` |
| `baseModelName` | Yes | Links to DGD modelRef | `Qwen/Qwen3-0.6B` |
| `modelType` | No | Type: base/lora/adapter | `lora` (default: `base`) |
| `source.uri` | For LoRA | Model location | `s3://bucket/path` or `hf://org/model` |
**Example minimal LoRA configuration:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: s3://my-bucket/my-lora
```
**For complete field specifications, validation rules, and all options, see:**
📖 [DynamoModel API Reference](../api-reference.md#dynamomodel)
### Status Summary
The status shows discovered endpoints and their readiness:
```bash
kubectl get dynamomodel my-lora
```
**Key status fields:**
- `totalEndpoints` / `readyEndpoints`: Counts of discovered vs ready endpoints
- `endpoints[]`: List with addresses, pod names, and ready status
- `conditions`: Standard Kubernetes conditions (EndpointsReady, ServicesFound)
For detailed status usage, see the [Monitoring & Operations](#monitoring--operations) section below
## Common Use Cases
### Use Case 1: S3-Hosted LoRA Adapter
Deploy a LoRA adapter stored in an S3 bucket.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: customer-support-lora
namespace: production
spec:
modelName: customer-support-adapter-v1
baseModelName: meta-llama/Llama-3.3-70B-Instruct
modelType: lora
source:
uri: s3://my-models-bucket/loras/customer-support/v1
```
**Prerequisites:**
- S3 bucket accessible from your pods (IAM role or credentials)
- Base model `meta-llama/Llama-3.3-70B-Instruct` running via DGD/DCD
**Verification:**
```bash
# Check LoRA is loaded
kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.readyEndpoints}'
# Should output: 2 (or your number of replicas)
# View which pods are serving
kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.endpoints[*].podName}'
```
### Use Case 2: HuggingFace-Hosted LoRA
Deploy a LoRA adapter from HuggingFace Hub.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: multilingual-lora
namespace: dynamo-system
spec:
modelName: multilingual-adapter
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: hf://myorg/qwen-multilingual-lora@v1.0.0 # Optional: @revision
```
**Prerequisites:**
- HuggingFace Hub accessible from your pods
- If private repo: HF token configured as secret and mounted in pods
- Base model `Qwen/Qwen3-0.6B` running via DGD/DCD
**With HuggingFace token:**
```yaml
# In your DGD/DCD
spec:
services:
worker:
envFromSecret: hf-token-secret # Provides HF_TOKEN env var
modelRef:
name: Qwen/Qwen3-0.6B
# ... rest of config
```
### Use Case 3: Multiple LoRAs on Same Base Model
Deploy multiple LoRA adapters on the same base model deployment.
```yaml
---
# LoRA for customer support
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: support-lora
spec:
modelName: support-adapter
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: s3://models/support-lora
---
# LoRA for code generation
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: code-lora
spec:
modelName: code-adapter
baseModelName: Qwen/Qwen3-0.6B # Same base model
modelType: lora
source:
uri: s3://models/code-lora
```
Both LoRAs will be loaded on all pods serving `Qwen/Qwen3-0.6B`. Your application can then route requests to the appropriate adapter.
## Monitoring & Operations
### Checking Status
**Quick status check:**
```bash
kubectl get dynamomodel
```
**Example output:**
```
NAME TOTAL READY AGE
my-lora 2 2 5m
customer-lora 4 3 2h
```
**Detailed status:**
```bash
kubectl describe dynamomodel my-lora
```
**Example output:**
```
Name: my-lora
Namespace: dynamo-system
Spec:
Model Name: my-custom-lora
Base Model Name: Qwen/Qwen3-0.6B
Model Type: lora
Source:
Uri: s3://my-bucket/my-lora
Status:
Ready Endpoints: 2
Total Endpoints: 2
Endpoints:
Address: http://10.0.1.5:9090
Pod Name: worker-0
Ready: true
Address: http://10.0.1.6:9090
Pod Name: worker-1
Ready: true
Conditions:
Type: EndpointsReady
Status: True
Reason: EndpointsDiscovered
Events:
Type Reason Message
---- ------ -------
Normal EndpointsReady Discovered 2 ready endpoints for base model Qwen/Qwen3-0.6B
```
### Understanding Readiness
An endpoint is **ready** when:
1. The pod is running and healthy
2. The LoRA load API call succeeded
**Condition states:**
- `EndpointsReady=True`: All endpoints are ready (full availability)
- `EndpointsReady=False, Reason=NotReady`: Not all endpoints ready (check message for counts)
- `EndpointsReady=False, Reason=NoEndpoints`: No endpoints found
When `readyEndpoints < totalEndpoints`, the operator automatically retries loading every 30 seconds.
### Viewing Endpoints
**Get endpoint addresses:**
```bash
kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].address}' | tr ' ' '\n'
```
**Output:**
```
http://10.0.1.5:9090
http://10.0.1.6:9090
```
**Get endpoint pod names:**
```bash
kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].podName}' | tr ' ' '\n'
```
**Check readiness of each endpoint:**
```bash
kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | {podName, ready}'
```
**Output:**
```json
{
"podName": "worker-0",
"ready": true
}
{
"podName": "worker-1",
"ready": true
}
```
### Updating a Model
To update a LoRA (e.g., deploy a new version):
```bash
# Edit the source URI
kubectl edit dynamomodel my-lora
# Or apply an updated YAML
kubectl apply -f my-lora-v2.yaml
```
The operator will detect the change and reload the LoRA on all endpoints.
### Deleting a Model
```bash
kubectl delete dynamomodel my-lora
```
For LoRA models, the operator will:
1. Unload the LoRA from all endpoints
2. Clean up associated resources
3. Remove the DynamoModel CR
The base model deployment (DGD/DCD) continues running normally.
## Troubleshooting
### No Endpoints Found
**Symptom:**
```yaml
status:
totalEndpoints: 0
readyEndpoints: 0
conditions:
- type: EndpointsReady
status: "False"
reason: NoEndpoints
message: "No endpoint slices found for base model Qwen/Qwen3-0.6B"
```
**Common Causes:**
1. **Base model deployment not running**
```bash
# Check if pods exist
kubectl get pods -l nvidia.com/dynamo-component-type=worker
```
**Solution:** Deploy your DGD/DCD first, wait for pods to be ready.
2. **`baseModelName` mismatch**
```bash
# Check modelRef in your DGD
kubectl get dynamographdeployment my-deployment -o yaml | grep -A2 modelRef
```
**Solution:** Ensure `baseModelName` in DynamoModel exactly matches `modelRef.name` in DGD.
3. **Pods not ready**
```bash
# Check pod status
kubectl get pods -l nvidia.com/dynamo-component-type=worker
```
**Solution:** Wait for pods to reach `Running` and `Ready` state.
4. **Wrong namespace**
**Solution:** Ensure DynamoModel is in the same namespace as your DGD/DCD.
### LoRA Load Failures
**Symptom:**
```yaml
status:
totalEndpoints: 2
readyEndpoints: 0 # ← No endpoints ready despite pods existing
conditions:
- type: EndpointsReady
status: "False"
reason: NoReadyEndpoints
```
**Common Causes:**
1. **Source URI not accessible**
```bash
# Check operator logs
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f | grep "Failed to load"
```
**Solution:**
- For S3: Verify bucket permissions, IAM role, credentials
- For HuggingFace: Verify token is valid, repo exists and is accessible
2. **Invalid LoRA format**
**Solution:** Ensure your LoRA weights are in the format expected by your backend framework (vLLM, SGLang, etc.)
3. **Endpoint API errors**
```bash
# Check operator logs for HTTP errors
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "error"
```
**Solution:** Check the backend framework's logs in the worker pods:
```bash
kubectl logs worker-0
```
4. **Out of memory**
**Solution:** LoRA adapters require additional memory. Increase memory limits in your DGD:
```yaml
resources:
limits:
memory: "32Gi" # Increase if needed
```
### Status Shows Not Ready
**Symptom:**
Some endpoints remain not ready for extended periods.
**Diagnosis:**
```bash
# Check which endpoints are not ready
kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | select(.ready == false)'
# View operator logs for that specific pod
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "worker-0"
# Check the worker pod logs
kubectl logs worker-0 | tail -50
```
**Common Causes:**
1. **Network issues**: Pod can't reach S3/HuggingFace
2. **Resource constraints**: Pod is OOMing or being throttled
3. **API endpoint not responding**: Backend framework isn't serving the LoRA API
**When to wait vs investigate:**
- **Wait**: If readyEndpoints is increasing over time (LoRAs loading progressively)
- **Investigate**: If stuck at same readyEndpoints for >5 minutes
### Viewing Events and Logs
**Check events:**
```bash
kubectl describe dynamomodel my-lora | tail -20
```
**View operator logs:**
```bash
# Follow logs
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f
# Filter for specific model
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "my-lora"
```
**Common events and messages:**
| Event/Message | Meaning | Action |
|---------------|---------|--------|
| `EndpointsReady` | All endpoints are ready | ✅ Good - full service availability |
| `NotReady` | Not all endpoints ready | ⚠️ Check readyEndpoints count - operator will retry |
| `PartialEndpointFailure` | Some endpoints failed to load | Check logs for errors |
| `NoEndpointsFound` | No pods discovered | Verify DGD running and modelRef matches |
| `EndpointDiscoveryFailed` | Can't query endpoints | Check operator RBAC permissions |
| `Successfully reconciled` | Reconciliation complete | ✅ Good |
## Integration with DynamoGraphDeployment
This section shows the complete end-to-end workflow for deploying base models and LoRA adapters together.
DynamoModel and DynamoGraphDeployment work together to provide complete model deployment:
- **DGD**: Deploys the infrastructure (pods, services, resources)
- **DynamoModel**: Manages model-specific operations (LoRA loading)
### Linking Models to Components
The connection is established through the `modelRef` field in your DGD:
**Complete example:**
```yaml
---
# 1. Deploy the base model infrastructure
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
backendFramework: vllm
services:
Frontend:
componentType: frontend
replicas: 1
dynamoNamespace: my-app
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
Worker:
# This modelRef creates the link to DynamoModel
modelRef:
name: Qwen/Qwen3-0.6B # ← Key linking field
componentType: worker
replicas: 2
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
args:
- --model
- Qwen/Qwen3-0.6B
- --tensor-parallel-size
- "1"
---
# 2. Deploy LoRA adapters on top
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # ← Must match modelRef.name above
modelType: lora
source:
uri: s3://my-bucket/loras/my-lora
```
### Deployment Workflow
**Recommended order:**
```bash
# 1. Deploy base model infrastructure
kubectl apply -f my-deployment.yaml
# 2. Wait for pods to be ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-component-type=worker --timeout=5m
# 3. Deploy LoRA adapters
kubectl apply -f my-lora.yaml
# 4. Verify LoRA is loaded
kubectl get dynamomodel my-lora
```
**What happens behind the scenes:**
| Step | DGD | DynamoModel |
|------|-----|-------------|
| 1 | Creates pods with modelRef | - |
| 2 | Pods become running and ready | - |
| 3 | - | CR created, discovers endpoints via auto-created Service |
| 4 | - | Calls LoRA load API on each endpoint |
| 5 | - | All endpoints ready ✓ |
The operator automatically handles all service discovery - you don't configure services, labels, or selectors manually.
## API Reference
For complete field specifications, validation rules, and detailed type definitions, see:
**📖 [Dynamo CRD API Reference](../api-reference.md#dynamomodel)**
## Summary
DynamoModel provides declarative model management for Dynamo deployments:
✅ **Simple**: 2-step deployment of LoRA adapters
✅ **Automatic**: Endpoint discovery and loading handled by operator
✅ **Observable**: Rich status reporting and conditions
✅ **Integrated**: Works seamlessly with DynamoGraphDeployment
**Next Steps:**
- Try the [Quick Start](#quick-start) example
- Explore [Common Use Cases](#common-use-cases)
- Check the [API Reference](../api-reference.md#dynamomodel) for advanced configuration
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Minikube Setup Guide"
---
Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
## 1. Install Minikube
First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
## 2. Configure GPU Support (Optional)
Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
<Tip>
Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
</Tip>
## 3. Start Minikube
Time to launch your local cluster!
```bash
# Start Minikube with GPU support (if configured)
minikube start --driver docker --container-runtime docker --gpus all --memory=16000mb --cpus=8
# Enable required addons
minikube addons enable istio-provisioner
minikube addons enable istio
minikube addons enable storage-provisioner-rancher
```
## 4. Verify Installation
Let's make sure everything is working correctly!
```bash
# Check Minikube status
minikube status
# Verify Istio installation
kubectl get pods -n istio-system
# Verify storage class
kubectl get storageclass
```
## Next Steps
Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](../installation-guide.md) to deploy the platform to your local cluster.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Multinode Deployment Guide"
---
This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
## Overview
Dynamo supports multinode deployments through the `multinode` section in resource specifications. This allows you to:
- Distribute workloads across multiple physical nodes
- Scale GPU resources beyond a single machine
- Support large models requiring extensive tensor parallelism
- Achieve high availability and fault tolerance
## Basic requirements
- **Kubernetes Cluster**: Version 1.24 or later
- **GPU Nodes**: Multiple nodes with NVIDIA GPUs
- **High-Speed Networking**: InfiniBand, RoCE, or high-bandwidth Ethernet (recommended for optimal performance)
### Advanced Multinode Orchestration
#### Using Grove (default)
For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems:
- **[Grove](https://github.com/NVIDIA/grove)**: Network topology-aware gang scheduling and auto-scaling for AI workloads
- **[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler)**: Kubernetes native scheduler optimized for AI workloads at scale
These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes.
**Features Enabled with Grove:**
- Declarative composition of AI workloads
- Multi-level horizontal auto-scaling
- Custom startup ordering for components
- Resource-aware rolling updates
[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a Kubernetes native scheduler optimized for AI workloads at large scale.
**Features Enabled with KAI-Scheduler:**
- Gang scheduling
- Network topology-aware pod placement
- AI workload-optimized scheduling algorithms
- GPU resource awareness and allocation
- Support for complex scheduling constraints
- Integration with Grove for enhanced capabilities
- Performance optimizations for large-scale deployments
##### Prerequisites
- [Grove](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) installed on the cluster
- (Optional) [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with the default queue name `dynamo` created. If no queue annotation is specified on the DGD resource, the operator uses the `dynamo` queue by default. Custom queue names can be specified via the `nvidia.com/kai-scheduler-queue` annotation, but the queue must exist in the cluster before deployment.
KAI-Scheduler is optional but recommended for advanced scheduling capabilities.
#### Using LWS and Volcano
LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.
- **LWS**: [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
- **Volcano**: [Volcano Installation](https://volcano.sh/en/docs/installation/)
Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support.
## Core Concepts
### Orchestrator Selection Algorithm
Dynamo automatically selects the best available orchestrator for multinode deployments using the following logic:
#### When Both Grove and LWS are Available:
- **Grove is selected by default** (recommended for advanced AI workloads)
- **LWS is selected** if you explicitly set `nvidia.com/enable-grove: "false"` annotation on your DGD resource
#### When Only One Orchestrator is Available:
- The installed orchestrator (Grove or LWS) is automatically selected
#### Scheduler Integration:
- **With Grove**: Automatically integrates with [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) when available, providing:
- Advanced queue management via `nvidia.com/kai-scheduler-queue` annotation
- AI-optimized scheduling policies
- Resource-aware workload placement
- **With LWS**: Uses Volcano scheduler for gang scheduling and resource coordination
#### Configuration Examples:
**Default (Grove with KAI-Scheduler):**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
annotations:
nvidia.com/kai-scheduler-queue: "dynamo"
spec:
# ... your deployment spec
```
> **Note:** The `nvidia.com/kai-scheduler-queue` annotation defaults to `"dynamo"`. If you specify a custom queue name, ensure the queue exists in your cluster before deploying. You can verify available queues with `kubectl get queues`.
**Force LWS usage:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
annotations:
nvidia.com/enable-grove: "false"
spec:
# ... your deployment spec
```
### The `multinode` Section
The `multinode` section in a resource specification defines how many physical nodes the workload should span:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
spec:
# ... your deployment spec
services:
my-service:
...
multinode:
nodeCount: 2
resources:
limits:
gpu: "2" # 2 GPUs per node
```
### GPU Distribution
The relationship between `multinode.nodeCount` and `gpu` is multiplicative:
- **`multinode.nodeCount`**: Number of physical nodes
- **`gpu`**: Number of GPUs per node
- **Total GPUs**: `multinode.nodeCount × gpu`
**Example:**
- `multinode.nodeCount: "2"` + `gpu: "4"` = 8 total GPUs (4 GPUs per node across 2 nodes)
- `multinode.nodeCount: "4"` + `gpu: "8"` = 32 total GPUs (8 GPUs per node across 4 nodes)
### Tensor Parallelism Alignment
The tensor parallelism (`tp-size` or `--tp`) in your command/args must match the total number of GPUs:
```yaml
# Example: 2 multinode.nodeCount × 4 GPUs = 8 total GPUs
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
spec:
# ... your deployment spec
services:
my-service:
...
multinode:
nodeCount: 2
resources:
limits:
gpu: "4"
extraPodSpec:
mainContainer:
...
args:
# Command args must use tp-size=8
- "--tp-size"
- "8" # Must equal multinode.nodeCount × gpu
```
## Backend-Specific Operator Behavior
When you deploy a multinode workload, the Dynamo operator automatically applies backend-specific configurations to enable distributed execution. Understanding these automatic modifications helps troubleshoot issues and optimize your deployments.
### vLLM Backend
For vLLM multinode deployments, the operator automatically selects and configures the appropriate distributed execution mode based on your parallelism settings:
#### Deployment Modes
The operator automatically determines the deployment mode based on your parallelism configuration:
**1. Tensor/Pipeline Parallelism Mode (Single model across nodes)**
- **When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size`
- **Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism
The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes.
**Leader Node:**
- **Command**: `ray start --head --port=6379 && <original-vllm-command> --distributed-executor-backend ray`
- **Behavior**: Starts Ray head node, then runs vLLM which creates a placement group spanning all Ray workers
- **Probes**: All health probes remain active (liveness, readiness, startup)
**Worker Nodes:**
- **Command**: `ray start --address=<leader-hostname>:6379 --block`
- **Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers
- **Probes**: All probes (liveness, readiness, startup) are automatically removed
> **Note**: vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend.
**2. Data Parallel Mode (Multiple model instances across nodes)**
- **When used**: When `world_size × data_parallel_size > GPUs_per_node`
- **Use case**: Running multiple independent model instances across nodes with data parallelism (e.g., MoE models with expert parallelism)
**All Nodes (Leader and Workers):**
- **Injected Flags**:
- `--data-parallel-address <leader-hostname>` - Address of the coordination server
- `--data-parallel-size-local <value>` - Number of data parallel workers per node
- `--data-parallel-rpc-port 13445` - RPC port for data parallel coordination
- `--data-parallel-start-rank <value>` - Starting rank for this node (calculated automatically)
- **Probes**: Worker probes are removed; leader probes remain active
**Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
#### Why Ray for Multi-Node TP/PP?
vLLM supports two distributed executor backends: `ray` and `mp`. For multi-node deployments:
- **Ray executor**: vLLM creates a placement group and spawns Ray actors across the cluster. Workers don't run vLLM directly - the leader's vLLM process manages everything.
- **mp executor**: Each node must run its own vLLM process with `--nnodes`, `--node-rank`, `--master-addr`, `--master-port`. This approach is more complex to orchestrate.
The Dynamo operator uses Ray because:
1. It aligns with vLLM's official multi-node documentation (see `multi-node-serving.sh`)
2. Simpler orchestration - only the leader runs vLLM, workers just need Ray agents
3. vLLM automatically handles placement group creation and worker management
#### Compilation Cache Support
When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets:
- **`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point
### SGLang Backend
For SGLang multinode deployments, the operator injects distributed training parameters:
#### Leader Node
- **Distributed Flags**: Injects `--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank 0`
- **Probes**: All health probes remain active
#### Worker Nodes
- **Distributed Flags**: Injects `--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank <dynamic-rank>`
- The `node-rank` is automatically determined from the pod's stateful identity
- **Probes**: All probes (liveness, readiness, startup) are automatically removed
**Note:** The operator intelligently injects these flags regardless of your command structure (direct Python commands or shell wrappers).
### TensorRT-LLM Backend
For TensorRT-LLM multinode deployments, the operator configures MPI-based communication:
#### Leader Node
- **SSH Configuration**: Automatically sets up SSH keys and configuration from a Kubernetes secret
- **MPI Command**: Wraps your command in an `mpirun` command with:
- Proper host list including all worker nodes
- SSH configuration for passwordless authentication on port 2222
- Environment variable propagation to all nodes
- Activation of the Dynamo virtual environment
- **Probes**: All health probes remain active
#### Worker Nodes
- **SSH Daemon**: Replaces your command with SSH daemon setup and execution
- Generates host keys in user-writable directories (non-privileged)
- Configures SSH daemon to listen on port 2222
- Sets up authorized keys for leader access
- **Probes**:
- **Liveness and Startup**: Removed (workers run SSH daemon, not the main application)
- **Readiness**: Replaced with TCP socket check on SSH port 2222
- Initial Delay: 20 seconds
- Period: 20 seconds
- Timeout: 5 seconds
- Failure Threshold: 10
#### Additional Configuration
- **Environment Variable**: `OMPI_MCA_orte_keep_fqdn_hostnames=1` is added to all nodes
- **SSH Volume**: Automatically mounts the SSH keypair secret (typically named `mpirun-ssh-key-<deployment-name>`)
**Important:** TensorRT-LLM requires an SSH keypair secret to be created before deployment. The secret name follows the pattern `mpirun-ssh-key-<component-name>`.
### Compilation Cache Configuration
The operator supports compilation cache volumes for backend-specific optimization:
| Backend | Support Level | Environment Variables | Default Mount Point |
|---------|--------------|----------------------|---------------------|
| vLLM | Fully Supported | `VLLM_CACHE_ROOT` | User-specified |
| SGLang | Partial Support | _None (pending upstream)_ | User-specified |
| TensorRT-LLM | Partial Support | _None (pending upstream)_ | User-specified |
To enable compilation cache, add a volume mount with `useAsCompilationCache: true` in your component specification. For vLLM, the operator will automatically configure the necessary environment variables. For other backends, volume mounts are created, but additional environment configuration may be required until upstream support is added.
## Next Steps
For additional support and examples, see the working multinode configurations in:
- **SGLang**: [examples/backends/sglang/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/)
- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/)
- **vLLM**: [examples/backends/vllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/)
These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Working with Dynamo Kubernetes Operator"
---
## Overview
Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
## Architecture
- **Operator Deployment:**
Deployed as a Kubernetes `Deployment` in a specific namespace.
- **Controllers:**
- `DynamoGraphDeploymentController`: Watches `DynamoGraphDeployment` CRs and orchestrates graph deployments.
- `DynamoComponentDeploymentController`: Watches `DynamoComponentDeployment` CRs and handles individual component deployments.
- `DynamoModelController`: Watches `DynamoModel` CRs and manages model lifecycle (e.g., loading LoRA adapters).
- **Workflow:**
1. A custom resource is created by the user or API server.
2. The corresponding controller detects the change and runs reconciliation.
3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
4. Status fields are updated to reflect the current state.
## Deployment Modes
The Dynamo operator supports three deployment modes to accommodate different cluster environments and use cases:
### 1. Cluster-Wide Mode (Default)
The operator monitors and manages DynamoGraph resources across **all namespaces** in the cluster.
**When to Use:**
- You have full cluster admin access
- You want centralized management of all Dynamo workloads
- Standard production deployment on a dedicated cluster
---
### 2. Namespace-Scoped Mode
The operator monitors and manages DynamoGraph resources **only in a specific namespace**. A lease marker is created to signal the operator's presence to any cluster-wide operators.
**When to Use:**
- You're on a shared/multi-tenant cluster
- You only have namespace-level permissions
- You want to test a new operator version in isolation
- You need to avoid conflicts with other operators
**Installation:**
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace my-namespace \
--create-namespace \
--set dynamo-operator.namespaceRestriction.enabled=true
```
---
### 3. Hybrid Mode
A **cluster-wide operator** manages most namespaces, while **one or more namespace-scoped operators** run in specific namespaces (e.g., for testing new versions). The cluster-wide operator automatically detects and excludes namespaces with namespace-scoped operators using lease markers.
**When to Use:**
- Running production workloads with a stable operator version
- Testing new operator versions in isolated namespaces without affecting production
- Gradual rollout of operator updates
- Development/staging environments on production clusters
**How It Works:**
1. Namespace-scoped operator creates a lease named `dynamo-operator-namespace-scope` in its namespace
2. Cluster-wide operator watches for these lease markers across all namespaces
3. Cluster-wide operator automatically excludes any namespace with a lease marker
4. If namespace-scoped operator stops, its lease expires (TTL: 30s by default)
5. Cluster-wide operator automatically resumes managing that namespace
**Setup Example:**
```bash
# 1. Install cluster-wide operator (production, v1.0.0)
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
--create-namespace
# 2. Install namespace-scoped operator (testing, v2.0.0-beta)
helm install dynamo-test dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace test-namespace \
--create-namespace \
--set dynamo-operator.namespaceRestriction.enabled=true \
--set dynamo-operator.controllerManager.manager.image.tag=v2.0.0-beta
**Observability:**
```bash
# List all namespaces with local operators
kubectl get lease -A --field-selector metadata.name=dynamo-operator-namespace-scope
# Check which operator version is running in a namespace
kubectl get lease -n my-namespace dynamo-operator-namespace-scope \
-o jsonpath='{.spec.holderIdentity}'
```
## Custom Resource Definitions (CRDs)
Dynamo provides the following Custom Resources:
- **DynamoGraphDeployment (DGD)**: Deploys complete inference pipelines
- **DynamoComponentDeployment (DCD)**: Deploys individual components
- **DynamoModel**: Manages model lifecycle (e.g., loading LoRA adapters)
For the complete technical API reference for Dynamo Custom Resource Definitions, see:
**📖 [Dynamo CRD API Reference](api-reference.md)**
For a user-focused guide on deploying and managing models with DynamoModel, see:
**📖 [Managing Models with DynamoModel Guide](deployment/dynamomodel-guide.md)**
## Webhooks
The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation of custom resources before they are persisted to the cluster. Webhooks are **enabled by default** and ensure that invalid configurations are rejected immediately at the API server level.
**Key Features:**
- ✅ Shared certificate infrastructure across all webhook types
- ✅ Automatic certificate generation (for testing/development)
- ✅ cert-manager integration (for production)
- ✅ Multi-operator support with lease-based coordination
- ✅ Immutability enforcement for critical fields
For complete documentation on webhooks, certificate management, and troubleshooting, see:
**📖 [Webhooks Guide](webhooks.md)**
## Installation
### Quick Install with Helm
```bash
# Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# Install Platform (includes operator)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
> **Note:** For shared/multi-tenant clusters or testing scenarios, see [Deployment Modes](#deployment-modes) above for namespace-scoped and hybrid configurations.
### Building from Source
```bash
# Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=your-registry.com/ # your container registry
export IMAGE_TAG=latest
# Build operator image
cd deploy/operator
docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG .
docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
cd -
# Install CRDs
cd deploy/helm/charts
helm install dynamo-crds ./crds/ --namespace default
# Install platform with custom operator image
helm install dynamo-platform ./platform/ \
--namespace ${NAMESPACE} \
--create-namespace \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}"
```
For detailed installation options, see the [Installation Guide](installation-guide.md)
## Development
- **Code Structure:**
The operator is built using Kubebuilder and the operator-sdk, with the following structure:
- `controllers/`: Reconciliation logic
- `api/v1alpha1/`: CRD types
- `config/`: Manifests and Helm charts
## References
- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
- [Custom Resource Definitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
- [Operator SDK](https://sdk.operatorframework.io/)
- [Helm Best Practices for CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "GitOps Deployment with FluxCD"
---
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../backends/vllm/README.md) to demonstrate the workflow.
## Prerequisites
- A Kubernetes cluster with [Dynamo Kubernetes Platform](installation-guide.md) installed
- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations
## Workflow Overview
The GitOps workflow for Dynamo deployments consists of three main steps:
1. Build and push the Dynamo Operator
2. Create and commit a DynamoGraphDeployment custom resource for initial deployment
3. Update the graph by building a new version and updating the CR for subsequent updates
## Step 1: Build and Push Dynamo Operator
First, follow to [See Install Dynamo Kubernetes Platform](installation-guide.md).
## Step 2: Create Initial Deployment
Create a new file in your Git repository (e.g., `deployments/llm-agg.yaml`) with the following content:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: llm-agg
spec:
pvcs:
- name: vllm-model-storage
size: 100Gi
services:
Frontend:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
Processor:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
VllmWorker:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
# Add PVC for model storage
volumeMounts:
- name: vllm-model-storage
mountPoint: /models
```
Commit and push this file to your Git repository. FluxCD will detect the new CR and create the initial Dynamo deployment in your cluster.
## Step 3: Update Existing Deployment
To update your pipeline, just update the associated DynamoGraphDeployment CRD
The Dynamo operator will automatically reconcile it.
## Monitoring the Deployment
You can monitor the deployment status using:
```bash
export NAMESPACE=<namespace-with-the-dynamo-operator>
# Check the DynamoGraphDeployment status
kubectl get dynamographdeployment llm-agg -n $NAMESPACE
```
\ No newline at end of file
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Grove Deployment Guide"
---
Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
## Overview
Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource.
### How Grove Works for Disaggregated Serving
Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages:
- **Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks
- **Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns
- **Resource Optimization**: Better utilization of hardware resources through specialized workload placement
- **Fault Isolation**: Issues in one component don't necessarily affect others
## Core Components and API Resources
Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups:
### PodCliqueSet
The top-level Grove object that defines a group of components managed and colocated together. Key features include:
- Support for autoscaling
- Topology-aware spread of replicas for availability
- Unified management of multiple disaggregated components
### PodClique
Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features:
- Independent configuration options
- Custom scaling logic support
- Role-specific resource allocation
### PodCliqueScalingGroup
A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior.
## Key Capabilities for Disaggregated Serving
Grove provides several specialized features that make it particularly well-suited for disaggregated serving:
### Flexible Gang Scheduling
PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodCliqueSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
### Multi-level Horizontal Auto-Scaling
Supports pluggable horizontal auto-scaling solutions to scale PodCliqueSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
### Network Topology-Aware Scheduling
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication.
### Custom Startup Dependencies
Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
## Use Cases and Examples
Grove specifically supports:
- **Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick
- **Single-node disaggregated inference** for optimized resource utilization
- **Agentic pipelines of models** for complex AI workflows
- **Standard aggregated serving** patterns for single node or single GPU inference
## Integration with NVIDIA Dynamo
Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack:
### Complementary Roles
- **Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads
- **Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management
### Release Coordination
Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap.
### Unified AI Platform
The integration creates a comprehensive platform where:
- Grove manages complex orchestration of disaggregated components
- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations
- Together they enable sophisticated AI serving architectures with simplified management
## Architecture Benefits
Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by:
1. **Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition
2. **Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate
3. **Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components
4. **Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling
## Getting Started
Grove relies on KAI Scheduler for resource allocation and scheduling.
For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler).
For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](deployment/multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
Dynamo Kubernetes Platform also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Kubernetes Platform Deployment Installation Guide](installation-guide.md) for more details.
\ No newline at end of file
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Installation Guide for Dynamo Kubernetes Platform"
---
Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
## Before You Start
Determine your cluster environment:
**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
- CRDs already installed cluster-wide - skip CRD installation step
- A cluster-wide Dynamo operator is likely already running
- **Do NOT install another operator** - use the existing cluster-wide operator
- Only install a namespace-restricted operator if you specifically need to prevent the cluster-wide operator from managing your namespace (e.g., testing operator features you're developing)
**Dedicated Cluster** (full cluster admin access):
- You install CRDs yourself
- Can use cluster-wide operator (default)
**Local Development** (Minikube, testing):
- See [Minikube Setup](deployment/minikube-setup.md) first, then follow installation steps below
To check if CRDs already exist:
```bash
kubectl get crd | grep dynamo
# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed
```
To check if a cluster-wide operator already exists:
```bash
# Check for cluster-wide operator and show its namespace
kubectl get clusterrolebinding -o json | \
jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) |
"Cluster-wide operator found in namespace: \(.subjects[0].namespace)"'
# If a cluster-wide operator exists: Do NOT install another operator
# Only install namespace-restricted mode if you specifically need namespace isolation
```
## Installation Paths
Platform is installed using Dynamo Kubernetes Platform [helm chart](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md).
**Path A: Pre-built Artifacts**
- Use case: Production deployment, shared or dedicated clusters
- Source: NGC published Helm charts
- Time: ~10 minutes
- Jump to: [Path A](#path-a-production-install)
**Path B: Custom Build from Source**
- Use case: Contributing to Dynamo, using latest features from main branch, customization
- Requirements: Docker build environment
- Time: ~30 minutes
- Jump to: [Path B](#path-b-custom-build-from-source)
All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:
```bash
helm install ...
-f your-values.yaml
```
and/or setting values as flags to the helm install command, as follows:
```bash
helm install ...
--set "your-value=your-value"
```
## Prerequisites
Before installing the Dynamo Kubernetes Platform, ensure you have the following tools and access:
### Required Tools
| Tool | Minimum Version | Description | Installation |
|------|-----------------|-------------|--------------|
| **kubectl** | v1.24+ | Kubernetes command-line tool | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | Kubernetes package manager | [Install Helm](https://helm.sh/docs/intro/install/) |
| **Docker** | Latest | Container runtime (Path B only) | [Install Docker](https://docs.docker.com/get-docker/) |
### Cluster and Access Requirements
- **Kubernetes cluster v1.24+** with admin or namespace-scoped access
- **Cluster type determined** (shared vs dedicated) — see [Before You Start](#before-you-start)
- **CRD status checked** if on a shared cluster
- **NGC credentials** (optional) — required only if pulling NVIDIA images from NGC
### Verify Installation
Run the following to confirm your tools are correctly installed:
```bash
# Verify tools and versions
kubectl version --client # Should show v1.24+
helm version # Should show v3.0+
docker version # Required for Path B only
# Set your release version
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
```
### Pre-Deployment Checks
Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements:
```bash
./deploy/pre-deployment/pre-deployment-check.sh
```
This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for details.
> **No cluster?** See [Minikube Setup](deployment/minikube-setup.md) for local development.
**Estimated installation time:** 5-30 minutes depending on path
## Path A: Production Install
Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).
```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
**For Shared/Multi-Tenant Clusters:**
If your cluster has namespace-restricted Dynamo operators, you MUST add namespace restriction to your installation:
```bash
# Add this flag to the helm install command above
--set dynamo-operator.namespaceRestriction.enabled=true
```
Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
If you see this validation error, you need namespace restriction:
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```
<Tip>
For multinode deployments, you need to install multinode orchestration components:
**Option 1 (Recommended): Grove + KAI Scheduler**
- Grove and KAI Scheduler can be installed manually or through the dynamo-platform helm install command.
- When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags:
```bash
--set "grove.enabled=true"
--set "kai-scheduler.enabled=true"
```
**Option 2: LeaderWorkerSet (LWS) + Volcano**
- If using LWS for multinode deployments, you must also install Volcano (required dependency):
- [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
- [Volcano Installation](https://volcano.sh/en/docs/installation/) (required for gang scheduling with LWS)
- These must be installed manually before deploying multinode workloads with LWS.
See the [Multinode Deployment Guide](deployment/multinode-deployment.md) for details on orchestrator selection.
</Tip>
<Tip>
By default, Model Express Server is not used.
If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command:
</Tip>
```bash
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```
<Tip>
By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces.
If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true.
You can also change the restricted namespace by setting the targetNamespace property.
</Tip>
```bash
--set "dynamo-operator.namespaceRestriction.enabled=true"
--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional
```
[Verify Installation](#verify-installation)
## Path B: Custom Build from Source
Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.
Note: This gives you access to the latest unreleased features and fixes on the main branch.
```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
export IMAGE_TAG=${RELEASE_VERSION}
# 2. Build operator
cd deploy/operator
# 2.1 Alternative 1 : Build and push the operator image for multiple platforms
docker buildx create --name multiplatform --driver docker-container --bootstrap
docker buildx use multiplatform
docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG --push .
# 2.2 Alternative 2 : Build and push the operator image for a single platform
docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG
cd -
# 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry)
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=${DOCKER_SERVER} \
--docker-username=${DOCKER_USERNAME} \
--docker-password=${DOCKER_PASSWORD} \
--namespace=${NAMESPACE}
cd deploy/helm/charts
# 4. Install CRDs
helm upgrade --install dynamo-crds ./crds/ --namespace default
# 5. Install Platform
helm dep build ./platform/
# To install cluster-wide instead, set NS_RESTRICT_FLAGS="" (empty) or omit that line entirely.
NS_RESTRICT_FLAGS="--set dynamo-operator.namespaceRestriction.enabled=true"
helm install dynamo-platform ./platform/ \
--namespace "${NAMESPACE}" \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
--set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" \
${NS_RESTRICT_FLAGS}
```
[Verify Installation](#verify-installation)
## Verify Installation
```bash
# Check CRDs
kubectl get crd | grep dynamo
# Check operator and platform pods
kubectl get pods -n ${NAMESPACE}
# Expected: dynamo-operator-* and etcd-* and nats-* pods Running
```
## Next Steps
1. **Deploy Model/Workflow**
```bash
# Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Port forward and test
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
```
2. **Explore Backend Guides**
- [vLLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)
- [SGLang Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)
- [TensorRT-LLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)
3. **Optional:**
- [Set up Prometheus & Grafana](observability/metrics.md)
- [SLA Planner Quickstart Guide](../planner/sla-planner-quickstart.md) (for SLA-aware scheduling and autoscaling)
## Troubleshooting
**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```
Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.
Solution: Add namespace restriction to your installation:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```
Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).
**CRDs already exist**
Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).
Solution: Skip step 2 (CRD installation), proceed directly to platform installation.
To check if CRDs exist:
```bash
kubectl get crd | grep dynamo
```
**Pods not starting?**
```bash
kubectl describe pod <pod-name> -n ${NAMESPACE}
kubectl logs <pod-name> -n ${NAMESPACE}
```
**HuggingFace model access?**
```bash
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
**Bitnami etcd "unrecognized" image?**
```bash
ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
```
This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog).
just add the following to the helm install command:
```bash
--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"
```
**Clean uninstall?**
To uninstall the platform, you can run the following command:
```
helm uninstall dynamo-platform --namespace ${NAMESPACE}
```
To uninstall the CRDs, follow these steps:
Get all of the dynamo CRDs installed in your cluster:
```bash
kubectl get crd | grep "dynamo.*nvidia.com"
```
You should see something like this:
```
dynamocomponentdeployments.nvidia.com 2025-10-21T14:49:52Z
dynamocomponents.nvidia.com 2025-10-25T05:16:10Z
dynamographdeploymentrequests.nvidia.com 2025-11-24T05:26:04Z
dynamographdeployments.nvidia.com 2025-09-04T20:56:40Z
dynamographdeploymentscalingadapters.nvidia.com 2025-12-09T21:05:59Z
dynamomodels.nvidia.com 2025-11-07T00:19:43Z
```
Delete each CRD one by one:
```bash
kubectl delete crd <crd-name>
```
## Advanced Options
- [Helm Chart Configuration](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md)
- [Create custom deployments](deployment/create-deployment.md)
- [Dynamo Operator details](dynamo-operator.md)
- [Model Express Server details](https://github.com/ai-dynamo/modelexpress)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Model Caching with Fluid: Cloud-Native Data Orchestration and Acceleration"
---
Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.
## Key Features
- **Data Caching and Acceleration:** Cache remote data close to compute workloads for faster access.
- **Unified Data Access:** Access data from S3, HDFS, NFS, and more through a single interface.
- **Kubernetes Native:** Integrates with Kubernetes using CRDs for data management.
- **Scalability:** Supports large-scale data and compute clusters.
## Installation
You can install Fluid on any Kubernetes cluster using Helm.
**Prerequisites:**
- Kubernetes >= 1.18
- `kubectl` >= 1.18
- `Helm` >= 3.5
**Quick Install:**
```sh
kubectl create ns fluid-system
helm repo add fluid https://fluid-cloudnative.github.io/charts
helm repo update
helm install fluid fluid/fluid -n fluid-system
```
For advanced configuration, see the [Fluid Installation Guide](https://fluid-cloudnative.github.io/docs/get-started/installation).
## Pre-deployment Steps
1. Install Fluid (see [Installation](#installation)).
2. Create a Dataset and Runtime (see [the following example](#webufs-example)).
3. Mount the resulting PVC in your workload.
## Mounting Data Sources
### WebUFS Example
WebUFS allows mounting HTTP/HTTPS sources as filesystems.
```yaml
# Mount a public HTTP directory as a Fluid Dataset
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: webufs-model
spec:
mounts:
- mountPoint: https://myhost.org/path_to_my_model # Replace with your HTTP source
name: webufs-model
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: webufs-model
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
```
After applying, Fluid creates a PersistentVolumeClaim (PVC) named `webufs-model` containing the files.
### S3 Example
Mount an S3 bucket as a Fluid Dataset.
```yaml
# Mount an S3 bucket as a Fluid Dataset
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: s3-model
spec:
mounts:
- mountPoint: s3://<your-bucket> # Replace with your bucket name
options:
alluxio.underfs.s3.endpoint: http://minio:9000 # S3 endpoint (e.g., MinIO)
alluxio.underfs.s3.disable.dns.buckets: "true"
aws.secretKey: "<your-secret>"
aws.accessKeyId: "<your-access-key>"
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: s3-model
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 1Gi
high: "0.95"
low: "0.7"
---
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: s3-model-loader
spec:
dataset:
name: s3-model
namespace: <your-namespace> # Replace with your namespace
loadMetadata: true
target:
- path: "/"
replicas: 1
```
The resulting PVC is named `s3-model`.
## Using HuggingFace Models with Fluid
**Limitations:**
- HuggingFace models are not exposed as simple filesystems or buckets.
- No native integration exists between Fluid and the HuggingFace Hub API.
**Workaround: Download and Upload to S3/MinIO**
1. Download the model using the HuggingFace CLI or SDK.
2. Upload the model files to a supported storage backend (S3, GCS, NFS).
3. Mount that backend using Fluid.
**Example Pod to Download and Upload:**
```yaml
apiVersion: v1
kind: Pod
metadata:
name: download-hf-to-minio
spec:
restartPolicy: Never
containers:
- name: downloader
image: python:3.10-slim
command: ["sh", "-c"]
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub awscli
BUCKET_NAME=hf-models
ENDPOINT_URL=http://minio:9000
MODEL_NAME=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
LOCAL_DIR=/tmp/model
if ! aws --endpoint-url $ENDPOINT_URL s3 ls "s3://$BUCKET_NAME" > /dev/null 2>&1; then
aws --endpoint-url $ENDPOINT_URL s3 mb "s3://$BUCKET_NAME"
fi
huggingface-cli download $MODEL_NAME --local-dir $LOCAL_DIR --local-dir-use-symlinks False
aws --endpoint-url $ENDPOINT_URL s3 cp $LOCAL_DIR s3://$BUCKET_NAME/$MODEL_NAME --recursive
env:
- name: AWS_ACCESS_KEY_ID
value: "<your-access-key>"
- name: AWS_SECRET_ACCESS_KEY
value: "<your-secret>"
volumeMounts:
- name: tmp-volume
mountPath: /tmp/model
volumes:
- name: tmp-volume
emptyDir: {}
```
You can then use `s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B` as your Dataset mount.
## Usage with Dynamo
Mount the Fluid-generated PVC in your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: model-caching
spec:
pvcs:
- name: s3-model
envs:
- name: HF_HOME
value: /model
- name: DYN_DEPLOYMENT_CONFIG
value: '{"Common": {"model": "/model", ...}}'
services:
VllmWorker:
volumeMounts:
- name: s3-model
mountPoint: /model
Processor:
volumeMounts:
- name: s3-model
mountPoint: /model
```
## Full example with llama3.3 70B
### Performance
When deploying LLaMA 3.3 70B using Fluid as the caching layer, we observed the best performance by configuring a single-node cache that holds 100% of the model files locally. By ensuring that the vllm worker pod is scheduled on the same node as the Fluid cache, we were able to eliminate network I/O bottlenecks, which resulted in the fastest model startup time and the highest inference efficiency during our tests.
| Cache Configuration | vLLM Pod Placement | Startup Time |
|----------------------------------------------|----------------------------------|-----------------|
| ❌ No Cache (Download from HuggingFace) | N/A | ~9 minutes |
| 🟡 Multi-Node Cache (100% Model Cached) | Not on Cache Node | ~18 minutes |
| 🟡 Multi-Node Cache (100% Model Cached) | On Cache Node | ~10 minutes |
| ✅ Single-Node Cache (100% Model Cached) | On Cache Node | ~80 seconds |
### Resources
```yaml
# dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: llama-3-3-70b-instruct-model
namespace: my-namespace
spec:
mounts:
- mountPoint: s3://hf-models/meta-llama/Llama-3.3-70B-Instruct
options:
alluxio.underfs.s3.endpoint: http://minio:9000
alluxio.underfs.s3.disable.dns.buckets: "true"
aws.secretKey: "minioadmin"
aws.accessKeyId: "minioadmin"
alluxio.underfs.s3.streaming.upload.enabled: "true"
alluxio.underfs.s3.multipart.upload.threads: "20"
alluxio.underfs.s3.socket.timeout: "50s"
alluxio.underfs.s3.request.timeout: "60s"
---
# runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: llama-3-3-70b-instruct-model
namespace: my-namespace
spec:
replicas: 1
properties:
alluxio.user.file.readtype.default: CACHE_PROMOTE
alluxio.user.file.write.type.default: CACHE_THROUGH
alluxio.user.block.size.bytes.default: 128MB
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 300Gi
high: "1.0"
low: "0.7"
---
# DataLoad - Preloads the model into cache
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: llama-3-3-70b-instruct-model-loader
spec:
dataset:
name: llama-3-3-70b-instruct-model
namespace: my-namespace
loadMetadata: true
target:
- path: "/"
replicas: 1
```
and the associated DynamoGraphDeployment with pod affinity to schedule the vllm worker on the same node than the Alluxio cache worker
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-hello-world
spec:
envs:
- name: DYN_LOG
value: "debug"
- name: DYN_DEPLOYMENT_CONFIG
value: '{"Common": {"model": "/model", "block-size": 64, "max-model-len": 16384},
"Frontend": {"served_model_name": "meta-llama/Llama-3.3-70B-Instruct", "endpoint":
"dynamo.Processor.chat/completions", "port": 8000}, "Processor": {"router":
"round-robin", "router-num-threads": 4, "common-configs": ["model", "block-size",
"max-model-len"]}, "VllmWorker": {"tensor-parallel-size": 4, "enforce-eager": true, "max-num-batched-tokens":
16384, "enable-prefix-caching": true, "ServiceArgs": {"workers": 1, "resources":
{"gpu": "4", "memory": "40Gi"}}, "common-configs": ["model", "block-size", "max-model-len"]},
"Planner": {"environment": "kubernetes", "no-operation": true}}'
pvcs:
- name: llama-3-3-70b-instruct-model
services:
Processor:
volumeMounts:
- name: llama-3-3-70b-instruct-model
mountPoint: /model
VllmWorker:
volumeMounts:
- name: llama-3-3-70b-instruct-model
mountPoint: /model
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: fluid.io/s-alluxio-my-namespace-llama-3-3-70b-instruct-model
operator: In
values:
- "true"
```
## Troubleshooting & FAQ
- **PVC not created?** Check Fluid and AlluxioRuntime pod logs.
- **Model not found?** Ensure the model was uploaded to the correct bucket/path.
- **Permission errors?** Verify S3/MinIO credentials and bucket policies.
## Resources
- [Fluid Documentation](https://fluid-cloudnative.github.io/)
- [Alluxio Documentation](https://docs.alluxio.io/)
- [MinIO Documentation](https://docs.min.io/)
- [Hugging Face Hub](https://huggingface.co/docs/hub/index)
- [Dynamo README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md)
- [Dynamo Documentation](https://docs.nvidia.com/dynamo/latest/index.html)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Log Aggregation in Dynamo on Kubernetes"
---
This guide demonstrates how to set up logging for Dynamo in Kubernetes using Grafana Loki and Alloy. This setup provides a simple reference logging setup that can be followed in Kubernetes clusters including Minikube and MicroK8s.
<Note>
This setup is intended for development and testing purposes. For production environments, please refer to the official documentation for high-availability configurations.
</Note>
## Components Overview
- **[Grafana Loki](https://grafana.com/oss/loki/)**: Fast and cost-effective Kubernetes-native log aggregation system.
- **[Grafana Alloy](https://grafana.com/oss/alloy/)**: OpenTelemetry collector that replaces Promtail, gathering logs, metrics and traces from Kubernetes pods.
- **[Grafana](https://grafana.com/grafana/)**: Visualization platform for querying and exploring logs.
## Prerequisites
### 1. Dynamo Kubernetes Platform
This guide assumes you have installed Dynamo Kubernetes Platform. For more information, see [Dynamo Kubernetes Platform](../README.md).
### 2. Kube-prometheus
While this guide does not use Prometheus, it assumes Grafana is pre-installed with the kube-prometheus. For more information, see [kube-prometheus](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
### 3. Environment Variables
#### Kubernetes Setup Variables
The following env variables are set:
- `MONITORING_NAMESPACE`: The namespace where Loki is installed
- `DYN_NAMESPACE`: The namespace where Dynamo Kubernetes Platform is installed
```bash
export MONITORING_NAMESPACE=monitoring
export DYN_NAMESPACE=dynamo-system
```
#### Dynamo Logging Variables
| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for Loki) | `true` |
| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `true` |
## Installation Steps
### 1. Install Loki
First, we'll install Loki in single binary mode, which is ideal for testing and development:
```bash
# Add the Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki
helm install --values deploy/observability/k8s/logging/values/loki-values.yaml loki grafana/loki -n $MONITORING_NAMESPACE
```
Our configuration (`loki-values.yaml`) sets up Loki in a simple configuration that is suitable for testing and development. It uses a local MinIO for storage. The installation pods can be viewed with:
```bash
kubectl get pods -n $MONITORING_NAMESPACE -l app=loki
```
### 2. Install Grafana Alloy
Next, install the Grafana Alloy collector to gather logs from your Kubernetes cluster and forward them to Loki. Here we use the Helm chart `k8s-monitoring` provided by Grafana to install the collector:
```bash
# Generate a custom values file with the namespace information
envsubst < deploy/observability/k8s/logging/values/alloy-values.yaml > alloy-custom-values.yaml
# Install the collector
helm install --values alloy-custom-values.yaml alloy grafana/k8s-monitoring -n $MONITORING_NAMESPACE
```
The values file (`alloy-values.yaml`) includes the following configurations for the collector:
- Destination to forward logs to Loki
- Namespace to collect logs from
- Pod labels to be mapped to Loki labels
- Collection method (kubernetesApi or tailing `/var/log/containers/`)
```yaml
destinations:
- name: loki
type: loki
url: http://loki-gateway.$MONITORING_NAMESPACE.svc.cluster.local/loki/api/v1/push
podLogs:
enabled: true
gatherMethod: kubernetesApi # collect logs from the kubernetes api, rather than /var/log/containers/; friendly for testing and development
collector: alloy-logs
labels:
app_kubernetes_io_name: app.kubernetes.io/name
nvidia_com_dynamo_component_type: nvidia.com/dynamo-component-type
nvidia_com_dynamo_graph_deployment_name: nvidia.com/dynamo-graph-deployment-name
labelsToKeep:
- "app_kubernetes_io_name"
- "container"
- "instance"
- "job"
- "level"
- "namespace"
- "service_name"
- "service_namespace"
- "deployment_environment"
- "deployment_environment_name"
- "nvidia_com_dynamo_component_type" # extract this label from the dynamo graph deployment
- "nvidia_com_dynamo_graph_deployment_name" # extract this label from the dynamo graph deployment
namespaces:
- $DYN_NAMESPACE
```
### 3. Configure Grafana with the Loki datasource and Dynamo Logs dashboard
We will be viewing the logs associated with our DynamoGraphDeployment in Grafana. To do this, we need to configure Grafana with the Loki datasource and Dynamo Logs dashboard.
Since we are using Grafana with the Prometheus Operator, we can simply apply the following ConfigMaps to quickly achieve this configuration.
```bash
# Configure Grafana with the Loki datasource
envsubst < deploy/observability/k8s/logging/grafana/loki-datasource.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
# Configure Grafana with the Dynamo Logs dashboard
envsubst < deploy/observability/k8s/logging/grafana/logging-dashboard.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
```
<Note>
If using Grafana installed without the Prometheus Operator, you can manually import the Loki datasource and Dynamo Logs dashboard using the Grafana UI.
</Note>
### 4. Deploy a DynamoGraphDeployment with JSONL Logging
At this point, we should have everything in place to collect and view logs in our Grafana instance. All that is left is to deploy a DynamoGraphDeployment to collect logs from.
To enable structured logs in a DynamoGraphDeployment, we need to set the `DYN_LOGGING_JSONL` environment variable to `1`. This is done for us in the `agg_logging.yaml` setup for the Sglang backend. We can now deploy the DynamoGraphDeployment with:
```bash
kubectl apply -n $DYN_NAMESPACE -f examples/backends/sglang/deploy/agg_logging.yaml
```
Send a few chat completions requests to generate structured logs across the frontend and worker pods across the DynamoGraphDeployment. We are now all set to view the logs in Grafana.
## Viewing Logs in Grafana
Port-forward the Grafana service to access the UI:
```bash
kubectl port-forward svc/prometheus-grafana 3000:80 -n $MONITORING_NAMESPACE
```
If everything is working, under Home > Dashboards > Dynamo Logs, you should see a dashboard that can be used to view the logs associated with our DynamoGraphDeployments
The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g., frontend, worker, etc.).
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Metrics Collection on Kubernetes"
---
## Overview
This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the kube-prometheus-stack. The kube-prometheus-stack provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components.
## Prerequisites
### Install kube-prometheus-stack
If you don't have an existing Prometheus setup, you'll likely want to install the kube-prometheus-stack. This is a collection of Kubernetes manifests that includes the Prometheus Operator, Prometheus, Grafana, and other monitoring components in a pre-configured setup. The stack introduces custom resources that make it easy to deploy and manage monitoring in Kubernetes:
- `PodMonitor`: Automatically discovers and scrapes metrics from pods based on label selectors
- `ServiceMonitor`: Similar to PodMonitor but works with Services
- `PrometheusRule`: Defines alerting and recording rules
For a basic installation:
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Values allow PodMonitors to be picked up that are outside of the kube-prometheus-stack helm release
helm install prometheus -n monitoring --create-namespace prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorNamespaceSelector="{}" \
--set prometheus.prometheusSpec.probeNamespaceSelector="{}"
```
<Note>
The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
</Note>
### Install Dynamo Operator
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../installation-guide.md) for detailed instructions on deploying the Dynamo operator.
Make sure to set the `prometheusEndpoint` to the Prometheus endpoint you installed in the previous step.
```bash
helm install dynamo-platform ...
--set prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
```
### Node Exporter for CPU/Memory Metrics
The Dynamo Grafana dashboard includes panels for node-level CPU utilization, system load, and container resource usage. These metrics are collected and exported to Prometheus via [node-exporter](https://github.com/prometheus/node_exporter), which exposes hardware and OS metrics from Linux systems.
<Note>
The kube-prometheus-stack installation described above includes node-exporter by default. If you're using a custom Prometheus setup, you'll need to ensure node-exporter is deployed as a DaemonSet on your cluster nodes.
</Note>
To verify node-exporter is running:
```bash
kubectl get daemonset -A | grep node-exporter
```
If node-exporter is not running, you can install it via the kube-prometheus-stack or deploy it separately. For more information, see the [node-exporter documentation](https://github.com/prometheus/node_exporter).
### DCGM Metrics Collection (Optional)
GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
```bash
kubectl get daemonset -A | grep dcgm-exporter
```
If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
## Deploy a DynamoGraphDeployment
Let's start by deploying a simple vLLM aggregated deployment:
```bash
export NAMESPACE=dynamo-system # namespace where dynamo operator is installed
pushd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n $NAMESPACE
popd
```
This will create two components:
- A Frontend component exposing metrics on its HTTP port
- A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](../../backends/vllm/README.md)
- Available metrics: See the [metrics guide](../../observability/metrics.md)
### Validate the Deployment
Let's send some test requests to populate metrics:
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 30
}'
```
For more information about validating the deployment, see the [vLLM README](../../backends/vllm/README.md).
## Set Up Metrics Collection
### Create PodMonitors
The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods:
- `nvidia.com/metrics-enabled: "true"` - Enables metrics collection
- `nvidia.com/dynamo-component-type: "frontend|worker"` - Identifies the component type
> **Note**: You can opt-out specific deployments from metrics collection by adding this annotation to your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
annotations:
nvidia.com/enable-metrics: "false"
spec:
# …
```
### Configure Grafana Dashboard
Apply the Dynamo dashboard configuration to populate Grafana with the Dynamo dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
```
The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_dashboard: "1"`, the Grafana will discover and populate it to its list of available dashboards. The dashboard includes panels for:
- Frontend request rates
- Time to first token
- Inter-token latency
- Request duration
- Input/Output sequence lengths
- GPU utilization via DCGM
- Node CPU utilization and system load
- Container CPU usage per pod
- Memory usage per pod
## Viewing the Metrics
### In Prometheus
```bash
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
```
Visit http://localhost:9090 and try these example queries:
- `dynamo_frontend_requests_total`
- `dynamo_frontend_time_to_first_token_seconds_bucket`
![Prometheus UI showing Dynamo metrics](../../../assets/img/prometheus-k8s.png)
### In Grafana
```bash
# Get Grafana credentials
export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
echo "Grafana user: $GRAFANA_USER"
echo "Grafana password: $GRAFANA_PASSWORD"
# Port forward Grafana service
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
```
Visit http://localhost:3000 and log in with the credentials captured above.
Once logged in, find the Dynamo dashboard under General.
![Grafana dashboard showing Dynamo metrics](../../../assets/img/grafana-k8s.png)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Deploying Dynamo on Kubernetes"
---
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
## Important Terminology
**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created.
- Used for: Resource isolation, RBAC, organizing deployments
- Example: `dynamo-system`, `team-a-namespace`
**Dynamo Namespace**: The logical namespace used by Dynamo components for [service discovery](service-discovery.md).
- Used for: Runtime component communication, service discovery
- Specified in: `.spec.services.<ServiceName>.dynamoNamespace` field
- Example: `my-llm`, `production-model`, `dynamo-dev`
These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
## Prerequisites
Before you begin, ensure you have the following tools installed:
| Tool | Minimum Version | Installation Guide |
|------|-----------------|-------------------|
| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) |
Verify your installation:
```bash
kubectl version --client # Should show v1.24+
helm version # Should show v3.0+
```
For detailed installation instructions, see the [Prerequisites section](installation-guide.md#prerequisites) in the Installation Guide.
## Pre-deployment Checks
Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:
```bash
./deploy/pre-deployment/pre-deployment-check.sh
```
This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for more details.
## 1. Install Platform First
```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
**For Shared/Multi-Tenant Clusters:**
If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```
For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](installation-guide.md)**.
## 2. Choose Your Backend
Each backend has deployment examples and configuration options:
| Backend | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
| **[SGLang](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **[TensorRT-LLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
| **[vLLM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
## 3. Deploy Your First Model
```bash
export NAMESPACE=dynamo-system
kubectl create namespace ${NAMESPACE}
# to pull model from HF
export HF_TOKEN=<Token-Here>
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="$HF_TOKEN" \
-n ${NAMESPACE};
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
# Test it
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
```
For SLA-based autoscaling, see [SLA Planner Quick Start Guide](../planner/sla-planner-quickstart.md).
## Understanding Dynamo's Custom Resources
Dynamo provides two main Kubernetes Custom Resources for deploying models:
### DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration
The **recommended approach** for generating optimal configurations. DGDR provides a high-level interface where you specify:
- Model name and backend framework
- SLA targets (latency requirements)
- GPU type (optional)
Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
- SLA-driven configuration generation
- Automated resource optimization
- Users who want simplicity over control
**Note**: DGDR generates a DGD spec which you can then use to deploy.
### DynamoGraphDeployment (DGD) - Direct Configuration
A lower-level interface that defines your complete inference pipeline:
- Model configuration
- Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections
Use this when you need fine-grained control or have already completed profiling.
Refer to the [API Reference and Documentation](api-reference.md) for more details.
## 📖 API Reference & Documentation
For detailed technical specifications of Dynamo's Kubernetes resources:
- **[API Reference](api-reference.md)** - Complete CRD field specifications for all Dynamo resources
- **[Create Deployment](deployment/create-deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
- **[Operator Guide](dynamo-operator.md)** - Dynamo operator configuration and management
### Choosing Your Architecture Pattern
When creating a deployment, select the architecture pattern that best fits your use case:
- **Development / Testing** - Use `agg.yaml` as the base configuration
- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
### Frontend and Worker Components
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
- Provides OpenAI-compatible `/v1/chat/completions` endpoint
- Auto-discovers backend workers via [service discovery](service-discovery.md) (Kubernetes-native by default)
- Routes requests and handles load balancing
- Validates and preprocesses requests
### Customizing Your Deployment
Example structure:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
services:
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: your-image
VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
dynamoNamespace: dynamo-dev
componentType: worker
replicas: 1
envFromSecret: hf-token-secret # for HuggingFace models
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: your-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
```
Worker command examples per backend:
```yaml
# vLLM worker
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
# SGLang worker
args:
- >-
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--tp 1
--trust-remote-code
# TensorRT-LLM worker
args:
- python3 -m dynamo.trtllm
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
```
Key customization points include:
- **Model Configuration**: Specify model in the args command
- **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
## Additional Resources
- **[Examples](../getting-started/examples.md)** - Complete working examples
- **[Create Custom Deployments](deployment/create-deployment.md)** - Build your own CRDs
- **[Managing Models with DynamoModel](deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models
- **[Operator Documentation](dynamo-operator.md)** - How the platform works
- **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
- **[Logging](observability/logging.md)** - For logging setup
- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment
- **[Grove](grove.md)** - For grove details and custom installation
- **[Monitoring](observability/metrics.md)** - For monitoring setup
- **[Model Caching with Fluid](model-caching-with-fluid.md)** - For model caching with Fluid
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Service Discovery"
---
Dynamo components (frontends, workers, planner) need to be able to discover each other and their capabilities at runtime. We refer to this as service discovery. There are 2 kinds of service discovery backends supported on Kubernetes.
## Discovery Backends
| Backend | Default | Dependencies | Use Case |
|---------|---------|--------------|----------|
| **Kubernetes** | ✅ Yes | None (native K8s) | Recommended for all Kubernetes deployments |
| **KV Store (etcd)** | No | etcd cluster | Legacy deployments |
## Kubernetes Discovery (Default)
Kubernetes discovery is the default and recommended backend when running on Kubernetes. It uses native Kubernetes primitives to facilitate discovery of components:
- **DynamoWorkerMetadata CRD**: Each worker stores its registered endpoints and model cards in a Custom Resource
- **EndpointSlices**: EndpointSlices signal each component's readiness status
### Implementation Details
Each pod runs a **discovery daemon** that watches both EndpointSlices and DynamoWorkerMetadata CRs. A pod is only discoverable when it appears as "ready" in an EndpointSlice AND has a corresponding `DynamoWorkerMetadata` CR. This correlation ensures pods aren't discoverable until they're ready, metadata is immediately available, and stale entries are cleaned up when pods terminate.
#### DynamoWorkerMetadata CRD
Each worker pod creates a `DynamoWorkerMetadata` CR that stores its discovery metadata:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoWorkerMetadata
metadata:
name: my-worker-pod-abc123
namespace: dynamo-system
ownerReferences:
- apiVersion: v1
kind: Pod
name: my-worker-pod-abc123
uid: <pod-uid>
controller: true
spec:
data:
endpoints:
"dynamo/backend/generate":
type: Endpoint
namespace: dynamo
component: backend
endpoint: generate
instance_id: 12345678901234567890
transport:
nats_tcp: "dynamo_backend.generate-abc123"
model_cards: {}
```
The CR is named after the pod and includes an owner reference for automatic garbage collection when the pod is deleted.
#### EndpointSlices
While DynamoWorkerMetadata resources provide an up-to-date snapshot of a component's capabilities, EndpointSlices give a snapshot of health of the various Dynamo components.
The operator creates a Kubernetes Service targeting the Dynamo components. The Kubernetes controller in turn creates and maintains EndpointSlice resources that keep track of the readiness of the pods targeted by the Service. Watching these slices gives us an up-to-date snapshot of which Dynamo components are ready to serve traffic.
##### Readiness Probes
A pod is marked ready if the readiness probe succeeds. On Dynamo workers, this is when the `generate` endpoint is available and healthy. These probes are configured by the Dynamo operator for each pod/component.
#### RBAC
Each Dynamo component pod is automatically given a ServiceAccount that allows it to watch `EndpointSlice` and `DynamoWorkerMetadata` resources within its namespace.
#### Environment Variables
The following environment variables are automatically injected into pods by the operator to facilitate service discovery:
| Variable | Description |
|----------|-------------|
| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
| `POD_NAME` | Pod name (via downward API) |
| `POD_NAMESPACE` | Pod namespace (via downward API) |
| `POD_UID` | Pod UID (via downward API) |
The pod's instance ID is deterministically generated by hashing the pod name, ensuring consistent identity and correlation between EndpointSlices and CRs.
## KV Store Discovery (etcd)
To use etcd-based discovery instead of Kubernetes-native discovery, add the annotation to your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
annotations:
nvidia.com/dynamo-discovery-backend: etcd
spec:
services:
# ...
```
This requires an etcd cluster to be available. The etcd connection is configured via the platform Helm chart.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Webhooks"
---
This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Configuration](#configuration)
- [Enabling/Disabling Webhooks](#enablingdisabling-webhooks)
- [Certificate Management Options](#certificate-management-options)
- [Advanced Configuration](#advanced-configuration)
- [Certificate Management](#certificate-management)
- [Automatic Certificates (Default)](#automatic-certificates-default)
- [cert-manager Integration](#cert-manager-integration)
- [External Certificates](#external-certificates)
- [Multi-Operator Deployments](#multi-operator-deployments)
- [Troubleshooting](#troubleshooting)
---
## Overview
The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.
All webhook types (validating, mutating, conversion, etc.) share the same **webhook server** and **TLS certificate infrastructure**, making certificate management consistent across all webhook operations.
### Key Features
-**Enabled by default** - Zero-touch validation out of the box
-**Shared certificate infrastructure** - All webhook types use the same TLS certificates
-**Automatic certificate generation** - No manual certificate management required
-**Defense in depth** - Controllers validate when webhooks are disabled
-**cert-manager integration** - Optional integration for automated certificate lifecycle
-**Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
-**Immutability enforcement** - Critical fields protected via CEL validation rules
### Current Webhook Types
- **Validating Webhooks**: Validate custom resource specifications before persistence
- `DynamoComponentDeployment` validation
- `DynamoGraphDeployment` validation
- `DynamoModel` validation
**Note:** Future releases may add mutating webhooks (for defaults/transformations) and conversion webhooks (for CRD version migrations). All will use the same certificate infrastructure described in this document.
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 1. User submits CR (kubectl apply) │
│ 2. API server calls ValidatingWebhookConfiguration │
└────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required)
┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │
│ 3. Validates CR against business rules │
│ 4. Returns admit/deny decision + warnings │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 5. If admitted: Persist CR to etcd │
│ 6. If denied: Return error to user │
└─────────────────────────────────────────────────────────────────┘
```
### Validation Flow
1. **Webhook validation** (if enabled): Validates at API server level
2. **CEL validation**: Kubernetes-native immutability checks (always active)
3. **Controller validation** (if webhooks disabled): Defense-in-depth validation during reconciliation
---
## Configuration
### Enabling/Disabling Webhooks
Webhooks are **enabled by default**. To disable them:
```yaml
# Platform-level values.yaml
dynamo-operator:
webhook:
enabled: false
```
**When to disable webhooks:**
- During development/testing when rapid iteration is needed
- In environments where admission webhooks are not supported
- When troubleshooting validation issues
**Note:** When webhooks are disabled, controllers perform validation during reconciliation (defense in depth).
---
### Certificate Management Options
The operator supports three certificate management modes:
| Mode | Description | Use Case |
|------|-------------|----------|
| **Automatic (Default)** | Helm hooks generate self-signed certificates | Testing and development environments |
| **cert-manager** | Integrate with cert-manager for automated lifecycle | Production deployments with cert-manager |
| **External** | Bring your own certificates | Production deployments with custom PKI |
---
### Advanced Configuration
#### Complete Configuration Reference
```yaml
dynamo-operator:
webhook:
# Enable/disable validation webhooks
enabled: true
# Certificate management
certManager:
enabled: false
issuerRef:
kind: Issuer
name: selfsigned-issuer
# Certificate secret configuration
certificateSecret:
name: webhook-server-cert
external: false
# Certificate validity period (automatic generation only)
certificateValidity: 3650 # 10 years
# Certificate generator image (automatic generation only)
certGenerator:
image:
repository: bitnami/kubectl
tag: latest
# Webhook behavior configuration
failurePolicy: Fail # Fail (reject on error) or Ignore (allow on error)
timeoutSeconds: 10 # Webhook timeout
# Namespace filtering (advanced)
namespaceSelector: {} # Kubernetes label selector for namespaces
```
#### Failure Policy
```yaml
# Fail: Reject resources if webhook is unavailable (recommended for production)
webhook:
failurePolicy: Fail
# Ignore: Allow resources if webhook is unavailable (use with caution)
webhook:
failurePolicy: Ignore
```
**Recommendation:** Use `Fail` in production to ensure validation is always enforced. Only use `Ignore` if you need high availability and can tolerate occasional invalid resources.
#### Namespace Filtering
Control which namespaces are validated (applies to **cluster-wide operator** only):
```yaml
# Only validate resources in namespaces with specific labels
webhook:
namespaceSelector:
matchLabels:
dynamo-validation: enabled
# Or exclude specific namespaces
webhook:
namespaceSelector:
matchExpressions:
- key: dynamo-validation
operator: NotIn
values: ["disabled"]
```
**Note:** For **namespace-restricted operators**, the namespace selector is automatically set to validate only the operator's namespace. This configuration is ignored in namespace-restricted mode.
---
## Certificate Management
### Automatic Certificates (Default)
**Zero configuration required!** Certificates are automatically generated during `helm install` and `helm upgrade`.
#### How It Works
1. **Pre-install/pre-upgrade hook**: Generates self-signed TLS certificates
- Root CA (valid 10 years)
- Server certificate (valid 10 years)
- Stores in Secret: `<release>-webhook-server-cert`
2. **Post-install/post-upgrade hook**: Injects CA bundle into `ValidatingWebhookConfiguration`
- Reads `ca.crt` from Secret
- Patches `ValidatingWebhookConfiguration` with base64-encoded CA bundle
3. **Operator pod**: Mounts certificate secret and serves webhook on port 9443
#### Certificate Validity
- **Root CA**: 10 years
- **Server Certificate**: 10 years (same as Root CA)
- **Automatic rotation**: Certificates are re-generated on every `helm upgrade`
#### Smart Certificate Generation
The certificate generation hook is intelligent:
-**Checks existing certificates** before generating new ones
-**Skips generation** if valid certificates exist (valid for 30+ days with correct SANs)
-**Regenerates** only when needed (missing, expiring soon, or incorrect SANs)
This means:
- Fast `helm upgrade` operations (no unnecessary cert generation)
- Safe to run `helm upgrade` frequently
- Certificates persist across reinstalls (stored in Secret)
#### Manual Certificate Rotation
If you need to rotate certificates manually:
```bash
# Delete the certificate secret
kubectl delete secret <release>-webhook-server-cert -n <namespace>
# Upgrade the release to regenerate certificates
helm upgrade <release> dynamo-platform -n <namespace>
```
---
### cert-manager Integration
For clusters with cert-manager installed, you can enable automated certificate lifecycle management.
#### Prerequisites
1. **cert-manager installed** (v1.0+)
2. **CA issuer configured** (e.g., `selfsigned-issuer`)
#### Configuration
```yaml
dynamo-operator:
webhook:
certManager:
enabled: true
issuerRef:
kind: Issuer # Or ClusterIssuer
name: selfsigned-issuer # Your issuer name
```
#### How It Works
1. **Helm creates Certificate resource**: Requests TLS certificate from cert-manager
2. **cert-manager generates certificate**: Based on configured issuer
3. **cert-manager stores in Secret**: `<release>-webhook-server-cert`
4. **cert-manager ca-injector**: Automatically injects CA bundle into `ValidatingWebhookConfiguration`
5. **Operator pod**: Mounts certificate secret and serves webhook
#### Benefits Over Automatic Mode
-**Automated rotation**: cert-manager renews certificates before expiration
-**Custom validity periods**: Configure certificate lifetime
-**CA rotation support**: ca-injector handles CA updates automatically
-**Integration with existing PKI**: Use your organization's certificate infrastructure
#### Certificate Rotation
With cert-manager, certificate rotation is **fully automated**:
1. **Leaf certificate rotation** (default: every year)
- cert-manager auto-renews before expiration
- controller-runtime auto-reloads new certificate
- **No pod restart required**
- **No caBundle update required** (same Root CA)
2. **Root CA rotation** (every 10 years)
- cert-manager rotates Root CA
- ca-injector auto-updates caBundle in `ValidatingWebhookConfiguration`
- **No manual intervention required**
#### Example: Self-Signed Issuer
```yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: dynamo-system
spec:
selfSigned: {}
---
# Enable in platform values.yaml
dynamo-operator:
webhook:
certManager:
enabled: true
issuerRef:
kind: Issuer
name: selfsigned-issuer
```
---
### External Certificates
Bring your own certificates for custom PKI requirements.
#### Steps
1. **Create certificate secret manually**:
```bash
kubectl create secret tls <release>-webhook-server-cert \
--cert=tls.crt \
--key=tls.key \
-n <namespace>
# Also add ca.crt to the secret
kubectl patch secret <release>-webhook-server-cert -n <namespace> \
--type='json' \
-p='[{"op": "add", "path": "/data/ca.crt", "value": "'$(base64 -w0 < ca.crt)'"}]'
```
2. **Configure operator to use external secret**:
```yaml
dynamo-operator:
webhook:
certificateSecret:
external: true
caBundle: <base64-encoded-ca-cert> # Must manually specify
```
3. **Deploy operator**:
```bash
helm install dynamo-platform . -n <namespace> -f values.yaml
```
#### Certificate Requirements
- **Secret name**: Must match `webhook.certificateSecret.name` (default: `webhook-server-cert`)
- **Secret keys**: `tls.crt`, `tls.key`, `ca.crt`
- **Certificate SAN**: Must include `<service-name>.<namespace>.svc`
- Example: `dynamo-platform-dynamo-operator-webhook-service.dynamo-system.svc`
---
## Multi-Operator Deployments
The operator supports running both **cluster-wide** and **namespace-restricted** instances simultaneously using a **lease-based coordination mechanism**.
### Scenario
```
Cluster:
├─ Operator A (cluster-wide, namespace: platform-system)
│ └─ Validates all namespaces EXCEPT team-a
└─ Operator B (namespace-restricted, namespace: team-a)
└─ Validates only team-a namespace
```
### How It Works
1. **Namespace-restricted operator** creates a Lease in its namespace
2. **Cluster-wide operator** watches for Leases named `dynamo-operator-ns-lock`
3. **Cluster-wide operator** skips validation for namespaces with active Leases
4. **Namespace-restricted operator** validates resources in its namespace
### Lease Configuration
The lease mechanism is **automatically configured** based on deployment mode:
```yaml
# Cluster-wide operator (default)
namespaceRestriction:
enabled: false
# → Watches for leases in all namespaces
# → Skips validation for namespaces with active leases
# Namespace-restricted operator
namespaceRestriction:
enabled: true
namespace: team-a
# → Creates lease in team-a namespace
# → Does NOT check for leases (no cluster permissions)
```
### Deployment Example
```bash
# 1. Deploy cluster-wide operator
helm install platform-operator dynamo-platform \
-n platform-system \
--set namespaceRestriction.enabled=false
# 2. Deploy namespace-restricted operator for team-a
helm install team-a-operator dynamo-platform \
-n team-a \
--set namespaceRestriction.enabled=true \
--set namespaceRestriction.namespace=team-a
```
### ValidatingWebhookConfiguration Naming
The webhook configuration name reflects the deployment mode:
- **Cluster-wide**: `<release>-validating`
- **Namespace-restricted**: `<release>-validating-<namespace>`
Example:
```bash
# Cluster-wide
platform-operator-validating
# Namespace-restricted (team-a)
team-a-operator-validating-team-a
```
This allows multiple webhook configurations to coexist without conflicts.
### Lease Health
If the namespace-restricted operator is deleted or becomes unhealthy:
- Lease expires after `leaseDuration + gracePeriod` (default: ~30 seconds)
- Cluster-wide operator automatically resumes validation for that namespace
---
## Troubleshooting
### Webhook Not Called
**Symptoms:**
- Invalid resources are accepted
- No validation errors in logs
**Checks:**
1. **Verify webhook is enabled**:
```bash
kubectl get validatingwebhookconfiguration | grep dynamo
```
2. **Check webhook configuration**:
```bash
kubectl get validatingwebhookconfiguration <name> -o yaml
# Verify:
# - caBundle is present and non-empty
# - clientConfig.service points to correct service
# - webhooks[].namespaceSelector matches your namespace
```
3. **Verify webhook service exists**:
```bash
kubectl get service -n <namespace> | grep webhook
```
4. **Check operator logs for webhook startup**:
```bash
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep webhook
# Should see: "Webhooks are enabled - webhooks will validate, controllers will skip validation"
# Should see: "Starting webhook server"
```
---
### Connection Refused Errors
**Symptoms:**
```
Error from server (InternalError): Internal error occurred: failed calling webhook:
Post "https://...webhook-service...:443/validate-...": dial tcp ...:443: connect: connection refused
```
**Checks:**
1. **Verify operator pod is running**:
```bash
kubectl get pods -n <namespace> -l app.kubernetes.io/name=dynamo-operator
```
2. **Check webhook server is listening**:
```bash
# Port-forward to pod
kubectl port-forward -n <namespace> pod/<operator-pod> 9443:9443
# In another terminal, test connection
curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeployment
# Should NOT get "connection refused"
```
3. **Verify webhook port in deployment**:
```bash
kubectl get deployment -n <namespace> <release>-dynamo-operator -o yaml | grep -A5 "containerPort: 9443"
```
4. **Check for webhook initialization errors**:
```bash
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep -i error
```
---
### Certificate Errors
**Symptoms:**
```
Error from server (InternalError): Internal error occurred: failed calling webhook:
x509: certificate signed by unknown authority
```
**Checks:**
1. **Verify caBundle is present**:
```bash
kubectl get validatingwebhookconfiguration <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
# Should output a valid PEM certificate
```
2. **Verify certificate secret exists**:
```bash
kubectl get secret -n <namespace> <release>-webhook-server-cert
```
3. **Check certificate validity**:
```bash
kubectl get secret -n <namespace> <release>-webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
# Check:
# - Not expired
# - SAN includes: <service-name>.<namespace>.svc
```
4. **Check CA injection job logs**:
```bash
kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
```
---
### Helm Hook Job Failures
**Symptoms:**
- `helm install` or `helm upgrade` hangs or fails
- Certificate generation errors
**Checks:**
1. **List hook jobs**:
```bash
kubectl get jobs -n <namespace> | grep webhook
```
2. **Check job logs**:
```bash
# Certificate generation
kubectl logs -n <namespace> job/<release>-webhook-cert-gen-<revision>
# CA injection
kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
```
3. **Check RBAC permissions**:
```bash
# Verify ServiceAccount exists
kubectl get sa -n <namespace> <release>-webhook-ca-inject
# Verify ClusterRole and ClusterRoleBinding exist
kubectl get clusterrole <release>-webhook-ca-inject
kubectl get clusterrolebinding <release>-webhook-ca-inject
```
4. **Manual cleanup**:
```bash
# Delete failed jobs
kubectl delete job -n <namespace> <release>-webhook-cert-gen-<revision>
kubectl delete job -n <namespace> <release>-webhook-ca-inject-<revision>
# Retry helm upgrade
helm upgrade <release> dynamo-platform -n <namespace>
```
---
### Validation Errors Not Clear
**Symptoms:**
- Webhook rejects resource but error message is unclear
**Solution:**
Check operator logs for detailed validation errors:
```bash
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep "validate create\|validate update"
```
Webhook logs include:
- Resource name and namespace
- Validation errors with context
- Warnings for immutable field changes
---
### Stuck Deleting Resources
**Symptoms:**
- Resource stuck in "Terminating" state
- Webhook blocks finalizer removal
**Solution:**
The webhook automatically skips validation for resources being deleted. If stuck:
1. **Check if webhook is blocking**:
```bash
kubectl describe <resource-type> <name> -n <namespace>
# Look for events mentioning webhook errors
```
2. **Temporarily disable webhook**:
```bash
# Option 1: Delete ValidatingWebhookConfiguration
kubectl delete validatingwebhookconfiguration <name>
# Option 2: Set failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration <name> \
--type='json' \
-p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
```
3. **Delete resource again**:
```bash
kubectl delete <resource-type> <name> -n <namespace>
```
4. **Re-enable webhook**:
```bash
helm upgrade <release> dynamo-platform -n <namespace>
```
---
## Best Practices
### Production Deployments
1.**Keep webhooks enabled** (default) for real-time validation
2.**Use `failurePolicy: Fail`** (default) to ensure validation is enforced
3.**Monitor webhook latency** - Validation adds ~10-50ms per resource operation
4.**Use cert-manager** for automated certificate lifecycle in large deployments
5.**Test webhook configuration** in staging before production
### Development Deployments
1.**Disable webhooks** for rapid iteration if needed
2.**Use `failurePolicy: Ignore`** if webhook availability is problematic
3.**Keep automatic certificates** (simpler than cert-manager for dev)
### Multi-Tenant Deployments
1.**Deploy one cluster-wide operator** for platform-wide validation
2.**Deploy namespace-restricted operators** for tenant-specific namespaces
3.**Monitor lease health** to ensure coordination works correctly
4.**Use unique release names** per namespace to avoid naming conflicts
---
## Additional Resources
- [Kubernetes Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
- [cert-manager Documentation](https://cert-manager.io/docs/)
- [Kubebuilder Webhook Tutorial](https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation.html)
- [CEL Validation Rules](https://kubernetes.io/docs/reference/using-api/cel/)
---
## Support
For issues or questions:
- Check [Troubleshooting](#troubleshooting) section
- Review operator logs: `kubectl logs -n <namespace> deployment/<release>-dynamo-operator`
- Open an issue on GitHub
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "KVBM Architecture"
---
The KVBM serves as a critical infrastructure component for scaling LLM inference workloads efficiently. By cleanly separating runtime logic from memory management, and by enabling distributed block sharing, KVBM lays the foundation for high-throughput, multi-node, and memory-disaggregated AI systems.
![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../../assets/img/kvbm-architecture.png)
**High level layered architecture view of Dynamo KV Block manager and how it interfaces with different components of LLM inference ecosystem**
The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (TRTLLM, vLLM and SGLang)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBM’s block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.
The middle layer, the KVBM layer, encapsulates the core logic of the KV block manager and serves as the runtime substrate for managing block memory. The KVBM adapter layer normalizes the representations and data layout for the incoming requests across runtimes and forwards them to the core memory manager. The KVBM and the core modules implement required internal functionality, such as table lookups, memory allocation, block layout management, lifecycle, and state transitions and block reuse or eviction was on policies. The KVBM layer also has required abstractions for external components to override or augment its behavior.
The last layer, the NIXL layer, provides unified support for enabling all data and storage transactions. NIXL enables P2P GPU transfers, enables RDMA and NVLINK remote memory sharing, dynamic block registration and metadata exchange and provides a plugin interface for storage backends.
NIXL integrates with several backends:
- Block memory (Eg. GPU HBM, Host DRAM, Remote DRAM, Local SSD when exposed as block device)
- Local file system (for example, POSIX)
- Remote file system (for example, NFS)
- Object stores (for example, S3-compatible)
- Cloud storage (for example, blob storage APIs)
**[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** abstracts away the registration and integration complexity for each backends via custom optimizable plugin architecture and enables memory blocks to be published, serialized, and accessed remotely, allowing the disaggregation of compute and memory across nodes. Combined with the Dynamo KV Block Manager (KVBM), storage providers no longer need to retrofit or optimize individual LLM inference engines. Instead, they can focus on tuning their own stack, providing optimized endpoints, knowing that integration is smooth, standardized, and efficient. And for those who *do* want to go further, Dynamo KVBM offers a clean separation of concerns, making custom optimization not only possible, but simple.
\ No newline at end of file
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Understanding KVBM components"
---
KVBM design takes inspiration from the KV block managers used in vLLM and SGLang, with an added influence from historical memory tiering strategies common in general GPU programming. For more details, [See KVBM Reading](kvbm-reading.md). The figure below illustrates the internal components of KVBM.
![Internal Components of Dynamo KVBM. ](../../assets/img/kvbm-components.png)
**Internal Components of Dynamo KVBM**
## KVBM Components
### Core
- **KvBlockManager**: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs.
- **Scheduler**: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1).
- **Config (config.rs)**: Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts.
- **KvBlockManagerState**: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks.
- **Events/Metrics**: Observability components emitting counters/gauges and event hooks for integration/testing.
### Layouts and Blocks
- **LayoutConfig & LayoutType**: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry.
- **Blocks & Metadata**: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by `sequence_hash`.
### Transfer Manager
- **TransferManager**: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device).
### Storage & Pools
- **Device Pool(G1)**: GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device).
- **Host Pool(G2)**: CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O.
- **Disk Pool(G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.
## KVBM DataFlows
![KVBM Data Flows. ](../../assets/img/kvbm-data-flows.png)
**KVBM Data Flows from device to other memory hierarchies**
**Device → Host (Offload)**
* Triggered explicitly requested to offload by the connector scheduler.
* Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy.
* Host pool registers the new immutable block (dedup by sequence hash).
**Host → Disk (Offload)**
* Local Disk: NIXL Write via POSIX; GDS when available.
* Remote Disk (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical.
* Triggered on registered host blocks or explicit offload requests.
* Worker allocates a Disk block and performs NIXL Write (Host→Disk).
* Disk pool registers the new immutable block (dedup by sequence hash).
**Host → Device (Onboard)**
* Called to bring a host block into GPU memory.
* Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy.
* Device pool registers the new immutable block.
**Disk → Device (Onboard)**
* Called to bring a disk block directly into GPU memory.
* Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS.
* Device pool registers the new immutable block.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "KVBM components"
---
The design of the KVBM is inspired from vLLM and SGLang KV block managers but with a twist from historical memory tiering design aspired in general GPU programming. [See KVBM Reading](kvbm-reading.md). The following figure shows the internal architecture of KVBM and how it works across workers using NIXL.
![Internal architecture and key modules in the Dynamo KVBM. ](../../assets/img/kvbm-internal-arch.png)
**Internal architecture and key modules in the Dynamo KVBM**
## KvBlockManager as Orchestration Layer
The `KvBlockManager <H, D>` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Critical to note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
`KvBlockManager<H, D>` owns:
* A device-side `BlockPool<Device>`
* A host-side `BlockPool<Host>`
* A remote NIXL agent that supports communication and memory sharing across nodes
* A block set registry for remote lookup and import/export of block metadata
Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness.
## Block Layout and Memory Mapping
Each block is a 2D array `[num_layers][page_size × inner_dim]`. `BlockLayouttrait` abstracts the memory layout. The default implementation,`FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation:
```none
block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
```
Both CPU and GPU pools share this memory layout, but they use storage-specific backends:
* `DeviceStorage` → CUDA device buffer
* `PinnedStorage` → page-locked host memory
* `SystemStorage` → CPU heap memory (fallback/test)
* `NixlStorage` → remote memory through NIXL RDMA handles (includes storage)
Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a StorageAllocator.
## BlockPool and Memory Pools (Active and Inactive)
Each `BlockPool<T>` (where `T` is `DeviceStorage`, `PinnedStorage`, and so forth) tracks two sub-pools:
* `ActivePool`: Contains blocks currently in use by sequences
* `InactivePool`: Recycled blocks ready for allocation; think free list
When a token block is requested (for example, `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
The state machine (`BlockState`) that tracks the block lifecycle transitions includes:
| State | Description | Ownership | Valid Actions/Transitions |
| ----- | ----- | ----- | ----- |
| Reset | Block hasn't been initialized or was reset. No associated sequence. | Held in InactivePool, reusable | init_sequence(salt_hash) → Partial |
| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | add_token() / add_tokens() (accumulate)- commit() → Complete- reset() → Reset |
| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | register() → Registered- reset() → Reset |
| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. Can use block for lookups | Shared ownership (global registry) | Auto drop() → triggers Remove event and transitions to Reset |
This table lists the valid KVBM transitions:
| From → To | Trigger | Validation |
| ----- | ----- | ----- |
| Reset → Partial | initsequence(salt_hash) | Must not be in use |
| Partial → Complete | commit() | Must be full |
| Complete → Registered | register() | Must be finalized |
| Registered → Reset | Drop of RegistrationHandle | Automatic |
| Partial → Reset | Aborted sequence | Explicit or drop |
| Complete → Reset | Invalidated | Explicit or drop |
Consider this example lifecycle of a block in the KVBM; in it, a sequence requests a new KV block:
1. Allocator pops from InactivePool → Block is in Reset
2. `init_sequence()` → Transitions to Partial
3. Tokens are appended → State remains Partial
4. On full → `commit()` → State becomes Complete
5. `register()` → Block is hashed and moved to Registered. Blocks can now be used to lookup.
6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset
## Lifecycle Management using RAII and Event Plane
The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop:
* `PublishHandle` triggers Register events
* Dropping it triggers Remove events
This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
## Remote Memory Integration using NIXL
The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include:
* `nixl_register()`: Registers memory region with NIXL runtime
* `serialize() / deserialize()`: Converts layout and memory into transferable descriptors
* `import_remote_blockset()`: Loads remote node's block layouts into the manager
* `get_remote_blocks_mutable()`: Fetches transferable memory views from another node
`RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
The left side of the figure in [KVBM Components](kvbm-components.md) illustrates a bidirectional remote memory registration and layout synchronization protocol between workers (for example, Worker 1 and Worker 2) using NIXL. The following steps break down the process:
1. *Agent Creation & Memory Registration:*
Each worker independently sets up a NixlAgent:
* Registers its memory regions (that is, device memory) through `nixl_register()`.
* These regions correspond to blocks managed in the local BlockPool.
Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout.
2. *Metadata exchange:*
After memory registration, workers exchange serialized layout metadata, encapsulated in a `SerializedNixlBlockLayout`.
Why is this step critical?
* LLM inference workloads often differ in *tensor parallel (TP)* configurations:
* Worker 1 might have TP=4, while Worker 2 has TP=8.
* Thus, even if both systems use similar `FullyContiguous` layouts, their internal slicing and alignment assumptions differ.
* The metadata exchange bridges this semantic mismatch by sharing:
* LayoutConfig (num_layers, page_size, inner_dim, dtype)
* BlockSetID
* Base address + stride information (including alignment)
* Device ID + memory type (host/device)
* Once the workers share metadata, each can reconstruct the layout on its side using deserialize().
This enables NIXL to:
* Understand where each layer/block resides
* Perform correct gather-scatter operations during RDMA-like transfers
Without this step, remote fetches would result in data corruption or misaligned tokens.
3. *Serialization & Deserialization: Making Layouts Portable*
In the serialization stage, KVBM exports and `FullyContiguous::serialize()` encodes:
* FullyContiguousConfig
* base_offset
* Physical memory descriptors (NixlStorage), including:
* Memory type (VRAM, DRAM)
* Address & size
* Device ID
The system sends this using NIXL transfer and then injects it into a KVBM scheduler state. In the deserialization stage, `SerializedNixlBlockLayout::deserialize()` rehydrates this into:
* A fully reconstructed memory layout view
* Local representation of a remote memory slice with correct offsets and size semantics
It also enables direct access to remote memory with consistent logical semantics
This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
4. *Ownership handles and lifetime tracking*
Memory ownership in NIXL is tightly coupled with RAII-based handles:
* When a block is registered, it returns a `PublishHandle` which wraps a `RegistrationHandle`
* On drop of this handle, an automatic Remove event is published, which:
* Deregisters the block from the NIXL layer
* Removes it from the remote block registry
* This ensures that, once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes
This mechanism avoids:
* Stale memory access
* Dangling pointers on GPU or host
* Manual deregistration bugs
The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency
## Storage backends and pluggability
You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers. We defer detailed integration guidance, since we collaborate with storage partners to simplify and standardize these integration paths.
```mermaid
---
title: Example KVBM System Architecture
---
flowchart TD
A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"]
B --> C["NIXL Storage Agent<br/>- Volume registration<br/>- get()/put() abstraction"]
B --> D["Event Plane<br/>- NATS-based Pub/Sub<br/>- StoreEvent / RemoveEvent"]
C --> E["G4 Storage Infrastructure<br/>(SSD, Object store, etc.)<br/>- Store KV blocks"]
D --> F["Storage Provider Subscriber<br/>- Parse Events<br/>- Build fast tree/index<br/>- Optimize G4 tiering"]
```
For now, the following breakdown provides a high-level understanding of how KVBM interacts with external storage using the NIXL storage interface and the Dynamo Event Plane:
### NIXL Storage Interface (for Backend Integration)
The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
* registerVolume(descriptor): Register a logical volume for KV cache data.
* unregisterVolume(): Cleanly deregister and release volume mappings.
* get() / put(): Block-level APIs used by KVBM to fetch and store token blocks.
These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Please note that these APIs are still being finalized.
### Dynamo Event Plane (Pub/Sub Coordination Layer)
To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations. Particularly there are two events emitted.
* StoreEvent: Emitted when a KV block is registered.
* RemoveEvent: Emitted when a KV block is released or evicted.
Each KVEvent (\~100 bytes) contains:
* sequence_hash: Unique identifier of the KV block
* prefix_hash: Prefix grouping for query-level aggregation
* block_size: Size in bytes
* storage_location: Logical volume identifier
* event_type: Store or Remove
* extra_metadata: Reserved fields for partner-specific optimization
For scalability, the system batches and publishes these events periodically (for example, every \~10s, or dynamically based on system load).
### A conceptual design of a storage advisor
This section provides an overview for the storage provider who is interested in integrating as a custom backend to KVBM and providing optimized performance. **Please note, this is optional for KVBM integration with a backend.**
External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
* Storage volumes are pre-provisioned and mounted by the storage provider.
* These volumes are then registered with Dynamo through the NIXL Storage Agent using registerVolume() APIs. Dynamo itself does not manage mounts or provisioning.
* The Dynamo KV Block Manager interacts only with logical block-level APIs (that is, get() and put()).
* In parallel, the Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel.
* Storage vendors implement a lightweight subscriber process that listens to these events without interfering with the KV Manager's runtime behavior.
* This decoupling ensures that external storage systems can optimize block placement and lifecycle tracking without modifying or instrumenting the core Dynamo codebase.
Now, to enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream. Here is a high level conceptual design:
* On receiving a StoreEvent, the storage system:
* Inserts a record into an internal prefix tree, hash map, or LRU index.
* This record includes the prefix_hash and sequence_hash, which logically identify the token block and its grouping.
* Associated metadata (for example, block_size, storage_location) is also captured.
* On receiving a RemoveEvent, the system:
* Deletes or prunes the corresponding record from its index.
* Optionally triggers cleanup or tier migration workflows.
This event-driven indexing allows the storage system to track which KV blocks are live and where they belong—enabling low-latency lookup, efficient space reclamation, and multi-tier coordination. With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies, such as:
* Hot block promotion: Frequently accessed KV blocks can be migrated to fast SSD volumes.
* Cold block demotion: Infrequently used blocks can be demoted to slower storage (for example, HDDs, cloud object storage).
* Proactive compaction: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks.
These optimizations are performed entirely outside of Dynamo, with the assumption that storage providers adhere to SLA guarantees and volume availability.
Critically, this entire system is designed to be non-intrusive:
* The Dynamo KV Block Manager remains agnostic to how data is stored or optimized.
* The Event Plane doesn't block or intercept any critical path of inference.
* Storage vendors are given the freedom to innovate and optimize without requiring changes to the inference runtime.
This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "KVBM Integrations"
---
KVBM Integrates with Inference frameworks (vLLM, TRTLLM, SGLang) via Connector APIs to influence KV caching behaviour, scheduling, and forward pass execution.
There are two components of the interface, Scheduler and Worker. Scheduler(leader) is responsible for the orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion. Worker is responsible for reading metadata built by the scheduler(leader), does async onboarding/ offloading at the end of the forward pass.
## Typical KVBM Integrations
The following figure shows the typical integration of KVBM with inference frameworks (vLLM used as an example)
![vLLM KVBM Integration ](../../assets/img/kvbm-integrations.png)
**vLLM KVBM Integration**
## How to run KVBM with Frameworks
* Instructions to [run KVBM in vLLM](vllm-setup.md)
* Instructions to [run KVBM with TRTLLM](trtllm-setup.md)
## Onboarding
![Onboarding blocks from Host to Device](../../assets/img/kvbm-onboard-host2device.png)
**Onboarding blocks from Host to Device**
![Onboarding blocks from Disk to Device](../../assets/img/kvbm-onboard-disk2device.png)
**Onboarding blocks from Disk to Device**
## Offloading
![Offloading blocks from Device to Host&Disk](../../assets/img/kvbm-offload.png)
**Offloading blocks from Device to Host&Disk**
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "KV Block Manager"
---
The Dynamo KV Block Manager (KVBM) is a scalable runtime component
designed to handle memory allocation, management, and remote sharing of
Key-Value (KV) blocks for inference tasks across heterogeneous and
distributed environments. It acts as a unified memory layer for
frameworks like vLLM, SGLang, and TRT-LLM.
It offers:
- A **unified memory API** that spans GPU memory(in future) , pinned
host memory, remote RDMA-accessible memory, local or distributed pool
of SSDs and remote file/object/cloud storage systems.
- Support for evolving **block lifecycles** (allocate → register →
match) with event-based state transitions that storage can subscribe
to.
- Integration with **NIXL**, a dynamic memory exchange layer used for
remote registration, sharing, and access of memory blocks over
RDMA/NVLink.
The Dynamo KV Block Manager serves as a reference implementation that
emphasizes modularity and extensibility. Its pluggable design enables
developers to customize components and optimize for specific
performance, memory, and deployment needs.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment