> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The DaemonSet runs in privileged mode to perform CRIU operations. See [Prerequisites](#prerequisites) for security considerations.
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in beta/preview. The DaemonSet runs in privileged mode to perform CRIU checkpoint and restore operations.
This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, including:
- Persistent Volume Claim (PVC) for checkpoint storage
This chart installs the namespace-scoped checkpoint/restore infrastructure used by Dynamo:
**Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
-**Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
-`snapshot-agent` DaemonSet on GPU nodes
-`snapshot-pvc` checkpoint storage, or wiring to an existing PVC
- namespace-scoped RBAC
- the seccomp profile required by CRIU
## Prerequisites
Snapshot storage is namespace-local. Install this chart in every namespace where you want checkpoint and restore.
⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable.
## Prerequisites
- Kubernetes 1.21+
-**x86_64 (amd64) nodes only** for the snapshot agent and placeholder images
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
- RWX (ReadWriteMany) storage class for multi-node deployments
-**Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
## Installation
> **Note:** The Dynamo Snapshot Helm chart is not yet published to a public Helm repository. For now, you must build and deploy from source.
### Building from Source
```bash
# Set environment
export NAMESPACE=my-team # Your target namespace
export DOCKER_SERVER=your-registry.com/ # Your container registry
Each namespace will have its own isolated checkpoint storage.
If your cluster does not use a default storage class, also set `storage.pvc.storageClass`.
## Verification
Keep `storage.pvc.accessMode=ReadWriteMany` for this chart layout. The DaemonSet mounts the same PVC on each eligible node, so a shared `ReadWriteOnce` claim only works when the agent runs on one node.
```bash
# Check PVC
kubectl get pvc snapshot-pvc -n my-team
If you already have a PVC, keep the chart in "use existing PVC" mode:
# Check DaemonSet
kubectl get daemonset -n my-team
Do not set `storage.pvc.create=true` when reusing an existing checkpoint PVC.
# Check DaemonSet pods are running
kubectl get pods -n my-team -l app.kubernetes.io/name=snapshot
| `daemonset.image.tag` | Snapshot agent image tag | `1.0.0` |
| `daemonset.imagePullSecrets` | Image pull secrets for the agent | `[{name: ngc-secret}]` |
## Troubleshooting
See [values.yaml](./values.yaml) for the complete configuration surface.
### DaemonSet pods not starting
## End To End
Check if GPU nodes have the correct labels and runtime class:
Once the chart is installed, use the snapshot guide to deploy a snapshot-capable `DynamoGraphDeployment`, wait for the checkpoint to become ready, and then scale the worker to verify restore:
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:Snapshot
---
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **preview** and may only be functional in some k8s cluster setups. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
**Dynamo Snapshot** is an experimental infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in User-space) and NVIDIA's cuda-checkpoint utility. Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (restore from checkpoint) | ~ 10 sec | Restore from checkpoint tar |
> ⚠️ Restore time may vary depending on cluster configuration (storage bandwidth, GPU model, etc.)
## Prerequisites
- Dynamo Platform/Operator installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
- NVIDIA driver 580.xx or newer on the target GPU nodes
-`ReadWriteMany` storage if you need cross-node restore
- vLLM or SGLang backend (TensorRT-LLM is not supported yet)
- Security clearance to run a privileged DaemonSet
## Quick Start
This guide assumes a normal Dynamo deployment workflow is already present on your Kubernetes cluster.
### 1. Build and push a placeholder image
Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with the restore tooling. If you do not already have one, build it with the snapshot placeholder target and push it to a registry your cluster can pull from:
This flow is defined in [deploy/snapshot/Makefile](../../deploy/snapshot/Makefile) and [deploy/snapshot/Dockerfile](../../deploy/snapshot/Dockerfile). The placeholder image preserves the base runtime entrypoint and command contract, and adds the CRIU, `cuda-checkpoint`, and `nsrestore` tooling needed for restore.
### 2. Enable checkpointing in the platform and verify it
Whether you are installing or upgrading `dynamo-platform`, the operator must have checkpointing enabled and must point at the same storage that the snapshot chart will use:
```yaml
dynamo-operator:
checkpoint:
enabled:true
storage:
type:pvc
pvc:
pvcName:snapshot-pvc
basePath:/checkpoints
```
If the platform is already installed, verify that the operator config contains the checkpoint block:
```bash
OPERATOR_CONFIG=$(kubectl get deploy -n"${PLATFORM_NAMESPACE}"\
Verify that the rendered config includes `enabled: true` and the same PVC name and base path you plan to use for the snapshot chart.
For the full platform/operator configuration surface, see [deploy/helm/charts/platform/README.md](../../deploy/helm/charts/platform/README.md) and [deploy/helm/charts/platform/components/operator/values.yaml](../../deploy/helm/charts/platform/components/operator/values.yaml).
Cross-node restore requires `ReadWriteMany` storage. The chart defaults to that mode.
For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC. If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and point `storage.pvc.name` at the existing PVC instead.
Verify that the PVC and DaemonSet are ready:
```bash
kubectl get pvc snapshot-pvc -n${NAMESPACE}
kubectl rollout status daemonset/snapshot-agent -n${NAMESPACE}
```
For the full snapshot chart configuration surface, see [deploy/helm/charts/snapshot/README.md](../../deploy/helm/charts/snapshot/README.md) and [deploy/helm/charts/snapshot/values.yaml](../../deploy/helm/charts/snapshot/values.yaml).
### 4. Apply a snapshot-compatible `DynamoGraphDeployment`
This example is adapted from [examples/backends/vllm/deploy/agg.yaml](../../examples/backends/vllm/deploy/agg.yaml). The worker must use the placeholder image from step 1, and the checkpoint identity must describe the runtime state you want to reuse.
New worker pods for `VllmDecodeWorker` will restore from the ready checkpoint automatically.
## Checkpoint Configuration
### Auto Mode (Recommended)
The operator computes the checkpoint identity hash, looks for an existing `DynamoCheckpoint` with a matching `nvidia.com/snapshot-checkpoint-hash` label, and creates one if it does not find one:
```yaml
checkpoint:
enabled:true
mode:Auto
identity:
model:"meta-llama/Llama-3-8B"
backendFramework:"vllm"# or "sglang"
tensorParallelSize:1
dtype:"bfloat16"
maxModelLen:4096
```
When a service uses checkpointing, DGD status reports the resolved `checkpointName`, `identityHash`, and `ready` fields under `.status.checkpoints.<service-name>`.
### Manual Management and `checkpointRef`
Use `checkpointRef` when you want a service to restore from a specific `DynamoCheckpoint` CR:
```yaml
checkpoint:
enabled:true
checkpointRef:"qwen3-06b-vllm-prewarm"
```
This is useful when:
- You want to **pre-warm checkpoints** before creating DGDs
- You want **explicit control** over which checkpoint to use
`checkpointRef` resolves by `DynamoCheckpoint.metadata.name`, not by `status.identityHash`. A manual checkpoint can use any valid Kubernetes resource name.
If you are managing checkpoint CRs yourself, set `mode: Manual` on the service to prevent the operator from creating a new `DynamoCheckpoint` when identity-based lookup does not find one.
```bash
# Check checkpoint status by CR name
kubectl get dynamocheckpoint qwen3-06b-vllm-prewarm -n${NAMESPACE}
# Now create DGD referencing it
kubectl apply -f my-dgd.yaml -n${NAMESPACE}
```
If you want `mode: Auto` DGDs to discover a manually created checkpoint by identity, add the label `nvidia.com/snapshot-checkpoint-hash=<identity-hash>` to that `DynamoCheckpoint`. Auto-created checkpoints already use that label, and currently use the same hash as the CR name.
## Checkpoint Identity
Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:
**Not included in hash** (don't invalidate checkpoint):
-`replicas`
-`nodeSelector`, `affinity`, `tolerations`
-`resources` (requests/limits)
- Logging/observability config
**Example with all fields:**
```yaml
checkpoint:
enabled:true
mode:Auto
identity:
model:"meta-llama/Llama-3-8B"
backendFramework:"vllm"
dynamoVersion:"0.9.0"
tensorParallelSize:1
pipelineParallelSize:1
dtype:"bfloat16"
maxModelLen:8192
extraParameters:
enableChunkedPrefill:"true"
quantization:"awq"
```
## DynamoCheckpoint CRD
The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
**When to create a DynamoCheckpoint directly:**
-**Pre-warming:** Create checkpoints before deploying DGDs for instant startup
-**Explicit control:** Manage checkpoint lifecycle independently from DGDs
The operator requires `spec.identity` and `spec.job.podTemplateSpec`. The pod template should match the worker container you want checkpointed, including image, command, args, secrets, volumes, and resource limits. You do not need to set the checkpoint environment variables manually; the operator injects them for checkpoint jobs and restored pods.
**Create a checkpoint:**
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoCheckpoint
metadata:
name:qwen3-06b-vllm-prewarm
labels:
nvidia.com/snapshot-checkpoint-hash:"e5962d34ba272638"# Add this if Auto-mode identity lookup should find the CR
You can name the CR however you want if you plan to use `checkpointRef`. If you want `mode: Auto` identity lookup to find a manual CR, set the `nvidia.com/snapshot-checkpoint-hash` label to the computed 16-character identity hash. Using the hash as the CR name is a convenient convention, but it is not required.
Once the checkpoint is `Ready`, you can reference it by CR name:
```yaml
spec:
services:
VllmDecodeWorker:
checkpoint:
enabled:true
checkpointRef:"qwen3-06b-vllm-prewarm"
```
Or use `mode: Auto` with the same identity and snapshot-hash label, and the operator will reuse it automatically.
## Limitations
-**LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
-**Single-GPU only**: Multi-GPU configurations may work in very basic hardware configurations, but are not officially supported yet.
-**Network state**: No active TCP connections can be checkpointed
-**Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU and cuda-checkpoint. However, workload pods do not need to be privileged.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:Checkpointing
---
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
**Dynamo Snapshot** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
## What is Dynamo Snapshot?
Dynamo Snapshot provides:
-**Fast cold starts**: Restore GPU-accelerated applications in seconds instead of minutes
-**CUDA state preservation**: Checkpoint and restore GPU memory and CUDA contexts
-**Kubernetes-native**: Integrates seamlessly with Kubernetes primitives
-**Storage flexibility**: PVC-based storage (S3/OCI planned for future releases)
-**Namespace isolation**: Each namespace gets its own checkpoint infrastructure
## Use Cases
### 1. With NVIDIA Dynamo Platform (Recommended)
Use Dynamo Snapshot as part of the Dynamo platform for automatic checkpoint management:
- Automatic checkpoint creation and lifecycle management
- Seamless integration with DynamoGraphDeployment CRDs
- Built-in autoscaling with fast restore
📖 **[Read the Dynamo Integration Guide →](dynamo.md)**
## Architecture
Dynamo Snapshot consists of two main components:
### 1. Dynamo Snapshot Helm Chart
Deploys the checkpoint/restore infrastructure:
-**DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
-**PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
-**RBAC**: Namespace-scoped or cluster-wide permissions
-**Seccomp Profile**: Security policies for CRIU syscalls (needs to be injected into workload pods)
### 2. External Restore via DaemonSet
The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces:
-**Checkpoint**: Freezes the running process and dumps state (CPU + GPU) to storage
-**Restore**: Enters a placeholder pod's namespaces and restores the checkpointed process via `nsrestore`
## Quick Start
To install the Dynamo Snapshot DaemonSet in your cluster, run the following:
```bash
helm install snapshot nvidia/snapshot \
--namespace my-team \
--create-namespace\
--set storage.pvc.size=100Gi
```
## Key Features
### ✅ Currently Supported
- ✅ **vLLM and SGLang backends** (TensorRT-LLM planned)
- ✅ **LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
⚠️ **Important**: Dynamo Snapshot has significant limitations that may impact production readiness:
### Security Considerations
-**🔴 Privileged DaemonSet**: The Dynamo Snapshot DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations. Workload pods do **not** need privileged mode — all CRIU privilege lives in the DaemonSet.
-**Security Impact**: The privileged DaemonSet can:
- Access all host devices and processes
- Bypass most security restrictions
- Potentially compromise node security if exploited
### Technical Limitations
-**x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
-**NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
-**vLLM and SGLang backends only**: TensorRT-LLM is not supported.
-**LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
-**Single-GPU only**: Multi-GPU configurations not yet supported
-**Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
-**Storage**: Only PVC storage is currently implemented (S3/OCI planned)
### Recommendation
Dynamo Snapshot is best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
## Documentation
### Getting Started
-[Dynamo Integration Guide](dynamo.md) - Using Dynamo Snapshot with Dynamo Platform
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:Integration with Dynamo
---
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | < 10 sec | Restore from checkpoint tar |
## Prerequisites
- Dynamo Platform installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
- Dynamo Snapshot Helm chart installed (separate from platform)
- RWX PVC storage (PVC is currently the only supported backend)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- vLLM or SGLang backend (TensorRT-LLM is not supported)
## Quick Start
### 1. Install Dynamo Snapshot Infrastructure
First, install the Dynamo Snapshot Helm chart in each namespace where you need checkpointing:
```bash
# Install Dynamo Snapshot infrastructure
helm install snapshot nvidia/snapshot \
--namespace my-team \
--create-namespace\
--set storage.pvc.size=100Gi
```
This creates:
- A PVC for checkpoint storage (`snapshot-pvc`)
- A DaemonSet for CRIU operations (`snapshot-agent`)
### 2. Configure Operator Values
Update your Helm values to point to the Dynamo Snapshot infrastructure:
```yaml
# values.yaml
dynamo-operator:
checkpoint:
enabled:true
storage:
type:pvc# Only PVC is currently supported (S3/OCI planned)
pvc:
pvcName:"snapshot-pvc"# Must match Dynamo Snapshot chart
basePath:"/checkpoints"
signalHostPath:"/var/lib/snapshot/signals"# Must match Dynamo Snapshot chart
```
### 2. Configure Your DGD
Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate `backendFramework`, command, and CLI flags.
> **Note:** Do **not** set `DYN_READY_FOR_CHECKPOINT_FILE` or `DYN_CHECKPOINT_READY_FILE` in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally.
### 3. Deploy
```bash
kubectl apply -f my-llm.yaml -n dynamo-system
```
On first deployment:
1. A checkpoint job runs to create the checkpoint
2. Worker pods start with cold start (checkpoint not ready yet)
3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
## Checkpoint Modes
### Auto Mode (Recommended)
The operator automatically creates a `DynamoCheckpoint` CR if one doesn't exist:
```yaml
checkpoint:
enabled:true
mode:auto
identity:
model:"meta-llama/Llama-3-8B"
backendFramework:"vllm"# or "sglang"
tensorParallelSize:1
dtype:"bfloat16"
maxModelLen:4096
```
### Reference Mode
Reference an existing `DynamoCheckpoint` CR by its 16-character hash using `checkpointRef`:
```yaml
checkpoint:
enabled:true
checkpointRef:"e5962d34ba272638"# 16-char hash of DynamoCheckpoint CR
```
This is useful when:
- You want to **pre-warm checkpoints** before creating DGDs
- You want to **explicit control** over which checkpoint to use
**Flow:**
1. Create a `DynamoCheckpoint` CR (see [DynamoCheckpoint CRD](#dynamocheckpoint-crd) section)
2. Wait for it to become `Ready`
3. Reference it in your DGD using `checkpointRef` with the hash
```bash
# Check checkpoint status (using 16-char hash name)
kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system
**Not included in hash** (don't invalidate checkpoint):
-`replicas`
-`nodeSelector`, `affinity`, `tolerations`
-`resources` (requests/limits)
- Logging/observability config
**Example with all fields:**
```yaml
checkpoint:
enabled:true
mode:auto
identity:
model:"meta-llama/Llama-3-8B"
backendFramework:"vllm"
dynamoVersion:"0.9.0"
tensorParallelSize:1
pipelineParallelSize:1
dtype:"bfloat16"
maxModelLen:8192
extraParameters:
enableChunkedPrefill:"true"
quantization:"awq"
```
**Checkpoint Naming:** The `DynamoCheckpoint` CR is automatically named using the 16-character identity hash (e.g., `e5962d34ba272638`).
**Checkpoint Sharing:** Multiple DGDs with the same identity automatically share the same checkpoint.
## DynamoCheckpoint CRD
The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
**When to create a DynamoCheckpoint directly:**
-**Pre-warming:** Create checkpoints before deploying DGDs for instant startup
-**Explicit control:** Manage checkpoint lifecycle independently from DGDs
**Note:** With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in `auto` mode.
**Create a checkpoint:**
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoCheckpoint
metadata:
name:e5962d34ba272638# Use the computed 16-char hash
spec:
identity:
model:"meta-llama/Llama-3-8B"
backendFramework:"vllm"
tensorParallelSize:1
dtype:"bfloat16"
job:
activeDeadlineSeconds:3600
podTemplateSpec:
spec:
containers:
-name:main
image:nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
command:["python3","-m","dynamo.vllm"]
args:["--model","meta-llama/Llama-3-8B"]
resources:
limits:
nvidia.com/gpu:"1"
env:
-name:HF_TOKEN
valueFrom:
secretKeyRef:
name:hf-token-secret
key:HF_TOKEN
```
**Note:** You can compute the hash yourself, or use `auto` mode to let the operator create it.