Unverified Commit 92b341f3 authored by Schwinn Saereesitthipitak's avatar Schwinn Saereesitthipitak Committed by GitHub
Browse files

docs: update snapshot checkpointing docs (#7244)

parent 4bd6299b
# Dynamo Snapshot Helm Chart # Dynamo Snapshot Helm Chart
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The DaemonSet runs in privileged mode to perform CRIU operations. See [Prerequisites](#prerequisites) for security considerations. > ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in beta/preview. The DaemonSet runs in privileged mode to perform CRIU checkpoint and restore operations.
This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, including: This chart installs the namespace-scoped checkpoint/restore infrastructure used by Dynamo:
- Persistent Volume Claim (PVC) for checkpoint storage
- DaemonSet running the CRIU checkpoint agent
- RBAC resources (ServiceAccount, Role, RoleBinding)
- Seccomp profile for blocking io_uring syscalls
**Note:** - `snapshot-agent` DaemonSet on GPU nodes
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC - `snapshot-pvc` checkpoint storage, or wiring to an existing PVC
- **Supports vLLM and SGLang backends** (TensorRT-LLM support planned) - namespace-scoped RBAC
- the seccomp profile required by CRIU
## Prerequisites Snapshot storage is namespace-local. Install this chart in every namespace where you want checkpoint and restore.
⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable. ## Prerequisites
- Kubernetes 1.21+ - Kubernetes 1.21+
- **x86_64 (amd64) nodes only** for the snapshot agent and placeholder images - x86_64 GPU nodes
- GPU nodes with NVIDIA runtime (`nvidia` runtime class) - NVIDIA driver 580.xx or newer
- NVIDIA driver 580.xx or newer on the target GPU nodes - containerd runtime
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images) - a cluster where a privileged DaemonSet with `hostPID`, `hostIPC`, and `hostNetwork` is acceptable
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped) - Dynamo Platform already installed, with operator checkpointing enabled
- RWX (ReadWriteMany) storage class for multi-node deployments
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
## Installation
> **Note:** The Dynamo Snapshot Helm chart is not yet published to a public Helm repository. For now, you must build and deploy from source. The platform/operator configuration must point at the same checkpoint storage that this chart installs:
### Building from Source ```yaml
dynamo-operator:
```bash checkpoint:
# Set environment enabled: true
export NAMESPACE=my-team # Your target namespace storage:
export DOCKER_SERVER=your-registry.com/ # Your container registry type: pvc
export IMAGE_TAG=latest pvc:
pvcName: snapshot-pvc
# Build Dynamo Snapshot agent image (amd64 only) basePath: /checkpoints
cd deploy/snapshot
docker build --platform linux/amd64 --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
docker push $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG
cd -
# Install Dynamo Snapshot chart with custom image
helm install snapshot ./deploy/helm/charts/snapshot/ \
--namespace ${NAMESPACE} \
--create-namespace \
--set daemonset.image.repository=${DOCKER_SERVER}/snapshot-agent \
--set daemonset.image.tag=${IMAGE_TAG} \
--set daemonset.imagePullSecrets[0].name=your-registry-secret
``` ```
## Configuration Cross-node restore requires a shared `ReadWriteMany` storage class. The chart defaults to `storage.pvc.accessMode=ReadWriteMany`.
See `values.yaml` for all configuration options.
### Key Configuration Options For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC.
| Parameter | Description | Default | ## Minimal Install
|-----------|-------------|---------|
| `storage.type` | Storage type: `pvc` (only supported), `s3` and `oci` planned | `pvc` |
| `storage.pvc.create` | Create a new PVC | `true` |
| `storage.pvc.name` | PVC name (must match operator config) | `snapshot-pvc` |
| `storage.pvc.size` | PVC size | `100Gi` |
| `storage.pvc.storageClass` | Storage class name | `""` (default) |
| `daemonset.image.repository` | DaemonSet image repository | `nvcr.io/nvidian/dynamo-dev/snapshot-agent` |
| `daemonset.snapshotLogLevel` | Snapshot agent and nsrestore log level (`trace`, `debug`, `info`, `warn`, `error`) | `info` |
| `daemonset.nodeSelector` | Node selector for GPU nodes | `nvidia.com/gpu.present: "true"` |
| `config.checkpoint.criu.ghostLimit` | CRIU ghost file size limit in bytes | `536870912` (512MB) |
| `config.checkpoint.criu.logLevel` | CRIU logging verbosity (0-4) | `4` |
| `rbac.namespaceRestricted` | Use namespace-scoped RBAC | `true` |
## Usage This is the smallest Helm install that creates the checkpoint PVC and the DaemonSet:
After installing this chart, enable checkpointing in your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-model
namespace: my-team
spec:
services:
worker:
checkpoint:
enabled: true
mode: auto
identity:
model: Qwen/Qwen3-0.6B
backendFramework: vllm
```
## Multi-Namespace Deployment
To enable checkpointing in multiple namespaces, install this chart in each namespace:
```bash ```bash
# Namespace A helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
helm install snapshot nvidia/snapshot -n team-a --namespace ${NAMESPACE} \
--create-namespace \
# Namespace B --set storage.pvc.create=true
helm install snapshot nvidia/snapshot -n team-b
``` ```
Each namespace will have its own isolated checkpoint storage. If your cluster does not use a default storage class, also set `storage.pvc.storageClass`.
## Verification Keep `storage.pvc.accessMode=ReadWriteMany` for this chart layout. The DaemonSet mounts the same PVC on each eligible node, so a shared `ReadWriteOnce` claim only works when the agent runs on one node.
```bash If you already have a PVC, keep the chart in "use existing PVC" mode:
# Check PVC
kubectl get pvc snapshot-pvc -n my-team
# Check DaemonSet Do not set `storage.pvc.create=true` when reusing an existing checkpoint PVC.
kubectl get daemonset -n my-team
# Check DaemonSet pods are running ```bash
kubectl get pods -n my-team -l app.kubernetes.io/name=snapshot helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
--namespace ${NAMESPACE} \
--create-namespace \
--set storage.pvc.create=false \
--set storage.pvc.name=my-snapshot-pvc
``` ```
## Uninstallation ## Verify
```bash ```bash
helm uninstall snapshot -n my-team kubectl get pvc snapshot-pvc -n ${NAMESPACE}
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot -o wide
``` ```
**Note:** This will NOT delete the PVC by default. To delete the PVC: ## Important Values
```bash | Parameter | Meaning | Default |
kubectl delete pvc snapshot-pvc -n my-team |-----------|---------|---------|
``` | `storage.pvc.create` | Create `snapshot-pvc` instead of using an existing PVC | `true` |
| `storage.pvc.name` | PVC name used by the agent and by the operator config | `snapshot-pvc` |
| `storage.pvc.size` | Requested PVC size | `1Ti` |
| `storage.pvc.storageClass` | Storage class name | `""` |
| `storage.pvc.accessMode` | Access mode for the checkpoint PVC | `ReadWriteMany` |
| `storage.pvc.basePath` | Checkpoint root inside the PVC | `/checkpoints` |
| `daemonset.image.repository` | Snapshot agent image repository | `nvcr.io/nvidia/ai-dynamo/snapshot-agent` |
| `daemonset.image.tag` | Snapshot agent image tag | `1.0.0` |
| `daemonset.imagePullSecrets` | Image pull secrets for the agent | `[{name: ngc-secret}]` |
## Troubleshooting See [values.yaml](./values.yaml) for the complete configuration surface.
### DaemonSet pods not starting ## End To End
Check if GPU nodes have the correct labels and runtime class: Once the chart is installed, use the snapshot guide to deploy a snapshot-capable `DynamoGraphDeployment`, wait for the checkpoint to become ready, and then scale the worker to verify restore:
```bash - [Snapshot](../../../../docs/kubernetes/snapshot.md)
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <node-name> | grep -A 5 "Runtime Class"
```
If nodes don't have the `nvidia.com/gpu.present` label, you can add it: ## Uninstall
```bash ```bash
kubectl label node <node-name> nvidia.com/gpu.present=true helm uninstall snapshot -n ${NAMESPACE}
``` ```
### Checkpoint job fails The chart does not remove checkpoint data automatically. Delete the PVC yourself if you want to remove stored checkpoints:
Check DaemonSet logs:
```bash ```bash
kubectl logs -n my-team -l app.kubernetes.io/name=snapshot kubectl delete pvc snapshot-pvc -n ${NAMESPACE}
``` ```
### PVC not mounting ## Troubleshooting
Check PVC status and events: If `snapshot-agent` does not schedule:
```bash ```bash
kubectl describe pvc snapshot-pvc -n my-team kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe daemonset snapshot-agent -n ${NAMESPACE}
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot --all-containers
``` ```
Ensure your storage class supports `ReadWriteMany` access mode for multi-node deployments. If checkpoint creation never becomes ready, verify all three pieces line up:
## Related Documentation
- [Dynamo Snapshot Overview](../../../../docs/kubernetes/snapshot/README.md) - Dynamo Snapshot architecture and use cases
- [Dynamo Snapshot with Dynamo Platform](../../../../docs/kubernetes/snapshot/dynamo.md) - Integration guide
## License
Apache License 2.0 - the operator has `dynamo-operator.checkpoint.enabled=true`
- the operator PVC name and base path match the snapshot chart values
- the workload uses a snapshot-capable worker image and command
...@@ -29,7 +29,7 @@ storage: ...@@ -29,7 +29,7 @@ storage:
# PVC name - must match operator configuration # PVC name - must match operator configuration
name: snapshot-pvc name: snapshot-pvc
# PVC size # PVC size
size: 100Gi size: 1Ti
# Storage class (leave empty for default) # Storage class (leave empty for default)
storageClass: "" storageClass: ""
# Access mode - ReadWriteMany required for multi-pod access # Access mode - ReadWriteMany required for multi-pod access
......
...@@ -55,11 +55,8 @@ navigation: ...@@ -55,11 +55,8 @@ navigation:
path: kubernetes/rolling-update.md path: kubernetes/rolling-update.md
- page: Inference Gateway (GAIE) - page: Inference Gateway (GAIE)
path: kubernetes/inference-gateway.md path: kubernetes/inference-gateway.md
- section: Checkpointing - page: Snapshot
path: kubernetes/snapshot/README.md path: kubernetes/snapshot.md
contents:
- page: Integration with Dynamo
path: kubernetes/snapshot/dynamo.md
- section: Observability (K8s) - section: Observability (K8s)
contents: contents:
- page: Metrics - page: Metrics
......
...@@ -230,7 +230,7 @@ Key customization points include: ...@@ -230,7 +230,7 @@ Key customization points include:
- **[Operator Documentation](dynamo-operator.md)** - How the platform works - **[Operator Documentation](dynamo-operator.md)** - How the platform works
- **[Service Discovery](service-discovery.md)** - Discovery backends and configuration - **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users - **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
- **[Checkpointing](snapshot/README.md)** - Fast pod startup with checkpoint/restore - **[Snapshot](snapshot.md)** - Fast pod startup with checkpoint/restore
- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users - **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
- **[Logging](observability/logging.md)** - For logging setup - **[Logging](observability/logging.md)** - For logging setup
- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment - **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Snapshot
---
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **preview** and may only be functional in some k8s cluster setups. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
**Dynamo Snapshot** is an experimental infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in User-space) and NVIDIA's cuda-checkpoint utility. Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (restore from checkpoint) | ~ 10 sec | Restore from checkpoint tar |
> ⚠️ Restore time may vary depending on cluster configuration (storage bandwidth, GPU model, etc.)
## Prerequisites
- Dynamo Platform/Operator installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
- NVIDIA driver 580.xx or newer on the target GPU nodes
- `ReadWriteMany` storage if you need cross-node restore
- vLLM or SGLang backend (TensorRT-LLM is not supported yet)
- Security clearance to run a privileged DaemonSet
## Quick Start
This guide assumes a normal Dynamo deployment workflow is already present on your Kubernetes cluster.
### 1. Build and push a placeholder image
Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with the restore tooling. If you do not already have one, build it with the snapshot placeholder target and push it to a registry your cluster can pull from:
```bash
export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0
cd deploy/snapshot
make docker-build-placeholder \
PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
make docker-push-placeholder \
PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
```
This flow is defined in [deploy/snapshot/Makefile](../../deploy/snapshot/Makefile) and [deploy/snapshot/Dockerfile](../../deploy/snapshot/Dockerfile). The placeholder image preserves the base runtime entrypoint and command contract, and adds the CRIU, `cuda-checkpoint`, and `nsrestore` tooling needed for restore.
### 2. Enable checkpointing in the platform and verify it
Whether you are installing or upgrading `dynamo-platform`, the operator must have checkpointing enabled and must point at the same storage that the snapshot chart will use:
```yaml
dynamo-operator:
checkpoint:
enabled: true
storage:
type: pvc
pvc:
pvcName: snapshot-pvc
basePath: /checkpoints
```
If the platform is already installed, verify that the operator config contains the checkpoint block:
```bash
OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
-l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
-o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')
kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
-o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'
```
Verify that the rendered config includes `enabled: true` and the same PVC name and base path you plan to use for the snapshot chart.
For the full platform/operator configuration surface, see [deploy/helm/charts/platform/README.md](../../deploy/helm/charts/platform/README.md) and [deploy/helm/charts/platform/components/operator/values.yaml](../../deploy/helm/charts/platform/components/operator/values.yaml).
### 3. Install the snapshot chart
```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
--namespace ${NAMESPACE} \
--create-namespace \
--set storage.pvc.create=true
```
Cross-node restore requires `ReadWriteMany` storage. The chart defaults to that mode.
For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC. If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and point `storage.pvc.name` at the existing PVC instead.
Verify that the PVC and DaemonSet are ready:
```bash
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
```
For the full snapshot chart configuration surface, see [deploy/helm/charts/snapshot/README.md](../../deploy/helm/charts/snapshot/README.md) and [deploy/helm/charts/snapshot/values.yaml](../../deploy/helm/charts/snapshot/values.yaml).
### 4. Apply a snapshot-compatible `DynamoGraphDeployment`
This example is adapted from [examples/backends/vllm/deploy/agg.yaml](../../examples/backends/vllm/deploy/agg.yaml). The worker must use the placeholder image from step 1, and the checkpoint identity must describe the runtime state you want to reuse.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-snapshot-demo
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: registry.example.com/dynamo/vllm-runtime:1.0.0
VllmDecodeWorker:
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
readinessProbe:
httpGet:
path: /live
port: system
periodSeconds: 1
timeoutSeconds: 4
failureThreshold: 3
checkpoint:
enabled: true
mode: Auto
identity:
model: Qwen/Qwen3-0.6B
backendFramework: vllm
extraPodSpec:
mainContainer:
image: registry.example.com/dynamo/vllm-placeholder:1.0.0
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disable-custom-all-reduce
env:
- name: GLOO_SOCKET_IFNAME
value: lo
- name: NCCL_SOCKET_IFNAME
value: lo
- name: NCCL_DEBUG
value: ERROR
- name: TORCH_CPP_LOG_LEVEL
value: ERROR
- name: TORCH_DISTRIBUTED_DEBUG
value: "OFF"
- name: CUDA_ERROR_LEVEL
value: "10"
- name: NCCL_CUMEM_ENABLE
value: "0"
- name: NCCL_CUMEM_HOST_ENABLE
value: "0"
- name: NCCL_NVLS_ENABLE
value: "0"
- name: NCCL_P2P_DISABLE
value: "0"
- name: NCCL_SHM_DISABLE
value: "1"
- name: NCCL_IB_DISABLE
value: "1"
- name: TORCH_NCCL_ENABLE_MONITORING
value: "0"
```
For SGLang, use `dynamo.sglang`, an SGLang placeholder image, `backendFramework: sglang`, and the matching CLI flags.
Apply the manifest:
```bash
kubectl apply -f vllm-snapshot-demo.yaml -n ${NAMESPACE}
```
On the first rollout, the worker cold-starts, the operator creates a `DynamoCheckpoint`, and the checkpoint Job writes data into `snapshot-pvc`.
### 5. Wait for the checkpoint to become ready
Capture the checkpoint name from DGD status, then wait for the `DynamoCheckpoint` phase to become `Ready`:
```bash
CHECKPOINT_NAME=$(kubectl get dgd vllm-snapshot-demo -n ${NAMESPACE} \
-o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}')
kubectl wait \
--for=jsonpath='{.status.phase}'=Ready \
"dynamocheckpoint/${CHECKPOINT_NAME}" \
-n ${NAMESPACE} \
--timeout=30m
```
The DGD status also reports the computed checkpoint hash at `.status.checkpoints.VllmDecodeWorker.identityHash`.
### 6. Trigger restore
Once the checkpoint is ready, scale the worker replicas from `1` to `2`:
```bash
kubectl patch dgd vllm-snapshot-demo -n ${NAMESPACE} --type=merge \
-p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'
```
New worker pods for `VllmDecodeWorker` will restore from the ready checkpoint automatically.
## Checkpoint Configuration
### Auto Mode (Recommended)
The operator computes the checkpoint identity hash, looks for an existing `DynamoCheckpoint` with a matching `nvidia.com/snapshot-checkpoint-hash` label, and creates one if it does not find one:
```yaml
checkpoint:
enabled: true
mode: Auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm" # or "sglang"
tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
```
When a service uses checkpointing, DGD status reports the resolved `checkpointName`, `identityHash`, and `ready` fields under `.status.checkpoints.<service-name>`.
### Manual Management and `checkpointRef`
Use `checkpointRef` when you want a service to restore from a specific `DynamoCheckpoint` CR:
```yaml
checkpoint:
enabled: true
checkpointRef: "qwen3-06b-vllm-prewarm"
```
This is useful when:
- You want to **pre-warm checkpoints** before creating DGDs
- You want **explicit control** over which checkpoint to use
`checkpointRef` resolves by `DynamoCheckpoint.metadata.name`, not by `status.identityHash`. A manual checkpoint can use any valid Kubernetes resource name.
If you are managing checkpoint CRs yourself, set `mode: Manual` on the service to prevent the operator from creating a new `DynamoCheckpoint` when identity-based lookup does not find one.
```bash
# Check checkpoint status by CR name
kubectl get dynamocheckpoint qwen3-06b-vllm-prewarm -n ${NAMESPACE}
# Now create DGD referencing it
kubectl apply -f my-dgd.yaml -n ${NAMESPACE}
```
If you want `mode: Auto` DGDs to discover a manually created checkpoint by identity, add the label `nvidia.com/snapshot-checkpoint-hash=<identity-hash>` to that `DynamoCheckpoint`. Auto-created checkpoints already use that label, and currently use the same hash as the CR name.
## Checkpoint Identity
Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:
| Field | Required | Affects Hash | Example |
|-------|----------|-------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `backendFramework` | ✓ | ✓ | `sglang`, `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
| `maxModelLen` | | ✓ | `4096`, `8192` |
| `extraParameters` | | ✓ | Custom key-value pairs |
**Not included in hash** (don't invalidate checkpoint):
- `replicas`
- `nodeSelector`, `affinity`, `tolerations`
- `resources` (requests/limits)
- Logging/observability config
**Example with all fields:**
```yaml
checkpoint:
enabled: true
mode: Auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
dynamoVersion: "0.9.0"
tensorParallelSize: 1
pipelineParallelSize: 1
dtype: "bfloat16"
maxModelLen: 8192
extraParameters:
enableChunkedPrefill: "true"
quantization: "awq"
```
## DynamoCheckpoint CRD
The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
**When to create a DynamoCheckpoint directly:**
- **Pre-warming:** Create checkpoints before deploying DGDs for instant startup
- **Explicit control:** Manage checkpoint lifecycle independently from DGDs
The operator requires `spec.identity` and `spec.job.podTemplateSpec`. The pod template should match the worker container you want checkpointed, including image, command, args, secrets, volumes, and resource limits. You do not need to set the checkpoint environment variables manually; the operator injects them for checkpoint jobs and restored pods.
**Create a checkpoint:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
name: qwen3-06b-vllm-prewarm
labels:
nvidia.com/snapshot-checkpoint-hash: "e5962d34ba272638" # Add this if Auto-mode identity lookup should find the CR
spec:
identity:
model: Qwen/Qwen3-0.6B
backendFramework: vllm
tensorParallelSize: 1
dtype: bfloat16
maxModelLen: 4096
job:
activeDeadlineSeconds: 3600
backoffLimit: 3
ttlSecondsAfterFinished: 300
podTemplateSpec:
spec:
restartPolicy: Never
containers:
- name: main
image: registry.example.com/dynamo/vllm-placeholder:1.0.0
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disable-custom-all-reduce
env:
- name: GLOO_SOCKET_IFNAME
value: lo
- name: NCCL_SOCKET_IFNAME
value: lo
resources:
limits:
nvidia.com/gpu: "1"
```
You can name the CR however you want if you plan to use `checkpointRef`. If you want `mode: Auto` identity lookup to find a manual CR, set the `nvidia.com/snapshot-checkpoint-hash` label to the computed 16-character identity hash. Using the hash as the CR name is a convenient convention, but it is not required.
**Check status:**
```bash
# List all checkpoints
kubectl get dynamocheckpoint -n ${NAMESPACE}
# Or use shortname
kubectl get dckpt -n ${NAMESPACE}
NAME MODEL BACKEND PHASE HASH AGE
qwen3-06b-vllm-prewarm Qwen/Qwen3-0.6B vllm Ready e5962d34ba272638 5m
llama3-8b-vllm-prewarm meta-llama/Llama-3-8B vllm Creating 7ab4f89c12de3456 2m
```
**Phases:**
| Phase | Description |
|-------|-------------|
| `Pending` | CR created, waiting for job to start |
| `Creating` | Checkpoint job is running |
| `Ready` | Checkpoint available for use |
| `Failed` | Checkpoint creation failed |
`Ready` is a value in `status.phase`, not a Kubernetes condition. The `conditions` array tracks job lifecycle events:
| Condition Type | Meaning |
|----------------|---------|
| `JobCreated` | The checkpoint Job has been created |
| `JobCompleted` | The checkpoint Job has completed successfully or failed |
Other useful status fields are:
| Field | Meaning |
|-------|---------|
| `status.jobName` | Name of the checkpoint Job |
| `status.identityHash` | Computed 16-character hash for the checkpoint identity |
| `status.location` | Checkpoint location in the configured storage backend |
| `status.storageType` | Storage backend type (`pvc`, `s3`, or `oci`) |
| `status.createdAt` | Timestamp recorded when the checkpoint becomes ready |
| `status.message` | Failure or progress message when available |
**Detailed status:**
```bash
kubectl describe dckpt qwen3-06b-vllm-prewarm -n ${NAMESPACE}
```
```yaml
Status:
Phase: Ready
IdentityHash: e5962d34ba272638
JobName: checkpoint-qwen3-06b-vllm-prewarm
Location: /checkpoints/e5962d34ba272638.tar
StorageType: pvc
CreatedAt: 2026-01-29T10:05:00Z
Conditions:
- Type: JobCreated
Status: "True"
Reason: JobCreated
- Type: JobCompleted
Status: "True"
Reason: JobSucceeded
```
**Reference from DGD:**
Once the checkpoint is `Ready`, you can reference it by CR name:
```yaml
spec:
services:
VllmDecodeWorker:
checkpoint:
enabled: true
checkpointRef: "qwen3-06b-vllm-prewarm"
```
Or use `mode: Auto` with the same identity and snapshot-hash label, and the operator will reuse it automatically.
## Limitations
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations may work in very basic hardware configurations, but are not officially supported yet.
- **Network state**: No active TCP connections can be checkpointed
- **Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU and cuda-checkpoint. However, workload pods do not need to be privileged.
## Troubleshooting
### Checkpoint Not Ready
1. Check the checkpoint job:
```bash
kubectl get dckpt -n ${NAMESPACE}
kubectl describe dckpt <checkpoint-name> -n ${NAMESPACE}
kubectl logs job/$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}') -n ${NAMESPACE}
```
2. Check the DaemonSet:
```bash
kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers
```
3. Verify that platform and chart storage settings match:
```bash
kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o yaml
```
### Restore Failing
1. Check pod logs:
```bash
kubectl logs <worker-pod> -n ${NAMESPACE}
```
2. Describe the restore target pod:
```bash
kubectl describe pod <worker-pod> -n ${NAMESPACE}
```
3. Confirm the referenced checkpoint is still `Ready`:
```bash
kubectl get dckpt <checkpoint-name> -n ${NAMESPACE}
```
## Planned Features
- TensorRT-LLM backend support
- S3/MinIO storage backend
- OCI registry storage backend
- Multi-GPU checkpoints
## Related Documentation
- [Dynamo Snapshot Helm Chart README](../../deploy/helm/charts/snapshot/README.md) - Chart configuration
- [Installation Guide](installation-guide.md) - Platform installation
- [API Reference](api-reference.md) - Complete CRD specifications
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Checkpointing
---
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
**Dynamo Snapshot** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
## What is Dynamo Snapshot?
Dynamo Snapshot provides:
- **Fast cold starts**: Restore GPU-accelerated applications in seconds instead of minutes
- **CUDA state preservation**: Checkpoint and restore GPU memory and CUDA contexts
- **Kubernetes-native**: Integrates seamlessly with Kubernetes primitives
- **Storage flexibility**: PVC-based storage (S3/OCI planned for future releases)
- **Namespace isolation**: Each namespace gets its own checkpoint infrastructure
## Use Cases
### 1. With NVIDIA Dynamo Platform (Recommended)
Use Dynamo Snapshot as part of the Dynamo platform for automatic checkpoint management:
- Automatic checkpoint creation and lifecycle management
- Seamless integration with DynamoGraphDeployment CRDs
- Built-in autoscaling with fast restore
📖 **[Read the Dynamo Integration Guide →](dynamo.md)**
## Architecture
Dynamo Snapshot consists of two main components:
### 1. Dynamo Snapshot Helm Chart
Deploys the checkpoint/restore infrastructure:
- **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
- **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
- **RBAC**: Namespace-scoped or cluster-wide permissions
- **Seccomp Profile**: Security policies for CRIU syscalls (needs to be injected into workload pods)
### 2. External Restore via DaemonSet
The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces:
- **Checkpoint**: Freezes the running process and dumps state (CPU + GPU) to storage
- **Restore**: Enters a placeholder pod's namespaces and restores the checkpointed process via `nsrestore`
## Quick Start
To install the Dynamo Snapshot DaemonSet in your cluster, run the following:
```bash
helm install snapshot nvidia/snapshot \
--namespace my-team \
--create-namespace \
--set storage.pvc.size=100Gi
```
## Key Features
### ✅ Currently Supported
-**vLLM and SGLang backends** (TensorRT-LLM planned)
-**LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
- ✅ Cross-node, single-GPU checkpoints (requires RWX storage)
- ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
- ✅ Namespace-scoped and cluster-wide RBAC
- ✅ Idempotent checkpoint creation
- ✅ Automatic signal-based checkpoint coordination
### 🚧 Planned Features
- 🚧 TensorRT-LLM backend support
- 🚧 S3/MinIO storage backend
- 🚧 OCI registry storage backend
- 🚧 Multi-GPU checkpoints
- 🚧 Multi-node distributed checkpoints
## Limitations
⚠️ **Important**: Dynamo Snapshot has significant limitations that may impact production readiness:
### Security Considerations
- **🔴 Privileged DaemonSet**: The Dynamo Snapshot DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations. Workload pods do **not** need privileged mode — all CRIU privilege lives in the DaemonSet.
- **Security Impact**: The privileged DaemonSet can:
- Access all host devices and processes
- Bypass most security restrictions
- Potentially compromise node security if exploited
### Technical Limitations
- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations not yet supported
- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
- **Storage**: Only PVC storage is currently implemented (S3/OCI planned)
### Recommendation
Dynamo Snapshot is best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
## Documentation
### Getting Started
- [Dynamo Integration Guide](dynamo.md) - Using Dynamo Snapshot with Dynamo Platform
- [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/snapshot/README.md) - Helm chart configuration
### Related Documentation
- [CRIU Documentation](https://criu.org/Main_Page) - Upstream CRIU docs
## Prerequisites
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
- RWX storage class (for multi-node deployments)
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
## Contributing
Dynamo Snapshot is part of the NVIDIA Dynamo project. Contributions are welcome!
## License
Apache License 2.0
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Integration with Dynamo
---
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | < 10 sec | Restore from checkpoint tar |
## Prerequisites
- Dynamo Platform installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
- Dynamo Snapshot Helm chart installed (separate from platform)
- RWX PVC storage (PVC is currently the only supported backend)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- vLLM or SGLang backend (TensorRT-LLM is not supported)
## Quick Start
### 1. Install Dynamo Snapshot Infrastructure
First, install the Dynamo Snapshot Helm chart in each namespace where you need checkpointing:
```bash
# Install Dynamo Snapshot infrastructure
helm install snapshot nvidia/snapshot \
--namespace my-team \
--create-namespace \
--set storage.pvc.size=100Gi
```
This creates:
- A PVC for checkpoint storage (`snapshot-pvc`)
- A DaemonSet for CRIU operations (`snapshot-agent`)
### 2. Configure Operator Values
Update your Helm values to point to the Dynamo Snapshot infrastructure:
```yaml
# values.yaml
dynamo-operator:
checkpoint:
enabled: true
storage:
type: pvc # Only PVC is currently supported (S3/OCI planned)
pvc:
pvcName: "snapshot-pvc" # Must match Dynamo Snapshot chart
basePath: "/checkpoints"
signalHostPath: "/var/lib/snapshot/signals" # Must match Dynamo Snapshot chart
```
### 2. Configure Your DGD
Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate `backendFramework`, command, and CLI flags.
#### vLLM Example
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
services:
worker:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.vllm"
- "--model"
- "meta-llama/Llama-3-8B"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
env:
# Required for cross-node checkpoint/restore
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
checkpoint:
enabled: true
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
```
#### SGLang Example
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-sglang-llm
spec:
services:
worker:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-sglang-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.sglang"
- "--model"
- "meta-llama/Llama-3-8B"
- "--mem-fraction-static"
- "0.90"
env:
# Required for cross-node checkpoint/restore
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
checkpoint:
enabled: true
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "sglang"
tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
```
**Key differences between backends:**
| Setting | vLLM | SGLang |
|---------|------|--------|
| Module | `dynamo.vllm` | `dynamo.sglang` |
| Max context (optional) | `--max-model-len` | `--context-length` |
| GPU memory | `--gpu-memory-utilization` | `--mem-fraction-static` |
| Placeholder image | `dynamo-vllm-placeholder` | `dynamo-sglang-placeholder` |
| Identity `backendFramework` | `"vllm"` | `"sglang"` |
> **Note:** Do **not** set `DYN_READY_FOR_CHECKPOINT_FILE` or `DYN_CHECKPOINT_READY_FILE` in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally.
### 3. Deploy
```bash
kubectl apply -f my-llm.yaml -n dynamo-system
```
On first deployment:
1. A checkpoint job runs to create the checkpoint
2. Worker pods start with cold start (checkpoint not ready yet)
3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
## Checkpoint Modes
### Auto Mode (Recommended)
The operator automatically creates a `DynamoCheckpoint` CR if one doesn't exist:
```yaml
checkpoint:
enabled: true
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm" # or "sglang"
tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
```
### Reference Mode
Reference an existing `DynamoCheckpoint` CR by its 16-character hash using `checkpointRef`:
```yaml
checkpoint:
enabled: true
checkpointRef: "e5962d34ba272638" # 16-char hash of DynamoCheckpoint CR
```
This is useful when:
- You want to **pre-warm checkpoints** before creating DGDs
- You want to **explicit control** over which checkpoint to use
**Flow:**
1. Create a `DynamoCheckpoint` CR (see [DynamoCheckpoint CRD](#dynamocheckpoint-crd) section)
2. Wait for it to become `Ready`
3. Reference it in your DGD using `checkpointRef` with the hash
```bash
# Check checkpoint status (using 16-char hash name)
kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system
NAME MODEL BACKEND PHASE HASH AGE
e5962d34ba272638 meta-llama/Llama-3-8B vllm Ready e5962d34ba272638 5m
# Now create DGD referencing it
kubectl apply -f my-dgd.yaml
```
## Checkpoint Identity
Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:
| Field | Required | Affects Hash | Example |
|-------|----------|-------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `backendFramework` | ✓ | ✓ | `sglang`, `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
| `maxModelLen` | | ✓ | `4096`, `8192` |
| `extraParameters` | | ✓ | Custom key-value pairs |
**Not included in hash** (don't invalidate checkpoint):
- `replicas`
- `nodeSelector`, `affinity`, `tolerations`
- `resources` (requests/limits)
- Logging/observability config
**Example with all fields:**
```yaml
checkpoint:
enabled: true
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
dynamoVersion: "0.9.0"
tensorParallelSize: 1
pipelineParallelSize: 1
dtype: "bfloat16"
maxModelLen: 8192
extraParameters:
enableChunkedPrefill: "true"
quantization: "awq"
```
**Checkpoint Naming:** The `DynamoCheckpoint` CR is automatically named using the 16-character identity hash (e.g., `e5962d34ba272638`).
**Checkpoint Sharing:** Multiple DGDs with the same identity automatically share the same checkpoint.
## DynamoCheckpoint CRD
The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
**When to create a DynamoCheckpoint directly:**
- **Pre-warming:** Create checkpoints before deploying DGDs for instant startup
- **Explicit control:** Manage checkpoint lifecycle independently from DGDs
**Note:** With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in `auto` mode.
**Create a checkpoint:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
name: e5962d34ba272638 # Use the computed 16-char hash
spec:
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
tensorParallelSize: 1
dtype: "bfloat16"
job:
activeDeadlineSeconds: 3600
podTemplateSpec:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
command: ["python3", "-m", "dynamo.vllm"]
args: ["--model", "meta-llama/Llama-3-8B"]
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
```
**Note:** You can compute the hash yourself, or use `auto` mode to let the operator create it.
**Check status:**
```bash
# List all checkpoints
kubectl get dynamocheckpoint -n dynamo-system
# Or use shortname
kubectl get dckpt -n dynamo-system
NAME MODEL BACKEND PHASE HASH AGE
e5962d34ba272638 meta-llama/Llama-3-8B vllm Ready e5962d34ba272638 5m
a7b4f89c12de3456 meta-llama/Llama-3-70B vllm Creating a7b4f89c12de3456 2m
```
**Phases:**
| Phase | Description |
|-------|-------------|
| `Pending` | CR created, waiting for job to start |
| `Creating` | Checkpoint job is running |
| `Ready` | Checkpoint available for use |
| `Failed` | Checkpoint creation failed |
**Detailed status:**
```bash
kubectl describe dckpt e5962d34ba272638 -n dynamo-system
```
```yaml
Status:
Phase: Ready
IdentityHash: e5962d34ba272638
Location: /checkpoints/e5962d34ba272638
StorageType: pvc
CreatedAt: 2026-01-29T10:05:00Z
```
**Reference from DGD:**
Once the checkpoint is `Ready`, you can reference it by hash:
```yaml
spec:
services:
VllmWorker:
checkpoint:
enabled: true
checkpointRef: "e5962d34ba272638" # 16-char hash
```
Or use `auto` mode and the operator will find/create it automatically.
## Limitations
- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations are not yet supported (planned)
- **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
- **Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU
## Troubleshooting
### Checkpoint Not Creating
1. Check the checkpoint job:
```bash
kubectl get jobs -l nvidia.com/snapshot-is-checkpoint-source=true -n dynamo-system
kubectl logs job/checkpoint-<name> -n dynamo-system
```
2. Check the DaemonSet:
```bash
kubectl logs daemonset/snapshot-agent -n dynamo-system
```
3. Verify storage access:
```bash
kubectl exec -it <checkpoint-agent-pod> -- ls -la /checkpoints
```
### Restore Failing
1. Check pod logs:
```bash
kubectl logs <worker-pod> -n dynamo-system
```
2. Verify checkpoint file exists:
```bash
# For PVC
kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/
```
3. Check environment variables:
```bash
kubectl exec <worker-pod> -- env | grep DYN_CHECKPOINT
```
### Cold Start Despite Checkpoint
Pods fall back to cold start if:
- Checkpoint file doesn't exist yet (still being created)
- Checkpoint file is corrupted
- CRIU restore fails
Check logs for "Falling back to cold start" message.
## Environment Variables
| Variable | Description |
|----------|-------------|
| `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` (`s3` and `oci` are currently no-ops) |
| `DYN_CHECKPOINT_LOCATION` | Full checkpoint location (checkpoint jobs) |
| `DYN_CHECKPOINT_PATH` | Base checkpoint directory (restore pods, PVC) |
| `DYN_CHECKPOINT_HASH` | Identity hash |
| `DYN_READY_FOR_CHECKPOINT_FILE` | Ready-for-checkpoint file path (checkpoint jobs) |
## Complete Example
Create a checkpoint and use it in a DGD:
```yaml
# 1. Create the DynamoCheckpoint CR
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
name: e5962d34ba272638 # 16-char hash (computed from identity)
namespace: dynamo-system
spec:
identity:
model: "meta-llama/Meta-Llama-3-8B-Instruct"
backendFramework: "vllm"
tensorParallelSize: 1
dtype: "bfloat16"
job:
activeDeadlineSeconds: 3600
backoffLimit: 3
podTemplateSpec:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.vllm"
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
restartPolicy: Never
---
# 2. Wait for Ready: kubectl get dckpt e5962d34ba272638 -n dynamo-system -w
---
# 3. Reference the checkpoint in your DGD
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
namespace: dynamo-system
spec:
services:
worker:
replicas: 2
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.vllm"
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
env:
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
checkpoint:
enabled: true
checkpointRef: "e5962d34ba272638" # Reference by hash
```
## Related Documentation
- [Dynamo Snapshot Overview](README.md) - Dynamo Snapshot architecture and use cases
- [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/snapshot/README.md) - Chart configuration
- [Installation Guide](../installation-guide.md) - Platform installation
- [API Reference](../api-reference.md) - Complete CRD specifications
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment