--- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 title: Integration with Dynamo --- > ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details. Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start. | Startup Type | Time | What Happens | |--------------|------|--------------| | **Cold Start** | ~1 min | Download model, load to GPU, initialize engine | | **Warm Start** (checkpoint) | < 10 sec | Restore from checkpoint tar | ## Prerequisites - Dynamo Platform installed on a k8s cluster with **x86_64 (amd64)** GPU nodes - Dynamo Snapshot Helm chart installed (separate from platform) - RWX PVC storage (PVC is currently the only supported backend) - NVIDIA driver 580.xx or newer on the target GPU nodes - vLLM or SGLang backend (TensorRT-LLM is not supported) ## Quick Start ### 1. Install Dynamo Snapshot Infrastructure First, install the Dynamo Snapshot Helm chart in each namespace where you need checkpointing: ```bash # Install Dynamo Snapshot infrastructure helm install snapshot nvidia/snapshot \ --namespace my-team \ --create-namespace \ --set storage.pvc.size=100Gi ``` This creates: - A PVC for checkpoint storage (`snapshot-pvc`) - A DaemonSet for CRIU operations (`snapshot-agent`) ### 2. Configure Operator Values Update your Helm values to point to the Dynamo Snapshot infrastructure: ```yaml # values.yaml dynamo-operator: checkpoint: enabled: true storage: type: pvc # Only PVC is currently supported (S3/OCI planned) pvc: pvcName: "snapshot-pvc" # Must match Dynamo Snapshot chart basePath: "/checkpoints" signalHostPath: "/var/lib/snapshot/signals" # Must match Dynamo Snapshot chart ``` ### 2. Configure Your DGD Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate `backendFramework`, command, and CLI flags. #### vLLM Example ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-llm spec: services: worker: replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest command: ["python3"] args: - "-m" - "dynamo.vllm" - "--model" - "meta-llama/Llama-3-8B" - "--max-model-len" - "4096" - "--gpu-memory-utilization" - "0.90" env: # Required for cross-node checkpoint/restore - name: GLOO_SOCKET_IFNAME value: "lo" - name: NCCL_SOCKET_IFNAME value: "lo" resources: limits: nvidia.com/gpu: "1" checkpoint: enabled: true mode: auto identity: model: "meta-llama/Llama-3-8B" backendFramework: "vllm" tensorParallelSize: 1 dtype: "bfloat16" maxModelLen: 4096 ``` #### SGLang Example ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-sglang-llm spec: services: worker: replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/dynamo-sglang-placeholder:latest command: ["python3"] args: - "-m" - "dynamo.sglang" - "--model" - "meta-llama/Llama-3-8B" - "--mem-fraction-static" - "0.90" env: # Required for cross-node checkpoint/restore - name: GLOO_SOCKET_IFNAME value: "lo" - name: NCCL_SOCKET_IFNAME value: "lo" resources: limits: nvidia.com/gpu: "1" checkpoint: enabled: true mode: auto identity: model: "meta-llama/Llama-3-8B" backendFramework: "sglang" tensorParallelSize: 1 dtype: "bfloat16" maxModelLen: 4096 ``` **Key differences between backends:** | Setting | vLLM | SGLang | |---------|------|--------| | Module | `dynamo.vllm` | `dynamo.sglang` | | Max context (optional) | `--max-model-len` | `--context-length` | | GPU memory | `--gpu-memory-utilization` | `--mem-fraction-static` | | Placeholder image | `dynamo-vllm-placeholder` | `dynamo-sglang-placeholder` | | Identity `backendFramework` | `"vllm"` | `"sglang"` | > **Note:** Do **not** set `DYN_READY_FOR_CHECKPOINT_FILE` or `DYN_CHECKPOINT_READY_FILE` in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally. ### 3. Deploy ```bash kubectl apply -f my-llm.yaml -n dynamo-system ``` On first deployment: 1. A checkpoint job runs to create the checkpoint 2. Worker pods start with cold start (checkpoint not ready yet) 3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint ## Checkpoint Modes ### Auto Mode (Recommended) The operator automatically creates a `DynamoCheckpoint` CR if one doesn't exist: ```yaml checkpoint: enabled: true mode: auto identity: model: "meta-llama/Llama-3-8B" backendFramework: "vllm" # or "sglang" tensorParallelSize: 1 dtype: "bfloat16" maxModelLen: 4096 ``` ### Reference Mode Reference an existing `DynamoCheckpoint` CR by its 16-character hash using `checkpointRef`: ```yaml checkpoint: enabled: true checkpointRef: "e5962d34ba272638" # 16-char hash of DynamoCheckpoint CR ``` This is useful when: - You want to **pre-warm checkpoints** before creating DGDs - You want to **explicit control** over which checkpoint to use **Flow:** 1. Create a `DynamoCheckpoint` CR (see [DynamoCheckpoint CRD](#dynamocheckpoint-crd) section) 2. Wait for it to become `Ready` 3. Reference it in your DGD using `checkpointRef` with the hash ```bash # Check checkpoint status (using 16-char hash name) kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system NAME MODEL BACKEND PHASE HASH AGE e5962d34ba272638 meta-llama/Llama-3-8B vllm Ready e5962d34ba272638 5m # Now create DGD referencing it kubectl apply -f my-dgd.yaml ``` ## Checkpoint Identity Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state: | Field | Required | Affects Hash | Example | |-------|----------|-------------|---------| | `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` | | `backendFramework` | ✓ | ✓ | `sglang`, `vllm` | | `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` | | `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) | | `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) | | `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` | | `maxModelLen` | | ✓ | `4096`, `8192` | | `extraParameters` | | ✓ | Custom key-value pairs | **Not included in hash** (don't invalidate checkpoint): - `replicas` - `nodeSelector`, `affinity`, `tolerations` - `resources` (requests/limits) - Logging/observability config **Example with all fields:** ```yaml checkpoint: enabled: true mode: auto identity: model: "meta-llama/Llama-3-8B" backendFramework: "vllm" dynamoVersion: "0.9.0" tensorParallelSize: 1 pipelineParallelSize: 1 dtype: "bfloat16" maxModelLen: 8192 extraParameters: enableChunkedPrefill: "true" quantization: "awq" ``` **Checkpoint Naming:** The `DynamoCheckpoint` CR is automatically named using the 16-character identity hash (e.g., `e5962d34ba272638`). **Checkpoint Sharing:** Multiple DGDs with the same identity automatically share the same checkpoint. ## DynamoCheckpoint CRD The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle. **When to create a DynamoCheckpoint directly:** - **Pre-warming:** Create checkpoints before deploying DGDs for instant startup - **Explicit control:** Manage checkpoint lifecycle independently from DGDs **Note:** With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in `auto` mode. **Create a checkpoint:** ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoCheckpoint metadata: name: e5962d34ba272638 # Use the computed 16-char hash spec: identity: model: "meta-llama/Llama-3-8B" backendFramework: "vllm" tensorParallelSize: 1 dtype: "bfloat16" job: activeDeadlineSeconds: 3600 podTemplateSpec: spec: containers: - name: main image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest command: ["python3", "-m", "dynamo.vllm"] args: ["--model", "meta-llama/Llama-3-8B"] resources: limits: nvidia.com/gpu: "1" env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: HF_TOKEN ``` **Note:** You can compute the hash yourself, or use `auto` mode to let the operator create it. **Check status:** ```bash # List all checkpoints kubectl get dynamocheckpoint -n dynamo-system # Or use shortname kubectl get dckpt -n dynamo-system NAME MODEL BACKEND PHASE HASH AGE e5962d34ba272638 meta-llama/Llama-3-8B vllm Ready e5962d34ba272638 5m a7b4f89c12de3456 meta-llama/Llama-3-70B vllm Creating a7b4f89c12de3456 2m ``` **Phases:** | Phase | Description | |-------|-------------| | `Pending` | CR created, waiting for job to start | | `Creating` | Checkpoint job is running | | `Ready` | Checkpoint available for use | | `Failed` | Checkpoint creation failed | **Detailed status:** ```bash kubectl describe dckpt e5962d34ba272638 -n dynamo-system ``` ```yaml Status: Phase: Ready IdentityHash: e5962d34ba272638 Location: /checkpoints/e5962d34ba272638 StorageType: pvc CreatedAt: 2026-01-29T10:05:00Z ``` **Reference from DGD:** Once the checkpoint is `Ready`, you can reference it by hash: ```yaml spec: services: VllmWorker: checkpoint: enabled: true checkpointRef: "e5962d34ba272638" # 16-char hash ``` Or use `auto` mode and the operator will find/create it automatically. ## Limitations - **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only. - **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers. - **vLLM and SGLang backends only**: TensorRT-LLM is not supported. - **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported. - **Single-GPU only**: Multi-GPU configurations are not yet supported (planned) - **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option) - **Storage**: Only PVC backend currently implemented (S3/OCI planned) - **Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU ## Troubleshooting ### Checkpoint Not Creating 1. Check the checkpoint job: ```bash kubectl get jobs -l nvidia.com/snapshot-is-checkpoint-source=true -n dynamo-system kubectl logs job/checkpoint- -n dynamo-system ``` 2. Check the DaemonSet: ```bash kubectl logs daemonset/snapshot-agent -n dynamo-system ``` 3. Verify storage access: ```bash kubectl exec -it -- ls -la /checkpoints ``` ### Restore Failing 1. Check pod logs: ```bash kubectl logs -n dynamo-system ``` 2. Verify checkpoint file exists: ```bash # For PVC kubectl exec -it -- ls -la /checkpoints/ ``` 3. Check environment variables: ```bash kubectl exec -- env | grep DYN_CHECKPOINT ``` ### Cold Start Despite Checkpoint Pods fall back to cold start if: - Checkpoint file doesn't exist yet (still being created) - Checkpoint file is corrupted - CRIU restore fails Check logs for "Falling back to cold start" message. ## Environment Variables | Variable | Description | |----------|-------------| | `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` (`s3` and `oci` are currently no-ops) | | `DYN_CHECKPOINT_LOCATION` | Full checkpoint location (checkpoint jobs) | | `DYN_CHECKPOINT_PATH` | Base checkpoint directory (restore pods, PVC) | | `DYN_CHECKPOINT_HASH` | Identity hash | | `DYN_READY_FOR_CHECKPOINT_FILE` | Ready-for-checkpoint file path (checkpoint jobs) | ## Complete Example Create a checkpoint and use it in a DGD: ```yaml # 1. Create the DynamoCheckpoint CR apiVersion: nvidia.com/v1alpha1 kind: DynamoCheckpoint metadata: name: e5962d34ba272638 # 16-char hash (computed from identity) namespace: dynamo-system spec: identity: model: "meta-llama/Meta-Llama-3-8B-Instruct" backendFramework: "vllm" tensorParallelSize: 1 dtype: "bfloat16" job: activeDeadlineSeconds: 3600 backoffLimit: 3 podTemplateSpec: spec: containers: - name: main image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest command: ["python3"] args: - "-m" - "dynamo.vllm" - "--model" - "meta-llama/Meta-Llama-3-8B-Instruct" - "--max-model-len" - "4096" - "--gpu-memory-utilization" - "0.90" env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: HF_TOKEN - name: GLOO_SOCKET_IFNAME value: "lo" - name: NCCL_SOCKET_IFNAME value: "lo" resources: limits: nvidia.com/gpu: "1" restartPolicy: Never --- # 2. Wait for Ready: kubectl get dckpt e5962d34ba272638 -n dynamo-system -w --- # 3. Reference the checkpoint in your DGD apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-llm namespace: dynamo-system spec: services: worker: replicas: 2 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest command: ["python3"] args: - "-m" - "dynamo.vllm" - "--model" - "meta-llama/Meta-Llama-3-8B-Instruct" - "--max-model-len" - "4096" - "--gpu-memory-utilization" - "0.90" env: - name: GLOO_SOCKET_IFNAME value: "lo" - name: NCCL_SOCKET_IFNAME value: "lo" resources: limits: nvidia.com/gpu: "1" checkpoint: enabled: true checkpointRef: "e5962d34ba272638" # Reference by hash ``` ## Related Documentation - [Dynamo Snapshot Overview](README.md) - Dynamo Snapshot architecture and use cases - [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/snapshot/README.md) - Chart configuration - [Installation Guide](../installation-guide.md) - Platform installation - [API Reference](../api-reference.md) - Complete CRD specifications