docs: update snapshot checkpointing docs (#7244)

92b341f3 · Schwinn Saereesitthipitak · GitHub · 4bd6299b · 92b341f3 · 92b341f3
Unverified Commit 92b341f3 authored Mar 11, 2026 by Schwinn Saereesitthipitak Committed by GitHub Mar 11, 2026
7 changed files
--- a/deploy/helm/charts/snapshot/README.md
+++ b/deploy/helm/charts/snapshot/README.md
 # Dynamo Snapshot Helm Chart

-> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The DaemonSet runs in privileged mode to perform CRIU operations. See [Prerequisites](#prerequisites) for security considerations.
+> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in beta/preview. The DaemonSet runs in privileged mode to perform CRIU checkpoint and restore operations.

-This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, including:
- Persistent Volume Claim (PVC) for checkpoint storage
- DaemonSet running the CRIU checkpoint agent
- RBAC resources (ServiceAccount, Role, RoleBinding)
- Seccomp profile for blocking io_uring syscalls
+This chart installs the namespace-scoped checkpoint/restore infrastructure used by Dynamo:

-**Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
- **Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
+- `snapshot-agent` DaemonSet on GPU nodes
+- `snapshot-pvc` checkpoint storage, or wiring to an existing PVC
+- namespace-scoped RBAC
+- the seccomp profile required by CRIU

-## Prerequisites
+Snapshot storage is namespace-local. Install this chart in every namespace where you want checkpoint and restore.

-⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable.
+## Prerequisites

 - Kubernetes 1.21+
- **x86_64 (amd64) nodes only** for the snapshot agent and placeholder images
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
- RWX (ReadWriteMany) storage class for multi-node deployments
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
-
-## Installation
-
-> **Note:** The Dynamo Snapshot Helm chart is not yet published to a public Helm repository. For now, you must build and deploy from source.
-
-### Building from Source
-
-```bash
-# Set environment
-export NAMESPACE=my-team  # Your target namespace
-export DOCKER_SERVER=your-registry.com/  # Your container registry
-export IMAGE_TAG=latest
-
-# Build Dynamo Snapshot agent image (amd64 only)
-cd deploy/snapshot
-docker build --platform linux/amd64 --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
-docker push $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG
-cd -
-
-# Install Dynamo Snapshot chart with custom image
-helm install snapshot ./deploy/helm/charts/snapshot/ \
-  --namespace ${NAMESPACE} \
-  --create-namespace \
-  --set daemonset.image.repository=${DOCKER_SERVER}/snapshot-agent \
-  --set daemonset.image.tag=${IMAGE_TAG} \
-  --set daemonset.imagePullSecrets[0].name=your-registry-secret
-```
-
-## Configuration
-
-See `values.yaml` for all configuration options.
-
-### Key Configuration Options
-
-| Parameter | Description | Default |
-|-----------|-------------|---------|
-| `storage.type` | Storage type: `pvc` (only supported), `s3` and `oci` planned | `pvc` |
-| `storage.pvc.create` | Create a new PVC | `true` |
-| `storage.pvc.name` | PVC name (must match operator config) | `snapshot-pvc` |
-| `storage.pvc.size` | PVC size | `100Gi` |
-| `storage.pvc.storageClass` | Storage class name | `""` (default) |
-| `daemonset.image.repository` | DaemonSet image repository | `nvcr.io/nvidian/dynamo-dev/snapshot-agent` |
-| `daemonset.snapshotLogLevel` | Snapshot agent and nsrestore log level (`trace`, `debug`, `info`, `warn`, `error`) | `info` |
-| `daemonset.nodeSelector` | Node selector for GPU nodes | `nvidia.com/gpu.present: "true"` |
-| `config.checkpoint.criu.ghostLimit` | CRIU ghost file size limit in bytes | `536870912` (512MB) |
-| `config.checkpoint.criu.logLevel` | CRIU logging verbosity (0-4) | `4` |
-| `rbac.namespaceRestricted` | Use namespace-scoped RBAC | `true` |
+- x86_64 GPU nodes
+- NVIDIA driver 580.xx or newer
+- containerd runtime
+- a cluster where a privileged DaemonSet with `hostPID`, `hostIPC`, and `hostNetwork` is acceptable
+- Dynamo Platform already installed, with operator checkpointing enabled

-## Usage
-
-After installing this chart, enable checkpointing in your DynamoGraphDeployment:
+The platform/operator configuration must point at the same checkpoint storage that this chart installs:

 ```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-model
-  namespace: my-team
-spec:
-  services:
-    worker:
+dynamo-operator:
  checkpoint:
    enabled: true
-        mode: auto
-        identity:
-          model: Qwen/Qwen3-0.6B
-          backendFramework: vllm
+    storage:
+      type: pvc
+      pvc:
+        pvcName: snapshot-pvc
+        basePath: /checkpoints
 ```

-## Multi-Namespace Deployment
+Cross-node restore requires a shared `ReadWriteMany` storage class. The chart defaults to `storage.pvc.accessMode=ReadWriteMany`.

-To enable checkpointing in multiple namespaces, install this chart in each namespace:
+For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC.

-```bash
-# Namespace A
-helm install snapshot nvidia/snapshot -n team-a
+## Minimal Install

-# Namespace B
-helm install snapshot nvidia/snapshot -n team-b
+This is the smallest Helm install that creates the checkpoint PVC and the DaemonSet:
+
+```bash
+helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
+  --namespace ${NAMESPACE} \
+  --create-namespace \
+  --set storage.pvc.create=true
 ```

-Each namespace will have its own isolated checkpoint storage.
+If your cluster does not use a default storage class, also set `storage.pvc.storageClass`.

-## Verification
+Keep `storage.pvc.accessMode=ReadWriteMany` for this chart layout. The DaemonSet mounts the same PVC on each eligible node, so a shared `ReadWriteOnce` claim only works when the agent runs on one node.

-```bash
-# Check PVC
-kubectl get pvc snapshot-pvc -n my-team
+If you already have a PVC, keep the chart in "use existing PVC" mode:

-# Check DaemonSet
-kubectl get daemonset -n my-team
+Do not set `storage.pvc.create=true` when reusing an existing checkpoint PVC.

-# Check DaemonSet pods are running
-kubectl get pods -n my-team -l app.kubernetes.io/name=snapshot
+```bash
+helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
+  --namespace ${NAMESPACE} \
+  --create-namespace \
+  --set storage.pvc.create=false \
+  --set storage.pvc.name=my-snapshot-pvc
 ```

-## Uninstallation
+## Verify

 ```bash
-helm uninstall snapshot -n my-team
+kubectl get pvc snapshot-pvc -n ${NAMESPACE}
+kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
+kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot -o wide
 ```

-**Note:** This will NOT delete the PVC by default. To delete the PVC:
+## Important Values

-```bash
-kubectl delete pvc snapshot-pvc -n my-team
-```
+| Parameter | Meaning | Default |
+|-----------|---------|---------|
+| `storage.pvc.create` | Create `snapshot-pvc` instead of using an existing PVC | `true` |
+| `storage.pvc.name` | PVC name used by the agent and by the operator config | `snapshot-pvc` |
+| `storage.pvc.size` | Requested PVC size | `1Ti` |
+| `storage.pvc.storageClass` | Storage class name | `""` |
+| `storage.pvc.accessMode` | Access mode for the checkpoint PVC | `ReadWriteMany` |
+| `storage.pvc.basePath` | Checkpoint root inside the PVC | `/checkpoints` |
+| `daemonset.image.repository` | Snapshot agent image repository | `nvcr.io/nvidia/ai-dynamo/snapshot-agent` |
+| `daemonset.image.tag` | Snapshot agent image tag | `1.0.0` |
+| `daemonset.imagePullSecrets` | Image pull secrets for the agent | `[{name: ngc-secret}]` |

-## Troubleshooting
+See [values.yaml](./values.yaml) for the complete configuration surface.

-### DaemonSet pods not starting
+## End To End

-Check if GPU nodes have the correct labels and runtime class:
+Once the chart is installed, use the snapshot guide to deploy a snapshot-capable `DynamoGraphDeployment`, wait for the checkpoint to become ready, and then scale the worker to verify restore:

-```bash
-kubectl get nodes -l nvidia.com/gpu.present=true
-kubectl describe node <node-name> | grep -A 5 "Runtime Class"
-```
+- [Snapshot](../../../../docs/kubernetes/snapshot.md)

-If nodes don't have the `nvidia.com/gpu.present` label, you can add it:
+## Uninstall

 ```bash
-kubectl label node <node-name> nvidia.com/gpu.present=true
+helm uninstall snapshot -n ${NAMESPACE}
 ```

-### Checkpoint job fails
-
-Check DaemonSet logs:
+The chart does not remove checkpoint data automatically. Delete the PVC yourself if you want to remove stored checkpoints:

 ```bash
-kubectl logs -n my-team -l app.kubernetes.io/name=snapshot
+kubectl delete pvc snapshot-pvc -n ${NAMESPACE}
 ```

-### PVC not mounting
+## Troubleshooting

-Check PVC status and events:
+If `snapshot-agent` does not schedule:

 ```bash
-kubectl describe pvc snapshot-pvc -n my-team
+kubectl get nodes -l nvidia.com/gpu.present=true
+kubectl describe daemonset snapshot-agent -n ${NAMESPACE}
+kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot --all-containers
 ```

-Ensure your storage class supports `ReadWriteMany` access mode for multi-node deployments.
-
-## Related Documentation
-
- [Dynamo Snapshot Overview](../../../../docs/kubernetes/snapshot/README.md) - Dynamo Snapshot architecture and use cases
- [Dynamo Snapshot with Dynamo Platform](../../../../docs/kubernetes/snapshot/dynamo.md) - Integration guide
-
-## License
+If checkpoint creation never becomes ready, verify all three pieces line up:

-Apache License 2.0
+- the operator has `dynamo-operator.checkpoint.enabled=true`
+- the operator PVC name and base path match the snapshot chart values
+- the workload uses a snapshot-capable worker image and command
--- a/deploy/helm/charts/snapshot/values.yaml
+++ b/deploy/helm/charts/snapshot/values.yaml
@@ -29,7 +29,7 @@ storage:
    # PVC name - must match operator configuration
    name: snapshot-pvc
    # PVC size
-    size: 100Gi
+    size: 1Ti
    # Storage class (leave empty for default)
    storageClass: ""
    # Access mode - ReadWriteMany required for multi-pod access

--- a/docs/index.yml
+++ b/docs/index.yml
@@ -55,11 +55,8 @@ navigation:
            path: kubernetes/rolling-update.md
          - page: Inference Gateway (GAIE)
            path: kubernetes/inference-gateway.md
-          - section: Checkpointing
-            path: kubernetes/snapshot/README.md
-            contents:
-              - page: Integration with Dynamo
-                path: kubernetes/snapshot/dynamo.md
+          - page: Snapshot
+            path: kubernetes/snapshot.md
      - section: Observability (K8s)
        contents:
          - page: Metrics

--- a/docs/kubernetes/README.md
+++ b/docs/kubernetes/README.md
@@ -230,7 +230,7 @@ Key customization points include:
 - **[Operator Documentation](dynamo-operator.md)** - How the platform works
 - **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
 - **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
- **[Checkpointing](snapshot/README.md)** - Fast pod startup with checkpoint/restore
+- **[Snapshot](snapshot.md)** - Fast pod startup with checkpoint/restore
 - **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
 - **[Logging](observability/logging.md)** - For logging setup
 - **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment

--- a/docs/kubernetes/snapshot.md
+++ b/docs/kubernetes/snapshot.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Snapshot
+---
+
+> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **preview** and may only be functional in some k8s cluster setups. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
+
+**Dynamo Snapshot** is an experimental infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in User-space) and NVIDIA's cuda-checkpoint utility. Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
+
+| Startup Type | Time | What Happens |
+|--------------|------|--------------|
+| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
+| **Warm Start** (restore from checkpoint) | ~ 10 sec | Restore from checkpoint tar |
+
+> ⚠️ Restore time may vary depending on cluster configuration (storage bandwidth, GPU model, etc.)
+
+## Prerequisites
+
+- Dynamo Platform/Operator installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
+- NVIDIA driver 580.xx or newer on the target GPU nodes
+- `ReadWriteMany` storage if you need cross-node restore
+- vLLM or SGLang backend (TensorRT-LLM is not supported yet)
+- Security clearance to run a privileged DaemonSet
+
+## Quick Start
+
+This guide assumes a normal Dynamo deployment workflow is already present on your Kubernetes cluster.
+
+### 1. Build and push a placeholder image
+
+Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with the restore tooling. If you do not already have one, build it with the snapshot placeholder target and push it to a registry your cluster can pull from:
+
+```bash
+export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
+export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0
+
+cd deploy/snapshot
+
+make docker-build-placeholder \
+  PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
+  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
+
+make docker-push-placeholder \
+  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
+```
+
+This flow is defined in [deploy/snapshot/Makefile](../../deploy/snapshot/Makefile) and [deploy/snapshot/Dockerfile](../../deploy/snapshot/Dockerfile). The placeholder image preserves the base runtime entrypoint and command contract, and adds the CRIU, `cuda-checkpoint`, and `nsrestore` tooling needed for restore.
+
+### 2. Enable checkpointing in the platform and verify it
+
+Whether you are installing or upgrading `dynamo-platform`, the operator must have checkpointing enabled and must point at the same storage that the snapshot chart will use:
+
+```yaml
+dynamo-operator:
+  checkpoint:
+    enabled: true
+    storage:
+      type: pvc
+      pvc:
+        pvcName: snapshot-pvc
+        basePath: /checkpoints
+```
+
+If the platform is already installed, verify that the operator config contains the checkpoint block:
+
+```bash
+OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
+  -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
+  -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')
+
+kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
+  -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'
+```
+
+Verify that the rendered config includes `enabled: true` and the same PVC name and base path you plan to use for the snapshot chart.
+
+For the full platform/operator configuration surface, see [deploy/helm/charts/platform/README.md](../../deploy/helm/charts/platform/README.md) and [deploy/helm/charts/platform/components/operator/values.yaml](../../deploy/helm/charts/platform/components/operator/values.yaml).
+
+### 3. Install the snapshot chart
+
+```bash
+helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
+  --namespace ${NAMESPACE} \
+  --create-namespace \
+  --set storage.pvc.create=true
+```
+
+Cross-node restore requires `ReadWriteMany` storage. The chart defaults to that mode.
+
+For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC. If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and point `storage.pvc.name` at the existing PVC instead.
+
+Verify that the PVC and DaemonSet are ready:
+
+```bash
+kubectl get pvc snapshot-pvc -n ${NAMESPACE}
+kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
+```
+
+For the full snapshot chart configuration surface, see [deploy/helm/charts/snapshot/README.md](../../deploy/helm/charts/snapshot/README.md) and [deploy/helm/charts/snapshot/values.yaml](../../deploy/helm/charts/snapshot/values.yaml).
+
+### 4. Apply a snapshot-compatible `DynamoGraphDeployment`
+
+This example is adapted from [examples/backends/vllm/deploy/agg.yaml](../../examples/backends/vllm/deploy/agg.yaml). The worker must use the placeholder image from step 1, and the checkpoint identity must describe the runtime state you want to reuse.
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: vllm-snapshot-demo
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: registry.example.com/dynamo/vllm-runtime:1.0.0
+
+    VllmDecodeWorker:
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      readinessProbe:
+        httpGet:
+          path: /live
+          port: system
+        periodSeconds: 1
+        timeoutSeconds: 4
+        failureThreshold: 3
+      checkpoint:
+        enabled: true
+        mode: Auto
+        identity:
+          model: Qwen/Qwen3-0.6B
+          backendFramework: vllm
+      extraPodSpec:
+        mainContainer:
+          image: registry.example.com/dynamo/vllm-placeholder:1.0.0
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - Qwen/Qwen3-0.6B
+            - --disable-custom-all-reduce
+          env:
+            - name: GLOO_SOCKET_IFNAME
+              value: lo
+            - name: NCCL_SOCKET_IFNAME
+              value: lo
+            - name: NCCL_DEBUG
+              value: ERROR
+            - name: TORCH_CPP_LOG_LEVEL
+              value: ERROR
+            - name: TORCH_DISTRIBUTED_DEBUG
+              value: "OFF"
+            - name: CUDA_ERROR_LEVEL
+              value: "10"
+            - name: NCCL_CUMEM_ENABLE
+              value: "0"
+            - name: NCCL_CUMEM_HOST_ENABLE
+              value: "0"
+            - name: NCCL_NVLS_ENABLE
+              value: "0"
+            - name: NCCL_P2P_DISABLE
+              value: "0"
+            - name: NCCL_SHM_DISABLE
+              value: "1"
+            - name: NCCL_IB_DISABLE
+              value: "1"
+            - name: TORCH_NCCL_ENABLE_MONITORING
+              value: "0"
+```
+
+For SGLang, use `dynamo.sglang`, an SGLang placeholder image, `backendFramework: sglang`, and the matching CLI flags.
+
+Apply the manifest:
+
+```bash
+kubectl apply -f vllm-snapshot-demo.yaml -n ${NAMESPACE}
+```
+
+On the first rollout, the worker cold-starts, the operator creates a `DynamoCheckpoint`, and the checkpoint Job writes data into `snapshot-pvc`.
+
+### 5. Wait for the checkpoint to become ready
+
+Capture the checkpoint name from DGD status, then wait for the `DynamoCheckpoint` phase to become `Ready`:
+
+```bash
+CHECKPOINT_NAME=$(kubectl get dgd vllm-snapshot-demo -n ${NAMESPACE} \
+  -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}')
+
+kubectl wait \
+  --for=jsonpath='{.status.phase}'=Ready \
+  "dynamocheckpoint/${CHECKPOINT_NAME}" \
+  -n ${NAMESPACE} \
+  --timeout=30m
+```
+
+The DGD status also reports the computed checkpoint hash at `.status.checkpoints.VllmDecodeWorker.identityHash`.
+
+### 6. Trigger restore
+
+Once the checkpoint is ready, scale the worker replicas from `1` to `2`:
+
+```bash
+kubectl patch dgd vllm-snapshot-demo -n ${NAMESPACE} --type=merge \
+  -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'
+```
+
+New worker pods for `VllmDecodeWorker` will restore from the ready checkpoint automatically.
+
+## Checkpoint Configuration
+
+### Auto Mode (Recommended)
+
+The operator computes the checkpoint identity hash, looks for an existing `DynamoCheckpoint` with a matching `nvidia.com/snapshot-checkpoint-hash` label, and creates one if it does not find one:
+
+```yaml
+checkpoint:
+  enabled: true
+  mode: Auto
+  identity:
+    model: "meta-llama/Llama-3-8B"
+    backendFramework: "vllm"  # or "sglang"
+    tensorParallelSize: 1
+    dtype: "bfloat16"
+    maxModelLen: 4096
+```
+
+When a service uses checkpointing, DGD status reports the resolved `checkpointName`, `identityHash`, and `ready` fields under `.status.checkpoints.<service-name>`.
+
+### Manual Management and `checkpointRef`
+
+Use `checkpointRef` when you want a service to restore from a specific `DynamoCheckpoint` CR:
+
+```yaml
+checkpoint:
+  enabled: true
+  checkpointRef: "qwen3-06b-vllm-prewarm"
+```
+
+This is useful when:
+- You want to **pre-warm checkpoints** before creating DGDs
+- You want **explicit control** over which checkpoint to use
+
+`checkpointRef` resolves by `DynamoCheckpoint.metadata.name`, not by `status.identityHash`. A manual checkpoint can use any valid Kubernetes resource name.
+
+If you are managing checkpoint CRs yourself, set `mode: Manual` on the service to prevent the operator from creating a new `DynamoCheckpoint` when identity-based lookup does not find one.
+
+```bash
+# Check checkpoint status by CR name
+kubectl get dynamocheckpoint qwen3-06b-vllm-prewarm -n ${NAMESPACE}
+
+# Now create DGD referencing it
+kubectl apply -f my-dgd.yaml -n ${NAMESPACE}
+```
+
+If you want `mode: Auto` DGDs to discover a manually created checkpoint by identity, add the label `nvidia.com/snapshot-checkpoint-hash=<identity-hash>` to that `DynamoCheckpoint`. Auto-created checkpoints already use that label, and currently use the same hash as the CR name.
+
+## Checkpoint Identity
+
+Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:
+
+| Field | Required | Affects Hash | Example |
+|-------|----------|-------------|---------|
+| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
+| `backendFramework` | ✓ | ✓ | `sglang`, `vllm` |
+| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
+| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
+| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
+| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
+| `maxModelLen` | | ✓ | `4096`, `8192` |
+| `extraParameters` | | ✓ | Custom key-value pairs |
+
+**Not included in hash** (don't invalidate checkpoint):
+- `replicas`
+- `nodeSelector`, `affinity`, `tolerations`
+- `resources` (requests/limits)
+- Logging/observability config
+
+**Example with all fields:**
+```yaml
+checkpoint:
+  enabled: true
+  mode: Auto
+  identity:
+    model: "meta-llama/Llama-3-8B"
+    backendFramework: "vllm"
+    dynamoVersion: "0.9.0"
+    tensorParallelSize: 1
+    pipelineParallelSize: 1
+    dtype: "bfloat16"
+    maxModelLen: 8192
+    extraParameters:
+      enableChunkedPrefill: "true"
+      quantization: "awq"
+```
+
+## DynamoCheckpoint CRD
+
+The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
+
+**When to create a DynamoCheckpoint directly:**
+- **Pre-warming:** Create checkpoints before deploying DGDs for instant startup
+- **Explicit control:** Manage checkpoint lifecycle independently from DGDs
+
+The operator requires `spec.identity` and `spec.job.podTemplateSpec`. The pod template should match the worker container you want checkpointed, including image, command, args, secrets, volumes, and resource limits. You do not need to set the checkpoint environment variables manually; the operator injects them for checkpoint jobs and restored pods.
+
+**Create a checkpoint:**
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoCheckpoint
+metadata:
+  name: qwen3-06b-vllm-prewarm
+  labels:
+    nvidia.com/snapshot-checkpoint-hash: "e5962d34ba272638"  # Add this if Auto-mode identity lookup should find the CR
+spec:
+  identity:
+    model: Qwen/Qwen3-0.6B
+    backendFramework: vllm
+    tensorParallelSize: 1
+    dtype: bfloat16
+    maxModelLen: 4096
+
+  job:
+    activeDeadlineSeconds: 3600
+    backoffLimit: 3
+    ttlSecondsAfterFinished: 300
+    podTemplateSpec:
+      spec:
+        restartPolicy: Never
+        containers:
+          - name: main
+            image: registry.example.com/dynamo/vllm-placeholder:1.0.0
+            command:
+              - python3
+              - -m
+              - dynamo.vllm
+            args:
+              - --model
+              - Qwen/Qwen3-0.6B
+              - --disable-custom-all-reduce
+            env:
+              - name: GLOO_SOCKET_IFNAME
+                value: lo
+              - name: NCCL_SOCKET_IFNAME
+                value: lo
+            resources:
+              limits:
+                nvidia.com/gpu: "1"
+```
+
+You can name the CR however you want if you plan to use `checkpointRef`. If you want `mode: Auto` identity lookup to find a manual CR, set the `nvidia.com/snapshot-checkpoint-hash` label to the computed 16-character identity hash. Using the hash as the CR name is a convenient convention, but it is not required.
+
+**Check status:**
+
+```bash
+# List all checkpoints
+kubectl get dynamocheckpoint -n ${NAMESPACE}
+# Or use shortname
+kubectl get dckpt -n ${NAMESPACE}
+
+NAME                MODEL                          BACKEND  PHASE    HASH              AGE
+qwen3-06b-vllm-prewarm Qwen/Qwen3-0.6B            vllm     Ready    e5962d34ba272638  5m
+llama3-8b-vllm-prewarm meta-llama/Llama-3-8B      vllm     Creating 7ab4f89c12de3456  2m
+```
+
+**Phases:**
+
+| Phase | Description |
+|-------|-------------|
+| `Pending` | CR created, waiting for job to start |
+| `Creating` | Checkpoint job is running |
+| `Ready` | Checkpoint available for use |
+| `Failed` | Checkpoint creation failed |
+
+`Ready` is a value in `status.phase`, not a Kubernetes condition. The `conditions` array tracks job lifecycle events:
+
+| Condition Type | Meaning |
+|----------------|---------|
+| `JobCreated` | The checkpoint Job has been created |
+| `JobCompleted` | The checkpoint Job has completed successfully or failed |
+
+Other useful status fields are:
+
+| Field | Meaning |
+|-------|---------|
+| `status.jobName` | Name of the checkpoint Job |
+| `status.identityHash` | Computed 16-character hash for the checkpoint identity |
+| `status.location` | Checkpoint location in the configured storage backend |
+| `status.storageType` | Storage backend type (`pvc`, `s3`, or `oci`) |
+| `status.createdAt` | Timestamp recorded when the checkpoint becomes ready |
+| `status.message` | Failure or progress message when available |
+
+**Detailed status:**
+
+```bash
+kubectl describe dckpt qwen3-06b-vllm-prewarm -n ${NAMESPACE}
+```
+
+```yaml
+Status:
+  Phase: Ready
+  IdentityHash: e5962d34ba272638
+  JobName: checkpoint-qwen3-06b-vllm-prewarm
+  Location: /checkpoints/e5962d34ba272638.tar
+  StorageType: pvc
+  CreatedAt: 2026-01-29T10:05:00Z
+  Conditions:
+    - Type: JobCreated
+      Status: "True"
+      Reason: JobCreated
+    - Type: JobCompleted
+      Status: "True"
+      Reason: JobSucceeded
+```
+
+**Reference from DGD:**
+
+Once the checkpoint is `Ready`, you can reference it by CR name:
+
+```yaml
+spec:
+  services:
+    VllmDecodeWorker:
+      checkpoint:
+        enabled: true
+        checkpointRef: "qwen3-06b-vllm-prewarm"
+```
+
+Or use `mode: Auto` with the same identity and snapshot-hash label, and the operator will reuse it automatically.
+
+## Limitations
+
+- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
+- **Single-GPU only**: Multi-GPU configurations may work in very basic hardware configurations, but are not officially supported yet.
+- **Network state**: No active TCP connections can be checkpointed
+- **Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU and cuda-checkpoint. However, workload pods do not need to be privileged.
+
+## Troubleshooting
+
+### Checkpoint Not Ready
+
+1. Check the checkpoint job:
+   ```bash
+   kubectl get dckpt -n ${NAMESPACE}
+   kubectl describe dckpt <checkpoint-name> -n ${NAMESPACE}
+   kubectl logs job/$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}') -n ${NAMESPACE}
+   ```
+
+2. Check the DaemonSet:
+   ```bash
+   kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers
+   ```
+
+3. Verify that platform and chart storage settings match:
+   ```bash
+   kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o yaml
+   ```
+
+### Restore Failing
+
+1. Check pod logs:
+   ```bash
+   kubectl logs <worker-pod> -n ${NAMESPACE}
+   ```
+
+2. Describe the restore target pod:
+   ```bash
+   kubectl describe pod <worker-pod> -n ${NAMESPACE}
+   ```
+
+3. Confirm the referenced checkpoint is still `Ready`:
+   ```bash
+   kubectl get dckpt <checkpoint-name> -n ${NAMESPACE}
+   ```
+
+## Planned Features
+
+- TensorRT-LLM backend support
+- S3/MinIO storage backend
+- OCI registry storage backend
+- Multi-GPU checkpoints
+
+## Related Documentation
+
+- [Dynamo Snapshot Helm Chart README](../../deploy/helm/charts/snapshot/README.md) - Chart configuration
+- [Installation Guide](installation-guide.md) - Platform installation
+- [API Reference](api-reference.md) - Complete CRD specifications
--- a/docs/kubernetes/snapshot/README.md
+++ b/docs/kubernetes/snapshot/README.md
---
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-title: Checkpointing
---
-
-> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
-
-**Dynamo Snapshot** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
-
-## What is Dynamo Snapshot?
-
-Dynamo Snapshot provides:
- **Fast cold starts**: Restore GPU-accelerated applications in seconds instead of minutes
- **CUDA state preservation**: Checkpoint and restore GPU memory and CUDA contexts
- **Kubernetes-native**: Integrates seamlessly with Kubernetes primitives
- **Storage flexibility**: PVC-based storage (S3/OCI planned for future releases)
- **Namespace isolation**: Each namespace gets its own checkpoint infrastructure
-
-## Use Cases
-
-### 1. With NVIDIA Dynamo Platform (Recommended)
-
-Use Dynamo Snapshot as part of the Dynamo platform for automatic checkpoint management:
- Automatic checkpoint creation and lifecycle management
- Seamless integration with DynamoGraphDeployment CRDs
- Built-in autoscaling with fast restore
-
-📖 **[Read the Dynamo Integration Guide →](dynamo.md)**
-
-## Architecture
-
-Dynamo Snapshot consists of two main components:
-
-### 1. Dynamo Snapshot Helm Chart
-Deploys the checkpoint/restore infrastructure:
- **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
- **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
- **RBAC**: Namespace-scoped or cluster-wide permissions
- **Seccomp Profile**: Security policies for CRIU syscalls (needs to be injected into workload pods)
-
-### 2. External Restore via DaemonSet
-The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces:
- **Checkpoint**: Freezes the running process and dumps state (CPU + GPU) to storage
- **Restore**: Enters a placeholder pod's namespaces and restores the checkpointed process via `nsrestore`
-
-## Quick Start
-
-To install the Dynamo Snapshot DaemonSet in your cluster, run the following:
-
-```bash
-helm install snapshot nvidia/snapshot \
-  --namespace my-team \
-  --create-namespace \
-  --set storage.pvc.size=100Gi
-```
-
-## Key Features
-
-### ✅ Currently Supported
- ✅ **vLLM and SGLang backends** (TensorRT-LLM planned)
- ✅ **LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
- ✅ Cross-node, single-GPU checkpoints (requires RWX storage)
- ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
- ✅ Namespace-scoped and cluster-wide RBAC
- ✅ Idempotent checkpoint creation
- ✅ Automatic signal-based checkpoint coordination
-
-### 🚧 Planned Features
- 🚧 TensorRT-LLM backend support
- 🚧 S3/MinIO storage backend
- 🚧 OCI registry storage backend
- 🚧 Multi-GPU checkpoints
- 🚧 Multi-node distributed checkpoints
-
-## Limitations
-
-⚠️ **Important**: Dynamo Snapshot has significant limitations that may impact production readiness:
-
-### Security Considerations
- **🔴 Privileged DaemonSet**: The Dynamo Snapshot DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations. Workload pods do **not** need privileged mode — all CRIU privilege lives in the DaemonSet.
- **Security Impact**: The privileged DaemonSet can:
-  - Access all host devices and processes
-  - Bypass most security restrictions
-  - Potentially compromise node security if exploited
-
-### Technical Limitations
- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations not yet supported
- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
- **Storage**: Only PVC storage is currently implemented (S3/OCI planned)
-
-### Recommendation
-Dynamo Snapshot is best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
-
-## Documentation
-
-### Getting Started
- [Dynamo Integration Guide](dynamo.md) - Using Dynamo Snapshot with Dynamo Platform
- [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/snapshot/README.md) - Helm chart configuration
-
-### Related Documentation
- [CRIU Documentation](https://criu.org/Main_Page) - Upstream CRIU docs
-
-## Prerequisites
-
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
- RWX storage class (for multi-node deployments)
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
-
-## Contributing
-
-Dynamo Snapshot is part of the NVIDIA Dynamo project. Contributions are welcome!
-
-## License
-
-Apache License 2.0
--- a/docs/kubernetes/snapshot/dynamo.md
+++ b/docs/kubernetes/snapshot/dynamo.md
---
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-title: Integration with Dynamo
---
-
-> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
-
-Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
-
-| Startup Type | Time | What Happens |
-|--------------|------|--------------|
-| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
-| **Warm Start** (checkpoint) | < 10 sec | Restore from checkpoint tar |
-
-## Prerequisites
-
- Dynamo Platform installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
- Dynamo Snapshot Helm chart installed (separate from platform)
- RWX PVC storage (PVC is currently the only supported backend)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- vLLM or SGLang backend (TensorRT-LLM is not supported)
-
-## Quick Start
-
-### 1. Install Dynamo Snapshot Infrastructure
-
-First, install the Dynamo Snapshot Helm chart in each namespace where you need checkpointing:
-
-```bash
-# Install Dynamo Snapshot infrastructure
-helm install snapshot nvidia/snapshot \
-  --namespace my-team \
-  --create-namespace \
-  --set storage.pvc.size=100Gi
-```
-
-This creates:
- A PVC for checkpoint storage (`snapshot-pvc`)
- A DaemonSet for CRIU operations (`snapshot-agent`)
-
-### 2. Configure Operator Values
-
-Update your Helm values to point to the Dynamo Snapshot infrastructure:
-
-```yaml
-# values.yaml
-dynamo-operator:
-  checkpoint:
-    enabled: true
-    storage:
-      type: pvc  # Only PVC is currently supported (S3/OCI planned)
-      pvc:
-        pvcName: "snapshot-pvc"  # Must match Dynamo Snapshot chart
-        basePath: "/checkpoints"
-      signalHostPath: "/var/lib/snapshot/signals"  # Must match Dynamo Snapshot chart
-```
-
-### 2. Configure Your DGD
-
-Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate `backendFramework`, command, and CLI flags.
-
-#### vLLM Example
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-llm
-spec:
-  services:
-    worker:
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
-          command: ["python3"]
-          args:
-            - "-m"
-            - "dynamo.vllm"
-            - "--model"
-            - "meta-llama/Llama-3-8B"
-            - "--max-model-len"
-            - "4096"
-            - "--gpu-memory-utilization"
-            - "0.90"
-          env:
-            # Required for cross-node checkpoint/restore
-            - name: GLOO_SOCKET_IFNAME
-              value: "lo"
-            - name: NCCL_SOCKET_IFNAME
-              value: "lo"
-      resources:
-        limits:
-          nvidia.com/gpu: "1"
-      checkpoint:
-        enabled: true
-        mode: auto
-        identity:
-          model: "meta-llama/Llama-3-8B"
-          backendFramework: "vllm"
-          tensorParallelSize: 1
-          dtype: "bfloat16"
-          maxModelLen: 4096
-```
-
-#### SGLang Example
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-sglang-llm
-spec:
-  services:
-    worker:
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/dynamo-sglang-placeholder:latest
-          command: ["python3"]
-          args:
-            - "-m"
-            - "dynamo.sglang"
-            - "--model"
-            - "meta-llama/Llama-3-8B"
-            - "--mem-fraction-static"
-            - "0.90"
-          env:
-            # Required for cross-node checkpoint/restore
-            - name: GLOO_SOCKET_IFNAME
-              value: "lo"
-            - name: NCCL_SOCKET_IFNAME
-              value: "lo"
-      resources:
-        limits:
-          nvidia.com/gpu: "1"
-      checkpoint:
-        enabled: true
-        mode: auto
-        identity:
-          model: "meta-llama/Llama-3-8B"
-          backendFramework: "sglang"
-          tensorParallelSize: 1
-          dtype: "bfloat16"
-          maxModelLen: 4096
-```
-
-**Key differences between backends:**
-
-| Setting | vLLM | SGLang |
-|---------|------|--------|
-| Module | `dynamo.vllm` | `dynamo.sglang` |
-| Max context (optional) | `--max-model-len` | `--context-length` |
-| GPU memory | `--gpu-memory-utilization` | `--mem-fraction-static` |
-| Placeholder image | `dynamo-vllm-placeholder` | `dynamo-sglang-placeholder` |
-| Identity `backendFramework` | `"vllm"` | `"sglang"` |
-
-> **Note:** Do **not** set `DYN_READY_FOR_CHECKPOINT_FILE` or `DYN_CHECKPOINT_READY_FILE` in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally.
-
-### 3. Deploy
-
-```bash
-kubectl apply -f my-llm.yaml -n dynamo-system
-```
-
-On first deployment:
-1. A checkpoint job runs to create the checkpoint
-2. Worker pods start with cold start (checkpoint not ready yet)
-3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
-
-## Checkpoint Modes
-
-### Auto Mode (Recommended)
-
-The operator automatically creates a `DynamoCheckpoint` CR if one doesn't exist:
-
-```yaml
-checkpoint:
-  enabled: true
-  mode: auto
-  identity:
-    model: "meta-llama/Llama-3-8B"
-    backendFramework: "vllm"  # or "sglang"
-    tensorParallelSize: 1
-    dtype: "bfloat16"
-    maxModelLen: 4096
-```
-
-### Reference Mode
-
-Reference an existing `DynamoCheckpoint` CR by its 16-character hash using `checkpointRef`:
-
-```yaml
-checkpoint:
-  enabled: true
-  checkpointRef: "e5962d34ba272638"  # 16-char hash of DynamoCheckpoint CR
-```
-
-This is useful when:
- You want to **pre-warm checkpoints** before creating DGDs
- You want to **explicit control** over which checkpoint to use
-
-**Flow:**
-1. Create a `DynamoCheckpoint` CR (see [DynamoCheckpoint CRD](#dynamocheckpoint-crd) section)
-2. Wait for it to become `Ready`
-3. Reference it in your DGD using `checkpointRef` with the hash
-
-```bash
-# Check checkpoint status (using 16-char hash name)
-kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system
-NAME                MODEL                   BACKEND  PHASE  HASH              AGE
-e5962d34ba272638    meta-llama/Llama-3-8B  vllm     Ready  e5962d34ba272638  5m
-
-# Now create DGD referencing it
-kubectl apply -f my-dgd.yaml
-```
-
-## Checkpoint Identity
-
-Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:
-
-| Field | Required | Affects Hash | Example |
-|-------|----------|-------------|---------|
-| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
-| `backendFramework` | ✓ | ✓ | `sglang`, `vllm` |
-| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
-| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
-| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
-| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
-| `maxModelLen` | | ✓ | `4096`, `8192` |
-| `extraParameters` | | ✓ | Custom key-value pairs |
-
-**Not included in hash** (don't invalidate checkpoint):
- `replicas`
- `nodeSelector`, `affinity`, `tolerations`
- `resources` (requests/limits)
- Logging/observability config
-
-**Example with all fields:**
-```yaml
-checkpoint:
-  enabled: true
-  mode: auto
-  identity:
-    model: "meta-llama/Llama-3-8B"
-    backendFramework: "vllm"
-    dynamoVersion: "0.9.0"
-    tensorParallelSize: 1
-    pipelineParallelSize: 1
-    dtype: "bfloat16"
-    maxModelLen: 8192
-    extraParameters:
-      enableChunkedPrefill: "true"
-      quantization: "awq"
-```
-
-**Checkpoint Naming:** The `DynamoCheckpoint` CR is automatically named using the 16-character identity hash (e.g., `e5962d34ba272638`).
-
-**Checkpoint Sharing:** Multiple DGDs with the same identity automatically share the same checkpoint.
-
-## DynamoCheckpoint CRD
-
-The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
-
-**When to create a DynamoCheckpoint directly:**
- **Pre-warming:** Create checkpoints before deploying DGDs for instant startup
- **Explicit control:** Manage checkpoint lifecycle independently from DGDs
-
-**Note:** With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in `auto` mode.
-
-**Create a checkpoint:**
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoCheckpoint
-metadata:
-  name: e5962d34ba272638  # Use the computed 16-char hash
-spec:
-  identity:
-    model: "meta-llama/Llama-3-8B"
-    backendFramework: "vllm"
-    tensorParallelSize: 1
-    dtype: "bfloat16"
-
-  job:
-    activeDeadlineSeconds: 3600
-    podTemplateSpec:
-      spec:
-        containers:
-          - name: main
-            image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
-            command: ["python3", "-m", "dynamo.vllm"]
-            args: ["--model", "meta-llama/Llama-3-8B"]
-            resources:
-              limits:
-                nvidia.com/gpu: "1"
-            env:
-              - name: HF_TOKEN
-                valueFrom:
-                  secretKeyRef:
-                    name: hf-token-secret
-                    key: HF_TOKEN
-```
-
-**Note:** You can compute the hash yourself, or use `auto` mode to let the operator create it.
-
-**Check status:**
-
-```bash
-# List all checkpoints
-kubectl get dynamocheckpoint -n dynamo-system
-# Or use shortname
-kubectl get dckpt -n dynamo-system
-
-NAME                MODEL                          BACKEND  PHASE    HASH              AGE
-e5962d34ba272638    meta-llama/Llama-3-8B         vllm     Ready    e5962d34ba272638  5m
-a7b4f89c12de3456    meta-llama/Llama-3-70B        vllm     Creating a7b4f89c12de3456  2m
-```
-
-**Phases:**
-| Phase | Description |
-|-------|-------------|
-| `Pending` | CR created, waiting for job to start |
-| `Creating` | Checkpoint job is running |
-| `Ready` | Checkpoint available for use |
-| `Failed` | Checkpoint creation failed |
-
-**Detailed status:**
-
-```bash
-kubectl describe dckpt e5962d34ba272638 -n dynamo-system
-```
-
-```yaml
-Status:
-  Phase: Ready
-  IdentityHash: e5962d34ba272638
-  Location: /checkpoints/e5962d34ba272638
-  StorageType: pvc
-  CreatedAt: 2026-01-29T10:05:00Z
-```
-
-**Reference from DGD:**
-
-Once the checkpoint is `Ready`, you can reference it by hash:
-
-```yaml
-spec:
-  services:
-    VllmWorker:
-      checkpoint:
-        enabled: true
-        checkpointRef: "e5962d34ba272638"  # 16-char hash
-```
-
-Or use `auto` mode and the operator will find/create it automatically.
-
-## Limitations
-
- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations are not yet supported (planned)
- **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
- **Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU
-
-## Troubleshooting
-
-### Checkpoint Not Creating
-
-1. Check the checkpoint job:
-   ```bash
-   kubectl get jobs -l nvidia.com/snapshot-is-checkpoint-source=true -n dynamo-system
-   kubectl logs job/checkpoint-<name> -n dynamo-system
-   ```
-
-2. Check the DaemonSet:
-   ```bash
-   kubectl logs daemonset/snapshot-agent -n dynamo-system
-   ```
-
-3. Verify storage access:
-   ```bash
-   kubectl exec -it <checkpoint-agent-pod> -- ls -la /checkpoints
-   ```
-
-### Restore Failing
-
-1. Check pod logs:
-   ```bash
-   kubectl logs <worker-pod> -n dynamo-system
-   ```
-
-2. Verify checkpoint file exists:
-   ```bash
-   # For PVC
-   kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/
-   ```
-
-3. Check environment variables:
-   ```bash
-   kubectl exec <worker-pod> -- env | grep DYN_CHECKPOINT
-   ```
-
-### Cold Start Despite Checkpoint
-
-Pods fall back to cold start if:
- Checkpoint file doesn't exist yet (still being created)
- Checkpoint file is corrupted
- CRIU restore fails
-
-Check logs for "Falling back to cold start" message.
-
-## Environment Variables
-
-| Variable | Description |
-|----------|-------------|
-| `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` (`s3` and `oci` are currently no-ops) |
-| `DYN_CHECKPOINT_LOCATION` | Full checkpoint location (checkpoint jobs) |
-| `DYN_CHECKPOINT_PATH` | Base checkpoint directory (restore pods, PVC) |
-| `DYN_CHECKPOINT_HASH` | Identity hash |
-| `DYN_READY_FOR_CHECKPOINT_FILE` | Ready-for-checkpoint file path (checkpoint jobs) |
-
-## Complete Example
-
-Create a checkpoint and use it in a DGD:
-
-```yaml
-# 1. Create the DynamoCheckpoint CR
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoCheckpoint
-metadata:
-  name: e5962d34ba272638  # 16-char hash (computed from identity)
-  namespace: dynamo-system
-spec:
-  identity:
-    model: "meta-llama/Meta-Llama-3-8B-Instruct"
-    backendFramework: "vllm"
-    tensorParallelSize: 1
-    dtype: "bfloat16"
-  job:
-    activeDeadlineSeconds: 3600
-    backoffLimit: 3
-    podTemplateSpec:
-      spec:
-        containers:
-          - name: main
-            image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
-            command: ["python3"]
-            args:
-              - "-m"
-              - "dynamo.vllm"
-              - "--model"
-              - "meta-llama/Meta-Llama-3-8B-Instruct"
-              - "--max-model-len"
-              - "4096"
-              - "--gpu-memory-utilization"
-              - "0.90"
-            env:
-              - name: HF_TOKEN
-                valueFrom:
-                  secretKeyRef:
-                    name: hf-token-secret
-                    key: HF_TOKEN
-              - name: GLOO_SOCKET_IFNAME
-                value: "lo"
-              - name: NCCL_SOCKET_IFNAME
-                value: "lo"
-            resources:
-              limits:
-                nvidia.com/gpu: "1"
-        restartPolicy: Never
---
-# 2. Wait for Ready: kubectl get dckpt e5962d34ba272638 -n dynamo-system -w
---
-# 3. Reference the checkpoint in your DGD
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-llm
-  namespace: dynamo-system
-spec:
-  services:
-    worker:
-      replicas: 2
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
-          command: ["python3"]
-          args:
-            - "-m"
-            - "dynamo.vllm"
-            - "--model"
-            - "meta-llama/Meta-Llama-3-8B-Instruct"
-            - "--max-model-len"
-            - "4096"
-            - "--gpu-memory-utilization"
-            - "0.90"
-          env:
-            - name: GLOO_SOCKET_IFNAME
-              value: "lo"
-            - name: NCCL_SOCKET_IFNAME
-              value: "lo"
-      resources:
-        limits:
-          nvidia.com/gpu: "1"
-      checkpoint:
-        enabled: true
-        checkpointRef: "e5962d34ba272638"  # Reference by hash
-```
-
-## Related Documentation
-
- [Dynamo Snapshot Overview](README.md) - Dynamo Snapshot architecture and use cases
- [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/snapshot/README.md) - Chart configuration
- [Installation Guide](../installation-guide.md) - Platform installation
- [API Reference](../api-reference.md) - Complete CRD specifications