chore: restrict Dynamo Snapshot to x86_64, document backend support (#7031)

d54af4f8 · Schwinn Saereesitthipitak · GitHub · b5fddbd0 · d54af4f8 · d54af4f8
Unverified Commit d54af4f8 authored Mar 09, 2026 by Schwinn Saereesitthipitak Committed by GitHub Mar 09, 2026
7 changed files
--- a/deploy/helm/charts/snapshot/README.md
+++ b/deploy/helm/charts/snapshot/README.md
@@ -17,7 +17,9 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
 ⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable.

 - Kubernetes 1.21+
+- **x86_64 (amd64) nodes only** for the snapshot agent and placeholder images
 - GPU nodes with NVIDIA runtime (`nvidia` runtime class)
+- NVIDIA driver 580.xx or newer on the target GPU nodes
 - containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
 - NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
 - RWX (ReadWriteMany) storage class for multi-node deployments
@@ -35,9 +37,9 @@ export NAMESPACE=my-team  # Your target namespace
 export DOCKER_SERVER=your-registry.com/  # Your container registry
 export IMAGE_TAG=latest

-# Build Dynamo Snapshot agent image
+# Build Dynamo Snapshot agent image (amd64 only)
 cd deploy/snapshot
-docker build --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
+docker build --platform linux/amd64 --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
 docker push $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG
 cd -


--- a/deploy/helm/charts/snapshot/templates/daemonset.yaml
+++ b/deploy/helm/charts/snapshot/templates/daemonset.yaml
@@ -38,10 +38,22 @@ spec:
        {{- with .Values.daemonset.tolerations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
-      {{- with .Values.daemonset.affinity }}
+      {{- if and .Values.daemonset.affinity (hasKey .Values.daemonset.affinity "nodeAffinity") }}
+      {{- fail "daemonset.affinity.nodeAffinity is not supported because the chart already enforces kubernetes.io/arch=amd64; use daemonset.nodeSelector or daemonset.affinity.podAffinity/podAntiAffinity instead" }}
+      {{- end }}
      affinity:
+        # cuda-checkpoint only supports x86_64 — never schedule on arm64 nodes
+        nodeAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            nodeSelectorTerms:
+              - matchExpressions:
+                  - key: kubernetes.io/arch
+                    operator: In
+                    values:
+                      - amd64
+        {{- with .Values.daemonset.affinity }}
        {{- toYaml . | nindent 8 }}
-      {{- end }}
+        {{- end }}
      # CUDA checkpoint/restore requires the nvidia container runtime
      runtimeClassName: nvidia
      {{- if .Values.seccomp.deploy }}

--- a/deploy/snapshot/Dockerfile
+++ b/deploy/snapshot/Dockerfile
@@ -4,8 +4,8 @@
 # Unified Dockerfile for snapshot-agent and placeholder images.
 #
 # Build targets:
-#   docker build --target agent -t snapshot-agent:latest .
-#   docker build --target placeholder --build-arg BASE_IMAGE=<app-image> -t placeholder:latest .
+#   docker build --platform linux/amd64 --target agent -t snapshot-agent:latest .
+#   docker build --platform linux/amd64 --target placeholder --build-arg BASE_IMAGE=<app-image> -t placeholder:latest .
 #
 # Optional targets for CI:
 #   docker build --target linter .   # Run linting
@@ -109,6 +109,12 @@ RUN git clone https://github.com/NVIDIA/cuda-checkpoint.git /tmp/cuda-checkpoint
 # =============================================================================
 FROM ${AGENT_BASE_IMAGE} AS agent

+ARG TARGETARCH=amd64
+
+RUN if [ "${TARGETARCH}" != "amd64" ]; then \
+      echo "ERROR: Dynamo Snapshot requires x86_64 (cuda-checkpoint has no ${TARGETARCH} binary)" >&2; exit 1; \
+    fi
+
 # Install CRIU runtime dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
    libbsd0 \
@@ -156,10 +162,15 @@ ENTRYPOINT ["/usr/local/bin/snapshot-agent"]
 FROM ${BASE_IMAGE} AS placeholder

 ARG BASE_IMAGE
+ARG TARGETARCH=amd64
 ENV ORIGINAL_BASE_IMAGE=${BASE_IMAGE}

 USER root

+RUN if [ "${TARGETARCH}" != "amd64" ]; then \
+      echo "ERROR: Dynamo Snapshot requires x86_64 (cuda-checkpoint has no ${TARGETARCH} binary)" >&2; exit 1; \
+    fi
+
 # Install minimal runtime dependencies for CRIU restore (nsrestore runs here via nsenter)
 RUN apt-get update && apt-get install -y --no-install-recommends \
    libbsd0 \
@@ -192,4 +203,3 @@ RUN chmod +x /usr/local/bin/nsrestore

 # Create directories
 RUN mkdir -p /checkpoints /var/run/criu /var/criu-work
-
--- a/deploy/snapshot/Makefile
+++ b/deploy/snapshot/Makefile
@@ -15,6 +15,8 @@ endif

 # CONTAINER_TOOL defines the container tool to be used for building images.
 CONTAINER_TOOL ?= docker
+# Snapshot runtime images ship cuda-checkpoint and are amd64-only.
+RUNTIME_IMAGE_PLATFORM ?= linux/amd64

 # Setting SHELL to bash allows bash commands to be executed by recipes.
 SHELL = /usr/bin/env bash -o pipefail
@@ -69,8 +71,8 @@ clean: ## Remove build artifacts.
 ##@ Docker

 .PHONY: docker-build-agent
-docker-build-agent: ## Build snapshot-agent docker image.
-	$(CONTAINER_TOOL) build --target agent -t ${IMG} .
+docker-build-agent: ## Build snapshot-agent docker image (linux/amd64 only).
+	$(CONTAINER_TOOL) build --platform ${RUNTIME_IMAGE_PLATFORM} --target agent -t ${IMG} .

 .PHONY: docker-build-agent-lint
 docker-build-agent-lint: ## Build snapshot-agent docker image up to lint stage.
@@ -81,7 +83,7 @@ docker-build-agent-test: ## Build snapshot-agent docker image up to test stage.
 	$(CONTAINER_TOOL) build --target tester -t ${IMG}-test .

 .PHONY: docker-build-placeholder
-docker-build-placeholder: ## Build placeholder image for checkpoint restore. Requires PLACEHOLDER_BASE_IMG.
+docker-build-placeholder: ## Build placeholder image for checkpoint restore (linux/amd64 only). Requires PLACEHOLDER_BASE_IMG.
 ifndef PLACEHOLDER_BASE_IMG
 	$(error PLACEHOLDER_BASE_IMG is required. Example: make docker-build-placeholder PLACEHOLDER_BASE_IMG=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1-cuda13)
 endif
@@ -91,7 +93,7 @@ endif
 		BASE_IMAGE_USER="$$( $(CONTAINER_TOOL) image inspect --format '{{.Config.User}}' ${PLACEHOLDER_BASE_IMG} 2>/dev/null || true )"; \
 	fi; \
 	if [ -z "$$BASE_IMAGE_USER" ]; then BASE_IMAGE_USER=root; fi; \
-	$(CONTAINER_TOOL) build --target placeholder \
+	$(CONTAINER_TOOL) build --platform ${RUNTIME_IMAGE_PLATFORM} --target placeholder \
 		--build-arg BASE_IMAGE=${PLACEHOLDER_BASE_IMG} \
 		--build-arg BASE_IMAGE_USER=$$BASE_IMAGE_USER \
 		-t ${PLACEHOLDER_IMG} .

--- a/docs/kubernetes/snapshot/README.md
+++ b/docs/kubernetes/snapshot/README.md
@@ -60,7 +60,7 @@ helm install snapshot nvidia/snapshot \
 ### ✅ Currently Supported
 - ✅ **vLLM and SGLang backends** (TensorRT-LLM planned)
 - ✅ **LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
- ✅ Cross-node, single-GPU checkpoints
+- ✅ Cross-node, single-GPU checkpoints (requires RWX storage)
 - ✅ PVC storage backend (RWX for multi-node)
 - ✅ CUDA checkpoint/restore
 - ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
@@ -87,9 +87,10 @@ helm install snapshot nvidia/snapshot \
  - Potentially compromise node security if exploited

 ### Technical Limitations
- **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
+- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
+- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
+- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
 - **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-node only**: Checkpoints must be created and restored on the same node
 - **Single-GPU only**: Multi-GPU configurations not yet supported
 - **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
 - **Storage**: Only PVC storage is currently implemented (S3/OCI planned)
@@ -114,6 +115,7 @@ Dynamo Snapshot is best suited for:

 - Kubernetes 1.21+
 - GPU nodes with NVIDIA runtime (`nvidia` runtime class)
+- NVIDIA driver 580.xx or newer on the target GPU nodes
 - containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
 - RWX storage class (for multi-node deployments)
 - **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)

--- a/docs/kubernetes/snapshot/dynamo.md
+++ b/docs/kubernetes/snapshot/dynamo.md
@@ -15,9 +15,11 @@ Checkpointing captures the complete state of a running worker pod (including GPU

 ## Prerequisites

- Dynamo Platform installed (v0.4.0+) on k8s cluster with GPU nodes
+- Dynamo Platform installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
 - Dynamo Snapshot Helm chart installed (separate from platform)
 - RWX PVC storage (PVC is currently the only supported backend)
+- NVIDIA driver 580.xx or newer on the target GPU nodes
+- vLLM or SGLang backend (TensorRT-LLM is not supported)

 ## Quick Start

@@ -221,7 +223,7 @@ Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits)
 | Field | Required | Affects Hash | Example |
 |-------|----------|-------------|---------|
 | `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
-| `framework` | ✓ | ✓ | `sglang`, `trtllm`, `vllm` |
+| `backendFramework` | ✓ | ✓ | `sglang`, `vllm` |
 | `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
 | `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
 | `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
@@ -356,7 +358,9 @@ Or use `auto` mode and the operator will find/create it automatically.

 ## Limitations

- **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
+- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
+- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
+- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
 - **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
 - **Single-GPU only**: Multi-GPU configurations are not yet supported (planned)
 - **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
@@ -514,4 +518,3 @@ spec:
 - [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/snapshot/README.md) - Chart configuration
 - [Installation Guide](../installation-guide.md) - Platform installation
 - [API Reference](../api-reference.md) - Complete CRD specifications
-
--- a/docs/reference/feature-matrix.md
+++ b/docs/reference/feature-matrix.md
@@ -28,6 +28,7 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
 | **LoRA** | | | ✅ | [K8s Guide][lora] |
 | **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
 | **Speculative Decoding** | 🚧 | ✅ | ✅ | Backend READMEs |
+| **Dynamo Snapshot** | ✅ | | ✅ | [Snapshot Docs][snapshot] |

 ## 1. vLLM Backend

@@ -130,3 +131,6 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full
 [lora]: ../kubernetes-deployment/deployment-guide/managing-models-with-dynamo-model
 [vllm-spec]: ../additional-resources/speculative-decoding/speculative-decoding-with-v-llm
 [trtllm-eagle]: ../additional-resources/tensor-rt-llm-details/llama-4-eagle
+
+{/* Dynamo Snapshot */}
+[snapshot]: ../kubernetes/snapshot/README.md