Unverified Commit d54af4f8 authored by Schwinn Saereesitthipitak's avatar Schwinn Saereesitthipitak Committed by GitHub
Browse files

chore: restrict Dynamo Snapshot to x86_64, document backend support (#7031)

parent b5fddbd0
......@@ -17,7 +17,9 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable.
- Kubernetes 1.21+
- **x86_64 (amd64) nodes only** for the snapshot agent and placeholder images
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
- RWX (ReadWriteMany) storage class for multi-node deployments
......@@ -35,9 +37,9 @@ export NAMESPACE=my-team # Your target namespace
export DOCKER_SERVER=your-registry.com/ # Your container registry
export IMAGE_TAG=latest
# Build Dynamo Snapshot agent image
# Build Dynamo Snapshot agent image (amd64 only)
cd deploy/snapshot
docker build --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
docker build --platform linux/amd64 --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
docker push $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG
cd -
......
......@@ -38,10 +38,22 @@ spec:
{{- with .Values.daemonset.tolerations }}
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.daemonset.affinity }}
{{- if and .Values.daemonset.affinity (hasKey .Values.daemonset.affinity "nodeAffinity") }}
{{- fail "daemonset.affinity.nodeAffinity is not supported because the chart already enforces kubernetes.io/arch=amd64; use daemonset.nodeSelector or daemonset.affinity.podAffinity/podAntiAffinity instead" }}
{{- end }}
affinity:
# cuda-checkpoint only supports x86_64 — never schedule on arm64 nodes
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- amd64
{{- with .Values.daemonset.affinity }}
{{- toYaml . | nindent 8 }}
{{- end }}
{{- end }}
# CUDA checkpoint/restore requires the nvidia container runtime
runtimeClassName: nvidia
{{- if .Values.seccomp.deploy }}
......
......@@ -4,8 +4,8 @@
# Unified Dockerfile for snapshot-agent and placeholder images.
#
# Build targets:
# docker build --target agent -t snapshot-agent:latest .
# docker build --target placeholder --build-arg BASE_IMAGE=<app-image> -t placeholder:latest .
# docker build --platform linux/amd64 --target agent -t snapshot-agent:latest .
# docker build --platform linux/amd64 --target placeholder --build-arg BASE_IMAGE=<app-image> -t placeholder:latest .
#
# Optional targets for CI:
# docker build --target linter . # Run linting
......@@ -109,6 +109,12 @@ RUN git clone https://github.com/NVIDIA/cuda-checkpoint.git /tmp/cuda-checkpoint
# =============================================================================
FROM ${AGENT_BASE_IMAGE} AS agent
ARG TARGETARCH=amd64
RUN if [ "${TARGETARCH}" != "amd64" ]; then \
echo "ERROR: Dynamo Snapshot requires x86_64 (cuda-checkpoint has no ${TARGETARCH} binary)" >&2; exit 1; \
fi
# Install CRIU runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
libbsd0 \
......@@ -156,10 +162,15 @@ ENTRYPOINT ["/usr/local/bin/snapshot-agent"]
FROM ${BASE_IMAGE} AS placeholder
ARG BASE_IMAGE
ARG TARGETARCH=amd64
ENV ORIGINAL_BASE_IMAGE=${BASE_IMAGE}
USER root
RUN if [ "${TARGETARCH}" != "amd64" ]; then \
echo "ERROR: Dynamo Snapshot requires x86_64 (cuda-checkpoint has no ${TARGETARCH} binary)" >&2; exit 1; \
fi
# Install minimal runtime dependencies for CRIU restore (nsrestore runs here via nsenter)
RUN apt-get update && apt-get install -y --no-install-recommends \
libbsd0 \
......@@ -192,4 +203,3 @@ RUN chmod +x /usr/local/bin/nsrestore
# Create directories
RUN mkdir -p /checkpoints /var/run/criu /var/criu-work
......@@ -15,6 +15,8 @@ endif
# CONTAINER_TOOL defines the container tool to be used for building images.
CONTAINER_TOOL ?= docker
# Snapshot runtime images ship cuda-checkpoint and are amd64-only.
RUNTIME_IMAGE_PLATFORM ?= linux/amd64
# Setting SHELL to bash allows bash commands to be executed by recipes.
SHELL = /usr/bin/env bash -o pipefail
......@@ -69,8 +71,8 @@ clean: ## Remove build artifacts.
##@ Docker
.PHONY: docker-build-agent
docker-build-agent: ## Build snapshot-agent docker image.
$(CONTAINER_TOOL) build --target agent -t ${IMG} .
docker-build-agent: ## Build snapshot-agent docker image (linux/amd64 only).
$(CONTAINER_TOOL) build --platform ${RUNTIME_IMAGE_PLATFORM} --target agent -t ${IMG} .
.PHONY: docker-build-agent-lint
docker-build-agent-lint: ## Build snapshot-agent docker image up to lint stage.
......@@ -81,7 +83,7 @@ docker-build-agent-test: ## Build snapshot-agent docker image up to test stage.
$(CONTAINER_TOOL) build --target tester -t ${IMG}-test .
.PHONY: docker-build-placeholder
docker-build-placeholder: ## Build placeholder image for checkpoint restore. Requires PLACEHOLDER_BASE_IMG.
docker-build-placeholder: ## Build placeholder image for checkpoint restore (linux/amd64 only). Requires PLACEHOLDER_BASE_IMG.
ifndef PLACEHOLDER_BASE_IMG
$(error PLACEHOLDER_BASE_IMG is required. Example: make docker-build-placeholder PLACEHOLDER_BASE_IMG=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1-cuda13)
endif
......@@ -91,7 +93,7 @@ endif
BASE_IMAGE_USER="$$( $(CONTAINER_TOOL) image inspect --format '{{.Config.User}}' ${PLACEHOLDER_BASE_IMG} 2>/dev/null || true )"; \
fi; \
if [ -z "$$BASE_IMAGE_USER" ]; then BASE_IMAGE_USER=root; fi; \
$(CONTAINER_TOOL) build --target placeholder \
$(CONTAINER_TOOL) build --platform ${RUNTIME_IMAGE_PLATFORM} --target placeholder \
--build-arg BASE_IMAGE=${PLACEHOLDER_BASE_IMG} \
--build-arg BASE_IMAGE_USER=$$BASE_IMAGE_USER \
-t ${PLACEHOLDER_IMG} .
......
......@@ -60,7 +60,7 @@ helm install snapshot nvidia/snapshot \
### ✅ Currently Supported
-**vLLM and SGLang backends** (TensorRT-LLM planned)
-**LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
- ✅ Cross-node, single-GPU checkpoints
- ✅ Cross-node, single-GPU checkpoints (requires RWX storage)
- ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
......@@ -87,9 +87,10 @@ helm install snapshot nvidia/snapshot \
- Potentially compromise node security if exploited
### Technical Limitations
- **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations not yet supported
- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
- **Storage**: Only PVC storage is currently implemented (S3/OCI planned)
......@@ -114,6 +115,7 @@ Dynamo Snapshot is best suited for:
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
- RWX storage class (for multi-node deployments)
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
......
......@@ -15,9 +15,11 @@ Checkpointing captures the complete state of a running worker pod (including GPU
## Prerequisites
- Dynamo Platform installed (v0.4.0+) on k8s cluster with GPU nodes
- Dynamo Platform installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
- Dynamo Snapshot Helm chart installed (separate from platform)
- RWX PVC storage (PVC is currently the only supported backend)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- vLLM or SGLang backend (TensorRT-LLM is not supported)
## Quick Start
......@@ -221,7 +223,7 @@ Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits)
| Field | Required | Affects Hash | Example |
|-------|----------|-------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `framework` | ✓ | ✓ | `sglang`, `trtllm`, `vllm` |
| `backendFramework` | ✓ | ✓ | `sglang`, `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
......@@ -356,7 +358,9 @@ Or use `auto` mode and the operator will find/create it automatically.
## Limitations
- **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations are not yet supported (planned)
- **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
......@@ -514,4 +518,3 @@ spec:
- [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/snapshot/README.md) - Chart configuration
- [Installation Guide](../installation-guide.md) - Platform installation
- [API Reference](../api-reference.md) - Complete CRD specifications
......@@ -28,6 +28,7 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
| **LoRA** | | | ✅ | [K8s Guide][lora] |
| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
| **Speculative Decoding** | 🚧 | ✅ | ✅ | Backend READMEs |
| **Dynamo Snapshot** | ✅ | | ✅ | [Snapshot Docs][snapshot] |
## 1. vLLM Backend
......@@ -130,3 +131,6 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full
[lora]: ../kubernetes-deployment/deployment-guide/managing-models-with-dynamo-model
[vllm-spec]: ../additional-resources/speculative-decoding/speculative-decoding-with-v-llm
[trtllm-eagle]: ../additional-resources/tensor-rt-llm-details/llama-4-eagle
{/* Dynamo Snapshot */}
[snapshot]: ../kubernetes/snapshot/README.md
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment