README.md 5.73 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Checkpointing
5
---
6

7
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
8

9
**Dynamo Snapshot** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
10

11
## What is Dynamo Snapshot?
12

13
Dynamo Snapshot provides:
14
15
16
17
18
19
20
21
22
23
- **Fast cold starts**: Restore GPU-accelerated applications in seconds instead of minutes
- **CUDA state preservation**: Checkpoint and restore GPU memory and CUDA contexts
- **Kubernetes-native**: Integrates seamlessly with Kubernetes primitives
- **Storage flexibility**: PVC-based storage (S3/OCI planned for future releases)
- **Namespace isolation**: Each namespace gets its own checkpoint infrastructure

## Use Cases

### 1. With NVIDIA Dynamo Platform (Recommended)

24
Use Dynamo Snapshot as part of the Dynamo platform for automatic checkpoint management:
25
26
27
28
29
30
31
32
- Automatic checkpoint creation and lifecycle management
- Seamless integration with DynamoGraphDeployment CRDs
- Built-in autoscaling with fast restore

📖 **[Read the Dynamo Integration Guide →](dynamo.md)**

## Architecture

33
Dynamo Snapshot consists of two main components:
34

35
### 1. Dynamo Snapshot Helm Chart
36
37
38
39
Deploys the checkpoint/restore infrastructure:
- **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
- **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
- **RBAC**: Namespace-scoped or cluster-wide permissions
40
- **Seccomp Profile**: Security policies for CRIU syscalls (needs to be injected into workload pods)
41

42
43
44
45
### 2. External Restore via DaemonSet
The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces:
- **Checkpoint**: Freezes the running process and dumps state (CPU + GPU) to storage
- **Restore**: Enters a placeholder pod's namespaces and restores the checkpointed process via `nsrestore`
46
47
48

## Quick Start

49
To install the Dynamo Snapshot DaemonSet in your cluster, run the following:
50
51

```bash
52
helm install snapshot nvidia/snapshot \
53
54
55
56
57
58
59
60
  --namespace my-team \
  --create-namespace \
  --set storage.pvc.size=100Gi
```

## Key Features

### ✅ Currently Supported
61
62
-**vLLM and SGLang backends** (TensorRT-LLM planned)
-**LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
63
- ✅ Cross-node, single-GPU checkpoints (requires RWX storage)
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
- ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
- ✅ Namespace-scoped and cluster-wide RBAC
- ✅ Idempotent checkpoint creation
- ✅ Automatic signal-based checkpoint coordination

### 🚧 Planned Features
- 🚧 TensorRT-LLM backend support
- 🚧 S3/MinIO storage backend
- 🚧 OCI registry storage backend
- 🚧 Multi-GPU checkpoints
- 🚧 Multi-node distributed checkpoints

## Limitations

80
⚠️ **Important**: Dynamo Snapshot has significant limitations that may impact production readiness:
81
82

### Security Considerations
83
- **🔴 Privileged DaemonSet**: The Dynamo Snapshot DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations. Workload pods do **not** need privileged mode — all CRIU privilege lives in the DaemonSet.
84
85
- **Security Impact**: The privileged DaemonSet can:
  - Access all host devices and processes
86
  - Bypass most security restrictions
87
  - Potentially compromise node security if exploited
88
89

### Technical Limitations
90
91
92
- **x86_64 (amd64) only**: `cuda-checkpoint` does not support ARM64. The snapshot agent and placeholder images are built for x86_64 only.
- **NVIDIA driver 580.xx or newer required**: Dynamo Snapshot depends on `cuda-checkpoint`, which requires R580+ drivers.
- **vLLM and SGLang backends only**: TensorRT-LLM is not supported.
93
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
94
95
96
97
98
- **Single-GPU only**: Multi-GPU configurations not yet supported
- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
- **Storage**: Only PVC storage is currently implemented (S3/OCI planned)

### Recommendation
99
Dynamo Snapshot is best suited for:
100
101
102
103
104
105
106
107
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment

## Documentation

### Getting Started
108
109
- [Dynamo Integration Guide](dynamo.md) - Using Dynamo Snapshot with Dynamo Platform
- [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/snapshot/README.md) - Helm chart configuration
110
111
112
113
114
115
116
117

### Related Documentation
- [CRIU Documentation](https://criu.org/Main_Page) - Upstream CRIU docs

## Prerequisites

- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
118
- NVIDIA driver 580.xx or newer on the target GPU nodes
119
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
120
- RWX storage class (for multi-node deployments)
121
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
122
123
124

## Contributing

125
Dynamo Snapshot is part of the NVIDIA Dynamo project. Contributions are welcome!
126
127
128
129

## License

Apache License 2.0