README.md 3.07 KB
Newer Older
1
# Dynamo Snapshot Helm Chart
2

3
4
> Experimental feature. `snapshot-agent` runs as a privileged DaemonSet to
> perform CRIU checkpoint and restore operations.
5

6
This chart installs the namespace-scoped snapshot infrastructure used by Dynamo:
7

8
9
- `snapshot-agent` DaemonSet on eligible GPU nodes
- `snapshot-pvc`, or wiring to an existing PVC
10
- namespace-scoped RBAC
11
- the seccomp profile CRIU needs
12

13
Install the chart in each namespace where you want checkpoint and restore.
14

15
## Prerequisites
16

17
- Kubernetes cluster with x86_64 GPU nodes
18
19
- NVIDIA driver 580.xx or newer
- containerd runtime
20
- Dynamo Platform already installed with `dynamo-operator.checkpoint.enabled=true`
21
- a cluster where a privileged DaemonSet with `hostPID`, `hostIPC`, and `hostNetwork` is acceptable
22

23
24
Cross-node restore requires shared `ReadWriteMany` storage. The chart defaults to
that mode.
25

26
## Minimal install
27

28
Create the checkpoint PVC and the agent:
29
30

```bash
31
32
33
34
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=true
35
36
```

37
38
If your cluster does not use a default storage class, also set
`storage.pvc.storageClass`.
39

40
Reuse an existing PVC instead:
41

42
43
44
45
46
47
```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=false \
  --set storage.pvc.name=my-snapshot-pvc
48
49
```

50
## Verify
51
52

```bash
53
54
55
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot -o wide
56
57
```

58
## Important values
59

60
61
| Parameter | Meaning | Default |
|-----------|---------|---------|
62
| `storage.type` | Snapshot-owned storage backend | `pvc` |
63
| `storage.pvc.create` | Create `snapshot-pvc` instead of using an existing PVC | `true` |
64
| `storage.pvc.name` | PVC mounted by the snapshot-agent | `snapshot-pvc` |
65
66
67
| `storage.pvc.size` | Requested PVC size | `1Ti` |
| `storage.pvc.storageClass` | Storage class name | `""` |
| `storage.pvc.accessMode` | Access mode for the checkpoint PVC | `ReadWriteMany` |
68
69
70
| `storage.pvc.basePath` | Mount path inside the snapshot-agent pod | `/checkpoints` |
| `daemonset.image.repository` | Snapshot-agent image repository | `nvcr.io/nvidia/ai-dynamo/snapshot-agent` |
| `daemonset.image.tag` | Snapshot-agent image tag | `1.0.0` |
71
| `daemonset.imagePullSecrets` | Image pull secrets for the agent | `[{name: ngc-secret}]` |
72

73
74
75
76
Reserved `s3` and `oci` values remain chart-owned placeholders for future
snapshot backends, but only `pvc` is implemented today.

See [values.yaml](./values.yaml) for the full configuration surface.
77

78
## Next steps
79

80
81
Once the chart is installed, use the snapshot guide to create a checkpoint or
exercise the lower-level `snapshotctl` flow:
82

83
- [Snapshot guide](../../../../docs/kubernetes/snapshot.md)
84

85
## Uninstall
86
87

```bash
88
helm uninstall snapshot -n ${NAMESPACE}
89
90
```

91
92
The chart does not delete checkpoint data automatically. Remove the PVC
yourself if you want to clear stored checkpoints:
93
94

```bash
95
kubectl delete pvc snapshot-pvc -n ${NAMESPACE}
96
```