README.md 4.31 KB
Newer Older
1
# Dynamo Snapshot Helm Chart
2

3
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in beta/preview. The DaemonSet runs in privileged mode to perform CRIU checkpoint and restore operations.
4

5
This chart installs the namespace-scoped checkpoint/restore infrastructure used by Dynamo:
6

7
8
9
10
- `snapshot-agent` DaemonSet on GPU nodes
- `snapshot-pvc` checkpoint storage, or wiring to an existing PVC
- namespace-scoped RBAC
- the seccomp profile required by CRIU
11

12
Snapshot storage is namespace-local. Install this chart in every namespace where you want checkpoint and restore.
13

14
## Prerequisites
15
16

- Kubernetes 1.21+
17
18
19
20
21
- x86_64 GPU nodes
- NVIDIA driver 580.xx or newer
- containerd runtime
- a cluster where a privileged DaemonSet with `hostPID`, `hostIPC`, and `hostNetwork` is acceptable
- Dynamo Platform already installed, with operator checkpointing enabled
22

23
The platform/operator configuration must point at the same checkpoint storage that this chart installs:
24

25
26
27
28
29
30
31
32
33
```yaml
dynamo-operator:
  checkpoint:
    enabled: true
    storage:
      type: pvc
      pvc:
        pvcName: snapshot-pvc
        basePath: /checkpoints
34
35
```

36
Cross-node restore requires a shared `ReadWriteMany` storage class. The chart defaults to `storage.pvc.accessMode=ReadWriteMany`.
37

38
For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC.
39

40
## Minimal Install
41

42
This is the smallest Helm install that creates the checkpoint PVC and the DaemonSet:
43
44

```bash
45
46
47
48
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=true
49
50
```

51
If your cluster does not use a default storage class, also set `storage.pvc.storageClass`.
52

53
Keep `storage.pvc.accessMode=ReadWriteMany` for this chart layout. The DaemonSet mounts the same PVC on each eligible node, so a shared `ReadWriteOnce` claim only works when the agent runs on one node.
54

55
If you already have a PVC, keep the chart in "use existing PVC" mode:
56

57
Do not set `storage.pvc.create=true` when reusing an existing checkpoint PVC.
58

59
60
61
62
63
64
```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=false \
  --set storage.pvc.name=my-snapshot-pvc
65
66
```

67
## Verify
68
69

```bash
70
71
72
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot -o wide
73
74
```

75
## Important Values
76

77
78
79
80
81
82
83
84
85
86
87
| Parameter | Meaning | Default |
|-----------|---------|---------|
| `storage.pvc.create` | Create `snapshot-pvc` instead of using an existing PVC | `true` |
| `storage.pvc.name` | PVC name used by the agent and by the operator config | `snapshot-pvc` |
| `storage.pvc.size` | Requested PVC size | `1Ti` |
| `storage.pvc.storageClass` | Storage class name | `""` |
| `storage.pvc.accessMode` | Access mode for the checkpoint PVC | `ReadWriteMany` |
| `storage.pvc.basePath` | Checkpoint root inside the PVC | `/checkpoints` |
| `daemonset.image.repository` | Snapshot agent image repository | `nvcr.io/nvidia/ai-dynamo/snapshot-agent` |
| `daemonset.image.tag` | Snapshot agent image tag | `1.0.0` |
| `daemonset.imagePullSecrets` | Image pull secrets for the agent | `[{name: ngc-secret}]` |
88

89
See [values.yaml](./values.yaml) for the complete configuration surface.
90

91
## End To End
92

93
Once the chart is installed, use the snapshot guide to deploy a snapshot-capable `DynamoGraphDeployment`, wait for the checkpoint to become ready, and then scale the worker to verify restore:
94

95
- [Snapshot](../../../../docs/kubernetes/snapshot.md)
96

97
## Uninstall
98
99

```bash
100
helm uninstall snapshot -n ${NAMESPACE}
101
102
```

103
The chart does not remove checkpoint data automatically. Delete the PVC yourself if you want to remove stored checkpoints:
104
105

```bash
106
kubectl delete pvc snapshot-pvc -n ${NAMESPACE}
107
108
```

109
## Troubleshooting
110

111
If `snapshot-agent` does not schedule:
112
113

```bash
114
115
116
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe daemonset snapshot-agent -n ${NAMESPACE}
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot --all-containers
117
118
```

119
If checkpoint creation never becomes ready, verify all three pieces line up:
120

121
122
123
- the operator has `dynamo-operator.checkpoint.enabled=true`
- the operator PVC name and base path match the snapshot chart values
- the workload uses a snapshot-capable worker image and command