"tests/gms/harness/__init__.py" did not exist on "89e135b9ce57db4c31a70162581f1a021058ba25"
README.md 4.61 KB
Newer Older
1
# Dynamo Snapshot Helm Chart
2

3
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in beta/preview. The DaemonSet runs in privileged mode to perform CRIU checkpoint and restore operations.
4

5
This chart installs the namespace-scoped checkpoint/restore infrastructure used by Dynamo:
6

7
8
9
10
- `snapshot-agent` DaemonSet on GPU nodes
- `snapshot-pvc` checkpoint storage, or wiring to an existing PVC
- namespace-scoped RBAC
- the seccomp profile required by CRIU
11

12
Snapshot storage is namespace-local. Install this chart in every namespace where you want checkpoint and restore.
13

14
## Prerequisites
15
16

- Kubernetes 1.21+
17
18
19
20
21
- x86_64 GPU nodes
- NVIDIA driver 580.xx or newer
- containerd runtime
- a cluster where a privileged DaemonSet with `hostPID`, `hostIPC`, and `hostNetwork` is acceptable
- Dynamo Platform already installed, with operator checkpointing enabled
22

23
The platform/operator configuration must point at the same checkpoint storage that this chart installs:
24

25
26
27
28
29
30
31
32
33
```yaml
dynamo-operator:
  checkpoint:
    enabled: true
    storage:
      type: pvc
      pvc:
        pvcName: snapshot-pvc
        basePath: /checkpoints
34
35
```

36
37
The snapshot-agent no longer reads `basePath` from its ConfigMap, but the operator still uses its configured PVC base path when it annotates checkpoint and restore pods. That path must match `storage.pvc.basePath` here so the mounted checkpoint location is valid inside the agent pod.

38
Cross-node restore requires a shared `ReadWriteMany` storage class. The chart defaults to `storage.pvc.accessMode=ReadWriteMany`.
39

40
For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC.
41

42
## Minimal Install
43

44
This is the smallest Helm install that creates the checkpoint PVC and the DaemonSet:
45
46

```bash
47
48
49
50
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=true
51
52
```

53
If your cluster does not use a default storage class, also set `storage.pvc.storageClass`.
54

55
Keep `storage.pvc.accessMode=ReadWriteMany` for this chart layout. The DaemonSet mounts the same PVC on each eligible node, so a shared `ReadWriteOnce` claim only works when the agent runs on one node.
56

57
If you already have a PVC, keep the chart in "use existing PVC" mode:
58

59
Do not set `storage.pvc.create=true` when reusing an existing checkpoint PVC.
60

61
62
63
64
65
66
```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=false \
  --set storage.pvc.name=my-snapshot-pvc
67
68
```

69
## Verify
70
71

```bash
72
73
74
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot -o wide
75
76
```

77
## Important Values
78

79
80
81
82
83
84
85
| Parameter | Meaning | Default |
|-----------|---------|---------|
| `storage.pvc.create` | Create `snapshot-pvc` instead of using an existing PVC | `true` |
| `storage.pvc.name` | PVC name used by the agent and by the operator config | `snapshot-pvc` |
| `storage.pvc.size` | Requested PVC size | `1Ti` |
| `storage.pvc.storageClass` | Storage class name | `""` |
| `storage.pvc.accessMode` | Access mode for the checkpoint PVC | `ReadWriteMany` |
86
| `storage.pvc.basePath` | PVC mount path inside the snapshot-agent pod | `/checkpoints` |
87
88
89
| `daemonset.image.repository` | Snapshot agent image repository | `nvcr.io/nvidia/ai-dynamo/snapshot-agent` |
| `daemonset.image.tag` | Snapshot agent image tag | `1.0.0` |
| `daemonset.imagePullSecrets` | Image pull secrets for the agent | `[{name: ngc-secret}]` |
90

91
See [values.yaml](./values.yaml) for the complete configuration surface.
92

93
## End To End
94

95
Once the chart is installed, use the snapshot guide to deploy a snapshot-capable `DynamoGraphDeployment`, wait for the checkpoint to become ready, and then scale the worker to verify restore:
96

97
- [Snapshot](../../../../docs/kubernetes/snapshot.md)
98

99
## Uninstall
100
101

```bash
102
helm uninstall snapshot -n ${NAMESPACE}
103
104
```

105
The chart does not remove checkpoint data automatically. Delete the PVC yourself if you want to remove stored checkpoints:
106
107

```bash
108
kubectl delete pvc snapshot-pvc -n ${NAMESPACE}
109
110
```

111
## Troubleshooting
112

113
If `snapshot-agent` does not schedule:
114
115

```bash
116
117
118
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe daemonset snapshot-agent -n ${NAMESPACE}
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=snapshot --all-containers
119
120
```

121
If checkpoint creation never becomes ready, verify all three pieces line up:
122

123
124
125
- the operator has `dynamo-operator.checkpoint.enabled=true`
- the operator PVC name and base path match the snapshot chart values
- the workload uses a snapshot-capable worker image and command