README.md 5.59 KB
Newer Older
1
# Dynamo Snapshot Helm Chart
2

3
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The DaemonSet runs in privileged mode to perform CRIU operations. See [Prerequisites](#prerequisites) for security considerations.
4
5
6
7
8
9
10
11
12

This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, including:
- Persistent Volume Claim (PVC) for checkpoint storage
- DaemonSet running the CRIU checkpoint agent
- RBAC resources (ServiceAccount, Role, RoleBinding)
- Seccomp profile for blocking io_uring syscalls

**Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
13
- **Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
14
15
16

## Prerequisites

17
⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable.
18
19

- Kubernetes 1.21+
20
- **x86_64 (amd64) nodes only** for the snapshot agent and placeholder images
21
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
22
- NVIDIA driver 580.xx or newer on the target GPU nodes
23
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
24
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
25
- RWX (ReadWriteMany) storage class for multi-node deployments
26
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)
27
28
29

## Installation

30
> **Note:** The Dynamo Snapshot Helm chart is not yet published to a public Helm repository. For now, you must build and deploy from source.
31
32
33
34
35
36
37
38
39

### Building from Source

```bash
# Set environment
export NAMESPACE=my-team  # Your target namespace
export DOCKER_SERVER=your-registry.com/  # Your container registry
export IMAGE_TAG=latest

40
# Build Dynamo Snapshot agent image (amd64 only)
41
cd deploy/snapshot
42
docker build --platform linux/amd64 --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
43
docker push $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG
44
45
cd -

46
47
# Install Dynamo Snapshot chart with custom image
helm install snapshot ./deploy/helm/charts/snapshot/ \
48
49
  --namespace ${NAMESPACE} \
  --create-namespace \
50
  --set daemonset.image.repository=${DOCKER_SERVER}/snapshot-agent \
51
52
53
54
55
56
57
58
59
60
61
62
63
64
  --set daemonset.image.tag=${IMAGE_TAG} \
  --set daemonset.imagePullSecrets[0].name=your-registry-secret
```

## Configuration

See `values.yaml` for all configuration options.

### Key Configuration Options

| Parameter | Description | Default |
|-----------|-------------|---------|
| `storage.type` | Storage type: `pvc` (only supported), `s3` and `oci` planned | `pvc` |
| `storage.pvc.create` | Create a new PVC | `true` |
65
| `storage.pvc.name` | PVC name (must match operator config) | `snapshot-pvc` |
66
67
| `storage.pvc.size` | PVC size | `100Gi` |
| `storage.pvc.storageClass` | Storage class name | `""` (default) |
68
69
| `daemonset.image.repository` | DaemonSet image repository | `nvcr.io/nvidian/dynamo-dev/snapshot-agent` |
| `daemonset.snapshotLogLevel` | Snapshot agent and nsrestore log level (`trace`, `debug`, `info`, `warn`, `error`) | `info` |
70
| `daemonset.nodeSelector` | Node selector for GPU nodes | `nvidia.com/gpu.present: "true"` |
71
72
| `config.checkpoint.criu.ghostLimit` | CRIU ghost file size limit in bytes | `536870912` (512MB) |
| `config.checkpoint.criu.logLevel` | CRIU logging verbosity (0-4) | `4` |
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
| `rbac.namespaceRestricted` | Use namespace-scoped RBAC | `true` |

## Usage

After installing this chart, enable checkpointing in your DynamoGraphDeployment:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-model
  namespace: my-team
spec:
  services:
    worker:
      checkpoint:
        enabled: true
        mode: auto
        identity:
          model: Qwen/Qwen3-0.6B
          backendFramework: vllm
```

## Multi-Namespace Deployment

To enable checkpointing in multiple namespaces, install this chart in each namespace:

```bash
# Namespace A
102
helm install snapshot nvidia/snapshot -n team-a
103
104

# Namespace B
105
helm install snapshot nvidia/snapshot -n team-b
106
107
108
109
110
111
112
113
```

Each namespace will have its own isolated checkpoint storage.

## Verification

```bash
# Check PVC
114
kubectl get pvc snapshot-pvc -n my-team
115
116
117
118
119

# Check DaemonSet
kubectl get daemonset -n my-team

# Check DaemonSet pods are running
120
kubectl get pods -n my-team -l app.kubernetes.io/name=snapshot
121
122
123
124
125
```

## Uninstallation

```bash
126
helm uninstall snapshot -n my-team
127
128
129
130
131
```

**Note:** This will NOT delete the PVC by default. To delete the PVC:

```bash
132
kubectl delete pvc snapshot-pvc -n my-team
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
```

## Troubleshooting

### DaemonSet pods not starting

Check if GPU nodes have the correct labels and runtime class:

```bash
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <node-name> | grep -A 5 "Runtime Class"
```

If nodes don't have the `nvidia.com/gpu.present` label, you can add it:

```bash
kubectl label node <node-name> nvidia.com/gpu.present=true
```

### Checkpoint job fails

Check DaemonSet logs:

```bash
157
kubectl logs -n my-team -l app.kubernetes.io/name=snapshot
158
159
160
161
162
163
164
```

### PVC not mounting

Check PVC status and events:

```bash
165
kubectl describe pvc snapshot-pvc -n my-team
166
167
168
169
170
171
```

Ensure your storage class supports `ReadWriteMany` access mode for multi-node deployments.

## Related Documentation

172
173
- [Dynamo Snapshot Overview](../../../../docs/kubernetes/snapshot/README.md) - Dynamo Snapshot architecture and use cases
- [Dynamo Snapshot with Dynamo Platform](../../../../docs/kubernetes/snapshot/dynamo.md) - Integration guide
174
175
176
177

## License

Apache License 2.0