# Dynamo Snapshot Helm Chart

> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **beta/preview**. The DaemonSet runs in privileged mode to perform CRIU operations. See [Prerequisites](#prerequisites) for security considerations.

This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, including:
- Persistent Volume Claim (PVC) for checkpoint storage
- DaemonSet running the CRIU checkpoint agent
- RBAC resources (ServiceAccount, Role, RoleBinding)
- Seccomp profile for blocking io_uring syscalls

**Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
- **Supports vLLM and SGLang backends** (TensorRT-LLM support planned)

## Prerequisites

⚠️ **Security Warning**: The Dynamo Snapshot DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable.

- Kubernetes 1.21+
- **x86_64 (amd64) nodes only** for the snapshot agent and placeholder images
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- NVIDIA driver 580.xx or newer on the target GPU nodes
- containerd runtime (for container inspection; CRIU is bundled in Dynamo Snapshot images)
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
- RWX (ReadWriteMany) storage class for multi-node deployments
- **Security clearance for privileged DaemonSet** (the Dynamo Snapshot agent runs privileged with hostPID/hostIPC/hostNetwork)

## Installation

> **Note:** The Dynamo Snapshot Helm chart is not yet published to a public Helm repository. For now, you must build and deploy from source.

### Building from Source

```bash
# Set environment
export NAMESPACE=my-team  # Your target namespace
export DOCKER_SERVER=your-registry.com/  # Your container registry
export IMAGE_TAG=latest

# Build Dynamo Snapshot agent image (amd64 only)
cd deploy/snapshot
docker build --platform linux/amd64 --target agent -t $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG .
docker push $DOCKER_SERVER/snapshot-agent:$IMAGE_TAG
cd -

# Install Dynamo Snapshot chart with custom image
helm install snapshot ./deploy/helm/charts/snapshot/ \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set daemonset.image.repository=${DOCKER_SERVER}/snapshot-agent \
  --set daemonset.image.tag=${IMAGE_TAG} \
  --set daemonset.imagePullSecrets[0].name=your-registry-secret
```

## Configuration

See `values.yaml` for all configuration options.

### Key Configuration Options

| Parameter | Description | Default |
|-----------|-------------|---------|
| `storage.type` | Storage type: `pvc` (only supported), `s3` and `oci` planned | `pvc` |
| `storage.pvc.create` | Create a new PVC | `true` |
| `storage.pvc.name` | PVC name (must match operator config) | `snapshot-pvc` |
| `storage.pvc.size` | PVC size | `100Gi` |
| `storage.pvc.storageClass` | Storage class name | `""` (default) |
| `daemonset.image.repository` | DaemonSet image repository | `nvcr.io/nvidian/dynamo-dev/snapshot-agent` |
| `daemonset.snapshotLogLevel` | Snapshot agent and nsrestore log level (`trace`, `debug`, `info`, `warn`, `error`) | `info` |
| `daemonset.nodeSelector` | Node selector for GPU nodes | `nvidia.com/gpu.present: "true"` |
| `config.checkpoint.criu.ghostLimit` | CRIU ghost file size limit in bytes | `536870912` (512MB) |
| `config.checkpoint.criu.logLevel` | CRIU logging verbosity (0-4) | `4` |
| `rbac.namespaceRestricted` | Use namespace-scoped RBAC | `true` |

## Usage

After installing this chart, enable checkpointing in your DynamoGraphDeployment:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-model
  namespace: my-team
spec:
  services:
    worker:
      checkpoint:
        enabled: true
        mode: auto
        identity:
          model: Qwen/Qwen3-0.6B
          backendFramework: vllm
```

## Multi-Namespace Deployment

To enable checkpointing in multiple namespaces, install this chart in each namespace:

```bash
# Namespace A
helm install snapshot nvidia/snapshot -n team-a

# Namespace B
helm install snapshot nvidia/snapshot -n team-b
```

Each namespace will have its own isolated checkpoint storage.

## Verification

```bash
# Check PVC
kubectl get pvc snapshot-pvc -n my-team

# Check DaemonSet
kubectl get daemonset -n my-team

# Check DaemonSet pods are running
kubectl get pods -n my-team -l app.kubernetes.io/name=snapshot
```

## Uninstallation

```bash
helm uninstall snapshot -n my-team
```

**Note:** This will NOT delete the PVC by default. To delete the PVC:

```bash
kubectl delete pvc snapshot-pvc -n my-team
```

## Troubleshooting

### DaemonSet pods not starting

Check if GPU nodes have the correct labels and runtime class:

```bash
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <node-name> | grep -A 5 "Runtime Class"
```

If nodes don't have the `nvidia.com/gpu.present` label, you can add it:

```bash
kubectl label node <node-name> nvidia.com/gpu.present=true
```

### Checkpoint job fails

Check DaemonSet logs:

```bash
kubectl logs -n my-team -l app.kubernetes.io/name=snapshot
```

### PVC not mounting

Check PVC status and events:

```bash
kubectl describe pvc snapshot-pvc -n my-team
```

Ensure your storage class supports `ReadWriteMany` access mode for multi-node deployments.

## Related Documentation

- [Dynamo Snapshot Overview](../../../../docs/kubernetes/snapshot/README.md) - Dynamo Snapshot architecture and use cases
- [Dynamo Snapshot with Dynamo Platform](../../../../docs/kubernetes/snapshot/dynamo.md) - Integration guide

## License

Apache License 2.0