README.md 5.13 KB
Newer Older
1
2
# Chrek Helm Chart

3
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The DaemonSet runs in privileged mode to perform CRIU operations. See [Prerequisites](#prerequisites) for security considerations.
4
5
6
7
8
9
10
11
12

This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, including:
- Persistent Volume Claim (PVC) for checkpoint storage
- DaemonSet running the CRIU checkpoint agent
- RBAC resources (ServiceAccount, Role, RoleBinding)
- Seccomp profile for blocking io_uring syscalls

**Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
13
- **Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
14
15
16

## Prerequisites

17
⚠️ **Security Warning**: The ChReK DaemonSet runs in **privileged mode** with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU checkpoint/restore operations. Workload pods do not need privileged mode. Only deploy in environments where a privileged DaemonSet is acceptable.
18
19
20

- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
21
- containerd runtime (for container inspection; CRIU is bundled in ChReK images)
22
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
23
- RWX (ReadWriteMany) storage class for multi-node deployments
24
- **Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

## Installation

> **Note:** The ChReK Helm chart is not yet published to a public Helm repository. For now, you must build and deploy from source.

### Building from Source

```bash
# Set environment
export NAMESPACE=my-team  # Your target namespace
export DOCKER_SERVER=your-registry.com/  # Your container registry
export IMAGE_TAG=latest

# Build ChReK agent image
cd deploy/chrek
docker build --target agent -t $DOCKER_SERVER/chrek-agent:$IMAGE_TAG .
docker push $DOCKER_SERVER/chrek-agent:$IMAGE_TAG
cd -

# Install ChReK chart with custom image
helm install chrek ./deploy/helm/charts/chrek/ \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set daemonset.image.repository=${DOCKER_SERVER}/chrek-agent \
  --set daemonset.image.tag=${IMAGE_TAG} \
  --set daemonset.imagePullSecrets[0].name=your-registry-secret
```

## Configuration

See `values.yaml` for all configuration options.

### Key Configuration Options

| Parameter | Description | Default |
|-----------|-------------|---------|
| `storage.type` | Storage type: `pvc` (only supported), `s3` and `oci` planned | `pvc` |
| `storage.pvc.create` | Create a new PVC | `true` |
| `storage.pvc.name` | PVC name (must match operator config) | `chrek-pvc` |
| `storage.pvc.size` | PVC size | `100Gi` |
| `storage.pvc.storageClass` | Storage class name | `""` (default) |
66
| `daemonset.image.repository` | DaemonSet image repository | `nvcr.io/nvidian/dynamo-dev/chrek-agent` |
67
| `daemonset.nodeSelector` | Node selector for GPU nodes | `nvidia.com/gpu.present: "true"` |
68
69
| `config.checkpoint.criu.ghostLimit` | CRIU ghost file size limit in bytes | `536870912` (512MB) |
| `config.checkpoint.criu.logLevel` | CRIU logging verbosity (0-4) | `4` |
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
| `rbac.namespaceRestricted` | Use namespace-scoped RBAC | `true` |

## Usage

After installing this chart, enable checkpointing in your DynamoGraphDeployment:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-model
  namespace: my-team
spec:
  services:
    worker:
      checkpoint:
        enabled: true
        mode: auto
        identity:
          model: Qwen/Qwen3-0.6B
          backendFramework: vllm
```

## Multi-Namespace Deployment

To enable checkpointing in multiple namespaces, install this chart in each namespace:

```bash
# Namespace A
helm install chrek nvidia/chrek -n team-a

# Namespace B
helm install chrek nvidia/chrek -n team-b
```

Each namespace will have its own isolated checkpoint storage.

## Verification

```bash
# Check PVC
kubectl get pvc chrek-pvc -n my-team

# Check DaemonSet
kubectl get daemonset -n my-team

# Check DaemonSet pods are running
kubectl get pods -n my-team -l app.kubernetes.io/name=chrek
```

## Uninstallation

```bash
helm uninstall chrek -n my-team
```

**Note:** This will NOT delete the PVC by default. To delete the PVC:

```bash
kubectl delete pvc chrek-pvc -n my-team
```

## Troubleshooting

### DaemonSet pods not starting

Check if GPU nodes have the correct labels and runtime class:

```bash
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <node-name> | grep -A 5 "Runtime Class"
```

If nodes don't have the `nvidia.com/gpu.present` label, you can add it:

```bash
kubectl label node <node-name> nvidia.com/gpu.present=true
```

### Checkpoint job fails

Check DaemonSet logs:

```bash
kubectl logs -n my-team -l app.kubernetes.io/name=chrek
```

### PVC not mounting

Check PVC status and events:

```bash
kubectl describe pvc chrek-pvc -n my-team
```

Ensure your storage class supports `ReadWriteMany` access mode for multi-node deployments.

## Related Documentation

169
170
- [ChReK Overview](../../../../docs/kubernetes/chrek/README.md) - ChReK architecture and use cases
- [ChReK with Dynamo Platform](../../../../docs/kubernetes/chrek/dynamo.md) - Integration guide
171
172
173
174

## License

Apache License 2.0