standalone.md 26.5 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
# ChReK Standalone Usage Guide

> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.

This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.

## Table of Contents

- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Step 1: Deploy ChReK](#step-1-deploy-chrek)
- [Step 2: Build Checkpoint-Enabled Images](#step-2-build-checkpoint-enabled-images)
- [Step 3: Create Checkpoint Jobs](#step-3-create-checkpoint-jobs)
- [Step 4: Restore from Checkpoints](#step-4-restore-from-checkpoints)
- [Environment Variables Reference](#environment-variables-reference)
- [Checkpoint Flow Explained](#checkpoint-flow-explained)
- [Troubleshooting](#troubleshooting)

---

## Overview

When using ChReK standalone, you are responsible for:

1. **Deploying the ChReK Helm chart** (DaemonSet + PVC)
2. **Building checkpoint-enabled container images** with the restore entrypoint
3. **Creating checkpoint jobs** with the correct environment variables
4. **Creating restore pods** that detect and use the checkpoints

The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.

---

## Prerequisites

- Kubernetes cluster with:
  - NVIDIA GPUs with checkpoint support
  - **Privileged security context allowed** (⚠️ required for CRIU - see [Security Considerations](#security-considerations))
  - PVC storage (ReadWriteMany recommended for multi-node)
- Docker or compatible container runtime for building images
- Access to the ChReK source code: `deploy/chrek/`

### Security Considerations

⚠️ **Important**: ChReK restore operations **require privileged mode**, which has significant security implications:

- **Privileged containers** can access all host devices and bypass most security restrictions
- This may violate security policies in production environments
- Privileged containers, if compromised, can potentially compromise node security

**Recommended for:**
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls

**Not recommended for:**
- ❌ Multi-tenant clusters without proper isolation
- ❌ Security-sensitive production workloads without risk assessment
- ❌ Environments with strict security compliance requirements

### Technical Limitations

⚠️ **Current Restrictions:**
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations are not yet supported
- **Network state**: Active TCP connections are closed during restore
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)

---

## Step 1: Deploy ChReK

### Install the Helm Chart

```bash
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo

# Install ChReK in your namespace
helm install chrek ./deploy/helm/charts/chrek \
  --namespace my-app \
  --create-namespace \
  --set storage.pvc.size=100Gi \
  --set storage.pvc.storageClass=your-storage-class
```

### Verify Installation

```bash
# Check the DaemonSet is running
kubectl get daemonset -n my-app
# NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
# chrek-agent   3         3         3       3            3

# Check the PVC is bound
kubectl get pvc -n my-app
# NAME        STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS
# chrek-pvc   Bound    pvc-xyz    100Gi      RWX            your-storage-class
```

---

## Step 2: Build Checkpoint-Enabled Images

ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.

### Quick Start: Using the Placeholder Target (Recommended)

```bash
cd deploy/chrek

# Define your images
export BASE_IMAGE="your-app:latest"           # Your existing application image
export RESTORE_IMAGE="your-app:checkpoint-enabled"  # Output checkpoint-enabled image

# Build using the placeholder target
docker build \
  --target placeholder \
  --build-arg BASE_IMAGE="$BASE_IMAGE" \
  -t "$RESTORE_IMAGE" \
  .

# Push to your registry
docker push "$RESTORE_IMAGE"
```

**Example with a Dynamo vLLM image:**

```bash
cd deploy/chrek

export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"

docker build \
  --target placeholder \
  --build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
  -t "$RESTORE_IMAGE" \
  .
```

### What the Placeholder Target Does

The ChReK Dockerfile's `placeholder` stage automatically:

- ✅ Builds the restore-entrypoint binary
- ✅ Injects it into `/usr/local/bin/restore-entrypoint`
- ✅ Adds `smart-entrypoint.sh` to `/usr/local/bin/`
- ✅ Sets executable permissions
- ✅ Configures the entrypoint to detect and restore checkpoints
- ✅ Preserves your original application CMD

### Alternative: Manual Multi-Stage Build

If you need more control, you can create your own Dockerfile:

```dockerfile
# Stage 1: Build restore-entrypoint
FROM golang:1.23-alpine AS restore-builder
WORKDIR /build
COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint
COPY deploy/chrek/pkg ./pkg
COPY deploy/chrek/go.mod deploy/chrek/go.sum ./

RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint

# Stage 2: Your application image
FROM your-base-image:latest

# Copy restore-entrypoint
COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint

# Copy smart-entrypoint.sh
COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh
RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint

# Set smart-entrypoint as the default entrypoint
ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]

# Your application command (becomes CMD, can be overridden)
CMD ["python", "your_app.py"]
```

> **💡 Tip**: Using the `placeholder` target is the recommended approach as it's maintained with the ChReK codebase and ensures compatibility.

---

## Step 3: Create Checkpoint Jobs

A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.

### Required Environment Variables

Your checkpoint job MUST set these environment variables:

| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_CHECKPOINT_SIGNAL_FILE` | Path where DaemonSet writes completion signal | `/checkpoint-signal/my-checkpoint.done` |
| `DYN_CHECKPOINT_READY_FILE` | Path where your app signals it's ready | `/tmp/checkpoint-ready` |
| `DYN_CHECKPOINT_HASH` | Unique identifier for this checkpoint | `abc123def456` |
| `DYN_CHECKPOINT_LOCATION` | Directory where checkpoint is stored | `/checkpoints/abc123def456` |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Storage backend type | `pvc` |

### Required Labels

Add this label to enable DaemonSet checkpoint detection:

```yaml
labels:
  nvidia.com/checkpoint-source: "true"
```

### Example Checkpoint Job

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: checkpoint-my-model
  namespace: my-app
spec:
  template:
    metadata:
      labels:
        nvidia.com/checkpoint-source: "true"  # Required for DaemonSet detection
    spec:
      restartPolicy: Never

      # Init container to clean up stale signal files
      initContainers:
      - name: cleanup-signal-file
        image: busybox:latest
        command:
        - sh
        - -c
        - |
          rm -f /checkpoint-signal/my-checkpoint.done || true
          echo "Signal file cleanup complete"
        volumeMounts:
        - name: checkpoint-signal
          mountPath: /checkpoint-signal

      containers:
      - name: main
        image: my-app:checkpoint-enabled

        # Security context required for CRIU
        securityContext:
          privileged: true
          capabilities:
            add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]

        # Readiness probe: Pod becomes Ready when model is loaded
        # This is what triggers the DaemonSet to start checkpointing
        readinessProbe:
          exec:
            command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
          initialDelaySeconds: 15
          periodSeconds: 2

        # Remove liveness/startup probes for checkpoint jobs
        # Model loading can take several minutes
        livenessProbe: null
        startupProbe: null

        # Checkpoint-related environment variables
        env:
        - name: DYN_CHECKPOINT_SIGNAL_FILE
          value: "/checkpoint-signal/my-checkpoint.done"
        - name: DYN_CHECKPOINT_READY_FILE
          value: "/tmp/checkpoint-ready"
        - name: DYN_CHECKPOINT_HASH
          value: "abc123def456"
        - name: DYN_CHECKPOINT_LOCATION
          value: "/checkpoints/abc123def456"
        - name: DYN_CHECKPOINT_STORAGE_TYPE
          value: "pvc"

        # GPU request
        resources:
          limits:
            nvidia.com/gpu: 1

        # Required volume mounts
        volumeMounts:
        - name: checkpoint-storage
          mountPath: /checkpoints
        - name: checkpoint-signal
          mountPath: /checkpoint-signal
        - name: tmp
          mountPath: /tmp

      volumes:
      - name: checkpoint-storage
        persistentVolumeClaim:
          claimName: chrek-pvc
      - name: checkpoint-signal
        hostPath:
          path: /var/lib/chrek/signals
          type: DirectoryOrCreate
      - name: tmp
        emptyDir: {}
```

### Application Code Requirements

Your application must implement the checkpoint flow. Here's the pattern used by Dynamo vLLM:

```python
import os
import time

def main():
    # 1. Check for checkpoint mode
    signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE")
    ready_file = os.environ.get("DYN_CHECKPOINT_READY_FILE")
    restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE", "/tmp/dynamo-restored")

    is_checkpoint_mode = signal_file is not None

    if is_checkpoint_mode:
        print("Checkpoint mode detected")

        # 2. Load your model/application
        model = load_model()

        # 3. Optional: Put model to sleep to reduce memory footprint
        # model.sleep()

        # 4. Write ready file (for application use, not DaemonSet)
        if ready_file:
            with open(ready_file, "w") as f:
                f.write("ready")
            print(f"Wrote checkpoint ready file: {ready_file}")

        # 5. Log readiness messages (helps debugging)
        print("CHECKPOINT_READY: Model loaded, ready for container checkpoint")
        print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}")
        print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}")

        # 6. Wait for checkpoint completion OR restore detection
        while True:
            # Check if we've been restored (marker file created by restore entrypoint)
            if os.path.exists(restore_marker):
                print(f"Detected restore from checkpoint (marker: {restore_marker})")
                # Continue with normal application flow
                break

            # Check if checkpoint is complete (signal file created by DaemonSet)
            if os.path.exists(signal_file):
                print(f"Checkpoint signal file detected: {signal_file}")
                print("Checkpoint complete, exiting")
                return  # Exit gracefully

            time.sleep(1)

    # Normal application flow (or post-restore flow)
    run_application()
```

**Important Notes:**

1. **Ready File & Readiness Probe**: The checkpoint job must have a readiness probe that checks for the ready file:
   ```yaml
   readinessProbe:
     exec:
       command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
     initialDelaySeconds: 15
     periodSeconds: 2
   ```
   The ChReK DaemonSet triggers checkpointing when:
   - Pod has `nvidia.com/checkpoint-source: "true"` label
   - Pod status is `Ready` (readiness probe passes = ready file exists)

2. **Restore Marker**: Created by `restore-entrypoint` before CRIU restore, allows the restored process to detect it was restored

3. **Two Exit Paths**:
   - **Signal file found**: Checkpoint complete, exit gracefully
   - **Restore marker found**: Process was restored, continue running


---

## Step 4: Restore from Checkpoints

Restore pods automatically detect and restore from checkpoints if they exist.

### Example Restore Pod

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-app-restored
  namespace: my-app
spec:
  restartPolicy: Never

  containers:
  - name: main
    image: my-app:checkpoint-enabled

    # Security context required for CRIU restore
    securityContext:
      privileged: true
      capabilities:
        add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]

    # Set checkpoint environment variables
    env:
    - name: DYN_CHECKPOINT_HASH
      value: "abc123def456"  # Must match checkpoint job
    - name: DYN_CHECKPOINT_PATH
      value: "/checkpoints"  # Base path (hash appended automatically)

    # Optional: Customize restore marker file path
    # - name: DYN_RESTORE_MARKER_FILE
    #   value: "/tmp/dynamo-restored"

    # GPU request
    resources:
      limits:
        nvidia.com/gpu: 1

    # Mount checkpoint storage (READ-ONLY is fine for restore)
    volumeMounts:
    - name: checkpoint-storage
      mountPath: /checkpoints
      readOnly: true
    - name: checkpoint-signal
      mountPath: /checkpoint-signal

  volumes:
  - name: checkpoint-storage
    persistentVolumeClaim:
      claimName: chrek-pvc
  - name: checkpoint-signal
    hostPath:
      path: /var/lib/chrek/signals
      type: DirectoryOrCreate
```

### How Restore Works

1. **Smart Entrypoint Detects Checkpoint**: The `smart-entrypoint.sh` checks if a checkpoint exists at `/checkpoints/${DYN_CHECKPOINT_HASH}/`
2. **Calls Restore Entrypoint**: If found, calls `/usr/local/bin/restore-entrypoint` which invokes CRIU
3. **CRIU Restores Process**: The entire process tree is restored from the checkpoint, including GPU state
4. **Application Continues**: Your application resumes exactly where it was checkpointed

---

## Environment Variables Reference

### Checkpoint Jobs

| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_CHECKPOINT_SIGNAL_FILE` | Yes | Full path to signal file (e.g., `/checkpoint-signal/my-checkpoint.done`) |
| `DYN_CHECKPOINT_READY_FILE` | Yes | Full path where app signals readiness (e.g., `/tmp/checkpoint-ready`) |
| `DYN_CHECKPOINT_HASH` | Yes | Unique checkpoint identifier (alphanumeric string) |
| `DYN_CHECKPOINT_LOCATION` | Yes | Directory where checkpoint is stored (e.g., `/checkpoints/abc123`) |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Yes | Storage backend: `pvc`, `s3`, or `oci` |

### Restore Pods

| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_CHECKPOINT_HASH` | Yes | Checkpoint identifier (must match checkpoint job) |
| `DYN_CHECKPOINT_PATH` | Yes | Base checkpoint directory (hash appended automatically) |
| `DYN_RESTORE_MARKER_FILE` | No | Path for restore marker file (default: `/tmp/dynamo-restored`) |

### Optional CRIU Tuning (Advanced)

| Variable | Default | Description |
|----------|---------|-------------|
| `CRIU_TIMEOUT` | `0` (unlimited) | CRIU operation timeout in seconds |
| `CRIU_LOG_LEVEL` | `4` | CRIU log verbosity (0-4) |
| `CRIU_WORK_DIR` | `/tmp` | CRIU working directory |
| `CUDA_PLUGIN_DIR` | `/usr/local/lib/criu` | Path to CRIU CUDA plugin |
| `CRIU_SKIP_IN_FLIGHT` | `false` | Skip in-flight TCP connections |
| `CRIU_AUTO_DEDUP` | `false` | Enable auto-deduplication |
| `CRIU_LAZY_PAGES` | `false` | Enable lazy page migration (experimental) |
| `WAIT_FOR_CHECKPOINT` | `false` | Wait for checkpoint to appear before starting |
| `RESTORE_WAIT_TIMEOUT` | `300` | Max seconds to wait for checkpoint |
| `DEBUG` | `false` | Enable debug mode (sleeps 300s on error) |

---

## Checkpoint Flow Explained

### 1. Checkpoint Creation Flow

```
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with nvidia.com/checkpoint-source=true label  │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 2. Application loads model and creates ready file           │
│    /tmp/checkpoint-ready                                     │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 3. Pod becomes Ready (kubelet readiness probe passes)       │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 4. ChReK DaemonSet detects:                                 │
│    - Pod is Ready                                            │
│    - Has checkpoint-source label                             │
│    - Ready file exists: /tmp/checkpoint-ready               │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 5. DaemonSet executes CRIU checkpoint via runc:             │
│    - Freezes container process                               │
│    - Dumps memory (CPU + GPU)                                │
│    - Saves to /checkpoints/${HASH}/                          │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 6. DaemonSet writes signal file:                            │
│    /checkpoint-signal/${HASH}.done                           │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 7. Application detects signal file and exits gracefully     │
└─────────────────────────────────────────────────────────────┘
```

### 2. Restore Flow

```
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with DYN_CHECKPOINT_HASH set                  │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 2. smart-entrypoint.sh checks for checkpoint:               │
│    /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done      │
└──────────────────────┬──────────────────────────────────────┘

                       ├─ Not Found ─────────────────┐
                       │                              │
                       ▼                              ▼
           ┌───────────────────────┐    ┌──────────────────────┐
           │ Checkpoint exists     │    │ Cold start           │
           └──────────┬────────────┘    │ Run original CMD     │
                      │                 └──────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 3. Call restore-entrypoint with checkpoint path             │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 4. restore-entrypoint extracts checkpoint and calls CRIU:   │
│    criu restore --images-dir /checkpoints/${HASH}/images    │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 5. CRIU restores process from checkpoint                    │
│    - Restores memory (CPU + GPU)                             │
│    - Restores file descriptors                               │
│    - Resumes process execution                               │
└──────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ 6. Application continues from checkpointed state            │
│    (Model already loaded, GPU memory initialized)           │
└─────────────────────────────────────────────────────────────┘
```

---

## Troubleshooting

### Checkpoint Not Created

**Symptom**: Job runs but no checkpoint appears in `/checkpoints/`

**Checks**:
1. Verify the pod has the label:
   ```bash
   kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}'
   ```

2. Check pod readiness:
   ```bash
   kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
   ```

3. Check ready file was created:
   ```bash
   kubectl exec <pod-name> -- ls -la /tmp/checkpoint-ready
   ```

4. Check DaemonSet logs:
   ```bash
   kubectl logs -n my-app daemonset/chrek-agent --all-containers
   ```

### Restore Fails

**Symptom**: Pod fails to restore from checkpoint

**Checks**:
1. Verify checkpoint files exist:
   ```bash
   kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/
   ```

2. Check privileged mode is enabled:
   ```bash
   kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].securityContext.privileged}'
   ```

3. Check CRIU logs in `/tmp/criu-restore.log`:
   ```bash
   kubectl exec <pod-name> -- cat /tmp/criu-restore.log
   ```

4. Ensure checkpoint and restore have same:
   - Container image
   - GPU count
   - Volume mounts
   - Environment variables (except POD_NAME, POD_IP, etc.)

### Permission Denied Errors

**Symptom**: `CRIU: Permission denied` or `Operation not permitted`

**Solution**: Ensure pod has:
```yaml
securityContext:
  privileged: true
  capabilities:
    add:
    - SYS_ADMIN
    - SYS_PTRACE
    - SYS_CHROOT
```

### Signal File Not Appearing

**Symptom**: Application waits forever for signal file

**Checks**:
1. Verify hostPath mount is correct:
   ```bash
   kubectl get pod <pod-name> -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}'
   ```

2. Check DaemonSet has access to the same path:
   ```bash
   kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}'
   ```

3. Verify paths match exactly:
   - Pod: `/var/lib/chrek/signals`
   - DaemonSet: `/var/lib/chrek/signals`

---

## Additional Resources

- [ChReK Helm Chart Values](../../deploy/helm/charts/chrek/values.yaml)
- [Smart Entrypoint Script](../../deploy/chrek/scripts/smart-entrypoint.sh)
- [CRIU Documentation](https://criu.org/Main_Page)
- [CUDA Checkpoint Plugin](https://docs.nvidia.com/cuda/cuda-checkpoint-plugin/)

---

## Getting Help

If you encounter issues:

1. Check the [Troubleshooting](#troubleshooting) section
2. Review DaemonSet logs: `kubectl logs -n <namespace> daemonset/chrek-agent`
3. Open an issue on [GitHub](https://github.com/ai-dynamo/dynamo/issues)