# ChReK Standalone Usage Guide > ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying. This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads. ## Table of Contents - [Overview](#overview) - [Prerequisites](#prerequisites) - [Step 1: Deploy ChReK](#step-1-deploy-chrek) - [Step 2: Build Checkpoint-Enabled Images](#step-2-build-checkpoint-enabled-images) - [Step 3: Create Checkpoint Jobs](#step-3-create-checkpoint-jobs) - [Step 4: Restore from Checkpoints](#step-4-restore-from-checkpoints) - [Environment Variables Reference](#environment-variables-reference) - [Checkpoint Flow Explained](#checkpoint-flow-explained) - [Troubleshooting](#troubleshooting) --- ## Overview When using ChReK standalone, you are responsible for: 1. **Deploying the ChReK Helm chart** (DaemonSet + PVC) 2. **Building checkpoint-enabled container images** with the restore entrypoint 3. **Creating checkpoint jobs** with the correct environment variables 4. **Creating restore pods** that detect and use the checkpoints The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly. --- ## Prerequisites - Kubernetes cluster with: - NVIDIA GPUs with checkpoint support - **Privileged security context allowed** (⚠️ required for CRIU - see [Security Considerations](#security-considerations)) - PVC storage (ReadWriteMany recommended for multi-node) - Docker or compatible container runtime for building images - Access to the ChReK source code: `deploy/chrek/` ### Security Considerations ⚠️ **Important**: ChReK restore operations **require privileged mode**, which has significant security implications: - **Privileged containers** can access all host devices and bypass most security restrictions - This may violate security policies in production environments - Privileged containers, if compromised, can potentially compromise node security **Recommended for:** - ✅ Development and testing environments - ✅ Research and experimentation - ✅ Controlled production environments with appropriate security controls **Not recommended for:** - ❌ Multi-tenant clusters without proper isolation - ❌ Security-sensitive production workloads without risk assessment - ❌ Environments with strict security compliance requirements ### Technical Limitations ⚠️ **Current Restrictions:** - **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned. - **Single-node only**: Checkpoints must be created and restored on the same node - **Single-GPU only**: Multi-GPU configurations are not yet supported - **Network state**: Active TCP connections are closed during restore - **Storage**: Only PVC backend currently implemented (S3/OCI planned) --- ## Step 1: Deploy ChReK ### Install the Helm Chart ```bash # Clone the repository git clone https://github.com/ai-dynamo/dynamo.git cd dynamo # Install ChReK in your namespace helm install chrek ./deploy/helm/charts/chrek \ --namespace my-app \ --create-namespace \ --set storage.pvc.size=100Gi \ --set storage.pvc.storageClass=your-storage-class ``` ### Verify Installation ```bash # Check the DaemonSet is running kubectl get daemonset -n my-app # NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE # chrek-agent 3 3 3 3 3 # Check the PVC is bound kubectl get pvc -n my-app # NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS # chrek-pvc Bound pvc-xyz 100Gi RWX your-storage-class ``` --- ## Step 2: Build Checkpoint-Enabled Images ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images. ### Quick Start: Using the Placeholder Target (Recommended) ```bash cd deploy/chrek # Define your images export BASE_IMAGE="your-app:latest" # Your existing application image export RESTORE_IMAGE="your-app:checkpoint-enabled" # Output checkpoint-enabled image # Build using the placeholder target docker build \ --target placeholder \ --build-arg BASE_IMAGE="$BASE_IMAGE" \ -t "$RESTORE_IMAGE" \ . # Push to your registry docker push "$RESTORE_IMAGE" ``` **Example with a Dynamo vLLM image:** ```bash cd deploy/chrek export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0" export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint" docker build \ --target placeholder \ --build-arg BASE_IMAGE="$DYNAMO_IMAGE" \ -t "$RESTORE_IMAGE" \ . ``` ### What the Placeholder Target Does The ChReK Dockerfile's `placeholder` stage automatically: - ✅ Builds the restore-entrypoint binary - ✅ Injects it into `/usr/local/bin/restore-entrypoint` - ✅ Adds `smart-entrypoint.sh` to `/usr/local/bin/` - ✅ Sets executable permissions - ✅ Configures the entrypoint to detect and restore checkpoints - ✅ Preserves your original application CMD ### Alternative: Manual Multi-Stage Build If you need more control, you can create your own Dockerfile: ```dockerfile # Stage 1: Build restore-entrypoint FROM golang:1.23-alpine AS restore-builder WORKDIR /build COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint COPY deploy/chrek/pkg ./pkg COPY deploy/chrek/go.mod deploy/chrek/go.sum ./ RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint # Stage 2: Your application image FROM your-base-image:latest # Copy restore-entrypoint COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint # Copy smart-entrypoint.sh COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint # Set smart-entrypoint as the default entrypoint ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"] # Your application command (becomes CMD, can be overridden) CMD ["python", "your_app.py"] ``` > **💡 Tip**: Using the `placeholder` target is the recommended approach as it's maintained with the ChReK codebase and ensures compatibility. --- ## Step 3: Create Checkpoint Jobs A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits. ### Required Environment Variables Your checkpoint job MUST set these environment variables: | Variable | Description | Example | |----------|-------------|---------| | `DYN_CHECKPOINT_SIGNAL_FILE` | Path where DaemonSet writes completion signal | `/checkpoint-signal/my-checkpoint.done` | | `DYN_CHECKPOINT_READY_FILE` | Path where your app signals it's ready | `/tmp/checkpoint-ready` | | `DYN_CHECKPOINT_HASH` | Unique identifier for this checkpoint | `abc123def456` | | `DYN_CHECKPOINT_LOCATION` | Directory where checkpoint is stored | `/checkpoints/abc123def456` | | `DYN_CHECKPOINT_STORAGE_TYPE` | Storage backend type | `pvc` | ### Required Labels Add this label to enable DaemonSet checkpoint detection: ```yaml labels: nvidia.com/checkpoint-source: "true" ``` ### Example Checkpoint Job ```yaml apiVersion: batch/v1 kind: Job metadata: name: checkpoint-my-model namespace: my-app spec: template: metadata: labels: nvidia.com/checkpoint-source: "true" # Required for DaemonSet detection spec: restartPolicy: Never # Init container to clean up stale signal files initContainers: - name: cleanup-signal-file image: busybox:latest command: - sh - -c - | rm -f /checkpoint-signal/my-checkpoint.done || true echo "Signal file cleanup complete" volumeMounts: - name: checkpoint-signal mountPath: /checkpoint-signal containers: - name: main image: my-app:checkpoint-enabled # Security context required for CRIU securityContext: privileged: true capabilities: add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"] # Readiness probe: Pod becomes Ready when model is loaded # This is what triggers the DaemonSet to start checkpointing readinessProbe: exec: command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"] initialDelaySeconds: 15 periodSeconds: 2 # Remove liveness/startup probes for checkpoint jobs # Model loading can take several minutes livenessProbe: null startupProbe: null # Checkpoint-related environment variables env: - name: DYN_CHECKPOINT_SIGNAL_FILE value: "/checkpoint-signal/my-checkpoint.done" - name: DYN_CHECKPOINT_READY_FILE value: "/tmp/checkpoint-ready" - name: DYN_CHECKPOINT_HASH value: "abc123def456" - name: DYN_CHECKPOINT_LOCATION value: "/checkpoints/abc123def456" - name: DYN_CHECKPOINT_STORAGE_TYPE value: "pvc" # GPU request resources: limits: nvidia.com/gpu: 1 # Required volume mounts volumeMounts: - name: checkpoint-storage mountPath: /checkpoints - name: checkpoint-signal mountPath: /checkpoint-signal - name: tmp mountPath: /tmp volumes: - name: checkpoint-storage persistentVolumeClaim: claimName: chrek-pvc - name: checkpoint-signal hostPath: path: /var/lib/chrek/signals type: DirectoryOrCreate - name: tmp emptyDir: {} ``` ### Application Code Requirements Your application must implement the checkpoint flow. Here's the pattern used by Dynamo vLLM: ```python import os import time def main(): # 1. Check for checkpoint mode signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE") ready_file = os.environ.get("DYN_CHECKPOINT_READY_FILE") restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE", "/tmp/dynamo-restored") is_checkpoint_mode = signal_file is not None if is_checkpoint_mode: print("Checkpoint mode detected") # 2. Load your model/application model = load_model() # 3. Optional: Put model to sleep to reduce memory footprint # model.sleep() # 4. Write ready file (for application use, not DaemonSet) if ready_file: with open(ready_file, "w") as f: f.write("ready") print(f"Wrote checkpoint ready file: {ready_file}") # 5. Log readiness messages (helps debugging) print("CHECKPOINT_READY: Model loaded, ready for container checkpoint") print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}") print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}") # 6. Wait for checkpoint completion OR restore detection while True: # Check if we've been restored (marker file created by restore entrypoint) if os.path.exists(restore_marker): print(f"Detected restore from checkpoint (marker: {restore_marker})") # Continue with normal application flow break # Check if checkpoint is complete (signal file created by DaemonSet) if os.path.exists(signal_file): print(f"Checkpoint signal file detected: {signal_file}") print("Checkpoint complete, exiting") return # Exit gracefully time.sleep(1) # Normal application flow (or post-restore flow) run_application() ``` **Important Notes:** 1. **Ready File & Readiness Probe**: The checkpoint job must have a readiness probe that checks for the ready file: ```yaml readinessProbe: exec: command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"] initialDelaySeconds: 15 periodSeconds: 2 ``` The ChReK DaemonSet triggers checkpointing when: - Pod has `nvidia.com/checkpoint-source: "true"` label - Pod status is `Ready` (readiness probe passes = ready file exists) 2. **Restore Marker**: Created by `restore-entrypoint` before CRIU restore, allows the restored process to detect it was restored 3. **Two Exit Paths**: - **Signal file found**: Checkpoint complete, exit gracefully - **Restore marker found**: Process was restored, continue running --- ## Step 4: Restore from Checkpoints Restore pods automatically detect and restore from checkpoints if they exist. ### Example Restore Pod ```yaml apiVersion: v1 kind: Pod metadata: name: my-app-restored namespace: my-app spec: restartPolicy: Never containers: - name: main image: my-app:checkpoint-enabled # Security context required for CRIU restore securityContext: privileged: true capabilities: add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"] # Set checkpoint environment variables env: - name: DYN_CHECKPOINT_HASH value: "abc123def456" # Must match checkpoint job - name: DYN_CHECKPOINT_PATH value: "/checkpoints" # Base path (hash appended automatically) # Optional: Customize restore marker file path # - name: DYN_RESTORE_MARKER_FILE # value: "/tmp/dynamo-restored" # GPU request resources: limits: nvidia.com/gpu: 1 # Mount checkpoint storage (READ-ONLY is fine for restore) volumeMounts: - name: checkpoint-storage mountPath: /checkpoints readOnly: true - name: checkpoint-signal mountPath: /checkpoint-signal volumes: - name: checkpoint-storage persistentVolumeClaim: claimName: chrek-pvc - name: checkpoint-signal hostPath: path: /var/lib/chrek/signals type: DirectoryOrCreate ``` ### How Restore Works 1. **Smart Entrypoint Detects Checkpoint**: The `smart-entrypoint.sh` checks if a checkpoint exists at `/checkpoints/${DYN_CHECKPOINT_HASH}/` 2. **Calls Restore Entrypoint**: If found, calls `/usr/local/bin/restore-entrypoint` which invokes CRIU 3. **CRIU Restores Process**: The entire process tree is restored from the checkpoint, including GPU state 4. **Application Continues**: Your application resumes exactly where it was checkpointed --- ## Environment Variables Reference ### Checkpoint Jobs | Variable | Required | Description | |----------|----------|-------------| | `DYN_CHECKPOINT_SIGNAL_FILE` | Yes | Full path to signal file (e.g., `/checkpoint-signal/my-checkpoint.done`) | | `DYN_CHECKPOINT_READY_FILE` | Yes | Full path where app signals readiness (e.g., `/tmp/checkpoint-ready`) | | `DYN_CHECKPOINT_HASH` | Yes | Unique checkpoint identifier (alphanumeric string) | | `DYN_CHECKPOINT_LOCATION` | Yes | Directory where checkpoint is stored (e.g., `/checkpoints/abc123`) | | `DYN_CHECKPOINT_STORAGE_TYPE` | Yes | Storage backend: `pvc`, `s3`, or `oci` | ### Restore Pods | Variable | Required | Description | |----------|----------|-------------| | `DYN_CHECKPOINT_HASH` | Yes | Checkpoint identifier (must match checkpoint job) | | `DYN_CHECKPOINT_PATH` | Yes | Base checkpoint directory (hash appended automatically) | | `DYN_RESTORE_MARKER_FILE` | No | Path for restore marker file (default: `/tmp/dynamo-restored`) | ### Optional CRIU Tuning (Advanced) | Variable | Default | Description | |----------|---------|-------------| | `CRIU_TIMEOUT` | `0` (unlimited) | CRIU operation timeout in seconds | | `CRIU_LOG_LEVEL` | `4` | CRIU log verbosity (0-4) | | `CRIU_WORK_DIR` | `/tmp` | CRIU working directory | | `CUDA_PLUGIN_DIR` | `/usr/local/lib/criu` | Path to CRIU CUDA plugin | | `CRIU_SKIP_IN_FLIGHT` | `false` | Skip in-flight TCP connections | | `CRIU_AUTO_DEDUP` | `false` | Enable auto-deduplication | | `CRIU_LAZY_PAGES` | `false` | Enable lazy page migration (experimental) | | `WAIT_FOR_CHECKPOINT` | `false` | Wait for checkpoint to appear before starting | | `RESTORE_WAIT_TIMEOUT` | `300` | Max seconds to wait for checkpoint | | `DEBUG` | `false` | Enable debug mode (sleeps 300s on error) | --- ## Checkpoint Flow Explained ### 1. Checkpoint Creation Flow ``` ┌─────────────────────────────────────────────────────────────┐ │ 1. Pod starts with nvidia.com/checkpoint-source=true label │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 2. Application loads model and creates ready file │ │ /tmp/checkpoint-ready │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 3. Pod becomes Ready (kubelet readiness probe passes) │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 4. ChReK DaemonSet detects: │ │ - Pod is Ready │ │ - Has checkpoint-source label │ │ - Ready file exists: /tmp/checkpoint-ready │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 5. DaemonSet executes CRIU checkpoint via runc: │ │ - Freezes container process │ │ - Dumps memory (CPU + GPU) │ │ - Saves to /checkpoints/${HASH}/ │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 6. DaemonSet writes signal file: │ │ /checkpoint-signal/${HASH}.done │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 7. Application detects signal file and exits gracefully │ └─────────────────────────────────────────────────────────────┘ ``` ### 2. Restore Flow ``` ┌─────────────────────────────────────────────────────────────┐ │ 1. Pod starts with DYN_CHECKPOINT_HASH set │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 2. smart-entrypoint.sh checks for checkpoint: │ │ /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done │ └──────────────────────┬──────────────────────────────────────┘ │ ├─ Not Found ─────────────────┐ │ │ ▼ ▼ ┌───────────────────────┐ ┌──────────────────────┐ │ Checkpoint exists │ │ Cold start │ └──────────┬────────────┘ │ Run original CMD │ │ └──────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 3. Call restore-entrypoint with checkpoint path │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 4. restore-entrypoint extracts checkpoint and calls CRIU: │ │ criu restore --images-dir /checkpoints/${HASH}/images │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 5. CRIU restores process from checkpoint │ │ - Restores memory (CPU + GPU) │ │ - Restores file descriptors │ │ - Resumes process execution │ └──────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 6. Application continues from checkpointed state │ │ (Model already loaded, GPU memory initialized) │ └─────────────────────────────────────────────────────────────┘ ``` --- ## Troubleshooting ### Checkpoint Not Created **Symptom**: Job runs but no checkpoint appears in `/checkpoints/` **Checks**: 1. Verify the pod has the label: ```bash kubectl get pod -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}' ``` 2. Check pod readiness: ```bash kubectl get pod -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' ``` 3. Check ready file was created: ```bash kubectl exec -- ls -la /tmp/checkpoint-ready ``` 4. Check DaemonSet logs: ```bash kubectl logs -n my-app daemonset/chrek-agent --all-containers ``` ### Restore Fails **Symptom**: Pod fails to restore from checkpoint **Checks**: 1. Verify checkpoint files exist: ```bash kubectl exec -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/ ``` 2. Check privileged mode is enabled: ```bash kubectl get pod -o jsonpath='{.spec.containers[0].securityContext.privileged}' ``` 3. Check CRIU logs in `/tmp/criu-restore.log`: ```bash kubectl exec -- cat /tmp/criu-restore.log ``` 4. Ensure checkpoint and restore have same: - Container image - GPU count - Volume mounts - Environment variables (except POD_NAME, POD_IP, etc.) ### Permission Denied Errors **Symptom**: `CRIU: Permission denied` or `Operation not permitted` **Solution**: Ensure pod has: ```yaml securityContext: privileged: true capabilities: add: - SYS_ADMIN - SYS_PTRACE - SYS_CHROOT ``` ### Signal File Not Appearing **Symptom**: Application waits forever for signal file **Checks**: 1. Verify hostPath mount is correct: ```bash kubectl get pod -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}' ``` 2. Check DaemonSet has access to the same path: ```bash kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}' ``` 3. Verify paths match exactly: - Pod: `/var/lib/chrek/signals` - DaemonSet: `/var/lib/chrek/signals` --- ## Additional Resources - [ChReK Helm Chart Values](../../deploy/helm/charts/chrek/values.yaml) - [Smart Entrypoint Script](../../deploy/chrek/scripts/smart-entrypoint.sh) - [CRIU Documentation](https://criu.org/Main_Page) - [CUDA Checkpoint Plugin](https://docs.nvidia.com/cuda/cuda-checkpoint-plugin/) --- ## Getting Help If you encounter issues: 1. Check the [Troubleshooting](#troubleshooting) section 2. Review DaemonSet logs: `kubectl logs -n daemonset/chrek-agent` 3. Open an issue on [GitHub](https://github.com/ai-dynamo/dynamo/issues)