> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. See [Limitations](#limitations) for details.
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
**ChReK** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). ChReK dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
...
...
@@ -48,10 +48,10 @@ Deploys the checkpoint/restore infrastructure:
-**RBAC**: Namespace-scoped or cluster-wide permissions
-**Seccomp Profile**: Security policies for CRIU syscalls
### 2. Smart Entrypoint
A wrapper script that intelligently decides between:
-**Cold start**: Normal application startup (when no checkpoint exists)
-**Restore**: CRIU restore from checkpoint (when checkpoint available)
### 2. External Restore via DaemonSet
The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces:
-**Checkpoint**: Freezes the running process and dumps state (CPU + GPU) to storage
-**Restore**: Enters a placeholder pod's namespaces and restores the checkpointed process via `nsrestore`
⚠️ **Important**: ChReK has significant limitations that may impact production readiness:
### Security Considerations
-**🔴 Privileged mode required**: Restore pods **must run in privileged mode**for CRIU to function. This grants containers elevated host access and may violate security policies in many production environments.
-**Security Impact**: Privileged containers can:
- Access all host devices
-**🔴 Privileged DaemonSet**: The ChReK DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations. Workload pods do **not** need privileged mode — all CRIU privilege lives in the DaemonSet.
-**Security Impact**: The privileged DaemonSet can:
- Access all host devices and processes
- Bypass most security restrictions
- Potentially compromise node security if the container is exploited
- Potentially compromise node security if exploited
### Technical Limitations
-**vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
...
...
@@ -128,9 +128,9 @@ ChReK is best suited for:
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
-CRIU support in container runtime (containerd with CRIU plugin)
- containerd runtime (for container inspection; CRIU is bundled in ChReK images)
- RWX storage class (for multi-node deployments)
-**Security clearance for privileged pods** (required for restore operations)
-**Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
## Troubleshooting
...
...
@@ -146,9 +146,9 @@ ChReK is best suited for:
- Verify CRIU is installed in the runtime
**Restore fails?**
- Ensure restore pod uses the same volumes as checkpoint job
- Verify `hostIPC: true` is set (required for CUDA)
- Check for `PSM3_DISABLED=1` and `GLOO_SOCKET_IFNAME=lo` environment variables
- Ensure restore pod uses the same image (built with `placeholder` target) and volume mounts as checkpoint job
- Verify the DaemonSet is running on the same node as the restore pod
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations. See [Limitations](#limitations) for details.
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
...
...
@@ -23,7 +23,7 @@ Checkpointing captures the complete state of a running worker pod (including GPU
- Dynamo Platform installed (v0.4.0+)
- ChReK Helm chart installed (separate from platform)
- GPU nodes with CRIU support
- GPU nodes with containerd runtime (CRIU is bundled in ChReK images)
- RWX PVC storage (PVC is currently the only supported backend)
## Quick Start
...
...
@@ -350,9 +350,9 @@ Or use `auto` mode and the operator will find/create it automatically.
⚠️ **Important**: ChReK has significant limitations that impact production readiness:
### Security Considerations
-**🔴 Privileged mode required**: Restore pods **must run in privileged mode**for CRIU to function
-Privileged containers have elevated host access, which may violate security policies in many production environments
- This requirement applies to all worker pods that restore from checkpoints
-**🔴 Privileged DaemonSet**: The ChReK DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations externally
-Workload pods (checkpoint jobs, restore pods) do **not** need privileged mode — all CRIU privilege lives in the DaemonSet
- The privileged DaemonSet has elevated host access, which may violate security policies in many production environments
### Technical Limitations
-**vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
...
...
@@ -374,7 +374,7 @@ ChReK is **experimental/beta** and best suited for:
1. Check the checkpoint job:
```bash
kubectl get jobs-l nvidia.com/checkpoint-source=true-n dynamo-system
kubectl get jobs-l nvidia.com/chrek-is-checkpoint-source=true-n dynamo-system
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. Review the [security implications](#security-considerations) before deploying.
This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
## Table of Contents
-[Overview](#overview)
-[Using ChReK Without the Dynamo Operator](#using-chrek-without-the-dynamo-operator)
@@ -27,7 +28,7 @@ This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as
When using ChReK standalone, you are responsible for:
1.**Deploying the ChReK Helm chart** (DaemonSet + PVC)
2.**Building checkpoint-enabled container images** with the restore entrypoint
2.**Building checkpoint-enabled container images** with the CRIU runtime dependencies
3.**Creating checkpoint jobs** with the correct environment variables
4.**Creating restore pods** that detect and use the checkpoints
...
...
@@ -35,22 +36,89 @@ The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automa
---
## Using ChReK Without the Dynamo Operator
When using ChReK with the Dynamo operator, the operator automatically configures workload pods for checkpoint/restore. Without the operator, you must handle this configuration manually. This section documents what the operator normally injects and how to replicate it.
### Container Naming
The ChReK DaemonSet needs to identify which container in your pod is the model-serving workload (as opposed to sidecars like istio-proxy or log collectors). It resolves the target container by name:
1. If a container is named `main`, it is selected
2. Otherwise, the first container in the pod spec is selected
When using the Dynamo operator, the model container is always named `main`. In standalone mode, you must either name your model container `main` or ensure it is the first container listed in your pod spec. All YAML examples in this guide use `name: main`.
### Seccomp Profile
The operator sets a seccomp profile on all checkpoint/restore workload pods to block `io_uring` syscalls. The chrek DaemonSet deploys the profile file (`profiles/block-iouring.json`) to each node, but you must reference it in your pod specs:
```yaml
spec:
securityContext:
seccompProfile:
type:Localhost
localhostProfile:profiles/block-iouring.json
```
Without this profile, `io_uring` syscalls during restore can cause CRIU failures.
### Sleep Infinity Command for Restore Pods
The operator overrides the container command to `["sleep", "infinity"]` on restore-target pods. This produces a Running-but-not-Ready placeholder pod that the chrek DaemonSet watcher detects and restores externally via `nsenter`. Without this override, the container runs its normal entrypoint (cold-starting instead of waiting for restore).
```yaml
containers:
-name:main
image:my-app:checkpoint-enabled
command:["sleep","infinity"]
```
### Recreate Deployment Strategy
The operator forces `Recreate` strategy when restore labels are present. This prevents the old and new pods from running simultaneously, which would cause failures — two pods competing for the same GPU checkpoint data. If you are using a Deployment, set this manually:
```yaml
apiVersion:apps/v1
kind:Deployment
spec:
strategy:
type:Recreate
```
### PVC Volume Mount Consistency
CRIU requires identical mount layouts between checkpoint and restore. The operator ensures the checkpoint PVC is mounted at the same path in both the checkpoint job and restore pod. When configuring manually, make sure your checkpoint job and restore pod use the exact same `mountPath` for the checkpoint PVC (e.g., `/checkpoints`).
### Downward API Volume (Currently Unused)
The operator injects a Downward API volume at `/etc/podinfo` for post-restore identity discovery (pod name, namespace, UID). This is not currently consumed by any component — you can skip it for now.
### Environment Variables
The following environment variables are normally injected by the operator. They are already documented in the [Environment Variables Reference](#environment-variables-reference) below, but note that without the operator you must set them manually:
-**Privileged security context allowed** (⚠️ required for CRIU - see [Security Considerations](#security-considerations))
-**Privileged DaemonSet allowed** (⚠️ the ChReK DaemonSet runs privileged - see [Security Considerations](#security-considerations))
- PVC storage (ReadWriteMany recommended for multi-node)
- Docker or compatible container runtime for building images
- Access to the ChReK source code: `deploy/chrek/`
### Security Considerations
⚠️ **Important**: ChReK restore operations **require privileged mode**, which has significant security implications:
⚠️ **Important**: The ChReK **DaemonSet** runs in privileged mode to perform CRIU checkpoint/restore operations. Your workload pods (checkpoint jobs, restore pods) do **not** need privileged mode — all CRIU privilege lives in the DaemonSet, which performs external restore via `nsenter`.
-**Privileged containers** can access all host devices and bypass most security restrictions
-**The DaemonSet** has `privileged: true`, `hostPID`, `hostIPC`, and `hostNetwork`
- This may violate security policies in production environments
-Privileged containers, if compromised, can potentially compromise node security
-If the DaemonSet is compromised, it could potentially compromise node security
**Recommended for:**
- ✅ Development and testing environments
...
...
@@ -108,7 +176,7 @@ kubectl get pvc -n my-app
## Step 2: Build Checkpoint-Enabled Images
ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.
ChReK provides a `placeholder` target in its Dockerfile that layers CRIU runtime dependencies onto your existing container images. The DaemonSet performs restore externally via `nsenter`, so these dependencies must be present in the image.
### Quick Start: Using the Placeholder Target (Recommended)
...
...
@@ -149,43 +217,14 @@ docker build \
The ChReK Dockerfile's `placeholder` stage automatically:
- ✅ Builds the restore-entrypoint binary
- ✅ Injects it into `/usr/local/bin/restore-entrypoint`
- ✅ Adds `smart-entrypoint.sh` to `/usr/local/bin/`
- ✅ Sets executable permissions
- ✅ Configures the entrypoint to detect and restore checkpoints
- ✅ Preserves your original application CMD
### Alternative: Manual Multi-Stage Build
If you need more control, you can create your own Dockerfile:
RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint
# Set smart-entrypoint as the default entrypoint
ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]
# Your application command (becomes CMD, can be overridden)
CMD ["python", "your_app.py"]
```
The placeholder image does **not** override the entrypoint or CMD. For restore pods, the operator (or you, in standalone mode) overrides the command to `sleep infinity`.
> **💡 Tip**: Using the `placeholder` target is the recommended approach as it's maintained with the ChReK codebase and ensures compatibility.
...
...
@@ -201,7 +240,6 @@ Your checkpoint job MUST set these environment variables:
| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_CHECKPOINT_SIGNAL_FILE` | Path where DaemonSet writes completion signal | `/checkpoint-signal/my-checkpoint.done` |
| `DYN_READY_FOR_CHECKPOINT_FILE` | Path where your app signals it's ready | `/tmp/ready-for-checkpoint` |
| `DYN_CHECKPOINT_HASH` | Unique identifier for this checkpoint | `abc123def456` |
| `DYN_CHECKPOINT_LOCATION` | Directory where checkpoint is stored | `/checkpoints/abc123def456` |
...
...
@@ -213,7 +251,7 @@ Add this label to enable DaemonSet checkpoint detection:
```yaml
labels:
nvidia.com/checkpoint-source:"true"
nvidia.com/chrek-is-checkpoint-source:"true"
```
### Example Checkpoint Job
...
...
@@ -228,39 +266,26 @@ spec:
template:
metadata:
labels:
nvidia.com/checkpoint-source:"true"# Required for DaemonSet detection
nvidia.com/chrek-is-checkpoint-source:"true"# Required for DaemonSet detection
nvidia.com/chrek-checkpoint-hash:"abc123def456"# Must match DYN_CHECKPOINT_HASH
- Pod has `nvidia.com/checkpoint-source: "true"` label
1.**Ready File & Readiness Probe**: The checkpoint job must have a readiness probe that checks for the ready file. The ChReK DaemonSet triggers checkpointing when:
- Pod has `nvidia.com/chrek-is-checkpoint-source: "true"` label
- Pod status is `Ready` (readiness probe passes = ready file exists)
2.**Restore Marker**: Created by `restore-entrypoint` before CRIU restore, allows the restored process to detect it was restored
2.**Signal-based coordination**: The DaemonSet sends `SIGUSR1` after checkpoint completes and `SIGCONT` after restore completes. Your application must handle these signals (not poll for files).
-**SIGCONT received**: Process was restored, wake model and continue
---
## Step 4: Restore from Checkpoints
Restore pods automatically detect and restore from checkpoints if they exist.
The DaemonSet performs restore externally — your restore pod just needs to be a placeholder that sleeps until the DaemonSet restores the checkpointed process into it.
### Example Restore Pod
...
...
@@ -399,18 +411,26 @@ kind: Pod
metadata:
name:my-app-restored
namespace:my-app
labels:
nvidia.com/chrek-is-restore-target:"true"# Required: watcher detects restore pods by this label
nvidia.com/chrek-checkpoint-hash:"abc123def456"# Required: watcher uses this to locate the checkpoint
spec:
restartPolicy:Never
# Seccomp profile to block io_uring syscalls (deployed by the chrek DaemonSet)
# Without this, io_uring syscalls may cause CRIU restore failures
securityContext:
seccompProfile:
type:Localhost
localhostProfile:profiles/block-iouring.json
containers:
-name:main
image:my-app:checkpoint-enabled
# Security context required for CRIU restore
securityContext:
privileged:true
capabilities:
add:["SYS_ADMIN","SYS_PTRACE","SYS_CHROOT"]
# Override command to sleep — the chrek DaemonSet performs external restore
# on Running-but-not-Ready pods. Without this, the container would cold-start.
command:["sleep","infinity"]
# Set checkpoint environment variables
env:
...
...
@@ -419,38 +439,28 @@ spec:
-name:DYN_CHECKPOINT_PATH
value:"/checkpoints"# Base path (hash appended automatically)
-name:DYN_RESTORE_MARKER_FILE
value:"/tmp/dynamo-restored"
# GPU request
resources:
limits:
nvidia.com/gpu:1
# Mount checkpoint storage (READ-ONLY is fine for restore)
# CRIU needs write access for restore.log — do NOT set readOnly
volumeMounts:
-name:checkpoint-storage
mountPath:/checkpoints
readOnly:true
-name:checkpoint-signal
mountPath:/checkpoint-signal
volumes:
-name:checkpoint-storage
persistentVolumeClaim:
claimName:chrek-pvc
-name:checkpoint-signal
hostPath:
path:/var/lib/chrek/signals
type:DirectoryOrCreate
```
### How Restore Works
1. **Smart Entrypoint Detects Checkpoint**:The `smart-entrypoint.sh` checks if a checkpoint exists at `/checkpoints/${DYN_CHECKPOINT_HASH}/`
2. **Calls Restore Entrypoint**:If found, calls `/usr/local/bin/restore-entrypoint` which invokes CRIU
3. **CRIU Restores Process**:The entire process tree is restored from the checkpoint, including GPU state
4. **Application Continues**:Your application resumes exactly where it was checkpointed
1. **Pod starts as placeholder**:The `sleep infinity` command keeps the pod Running but not Ready
2. **DaemonSet detects restore pod**:The watcher finds pods with `nvidia.com/chrek-is-restore-target=true` that are Running but not Ready
3. **External restore via nsenter**:The DaemonSet enters the pod's namespaces and performs CRIU restore, including GPU state
4. **Application continues**:Your application resumes exactly where it was checkpointed
---
...
...
@@ -460,10 +470,9 @@ spec:
| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_CHECKPOINT_SIGNAL_FILE` | Yes | Full path to signal file (e.g., `/checkpoint-signal/my-checkpoint.done`) |
| `DYN_READY_FOR_CHECKPOINT_FILE` | Yes | Full path where app signals readiness (e.g., `/tmp/ready-for-checkpoint`) |
The DaemonSet communicates checkpoint/restore completion via Unix signals, not files:
| Signal | Direction | Meaning |
|--------|-----------|---------|
| `SIGUSR1` | DaemonSet → checkpoint pod | Checkpoint completed, process should exit |
| `SIGCONT` | DaemonSet → restored pod | Restore completed, process should wake up |
| `SIGUSR2` | DaemonSet → checkpoint pod | Checkpoint failed (wake process to continue) |
CRIU tuning options are configured via the ChReK Helm chart's `config.checkpoint.criu` values, not environment variables. See the [Helm Chart Values](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/values.yaml) for available options.