> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. See [Limitations](#limitations) for details.
**ChReK** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). ChReK dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
## What is ChReK?
ChReK provides:
-**Fast cold starts**: Restore GPU-accelerated applications in seconds instead of minutes
-**CUDA state preservation**: Checkpoint and restore GPU memory and CUDA contexts
-**Kubernetes-native**: Integrates seamlessly with Kubernetes primitives
-**Storage flexibility**: PVC-based storage (S3/OCI planned for future releases)
-**Namespace isolation**: Each namespace gets its own checkpoint infrastructure
## Use Cases
### 1. With NVIDIA Dynamo Platform (Recommended)
Use ChReK as part of the Dynamo platform for automatic checkpoint management:
- Automatic checkpoint creation and lifecycle management
- Seamless integration with DynamoGraphDeployment CRDs
- Built-in autoscaling with fast restore
📖 **[Read the Dynamo Integration Guide →](dynamo.md)**
### 2. Standalone (Without Dynamo)
Use ChReK independently in your own Kubernetes applications:
- Manual checkpoint job creation
- Build your own restore-enabled container images
- Full control over checkpoint lifecycle
📖 **[Read the Standalone Usage Guide →](standalone.md)**
## Architecture
ChReK consists of two main components:
### 1. ChReK Helm Chart
Deploys the checkpoint/restore infrastructure:
-**DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
-**PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
-**RBAC**: Namespace-scoped or cluster-wide permissions
-**Seccomp Profile**: Security policies for CRIU syscalls
### 2. Smart Entrypoint
A wrapper script that intelligently decides between:
-**Cold start**: Normal application startup (when no checkpoint exists)
-**Restore**: CRIU restore from checkpoint (when checkpoint available)
## Quick Start
### Install ChReK Infrastructure
```bash
helm install chrek nvidia/chrek \
--namespace my-team \
--create-namespace\
--set storage.pvc.size=100Gi
```
### Choose Your Integration Path
-**Using Dynamo Platform?** → Follow the [Dynamo Integration Guide](dynamo.md)
-**Using standalone?** → Follow the [Standalone Usage Guide](standalone.md)
## Key Features
### ✅ Currently Supported
- ✅ **vLLM backend only** (SGLang and TensorRT-LLM planned)
- ✅ Single-node, single-GPU checkpoints
- ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
⚠️ **Important**: ChReK has significant limitations that may impact production readiness:
### Security Considerations
-**🔴 Privileged mode required**: Restore pods **must run in privileged mode** for CRIU to function. This grants containers elevated host access and may violate security policies in many production environments.
-**Security Impact**: Privileged containers can:
- Access all host devices
- Bypass most security restrictions
- Potentially compromise node security if the container is exploited
### Technical Limitations
-**vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
-**Single-node only**: Checkpoints must be created and restored on the same node
-**Single-GPU only**: Multi-GPU configurations not yet supported
-**Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
-**Storage**: Only PVC storage is currently implemented (S3/OCI planned)
### Recommendation
ChReK is best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
## Documentation
### Getting Started
-[Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform
-[Standalone Usage Guide](standalone.md) - Using ChReK independently
- ChReK Helm Chart README - See `deploy/helm/charts/chrek/README.md` in the repository for Helm chart configuration
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Checkpoint/Restore for Fast Pod Startup
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations. See [Limitations](#limitations) for details.
Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
## Overview
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~3 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | ~30 sec | Restore from checkpoint tar |
## Prerequisites
- Dynamo Platform installed (v0.4.0+)
- ChReK Helm chart installed (separate from platform)
- GPU nodes with CRIU support
- RWX PVC storage (PVC is currently the only supported backend)
## Quick Start
### 1. Install ChReK Infrastructure
First, install the ChReK Helm chart in each namespace where you need checkpointing:
```bash
# Install ChReK infrastructure
helm install chrek nvidia/chrek \
--namespace my-team \
--create-namespace\
--set storage.pvc.size=100Gi
```
This creates:
- A PVC for checkpoint storage (`chrek-pvc`)
- A DaemonSet for CRIU operations (`chrek-agent`)
### 2. Configure Operator Values
Update your Helm values to point to the ChReK infrastructure:
```yaml
# values.yaml
dynamo-operator:
checkpoint:
enabled:true
storage:
type:pvc# Only PVC is currently supported (S3/OCI planned)
pvc:
pvcName:"chrek-pvc"# Must match ChReK chart
basePath:"/checkpoints"
signalHostPath:"/var/lib/chrek/signals"# Must match ChReK chart
**Not included in hash** (don't invalidate checkpoint):
-`replicas`
-`nodeSelector`, `affinity`, `tolerations`
-`resources` (requests/limits)
- Logging/observability config
**Example with all fields:**
```yaml
checkpoint:
enabled:true
mode:auto
identity:
model:"meta-llama/Llama-3-8B"
backendFramework:"vllm"
dynamoVersion:"0.9.0"
tensorParallelSize:1
pipelineParallelSize:1
dtype:"bfloat16"
maxModelLen:8192
extraParameters:
enableChunkedPrefill:"true"
quantization:"awq"
```
**Checkpoint Naming:** The `DynamoCheckpoint` CR is automatically named using the 16-character identity hash (e.g., `e5962d34ba272638`).
**Checkpoint Sharing:** Multiple DGDs with the same identity automatically share the same checkpoint.
## DynamoCheckpoint CRD
The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
**When to create a DynamoCheckpoint directly:**
-**Pre-warming:** Create checkpoints before deploying DGDs for instant startup
-**Explicit control:** Manage checkpoint lifecycle independently from DGDs
**Note:** With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in `auto` mode.
**Create a checkpoint:**
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoCheckpoint
metadata:
name:e5962d34ba272638# Use the computed 16-char hash
spec:
identity:
model:"meta-llama/Llama-3-8B"
backendFramework:"vllm"
tensorParallelSize:1
dtype:"bfloat16"
job:
activeDeadlineSeconds:3600
podTemplateSpec:
spec:
containers:
-name:main
image:nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
command:["python3","-m","dynamo.vllm"]
args:["--model","meta-llama/Llama-3-8B"]
resources:
limits:
nvidia.com/gpu:"1"
env:
-name:HF_TOKEN
valueFrom:
secretKeyRef:
name:hf-token-secret
key:HF_TOKEN
```
**Note:** You can compute the hash yourself, or use `auto` mode to let the operator create it.
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.
This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.
### Quick Start: Using the Placeholder Target (Recommended)
```bash
cd deploy/chrek
# Define your images
export BASE_IMAGE="your-app:latest"# Your existing application image