Unverified Commit f3aa1e01 authored by Julien Mancuso's avatar Julien Mancuso Committed by GitHub
Browse files

feat: introducing ChReK (Checkpoint Restore in K8s) (#4978)


Signed-off-by: default avatarJulien Mancuso <jmancuso@nvidia.com>
parent 44986bf5
......@@ -234,6 +234,7 @@ Key customization points include:
- **[Operator Documentation](dynamo-operator.md)** - How the platform works
- **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
- **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
- **[Checkpointing](/docs/kubernetes/chrek/README.md)** - Fast pod startup with checkpoint/restore
- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
- **[Logging](observability/logging.md)** - For logging setup
- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment
......
# ChReK: Checkpoint/Restore in Kubernetes
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. See [Limitations](#limitations) for details.
**ChReK** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). ChReK dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
## What is ChReK?
ChReK provides:
- **Fast cold starts**: Restore GPU-accelerated applications in seconds instead of minutes
- **CUDA state preservation**: Checkpoint and restore GPU memory and CUDA contexts
- **Kubernetes-native**: Integrates seamlessly with Kubernetes primitives
- **Storage flexibility**: PVC-based storage (S3/OCI planned for future releases)
- **Namespace isolation**: Each namespace gets its own checkpoint infrastructure
## Use Cases
### 1. With NVIDIA Dynamo Platform (Recommended)
Use ChReK as part of the Dynamo platform for automatic checkpoint management:
- Automatic checkpoint creation and lifecycle management
- Seamless integration with DynamoGraphDeployment CRDs
- Built-in autoscaling with fast restore
📖 **[Read the Dynamo Integration Guide →](dynamo.md)**
### 2. Standalone (Without Dynamo)
Use ChReK independently in your own Kubernetes applications:
- Manual checkpoint job creation
- Build your own restore-enabled container images
- Full control over checkpoint lifecycle
📖 **[Read the Standalone Usage Guide →](standalone.md)**
## Architecture
ChReK consists of two main components:
### 1. ChReK Helm Chart
Deploys the checkpoint/restore infrastructure:
- **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
- **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
- **RBAC**: Namespace-scoped or cluster-wide permissions
- **Seccomp Profile**: Security policies for CRIU syscalls
### 2. Smart Entrypoint
A wrapper script that intelligently decides between:
- **Cold start**: Normal application startup (when no checkpoint exists)
- **Restore**: CRIU restore from checkpoint (when checkpoint available)
## Quick Start
### Install ChReK Infrastructure
```bash
helm install chrek nvidia/chrek \
--namespace my-team \
--create-namespace \
--set storage.pvc.size=100Gi
```
### Choose Your Integration Path
- **Using Dynamo Platform?** → Follow the [Dynamo Integration Guide](dynamo.md)
- **Using standalone?** → Follow the [Standalone Usage Guide](standalone.md)
## Key Features
### ✅ Currently Supported
-**vLLM backend only** (SGLang and TensorRT-LLM planned)
- ✅ Single-node, single-GPU checkpoints
- ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
- ✅ Namespace-scoped and cluster-wide RBAC
- ✅ Idempotent checkpoint creation
- ✅ Automatic signal-based checkpoint coordination
### 🚧 Planned Features
- 🚧 SGLang backend support
- 🚧 TensorRT-LLM backend support
- 🚧 S3/MinIO storage backend
- 🚧 OCI registry storage backend
- 🚧 Multi-GPU checkpoints
- 🚧 Multi-node distributed checkpoints
## Limitations
⚠️ **Important**: ChReK has significant limitations that may impact production readiness:
### Security Considerations
- **🔴 Privileged mode required**: Restore pods **must run in privileged mode** for CRIU to function. This grants containers elevated host access and may violate security policies in many production environments.
- **Security Impact**: Privileged containers can:
- Access all host devices
- Bypass most security restrictions
- Potentially compromise node security if the container is exploited
### Technical Limitations
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations not yet supported
- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
- **Storage**: Only PVC storage is currently implemented (S3/OCI planned)
### Recommendation
ChReK is best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
## Documentation
### Getting Started
- [Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform
- [Standalone Usage Guide](standalone.md) - Using ChReK independently
- ChReK Helm Chart README - See `deploy/helm/charts/chrek/README.md` in the repository for Helm chart configuration
### Related Documentation
- [CRIU Documentation](https://criu.org/Main_Page) - Upstream CRIU docs
## Prerequisites
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- CRIU support in container runtime (containerd with CRIU plugin)
- RWX storage class (for multi-node deployments)
- **Security clearance for privileged pods** (required for restore operations)
## Troubleshooting
### Common Issues
**DaemonSet not starting?**
- Check GPU node labels: `kubectl get nodes -l nvidia.com/gpu.present=true`
- Verify NVIDIA runtime is available
**Checkpoint fails?**
- Check DaemonSet logs: `kubectl logs -l app.kubernetes.io/name=chrek -n <namespace>`
- Ensure application properly signals readiness
- Verify CRIU is installed in the runtime
**Restore fails?**
- Ensure restore pod uses the same volumes as checkpoint job
- Verify `hostIPC: true` is set (required for CUDA)
- Check for `PSM3_DISABLED=1` and `GLOO_SOCKET_IFNAME=lo` environment variables
For detailed troubleshooting, see:
- [Dynamo Integration Guide - Troubleshooting](dynamo.md#troubleshooting)
- [Standalone Guide - Troubleshooting](standalone.md#troubleshooting)
## Contributing
ChReK is part of the NVIDIA Dynamo project. Contributions are welcome!
## License
Apache License 2.0
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Checkpoint/Restore for Fast Pod Startup
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations. See [Limitations](#limitations) for details.
Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
## Overview
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~3 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | ~30 sec | Restore from checkpoint tar |
## Prerequisites
- Dynamo Platform installed (v0.4.0+)
- ChReK Helm chart installed (separate from platform)
- GPU nodes with CRIU support
- RWX PVC storage (PVC is currently the only supported backend)
## Quick Start
### 1. Install ChReK Infrastructure
First, install the ChReK Helm chart in each namespace where you need checkpointing:
```bash
# Install ChReK infrastructure
helm install chrek nvidia/chrek \
--namespace my-team \
--create-namespace \
--set storage.pvc.size=100Gi
```
This creates:
- A PVC for checkpoint storage (`chrek-pvc`)
- A DaemonSet for CRIU operations (`chrek-agent`)
### 2. Configure Operator Values
Update your Helm values to point to the ChReK infrastructure:
```yaml
# values.yaml
dynamo-operator:
checkpoint:
enabled: true
storage:
type: pvc # Only PVC is currently supported (S3/OCI planned)
pvc:
pvcName: "chrek-pvc" # Must match ChReK chart
basePath: "/checkpoints"
signalHostPath: "/var/lib/chrek/signals" # Must match ChReK chart
```
### 2. Configure Your DGD
Add checkpoint configuration to your service:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
services:
VllmWorker:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
args:
- python3 -m dynamo.vllm --model meta-llama/Llama-3-8B
resources:
limits:
nvidia.com/gpu: "1"
# Checkpoint configuration
checkpoint:
enabled: true
mode: auto # Automatically create checkpoint if not found
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
tensorParallelSize: 1
dtype: "bfloat16"
```
### 3. Deploy
```bash
kubectl apply -f my-llm.yaml -n dynamo-system
```
On first deployment:
1. A checkpoint job runs to create the checkpoint
2. Worker pods start with cold start (checkpoint not ready yet)
3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
## Storage Backends
### PVC (Currently Supported)
Use when you have RWX storage available (e.g., NFS, EFS, Filestore).
```yaml
checkpoint:
storage:
type: pvc
pvc:
pvcName: "chrek-pvc"
basePath: "/checkpoints"
```
**Requirements:**
- RWX (ReadWriteMany) PVC for multi-node access
- Sufficient storage (checkpoints are ~10-50GB per model)
### S3 / MinIO (Planned - Not Yet Implemented)
> ⚠️ **Note:** S3 storage backend is defined in the API but not yet fully implemented.
Object storage support is planned for a future release. The configuration will look like:
```yaml
checkpoint:
storage:
type: s3 # Not yet supported
s3:
# AWS S3
uri: "s3://my-bucket/checkpoints"
# Or MinIO / custom S3
uri: "s3://minio.example.com/my-bucket/checkpoints"
# Optional: credentials secret
credentialsSecretRef: "s3-creds"
```
### OCI Registry (Planned - Not Yet Implemented)
> ⚠️ **Note:** OCI registry storage backend is defined in the API but not yet fully implemented.
Container registry storage support is planned for a future release. The configuration will look like:
```yaml
checkpoint:
storage:
type: oci # Not yet supported
oci:
uri: "oci://myregistry.io/checkpoints"
credentialsSecretRef: "registry-creds" # Docker config secret
```
## Checkpoint Modes
### Auto Mode (Recommended)
The operator automatically creates a `DynamoCheckpoint` CR if one doesn't exist:
```yaml
checkpoint:
enabled: true
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
tensorParallelSize: 1
```
### Reference Mode
Reference an existing `DynamoCheckpoint` CR by its 16-character hash using `checkpointRef`:
```yaml
checkpoint:
enabled: true
checkpointRef: "e5962d34ba272638" # 16-char hash of DynamoCheckpoint CR
```
This is useful when:
- You want to **pre-warm checkpoints** before creating DGDs
- You want to **explicit control** over which checkpoint to use
**Flow:**
1. Create a `DynamoCheckpoint` CR (see [DynamoCheckpoint CRD](#dynamocheckpoint-crd) section)
2. Wait for it to become `Ready`
3. Reference it in your DGD using `checkpointRef` with the hash
```bash
# Check checkpoint status (using 16-char hash name)
kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system
NAME MODEL BACKEND PHASE HASH AGE
e5962d34ba272638 meta-llama/Llama-3-8B vllm Ready e5962d34ba272638 5m
# Now create DGD referencing it
kubectl apply -f my-dgd.yaml
```
## Checkpoint Identity
Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:
| Field | Required | Affects Hash | Example |
|-------|----------|-------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `framework` | ✓ | ✓ | `vllm`, `sglang`, `trtllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
| `maxModelLen` | | ✓ | `4096`, `8192` |
| `extraParameters` | | ✓ | Custom key-value pairs |
**Not included in hash** (don't invalidate checkpoint):
- `replicas`
- `nodeSelector`, `affinity`, `tolerations`
- `resources` (requests/limits)
- Logging/observability config
**Example with all fields:**
```yaml
checkpoint:
enabled: true
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
dynamoVersion: "0.9.0"
tensorParallelSize: 1
pipelineParallelSize: 1
dtype: "bfloat16"
maxModelLen: 8192
extraParameters:
enableChunkedPrefill: "true"
quantization: "awq"
```
**Checkpoint Naming:** The `DynamoCheckpoint` CR is automatically named using the 16-character identity hash (e.g., `e5962d34ba272638`).
**Checkpoint Sharing:** Multiple DGDs with the same identity automatically share the same checkpoint.
## DynamoCheckpoint CRD
The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
**When to create a DynamoCheckpoint directly:**
- **Pre-warming:** Create checkpoints before deploying DGDs for instant startup
- **Explicit control:** Manage checkpoint lifecycle independently from DGDs
**Note:** With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in `auto` mode.
**Create a checkpoint:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
name: e5962d34ba272638 # Use the computed 16-char hash
spec:
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
tensorParallelSize: 1
dtype: "bfloat16"
job:
activeDeadlineSeconds: 3600
podTemplateSpec:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
command: ["python3", "-m", "dynamo.vllm"]
args: ["--model", "meta-llama/Llama-3-8B"]
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
```
**Note:** You can compute the hash yourself, or use `auto` mode to let the operator create it.
**Check status:**
```bash
# List all checkpoints
kubectl get dynamocheckpoint -n dynamo-system
# Or use shortname
kubectl get dckpt -n dynamo-system
NAME MODEL BACKEND PHASE HASH AGE
e5962d34ba272638 meta-llama/Llama-3-8B vllm Ready e5962d34ba272638 5m
a7b4f89c12de3456 meta-llama/Llama-3-70B vllm Creating a7b4f89c12de3456 2m
```
**Phases:**
| Phase | Description |
|-------|-------------|
| `Pending` | CR created, waiting for job to start |
| `Creating` | Checkpoint job is running |
| `Ready` | Checkpoint available for use |
| `Failed` | Checkpoint creation failed |
**Detailed status:**
```bash
kubectl describe dckpt e5962d34ba272638 -n dynamo-system
```
```yaml
Status:
Phase: Ready
IdentityHash: e5962d34ba272638
Location: /checkpoints/e5962d34ba272638
StorageType: pvc
CreatedAt: 2026-01-29T10:05:00Z
```
**Reference from DGD:**
Once the checkpoint is `Ready`, you can reference it by hash:
```yaml
spec:
services:
VllmWorker:
checkpoint:
enabled: true
checkpointRef: "e5962d34ba272638" # 16-char hash
```
Or use `auto` mode and the operator will find/create it automatically.
## Limitations
⚠️ **Important**: ChReK has significant limitations that impact production readiness:
### Security Considerations
- **🔴 Privileged mode required**: Restore pods **must run in privileged mode** for CRIU to function
- Privileged containers have elevated host access, which may violate security policies in many production environments
- This requirement applies to all worker pods that restore from checkpoints
### Technical Limitations
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations are not yet supported
- **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
### Recommendation
ChReK is **experimental/beta** and best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
## Troubleshooting
### Checkpoint Not Creating
1. Check the checkpoint job:
```bash
kubectl get jobs -l nvidia.com/checkpoint-source=true -n dynamo-system
kubectl logs job/checkpoint-<name> -n dynamo-system
```
2. Check the DaemonSet:
```bash
kubectl logs daemonset/chrek-agent -n dynamo-system
```
3. Verify storage access:
```bash
kubectl exec -it <checkpoint-agent-pod> -- ls -la /checkpoints
```
### Restore Failing
1. Check pod logs:
```bash
kubectl logs <worker-pod> -n dynamo-system
```
2. Verify checkpoint file exists:
```bash
# For PVC
kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/
# For S3
aws s3 ls s3://my-bucket/checkpoints/
```
3. Check environment variables:
```bash
kubectl exec <worker-pod> -- env | grep DYN_CHECKPOINT
```
### Cold Start Despite Checkpoint
Pods fall back to cold start if:
- Checkpoint file doesn't exist yet (still being created)
- Checkpoint file is corrupted
- CRIU restore fails
Check logs for "Falling back to cold start" message.
## Best Practices
1. **Use RWX PVCs** for multi-node deployments (currently the only supported backend)
2. **Pre-warm checkpoints** before scaling up
3. **Monitor checkpoint size** - large models create large checkpoints
4. **Clean up old checkpoints** to save storage
## Environment Variables
| Variable | Description |
|----------|-------------|
| `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` |
| `DYN_CHECKPOINT_LOCATION` | Source location (URI) |
| `DYN_CHECKPOINT_PATH` | Local path to tar file |
| `DYN_CHECKPOINT_HASH` | Identity hash (debugging) |
| `DYN_CHECKPOINT_SIGNAL_FILE` | Signal file (creation mode only) |
## Complete Example
Create a checkpoint and use it in a DGD:
```yaml
# 1. Create the DynamoCheckpoint CR
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
name: e5962d34ba272638 # 16-char hash (computed from identity)
namespace: dynamo-system
spec:
identity:
model: "meta-llama/Meta-Llama-3-8B-Instruct"
backendFramework: "vllm"
tensorParallelSize: 1
dtype: "bfloat16"
job:
activeDeadlineSeconds: 3600
backoffLimit: 3
podTemplateSpec:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
command: ["python3", "-m", "dynamo.vllm"]
args:
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--dtype"
- "bfloat16"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
resources:
limits:
nvidia.com/gpu: "1"
restartPolicy: Never
---
# 2. Wait for Ready: kubectl get dckpt e5962d34ba272638 -n dynamo-system -w
---
# 3. Reference the checkpoint in your DGD
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
namespace: dynamo-system
spec:
services:
VllmWorker:
replicas: 2
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
resources:
limits:
nvidia.com/gpu: "1"
checkpoint:
enabled: true
checkpointRef: "e5962d34ba272638" # Reference by hash
```
## Related Documentation
- [ChReK Overview](README.md) - ChReK architecture and use cases
- [ChReK Standalone Usage Guide](standalone.md) - Use ChReK without Dynamo Platform
- ChReK Helm Chart README - See `deploy/helm/charts/chrek/README.md` in the repository for chart configuration
- [Installation Guide](../installation-guide.md) - Platform installation
- [API Reference](../api-reference.md) - Complete CRD specifications
# ChReK Standalone Usage Guide
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.
This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
## Table of Contents
- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Step 1: Deploy ChReK](#step-1-deploy-chrek)
- [Step 2: Build Checkpoint-Enabled Images](#step-2-build-checkpoint-enabled-images)
- [Step 3: Create Checkpoint Jobs](#step-3-create-checkpoint-jobs)
- [Step 4: Restore from Checkpoints](#step-4-restore-from-checkpoints)
- [Environment Variables Reference](#environment-variables-reference)
- [Checkpoint Flow Explained](#checkpoint-flow-explained)
- [Troubleshooting](#troubleshooting)
---
## Overview
When using ChReK standalone, you are responsible for:
1. **Deploying the ChReK Helm chart** (DaemonSet + PVC)
2. **Building checkpoint-enabled container images** with the restore entrypoint
3. **Creating checkpoint jobs** with the correct environment variables
4. **Creating restore pods** that detect and use the checkpoints
The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.
---
## Prerequisites
- Kubernetes cluster with:
- NVIDIA GPUs with checkpoint support
- **Privileged security context allowed** (⚠️ required for CRIU - see [Security Considerations](#security-considerations))
- PVC storage (ReadWriteMany recommended for multi-node)
- Docker or compatible container runtime for building images
- Access to the ChReK source code: `deploy/chrek/`
### Security Considerations
⚠️ **Important**: ChReK restore operations **require privileged mode**, which has significant security implications:
- **Privileged containers** can access all host devices and bypass most security restrictions
- This may violate security policies in production environments
- Privileged containers, if compromised, can potentially compromise node security
**Recommended for:**
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
**Not recommended for:**
- ❌ Multi-tenant clusters without proper isolation
- ❌ Security-sensitive production workloads without risk assessment
- ❌ Environments with strict security compliance requirements
### Technical Limitations
⚠️ **Current Restrictions:**
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations are not yet supported
- **Network state**: Active TCP connections are closed during restore
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
---
## Step 1: Deploy ChReK
### Install the Helm Chart
```bash
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Install ChReK in your namespace
helm install chrek ./deploy/helm/charts/chrek \
--namespace my-app \
--create-namespace \
--set storage.pvc.size=100Gi \
--set storage.pvc.storageClass=your-storage-class
```
### Verify Installation
```bash
# Check the DaemonSet is running
kubectl get daemonset -n my-app
# NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE
# chrek-agent 3 3 3 3 3
# Check the PVC is bound
kubectl get pvc -n my-app
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# chrek-pvc Bound pvc-xyz 100Gi RWX your-storage-class
```
---
## Step 2: Build Checkpoint-Enabled Images
ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.
### Quick Start: Using the Placeholder Target (Recommended)
```bash
cd deploy/chrek
# Define your images
export BASE_IMAGE="your-app:latest" # Your existing application image
export RESTORE_IMAGE="your-app:checkpoint-enabled" # Output checkpoint-enabled image
# Build using the placeholder target
docker build \
--target placeholder \
--build-arg BASE_IMAGE="$BASE_IMAGE" \
-t "$RESTORE_IMAGE" \
.
# Push to your registry
docker push "$RESTORE_IMAGE"
```
**Example with a Dynamo vLLM image:**
```bash
cd deploy/chrek
export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"
docker build \
--target placeholder \
--build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
-t "$RESTORE_IMAGE" \
.
```
### What the Placeholder Target Does
The ChReK Dockerfile's `placeholder` stage automatically:
- ✅ Builds the restore-entrypoint binary
- ✅ Injects it into `/usr/local/bin/restore-entrypoint`
- ✅ Adds `smart-entrypoint.sh` to `/usr/local/bin/`
- ✅ Sets executable permissions
- ✅ Configures the entrypoint to detect and restore checkpoints
- ✅ Preserves your original application CMD
### Alternative: Manual Multi-Stage Build
If you need more control, you can create your own Dockerfile:
```dockerfile
# Stage 1: Build restore-entrypoint
FROM golang:1.23-alpine AS restore-builder
WORKDIR /build
COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint
COPY deploy/chrek/pkg ./pkg
COPY deploy/chrek/go.mod deploy/chrek/go.sum ./
RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint
# Stage 2: Your application image
FROM your-base-image:latest
# Copy restore-entrypoint
COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint
# Copy smart-entrypoint.sh
COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh
RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint
# Set smart-entrypoint as the default entrypoint
ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]
# Your application command (becomes CMD, can be overridden)
CMD ["python", "your_app.py"]
```
> **💡 Tip**: Using the `placeholder` target is the recommended approach as it's maintained with the ChReK codebase and ensures compatibility.
---
## Step 3: Create Checkpoint Jobs
A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.
### Required Environment Variables
Your checkpoint job MUST set these environment variables:
| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_CHECKPOINT_SIGNAL_FILE` | Path where DaemonSet writes completion signal | `/checkpoint-signal/my-checkpoint.done` |
| `DYN_CHECKPOINT_READY_FILE` | Path where your app signals it's ready | `/tmp/checkpoint-ready` |
| `DYN_CHECKPOINT_HASH` | Unique identifier for this checkpoint | `abc123def456` |
| `DYN_CHECKPOINT_LOCATION` | Directory where checkpoint is stored | `/checkpoints/abc123def456` |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Storage backend type | `pvc` |
### Required Labels
Add this label to enable DaemonSet checkpoint detection:
```yaml
labels:
nvidia.com/checkpoint-source: "true"
```
### Example Checkpoint Job
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: checkpoint-my-model
namespace: my-app
spec:
template:
metadata:
labels:
nvidia.com/checkpoint-source: "true" # Required for DaemonSet detection
spec:
restartPolicy: Never
# Init container to clean up stale signal files
initContainers:
- name: cleanup-signal-file
image: busybox:latest
command:
- sh
- -c
- |
rm -f /checkpoint-signal/my-checkpoint.done || true
echo "Signal file cleanup complete"
volumeMounts:
- name: checkpoint-signal
mountPath: /checkpoint-signal
containers:
- name: main
image: my-app:checkpoint-enabled
# Security context required for CRIU
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
# Readiness probe: Pod becomes Ready when model is loaded
# This is what triggers the DaemonSet to start checkpointing
readinessProbe:
exec:
command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
initialDelaySeconds: 15
periodSeconds: 2
# Remove liveness/startup probes for checkpoint jobs
# Model loading can take several minutes
livenessProbe: null
startupProbe: null
# Checkpoint-related environment variables
env:
- name: DYN_CHECKPOINT_SIGNAL_FILE
value: "/checkpoint-signal/my-checkpoint.done"
- name: DYN_CHECKPOINT_READY_FILE
value: "/tmp/checkpoint-ready"
- name: DYN_CHECKPOINT_HASH
value: "abc123def456"
- name: DYN_CHECKPOINT_LOCATION
value: "/checkpoints/abc123def456"
- name: DYN_CHECKPOINT_STORAGE_TYPE
value: "pvc"
# GPU request
resources:
limits:
nvidia.com/gpu: 1
# Required volume mounts
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
- name: checkpoint-signal
mountPath: /checkpoint-signal
- name: tmp
mountPath: /tmp
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: chrek-pvc
- name: checkpoint-signal
hostPath:
path: /var/lib/chrek/signals
type: DirectoryOrCreate
- name: tmp
emptyDir: {}
```
### Application Code Requirements
Your application must implement the checkpoint flow. Here's the pattern used by Dynamo vLLM:
```python
import os
import time
def main():
# 1. Check for checkpoint mode
signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE")
ready_file = os.environ.get("DYN_CHECKPOINT_READY_FILE")
restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE", "/tmp/dynamo-restored")
is_checkpoint_mode = signal_file is not None
if is_checkpoint_mode:
print("Checkpoint mode detected")
# 2. Load your model/application
model = load_model()
# 3. Optional: Put model to sleep to reduce memory footprint
# model.sleep()
# 4. Write ready file (for application use, not DaemonSet)
if ready_file:
with open(ready_file, "w") as f:
f.write("ready")
print(f"Wrote checkpoint ready file: {ready_file}")
# 5. Log readiness messages (helps debugging)
print("CHECKPOINT_READY: Model loaded, ready for container checkpoint")
print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}")
print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}")
# 6. Wait for checkpoint completion OR restore detection
while True:
# Check if we've been restored (marker file created by restore entrypoint)
if os.path.exists(restore_marker):
print(f"Detected restore from checkpoint (marker: {restore_marker})")
# Continue with normal application flow
break
# Check if checkpoint is complete (signal file created by DaemonSet)
if os.path.exists(signal_file):
print(f"Checkpoint signal file detected: {signal_file}")
print("Checkpoint complete, exiting")
return # Exit gracefully
time.sleep(1)
# Normal application flow (or post-restore flow)
run_application()
```
**Important Notes:**
1. **Ready File & Readiness Probe**: The checkpoint job must have a readiness probe that checks for the ready file:
```yaml
readinessProbe:
exec:
command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
initialDelaySeconds: 15
periodSeconds: 2
```
The ChReK DaemonSet triggers checkpointing when:
- Pod has `nvidia.com/checkpoint-source: "true"` label
- Pod status is `Ready` (readiness probe passes = ready file exists)
2. **Restore Marker**: Created by `restore-entrypoint` before CRIU restore, allows the restored process to detect it was restored
3. **Two Exit Paths**:
- **Signal file found**: Checkpoint complete, exit gracefully
- **Restore marker found**: Process was restored, continue running
---
## Step 4: Restore from Checkpoints
Restore pods automatically detect and restore from checkpoints if they exist.
### Example Restore Pod
```yaml
apiVersion: v1
kind: Pod
metadata:
name: my-app-restored
namespace: my-app
spec:
restartPolicy: Never
containers:
- name: main
image: my-app:checkpoint-enabled
# Security context required for CRIU restore
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
# Set checkpoint environment variables
env:
- name: DYN_CHECKPOINT_HASH
value: "abc123def456" # Must match checkpoint job
- name: DYN_CHECKPOINT_PATH
value: "/checkpoints" # Base path (hash appended automatically)
# Optional: Customize restore marker file path
# - name: DYN_RESTORE_MARKER_FILE
# value: "/tmp/dynamo-restored"
# GPU request
resources:
limits:
nvidia.com/gpu: 1
# Mount checkpoint storage (READ-ONLY is fine for restore)
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
readOnly: true
- name: checkpoint-signal
mountPath: /checkpoint-signal
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: chrek-pvc
- name: checkpoint-signal
hostPath:
path: /var/lib/chrek/signals
type: DirectoryOrCreate
```
### How Restore Works
1. **Smart Entrypoint Detects Checkpoint**: The `smart-entrypoint.sh` checks if a checkpoint exists at `/checkpoints/${DYN_CHECKPOINT_HASH}/`
2. **Calls Restore Entrypoint**: If found, calls `/usr/local/bin/restore-entrypoint` which invokes CRIU
3. **CRIU Restores Process**: The entire process tree is restored from the checkpoint, including GPU state
4. **Application Continues**: Your application resumes exactly where it was checkpointed
---
## Environment Variables Reference
### Checkpoint Jobs
| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_CHECKPOINT_SIGNAL_FILE` | Yes | Full path to signal file (e.g., `/checkpoint-signal/my-checkpoint.done`) |
| `DYN_CHECKPOINT_READY_FILE` | Yes | Full path where app signals readiness (e.g., `/tmp/checkpoint-ready`) |
| `DYN_CHECKPOINT_HASH` | Yes | Unique checkpoint identifier (alphanumeric string) |
| `DYN_CHECKPOINT_LOCATION` | Yes | Directory where checkpoint is stored (e.g., `/checkpoints/abc123`) |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Yes | Storage backend: `pvc`, `s3`, or `oci` |
### Restore Pods
| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_CHECKPOINT_HASH` | Yes | Checkpoint identifier (must match checkpoint job) |
| `DYN_CHECKPOINT_PATH` | Yes | Base checkpoint directory (hash appended automatically) |
| `DYN_RESTORE_MARKER_FILE` | No | Path for restore marker file (default: `/tmp/dynamo-restored`) |
### Optional CRIU Tuning (Advanced)
| Variable | Default | Description |
|----------|---------|-------------|
| `CRIU_TIMEOUT` | `0` (unlimited) | CRIU operation timeout in seconds |
| `CRIU_LOG_LEVEL` | `4` | CRIU log verbosity (0-4) |
| `CRIU_WORK_DIR` | `/tmp` | CRIU working directory |
| `CUDA_PLUGIN_DIR` | `/usr/local/lib/criu` | Path to CRIU CUDA plugin |
| `CRIU_SKIP_IN_FLIGHT` | `false` | Skip in-flight TCP connections |
| `CRIU_AUTO_DEDUP` | `false` | Enable auto-deduplication |
| `CRIU_LAZY_PAGES` | `false` | Enable lazy page migration (experimental) |
| `WAIT_FOR_CHECKPOINT` | `false` | Wait for checkpoint to appear before starting |
| `RESTORE_WAIT_TIMEOUT` | `300` | Max seconds to wait for checkpoint |
| `DEBUG` | `false` | Enable debug mode (sleeps 300s on error) |
---
## Checkpoint Flow Explained
### 1. Checkpoint Creation Flow
```
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with nvidia.com/checkpoint-source=true label │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. Application loads model and creates ready file │
│ /tmp/checkpoint-ready │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. Pod becomes Ready (kubelet readiness probe passes) │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. ChReK DaemonSet detects: │
│ - Pod is Ready │
│ - Has checkpoint-source label │
│ - Ready file exists: /tmp/checkpoint-ready │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. DaemonSet executes CRIU checkpoint via runc: │
│ - Freezes container process │
│ - Dumps memory (CPU + GPU) │
│ - Saves to /checkpoints/${HASH}/ │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. DaemonSet writes signal file: │
│ /checkpoint-signal/${HASH}.done │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 7. Application detects signal file and exits gracefully │
└─────────────────────────────────────────────────────────────┘
```
### 2. Restore Flow
```
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with DYN_CHECKPOINT_HASH set │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. smart-entrypoint.sh checks for checkpoint: │
│ /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done │
└──────────────────────┬──────────────────────────────────────┘
├─ Not Found ─────────────────┐
│ │
▼ ▼
┌───────────────────────┐ ┌──────────────────────┐
│ Checkpoint exists │ │ Cold start │
└──────────┬────────────┘ │ Run original CMD │
│ └──────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. Call restore-entrypoint with checkpoint path │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. restore-entrypoint extracts checkpoint and calls CRIU: │
│ criu restore --images-dir /checkpoints/${HASH}/images │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. CRIU restores process from checkpoint │
│ - Restores memory (CPU + GPU) │
│ - Restores file descriptors │
│ - Resumes process execution │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. Application continues from checkpointed state │
│ (Model already loaded, GPU memory initialized) │
└─────────────────────────────────────────────────────────────┘
```
---
## Troubleshooting
### Checkpoint Not Created
**Symptom**: Job runs but no checkpoint appears in `/checkpoints/`
**Checks**:
1. Verify the pod has the label:
```bash
kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}'
```
2. Check pod readiness:
```bash
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
```
3. Check ready file was created:
```bash
kubectl exec <pod-name> -- ls -la /tmp/checkpoint-ready
```
4. Check DaemonSet logs:
```bash
kubectl logs -n my-app daemonset/chrek-agent --all-containers
```
### Restore Fails
**Symptom**: Pod fails to restore from checkpoint
**Checks**:
1. Verify checkpoint files exist:
```bash
kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/
```
2. Check privileged mode is enabled:
```bash
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].securityContext.privileged}'
```
3. Check CRIU logs in `/tmp/criu-restore.log`:
```bash
kubectl exec <pod-name> -- cat /tmp/criu-restore.log
```
4. Ensure checkpoint and restore have same:
- Container image
- GPU count
- Volume mounts
- Environment variables (except POD_NAME, POD_IP, etc.)
### Permission Denied Errors
**Symptom**: `CRIU: Permission denied` or `Operation not permitted`
**Solution**: Ensure pod has:
```yaml
securityContext:
privileged: true
capabilities:
add:
- SYS_ADMIN
- SYS_PTRACE
- SYS_CHROOT
```
### Signal File Not Appearing
**Symptom**: Application waits forever for signal file
**Checks**:
1. Verify hostPath mount is correct:
```bash
kubectl get pod <pod-name> -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}'
```
2. Check DaemonSet has access to the same path:
```bash
kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}'
```
3. Verify paths match exactly:
- Pod: `/var/lib/chrek/signals`
- DaemonSet: `/var/lib/chrek/signals`
---
## Additional Resources
- [ChReK Helm Chart Values](../../deploy/helm/charts/chrek/values.yaml)
- [Smart Entrypoint Script](../../deploy/chrek/scripts/smart-entrypoint.sh)
- [CRIU Documentation](https://criu.org/Main_Page)
- [CUDA Checkpoint Plugin](https://docs.nvidia.com/cuda/cuda-checkpoint-plugin/)
---
## Getting Help
If you encounter issues:
1. Check the [Troubleshooting](#troubleshooting) section
2. Review DaemonSet logs: `kubectl logs -n <namespace> daemonset/chrek-agent`
3. Open an issue on [GitHub](https://github.com/ai-dynamo/dynamo/issues)
......@@ -54,6 +54,14 @@ navigation:
path: ../pages/kubernetes/webhooks.md
- page: Autoscaling
path: ../pages/kubernetes/autoscaling.md
- section: Checkpointing (ChReK)
contents:
- page: Overview
path: ../pages/kubernetes/chrek/README.md
- page: Integration with Dynamo
path: ../pages/kubernetes/chrek/dynamo.md
- page: Standalone Usage
path: ../pages/kubernetes/chrek/standalone.md
- section: Observability (K8s)
contents:
- page: Metrics
......
......@@ -4,7 +4,9 @@
use anyhow::Result;
use k8s_openapi::api::discovery::v1::EndpointSlice;
use std::collections::hash_map::DefaultHasher;
use std::fs;
use std::hash::{Hash, Hasher};
use std::path::Path;
/// Hash a pod name to get a consistent instance ID
pub fn hash_pod_name(pod_name: &str) -> u64 {
......@@ -57,24 +59,61 @@ pub(super) struct PodInfo {
pub system_port: u16,
}
/// Default path for Kubernetes Downward API volume mount
const DEFAULT_PODINFO_PATH: &str = "/etc/podinfo";
impl PodInfo {
/// Discover pod information from environment variables
/// Read a value from a Downward API file, falling back to environment variable
fn read_from_file_or_env(file_path: &Path, env_var: &str) -> Option<String> {
// First try reading from file (Downward API volume mount)
// This is preferred after CRIU restore since env vars contain stale values
if let Ok(content) = fs::read_to_string(file_path) {
let value = content.trim().to_string();
if !value.is_empty() {
return Some(value);
}
}
// Fall back to environment variable
std::env::var(env_var).ok()
}
/// Discover pod information from Kubernetes Downward API volume mounts or environment variables
///
/// Required environment variables:
/// This function first attempts to read pod identity from Downward API volume mounts
/// at /etc/podinfo/{pod_name, pod_uid, pod_namespace}. This is critical for CRIU
/// checkpoint/restore scenarios where environment variables contain stale values
/// from the checkpoint source pod.
///
/// If the Downward API files are not available, falls back to environment variables:
/// - `POD_NAME`: Name of the pod (required)
/// - `POD_UID`: UID of the pod (required for CR owner reference)
/// - `POD_NAMESPACE`: Namespace of the pod (defaults to "default")
pub fn from_env() -> Result<Self> {
let pod_name = std::env::var("POD_NAME")
.map_err(|_| anyhow::anyhow!("POD_NAME environment variable not set"))?;
let pod_uid = std::env::var("POD_UID")
.map_err(|_| anyhow::anyhow!("POD_UID environment variable not set"))?;
let pod_namespace = std::env::var("POD_NAMESPACE").unwrap_or_else(|_| {
tracing::warn!("POD_NAMESPACE not set, defaulting to 'default'");
"default".to_string()
});
let podinfo_path = Path::new(DEFAULT_PODINFO_PATH);
let pod_name = Self::read_from_file_or_env(&podinfo_path.join("pod_name"), "POD_NAME")
.ok_or_else(|| anyhow::anyhow!("POD_NAME not available from file or environment"))?;
let pod_uid = Self::read_from_file_or_env(&podinfo_path.join("pod_uid"), "POD_UID")
.ok_or_else(|| anyhow::anyhow!("POD_UID not available from file or environment"))?;
let pod_namespace =
Self::read_from_file_or_env(&podinfo_path.join("pod_namespace"), "POD_NAMESPACE")
.unwrap_or_else(|| {
tracing::warn!("POD_NAMESPACE not set, defaulting to 'default'");
"default".to_string()
});
// Log where we got the pod info from for debugging
if podinfo_path.join("pod_name").exists() {
tracing::info!(
"Pod identity loaded from Downward API volume mount at {}",
DEFAULT_PODINFO_PATH
);
} else {
tracing::info!("Pod identity loaded from environment variables");
}
// Read system server port from config
let config = crate::config::RuntimeConfig::from_settings().unwrap_or_default();
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment