feat: introducing ChReK (Checkpoint Restore in K8s) (#4978)

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

feat: introducing ChReK (Checkpoint Restore in K8s) (#4978)
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
f3aa1e01 · Julien Mancuso · GitHub · 44986bf5 · f3aa1e01 · f3aa1e01
Unverified Commit f3aa1e01 authored Feb 03, 2026 by Julien Mancuso Committed by GitHub Feb 03, 2026
6 changed files
--- a/fern/pages/kubernetes/README.md
+++ b/fern/pages/kubernetes/README.md
@@ -234,6 +234,7 @@ Key customization points include:
 - **[Operator Documentation](dynamo-operator.md)** - How the platform works
 - **[Service Discovery](service-discovery.md)** - Discovery backends and configuration
 - **[Helm Charts](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/README.md)** - For advanced users
+- **[Checkpointing](/docs/kubernetes/chrek/README.md)** - Fast pod startup with checkpoint/restore
 - **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
 - **[Logging](observability/logging.md)** - For logging setup
 - **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment

--- a/fern/pages/kubernetes/chrek/README.md
+++ b/fern/pages/kubernetes/chrek/README.md
+# ChReK: Checkpoint/Restore in Kubernetes
+
+> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. See [Limitations](#limitations) for details.
+
+**ChReK** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). ChReK dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
+
+## What is ChReK?
+
+ChReK provides:
+- **Fast cold starts**: Restore GPU-accelerated applications in seconds instead of minutes
+- **CUDA state preservation**: Checkpoint and restore GPU memory and CUDA contexts
+- **Kubernetes-native**: Integrates seamlessly with Kubernetes primitives
+- **Storage flexibility**: PVC-based storage (S3/OCI planned for future releases)
+- **Namespace isolation**: Each namespace gets its own checkpoint infrastructure
+
+## Use Cases
+
+### 1. With NVIDIA Dynamo Platform (Recommended)
+
+Use ChReK as part of the Dynamo platform for automatic checkpoint management:
+- Automatic checkpoint creation and lifecycle management
+- Seamless integration with DynamoGraphDeployment CRDs
+- Built-in autoscaling with fast restore
+
+📖 **[Read the Dynamo Integration Guide →](dynamo.md)**
+
+### 2. Standalone (Without Dynamo)
+
+Use ChReK independently in your own Kubernetes applications:
+- Manual checkpoint job creation
+- Build your own restore-enabled container images
+- Full control over checkpoint lifecycle
+
+📖 **[Read the Standalone Usage Guide →](standalone.md)**
+
+## Architecture
+
+ChReK consists of two main components:
+
+### 1. ChReK Helm Chart
+Deploys the checkpoint/restore infrastructure:
+- **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
+- **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
+- **RBAC**: Namespace-scoped or cluster-wide permissions
+- **Seccomp Profile**: Security policies for CRIU syscalls
+
+### 2. Smart Entrypoint
+A wrapper script that intelligently decides between:
+- **Cold start**: Normal application startup (when no checkpoint exists)
+- **Restore**: CRIU restore from checkpoint (when checkpoint available)
+
+## Quick Start
+
+### Install ChReK Infrastructure
+
+```bash
+helm install chrek nvidia/chrek \
+  --namespace my-team \
+  --create-namespace \
+  --set storage.pvc.size=100Gi
+```
+
+### Choose Your Integration Path
+
+- **Using Dynamo Platform?** → Follow the [Dynamo Integration Guide](dynamo.md)
+- **Using standalone?** → Follow the [Standalone Usage Guide](standalone.md)
+
+## Key Features
+
+### ✅ Currently Supported
+- ✅ **vLLM backend only** (SGLang and TensorRT-LLM planned)
+- ✅ Single-node, single-GPU checkpoints
+- ✅ PVC storage backend (RWX for multi-node)
+- ✅ CUDA checkpoint/restore
+- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
+- ✅ Namespace-scoped and cluster-wide RBAC
+- ✅ Idempotent checkpoint creation
+- ✅ Automatic signal-based checkpoint coordination
+
+### 🚧 Planned Features
+- 🚧 SGLang backend support
+- 🚧 TensorRT-LLM backend support
+- 🚧 S3/MinIO storage backend
+- 🚧 OCI registry storage backend
+- 🚧 Multi-GPU checkpoints
+- 🚧 Multi-node distributed checkpoints
+
+## Limitations
+
+⚠️ **Important**: ChReK has significant limitations that may impact production readiness:
+
+### Security Considerations
+- **🔴 Privileged mode required**: Restore pods **must run in privileged mode** for CRIU to function. This grants containers elevated host access and may violate security policies in many production environments.
+- **Security Impact**: Privileged containers can:
+  - Access all host devices
+  - Bypass most security restrictions
+  - Potentially compromise node security if the container is exploited
+
+### Technical Limitations
+- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
+- **Single-node only**: Checkpoints must be created and restored on the same node
+- **Single-GPU only**: Multi-GPU configurations not yet supported
+- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
+- **Storage**: Only PVC storage is currently implemented (S3/OCI planned)
+
+### Recommendation
+ChReK is best suited for:
+- ✅ Development and testing environments
+- ✅ Research and experimentation
+- ✅ Controlled production environments with appropriate security controls
+- ❌ Security-sensitive production workloads without proper risk assessment
+
+## Documentation
+
+### Getting Started
+- [Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform
+- [Standalone Usage Guide](standalone.md) - Using ChReK independently
+- ChReK Helm Chart README - See `deploy/helm/charts/chrek/README.md` in the repository for Helm chart configuration
+
+### Related Documentation
+- [CRIU Documentation](https://criu.org/Main_Page) - Upstream CRIU docs
+
+## Prerequisites
+
+- Kubernetes 1.21+
+- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
+- CRIU support in container runtime (containerd with CRIU plugin)
+- RWX storage class (for multi-node deployments)
+- **Security clearance for privileged pods** (required for restore operations)
+
+## Troubleshooting
+
+### Common Issues
+
+**DaemonSet not starting?**
+- Check GPU node labels: `kubectl get nodes -l nvidia.com/gpu.present=true`
+- Verify NVIDIA runtime is available
+
+**Checkpoint fails?**
+- Check DaemonSet logs: `kubectl logs -l app.kubernetes.io/name=chrek -n <namespace>`
+- Ensure application properly signals readiness
+- Verify CRIU is installed in the runtime
+
+**Restore fails?**
+- Ensure restore pod uses the same volumes as checkpoint job
+- Verify `hostIPC: true` is set (required for CUDA)
+- Check for `PSM3_DISABLED=1` and `GLOO_SOCKET_IFNAME=lo` environment variables
+
+For detailed troubleshooting, see:
+- [Dynamo Integration Guide - Troubleshooting](dynamo.md#troubleshooting)
+- [Standalone Guide - Troubleshooting](standalone.md#troubleshooting)
+
+## Contributing
+
+ChReK is part of the NVIDIA Dynamo project. Contributions are welcome!
+
+## License
+
+Apache License 2.0
--- a/fern/pages/kubernetes/chrek/dynamo.md
+++ b/fern/pages/kubernetes/chrek/dynamo.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Checkpoint/Restore for Fast Pod Startup
+
+> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations. See [Limitations](#limitations) for details.
+
+Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
+
+## Overview
+
+Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
+
+| Startup Type | Time | What Happens |
+|--------------|------|--------------|
+| **Cold Start** | ~3 min | Download model, load to GPU, initialize engine |
+| **Warm Start** (checkpoint) | ~30 sec | Restore from checkpoint tar |
+
+## Prerequisites
+
+- Dynamo Platform installed (v0.4.0+)
+- ChReK Helm chart installed (separate from platform)
+- GPU nodes with CRIU support
+- RWX PVC storage (PVC is currently the only supported backend)
+
+## Quick Start
+
+### 1. Install ChReK Infrastructure
+
+First, install the ChReK Helm chart in each namespace where you need checkpointing:
+
+```bash
+# Install ChReK infrastructure
+helm install chrek nvidia/chrek \
+  --namespace my-team \
+  --create-namespace \
+  --set storage.pvc.size=100Gi
+```
+
+This creates:
+- A PVC for checkpoint storage (`chrek-pvc`)
+- A DaemonSet for CRIU operations (`chrek-agent`)
+
+### 2. Configure Operator Values
+
+Update your Helm values to point to the ChReK infrastructure:
+
+```yaml
+# values.yaml
+dynamo-operator:
+  checkpoint:
+    enabled: true
+    storage:
+      type: pvc  # Only PVC is currently supported (S3/OCI planned)
+      pvc:
+        pvcName: "chrek-pvc"  # Must match ChReK chart
+        basePath: "/checkpoints"
+      signalHostPath: "/var/lib/chrek/signals"  # Must match ChReK chart
+```
+
+### 2. Configure Your DGD
+
+Add checkpoint configuration to your service:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-llm
+spec:
+  services:
+    VllmWorker:
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
+          args:
+            - python3 -m dynamo.vllm --model meta-llama/Llama-3-8B
+      resources:
+        limits:
+          nvidia.com/gpu: "1"
+
+      # Checkpoint configuration
+      checkpoint:
+        enabled: true
+        mode: auto  # Automatically create checkpoint if not found
+        identity:
+          model: "meta-llama/Llama-3-8B"
+          backendFramework: "vllm"
+          tensorParallelSize: 1
+          dtype: "bfloat16"
+```
+
+### 3. Deploy
+
+```bash
+kubectl apply -f my-llm.yaml -n dynamo-system
+```
+
+On first deployment:
+1. A checkpoint job runs to create the checkpoint
+2. Worker pods start with cold start (checkpoint not ready yet)
+3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
+
+## Storage Backends
+
+### PVC (Currently Supported)
+
+Use when you have RWX storage available (e.g., NFS, EFS, Filestore).
+
+```yaml
+checkpoint:
+  storage:
+    type: pvc
+    pvc:
+      pvcName: "chrek-pvc"
+      basePath: "/checkpoints"
+```
+
+**Requirements:**
+- RWX (ReadWriteMany) PVC for multi-node access
+- Sufficient storage (checkpoints are ~10-50GB per model)
+
+### S3 / MinIO (Planned - Not Yet Implemented)
+
+> ⚠️ **Note:** S3 storage backend is defined in the API but not yet fully implemented.
+
+Object storage support is planned for a future release. The configuration will look like:
+
+```yaml
+checkpoint:
+  storage:
+    type: s3  # Not yet supported
+    s3:
+      # AWS S3
+      uri: "s3://my-bucket/checkpoints"
+
+      # Or MinIO / custom S3
+      uri: "s3://minio.example.com/my-bucket/checkpoints"
+
+      # Optional: credentials secret
+      credentialsSecretRef: "s3-creds"
+```
+
+### OCI Registry (Planned - Not Yet Implemented)
+
+> ⚠️ **Note:** OCI registry storage backend is defined in the API but not yet fully implemented.
+
+Container registry storage support is planned for a future release. The configuration will look like:
+
+```yaml
+checkpoint:
+  storage:
+    type: oci  # Not yet supported
+    oci:
+      uri: "oci://myregistry.io/checkpoints"
+      credentialsSecretRef: "registry-creds"  # Docker config secret
+```
+
+## Checkpoint Modes
+
+### Auto Mode (Recommended)
+
+The operator automatically creates a `DynamoCheckpoint` CR if one doesn't exist:
+
+```yaml
+checkpoint:
+  enabled: true
+  mode: auto
+  identity:
+    model: "meta-llama/Llama-3-8B"
+    backendFramework: "vllm"
+    tensorParallelSize: 1
+```
+
+### Reference Mode
+
+Reference an existing `DynamoCheckpoint` CR by its 16-character hash using `checkpointRef`:
+
+```yaml
+checkpoint:
+  enabled: true
+  checkpointRef: "e5962d34ba272638"  # 16-char hash of DynamoCheckpoint CR
+```
+
+This is useful when:
+- You want to **pre-warm checkpoints** before creating DGDs
+- You want to **explicit control** over which checkpoint to use
+
+**Flow:**
+1. Create a `DynamoCheckpoint` CR (see [DynamoCheckpoint CRD](#dynamocheckpoint-crd) section)
+2. Wait for it to become `Ready`
+3. Reference it in your DGD using `checkpointRef` with the hash
+
+```bash
+# Check checkpoint status (using 16-char hash name)
+kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system
+NAME                MODEL                   BACKEND  PHASE  HASH              AGE
+e5962d34ba272638    meta-llama/Llama-3-8B  vllm     Ready  e5962d34ba272638  5m
+
+# Now create DGD referencing it
+kubectl apply -f my-dgd.yaml
+```
+
+## Checkpoint Identity
+
+Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:
+
+| Field | Required | Affects Hash | Example |
+|-------|----------|-------------|---------|
+| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
+| `framework` | ✓ | ✓ | `vllm`, `sglang`, `trtllm` |
+| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
+| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
+| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
+| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
+| `maxModelLen` | | ✓ | `4096`, `8192` |
+| `extraParameters` | | ✓ | Custom key-value pairs |
+
+**Not included in hash** (don't invalidate checkpoint):
+- `replicas`
+- `nodeSelector`, `affinity`, `tolerations`
+- `resources` (requests/limits)
+- Logging/observability config
+
+**Example with all fields:**
+```yaml
+checkpoint:
+  enabled: true
+  mode: auto
+  identity:
+    model: "meta-llama/Llama-3-8B"
+    backendFramework: "vllm"
+    dynamoVersion: "0.9.0"
+    tensorParallelSize: 1
+    pipelineParallelSize: 1
+    dtype: "bfloat16"
+    maxModelLen: 8192
+    extraParameters:
+      enableChunkedPrefill: "true"
+      quantization: "awq"
+```
+
+**Checkpoint Naming:** The `DynamoCheckpoint` CR is automatically named using the 16-character identity hash (e.g., `e5962d34ba272638`).
+
+**Checkpoint Sharing:** Multiple DGDs with the same identity automatically share the same checkpoint.
+
+## DynamoCheckpoint CRD
+
+The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
+
+**When to create a DynamoCheckpoint directly:**
+- **Pre-warming:** Create checkpoints before deploying DGDs for instant startup
+- **Explicit control:** Manage checkpoint lifecycle independently from DGDs
+
+**Note:** With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in `auto` mode.
+
+**Create a checkpoint:**
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoCheckpoint
+metadata:
+  name: e5962d34ba272638  # Use the computed 16-char hash
+spec:
+  identity:
+    model: "meta-llama/Llama-3-8B"
+    backendFramework: "vllm"
+    tensorParallelSize: 1
+    dtype: "bfloat16"
+
+  job:
+    activeDeadlineSeconds: 3600
+    podTemplateSpec:
+      spec:
+        containers:
+          - name: main
+            image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
+            command: ["python3", "-m", "dynamo.vllm"]
+            args: ["--model", "meta-llama/Llama-3-8B"]
+            resources:
+              limits:
+                nvidia.com/gpu: "1"
+            env:
+              - name: HF_TOKEN
+                valueFrom:
+                  secretKeyRef:
+                    name: hf-token-secret
+                    key: HF_TOKEN
+```
+
+**Note:** You can compute the hash yourself, or use `auto` mode to let the operator create it.
+
+**Check status:**
+
+```bash
+# List all checkpoints
+kubectl get dynamocheckpoint -n dynamo-system
+# Or use shortname
+kubectl get dckpt -n dynamo-system
+
+NAME                MODEL                          BACKEND  PHASE    HASH              AGE
+e5962d34ba272638    meta-llama/Llama-3-8B         vllm     Ready    e5962d34ba272638  5m
+a7b4f89c12de3456    meta-llama/Llama-3-70B        vllm     Creating a7b4f89c12de3456  2m
+```
+
+**Phases:**
+| Phase | Description |
+|-------|-------------|
+| `Pending` | CR created, waiting for job to start |
+| `Creating` | Checkpoint job is running |
+| `Ready` | Checkpoint available for use |
+| `Failed` | Checkpoint creation failed |
+
+**Detailed status:**
+
+```bash
+kubectl describe dckpt e5962d34ba272638 -n dynamo-system
+```
+
+```yaml
+Status:
+  Phase: Ready
+  IdentityHash: e5962d34ba272638
+  Location: /checkpoints/e5962d34ba272638
+  StorageType: pvc
+  CreatedAt: 2026-01-29T10:05:00Z
+```
+
+**Reference from DGD:**
+
+Once the checkpoint is `Ready`, you can reference it by hash:
+
+```yaml
+spec:
+  services:
+    VllmWorker:
+      checkpoint:
+        enabled: true
+        checkpointRef: "e5962d34ba272638"  # 16-char hash
+```
+
+Or use `auto` mode and the operator will find/create it automatically.
+
+## Limitations
+
+⚠️ **Important**: ChReK has significant limitations that impact production readiness:
+
+### Security Considerations
+- **🔴 Privileged mode required**: Restore pods **must run in privileged mode** for CRIU to function
+- Privileged containers have elevated host access, which may violate security policies in many production environments
+- This requirement applies to all worker pods that restore from checkpoints
+
+### Technical Limitations
+- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
+- **Single-node only**: Checkpoints must be created and restored on the same node
+- **Single-GPU only**: Multi-GPU configurations are not yet supported
+- **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
+- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
+
+### Recommendation
+ChReK is **experimental/beta** and best suited for:
+- ✅ Development and testing environments
+- ✅ Research and experimentation
+- ✅ Controlled production environments with appropriate security controls
+- ❌ Security-sensitive production workloads without proper risk assessment
+
+## Troubleshooting
+
+### Checkpoint Not Creating
+
+1. Check the checkpoint job:
+   ```bash
+   kubectl get jobs -l nvidia.com/checkpoint-source=true -n dynamo-system
+   kubectl logs job/checkpoint-<name> -n dynamo-system
+   ```
+
+2. Check the DaemonSet:
+   ```bash
+   kubectl logs daemonset/chrek-agent -n dynamo-system
+   ```
+
+3. Verify storage access:
+   ```bash
+   kubectl exec -it <checkpoint-agent-pod> -- ls -la /checkpoints
+   ```
+
+### Restore Failing
+
+1. Check pod logs:
+   ```bash
+   kubectl logs <worker-pod> -n dynamo-system
+   ```
+
+2. Verify checkpoint file exists:
+   ```bash
+   # For PVC
+   kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/
+
+   # For S3
+   aws s3 ls s3://my-bucket/checkpoints/
+   ```
+
+3. Check environment variables:
+   ```bash
+   kubectl exec <worker-pod> -- env | grep DYN_CHECKPOINT
+   ```
+
+### Cold Start Despite Checkpoint
+
+Pods fall back to cold start if:
+- Checkpoint file doesn't exist yet (still being created)
+- Checkpoint file is corrupted
+- CRIU restore fails
+
+Check logs for "Falling back to cold start" message.
+
+## Best Practices
+
+1. **Use RWX PVCs** for multi-node deployments (currently the only supported backend)
+2. **Pre-warm checkpoints** before scaling up
+3. **Monitor checkpoint size** - large models create large checkpoints
+4. **Clean up old checkpoints** to save storage
+
+## Environment Variables
+
+| Variable | Description |
+|----------|-------------|
+| `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` |
+| `DYN_CHECKPOINT_LOCATION` | Source location (URI) |
+| `DYN_CHECKPOINT_PATH` | Local path to tar file |
+| `DYN_CHECKPOINT_HASH` | Identity hash (debugging) |
+| `DYN_CHECKPOINT_SIGNAL_FILE` | Signal file (creation mode only) |
+
+## Complete Example
+
+Create a checkpoint and use it in a DGD:
+
+```yaml
+# 1. Create the DynamoCheckpoint CR
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoCheckpoint
+metadata:
+  name: e5962d34ba272638  # 16-char hash (computed from identity)
+  namespace: dynamo-system
+spec:
+  identity:
+    model: "meta-llama/Meta-Llama-3-8B-Instruct"
+    backendFramework: "vllm"
+    tensorParallelSize: 1
+    dtype: "bfloat16"
+  job:
+    activeDeadlineSeconds: 3600
+    backoffLimit: 3
+    podTemplateSpec:
+      spec:
+        containers:
+          - name: main
+            image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
+            command: ["python3", "-m", "dynamo.vllm"]
+            args:
+              - "--model"
+              - "meta-llama/Meta-Llama-3-8B-Instruct"
+              - "--tensor-parallel-size"
+              - "1"
+              - "--dtype"
+              - "bfloat16"
+            env:
+              - name: HF_TOKEN
+                valueFrom:
+                  secretKeyRef:
+                    name: hf-token-secret
+                    key: HF_TOKEN
+            resources:
+              limits:
+                nvidia.com/gpu: "1"
+        restartPolicy: Never
+---
+# 2. Wait for Ready: kubectl get dckpt e5962d34ba272638 -n dynamo-system -w
+---
+# 3. Reference the checkpoint in your DGD
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-llm
+  namespace: dynamo-system
+spec:
+  services:
+    VllmWorker:
+      replicas: 2
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
+      resources:
+        limits:
+          nvidia.com/gpu: "1"
+      checkpoint:
+        enabled: true
+        checkpointRef: "e5962d34ba272638"  # Reference by hash
+```
+
+## Related Documentation
+
+- [ChReK Overview](README.md) - ChReK architecture and use cases
+- [ChReK Standalone Usage Guide](standalone.md) - Use ChReK without Dynamo Platform
+- ChReK Helm Chart README - See `deploy/helm/charts/chrek/README.md` in the repository for chart configuration
+- [Installation Guide](../installation-guide.md) - Platform installation
+- [API Reference](../api-reference.md) - Complete CRD specifications
+
--- a/fern/pages/kubernetes/chrek/standalone.md
+++ b/fern/pages/kubernetes/chrek/standalone.md
+# ChReK Standalone Usage Guide
+
+> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the [security implications](#security-considerations) before deploying.
+
+This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Prerequisites](#prerequisites)
+- [Step 1: Deploy ChReK](#step-1-deploy-chrek)
+- [Step 2: Build Checkpoint-Enabled Images](#step-2-build-checkpoint-enabled-images)
+- [Step 3: Create Checkpoint Jobs](#step-3-create-checkpoint-jobs)
+- [Step 4: Restore from Checkpoints](#step-4-restore-from-checkpoints)
+- [Environment Variables Reference](#environment-variables-reference)
+- [Checkpoint Flow Explained](#checkpoint-flow-explained)
+- [Troubleshooting](#troubleshooting)
+
+---
+
+## Overview
+
+When using ChReK standalone, you are responsible for:
+
+1. **Deploying the ChReK Helm chart** (DaemonSet + PVC)
+2. **Building checkpoint-enabled container images** with the restore entrypoint
+3. **Creating checkpoint jobs** with the correct environment variables
+4. **Creating restore pods** that detect and use the checkpoints
+
+The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.
+
+---
+
+## Prerequisites
+
+- Kubernetes cluster with:
+  - NVIDIA GPUs with checkpoint support
+  - **Privileged security context allowed** (⚠️ required for CRIU - see [Security Considerations](#security-considerations))
+  - PVC storage (ReadWriteMany recommended for multi-node)
+- Docker or compatible container runtime for building images
+- Access to the ChReK source code: `deploy/chrek/`
+
+### Security Considerations
+
+⚠️ **Important**: ChReK restore operations **require privileged mode**, which has significant security implications:
+
+- **Privileged containers** can access all host devices and bypass most security restrictions
+- This may violate security policies in production environments
+- Privileged containers, if compromised, can potentially compromise node security
+
+**Recommended for:**
+- ✅ Development and testing environments
+- ✅ Research and experimentation
+- ✅ Controlled production environments with appropriate security controls
+
+**Not recommended for:**
+- ❌ Multi-tenant clusters without proper isolation
+- ❌ Security-sensitive production workloads without risk assessment
+- ❌ Environments with strict security compliance requirements
+
+### Technical Limitations
+
+⚠️ **Current Restrictions:**
+- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
+- **Single-node only**: Checkpoints must be created and restored on the same node
+- **Single-GPU only**: Multi-GPU configurations are not yet supported
+- **Network state**: Active TCP connections are closed during restore
+- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
+
+---
+
+## Step 1: Deploy ChReK
+
+### Install the Helm Chart
+
+```bash
+# Clone the repository
+git clone https://github.com/ai-dynamo/dynamo.git
+cd dynamo
+
+# Install ChReK in your namespace
+helm install chrek ./deploy/helm/charts/chrek \
+  --namespace my-app \
+  --create-namespace \
+  --set storage.pvc.size=100Gi \
+  --set storage.pvc.storageClass=your-storage-class
+```
+
+### Verify Installation
+
+```bash
+# Check the DaemonSet is running
+kubectl get daemonset -n my-app
+# NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
+# chrek-agent   3         3         3       3            3
+
+# Check the PVC is bound
+kubectl get pvc -n my-app
+# NAME        STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS
+# chrek-pvc   Bound    pvc-xyz    100Gi      RWX            your-storage-class
+```
+
+---
+
+## Step 2: Build Checkpoint-Enabled Images
+
+ChReK provides a convenient `placeholder` target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.
+
+### Quick Start: Using the Placeholder Target (Recommended)
+
+```bash
+cd deploy/chrek
+
+# Define your images
+export BASE_IMAGE="your-app:latest"           # Your existing application image
+export RESTORE_IMAGE="your-app:checkpoint-enabled"  # Output checkpoint-enabled image
+
+# Build using the placeholder target
+docker build \
+  --target placeholder \
+  --build-arg BASE_IMAGE="$BASE_IMAGE" \
+  -t "$RESTORE_IMAGE" \
+  .
+
+# Push to your registry
+docker push "$RESTORE_IMAGE"
+```
+
+**Example with a Dynamo vLLM image:**
+
+```bash
+cd deploy/chrek
+
+export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
+export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"
+
+docker build \
+  --target placeholder \
+  --build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
+  -t "$RESTORE_IMAGE" \
+  .
+```
+
+### What the Placeholder Target Does
+
+The ChReK Dockerfile's `placeholder` stage automatically:
+
+- ✅ Builds the restore-entrypoint binary
+- ✅ Injects it into `/usr/local/bin/restore-entrypoint`
+- ✅ Adds `smart-entrypoint.sh` to `/usr/local/bin/`
+- ✅ Sets executable permissions
+- ✅ Configures the entrypoint to detect and restore checkpoints
+- ✅ Preserves your original application CMD
+
+### Alternative: Manual Multi-Stage Build
+
+If you need more control, you can create your own Dockerfile:
+
+```dockerfile
+# Stage 1: Build restore-entrypoint
+FROM golang:1.23-alpine AS restore-builder
+WORKDIR /build
+COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint
+COPY deploy/chrek/pkg ./pkg
+COPY deploy/chrek/go.mod deploy/chrek/go.sum ./
+
+RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint
+
+# Stage 2: Your application image
+FROM your-base-image:latest
+
+# Copy restore-entrypoint
+COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint
+
+# Copy smart-entrypoint.sh
+COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh
+RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint
+
+# Set smart-entrypoint as the default entrypoint
+ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]
+
+# Your application command (becomes CMD, can be overridden)
+CMD ["python", "your_app.py"]
+```
+
+> **💡 Tip**: Using the `placeholder` target is the recommended approach as it's maintained with the ChReK codebase and ensures compatibility.
+
+---
+
+## Step 3: Create Checkpoint Jobs
+
+A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.
+
+### Required Environment Variables
+
+Your checkpoint job MUST set these environment variables:
+
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `DYN_CHECKPOINT_SIGNAL_FILE` | Path where DaemonSet writes completion signal | `/checkpoint-signal/my-checkpoint.done` |
+| `DYN_CHECKPOINT_READY_FILE` | Path where your app signals it's ready | `/tmp/checkpoint-ready` |
+| `DYN_CHECKPOINT_HASH` | Unique identifier for this checkpoint | `abc123def456` |
+| `DYN_CHECKPOINT_LOCATION` | Directory where checkpoint is stored | `/checkpoints/abc123def456` |
+| `DYN_CHECKPOINT_STORAGE_TYPE` | Storage backend type | `pvc` |
+
+### Required Labels
+
+Add this label to enable DaemonSet checkpoint detection:
+
+```yaml
+labels:
+  nvidia.com/checkpoint-source: "true"
+```
+
+### Example Checkpoint Job
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: checkpoint-my-model
+  namespace: my-app
+spec:
+  template:
+    metadata:
+      labels:
+        nvidia.com/checkpoint-source: "true"  # Required for DaemonSet detection
+    spec:
+      restartPolicy: Never
+
+      # Init container to clean up stale signal files
+      initContainers:
+      - name: cleanup-signal-file
+        image: busybox:latest
+        command:
+        - sh
+        - -c
+        - |
+          rm -f /checkpoint-signal/my-checkpoint.done || true
+          echo "Signal file cleanup complete"
+        volumeMounts:
+        - name: checkpoint-signal
+          mountPath: /checkpoint-signal
+
+      containers:
+      - name: main
+        image: my-app:checkpoint-enabled
+
+        # Security context required for CRIU
+        securityContext:
+          privileged: true
+          capabilities:
+            add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
+
+        # Readiness probe: Pod becomes Ready when model is loaded
+        # This is what triggers the DaemonSet to start checkpointing
+        readinessProbe:
+          exec:
+            command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
+          initialDelaySeconds: 15
+          periodSeconds: 2
+
+        # Remove liveness/startup probes for checkpoint jobs
+        # Model loading can take several minutes
+        livenessProbe: null
+        startupProbe: null
+
+        # Checkpoint-related environment variables
+        env:
+        - name: DYN_CHECKPOINT_SIGNAL_FILE
+          value: "/checkpoint-signal/my-checkpoint.done"
+        - name: DYN_CHECKPOINT_READY_FILE
+          value: "/tmp/checkpoint-ready"
+        - name: DYN_CHECKPOINT_HASH
+          value: "abc123def456"
+        - name: DYN_CHECKPOINT_LOCATION
+          value: "/checkpoints/abc123def456"
+        - name: DYN_CHECKPOINT_STORAGE_TYPE
+          value: "pvc"
+
+        # GPU request
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+
+        # Required volume mounts
+        volumeMounts:
+        - name: checkpoint-storage
+          mountPath: /checkpoints
+        - name: checkpoint-signal
+          mountPath: /checkpoint-signal
+        - name: tmp
+          mountPath: /tmp
+
+      volumes:
+      - name: checkpoint-storage
+        persistentVolumeClaim:
+          claimName: chrek-pvc
+      - name: checkpoint-signal
+        hostPath:
+          path: /var/lib/chrek/signals
+          type: DirectoryOrCreate
+      - name: tmp
+        emptyDir: {}
+```
+
+### Application Code Requirements
+
+Your application must implement the checkpoint flow. Here's the pattern used by Dynamo vLLM:
+
+```python
+import os
+import time
+
+def main():
+    # 1. Check for checkpoint mode
+    signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE")
+    ready_file = os.environ.get("DYN_CHECKPOINT_READY_FILE")
+    restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE", "/tmp/dynamo-restored")
+
+    is_checkpoint_mode = signal_file is not None
+
+    if is_checkpoint_mode:
+        print("Checkpoint mode detected")
+
+        # 2. Load your model/application
+        model = load_model()
+
+        # 3. Optional: Put model to sleep to reduce memory footprint
+        # model.sleep()
+
+        # 4. Write ready file (for application use, not DaemonSet)
+        if ready_file:
+            with open(ready_file, "w") as f:
+                f.write("ready")
+            print(f"Wrote checkpoint ready file: {ready_file}")
+
+        # 5. Log readiness messages (helps debugging)
+        print("CHECKPOINT_READY: Model loaded, ready for container checkpoint")
+        print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}")
+        print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}")
+
+        # 6. Wait for checkpoint completion OR restore detection
+        while True:
+            # Check if we've been restored (marker file created by restore entrypoint)
+            if os.path.exists(restore_marker):
+                print(f"Detected restore from checkpoint (marker: {restore_marker})")
+                # Continue with normal application flow
+                break
+
+            # Check if checkpoint is complete (signal file created by DaemonSet)
+            if os.path.exists(signal_file):
+                print(f"Checkpoint signal file detected: {signal_file}")
+                print("Checkpoint complete, exiting")
+                return  # Exit gracefully
+
+            time.sleep(1)
+
+    # Normal application flow (or post-restore flow)
+    run_application()
+```
+
+**Important Notes:**
+
+1. **Ready File & Readiness Probe**: The checkpoint job must have a readiness probe that checks for the ready file:
+   ```yaml
+   readinessProbe:
+     exec:
+       command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
+     initialDelaySeconds: 15
+     periodSeconds: 2
+   ```
+   The ChReK DaemonSet triggers checkpointing when:
+   - Pod has `nvidia.com/checkpoint-source: "true"` label
+   - Pod status is `Ready` (readiness probe passes = ready file exists)
+
+2. **Restore Marker**: Created by `restore-entrypoint` before CRIU restore, allows the restored process to detect it was restored
+
+3. **Two Exit Paths**:
+   - **Signal file found**: Checkpoint complete, exit gracefully
+   - **Restore marker found**: Process was restored, continue running
+
+
+---
+
+## Step 4: Restore from Checkpoints
+
+Restore pods automatically detect and restore from checkpoints if they exist.
+
+### Example Restore Pod
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: my-app-restored
+  namespace: my-app
+spec:
+  restartPolicy: Never
+
+  containers:
+  - name: main
+    image: my-app:checkpoint-enabled
+
+    # Security context required for CRIU restore
+    securityContext:
+      privileged: true
+      capabilities:
+        add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
+
+    # Set checkpoint environment variables
+    env:
+    - name: DYN_CHECKPOINT_HASH
+      value: "abc123def456"  # Must match checkpoint job
+    - name: DYN_CHECKPOINT_PATH
+      value: "/checkpoints"  # Base path (hash appended automatically)
+
+    # Optional: Customize restore marker file path
+    # - name: DYN_RESTORE_MARKER_FILE
+    #   value: "/tmp/dynamo-restored"
+
+    # GPU request
+    resources:
+      limits:
+        nvidia.com/gpu: 1
+
+    # Mount checkpoint storage (READ-ONLY is fine for restore)
+    volumeMounts:
+    - name: checkpoint-storage
+      mountPath: /checkpoints
+      readOnly: true
+    - name: checkpoint-signal
+      mountPath: /checkpoint-signal
+
+  volumes:
+  - name: checkpoint-storage
+    persistentVolumeClaim:
+      claimName: chrek-pvc
+  - name: checkpoint-signal
+    hostPath:
+      path: /var/lib/chrek/signals
+      type: DirectoryOrCreate
+```
+
+### How Restore Works
+
+1. **Smart Entrypoint Detects Checkpoint**: The `smart-entrypoint.sh` checks if a checkpoint exists at `/checkpoints/${DYN_CHECKPOINT_HASH}/`
+2. **Calls Restore Entrypoint**: If found, calls `/usr/local/bin/restore-entrypoint` which invokes CRIU
+3. **CRIU Restores Process**: The entire process tree is restored from the checkpoint, including GPU state
+4. **Application Continues**: Your application resumes exactly where it was checkpointed
+
+---
+
+## Environment Variables Reference
+
+### Checkpoint Jobs
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `DYN_CHECKPOINT_SIGNAL_FILE` | Yes | Full path to signal file (e.g., `/checkpoint-signal/my-checkpoint.done`) |
+| `DYN_CHECKPOINT_READY_FILE` | Yes | Full path where app signals readiness (e.g., `/tmp/checkpoint-ready`) |
+| `DYN_CHECKPOINT_HASH` | Yes | Unique checkpoint identifier (alphanumeric string) |
+| `DYN_CHECKPOINT_LOCATION` | Yes | Directory where checkpoint is stored (e.g., `/checkpoints/abc123`) |
+| `DYN_CHECKPOINT_STORAGE_TYPE` | Yes | Storage backend: `pvc`, `s3`, or `oci` |
+
+### Restore Pods
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `DYN_CHECKPOINT_HASH` | Yes | Checkpoint identifier (must match checkpoint job) |
+| `DYN_CHECKPOINT_PATH` | Yes | Base checkpoint directory (hash appended automatically) |
+| `DYN_RESTORE_MARKER_FILE` | No | Path for restore marker file (default: `/tmp/dynamo-restored`) |
+
+### Optional CRIU Tuning (Advanced)
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `CRIU_TIMEOUT` | `0` (unlimited) | CRIU operation timeout in seconds |
+| `CRIU_LOG_LEVEL` | `4` | CRIU log verbosity (0-4) |
+| `CRIU_WORK_DIR` | `/tmp` | CRIU working directory |
+| `CUDA_PLUGIN_DIR` | `/usr/local/lib/criu` | Path to CRIU CUDA plugin |
+| `CRIU_SKIP_IN_FLIGHT` | `false` | Skip in-flight TCP connections |
+| `CRIU_AUTO_DEDUP` | `false` | Enable auto-deduplication |
+| `CRIU_LAZY_PAGES` | `false` | Enable lazy page migration (experimental) |
+| `WAIT_FOR_CHECKPOINT` | `false` | Wait for checkpoint to appear before starting |
+| `RESTORE_WAIT_TIMEOUT` | `300` | Max seconds to wait for checkpoint |
+| `DEBUG` | `false` | Enable debug mode (sleeps 300s on error) |
+
+---
+
+## Checkpoint Flow Explained
+
+### 1. Checkpoint Creation Flow
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ 1. Pod starts with nvidia.com/checkpoint-source=true label  │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 2. Application loads model and creates ready file           │
+│    /tmp/checkpoint-ready                                     │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 3. Pod becomes Ready (kubelet readiness probe passes)       │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 4. ChReK DaemonSet detects:                                 │
+│    - Pod is Ready                                            │
+│    - Has checkpoint-source label                             │
+│    - Ready file exists: /tmp/checkpoint-ready               │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 5. DaemonSet executes CRIU checkpoint via runc:             │
+│    - Freezes container process                               │
+│    - Dumps memory (CPU + GPU)                                │
+│    - Saves to /checkpoints/${HASH}/                          │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 6. DaemonSet writes signal file:                            │
+│    /checkpoint-signal/${HASH}.done                           │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 7. Application detects signal file and exits gracefully     │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### 2. Restore Flow
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ 1. Pod starts with DYN_CHECKPOINT_HASH set                  │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 2. smart-entrypoint.sh checks for checkpoint:               │
+│    /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done      │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ├─ Not Found ─────────────────┐
+                       │                              │
+                       ▼                              ▼
+           ┌───────────────────────┐    ┌──────────────────────┐
+           │ Checkpoint exists     │    │ Cold start           │
+           └──────────┬────────────┘    │ Run original CMD     │
+                      │                 └──────────────────────┘
+                      ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 3. Call restore-entrypoint with checkpoint path             │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 4. restore-entrypoint extracts checkpoint and calls CRIU:   │
+│    criu restore --images-dir /checkpoints/${HASH}/images    │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 5. CRIU restores process from checkpoint                    │
+│    - Restores memory (CPU + GPU)                             │
+│    - Restores file descriptors                               │
+│    - Resumes process execution                               │
+└──────────────────────┬──────────────────────────────────────┘
+                       │
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 6. Application continues from checkpointed state            │
+│    (Model already loaded, GPU memory initialized)           │
+└─────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Troubleshooting
+
+### Checkpoint Not Created
+
+**Symptom**: Job runs but no checkpoint appears in `/checkpoints/`
+
+**Checks**:
+1. Verify the pod has the label:
+   ```bash
+   kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}'
+   ```
+
+2. Check pod readiness:
+   ```bash
+   kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
+   ```
+
+3. Check ready file was created:
+   ```bash
+   kubectl exec <pod-name> -- ls -la /tmp/checkpoint-ready
+   ```
+
+4. Check DaemonSet logs:
+   ```bash
+   kubectl logs -n my-app daemonset/chrek-agent --all-containers
+   ```
+
+### Restore Fails
+
+**Symptom**: Pod fails to restore from checkpoint
+
+**Checks**:
+1. Verify checkpoint files exist:
+   ```bash
+   kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/
+   ```
+
+2. Check privileged mode is enabled:
+   ```bash
+   kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].securityContext.privileged}'
+   ```
+
+3. Check CRIU logs in `/tmp/criu-restore.log`:
+   ```bash
+   kubectl exec <pod-name> -- cat /tmp/criu-restore.log
+   ```
+
+4. Ensure checkpoint and restore have same:
+   - Container image
+   - GPU count
+   - Volume mounts
+   - Environment variables (except POD_NAME, POD_IP, etc.)
+
+### Permission Denied Errors
+
+**Symptom**: `CRIU: Permission denied` or `Operation not permitted`
+
+**Solution**: Ensure pod has:
+```yaml
+securityContext:
+  privileged: true
+  capabilities:
+    add:
+    - SYS_ADMIN
+    - SYS_PTRACE
+    - SYS_CHROOT
+```
+
+### Signal File Not Appearing
+
+**Symptom**: Application waits forever for signal file
+
+**Checks**:
+1. Verify hostPath mount is correct:
+   ```bash
+   kubectl get pod <pod-name> -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}'
+   ```
+
+2. Check DaemonSet has access to the same path:
+   ```bash
+   kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}'
+   ```
+
+3. Verify paths match exactly:
+   - Pod: `/var/lib/chrek/signals`
+   - DaemonSet: `/var/lib/chrek/signals`
+
+---
+
+## Additional Resources
+
+- [ChReK Helm Chart Values](../../deploy/helm/charts/chrek/values.yaml)
+- [Smart Entrypoint Script](../../deploy/chrek/scripts/smart-entrypoint.sh)
+- [CRIU Documentation](https://criu.org/Main_Page)
+- [CUDA Checkpoint Plugin](https://docs.nvidia.com/cuda/cuda-checkpoint-plugin/)
+
+---
+
+## Getting Help
+
+If you encounter issues:
+
+1. Check the [Troubleshooting](#troubleshooting) section
+2. Review DaemonSet logs: `kubectl logs -n <namespace> daemonset/chrek-agent`
+3. Open an issue on [GitHub](https://github.com/ai-dynamo/dynamo/issues)
--- a/fern/versions/next.yml
+++ b/fern/versions/next.yml
@@ -54,6 +54,14 @@ navigation:
            path: ../pages/kubernetes/webhooks.md
          - page: Autoscaling
            path: ../pages/kubernetes/autoscaling.md
+      - section: Checkpointing (ChReK)
+        contents:
+          - page: Overview
+            path: ../pages/kubernetes/chrek/README.md
+          - page: Integration with Dynamo
+            path: ../pages/kubernetes/chrek/dynamo.md
+          - page: Standalone Usage
+            path: ../pages/kubernetes/chrek/standalone.md
      - section: Observability (K8s)
        contents:
          - page: Metrics

--- a/lib/runtime/src/discovery/kube/utils.rs
+++ b/lib/runtime/src/discovery/kube/utils.rs
@@ -4,7 +4,9 @@
 use anyhow::Result;
 use k8s_openapi::api::discovery::v1::EndpointSlice;
 use std::collections::hash_map::DefaultHasher;
+use std::fs;
 use std::hash::{Hash, Hasher};
+use std::path::Path;

 /// Hash a pod name to get a consistent instance ID
 pub fn hash_pod_name(pod_name: &str) -> u64 {
@@ -57,24 +59,61 @@ pub(super) struct PodInfo {
    pub system_port: u16,
 }

+/// Default path for Kubernetes Downward API volume mount
+const DEFAULT_PODINFO_PATH: &str = "/etc/podinfo";
+
 impl PodInfo {
-    /// Discover pod information from environment variables
+    /// Read a value from a Downward API file, falling back to environment variable
+    fn read_from_file_or_env(file_path: &Path, env_var: &str) -> Option<String> {
+        // First try reading from file (Downward API volume mount)
+        // This is preferred after CRIU restore since env vars contain stale values
+        if let Ok(content) = fs::read_to_string(file_path) {
+            let value = content.trim().to_string();
+            if !value.is_empty() {
+                return Some(value);
+            }
+        }
+
+        // Fall back to environment variable
+        std::env::var(env_var).ok()
+    }
+
+    /// Discover pod information from Kubernetes Downward API volume mounts or environment variables
    ///
-    /// Required environment variables:
+    /// This function first attempts to read pod identity from Downward API volume mounts
+    /// at /etc/podinfo/{pod_name, pod_uid, pod_namespace}. This is critical for CRIU
+    /// checkpoint/restore scenarios where environment variables contain stale values
+    /// from the checkpoint source pod.
+    ///
+    /// If the Downward API files are not available, falls back to environment variables:
    /// - `POD_NAME`: Name of the pod (required)
    /// - `POD_UID`: UID of the pod (required for CR owner reference)
    /// - `POD_NAMESPACE`: Namespace of the pod (defaults to "default")
    pub fn from_env() -> Result<Self> {
-        let pod_name = std::env::var("POD_NAME")
-            .map_err(|_| anyhow::anyhow!("POD_NAME environment variable not set"))?;
-
-        let pod_uid = std::env::var("POD_UID")
-            .map_err(|_| anyhow::anyhow!("POD_UID environment variable not set"))?;
-
-        let pod_namespace = std::env::var("POD_NAMESPACE").unwrap_or_else(|_| {
-            tracing::warn!("POD_NAMESPACE not set, defaulting to 'default'");
-            "default".to_string()
-        });
+        let podinfo_path = Path::new(DEFAULT_PODINFO_PATH);
+
+        let pod_name = Self::read_from_file_or_env(&podinfo_path.join("pod_name"), "POD_NAME")
+            .ok_or_else(|| anyhow::anyhow!("POD_NAME not available from file or environment"))?;
+
+        let pod_uid = Self::read_from_file_or_env(&podinfo_path.join("pod_uid"), "POD_UID")
+            .ok_or_else(|| anyhow::anyhow!("POD_UID not available from file or environment"))?;
+
+        let pod_namespace =
+            Self::read_from_file_or_env(&podinfo_path.join("pod_namespace"), "POD_NAMESPACE")
+                .unwrap_or_else(|| {
+                    tracing::warn!("POD_NAMESPACE not set, defaulting to 'default'");
+                    "default".to_string()
+                });
+
+        // Log where we got the pod info from for debugging
+        if podinfo_path.join("pod_name").exists() {
+            tracing::info!(
+                "Pod identity loaded from Downward API volume mount at {}",
+                DEFAULT_PODINFO_PATH
+            );
+        } else {
+            tracing::info!("Pod identity loaded from environment variables");
+        }

        // Read system server port from config
        let config = crate::config::RuntimeConfig::from_settings().unwrap_or_default();