@@ -10,7 +10,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
**Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
-**Currently only supports vLLM backend** (SGLang and TensorRT-LLM support planned)
-**Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
## Prerequisites
...
...
@@ -19,7 +19,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- containerd runtime (for container inspection; CRIU is bundled in ChReK images)
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped), **or** manual pod configuration — see [Standalone Usage](../../../../docs/pages/kubernetes/chrek/standalone.md#using-chrek-without-the-dynamo-operator) for required labels, seccomp profiles, command overrides, and deployment strategy when running without the operator
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
- RWX (ReadWriteMany) storage class for multi-node deployments
-**Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
...
...
@@ -168,7 +168,6 @@ Ensure your storage class supports `ReadWriteMany` access mode for multi-node de
-[ChReK Overview](../../../../docs/pages/kubernetes/chrek/README.md) - ChReK architecture and use cases
-[ChReK with Dynamo Platform](../../../../docs/pages/kubernetes/chrek/dynamo.md) - Integration guide
-[ChReK Standalone Usage](../../../../docs/pages/kubernetes/chrek/standalone.md) - Use ChReK without Dynamo Platform
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
## Overview
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~3 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | ~30 sec | Restore from checkpoint tar |
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | < 10 sec | Restore from checkpoint tar |
## Prerequisites
- Dynamo Platform installed (v0.4.0+)
- Dynamo Platform installed (v0.4.0+) on k8s cluster with GPU nodes
- ChReK Helm chart installed (separate from platform)
- GPU nodes with containerd runtime (CRIU is bundled in ChReK images)
- RWX PVC storage (PVC is currently the only supported backend)
## Quick Start
...
...
@@ -63,7 +58,9 @@ dynamo-operator:
### 2. Configure Your DGD
Add checkpoint configuration to your service:
Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate `backendFramework`, command, and CLI flags.
> ⚠️ **Note:** S3 storage backend is defined in the API but not yet fully implemented.
Object storage support is planned for a future release. The configuration will look like:
```yaml
checkpoint:
storage:
type:s3# Not yet supported
s3:
# AWS S3
uri:"s3://my-bucket/checkpoints"
> **Note:** Do **not** set `DYN_READY_FOR_CHECKPOINT_FILE` or `DYN_CHECKPOINT_READY_FILE` in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally.
2. Worker pods start with cold start (checkpoint not ready yet)
3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
## Checkpoint Modes
...
...
@@ -172,8 +181,10 @@ checkpoint:
mode:auto
identity:
model:"meta-llama/Llama-3-8B"
backendFramework:"vllm"
backendFramework:"vllm"# or "sglang"
tensorParallelSize:1
dtype:"bfloat16"
maxModelLen:4096
```
### Reference Mode
...
...
@@ -347,26 +358,12 @@ Or use `auto` mode and the operator will find/create it automatically.
## Limitations
⚠️ **Important**: ChReK has significant limitations that impact production readiness:
### Security Considerations
-**🔴 Privileged DaemonSet**: The ChReK DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations externally
- Workload pods (checkpoint jobs, restore pods) do **not** need privileged mode — all CRIU privilege lives in the DaemonSet
- The privileged DaemonSet has elevated host access, which may violate security policies in many production environments
### Technical Limitations
-**vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
-**Single-node only**: Checkpoints must be created and restored on the same node
-**Single-GPU only**: Multi-GPU configurations are not yet supported
-**vLLM and SGLang backends only**: TensorRT-LLM support is planned.
-**LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
-**Single-GPU only**: Multi-GPU configurations are not yet supported (planned)
-**Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
-**Storage**: Only PVC backend currently implemented (S3/OCI planned)
### Recommendation
ChReK is **experimental/beta** and best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
-**Security**: ChReK runs as a **privileged DaemonSet** which is required to run CRIU
## Troubleshooting
...
...
@@ -399,9 +396,6 @@ ChReK is **experimental/beta** and best suited for: