@@ -10,7 +10,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
...
@@ -10,7 +10,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
**Note:**
**Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
-**Currently only supports vLLM backend** (SGLang and TensorRT-LLM support planned)
-**Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
## Prerequisites
## Prerequisites
...
@@ -19,7 +19,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
...
@@ -19,7 +19,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
- Kubernetes 1.21+
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- containerd runtime (for container inspection; CRIU is bundled in ChReK images)
- containerd runtime (for container inspection; CRIU is bundled in ChReK images)
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped), **or** manual pod configuration — see [Standalone Usage](../../../../docs/pages/kubernetes/chrek/standalone.md#using-chrek-without-the-dynamo-operator) for required labels, seccomp profiles, command overrides, and deployment strategy when running without the operator
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
- RWX (ReadWriteMany) storage class for multi-node deployments
- RWX (ReadWriteMany) storage class for multi-node deployments
-**Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
-**Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
...
@@ -168,7 +168,6 @@ Ensure your storage class supports `ReadWriteMany` access mode for multi-node de
...
@@ -168,7 +168,6 @@ Ensure your storage class supports `ReadWriteMany` access mode for multi-node de
-[ChReK Overview](../../../../docs/pages/kubernetes/chrek/README.md) - ChReK architecture and use cases
-[ChReK Overview](../../../../docs/pages/kubernetes/chrek/README.md) - ChReK architecture and use cases
-[ChReK with Dynamo Platform](../../../../docs/pages/kubernetes/chrek/dynamo.md) - Integration guide
-[ChReK with Dynamo Platform](../../../../docs/pages/kubernetes/chrek/dynamo.md) - Integration guide
-[ChReK Standalone Usage](../../../../docs/pages/kubernetes/chrek/standalone.md) - Use ChReK without Dynamo Platform
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
## Overview
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
| Startup Type | Time | What Happens |
| Startup Type | Time | What Happens |
|--------------|------|--------------|
|--------------|------|--------------|
| **Cold Start** | ~3 min | Download model, load to GPU, initialize engine |
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | ~30 sec | Restore from checkpoint tar |
| **Warm Start** (checkpoint) | < 10 sec | Restore from checkpoint tar |
## Prerequisites
## Prerequisites
- Dynamo Platform installed (v0.4.0+)
- Dynamo Platform installed (v0.4.0+) on k8s cluster with GPU nodes
- ChReK Helm chart installed (separate from platform)
- ChReK Helm chart installed (separate from platform)
- GPU nodes with containerd runtime (CRIU is bundled in ChReK images)
- RWX PVC storage (PVC is currently the only supported backend)
- RWX PVC storage (PVC is currently the only supported backend)
## Quick Start
## Quick Start
...
@@ -63,7 +58,9 @@ dynamo-operator:
...
@@ -63,7 +58,9 @@ dynamo-operator:
### 2. Configure Your DGD
### 2. Configure Your DGD
Add checkpoint configuration to your service:
Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate `backendFramework`, command, and CLI flags.
> ⚠️ **Note:** S3 storage backend is defined in the API but not yet fully implemented.
> **Note:** Do **not** set `DYN_READY_FOR_CHECKPOINT_FILE` or `DYN_CHECKPOINT_READY_FILE` in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally.
Object storage support is planned for a future release. The configuration will look like:
@@ -347,26 +358,12 @@ Or use `auto` mode and the operator will find/create it automatically.
...
@@ -347,26 +358,12 @@ Or use `auto` mode and the operator will find/create it automatically.
## Limitations
## Limitations
⚠️ **Important**: ChReK has significant limitations that impact production readiness:
-**vLLM and SGLang backends only**: TensorRT-LLM support is planned.
-**LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
### Security Considerations
-**Single-GPU only**: Multi-GPU configurations are not yet supported (planned)
-**🔴 Privileged DaemonSet**: The ChReK DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations externally
- Workload pods (checkpoint jobs, restore pods) do **not** need privileged mode — all CRIU privilege lives in the DaemonSet
- The privileged DaemonSet has elevated host access, which may violate security policies in many production environments
### Technical Limitations
-**vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
-**Single-node only**: Checkpoints must be created and restored on the same node
-**Single-GPU only**: Multi-GPU configurations are not yet supported
-**Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
-**Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
-**Storage**: Only PVC backend currently implemented (S3/OCI planned)
-**Storage**: Only PVC backend currently implemented (S3/OCI planned)
-**Security**: ChReK runs as a **privileged DaemonSet** which is required to run CRIU
### Recommendation
ChReK is **experimental/beta** and best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
## Troubleshooting
## Troubleshooting
...
@@ -399,9 +396,6 @@ ChReK is **experimental/beta** and best suited for:
...
@@ -399,9 +396,6 @@ ChReK is **experimental/beta** and best suited for: