"docs/features/vscode:/vscode.git/clone" did not exist on "1da0ecd46ffbc27abb646d31b3f8b863346f2cf9"
Unverified Commit 6dbe9f6a authored by Schwinn Saereesitthipitak's avatar Schwinn Saereesitthipitak Committed by GitHub
Browse files

docs(chrek): clean up docs + recipe for ChReK (vllm + sglang) (#6603)

parent 5a768786
...@@ -10,7 +10,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, ...@@ -10,7 +10,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
**Note:** **Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC - Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
- **Currently only supports vLLM backend** (SGLang and TensorRT-LLM support planned) - **Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
## Prerequisites ## Prerequisites
...@@ -19,7 +19,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo, ...@@ -19,7 +19,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
- Kubernetes 1.21+ - Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class) - GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- containerd runtime (for container inspection; CRIU is bundled in ChReK images) - containerd runtime (for container inspection; CRIU is bundled in ChReK images)
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped), **or** manual pod configuration — see [Standalone Usage](../../../../docs/pages/kubernetes/chrek/standalone.md#using-chrek-without-the-dynamo-operator) for required labels, seccomp profiles, command overrides, and deployment strategy when running without the operator - NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
- RWX (ReadWriteMany) storage class for multi-node deployments - RWX (ReadWriteMany) storage class for multi-node deployments
- **Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork) - **Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
...@@ -168,7 +168,6 @@ Ensure your storage class supports `ReadWriteMany` access mode for multi-node de ...@@ -168,7 +168,6 @@ Ensure your storage class supports `ReadWriteMany` access mode for multi-node de
- [ChReK Overview](../../../../docs/pages/kubernetes/chrek/README.md) - ChReK architecture and use cases - [ChReK Overview](../../../../docs/pages/kubernetes/chrek/README.md) - ChReK architecture and use cases
- [ChReK with Dynamo Platform](../../../../docs/pages/kubernetes/chrek/dynamo.md) - Integration guide - [ChReK with Dynamo Platform](../../../../docs/pages/kubernetes/chrek/dynamo.md) - Integration guide
- [ChReK Standalone Usage](../../../../docs/pages/kubernetes/chrek/standalone.md) - Use ChReK without Dynamo Platform
## License ## License
......
...@@ -28,15 +28,6 @@ Use ChReK as part of the Dynamo platform for automatic checkpoint management: ...@@ -28,15 +28,6 @@ Use ChReK as part of the Dynamo platform for automatic checkpoint management:
📖 **[Read the Dynamo Integration Guide →](dynamo.md)** 📖 **[Read the Dynamo Integration Guide →](dynamo.md)**
### 2. Standalone (Without Dynamo)
Use ChReK independently in your own Kubernetes applications:
- Manual checkpoint job creation
- Build your own restore-enabled container images
- Full control over checkpoint lifecycle
📖 **[Read the Standalone Usage Guide →](standalone.md)**
## Architecture ## Architecture
ChReK consists of two main components: ChReK consists of two main components:
...@@ -46,7 +37,7 @@ Deploys the checkpoint/restore infrastructure: ...@@ -46,7 +37,7 @@ Deploys the checkpoint/restore infrastructure:
- **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations - **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
- **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state) - **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
- **RBAC**: Namespace-scoped or cluster-wide permissions - **RBAC**: Namespace-scoped or cluster-wide permissions
- **Seccomp Profile**: Security policies for CRIU syscalls - **Seccomp Profile**: Security policies for CRIU syscalls (needs to be injected into workload pods)
### 2. External Restore via DaemonSet ### 2. External Restore via DaemonSet
The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces: The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces:
...@@ -55,7 +46,7 @@ The DaemonSet performs checkpoint/restore externally using `nsenter` to enter po ...@@ -55,7 +46,7 @@ The DaemonSet performs checkpoint/restore externally using `nsenter` to enter po
## Quick Start ## Quick Start
### Install ChReK Infrastructure To install the ChReK DaemonSet in your cluster, run the following:
```bash ```bash
helm install chrek nvidia/chrek \ helm install chrek nvidia/chrek \
...@@ -64,16 +55,12 @@ helm install chrek nvidia/chrek \ ...@@ -64,16 +55,12 @@ helm install chrek nvidia/chrek \
--set storage.pvc.size=100Gi --set storage.pvc.size=100Gi
``` ```
### Choose Your Integration Path
- **Using Dynamo Platform?** → Follow the [Dynamo Integration Guide](dynamo.md)
- **Using standalone?** → Follow the [Standalone Usage Guide](standalone.md)
## Key Features ## Key Features
### ✅ Currently Supported ### ✅ Currently Supported
-**vLLM backend only** (SGLang and TensorRT-LLM planned) -**vLLM and SGLang backends** (TensorRT-LLM planned)
- ✅ Single-node, single-GPU checkpoints -**LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
- ✅ Cross-node, single-GPU checkpoints
- ✅ PVC storage backend (RWX for multi-node) - ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore - ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`) - ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
...@@ -82,7 +69,6 @@ helm install chrek nvidia/chrek \ ...@@ -82,7 +69,6 @@ helm install chrek nvidia/chrek \
- ✅ Automatic signal-based checkpoint coordination - ✅ Automatic signal-based checkpoint coordination
### 🚧 Planned Features ### 🚧 Planned Features
- 🚧 SGLang backend support
- 🚧 TensorRT-LLM backend support - 🚧 TensorRT-LLM backend support
- 🚧 S3/MinIO storage backend - 🚧 S3/MinIO storage backend
- 🚧 OCI registry storage backend - 🚧 OCI registry storage backend
...@@ -101,7 +87,8 @@ helm install chrek nvidia/chrek \ ...@@ -101,7 +87,8 @@ helm install chrek nvidia/chrek \
- Potentially compromise node security if exploited - Potentially compromise node security if exploited
### Technical Limitations ### Technical Limitations
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned. - **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-node only**: Checkpoints must be created and restored on the same node - **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations not yet supported - **Single-GPU only**: Multi-GPU configurations not yet supported
- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option) - **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
...@@ -118,7 +105,6 @@ ChReK is best suited for: ...@@ -118,7 +105,6 @@ ChReK is best suited for:
### Getting Started ### Getting Started
- [Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform - [Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform
- [Standalone Usage Guide](standalone.md) - Using ChReK independently
- [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Helm chart configuration - [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Helm chart configuration
### Related Documentation ### Related Documentation
...@@ -132,28 +118,6 @@ ChReK is best suited for: ...@@ -132,28 +118,6 @@ ChReK is best suited for:
- RWX storage class (for multi-node deployments) - RWX storage class (for multi-node deployments)
- **Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork) - **Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
## Troubleshooting
### Common Issues
**DaemonSet not starting?**
- Check GPU node labels: `kubectl get nodes -l nvidia.com/gpu.present=true`
- Verify NVIDIA runtime is available
**Checkpoint fails?**
- Check DaemonSet logs: `kubectl logs -l app.kubernetes.io/name=chrek -n <namespace>`
- Ensure application properly signals readiness
- Verify CRIU is installed in the runtime
**Restore fails?**
- Ensure restore pod uses the same image (built with `placeholder` target) and volume mounts as checkpoint job
- Verify the DaemonSet is running on the same node as the restore pod
- Check DaemonSet logs for CRIU errors: `kubectl logs -l app.kubernetes.io/name=chrek`
For detailed troubleshooting, see:
- [Dynamo Integration Guide - Troubleshooting](dynamo.md#troubleshooting)
- [Standalone Guide - Troubleshooting](standalone.md#troubleshooting)
## Contributing ## Contributing
ChReK is part of the NVIDIA Dynamo project. Contributions are welcome! ChReK is part of the NVIDIA Dynamo project. Contributions are welcome!
......
...@@ -8,22 +8,17 @@ title: Integration with Dynamo ...@@ -8,22 +8,17 @@ title: Integration with Dynamo
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details. > ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
## Overview
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start. Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
| Startup Type | Time | What Happens | | Startup Type | Time | What Happens |
|--------------|------|--------------| |--------------|------|--------------|
| **Cold Start** | ~3 min | Download model, load to GPU, initialize engine | | **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | ~30 sec | Restore from checkpoint tar | | **Warm Start** (checkpoint) | < 10 sec | Restore from checkpoint tar |
## Prerequisites ## Prerequisites
- Dynamo Platform installed (v0.4.0+) - Dynamo Platform installed (v0.4.0+) on k8s cluster with GPU nodes
- ChReK Helm chart installed (separate from platform) - ChReK Helm chart installed (separate from platform)
- GPU nodes with containerd runtime (CRIU is bundled in ChReK images)
- RWX PVC storage (PVC is currently the only supported backend) - RWX PVC storage (PVC is currently the only supported backend)
## Quick Start ## Quick Start
...@@ -63,7 +58,9 @@ dynamo-operator: ...@@ -63,7 +58,9 @@ dynamo-operator:
### 2. Configure Your DGD ### 2. Configure Your DGD
Add checkpoint configuration to your service: Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate `backendFramework`, command, and CLI flags.
#### vLLM Example
```yaml ```yaml
apiVersion: nvidia.com/v1alpha1 apiVersion: nvidia.com/v1alpha1
...@@ -72,93 +69,105 @@ metadata: ...@@ -72,93 +69,105 @@ metadata:
name: my-llm name: my-llm
spec: spec:
services: services:
VllmWorker: worker:
replicas: 1 replicas: 1
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3"]
args: args:
- python3 -m dynamo.vllm --model meta-llama/Llama-3-8B - "-m"
- "dynamo.vllm"
- "--model"
- "meta-llama/Llama-3-8B"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
env:
# Required for cross-node checkpoint/restore
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources: resources:
limits: limits:
nvidia.com/gpu: "1" nvidia.com/gpu: "1"
# Checkpoint configuration
checkpoint: checkpoint:
enabled: true enabled: true
mode: auto # Automatically create checkpoint if not found mode: auto
identity: identity:
model: "meta-llama/Llama-3-8B" model: "meta-llama/Llama-3-8B"
backendFramework: "vllm" backendFramework: "vllm"
tensorParallelSize: 1 tensorParallelSize: 1
dtype: "bfloat16" dtype: "bfloat16"
maxModelLen: 4096
``` ```
### 3. Deploy #### SGLang Example
```bash
kubectl apply -f my-llm.yaml -n dynamo-system
```
On first deployment:
1. A checkpoint job runs to create the checkpoint
2. Worker pods start with cold start (checkpoint not ready yet)
3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
## Storage Backends
### PVC (Currently Supported)
Use when you have RWX storage available (e.g., NFS, EFS, Filestore).
```yaml ```yaml
checkpoint: apiVersion: nvidia.com/v1alpha1
storage: kind: DynamoGraphDeployment
type: pvc metadata:
pvc: name: my-sglang-llm
pvcName: "chrek-pvc" spec:
basePath: "/checkpoints" services:
worker:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-sglang-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.sglang"
- "--model"
- "meta-llama/Llama-3-8B"
- "--mem-fraction-static"
- "0.90"
env:
# Required for cross-node checkpoint/restore
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
checkpoint:
enabled: true
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "sglang"
tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
``` ```
**Requirements:** **Key differences between backends:**
- RWX (ReadWriteMany) PVC for multi-node access
- Sufficient storage (checkpoints are ~10-50GB per model)
### S3 / MinIO (Planned - Not Yet Implemented)
> ⚠️ **Note:** S3 storage backend is defined in the API but not yet fully implemented. | Setting | vLLM | SGLang |
|---------|------|--------|
| Module | `dynamo.vllm` | `dynamo.sglang` |
| Max context (optional) | `--max-model-len` | `--context-length` |
| GPU memory | `--gpu-memory-utilization` | `--mem-fraction-static` |
| Placeholder image | `dynamo-vllm-placeholder` | `dynamo-sglang-placeholder` |
| Identity `backendFramework` | `"vllm"` | `"sglang"` |
Object storage support is planned for a future release. The configuration will look like: > **Note:** Do **not** set `DYN_READY_FOR_CHECKPOINT_FILE` or `DYN_CHECKPOINT_READY_FILE` in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally.
```yaml ### 3. Deploy
checkpoint:
storage:
type: s3 # Not yet supported
s3:
# AWS S3
uri: "s3://my-bucket/checkpoints"
# Or MinIO / custom S3
uri: "s3://minio.example.com/my-bucket/checkpoints"
# Optional: credentials secret ```bash
credentialsSecretRef: "s3-creds" kubectl apply -f my-llm.yaml -n dynamo-system
``` ```
### OCI Registry (Planned - Not Yet Implemented) On first deployment:
1. A checkpoint job runs to create the checkpoint
> ⚠️ **Note:** OCI registry storage backend is defined in the API but not yet fully implemented. 2. Worker pods start with cold start (checkpoint not ready yet)
3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
Container registry storage support is planned for a future release. The configuration will look like:
```yaml
checkpoint:
storage:
type: oci # Not yet supported
oci:
uri: "oci://myregistry.io/checkpoints"
credentialsSecretRef: "registry-creds" # Docker config secret
```
## Checkpoint Modes ## Checkpoint Modes
...@@ -172,8 +181,10 @@ checkpoint: ...@@ -172,8 +181,10 @@ checkpoint:
mode: auto mode: auto
identity: identity:
model: "meta-llama/Llama-3-8B" model: "meta-llama/Llama-3-8B"
backendFramework: "vllm" backendFramework: "vllm" # or "sglang"
tensorParallelSize: 1 tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
``` ```
### Reference Mode ### Reference Mode
...@@ -347,26 +358,12 @@ Or use `auto` mode and the operator will find/create it automatically. ...@@ -347,26 +358,12 @@ Or use `auto` mode and the operator will find/create it automatically.
## Limitations ## Limitations
⚠️ **Important**: ChReK has significant limitations that impact production readiness: - **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
### Security Considerations - **Single-GPU only**: Multi-GPU configurations are not yet supported (planned)
- **🔴 Privileged DaemonSet**: The ChReK DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations externally
- Workload pods (checkpoint jobs, restore pods) do **not** need privileged mode — all CRIU privilege lives in the DaemonSet
- The privileged DaemonSet has elevated host access, which may violate security policies in many production environments
### Technical Limitations
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations are not yet supported
- **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option) - **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
- **Storage**: Only PVC backend currently implemented (S3/OCI planned) - **Storage**: Only PVC backend currently implemented (S3/OCI planned)
- **Security**: ChReK runs as a **privileged DaemonSet** which is required to run CRIU
### Recommendation
ChReK is **experimental/beta** and best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
## Troubleshooting ## Troubleshooting
...@@ -399,9 +396,6 @@ ChReK is **experimental/beta** and best suited for: ...@@ -399,9 +396,6 @@ ChReK is **experimental/beta** and best suited for:
```bash ```bash
# For PVC # For PVC
kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/ kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/
# For S3
aws s3 ls s3://my-bucket/checkpoints/
``` ```
3. Check environment variables: 3. Check environment variables:
...@@ -418,18 +412,11 @@ Pods fall back to cold start if: ...@@ -418,18 +412,11 @@ Pods fall back to cold start if:
Check logs for "Falling back to cold start" message. Check logs for "Falling back to cold start" message.
## Best Practices
1. **Use RWX PVCs** for multi-node deployments (currently the only supported backend)
2. **Pre-warm checkpoints** before scaling up
3. **Monitor checkpoint size** - large models create large checkpoints
4. **Clean up old checkpoints** to save storage
## Environment Variables ## Environment Variables
| Variable | Description | | Variable | Description |
|----------|-------------| |----------|-------------|
| `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` | | `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` (`s3` and `oci` are currently no-ops) |
| `DYN_CHECKPOINT_LOCATION` | Full checkpoint location (checkpoint jobs) | | `DYN_CHECKPOINT_LOCATION` | Full checkpoint location (checkpoint jobs) |
| `DYN_CHECKPOINT_PATH` | Base checkpoint directory (restore pods, PVC) | | `DYN_CHECKPOINT_PATH` | Base checkpoint directory (restore pods, PVC) |
| `DYN_CHECKPOINT_HASH` | Identity hash | | `DYN_CHECKPOINT_HASH` | Identity hash |
...@@ -459,21 +446,27 @@ spec: ...@@ -459,21 +446,27 @@ spec:
spec: spec:
containers: containers:
- name: main - name: main
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3", "-m", "dynamo.vllm"] command: ["python3"]
args: args:
- "-m"
- "dynamo.vllm"
- "--model" - "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct" - "meta-llama/Meta-Llama-3-8B-Instruct"
- "--tensor-parallel-size" - "--max-model-len"
- "1" - "4096"
- "--dtype" - "--gpu-memory-utilization"
- "bfloat16" - "0.90"
env: env:
- name: HF_TOKEN - name: HF_TOKEN
valueFrom: valueFrom:
secretKeyRef: secretKeyRef:
name: hf-token-secret name: hf-token-secret
key: HF_TOKEN key: HF_TOKEN
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources: resources:
limits: limits:
nvidia.com/gpu: "1" nvidia.com/gpu: "1"
...@@ -489,11 +482,26 @@ metadata: ...@@ -489,11 +482,26 @@ metadata:
namespace: dynamo-system namespace: dynamo-system
spec: spec:
services: services:
VllmWorker: worker:
replicas: 2 replicas: 2
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.vllm"
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
env:
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources: resources:
limits: limits:
nvidia.com/gpu: "1" nvidia.com/gpu: "1"
...@@ -505,7 +513,6 @@ spec: ...@@ -505,7 +513,6 @@ spec:
## Related Documentation ## Related Documentation
- [ChReK Overview](README.md) - ChReK architecture and use cases - [ChReK Overview](README.md) - ChReK architecture and use cases
- [ChReK Standalone Usage Guide](standalone.md) - Use ChReK without Dynamo Platform
- [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Chart configuration - [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Chart configuration
- [Installation Guide](../installation-guide.md) - Platform installation - [Installation Guide](../installation-guide.md) - Platform installation
- [API Reference](../api-reference.md) - Complete CRD specifications - [API Reference](../api-reference.md) - Complete CRD specifications
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Standalone Usage
---
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. Review the [security implications](#security-considerations) before deploying.
This guide explains how to use **ChReK** (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
## Table of Contents
- [Overview](#overview)
- [Using ChReK Without the Dynamo Operator](#using-chrek-without-the-dynamo-operator)
- [Prerequisites](#prerequisites)
- [Step 1: Deploy ChReK](#step-1-deploy-chrek)
- [Step 2: Build Checkpoint-Enabled Images](#step-2-build-checkpoint-enabled-images)
- [Step 3: Create Checkpoint Jobs](#step-3-create-checkpoint-jobs)
- [Step 4: Restore from Checkpoints](#step-4-restore-from-checkpoints)
- [Environment Variables Reference](#environment-variables-reference)
- [Checkpoint Flow Explained](#checkpoint-flow-explained)
- [Troubleshooting](#troubleshooting)
---
## Overview
When using ChReK standalone, you are responsible for:
1. **Deploying the ChReK Helm chart** (DaemonSet + PVC)
2. **Building checkpoint-enabled container images** with the CRIU runtime dependencies
3. **Creating checkpoint jobs** with the correct environment variables
4. **Creating restore pods** that detect and use the checkpoints
The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.
---
## Using ChReK Without the Dynamo Operator
When using ChReK with the Dynamo operator, the operator automatically configures workload pods for checkpoint/restore. Without the operator, you must handle this configuration manually. This section documents what the operator normally injects and how to replicate it.
### Container Naming
The ChReK DaemonSet needs to identify which container in your pod is the model-serving workload (as opposed to sidecars like istio-proxy or log collectors). It resolves the target container by name:
1. If a container is named `main`, it is selected
2. Otherwise, the first container in the pod spec is selected
When using the Dynamo operator, the model container is always named `main`. In standalone mode, you must either name your model container `main` or ensure it is the first container listed in your pod spec. All YAML examples in this guide use `name: main`.
### Seccomp Profile
The operator sets a seccomp profile on all checkpoint/restore workload pods to block `io_uring` syscalls. The chrek DaemonSet deploys the profile file (`profiles/block-iouring.json`) to each node, but you must reference it in your pod specs:
```yaml
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/block-iouring.json
```
Without this profile, `io_uring` syscalls during restore can cause CRIU failures.
### Sleep Infinity Command for Restore Pods
The operator overrides the container command to `["sleep", "infinity"]` on restore-target pods. This produces a Running-but-not-Ready placeholder pod that the chrek DaemonSet watcher detects and restores externally via `nsenter`. Without this override, the container runs its normal entrypoint (cold-starting instead of waiting for restore).
```yaml
containers:
- name: main
image: my-app:checkpoint-enabled
command: ["sleep", "infinity"]
```
### Recreate Deployment Strategy
The operator forces `Recreate` strategy when restore labels are present. This prevents the old and new pods from running simultaneously, which would cause failures — two pods competing for the same GPU checkpoint data. If you are using a Deployment, set this manually:
```yaml
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
type: Recreate
```
### PVC Volume Mount Consistency
CRIU requires identical mount layouts between checkpoint and restore. The operator ensures the checkpoint PVC is mounted at the same path in both the checkpoint job and restore pod. When configuring manually, make sure your checkpoint job and restore pod use the exact same `mountPath` for the checkpoint PVC (e.g., `/checkpoints`).
### Downward API Volume (Currently Unused)
The operator injects a Downward API volume at `/etc/podinfo` for post-restore identity discovery (pod name, namespace, UID). This is not currently consumed by any component — you can skip it for now.
### Environment Variables
The following environment variables are normally injected by the operator. They are already documented in the [Environment Variables Reference](#environment-variables-reference) below, but note that without the operator you must set them manually:
- **Checkpoint jobs:** `DYN_READY_FOR_CHECKPOINT_FILE`, `DYN_CHECKPOINT_LOCATION`, `DYN_CHECKPOINT_STORAGE_TYPE`, `DYN_CHECKPOINT_HASH`
- **Restore pods:** `DYN_CHECKPOINT_PATH`, `DYN_CHECKPOINT_HASH`
---
## Prerequisites
- Kubernetes cluster with:
- NVIDIA GPUs with checkpoint support
- **Privileged DaemonSet allowed** (⚠️ the ChReK DaemonSet runs privileged - see [Security Considerations](#security-considerations))
- PVC storage (ReadWriteMany recommended for multi-node)
- Docker or compatible container runtime for building images
- Access to the ChReK source code: `deploy/chrek/`
### Security Considerations
⚠️ **Important**: The ChReK **DaemonSet** runs in privileged mode to perform CRIU checkpoint/restore operations. Your workload pods (checkpoint jobs, restore pods) do **not** need privileged mode — all CRIU privilege lives in the DaemonSet, which performs external restore via `nsenter`.
- **The DaemonSet** has `privileged: true`, `hostPID`, `hostIPC`, and `hostNetwork`
- This may violate security policies in production environments
- If the DaemonSet is compromised, it could potentially compromise node security
**Recommended for:**
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
**Not recommended for:**
- ❌ Multi-tenant clusters without proper isolation
- ❌ Security-sensitive production workloads without risk assessment
- ❌ Environments with strict security compliance requirements
### Technical Limitations
⚠️ **Current Restrictions:**
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations are not yet supported
- **Network state**: Active TCP connections are closed during restore
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
---
## Step 1: Deploy ChReK
### Install the Helm Chart
```bash
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Install ChReK in your namespace
helm install chrek ./deploy/helm/charts/chrek \
--namespace my-app \
--create-namespace \
--set storage.pvc.size=100Gi \
--set storage.pvc.storageClass=your-storage-class
```
### Verify Installation
```bash
# Check the DaemonSet is running
kubectl get daemonset -n my-app
# NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE
# chrek-agent 3 3 3 3 3
# Check the PVC is bound
kubectl get pvc -n my-app
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# chrek-pvc Bound pvc-xyz 100Gi RWX your-storage-class
```
---
## Step 2: Build Checkpoint-Enabled Images
ChReK provides a `placeholder` target in its Dockerfile that layers CRIU runtime dependencies onto your existing container images. The DaemonSet performs restore externally via `nsenter`, so these dependencies must be present in the image.
### Quick Start: Using the Placeholder Target (Recommended)
```bash
cd deploy/chrek
# Define your images
export BASE_IMAGE="your-app:latest" # Your existing application image
export RESTORE_IMAGE="your-app:checkpoint-enabled" # Output checkpoint-enabled image
# Build using the placeholder target
docker build \
--target placeholder \
--build-arg BASE_IMAGE="$BASE_IMAGE" \
-t "$RESTORE_IMAGE" \
.
# Push to your registry
docker push "$RESTORE_IMAGE"
```
**Example with a Dynamo vLLM image:**
```bash
cd deploy/chrek
export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"
docker build \
--target placeholder \
--build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
-t "$RESTORE_IMAGE" \
.
```
### What the Placeholder Target Does
The ChReK Dockerfile's `placeholder` stage automatically:
- ✅ Installs CRIU runtime libraries (required by `nsrestore` running inside the pod's namespaces)
- ✅ Copies the `criu` binary to `/usr/local/sbin/criu`
- ✅ Copies `cuda-checkpoint` to `/usr/local/sbin/cuda-checkpoint` (used for CUDA state checkpoint/restore)
- ✅ Copies `nsrestore` to `/usr/local/bin/nsrestore` (invoked by DaemonSet via `nsenter`)
- ✅ Creates checkpoint directories (`/checkpoints`, `/var/run/criu`, `/var/criu-work`)
- ✅ Preserves your original application image contents
The placeholder image does **not** override the entrypoint or CMD. For restore pods, the operator (or you, in standalone mode) overrides the command to `sleep infinity`.
> **💡 Tip**: Using the `placeholder` target is the recommended approach as it's maintained with the ChReK codebase and ensures compatibility.
---
## Step 3: Create Checkpoint Jobs
A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.
### Required Environment Variables
Your checkpoint job MUST set these environment variables:
| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_READY_FOR_CHECKPOINT_FILE` | Path where your app signals it's ready | `/tmp/ready-for-checkpoint` |
| `DYN_CHECKPOINT_HASH` | Unique identifier for this checkpoint | `abc123def456` |
| `DYN_CHECKPOINT_LOCATION` | Directory where checkpoint is stored | `/checkpoints/abc123def456` |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Storage backend type | `pvc` |
### Required Labels
Add this label to enable DaemonSet checkpoint detection:
```yaml
labels:
nvidia.com/chrek-is-checkpoint-source: "true"
```
### Example Checkpoint Job
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: checkpoint-my-model
namespace: my-app
spec:
template:
metadata:
labels:
nvidia.com/chrek-is-checkpoint-source: "true" # Required for DaemonSet detection
nvidia.com/chrek-checkpoint-hash: "abc123def456" # Must match DYN_CHECKPOINT_HASH
spec:
restartPolicy: Never
# Seccomp profile to block io_uring syscalls (deployed by the chrek DaemonSet)
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/block-iouring.json
containers:
- name: main
image: my-app:checkpoint-enabled
# Readiness probe: Pod becomes Ready when model is loaded
# This is what triggers the DaemonSet to start checkpointing
readinessProbe:
exec:
command: ["cat", "/tmp/ready-for-checkpoint"]
initialDelaySeconds: 15
periodSeconds: 2
# Remove liveness/startup probes for checkpoint jobs
# Model loading can take several minutes
livenessProbe: null
startupProbe: null
# Checkpoint-related environment variables
env:
- name: DYN_READY_FOR_CHECKPOINT_FILE
value: "/tmp/ready-for-checkpoint"
- name: DYN_CHECKPOINT_HASH
value: "abc123def456"
- name: DYN_CHECKPOINT_LOCATION
value: "/checkpoints/abc123def456"
- name: DYN_CHECKPOINT_STORAGE_TYPE
value: "pvc"
# GPU request
resources:
limits:
nvidia.com/gpu: 1
# Required volume mounts
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: chrek-pvc
```
### Application Code Requirements
Your application must implement the checkpoint flow. The DaemonSet communicates with your application via Unix signals (not files):
- **`SIGUSR1`**: Checkpoint completed — your process should exit gracefully
- **`SIGCONT`**: Restore completed — your process should wake up and continue
- **`SIGKILL`**: Checkpoint failed — process is terminated immediately (unhandleable)
Here's the pattern used by Dynamo vLLM (see `components/src/dynamo/vllm/checkpoint_restore.py`):
```python
import asyncio
import os
import signal
async def main():
ready_file = os.environ.get("DYN_READY_FOR_CHECKPOINT_FILE")
if not ready_file:
# Not in checkpoint mode, run normally
await run_application()
return
print("Checkpoint mode detected")
# 1. Load your model/application
model = await load_model()
# 2. Put model to sleep for CRIU-friendly GPU state
await model.sleep()
# 3. Install signal handlers BEFORE writing the ready file to avoid a race
# where the DaemonSet sends a signal while default disposition (terminate)
# is still in effect. No handler needed for checkpoint failure — the
# watcher sends SIGKILL which terminates the process immediately.
checkpoint_done = asyncio.Event()
restore_done = asyncio.Event()
loop = asyncio.get_running_loop()
loop.add_signal_handler(signal.SIGUSR1, checkpoint_done.set)
loop.add_signal_handler(signal.SIGCONT, restore_done.set)
# 4. Write ready file — triggers DaemonSet checkpoint via readiness probe
with open(ready_file, "w") as f:
f.write("ready")
print("Ready for checkpoint. Waiting for watcher signal...")
# Wait for whichever signal comes first (SIGKILL on failure kills us
# immediately, so only success/restore signals reach this point)
done, pending = await asyncio.wait(
[asyncio.create_task(checkpoint_done.wait()),
asyncio.create_task(restore_done.wait())],
return_when=asyncio.FIRST_COMPLETED,
)
for task in pending:
task.cancel()
if restore_done.is_set():
# SIGCONT: Process was restored from checkpoint
print("Restore complete, waking model")
await model.wake_up()
await run_application()
else:
# SIGUSR1: Checkpoint complete, exit
print("Checkpoint complete, exiting")
```
**Important Notes:**
1. **Ready File & Readiness Probe**: The checkpoint job must have a readiness probe that checks for the ready file. The ChReK DaemonSet triggers checkpointing when:
- Pod has `nvidia.com/chrek-is-checkpoint-source: "true"` label
- Pod status is `Ready` (readiness probe passes = ready file exists)
2. **Signal handler ordering**: Install signal handlers **before** writing the ready file. Otherwise there is a race window where the DaemonSet sends a signal while the default disposition (terminate) is still in effect.
3. **Signal-based coordination**: The DaemonSet sends `SIGUSR1` after checkpoint completes, `SIGCONT` after restore completes, and `SIGKILL` if checkpoint fails. Your application must handle `SIGUSR1` and `SIGCONT` (not poll for files). `SIGKILL` cannot be caught — the kernel terminates the process immediately.
4. **Three exit paths**:
- **SIGUSR1 received**: Checkpoint complete, exit gracefully
- **SIGCONT received**: Process was restored, wake model and continue
- **SIGKILL received**: Checkpoint failed, process terminated immediately (no handler needed)
---
## Step 4: Restore from Checkpoints
The DaemonSet performs restore externally — your restore pod just needs to be a placeholder that sleeps until the DaemonSet restores the checkpointed process into it.
### Example Restore Pod
```yaml
apiVersion: v1
kind: Pod
metadata:
name: my-app-restored
namespace: my-app
labels:
nvidia.com/chrek-is-restore-target: "true" # Required: watcher detects restore pods by this label
nvidia.com/chrek-checkpoint-hash: "abc123def456" # Required: watcher uses this to locate the checkpoint
spec:
restartPolicy: Never
# Seccomp profile to block io_uring syscalls (deployed by the chrek DaemonSet)
# Without this, io_uring syscalls may cause CRIU restore failures
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/block-iouring.json
containers:
- name: main
image: my-app:checkpoint-enabled
# Override command to sleep — the chrek DaemonSet performs external restore
# on Running-but-not-Ready pods. Without this, the container would cold-start.
command: ["sleep", "infinity"]
# Set checkpoint environment variables
env:
- name: DYN_CHECKPOINT_HASH
value: "abc123def456" # Must match checkpoint job
- name: DYN_CHECKPOINT_PATH
value: "/checkpoints" # Base path (hash appended automatically)
# GPU request
resources:
limits:
nvidia.com/gpu: 1
# CRIU needs write access for restore.log — do NOT set readOnly
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: chrek-pvc
```
### How Restore Works
1. **Pod starts as placeholder**: The `sleep infinity` command keeps the pod Running but not Ready
2. **DaemonSet detects restore pod**: The watcher finds pods with `nvidia.com/chrek-is-restore-target=true` that are Running but not Ready
3. **External restore via nsenter**: The DaemonSet enters the pod's namespaces and performs CRIU restore, including GPU state
4. **Application continues**: Your application resumes exactly where it was checkpointed
---
## Environment Variables Reference
### Checkpoint Jobs
| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_READY_FOR_CHECKPOINT_FILE` | Yes | Full path where app signals readiness (e.g., `/tmp/ready-for-checkpoint`) |
| `DYN_CHECKPOINT_HASH` | Yes | Unique checkpoint identifier (16-char hex string) |
| `DYN_CHECKPOINT_LOCATION` | Yes | Directory where checkpoint is stored (e.g., `/checkpoints/abc123def456`) |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Yes | Storage backend: `pvc`, `s3`, or `oci` |
### Restore Pods
| Variable | Required | Description |
|----------|----------|-------------|
| `DYN_CHECKPOINT_HASH` | Yes | Checkpoint identifier (must match checkpoint job) |
| `DYN_CHECKPOINT_PATH` | Yes | Base checkpoint directory (hash appended automatically) |
### Signals (DaemonSet → Application)
The DaemonSet communicates checkpoint/restore completion via Unix signals, not files:
| Signal | Direction | Meaning |
|--------|-----------|---------|
| `SIGUSR1` | DaemonSet → checkpoint pod | Checkpoint completed, process should exit |
| `SIGCONT` | DaemonSet → restored pod | Restore completed, process should wake up |
| `SIGKILL` | DaemonSet → checkpoint pod | Checkpoint failed — process terminated immediately |
CRIU tuning options are configured via the ChReK Helm chart's `config.checkpoint.criu` values, not environment variables. See the [Helm Chart Values](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/values.yaml) for available options.
---
## Checkpoint Flow Explained
### 1. Checkpoint Creation Flow
```
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with nvidia.com/chrek-is-checkpoint-source=true label │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. Application loads model and creates ready file │
│ /tmp/ready-for-checkpoint │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. Pod becomes Ready (kubelet readiness probe passes) │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. ChReK DaemonSet detects: │
│ - Pod is Ready │
│ - Has chrek-is-checkpoint-source label │
│ - Has chrek-checkpoint-hash label │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. DaemonSet executes CRIU checkpoint: │
│ - Freezes container process │
│ - Dumps memory (CPU + GPU) │
│ - Saves to /checkpoints/${HASH}/ │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. DaemonSet sends SIGUSR1 to the application process │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 7. Application receives SIGUSR1 and exits gracefully │
└─────────────────────────────────────────────────────────────┘
```
### 2. Restore Flow
```
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with restore labels and sleep infinity │
│ (Running but not Ready) │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. ChReK DaemonSet detects: │
│ - Pod is Running but not Ready │
│ - Has chrek-is-restore-target label │
│ - Has chrek-checkpoint-hash label │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. DaemonSet performs external restore via nsenter: │
│ - Enters pod's namespaces (mount, net, pid, ipc) │
│ - Runs nsrestore with CRIU inside the pod's context │
│ - Restores memory (CPU + GPU via cuda-checkpoint) │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. DaemonSet sends SIGCONT to the restored process │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. Application receives SIGCONT, wakes model, continues │
│ (Model already loaded, GPU memory initialized) │
└─────────────────────────────────────────────────────────────┘
```
---
## Troubleshooting
### Checkpoint Not Created
**Symptom**: Job runs but no checkpoint appears in `/checkpoints/`
**Checks**:
1. Verify the pod has the label:
```bash
kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/chrek-is-checkpoint-source}'
```
2. Check pod readiness:
```bash
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
```
3. Check ready file was created:
```bash
kubectl exec <pod-name> -- ls -la /tmp/ready-for-checkpoint
```
4. Check DaemonSet logs:
```bash
kubectl logs -n my-app daemonset/chrek-agent --all-containers
```
### Restore Fails
**Symptom**: Pod fails to restore from checkpoint
**Checks**:
1. Verify checkpoint files exist:
```bash
kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/
```
2. Check DaemonSet logs for restore errors:
```bash
kubectl logs -n my-app daemonset/chrek-agent --all-containers
```
3. Check pod events for restore status annotations:
```bash
kubectl describe pod <pod-name>
```
4. Ensure checkpoint and restore have same:
- Container image (built with `placeholder` target)
- GPU count
- Volume mounts (same `mountPath` for checkpoint PVC)
### Restore Pod Not Detected
**Symptom**: Pod runs `sleep infinity` but DaemonSet never restores it
**Checks**:
1. Verify the pod has the required labels:
```bash
kubectl get pod <pod-name> -o jsonpath='{.metadata.labels}'
```
Must have both `nvidia.com/chrek-is-restore-target: "true"` and `nvidia.com/chrek-checkpoint-hash: "<hash>"`.
2. Verify the pod is Running but not Ready (this is the trigger):
```bash
kubectl get pod <pod-name> -o jsonpath='{.status.phase}'
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
```
3. Verify the DaemonSet is running on the same node:
```bash
kubectl get pods -n my-app -l app.kubernetes.io/name=chrek -o wide
```
---
## Additional Resources
- [ChReK Helm Chart Values](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/values.yaml)
- [Dynamo vLLM ChReK Integration](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/checkpoint_restore.py) - Reference signal handler implementation
- [ChReK Dockerfile](https://github.com/ai-dynamo/dynamo/tree/main/deploy/chrek/Dockerfile)
- [CRIU Documentation](https://criu.org/Main_Page)
- [CUDA Checkpoint Utility](https://github.com/NVIDIA/cuda-checkpoint)
---
## Getting Help
If you encounter issues:
1. Check the [Troubleshooting](#troubleshooting) section
2. Review DaemonSet logs: `kubectl logs -n <namespace> daemonset/chrek-agent`
3. Open an issue on [GitHub](https://github.com/ai-dynamo/dynamo/issues)
...@@ -58,8 +58,6 @@ navigation: ...@@ -58,8 +58,6 @@ navigation:
contents: contents:
- page: Integration with Dynamo - page: Integration with Dynamo
path: ../pages/kubernetes/chrek/dynamo.md path: ../pages/kubernetes/chrek/dynamo.md
- page: Standalone Usage
path: ../pages/kubernetes/chrek/standalone.md
- section: Observability (K8s) - section: Observability (K8s)
contents: contents:
- page: Metrics - page: Metrics
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment