README.md 5.3 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Checkpointing
5
---
6

7
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

**ChReK** (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). ChReK dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.

## What is ChReK?

ChReK provides:
- **Fast cold starts**: Restore GPU-accelerated applications in seconds instead of minutes
- **CUDA state preservation**: Checkpoint and restore GPU memory and CUDA contexts
- **Kubernetes-native**: Integrates seamlessly with Kubernetes primitives
- **Storage flexibility**: PVC-based storage (S3/OCI planned for future releases)
- **Namespace isolation**: Each namespace gets its own checkpoint infrastructure

## Use Cases

### 1. With NVIDIA Dynamo Platform (Recommended)

Use ChReK as part of the Dynamo platform for automatic checkpoint management:
- Automatic checkpoint creation and lifecycle management
- Seamless integration with DynamoGraphDeployment CRDs
- Built-in autoscaling with fast restore

📖 **[Read the Dynamo Integration Guide →](dynamo.md)**

## Architecture

ChReK consists of two main components:

### 1. ChReK Helm Chart
Deploys the checkpoint/restore infrastructure:
- **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
- **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
- **RBAC**: Namespace-scoped or cluster-wide permissions
40
- **Seccomp Profile**: Security policies for CRIU syscalls (needs to be injected into workload pods)
41

42
43
44
45
### 2. External Restore via DaemonSet
The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces:
- **Checkpoint**: Freezes the running process and dumps state (CPU + GPU) to storage
- **Restore**: Enters a placeholder pod's namespaces and restores the checkpointed process via `nsrestore`
46
47
48

## Quick Start

49
To install the ChReK DaemonSet in your cluster, run the following:
50
51
52
53
54
55
56
57
58
59
60

```bash
helm install chrek nvidia/chrek \
  --namespace my-team \
  --create-namespace \
  --set storage.pvc.size=100Gi
```

## Key Features

### ✅ Currently Supported
61
62
63
-**vLLM and SGLang backends** (TensorRT-LLM planned)
-**LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
- ✅ Cross-node, single-GPU checkpoints
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
- ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
- ✅ Namespace-scoped and cluster-wide RBAC
- ✅ Idempotent checkpoint creation
- ✅ Automatic signal-based checkpoint coordination

### 🚧 Planned Features
- 🚧 TensorRT-LLM backend support
- 🚧 S3/MinIO storage backend
- 🚧 OCI registry storage backend
- 🚧 Multi-GPU checkpoints
- 🚧 Multi-node distributed checkpoints

## Limitations

⚠️ **Important**: ChReK has significant limitations that may impact production readiness:

### Security Considerations
83
84
85
- **🔴 Privileged DaemonSet**: The ChReK DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations. Workload pods do **not** need privileged mode — all CRIU privilege lives in the DaemonSet.
- **Security Impact**: The privileged DaemonSet can:
  - Access all host devices and processes
86
  - Bypass most security restrictions
87
  - Potentially compromise node security if exploited
88
89

### Technical Limitations
90
91
- **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations not yet supported
- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
- **Storage**: Only PVC storage is currently implemented (S3/OCI planned)

### Recommendation
ChReK is best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment

## Documentation

### Getting Started
- [Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform
108
- [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Helm chart configuration
109
110
111
112
113
114
115
116

### Related Documentation
- [CRIU Documentation](https://criu.org/Main_Page) - Upstream CRIU docs

## Prerequisites

- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
117
- containerd runtime (for container inspection; CRIU is bundled in ChReK images)
118
- RWX storage class (for multi-node deployments)
119
- **Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
120
121
122
123
124
125
126
127

## Contributing

ChReK is part of the NVIDIA Dynamo project. Contributions are welcome!

## License

Apache License 2.0