Unverified Commit 6dbe9f6a authored by Schwinn Saereesitthipitak's avatar Schwinn Saereesitthipitak Committed by GitHub
Browse files

docs(chrek): clean up docs + recipe for ChReK (vllm + sglang) (#6603)

parent 5a768786
......@@ -10,7 +10,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
**Note:**
- Each namespace gets its own isolated checkpoint infrastructure with namespace-scoped RBAC
- **Currently only supports vLLM backend** (SGLang and TensorRT-LLM support planned)
- **Supports vLLM and SGLang backends** (TensorRT-LLM support planned)
## Prerequisites
......@@ -19,7 +19,7 @@ This Helm chart deploys the checkpoint/restore infrastructure for NVIDIA Dynamo,
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (`nvidia` runtime class)
- containerd runtime (for container inspection; CRIU is bundled in ChReK images)
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped), **or** manual pod configuration — see [Standalone Usage](../../../../docs/pages/kubernetes/chrek/standalone.md#using-chrek-without-the-dynamo-operator) for required labels, seccomp profiles, command overrides, and deployment strategy when running without the operator
- NVIDIA Dynamo operator installed (cluster-wide or namespace-scoped)
- RWX (ReadWriteMany) storage class for multi-node deployments
- **Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
......@@ -168,7 +168,6 @@ Ensure your storage class supports `ReadWriteMany` access mode for multi-node de
- [ChReK Overview](../../../../docs/pages/kubernetes/chrek/README.md) - ChReK architecture and use cases
- [ChReK with Dynamo Platform](../../../../docs/pages/kubernetes/chrek/dynamo.md) - Integration guide
- [ChReK Standalone Usage](../../../../docs/pages/kubernetes/chrek/standalone.md) - Use ChReK without Dynamo Platform
## License
......
......@@ -28,15 +28,6 @@ Use ChReK as part of the Dynamo platform for automatic checkpoint management:
📖 **[Read the Dynamo Integration Guide →](dynamo.md)**
### 2. Standalone (Without Dynamo)
Use ChReK independently in your own Kubernetes applications:
- Manual checkpoint job creation
- Build your own restore-enabled container images
- Full control over checkpoint lifecycle
📖 **[Read the Standalone Usage Guide →](standalone.md)**
## Architecture
ChReK consists of two main components:
......@@ -46,7 +37,7 @@ Deploys the checkpoint/restore infrastructure:
- **DaemonSet**: Runs on GPU nodes to perform CRIU checkpoint operations
- **PVC**: Stores checkpoint data (rootfs diffs, CUDA memory state)
- **RBAC**: Namespace-scoped or cluster-wide permissions
- **Seccomp Profile**: Security policies for CRIU syscalls
- **Seccomp Profile**: Security policies for CRIU syscalls (needs to be injected into workload pods)
### 2. External Restore via DaemonSet
The DaemonSet performs checkpoint/restore externally using `nsenter` to enter pod namespaces:
......@@ -55,7 +46,7 @@ The DaemonSet performs checkpoint/restore externally using `nsenter` to enter po
## Quick Start
### Install ChReK Infrastructure
To install the ChReK DaemonSet in your cluster, run the following:
```bash
helm install chrek nvidia/chrek \
......@@ -64,16 +55,12 @@ helm install chrek nvidia/chrek \
--set storage.pvc.size=100Gi
```
### Choose Your Integration Path
- **Using Dynamo Platform?** → Follow the [Dynamo Integration Guide](dynamo.md)
- **Using standalone?** → Follow the [Standalone Usage Guide](standalone.md)
## Key Features
### ✅ Currently Supported
-**vLLM backend only** (SGLang and TensorRT-LLM planned)
- ✅ Single-node, single-GPU checkpoints
-**vLLM and SGLang backends** (TensorRT-LLM planned)
-**LLM decode/prefill workers only** (multimodal, embedding, and diffusion workers are not supported)
- ✅ Cross-node, single-GPU checkpoints
- ✅ PVC storage backend (RWX for multi-node)
- ✅ CUDA checkpoint/restore
- ✅ PyTorch distributed state (with `GLOO_SOCKET_IFNAME=lo`)
......@@ -82,7 +69,6 @@ helm install chrek nvidia/chrek \
- ✅ Automatic signal-based checkpoint coordination
### 🚧 Planned Features
- 🚧 SGLang backend support
- 🚧 TensorRT-LLM backend support
- 🚧 S3/MinIO storage backend
- 🚧 OCI registry storage backend
......@@ -101,7 +87,8 @@ helm install chrek nvidia/chrek \
- Potentially compromise node security if exploited
### Technical Limitations
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations not yet supported
- **Network state limitations**: Active TCP connections are closed during restore (use `tcp-close` CRIU option)
......@@ -118,7 +105,6 @@ ChReK is best suited for:
### Getting Started
- [Dynamo Integration Guide](dynamo.md) - Using ChReK with Dynamo Platform
- [Standalone Usage Guide](standalone.md) - Using ChReK independently
- [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Helm chart configuration
### Related Documentation
......@@ -132,28 +118,6 @@ ChReK is best suited for:
- RWX storage class (for multi-node deployments)
- **Security clearance for privileged DaemonSet** (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
## Troubleshooting
### Common Issues
**DaemonSet not starting?**
- Check GPU node labels: `kubectl get nodes -l nvidia.com/gpu.present=true`
- Verify NVIDIA runtime is available
**Checkpoint fails?**
- Check DaemonSet logs: `kubectl logs -l app.kubernetes.io/name=chrek -n <namespace>`
- Ensure application properly signals readiness
- Verify CRIU is installed in the runtime
**Restore fails?**
- Ensure restore pod uses the same image (built with `placeholder` target) and volume mounts as checkpoint job
- Verify the DaemonSet is running on the same node as the restore pod
- Check DaemonSet logs for CRIU errors: `kubectl logs -l app.kubernetes.io/name=chrek`
For detailed troubleshooting, see:
- [Dynamo Integration Guide - Troubleshooting](dynamo.md#troubleshooting)
- [Standalone Guide - Troubleshooting](standalone.md#troubleshooting)
## Contributing
ChReK is part of the NVIDIA Dynamo project. Contributions are welcome!
......
......@@ -8,22 +8,17 @@ title: Integration with Dynamo
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
## Overview
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~3 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | ~30 sec | Restore from checkpoint tar |
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (checkpoint) | < 10 sec | Restore from checkpoint tar |
## Prerequisites
- Dynamo Platform installed (v0.4.0+)
- Dynamo Platform installed (v0.4.0+) on k8s cluster with GPU nodes
- ChReK Helm chart installed (separate from platform)
- GPU nodes with containerd runtime (CRIU is bundled in ChReK images)
- RWX PVC storage (PVC is currently the only supported backend)
## Quick Start
......@@ -63,7 +58,9 @@ dynamo-operator:
### 2. Configure Your DGD
Add checkpoint configuration to your service:
Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate `backendFramework`, command, and CLI flags.
#### vLLM Example
```yaml
apiVersion: nvidia.com/v1alpha1
......@@ -72,93 +69,105 @@ metadata:
name: my-llm
spec:
services:
VllmWorker:
worker:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3"]
args:
- python3 -m dynamo.vllm --model meta-llama/Llama-3-8B
- "-m"
- "dynamo.vllm"
- "--model"
- "meta-llama/Llama-3-8B"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
env:
# Required for cross-node checkpoint/restore
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
# Checkpoint configuration
checkpoint:
enabled: true
mode: auto # Automatically create checkpoint if not found
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
```
### 3. Deploy
```bash
kubectl apply -f my-llm.yaml -n dynamo-system
```
On first deployment:
1. A checkpoint job runs to create the checkpoint
2. Worker pods start with cold start (checkpoint not ready yet)
3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
## Storage Backends
### PVC (Currently Supported)
Use when you have RWX storage available (e.g., NFS, EFS, Filestore).
#### SGLang Example
```yaml
checkpoint:
storage:
type: pvc
pvc:
pvcName: "chrek-pvc"
basePath: "/checkpoints"
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-sglang-llm
spec:
services:
worker:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-sglang-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.sglang"
- "--model"
- "meta-llama/Llama-3-8B"
- "--mem-fraction-static"
- "0.90"
env:
# Required for cross-node checkpoint/restore
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
checkpoint:
enabled: true
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "sglang"
tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
```
**Requirements:**
- RWX (ReadWriteMany) PVC for multi-node access
- Sufficient storage (checkpoints are ~10-50GB per model)
**Key differences between backends:**
### S3 / MinIO (Planned - Not Yet Implemented)
| Setting | vLLM | SGLang |
|---------|------|--------|
| Module | `dynamo.vllm` | `dynamo.sglang` |
| Max context (optional) | `--max-model-len` | `--context-length` |
| GPU memory | `--gpu-memory-utilization` | `--mem-fraction-static` |
| Placeholder image | `dynamo-vllm-placeholder` | `dynamo-sglang-placeholder` |
| Identity `backendFramework` | `"vllm"` | `"sglang"` |
> ⚠️ **Note:** S3 storage backend is defined in the API but not yet fully implemented.
Object storage support is planned for a future release. The configuration will look like:
```yaml
checkpoint:
storage:
type: s3 # Not yet supported
s3:
# AWS S3
uri: "s3://my-bucket/checkpoints"
> **Note:** Do **not** set `DYN_READY_FOR_CHECKPOINT_FILE` or `DYN_CHECKPOINT_READY_FILE` in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally.
# Or MinIO / custom S3
uri: "s3://minio.example.com/my-bucket/checkpoints"
### 3. Deploy
# Optional: credentials secret
credentialsSecretRef: "s3-creds"
```bash
kubectl apply -f my-llm.yaml -n dynamo-system
```
### OCI Registry (Planned - Not Yet Implemented)
> ⚠️ **Note:** OCI registry storage backend is defined in the API but not yet fully implemented.
Container registry storage support is planned for a future release. The configuration will look like:
```yaml
checkpoint:
storage:
type: oci # Not yet supported
oci:
uri: "oci://myregistry.io/checkpoints"
credentialsSecretRef: "registry-creds" # Docker config secret
```
On first deployment:
1. A checkpoint job runs to create the checkpoint
2. Worker pods start with cold start (checkpoint not ready yet)
3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
## Checkpoint Modes
......@@ -172,8 +181,10 @@ checkpoint:
mode: auto
identity:
model: "meta-llama/Llama-3-8B"
backendFramework: "vllm"
backendFramework: "vllm" # or "sglang"
tensorParallelSize: 1
dtype: "bfloat16"
maxModelLen: 4096
```
### Reference Mode
......@@ -347,26 +358,12 @@ Or use `auto` mode and the operator will find/create it automatically.
## Limitations
⚠️ **Important**: ChReK has significant limitations that impact production readiness:
### Security Considerations
- **🔴 Privileged DaemonSet**: The ChReK DaemonSet runs in privileged mode with `hostPID`, `hostIPC`, and `hostNetwork` to perform CRIU operations externally
- Workload pods (checkpoint jobs, restore pods) do **not** need privileged mode — all CRIU privilege lives in the DaemonSet
- The privileged DaemonSet has elevated host access, which may violate security policies in many production environments
### Technical Limitations
- **vLLM backend only**: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- **Single-node only**: Checkpoints must be created and restored on the same node
- **Single-GPU only**: Multi-GPU configurations are not yet supported
- **vLLM and SGLang backends only**: TensorRT-LLM support is planned.
- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations are not yet supported (planned)
- **Network state**: Active TCP connections are closed during restore (handled with `tcp-close` CRIU option)
- **Storage**: Only PVC backend currently implemented (S3/OCI planned)
### Recommendation
ChReK is **experimental/beta** and best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
- **Security**: ChReK runs as a **privileged DaemonSet** which is required to run CRIU
## Troubleshooting
......@@ -399,9 +396,6 @@ ChReK is **experimental/beta** and best suited for:
```bash
# For PVC
kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/
# For S3
aws s3 ls s3://my-bucket/checkpoints/
```
3. Check environment variables:
......@@ -418,18 +412,11 @@ Pods fall back to cold start if:
Check logs for "Falling back to cold start" message.
## Best Practices
1. **Use RWX PVCs** for multi-node deployments (currently the only supported backend)
2. **Pre-warm checkpoints** before scaling up
3. **Monitor checkpoint size** - large models create large checkpoints
4. **Clean up old checkpoints** to save storage
## Environment Variables
| Variable | Description |
|----------|-------------|
| `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` |
| `DYN_CHECKPOINT_STORAGE_TYPE` | Backend: `pvc`, `s3`, `oci` (`s3` and `oci` are currently no-ops) |
| `DYN_CHECKPOINT_LOCATION` | Full checkpoint location (checkpoint jobs) |
| `DYN_CHECKPOINT_PATH` | Base checkpoint directory (restore pods, PVC) |
| `DYN_CHECKPOINT_HASH` | Identity hash |
......@@ -459,21 +446,27 @@ spec:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
command: ["python3", "-m", "dynamo.vllm"]
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.vllm"
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--dtype"
- "bfloat16"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
......@@ -489,11 +482,26 @@ metadata:
namespace: dynamo-system
spec:
services:
VllmWorker:
worker:
replicas: 2
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
command: ["python3"]
args:
- "-m"
- "dynamo.vllm"
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
env:
- name: GLOO_SOCKET_IFNAME
value: "lo"
- name: NCCL_SOCKET_IFNAME
value: "lo"
resources:
limits:
nvidia.com/gpu: "1"
......@@ -505,7 +513,6 @@ spec:
## Related Documentation
- [ChReK Overview](README.md) - ChReK architecture and use cases
- [ChReK Standalone Usage Guide](standalone.md) - Use ChReK without Dynamo Platform
- [ChReK Helm Chart README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/README.md) - Chart configuration
- [Installation Guide](../installation-guide.md) - Platform installation
- [API Reference](../api-reference.md) - Complete CRD specifications
......
This diff is collapsed.
......@@ -58,8 +58,6 @@ navigation:
contents:
- page: Integration with Dynamo
path: ../pages/kubernetes/chrek/dynamo.md
- page: Standalone Usage
path: ../pages/kubernetes/chrek/standalone.md
- section: Observability (K8s)
contents:
- page: Metrics
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment