snapshot.md 17.7 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Snapshot
---

> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **preview** and may only be functional in some k8s cluster setups. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.

**Dynamo Snapshot** is an experimental infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in User-space) and NVIDIA's cuda-checkpoint utility. Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.

| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
14
| **Warm Start** (restore from checkpoint) | ~ 10 sec | Restore from a ready checkpoint directory |
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

> ⚠️ Restore time may vary depending on cluster configuration (storage bandwidth, GPU model, etc.)

## Prerequisites

- Dynamo Platform/Operator installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
- NVIDIA driver 580.xx or newer on the target GPU nodes
- `ReadWriteMany` storage if you need cross-node restore
- vLLM or SGLang backend (TensorRT-LLM is not supported yet)
- Security clearance to run a privileged DaemonSet

## Quick Start

This guide assumes a normal Dynamo deployment workflow is already present on your Kubernetes cluster.

### 1. Build and push a placeholder image

Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with the restore tooling. If you do not already have one, build it with the snapshot placeholder target and push it to a registry your cluster can pull from:

```bash
export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0

cd deploy/snapshot

make docker-build-placeholder \
  PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"

make docker-push-placeholder \
  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
```

48
This flow is defined in [deploy/snapshot/Makefile](https://github.com/ai-dynamo/dynamo/blob/main/deploy/snapshot/Makefile) and [deploy/snapshot/Dockerfile](https://github.com/ai-dynamo/dynamo/blob/main/deploy/snapshot/Dockerfile). The placeholder image preserves the base runtime entrypoint and command contract, and adds the CRIU, `cuda-checkpoint`, and `nsrestore` tooling needed for restore.
49

50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
To build either snapshot image against a custom CRIU fork or ref, pass
`CRIU_REPO` and `CRIU_REF` through `make`. If they are unset, the Dockerfile
defaults are used.

```bash
make docker-build-agent \
  IMG=registry.example.com/dynamo/snapshot-agent:1.0.0 \
  CRIU_REPO="${YOUR_CRIU_REPO}" \
  CRIU_REF="branch-or-sha"

make docker-build-placeholder \
  PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" \
  CRIU_REPO="${YOUR_CRIU_REPO}" \
  CRIU_REF="branch-or-sha"
```

67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
### 2. Enable checkpointing in the platform and verify it

Whether you are installing or upgrading `dynamo-platform`, the operator must have checkpointing enabled and must point at the same storage that the snapshot chart will use:

```yaml
dynamo-operator:
  checkpoint:
    enabled: true
    storage:
      type: pvc
      pvc:
        pvcName: snapshot-pvc
        basePath: /checkpoints
```

If the platform is already installed, verify that the operator config contains the checkpoint block:

```bash
OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
  -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
  -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')

kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
  -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'
```

Verify that the rendered config includes `enabled: true` and the same PVC name and base path you plan to use for the snapshot chart.

95
For the full platform/operator configuration surface, see [deploy/helm/charts/platform/README.md](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/platform/README.md) and [deploy/helm/charts/platform/components/operator/values.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/platform/components/operator/values.yaml).
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116

### 3. Install the snapshot chart

```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=true
```

Cross-node restore requires `ReadWriteMany` storage. The chart defaults to that mode.

For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC. If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and point `storage.pvc.name` at the existing PVC instead.

Verify that the PVC and DaemonSet are ready:

```bash
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
```

117
For the full snapshot chart configuration surface, see [deploy/helm/charts/snapshot/README.md](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/snapshot/README.md) and [deploy/helm/charts/snapshot/values.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/snapshot/values.yaml).
118
119
120

### 4. Apply a snapshot-compatible `DynamoGraphDeployment`

121
This example is adapted from [examples/backends/vllm/deploy/agg.yaml](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/deploy/agg.yaml). The worker must use the placeholder image from step 1, and the checkpoint identity must describe the runtime state you want to reuse.
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-snapshot-demo
spec:
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: registry.example.com/dynamo/vllm-runtime:1.0.0

    VllmDecodeWorker:
      componentType: worker
      replicas: 1
      resources:
        limits:
          gpu: "1"
      readinessProbe:
        httpGet:
          path: /live
          port: system
        periodSeconds: 1
        timeoutSeconds: 4
        failureThreshold: 3
      checkpoint:
        enabled: true
        mode: Auto
        identity:
          model: Qwen/Qwen3-0.6B
          backendFramework: vllm
      extraPodSpec:
        mainContainer:
          image: registry.example.com/dynamo/vllm-placeholder:1.0.0
          command:
            - python3
            - -m
            - dynamo.vllm
          args:
            - --model
            - Qwen/Qwen3-0.6B
          env:
            - name: NCCL_DEBUG
              value: ERROR
            - name: TORCH_CPP_LOG_LEVEL
              value: ERROR
            - name: TORCH_DISTRIBUTED_DEBUG
              value: "OFF"
```

For SGLang, use `dynamo.sglang`, an SGLang placeholder image, `backendFramework: sglang`, and the matching CLI flags.

Apply the manifest:

```bash
kubectl apply -f vllm-snapshot-demo.yaml -n ${NAMESPACE}
```

183
On the first rollout, the worker cold-starts, the operator resolves the checkpoint identity hash, and the checkpoint Job writes a new checkpoint directory into `snapshot-pvc`.
184
185
186

### 5. Wait for the checkpoint to become ready

187
Auto mode resolves checkpoints by identity hash. It may create `checkpoint-<hash>` or reuse an existing checkpoint with a different CR name. For the sample identity above, the hash is `73e74442beb109ed`:
188
189

```bash
190
kubectl get dckpt -n ${NAMESPACE}
191

192
193
194
CKPT_NAME=$(kubectl get dckpt -n ${NAMESPACE} \
  -l nvidia.com/snapshot-checkpoint-hash=73e74442beb109ed \
  -o jsonpath='{.items[0].metadata.name}')
195
196
kubectl wait \
  --for=jsonpath='{.status.phase}'=Ready \
197
  "dynamocheckpoint/${CKPT_NAME}" \
198
  -n ${NAMESPACE} \
199
  --timeout=5m
200
201
```

202
If you change the checkpoint identity, the hash changes and so does the checkpoint selected by Auto mode.
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218

### 6. Trigger restore

Once the checkpoint is ready, scale the worker replicas from `1` to `2`:

```bash
kubectl patch dgd vllm-snapshot-demo -n ${NAMESPACE} --type=merge \
  -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'
```

New worker pods for `VllmDecodeWorker` will restore from the ready checkpoint automatically.

## Checkpoint Configuration

### Auto Mode (Recommended)

219
The operator computes the checkpoint identity hash, looks up an existing `DynamoCheckpoint` by that hash, and creates a new `DynamoCheckpoint` only when no matching checkpoint already exists:
220
221
222
223
224
225
226
227
228
229
230
231
232

```yaml
checkpoint:
  enabled: true
  mode: Auto
  identity:
    model: "meta-llama/Llama-3-8B"
    backendFramework: "vllm"  # or "sglang"
    tensorParallelSize: 1
    dtype: "bfloat16"
    maxModelLen: 4096
```

233
234
235
236
237
238
The `DynamoGraphDeployment` mirrors checkpoint resolution state under `.status.checkpoints`, including the resolved checkpoint CR name, identity hash, and whether the checkpoint was visible to the worker when it started:

```bash
kubectl get dgd vllm-snapshot-demo -n ${NAMESPACE} \
  -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.identityHash}{"\n"}'
```
239
240
241
242
243
244
245
246

### Manual Management and `checkpointRef`

Use `checkpointRef` when you want a service to restore from a specific `DynamoCheckpoint` CR:

```yaml
checkpoint:
  enabled: true
247
  checkpointRef: "qwen3-06b-bf16"
248
249
250
251
252
253
```

This is useful when:
- You want to **pre-warm checkpoints** before creating DGDs
- You want **explicit control** over which checkpoint to use

254
`checkpointRef` resolves by `DynamoCheckpoint.metadata.name`. Use a readable CR name when you want an explicit checkpoint that operators can reference directly.
255
256
257
258
259

If you are managing checkpoint CRs yourself, set `mode: Manual` on the service to prevent the operator from creating a new `DynamoCheckpoint` when identity-based lookup does not find one.

```bash
# Check checkpoint status by CR name
260
kubectl get dynamocheckpoint qwen3-06b-bf16 -n ${NAMESPACE}
261
262
263
264
265

# Now create DGD referencing it
kubectl apply -f my-dgd.yaml -n ${NAMESPACE}
```

266
`mode: Auto` still resolves checkpoints by identity hash. The operator backfills `status.identityHash` and the `nvidia.com/snapshot-checkpoint-hash` label on each `DynamoCheckpoint` so auto lookup and uniqueness checks do not depend on the CR name.
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296

## Checkpoint Identity

Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:

| Field | Required | Affects Hash | Example |
|-------|----------|-------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `backendFramework` | ✓ | ✓ | `sglang`, `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
| `maxModelLen` | | ✓ | `4096`, `8192` |
| `extraParameters` | | ✓ | Custom key-value pairs |

**Not included in hash** (don't invalidate checkpoint):
- `replicas`
- `nodeSelector`, `affinity`, `tolerations`
- `resources` (requests/limits)
- Logging/observability config

**Example with all fields:**
```yaml
checkpoint:
  enabled: true
  mode: Auto
  identity:
    model: "meta-llama/Llama-3-8B"
    backendFramework: "vllm"
297
    dynamoVersion: "1.0.0"
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
    tensorParallelSize: 1
    pipelineParallelSize: 1
    dtype: "bfloat16"
    maxModelLen: 8192
    extraParameters:
      enableChunkedPrefill: "true"
      quantization: "awq"
```

## DynamoCheckpoint CRD

The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.

**When to create a DynamoCheckpoint directly:**
- **Pre-warming:** Create checkpoints before deploying DGDs for instant startup
- **Explicit control:** Manage checkpoint lifecycle independently from DGDs

315
316
The operator requires `spec.identity` and `spec.job.podTemplateSpec`. The pod template should match the worker container you want checkpointed, including image, command, args, secrets, volumes, and resource limits. You do not need to set checkpoint-control plumbing manually; the operator injects the checkpoint-ready signal path for checkpoint Jobs and adds the restore metadata consumed by restored pods and the node-local controller inside the `snapshot-agent` DaemonSet.
`spec.job.backoffLimit` is deprecated and ignored. Checkpoint Jobs are always single-attempt.
317
318
319
320
321
322
323

**Create a checkpoint:**

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
324
  name: qwen3-06b-bf16
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
spec:
  identity:
    model: Qwen/Qwen3-0.6B
    backendFramework: vllm
    tensorParallelSize: 1
    dtype: bfloat16
    maxModelLen: 4096

  job:
    activeDeadlineSeconds: 3600
    ttlSecondsAfterFinished: 300
    podTemplateSpec:
      spec:
        restartPolicy: Never
        containers:
          - name: main
            image: registry.example.com/dynamo/vllm-placeholder:1.0.0
            command:
              - python3
              - -m
              - dynamo.vllm
            args:
              - --model
              - Qwen/Qwen3-0.6B
            env:
350
351
352
353
354
355
              - name: NCCL_DEBUG
                value: ERROR
              - name: TORCH_CPP_LOG_LEVEL
                value: ERROR
              - name: TORCH_DISTRIBUTED_DEBUG
                value: "OFF"
356
357
358
359
360
            resources:
              limits:
                nvidia.com/gpu: "1"
```

361
For this example identity, the operator computes a deterministic identity hash and stores it in `status.identityHash`. Auto mode uses that hash, not the CR name, when it decides whether to reuse or create a checkpoint.
362
363
364
365
366
367
368
369
370

**Check status:**

```bash
# List all checkpoints
kubectl get dynamocheckpoint -n ${NAMESPACE}
# Or use shortname
kubectl get dckpt -n ${NAMESPACE}

371
372
373
NAME               MODEL                                BACKEND  PHASE     HASH              AGE
qwen3-06b-bf16     Qwen/Qwen3-0.6B                      vllm     Ready     3bff874d069f0ed5  5m
llama3-8b-bf16     meta-llama/Meta-Llama-3-8B-Instruct  vllm     Creating  9be4f5574b5a285d  2m
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
```

**Phases:**

| Phase | Description |
|-------|-------------|
| `Pending` | CR created, waiting for job to start |
| `Creating` | Checkpoint job is running |
| `Ready` | Checkpoint available for use |
| `Failed` | Checkpoint creation failed |

Other useful status fields are:

| Field | Meaning |
|-------|---------|
389
| `status.identityHash` | Deterministic hash of `spec.identity` used for auto lookup and reuse |
390
391
392
393
394
395
| `status.jobName` | Name of the checkpoint Job |
| `status.location` | Checkpoint location in the configured storage backend |
| `status.storageType` | Storage backend type (`pvc`, `s3`, or `oci`) |
| `status.createdAt` | Timestamp recorded when the checkpoint becomes ready |
| `status.message` | Failure or progress message when available |

396
397
`status.conditions` is deprecated for `DynamoCheckpoint`. The legacy condition types `JobCreated` and `JobCompleted` are kept for compatibility only. Prefer `status.phase`, `status.jobName`, and `status.message` when checking checkpoint progress.

398
399
400
**Detailed status:**

```bash
401
kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE}
402
403
404
405
406
```

```yaml
Status:
  Phase: Ready
407
408
409
  IdentityHash: 3bff874d069f0ed5
  JobName: checkpoint-job-3bff874d069f0ed5
  Location: /checkpoints/3bff874d069f0ed5
410
411
412
413
414
415
416
417
418
419
420
421
422
423
  StorageType: pvc
  CreatedAt: 2026-01-29T10:05:00Z
```

**Reference from DGD:**

Once the checkpoint is `Ready`, you can reference it by CR name:

```yaml
spec:
  services:
    VllmDecodeWorker:
      checkpoint:
        enabled: true
424
        checkpointRef: "qwen3-06b-bf16"
425
426
```

427
Or use `mode: Auto` with the same identity, and the operator will reuse the same deterministic checkpoint object automatically.
428
429
430
431
432

## Limitations

- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations may work in very basic hardware configurations, but are not officially supported yet.
433
- **Network state**: Restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets can work with the supported CRIU TCP policies, but non-loopback or pod-IP-bound connections can still break restore.
434
435
436
437
438
439
440
441
442
443
- **Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU and cuda-checkpoint. However, workload pods do not need to be privileged.

## Troubleshooting

### Checkpoint Not Ready

1. Check the checkpoint job:
   ```bash
   kubectl get dckpt -n ${NAMESPACE}
   kubectl describe dckpt <checkpoint-name> -n ${NAMESPACE}
444
445
446
447
   JOB_NAME=$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}')
   if [ -n "${JOB_NAME}" ]; then
     kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE}
   fi
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
   ```

2. Check the DaemonSet:
   ```bash
   kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers
   ```

3. Verify that platform and chart storage settings match:
   ```bash
   kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o yaml
   ```

### Restore Failing

1. Check pod logs:
   ```bash
   kubectl logs <worker-pod> -n ${NAMESPACE}
   ```

2. Describe the restore target pod:
   ```bash
   kubectl describe pod <worker-pod> -n ${NAMESPACE}
   ```

3. Confirm the referenced checkpoint is still `Ready`:
   ```bash
   kubectl get dckpt <checkpoint-name> -n ${NAMESPACE}
   ```

## Planned Features

- TensorRT-LLM backend support
- S3/MinIO storage backend
- OCI registry storage backend
- Multi-GPU checkpoints

## Related Documentation

486
- [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/snapshot/README.md) - Chart configuration
487
488
- [Installation Guide](installation-guide.md) - Platform installation
- [API Reference](api-reference.md) - Complete CRD specifications