"vllm/model_executor/models/registry.py" did not exist on "be0b3af9e068418726fa2b1dccef966e39024fd5"
snapshot.md 17.2 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Snapshot
---

> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **preview** and may only be functional in some k8s cluster setups. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.

**Dynamo Snapshot** is an experimental infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in User-space) and NVIDIA's cuda-checkpoint utility. Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.

| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
14
| **Warm Start** (restore from checkpoint) | ~ 10 sec | Restore from a ready checkpoint directory |
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

> ⚠️ Restore time may vary depending on cluster configuration (storage bandwidth, GPU model, etc.)

## Prerequisites

- Dynamo Platform/Operator installed on a k8s cluster with **x86_64 (amd64)** GPU nodes
- NVIDIA driver 580.xx or newer on the target GPU nodes
- `ReadWriteMany` storage if you need cross-node restore
- vLLM or SGLang backend (TensorRT-LLM is not supported yet)
- Security clearance to run a privileged DaemonSet

## Quick Start

This guide assumes a normal Dynamo deployment workflow is already present on your Kubernetes cluster.

### 1. Build and push a placeholder image

Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with the restore tooling. If you do not already have one, build it with the snapshot placeholder target and push it to a registry your cluster can pull from:

```bash
export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0

cd deploy/snapshot

make docker-build-placeholder \
  PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"

make docker-push-placeholder \
  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
```

48
This flow is defined in [deploy/snapshot/Makefile](https://github.com/ai-dynamo/dynamo/blob/main/deploy/snapshot/Makefile) and [deploy/snapshot/Dockerfile](https://github.com/ai-dynamo/dynamo/blob/main/deploy/snapshot/Dockerfile). The placeholder image preserves the base runtime entrypoint and command contract, and adds the CRIU, `cuda-checkpoint`, and `nsrestore` tooling needed for restore.
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

### 2. Enable checkpointing in the platform and verify it

Whether you are installing or upgrading `dynamo-platform`, the operator must have checkpointing enabled and must point at the same storage that the snapshot chart will use:

```yaml
dynamo-operator:
  checkpoint:
    enabled: true
    storage:
      type: pvc
      pvc:
        pvcName: snapshot-pvc
        basePath: /checkpoints
```

If the platform is already installed, verify that the operator config contains the checkpoint block:

```bash
OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
  -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
  -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')

kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
  -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'
```

Verify that the rendered config includes `enabled: true` and the same PVC name and base path you plan to use for the snapshot chart.

78
For the full platform/operator configuration surface, see [deploy/helm/charts/platform/README.md](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/platform/README.md) and [deploy/helm/charts/platform/components/operator/values.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/platform/components/operator/values.yaml).
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

### 3. Install the snapshot chart

```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=true
```

Cross-node restore requires `ReadWriteMany` storage. The chart defaults to that mode.

For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC. If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and point `storage.pvc.name` at the existing PVC instead.

Verify that the PVC and DaemonSet are ready:

```bash
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
```

100
For the full snapshot chart configuration surface, see [deploy/helm/charts/snapshot/README.md](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/snapshot/README.md) and [deploy/helm/charts/snapshot/values.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/snapshot/values.yaml).
101
102
103

### 4. Apply a snapshot-compatible `DynamoGraphDeployment`

104
This example is adapted from [examples/backends/vllm/deploy/agg.yaml](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/deploy/agg.yaml). The worker must use the placeholder image from step 1, and the checkpoint identity must describe the runtime state you want to reuse.
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-snapshot-demo
spec:
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: registry.example.com/dynamo/vllm-runtime:1.0.0

    VllmDecodeWorker:
      componentType: worker
      replicas: 1
      resources:
        limits:
          gpu: "1"
      readinessProbe:
        httpGet:
          path: /live
          port: system
        periodSeconds: 1
        timeoutSeconds: 4
        failureThreshold: 3
      checkpoint:
        enabled: true
        mode: Auto
        identity:
          model: Qwen/Qwen3-0.6B
          backendFramework: vllm
      extraPodSpec:
        mainContainer:
          image: registry.example.com/dynamo/vllm-placeholder:1.0.0
          command:
            - python3
            - -m
            - dynamo.vllm
          args:
            - --model
            - Qwen/Qwen3-0.6B
          env:
            - name: NCCL_DEBUG
              value: ERROR
            - name: TORCH_CPP_LOG_LEVEL
              value: ERROR
            - name: TORCH_DISTRIBUTED_DEBUG
              value: "OFF"
```

For SGLang, use `dynamo.sglang`, an SGLang placeholder image, `backendFramework: sglang`, and the matching CLI flags.

Apply the manifest:

```bash
kubectl apply -f vllm-snapshot-demo.yaml -n ${NAMESPACE}
```

166
On the first rollout, the worker cold-starts, the operator resolves the checkpoint identity hash, and the checkpoint Job writes a new checkpoint directory into `snapshot-pvc`.
167
168
169

### 5. Wait for the checkpoint to become ready

170
Auto mode resolves checkpoints by identity hash. It may create `checkpoint-<hash>` or reuse an existing checkpoint with a different CR name. For the sample identity above, the hash is `73e74442beb109ed`:
171
172

```bash
173
kubectl get dckpt -n ${NAMESPACE}
174

175
176
177
CKPT_NAME=$(kubectl get dckpt -n ${NAMESPACE} \
  -l nvidia.com/snapshot-checkpoint-hash=73e74442beb109ed \
  -o jsonpath='{.items[0].metadata.name}')
178
179
kubectl wait \
  --for=jsonpath='{.status.phase}'=Ready \
180
  "dynamocheckpoint/${CKPT_NAME}" \
181
  -n ${NAMESPACE} \
182
  --timeout=5m
183
184
```

185
If you change the checkpoint identity, the hash changes and so does the checkpoint selected by Auto mode.
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201

### 6. Trigger restore

Once the checkpoint is ready, scale the worker replicas from `1` to `2`:

```bash
kubectl patch dgd vllm-snapshot-demo -n ${NAMESPACE} --type=merge \
  -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'
```

New worker pods for `VllmDecodeWorker` will restore from the ready checkpoint automatically.

## Checkpoint Configuration

### Auto Mode (Recommended)

202
The operator computes the checkpoint identity hash, looks up an existing `DynamoCheckpoint` by that hash, and creates a new `DynamoCheckpoint` only when no matching checkpoint already exists:
203
204
205
206
207
208
209
210
211
212
213
214
215

```yaml
checkpoint:
  enabled: true
  mode: Auto
  identity:
    model: "meta-llama/Llama-3-8B"
    backendFramework: "vllm"  # or "sglang"
    tensorParallelSize: 1
    dtype: "bfloat16"
    maxModelLen: 4096
```

216
217
218
219
220
221
The `DynamoGraphDeployment` mirrors checkpoint resolution state under `.status.checkpoints`, including the resolved checkpoint CR name, identity hash, and whether the checkpoint was visible to the worker when it started:

```bash
kubectl get dgd vllm-snapshot-demo -n ${NAMESPACE} \
  -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.identityHash}{"\n"}'
```
222
223
224
225
226
227
228
229

### Manual Management and `checkpointRef`

Use `checkpointRef` when you want a service to restore from a specific `DynamoCheckpoint` CR:

```yaml
checkpoint:
  enabled: true
230
  checkpointRef: "qwen3-06b-bf16"
231
232
233
234
235
236
```

This is useful when:
- You want to **pre-warm checkpoints** before creating DGDs
- You want **explicit control** over which checkpoint to use

237
`checkpointRef` resolves by `DynamoCheckpoint.metadata.name`. Use a readable CR name when you want an explicit checkpoint that operators can reference directly.
238
239
240
241
242

If you are managing checkpoint CRs yourself, set `mode: Manual` on the service to prevent the operator from creating a new `DynamoCheckpoint` when identity-based lookup does not find one.

```bash
# Check checkpoint status by CR name
243
kubectl get dynamocheckpoint qwen3-06b-bf16 -n ${NAMESPACE}
244
245
246
247
248

# Now create DGD referencing it
kubectl apply -f my-dgd.yaml -n ${NAMESPACE}
```

249
`mode: Auto` still resolves checkpoints by identity hash. The operator backfills `status.identityHash` and the `nvidia.com/snapshot-checkpoint-hash` label on each `DynamoCheckpoint` so auto lookup and uniqueness checks do not depend on the CR name.
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279

## Checkpoint Identity

Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state:

| Field | Required | Affects Hash | Example |
|-------|----------|-------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `backendFramework` | ✓ | ✓ | `sglang`, `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) |
| `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) |
| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
| `maxModelLen` | | ✓ | `4096`, `8192` |
| `extraParameters` | | ✓ | Custom key-value pairs |

**Not included in hash** (don't invalidate checkpoint):
- `replicas`
- `nodeSelector`, `affinity`, `tolerations`
- `resources` (requests/limits)
- Logging/observability config

**Example with all fields:**
```yaml
checkpoint:
  enabled: true
  mode: Auto
  identity:
    model: "meta-llama/Llama-3-8B"
    backendFramework: "vllm"
280
    dynamoVersion: "1.0.0"
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
    tensorParallelSize: 1
    pipelineParallelSize: 1
    dtype: "bfloat16"
    maxModelLen: 8192
    extraParameters:
      enableChunkedPrefill: "true"
      quantization: "awq"
```

## DynamoCheckpoint CRD

The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle.

**When to create a DynamoCheckpoint directly:**
- **Pre-warming:** Create checkpoints before deploying DGDs for instant startup
- **Explicit control:** Manage checkpoint lifecycle independently from DGDs

298
299
The operator requires `spec.identity` and `spec.job.podTemplateSpec`. The pod template should match the worker container you want checkpointed, including image, command, args, secrets, volumes, and resource limits. You do not need to set checkpoint-control plumbing manually; the operator injects the checkpoint-ready signal path for checkpoint Jobs and adds the restore metadata consumed by restored pods and the node-local controller inside the `snapshot-agent` DaemonSet.
`spec.job.backoffLimit` is deprecated and ignored. Checkpoint Jobs are always single-attempt.
300
301
302
303
304
305
306

**Create a checkpoint:**

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
307
  name: qwen3-06b-bf16
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
spec:
  identity:
    model: Qwen/Qwen3-0.6B
    backendFramework: vllm
    tensorParallelSize: 1
    dtype: bfloat16
    maxModelLen: 4096

  job:
    activeDeadlineSeconds: 3600
    ttlSecondsAfterFinished: 300
    podTemplateSpec:
      spec:
        restartPolicy: Never
        containers:
          - name: main
            image: registry.example.com/dynamo/vllm-placeholder:1.0.0
            command:
              - python3
              - -m
              - dynamo.vllm
            args:
              - --model
              - Qwen/Qwen3-0.6B
            env:
333
334
335
336
337
338
              - name: NCCL_DEBUG
                value: ERROR
              - name: TORCH_CPP_LOG_LEVEL
                value: ERROR
              - name: TORCH_DISTRIBUTED_DEBUG
                value: "OFF"
339
340
341
342
343
            resources:
              limits:
                nvidia.com/gpu: "1"
```

344
For this example identity, the operator computes a deterministic identity hash and stores it in `status.identityHash`. Auto mode uses that hash, not the CR name, when it decides whether to reuse or create a checkpoint.
345
346
347
348
349
350
351
352
353

**Check status:**

```bash
# List all checkpoints
kubectl get dynamocheckpoint -n ${NAMESPACE}
# Or use shortname
kubectl get dckpt -n ${NAMESPACE}

354
355
356
NAME               MODEL                                BACKEND  PHASE     HASH              AGE
qwen3-06b-bf16     Qwen/Qwen3-0.6B                      vllm     Ready     3bff874d069f0ed5  5m
llama3-8b-bf16     meta-llama/Meta-Llama-3-8B-Instruct  vllm     Creating  9be4f5574b5a285d  2m
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
```

**Phases:**

| Phase | Description |
|-------|-------------|
| `Pending` | CR created, waiting for job to start |
| `Creating` | Checkpoint job is running |
| `Ready` | Checkpoint available for use |
| `Failed` | Checkpoint creation failed |

Other useful status fields are:

| Field | Meaning |
|-------|---------|
372
| `status.identityHash` | Deterministic hash of `spec.identity` used for auto lookup and reuse |
373
374
375
376
377
378
| `status.jobName` | Name of the checkpoint Job |
| `status.location` | Checkpoint location in the configured storage backend |
| `status.storageType` | Storage backend type (`pvc`, `s3`, or `oci`) |
| `status.createdAt` | Timestamp recorded when the checkpoint becomes ready |
| `status.message` | Failure or progress message when available |

379
380
`status.conditions` is deprecated for `DynamoCheckpoint`. The legacy condition types `JobCreated` and `JobCompleted` are kept for compatibility only. Prefer `status.phase`, `status.jobName`, and `status.message` when checking checkpoint progress.

381
382
383
**Detailed status:**

```bash
384
kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE}
385
386
387
388
389
```

```yaml
Status:
  Phase: Ready
390
391
392
  IdentityHash: 3bff874d069f0ed5
  JobName: checkpoint-job-3bff874d069f0ed5
  Location: /checkpoints/3bff874d069f0ed5
393
394
395
396
397
398
399
400
401
402
403
404
405
406
  StorageType: pvc
  CreatedAt: 2026-01-29T10:05:00Z
```

**Reference from DGD:**

Once the checkpoint is `Ready`, you can reference it by CR name:

```yaml
spec:
  services:
    VllmDecodeWorker:
      checkpoint:
        enabled: true
407
        checkpointRef: "qwen3-06b-bf16"
408
409
```

410
Or use `mode: Auto` with the same identity, and the operator will reuse the same deterministic checkpoint object automatically.
411
412
413
414
415

## Limitations

- **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- **Single-GPU only**: Multi-GPU configurations may work in very basic hardware configurations, but are not officially supported yet.
416
- **Network state**: Restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets can work with the supported CRIU TCP policies, but non-loopback or pod-IP-bound connections can still break restore.
417
418
419
420
421
422
423
424
425
426
- **Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU and cuda-checkpoint. However, workload pods do not need to be privileged.

## Troubleshooting

### Checkpoint Not Ready

1. Check the checkpoint job:
   ```bash
   kubectl get dckpt -n ${NAMESPACE}
   kubectl describe dckpt <checkpoint-name> -n ${NAMESPACE}
427
428
429
430
   JOB_NAME=$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}')
   if [ -n "${JOB_NAME}" ]; then
     kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE}
   fi
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
   ```

2. Check the DaemonSet:
   ```bash
   kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers
   ```

3. Verify that platform and chart storage settings match:
   ```bash
   kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o yaml
   ```

### Restore Failing

1. Check pod logs:
   ```bash
   kubectl logs <worker-pod> -n ${NAMESPACE}
   ```

2. Describe the restore target pod:
   ```bash
   kubectl describe pod <worker-pod> -n ${NAMESPACE}
   ```

3. Confirm the referenced checkpoint is still `Ready`:
   ```bash
   kubectl get dckpt <checkpoint-name> -n ${NAMESPACE}
   ```

## Planned Features

- TensorRT-LLM backend support
- S3/MinIO storage backend
- OCI registry storage backend
- Multi-GPU checkpoints

## Related Documentation

469
- [Dynamo Snapshot Helm Chart README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/helm/charts/snapshot/README.md) - Chart configuration
470
471
- [Installation Guide](installation-guide.md) - Platform installation
- [API Reference](api-reference.md) - Complete CRD specifications