model-caching.md 4.9 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Model Caching
subtitle: Download models once and share across all pods in a Kubernetes cluster
---

Large language models can take minutes to download. Without caching, every pod downloads the full model independently, wasting bandwidth and delaying startup. Dynamo supports two approaches to ensure models are downloaded once and shared across the cluster.

## Option 1: PVC + Download Job (Recommended)

The simplest approach: create a shared PVC, run a one-time Job to download the model, then mount the PVC in your DynamoGraphDeployment.

This is the pattern used by all Dynamo recipes today.

### Step 1: Create a Shared PVC

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
```

<Note>
`ReadWriteMany` access mode is required so multiple pods can mount the PVC simultaneously. Ensure your storage class supports RWX (e.g., NFS, CephFS, or cloud-provider shared file systems).
</Note>

### Step 2: Download the model

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: model-download
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: downloader
          image: python:3.12-slim
          command: ["sh", "-c"]
          args:
            - |
              pip install huggingface_hub hf_transfer
              HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
                $MODEL_NAME --revision $MODEL_REVISION
          env:
            - name: MODEL_NAME
              value: "Qwen/Qwen3-0.6B"
            - name: MODEL_REVISION
              value: "main"
            - name: HF_HOME
              value: /cache/huggingface
          envFrom:
            - secretRef:
                name: hf-token-secret
          volumeMounts:
            - name: model-cache
              mountPath: /cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache
```

### Step 3: Mount in DynamoGraphDeployment

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-deployment
spec:
  pvcs:
    - create: false
      name: model-cache
  services:
    VllmWorker:
      volumeMounts:
        - name: model-cache
          mountPoint: /home/dynamo/.cache/huggingface
```

All `VllmWorker` pods that mount `model-cache` now read from the shared cache, avoiding per-pod worker downloads. If you also want the frontend to reuse tokenizer and config files, mount the same PVC there too.

### Compilation Cache

For vLLM, you can also cache compiled artifacts (CUDA graphs, etc.) with a second PVC:

```yaml
spec:
  pvcs:
    - create: false
      name: model-cache
    - create: false
      name: compilation-cache
  services:
    VllmWorker:
      volumeMounts:
        - name: model-cache
          mountPoint: /home/dynamo/.cache/huggingface
        - name: compilation-cache
          mountPoint: /home/dynamo/.cache/vllm
```

## Option 2: Model Express (P2P Distribution)

[Model Express](https://github.com/ai-dynamo/modelexpress) is a P2P model distribution server that downloads a model once and serves it to all pods over the network. It integrates directly with vLLM's weight loading pipeline via custom load formats.

### How It Works

1. A Model Express server runs in the cluster and caches model weights
2. Workers use `--load-format=mx-source` or `--load-format=mx-target` to load from the server
3. The K8s operator injects `MODEL_EXPRESS_URL` into all pods automatically

### Setup

**Install with Dynamo Platform:**

```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
  --namespace ${NAMESPACE} \
  --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```

**Configure workers to use Model Express:**

```yaml
services:
  VllmWorker:
    envs:
      - name: VLLM_LOAD_FORMAT
        value: mx-target
```

When `MODEL_EXPRESS_URL` is configured in the operator, it is automatically injected as an environment variable into all component pods. Workers using `mx-source` or `mx-target` load formats will connect to the server for model weight distribution.

### When to Use Model Express

| Scenario | Recommended Approach |
|----------|---------------------|
| Small cluster, simple setup | PVC + Download Job |
| Large cluster, many nodes | Model Express |
| Models already on shared storage (NFS) | PVC |
| Frequent model updates across fleet | Model Express |

## See Also

- [Managing Models with DynamoModel](deployment/dynamomodel-guide.md) — declarative model management CRD
- [Detailed Installation Guide](installation-guide.md) — Helm chart configuration including Model Express
- [LoRA Adapters](../features/lora/README.md) — dynamic adapter loading (separate from base model caching)