README.md 10.1 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
# Multimodal LoRA Deployment with MinIO on Kubernetes

This guide explains how to deploy multimodal (vision-language) LoRA-enabled vLLM inference with S3-compatible storage backend on Kubernetes.

## Overview

This deployment pattern enables dynamic LoRA adapter loading from S3-compatible storage (MinIO) for vision-language models in a Kubernetes environment. It uses the aggregated single-worker architecture where the Rust OpenAIPreprocessor in the Frontend handles image URLs directly.

## Prerequisites

- Kubernetes cluster with GPU support
- Helm 3.x installed
- `kubectl` configured to access your cluster
14
- Dynamo Kubernetes Platform installed ([Installation Guide](../../../../../../docs/kubernetes/installation-guide.md))
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
- HuggingFace token for downloading base and LoRA adapters

## Files in This Directory

| File | Description |
|------|-------------|
| `agg_qwen_lora.yaml` | DynamoGraphDeployment for multimodal vLLM with LoRA support |
| `minio-secret.yaml` | Kubernetes secret for MinIO credentials |
| `sync-lora-job.yaml` | Job to download LoRA from HuggingFace and upload to MinIO |
| `lora-model.yaml` | DynamoModel CRD for registering LoRA adapters |

---

## Step 1: Set Up Environment Variables

```bash
export NAMESPACE=dynamo  # Your Dynamo namespace
export HF_TOKEN=your_hf_token  # Your HuggingFace token
```

---

## Step 2: Create Secrets

### Create HuggingFace Token Secret

```bash
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}
```

### Create MinIO Credentials Secret

In this example, we are using the default credentials for MinIO.
You can change the credentials to point to your own S3-compatible storage.

```bash
kubectl apply -f minio-secret.yaml -n ${NAMESPACE}
```

---

## Step 3: Install MinIO

### Add MinIO Helm Repository

```bash
helm repo add minio https://charts.min.io/
helm repo update
```

### Deploy MinIO

```bash
helm install minio minio/minio \
  --namespace ${NAMESPACE} \
  --set rootUser=minioadmin \
  --set rootPassword=minioadmin \
  --set mode=standalone \
  --set replicas=1 \
  --set persistence.enabled=true \
  --set persistence.size=10Gi \
  --set resources.requests.memory=512Mi \
  --set service.type=ClusterIP \
  --set consoleService.type=ClusterIP
```

### Verify MinIO Installation

```bash
kubectl get pods -n ${NAMESPACE} | grep minio
kubectl get svc -n ${NAMESPACE} | grep minio
```

Expected output:
```text
minio-xxxx-xxxx   1/1     Running   0          1m
```

### (Optional) Access MinIO Console

```bash
kubectl port-forward svc/minio-console -n ${NAMESPACE} 9001:9001 9000:9000
```

Open http://localhost:9001 in your browser:
- Username: `minioadmin`
- Password: `minioadmin`

---

## Step 4: Upload LoRA Adapters to MinIO

Use the provided Kubernetes Job to download a vision LoRA adapter from HuggingFace and upload it to MinIO:

```bash
kubectl apply -f sync-lora-job.yaml -n ${NAMESPACE}
```

The default job syncs `Chhagan005/Chhagan-DocVL-Qwen3`, a document-understanding LoRA for Qwen3-VL-2B.

### Monitor the Job

```bash
# Watch job progress
kubectl get jobs -n ${NAMESPACE} -w

# Check job logs
kubectl logs job/sync-hf-lora-to-minio -n ${NAMESPACE} -f
```

Wait for the job to complete successfully.

### Verify Upload (Optional)

```bash
# Port-forward MinIO API
kubectl port-forward svc/minio -n ${NAMESPACE} 9000:9000 &

# Check uploaded files
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_ENDPOINT_URL=http://localhost:9000
aws s3 ls s3://my-loras/ --recursive
```

### Customizing the LoRA Adapter

To upload a different LoRA adapter, edit `sync-lora-job.yaml` and change the `MODEL_NAME` environment variable:

```yaml
env:
- name: MODEL_NAME
  value: your-org/your-vision-lora-adapter
```

---

## Step 5: Deploy Multimodal vLLM with LoRA Support

### Update the Image (if needed)

Edit `agg_qwen_lora.yaml` to use your container image:

```bash
# Using yq to update the image
export FRAMEWORK_RUNTIME_IMAGE=your-registry/your-image:tag
yq '.spec.services[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' agg_qwen_lora.yaml > agg_qwen_lora_updated.yaml
```

### Deploy the LoRA-enabled Multimodal Graph

```bash
kubectl apply -f agg_qwen_lora.yaml -n ${NAMESPACE}
```

### Verify Deployment

```bash
# Check pods
kubectl get pods -n ${NAMESPACE}

# Watch worker logs
kubectl logs -f deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE}
```

Wait for the worker to show "Application startup complete".

### Test the Deployment

```bash
# Port-forward the frontend
kubectl port-forward svc/agg-qwen-multimodal-lora-frontend -n ${NAMESPACE} 8000:8000 &

# List available models
curl http://localhost:8000/v1/models | jq .
```

---

## Step 6: Using DynamoModel CRD

The `lora-model.yaml` file demonstrates how to register a LoRA adapter using the DynamoModel Custom Resource:

```bash
kubectl apply -f lora-model.yaml -n ${NAMESPACE}
```

This creates a declarative way to manage LoRA adapters in your cluster. The model CRD references:
- **modelName**: `Chhagan005/Chhagan-DocVL-Qwen3` (the adapter identity)
- **baseModelName**: `Qwen/Qwen3-VL-2B-Instruct` (the base VLM)
- **source.uri**: `s3://my-loras/Chhagan005/Chhagan-DocVL-Qwen3` (MinIO location)

---

## Step 7: Run Inference

### Inference with the LoRA Adapter

```bash
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{
    "model": "Chhagan005/Chhagan-DocVL-Qwen3",
    "messages": [{"role": "user", "content": [
      {"type": "text", "text": "Describe this image in detail"},
      {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
    ]}],
    "max_tokens": 300,
    "temperature": 0.0
  }' | jq .
```

### Inference with the Base Model

```bash
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-2B-Instruct",
    "messages": [{"role": "user", "content": [
      {"type": "text", "text": "Describe this image in detail"},
      {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
    ]}],
    "max_tokens": 300,
    "temperature": 0.0
  }' | jq .
```

---

## Configuration Reference

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `DYN_REQUEST_PLANE` | Transport plane (TCP for multimodal to avoid NATS 1MB limit) | `tcp` |
| `DYN_LORA_ENABLED` | Enable LoRA support | `true` |
| `DYN_LORA_PATH` | Local cache path for LoRA files | `/tmp/dynamo_loras_multimodal` |
| `DYN_SYSTEM_ENABLED` | Enable system management API | `true` |
| `DYN_SYSTEM_PORT` | Port for LoRA management API | `9090` |
| `AWS_ENDPOINT` | MinIO/S3 endpoint URL | `http://minio:9000` |
| `AWS_ACCESS_KEY_ID` | MinIO access key | From secret |
| `AWS_SECRET_ACCESS_KEY` | MinIO secret key | From secret |
| `AWS_REGION` | AWS region (required for S3 SDK) | `us-east-1` |
| `AWS_ALLOW_HTTP` | Allow HTTP connections | `true` |
| `BUCKET_NAME` | MinIO bucket name | `my-loras` |

### vLLM Arguments

| Argument | Description |
|----------|-------------|
| `--enable-multimodal` | Enable multimodal (vision) support |
| `--enable-lora` | Enable LoRA adapter support |
| `--max-lora-rank` | Maximum LoRA rank (must be >= your adapter's rank) |
| `--max-loras` | Maximum number of LoRAs to load simultaneously |
| `--gpu-memory-utilization` | Fraction of GPU memory to use (default 0.85) |
| `--max-model-len` | Maximum sequence length (default 8192) |
| `--max-num-batched-tokens` | Maximum batched tokens (default 8192) |

---

## Cleanup

### Remove vLLM Deployment

```bash
kubectl delete -f agg_qwen_lora.yaml -n ${NAMESPACE}
```

### Remove DynamoModel CRD

```bash
kubectl delete -f lora-model.yaml -n ${NAMESPACE}
```

### Remove Sync Job

```bash
kubectl delete -f sync-lora-job.yaml -n ${NAMESPACE}
```

### Remove MinIO

```bash
helm uninstall minio -n ${NAMESPACE}
```

### Remove Secrets

```bash
kubectl delete -f minio-secret.yaml -n ${NAMESPACE}
kubectl delete secret hf-token-secret -n ${NAMESPACE}
```

---

## Troubleshooting

### LoRA Fails to Load

1. **Check MinIO connectivity from worker**:
   ```bash
   kubectl exec -it deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE} -- \
     curl http://minio:9000/minio/health/live
   ```

2. **Verify LoRA exists in MinIO**:
   ```bash
   kubectl port-forward svc/minio -n ${NAMESPACE} 9000:9000 &
   aws --endpoint-url=http://localhost:9000 s3 ls s3://my-loras/ --recursive
   ```

3. **Check worker logs**:
   ```bash
   kubectl logs deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE}
   ```

4. **Verify adapter compatibility**: Ensure the LoRA adapter was trained for the same base model architecture (Qwen3-VL-2B) and that `max-lora-rank` (default 64) is >= the adapter's rank.

### Sync Job Fails

1. **Check job logs**:
   ```bash
   kubectl logs job/sync-hf-lora-to-minio -n ${NAMESPACE}
   ```

2. **Verify HuggingFace token**:
   ```bash
   kubectl get secret hf-token-secret -n ${NAMESPACE} -o yaml
   ```

3. **Check MinIO is accessible**:
   ```bash
   kubectl get svc minio -n ${NAMESPACE}
   ```

### OOM During Inference

- Qwen VL models use dynamic resolution: a 2560px image can produce 5000+ tokens
- Reduce `--max-model-len` in `agg_qwen_lora.yaml` args
- Add `--mm-processor-kwargs '{"max_pixels": 1003520}'` to cap image resolution
- Lower `--gpu-memory-utilization` to `0.80`

### MinIO Connection Refused

- Ensure MinIO pods are running: `kubectl get pods -n ${NAMESPACE} | grep minio`
- Check MinIO service: `kubectl get svc minio -n ${NAMESPACE}`
- Verify the `AWS_ENDPOINT` URL matches the service name

## Further Reading

367
368
369
370
- [Multimodal LoRA Launch Guide](../../../launch/lora/multimodal/README.md) - Local launch with shell scripts
- [LLM LoRA Deployment](../README.md) - Text-only LoRA deployment pattern
- [Dynamo Kubernetes Guide](../../../../../../docs/kubernetes/README.md) - Platform setup
- [Installation Guide](../../../../../../docs/kubernetes/installation-guide.md) - Platform installation