Unverified Commit 04a532ed authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

feat: add multimodal lora docs and deployment example for k8s (#6452)

parent b230980f
......@@ -481,6 +481,82 @@ await register_model(
)
```
## LoRA Adapters on Multimodal Workers
Multimodal workers support dynamic loading and unloading of LoRA adapters at runtime via the management API. This enables serving fine-tuned multimodal models alongside the base model.
### Loading a LoRA Adapter
Load an adapter on a running multimodal worker via the `load_lora` endpoint:
```bash
# For components workers (URI-based, requires DYN_LORA_ENABLED=true)
curl -X POST http://<worker-host>:<port>/load_lora \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-vlm-adapter",
"source": {"uri": "s3://my-bucket/adapters/my-vlm-adapter"}
}'
# For example workers (path-based)
curl -X POST http://<worker-host>:<port>/load_lora \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-vlm-adapter",
"lora_path": "/path/to/adapter"
}'
```
### Sending Requests with a LoRA
Set the `model` field in the request to the LoRA adapter name:
```bash
curl -X POST http://<frontend-host>:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-vlm-adapter",
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]}
]
}'
```
Requests without a LoRA name (or with the base model name) will use the base model.
### Unloading a LoRA Adapter
```bash
curl -X POST http://<worker-host>:<port>/unload_lora \
-H "Content-Type: application/json" \
-d '{"lora_name": "my-vlm-adapter"}'
```
### Listing Loaded Adapters
```bash
curl -X POST http://<worker-host>:<port>/list_loras
```
### Disaggregated Mode
In disaggregated (prefill/decode) deployments, the **same LoRA adapter must be loaded on both the prefill and decode workers**. The LoRA identity (`model` field) is automatically propagated from the prefill worker to the decode worker in the forwarded request.
```bash
# Load on prefill worker
curl -X POST http://<prefill-worker>/load_lora \
-d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'
# Load on decode worker (same adapter)
curl -X POST http://<decode-worker>/load_lora \
-d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'
```
If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.
## Known Limitations
- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).
......
# Multimodal LoRA Deployment with MinIO on Kubernetes
This guide explains how to deploy multimodal (vision-language) LoRA-enabled vLLM inference with S3-compatible storage backend on Kubernetes.
## Overview
This deployment pattern enables dynamic LoRA adapter loading from S3-compatible storage (MinIO) for vision-language models in a Kubernetes environment. It uses the aggregated single-worker architecture where the Rust OpenAIPreprocessor in the Frontend handles image URLs directly.
## Prerequisites
- Kubernetes cluster with GPU support
- Helm 3.x installed
- `kubectl` configured to access your cluster
- Dynamo Kubernetes Platform installed ([Installation Guide](../../../../docs/pages/kubernetes/installation-guide.md))
- HuggingFace token for downloading base and LoRA adapters
## Files in This Directory
| File | Description |
|------|-------------|
| `agg_qwen_lora.yaml` | DynamoGraphDeployment for multimodal vLLM with LoRA support |
| `minio-secret.yaml` | Kubernetes secret for MinIO credentials |
| `sync-lora-job.yaml` | Job to download LoRA from HuggingFace and upload to MinIO |
| `lora-model.yaml` | DynamoModel CRD for registering LoRA adapters |
---
## Step 1: Set Up Environment Variables
```bash
export NAMESPACE=dynamo # Your Dynamo namespace
export HF_TOKEN=your_hf_token # Your HuggingFace token
```
---
## Step 2: Create Secrets
### Create HuggingFace Token Secret
```bash
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
### Create MinIO Credentials Secret
In this example, we are using the default credentials for MinIO.
You can change the credentials to point to your own S3-compatible storage.
```bash
kubectl apply -f minio-secret.yaml -n ${NAMESPACE}
```
---
## Step 3: Install MinIO
### Add MinIO Helm Repository
```bash
helm repo add minio https://charts.min.io/
helm repo update
```
### Deploy MinIO
```bash
helm install minio minio/minio \
--namespace ${NAMESPACE} \
--set rootUser=minioadmin \
--set rootPassword=minioadmin \
--set mode=standalone \
--set replicas=1 \
--set persistence.enabled=true \
--set persistence.size=10Gi \
--set resources.requests.memory=512Mi \
--set service.type=ClusterIP \
--set consoleService.type=ClusterIP
```
### Verify MinIO Installation
```bash
kubectl get pods -n ${NAMESPACE} | grep minio
kubectl get svc -n ${NAMESPACE} | grep minio
```
Expected output:
```text
minio-xxxx-xxxx 1/1 Running 0 1m
```
### (Optional) Access MinIO Console
```bash
kubectl port-forward svc/minio-console -n ${NAMESPACE} 9001:9001 9000:9000
```
Open http://localhost:9001 in your browser:
- Username: `minioadmin`
- Password: `minioadmin`
---
## Step 4: Upload LoRA Adapters to MinIO
Use the provided Kubernetes Job to download a vision LoRA adapter from HuggingFace and upload it to MinIO:
```bash
kubectl apply -f sync-lora-job.yaml -n ${NAMESPACE}
```
The default job syncs `Chhagan005/Chhagan-DocVL-Qwen3`, a document-understanding LoRA for Qwen3-VL-2B.
### Monitor the Job
```bash
# Watch job progress
kubectl get jobs -n ${NAMESPACE} -w
# Check job logs
kubectl logs job/sync-hf-lora-to-minio -n ${NAMESPACE} -f
```
Wait for the job to complete successfully.
### Verify Upload (Optional)
```bash
# Port-forward MinIO API
kubectl port-forward svc/minio -n ${NAMESPACE} 9000:9000 &
# Check uploaded files
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_ENDPOINT_URL=http://localhost:9000
aws s3 ls s3://my-loras/ --recursive
```
### Customizing the LoRA Adapter
To upload a different LoRA adapter, edit `sync-lora-job.yaml` and change the `MODEL_NAME` environment variable:
```yaml
env:
- name: MODEL_NAME
value: your-org/your-vision-lora-adapter
```
---
## Step 5: Deploy Multimodal vLLM with LoRA Support
### Update the Image (if needed)
Edit `agg_qwen_lora.yaml` to use your container image:
```bash
# Using yq to update the image
export FRAMEWORK_RUNTIME_IMAGE=your-registry/your-image:tag
yq '.spec.services[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' agg_qwen_lora.yaml > agg_qwen_lora_updated.yaml
```
### Deploy the LoRA-enabled Multimodal Graph
```bash
kubectl apply -f agg_qwen_lora.yaml -n ${NAMESPACE}
```
### Verify Deployment
```bash
# Check pods
kubectl get pods -n ${NAMESPACE}
# Watch worker logs
kubectl logs -f deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE}
```
Wait for the worker to show "Application startup complete".
### Test the Deployment
```bash
# Port-forward the frontend
kubectl port-forward svc/agg-qwen-multimodal-lora-frontend -n ${NAMESPACE} 8000:8000 &
# List available models
curl http://localhost:8000/v1/models | jq .
```
---
## Step 6: Using DynamoModel CRD
The `lora-model.yaml` file demonstrates how to register a LoRA adapter using the DynamoModel Custom Resource:
```bash
kubectl apply -f lora-model.yaml -n ${NAMESPACE}
```
This creates a declarative way to manage LoRA adapters in your cluster. The model CRD references:
- **modelName**: `Chhagan005/Chhagan-DocVL-Qwen3` (the adapter identity)
- **baseModelName**: `Qwen/Qwen3-VL-2B-Instruct` (the base VLM)
- **source.uri**: `s3://my-loras/Chhagan005/Chhagan-DocVL-Qwen3` (MinIO location)
---
## Step 7: Run Inference
### Inference with the LoRA Adapter
```bash
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{
"model": "Chhagan005/Chhagan-DocVL-Qwen3",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe this image in detail"},
{"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
]}],
"max_tokens": 300,
"temperature": 0.0
}' | jq .
```
### Inference with the Base Model
```bash
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-2B-Instruct",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe this image in detail"},
{"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
]}],
"max_tokens": 300,
"temperature": 0.0
}' | jq .
```
---
## Configuration Reference
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `DYN_REQUEST_PLANE` | Transport plane (TCP for multimodal to avoid NATS 1MB limit) | `tcp` |
| `DYN_LORA_ENABLED` | Enable LoRA support | `true` |
| `DYN_LORA_PATH` | Local cache path for LoRA files | `/tmp/dynamo_loras_multimodal` |
| `DYN_SYSTEM_ENABLED` | Enable system management API | `true` |
| `DYN_SYSTEM_PORT` | Port for LoRA management API | `9090` |
| `AWS_ENDPOINT` | MinIO/S3 endpoint URL | `http://minio:9000` |
| `AWS_ACCESS_KEY_ID` | MinIO access key | From secret |
| `AWS_SECRET_ACCESS_KEY` | MinIO secret key | From secret |
| `AWS_REGION` | AWS region (required for S3 SDK) | `us-east-1` |
| `AWS_ALLOW_HTTP` | Allow HTTP connections | `true` |
| `BUCKET_NAME` | MinIO bucket name | `my-loras` |
### vLLM Arguments
| Argument | Description |
|----------|-------------|
| `--enable-multimodal` | Enable multimodal (vision) support |
| `--enable-lora` | Enable LoRA adapter support |
| `--max-lora-rank` | Maximum LoRA rank (must be >= your adapter's rank) |
| `--max-loras` | Maximum number of LoRAs to load simultaneously |
| `--gpu-memory-utilization` | Fraction of GPU memory to use (default 0.85) |
| `--max-model-len` | Maximum sequence length (default 8192) |
| `--max-num-batched-tokens` | Maximum batched tokens (default 8192) |
---
## Cleanup
### Remove vLLM Deployment
```bash
kubectl delete -f agg_qwen_lora.yaml -n ${NAMESPACE}
```
### Remove DynamoModel CRD
```bash
kubectl delete -f lora-model.yaml -n ${NAMESPACE}
```
### Remove Sync Job
```bash
kubectl delete -f sync-lora-job.yaml -n ${NAMESPACE}
```
### Remove MinIO
```bash
helm uninstall minio -n ${NAMESPACE}
```
### Remove Secrets
```bash
kubectl delete -f minio-secret.yaml -n ${NAMESPACE}
kubectl delete secret hf-token-secret -n ${NAMESPACE}
```
---
## Troubleshooting
### LoRA Fails to Load
1. **Check MinIO connectivity from worker**:
```bash
kubectl exec -it deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE} -- \
curl http://minio:9000/minio/health/live
```
2. **Verify LoRA exists in MinIO**:
```bash
kubectl port-forward svc/minio -n ${NAMESPACE} 9000:9000 &
aws --endpoint-url=http://localhost:9000 s3 ls s3://my-loras/ --recursive
```
3. **Check worker logs**:
```bash
kubectl logs deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE}
```
4. **Verify adapter compatibility**: Ensure the LoRA adapter was trained for the same base model architecture (Qwen3-VL-2B) and that `max-lora-rank` (default 64) is >= the adapter's rank.
### Sync Job Fails
1. **Check job logs**:
```bash
kubectl logs job/sync-hf-lora-to-minio -n ${NAMESPACE}
```
2. **Verify HuggingFace token**:
```bash
kubectl get secret hf-token-secret -n ${NAMESPACE} -o yaml
```
3. **Check MinIO is accessible**:
```bash
kubectl get svc minio -n ${NAMESPACE}
```
### OOM During Inference
- Qwen VL models use dynamic resolution: a 2560px image can produce 5000+ tokens
- Reduce `--max-model-len` in `agg_qwen_lora.yaml` args
- Add `--mm-processor-kwargs '{"max_pixels": 1003520}'` to cap image resolution
- Lower `--gpu-memory-utilization` to `0.80`
### MinIO Connection Refused
- Ensure MinIO pods are running: `kubectl get pods -n ${NAMESPACE} | grep minio`
- Check MinIO service: `kubectl get svc minio -n ${NAMESPACE}`
- Verify the `AWS_ENDPOINT` URL matches the service name
## Further Reading
- [Multimodal LoRA Launch Guide](../../launch/lora/README.md) - Local launch with shell scripts
- [LLM LoRA Deployment](../../../backends/vllm/deploy/lora/README.md) - Text-only LoRA deployment pattern
- [Dynamo Kubernetes Guide](../../../../docs/pages/kubernetes/README.md) - Platform setup
- [Installation Guide](../../../../docs/pages/kubernetes/installation-guide.md) - Platform installation
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Aggregated multimodal serving for Qwen VL with dynamic LoRA adapter support.
# Uses dynamo.vllm --enable-multimodal --enable-lora in a single-worker PD architecture
# where the Rust OpenAIPreprocessor in the Frontend handles image URLs directly.
#
# LoRA adapters are managed via the system API on DYN_SYSTEM_PORT:
# Load: curl -X POST http://<worker>:9090/v1/loras \
# -H "Content-Type: application/json" \
# -d '{"lora_name": "my-adapter", "source": {"uri": "s3://my-loras/adapter"}}'
# List: curl http://<worker>:9090/v1/loras
# Unload: curl -X DELETE http://<worker>:9090/v1/loras/my-adapter
#
# Matches the pattern in: examples/multimodal/launch/lora/lora_agg.sh
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: agg-qwen-multimodal-lora
spec:
services:
Frontend:
envFromSecret: hf-token-secret
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
env:
- name: DYN_REQUEST_PLANE
value: tcp
- name: MODEL_PATH
value: Qwen/Qwen3-VL-2B-Instruct
VllmWorker:
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
modelRef:
name: Qwen/Qwen3-VL-2B-Instruct
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
env:
- name: DYN_REQUEST_PLANE
value: tcp
- name: MODEL_PATH
value: Qwen/Qwen3-VL-2B-Instruct
- name: DYN_LORA_ENABLED
value: "true"
- name: DYN_LORA_PATH
value: "/tmp/dynamo_loras_multimodal"
- name: DYN_SYSTEM_ENABLED
value: "true"
- name: DYN_SYSTEM_PORT
value: "9090"
- name: AWS_ENDPOINT
value: "http://minio:9000"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: minio-secret
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: minio-secret
key: AWS_SECRET_ACCESS_KEY
- name: AWS_REGION
value: "us-east-1"
- name: AWS_ALLOW_HTTP
value: "true"
- name: BUCKET_NAME
value: "my-loras"
command:
- python3
- -m
- dynamo.vllm
args:
- --enable-multimodal
- --model
- Qwen/Qwen3-VL-2B-Instruct
- --connector
- none
- --enable-lora
- --max-lora-rank
- "64"
- --gpu-memory-utilization
- "0.85"
- --max-model-len
- "8192"
- --max-num-batched-tokens
- "8192"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: chhagan-docvl-qwen
spec:
modelName: Chhagan005/Chhagan-DocVL-Qwen3
baseModelName: Qwen/Qwen3-VL-2B-Instruct
modelType: lora
source:
uri: s3://my-loras/Chhagan005/Chhagan-DocVL-Qwen3
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: minio-secret
stringData:
AWS_ACCESS_KEY_ID: minioadmin
AWS_SECRET_ACCESS_KEY: minioadmin
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: sync-hf-lora-to-minio
spec:
template:
spec:
containers:
- name: uploader
image: python:3.10-slim
command:
- /bin/sh
- -c
- |
set -eux
pip install --no-cache-dir huggingface-hub awscli
hf download $MODEL_NAME --local-dir /tmp/lora
rm -rf /tmp/lora/.cache
aws --endpoint-url=http://minio:9000 s3 mb s3://$LORA_ROOT_PATH || true
aws --endpoint-url=http://minio:9000 s3 sync /tmp/lora s3://$LORA_ROOT_PATH/$MODEL_NAME
envFrom:
- secretRef:
name: hf-token-secret
- secretRef:
name: minio-secret
env:
- name: AWS_REGION # set this to your aws region
value: us-east-1
- name: AWS_ALLOW_HTTP # remove/disable this if you are using a S3 endpoint or secure MinIO
value: "true"
- name: LORA_ROOT_PATH
value: "my-loras"
- name: MODEL_NAME
value: Chhagan005/Chhagan-DocVL-Qwen3
restartPolicy: Never
backoffLimit: 3
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment