# Multimodal LoRA Deployment with MinIO on Kubernetes This guide explains how to deploy multimodal (vision-language) LoRA-enabled vLLM inference with S3-compatible storage backend on Kubernetes. ## Overview This deployment pattern enables dynamic LoRA adapter loading from S3-compatible storage (MinIO) for vision-language models in a Kubernetes environment. It uses the aggregated single-worker architecture where the Rust OpenAIPreprocessor in the Frontend handles image URLs directly. ## Prerequisites - Kubernetes cluster with GPU support - Helm 3.x installed - `kubectl` configured to access your cluster - Dynamo Kubernetes Platform installed ([Installation Guide](../../../../../../docs/kubernetes/installation-guide.md)) - HuggingFace token for downloading base and LoRA adapters ## Files in This Directory | File | Description | |------|-------------| | `agg_qwen_lora.yaml` | DynamoGraphDeployment for multimodal vLLM with LoRA support | | `minio-secret.yaml` | Kubernetes secret for MinIO credentials | | `sync-lora-job.yaml` | Job to download LoRA from HuggingFace and upload to MinIO | | `lora-model.yaml` | DynamoModel CRD for registering LoRA adapters | --- ## Step 1: Set Up Environment Variables ```bash export NAMESPACE=dynamo # Your Dynamo namespace export HF_TOKEN=your_hf_token # Your HuggingFace token ``` --- ## Step 2: Create Secrets ### Create HuggingFace Token Secret ```bash kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE} ``` ### Create MinIO Credentials Secret In this example, we are using the default credentials for MinIO. You can change the credentials to point to your own S3-compatible storage. ```bash kubectl apply -f minio-secret.yaml -n ${NAMESPACE} ``` --- ## Step 3: Install MinIO ### Add MinIO Helm Repository ```bash helm repo add minio https://charts.min.io/ helm repo update ``` ### Deploy MinIO ```bash helm install minio minio/minio \ --namespace ${NAMESPACE} \ --set rootUser=minioadmin \ --set rootPassword=minioadmin \ --set mode=standalone \ --set replicas=1 \ --set persistence.enabled=true \ --set persistence.size=10Gi \ --set resources.requests.memory=512Mi \ --set service.type=ClusterIP \ --set consoleService.type=ClusterIP ``` ### Verify MinIO Installation ```bash kubectl get pods -n ${NAMESPACE} | grep minio kubectl get svc -n ${NAMESPACE} | grep minio ``` Expected output: ```text minio-xxxx-xxxx 1/1 Running 0 1m ``` ### (Optional) Access MinIO Console ```bash kubectl port-forward svc/minio-console -n ${NAMESPACE} 9001:9001 9000:9000 ``` Open http://localhost:9001 in your browser: - Username: `minioadmin` - Password: `minioadmin` --- ## Step 4: Upload LoRA Adapters to MinIO Use the provided Kubernetes Job to download a vision LoRA adapter from HuggingFace and upload it to MinIO: ```bash kubectl apply -f sync-lora-job.yaml -n ${NAMESPACE} ``` The default job syncs `Chhagan005/Chhagan-DocVL-Qwen3`, a document-understanding LoRA for Qwen3-VL-2B. ### Monitor the Job ```bash # Watch job progress kubectl get jobs -n ${NAMESPACE} -w # Check job logs kubectl logs job/sync-hf-lora-to-minio -n ${NAMESPACE} -f ``` Wait for the job to complete successfully. ### Verify Upload (Optional) ```bash # Port-forward MinIO API kubectl port-forward svc/minio -n ${NAMESPACE} 9000:9000 & # Check uploaded files export AWS_ACCESS_KEY_ID=minioadmin export AWS_SECRET_ACCESS_KEY=minioadmin export AWS_ENDPOINT_URL=http://localhost:9000 aws s3 ls s3://my-loras/ --recursive ``` ### Customizing the LoRA Adapter To upload a different LoRA adapter, edit `sync-lora-job.yaml` and change the `MODEL_NAME` environment variable: ```yaml env: - name: MODEL_NAME value: your-org/your-vision-lora-adapter ``` --- ## Step 5: Deploy Multimodal vLLM with LoRA Support ### Update the Image (if needed) Edit `agg_qwen_lora.yaml` to use your container image: ```bash # Using yq to update the image export FRAMEWORK_RUNTIME_IMAGE=your-registry/your-image:tag yq '.spec.services[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' agg_qwen_lora.yaml > agg_qwen_lora_updated.yaml ``` ### Deploy the LoRA-enabled Multimodal Graph ```bash kubectl apply -f agg_qwen_lora.yaml -n ${NAMESPACE} ``` ### Verify Deployment ```bash # Check pods kubectl get pods -n ${NAMESPACE} # Watch worker logs kubectl logs -f deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE} ``` Wait for the worker to show "Application startup complete". ### Test the Deployment ```bash # Port-forward the frontend kubectl port-forward svc/agg-qwen-multimodal-lora-frontend -n ${NAMESPACE} 8000:8000 & # List available models curl http://localhost:8000/v1/models | jq . ``` --- ## Step 6: Using DynamoModel CRD The `lora-model.yaml` file demonstrates how to register a LoRA adapter using the DynamoModel Custom Resource: ```bash kubectl apply -f lora-model.yaml -n ${NAMESPACE} ``` This creates a declarative way to manage LoRA adapters in your cluster. The model CRD references: - **modelName**: `Chhagan005/Chhagan-DocVL-Qwen3` (the adapter identity) - **baseModelName**: `Qwen/Qwen3-VL-2B-Instruct` (the base VLM) - **source.uri**: `s3://my-loras/Chhagan005/Chhagan-DocVL-Qwen3` (MinIO location) --- ## Step 7: Run Inference ### Inference with the LoRA Adapter ```bash curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \ -d '{ "model": "Chhagan005/Chhagan-DocVL-Qwen3", "messages": [{"role": "user", "content": [ {"type": "text", "text": "Describe this image in detail"}, {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}} ]}], "max_tokens": 300, "temperature": 0.0 }' | jq . ``` ### Inference with the Base Model ```bash curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-VL-2B-Instruct", "messages": [{"role": "user", "content": [ {"type": "text", "text": "Describe this image in detail"}, {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}} ]}], "max_tokens": 300, "temperature": 0.0 }' | jq . ``` --- ## Configuration Reference ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `DYN_REQUEST_PLANE` | Transport plane (TCP for multimodal to avoid NATS 1MB limit) | `tcp` | | `DYN_LORA_ENABLED` | Enable LoRA support | `true` | | `DYN_LORA_PATH` | Local cache path for LoRA files | `/tmp/dynamo_loras_multimodal` | | `DYN_SYSTEM_ENABLED` | Enable system management API | `true` | | `DYN_SYSTEM_PORT` | Port for LoRA management API | `9090` | | `AWS_ENDPOINT` | MinIO/S3 endpoint URL | `http://minio:9000` | | `AWS_ACCESS_KEY_ID` | MinIO access key | From secret | | `AWS_SECRET_ACCESS_KEY` | MinIO secret key | From secret | | `AWS_REGION` | AWS region (required for S3 SDK) | `us-east-1` | | `AWS_ALLOW_HTTP` | Allow HTTP connections | `true` | | `BUCKET_NAME` | MinIO bucket name | `my-loras` | ### vLLM Arguments | Argument | Description | |----------|-------------| | `--enable-multimodal` | Enable multimodal (vision) support | | `--enable-lora` | Enable LoRA adapter support | | `--max-lora-rank` | Maximum LoRA rank (must be >= your adapter's rank) | | `--max-loras` | Maximum number of LoRAs to load simultaneously | | `--gpu-memory-utilization` | Fraction of GPU memory to use (default 0.85) | | `--max-model-len` | Maximum sequence length (default 8192) | | `--max-num-batched-tokens` | Maximum batched tokens (default 8192) | --- ## Cleanup ### Remove vLLM Deployment ```bash kubectl delete -f agg_qwen_lora.yaml -n ${NAMESPACE} ``` ### Remove DynamoModel CRD ```bash kubectl delete -f lora-model.yaml -n ${NAMESPACE} ``` ### Remove Sync Job ```bash kubectl delete -f sync-lora-job.yaml -n ${NAMESPACE} ``` ### Remove MinIO ```bash helm uninstall minio -n ${NAMESPACE} ``` ### Remove Secrets ```bash kubectl delete -f minio-secret.yaml -n ${NAMESPACE} kubectl delete secret hf-token-secret -n ${NAMESPACE} ``` --- ## Troubleshooting ### LoRA Fails to Load 1. **Check MinIO connectivity from worker**: ```bash kubectl exec -it deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE} -- \ curl http://minio:9000/minio/health/live ``` 2. **Verify LoRA exists in MinIO**: ```bash kubectl port-forward svc/minio -n ${NAMESPACE} 9000:9000 & aws --endpoint-url=http://localhost:9000 s3 ls s3://my-loras/ --recursive ``` 3. **Check worker logs**: ```bash kubectl logs deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE} ``` 4. **Verify adapter compatibility**: Ensure the LoRA adapter was trained for the same base model architecture (Qwen3-VL-2B) and that `max-lora-rank` (default 64) is >= the adapter's rank. ### Sync Job Fails 1. **Check job logs**: ```bash kubectl logs job/sync-hf-lora-to-minio -n ${NAMESPACE} ``` 2. **Verify HuggingFace token**: ```bash kubectl get secret hf-token-secret -n ${NAMESPACE} -o yaml ``` 3. **Check MinIO is accessible**: ```bash kubectl get svc minio -n ${NAMESPACE} ``` ### OOM During Inference - Qwen VL models use dynamic resolution: a 2560px image can produce 5000+ tokens - Reduce `--max-model-len` in `agg_qwen_lora.yaml` args - Add `--mm-processor-kwargs '{"max_pixels": 1003520}'` to cap image resolution - Lower `--gpu-memory-utilization` to `0.80` ### MinIO Connection Refused - Ensure MinIO pods are running: `kubectl get pods -n ${NAMESPACE} | grep minio` - Check MinIO service: `kubectl get svc minio -n ${NAMESPACE}` - Verify the `AWS_ENDPOINT` URL matches the service name ## Further Reading - [Multimodal LoRA Launch Guide](../../../launch/lora/multimodal/README.md) - Local launch with shell scripts - [LLM LoRA Deployment](../README.md) - Text-only LoRA deployment pattern - [Dynamo Kubernetes Guide](../../../../../../docs/kubernetes/README.md) - Platform setup - [Installation Guide](../../../../../../docs/kubernetes/installation-guide.md) - Platform installation