feat: add multimodal lora docs and deployment example for k8s (#6452)

04a532ed · Biswa Panda · GitHub · b230980f · 04a532ed · 04a532ed
Unverified Commit 04a532ed authored Feb 23, 2026 by Biswa Panda Committed by GitHub Feb 23, 2026
6 changed files
--- a/docs/pages/features/multimodal/multimodal-vllm.md
+++ b/docs/pages/features/multimodal/multimodal-vllm.md
@@ -481,6 +481,82 @@ await register_model(
 )
 ```

+## LoRA Adapters on Multimodal Workers
+
+Multimodal workers support dynamic loading and unloading of LoRA adapters at runtime via the management API. This enables serving fine-tuned multimodal models alongside the base model.
+
+### Loading a LoRA Adapter
+
+Load an adapter on a running multimodal worker via the `load_lora` endpoint:
+
+```bash
+# For components workers (URI-based, requires DYN_LORA_ENABLED=true)
+curl -X POST http://<worker-host>:<port>/load_lora \
+  -H "Content-Type: application/json" \
+  -d '{
+    "lora_name": "my-vlm-adapter",
+    "source": {"uri": "s3://my-bucket/adapters/my-vlm-adapter"}
+  }'
+
+# For example workers (path-based)
+curl -X POST http://<worker-host>:<port>/load_lora \
+  -H "Content-Type: application/json" \
+  -d '{
+    "lora_name": "my-vlm-adapter",
+    "lora_path": "/path/to/adapter"
+  }'
+```
+
+### Sending Requests with a LoRA
+
+Set the `model` field in the request to the LoRA adapter name:
+
+```bash
+curl -X POST http://<frontend-host>:<port>/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "my-vlm-adapter",
+    "messages": [
+      {"role": "user", "content": [
+        {"type": "text", "text": "Describe this image"},
+        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
+      ]}
+    ]
+  }'
+```
+
+Requests without a LoRA name (or with the base model name) will use the base model.
+
+### Unloading a LoRA Adapter
+
+```bash
+curl -X POST http://<worker-host>:<port>/unload_lora \
+  -H "Content-Type: application/json" \
+  -d '{"lora_name": "my-vlm-adapter"}'
+```
+
+### Listing Loaded Adapters
+
+```bash
+curl -X POST http://<worker-host>:<port>/list_loras
+```
+
+### Disaggregated Mode
+
+In disaggregated (prefill/decode) deployments, the **same LoRA adapter must be loaded on both the prefill and decode workers**. The LoRA identity (`model` field) is automatically propagated from the prefill worker to the decode worker in the forwarded request.
+
+```bash
+# Load on prefill worker
+curl -X POST http://<prefill-worker>/load_lora \
+  -d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'
+
+# Load on decode worker (same adapter)
+curl -X POST http://<decode-worker>/load_lora \
+  -d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'
+```
+
+If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.
+
 ## Known Limitations

 - **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`).

--- a/examples/multimodal/deploy/lora/README.md
+++ b/examples/multimodal/deploy/lora/README.md
+# Multimodal LoRA Deployment with MinIO on Kubernetes
+
+This guide explains how to deploy multimodal (vision-language) LoRA-enabled vLLM inference with S3-compatible storage backend on Kubernetes.
+
+## Overview
+
+This deployment pattern enables dynamic LoRA adapter loading from S3-compatible storage (MinIO) for vision-language models in a Kubernetes environment. It uses the aggregated single-worker architecture where the Rust OpenAIPreprocessor in the Frontend handles image URLs directly.
+
+## Prerequisites
+
+- Kubernetes cluster with GPU support
+- Helm 3.x installed
+- `kubectl` configured to access your cluster
+- Dynamo Kubernetes Platform installed ([Installation Guide](../../../../docs/pages/kubernetes/installation-guide.md))
+- HuggingFace token for downloading base and LoRA adapters
+
+## Files in This Directory
+
+| File | Description |
+|------|-------------|
+| `agg_qwen_lora.yaml` | DynamoGraphDeployment for multimodal vLLM with LoRA support |
+| `minio-secret.yaml` | Kubernetes secret for MinIO credentials |
+| `sync-lora-job.yaml` | Job to download LoRA from HuggingFace and upload to MinIO |
+| `lora-model.yaml` | DynamoModel CRD for registering LoRA adapters |
+
+---
+
+## Step 1: Set Up Environment Variables
+
+```bash
+export NAMESPACE=dynamo  # Your Dynamo namespace
+export HF_TOKEN=your_hf_token  # Your HuggingFace token
+```
+
+---
+
+## Step 2: Create Secrets
+
+### Create HuggingFace Token Secret
+
+```bash
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+### Create MinIO Credentials Secret
+
+In this example, we are using the default credentials for MinIO.
+You can change the credentials to point to your own S3-compatible storage.
+
+```bash
+kubectl apply -f minio-secret.yaml -n ${NAMESPACE}
+```
+
+---
+
+## Step 3: Install MinIO
+
+### Add MinIO Helm Repository
+
+```bash
+helm repo add minio https://charts.min.io/
+helm repo update
+```
+
+### Deploy MinIO
+
+```bash
+helm install minio minio/minio \
+  --namespace ${NAMESPACE} \
+  --set rootUser=minioadmin \
+  --set rootPassword=minioadmin \
+  --set mode=standalone \
+  --set replicas=1 \
+  --set persistence.enabled=true \
+  --set persistence.size=10Gi \
+  --set resources.requests.memory=512Mi \
+  --set service.type=ClusterIP \
+  --set consoleService.type=ClusterIP
+```
+
+### Verify MinIO Installation
+
+```bash
+kubectl get pods -n ${NAMESPACE} | grep minio
+kubectl get svc -n ${NAMESPACE} | grep minio
+```
+
+Expected output:
+```text
+minio-xxxx-xxxx   1/1     Running   0          1m
+```
+
+### (Optional) Access MinIO Console
+
+```bash
+kubectl port-forward svc/minio-console -n ${NAMESPACE} 9001:9001 9000:9000
+```
+
+Open http://localhost:9001 in your browser:
+- Username: `minioadmin`
+- Password: `minioadmin`
+
+---
+
+## Step 4: Upload LoRA Adapters to MinIO
+
+Use the provided Kubernetes Job to download a vision LoRA adapter from HuggingFace and upload it to MinIO:
+
+```bash
+kubectl apply -f sync-lora-job.yaml -n ${NAMESPACE}
+```
+
+The default job syncs `Chhagan005/Chhagan-DocVL-Qwen3`, a document-understanding LoRA for Qwen3-VL-2B.
+
+### Monitor the Job
+
+```bash
+# Watch job progress
+kubectl get jobs -n ${NAMESPACE} -w
+
+# Check job logs
+kubectl logs job/sync-hf-lora-to-minio -n ${NAMESPACE} -f
+```
+
+Wait for the job to complete successfully.
+
+### Verify Upload (Optional)
+
+```bash
+# Port-forward MinIO API
+kubectl port-forward svc/minio -n ${NAMESPACE} 9000:9000 &
+
+# Check uploaded files
+export AWS_ACCESS_KEY_ID=minioadmin
+export AWS_SECRET_ACCESS_KEY=minioadmin
+export AWS_ENDPOINT_URL=http://localhost:9000
+aws s3 ls s3://my-loras/ --recursive
+```
+
+### Customizing the LoRA Adapter
+
+To upload a different LoRA adapter, edit `sync-lora-job.yaml` and change the `MODEL_NAME` environment variable:
+
+```yaml
+env:
+- name: MODEL_NAME
+  value: your-org/your-vision-lora-adapter
+```
+
+---
+
+## Step 5: Deploy Multimodal vLLM with LoRA Support
+
+### Update the Image (if needed)
+
+Edit `agg_qwen_lora.yaml` to use your container image:
+
+```bash
+# Using yq to update the image
+export FRAMEWORK_RUNTIME_IMAGE=your-registry/your-image:tag
+yq '.spec.services[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' agg_qwen_lora.yaml > agg_qwen_lora_updated.yaml
+```
+
+### Deploy the LoRA-enabled Multimodal Graph
+
+```bash
+kubectl apply -f agg_qwen_lora.yaml -n ${NAMESPACE}
+```
+
+### Verify Deployment
+
+```bash
+# Check pods
+kubectl get pods -n ${NAMESPACE}
+
+# Watch worker logs
+kubectl logs -f deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE}
+```
+
+Wait for the worker to show "Application startup complete".
+
+### Test the Deployment
+
+```bash
+# Port-forward the frontend
+kubectl port-forward svc/agg-qwen-multimodal-lora-frontend -n ${NAMESPACE} 8000:8000 &
+
+# List available models
+curl http://localhost:8000/v1/models | jq .
+```
+
+---
+
+## Step 6: Using DynamoModel CRD
+
+The `lora-model.yaml` file demonstrates how to register a LoRA adapter using the DynamoModel Custom Resource:
+
+```bash
+kubectl apply -f lora-model.yaml -n ${NAMESPACE}
+```
+
+This creates a declarative way to manage LoRA adapters in your cluster. The model CRD references:
+- **modelName**: `Chhagan005/Chhagan-DocVL-Qwen3` (the adapter identity)
+- **baseModelName**: `Qwen/Qwen3-VL-2B-Instruct` (the base VLM)
+- **source.uri**: `s3://my-loras/Chhagan005/Chhagan-DocVL-Qwen3` (MinIO location)
+
+---
+
+## Step 7: Run Inference
+
+### Inference with the LoRA Adapter
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
+  -d '{
+    "model": "Chhagan005/Chhagan-DocVL-Qwen3",
+    "messages": [{"role": "user", "content": [
+      {"type": "text", "text": "Describe this image in detail"},
+      {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
+    ]}],
+    "max_tokens": 300,
+    "temperature": 0.0
+  }' | jq .
+```
+
+### Inference with the Base Model
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-VL-2B-Instruct",
+    "messages": [{"role": "user", "content": [
+      {"type": "text", "text": "Describe this image in detail"},
+      {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
+    ]}],
+    "max_tokens": 300,
+    "temperature": 0.0
+  }' | jq .
+```
+
+---
+
+## Configuration Reference
+
+### Environment Variables
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `DYN_REQUEST_PLANE` | Transport plane (TCP for multimodal to avoid NATS 1MB limit) | `tcp` |
+| `DYN_LORA_ENABLED` | Enable LoRA support | `true` |
+| `DYN_LORA_PATH` | Local cache path for LoRA files | `/tmp/dynamo_loras_multimodal` |
+| `DYN_SYSTEM_ENABLED` | Enable system management API | `true` |
+| `DYN_SYSTEM_PORT` | Port for LoRA management API | `9090` |
+| `AWS_ENDPOINT` | MinIO/S3 endpoint URL | `http://minio:9000` |
+| `AWS_ACCESS_KEY_ID` | MinIO access key | From secret |
+| `AWS_SECRET_ACCESS_KEY` | MinIO secret key | From secret |
+| `AWS_REGION` | AWS region (required for S3 SDK) | `us-east-1` |
+| `AWS_ALLOW_HTTP` | Allow HTTP connections | `true` |
+| `BUCKET_NAME` | MinIO bucket name | `my-loras` |
+
+### vLLM Arguments
+
+| Argument | Description |
+|----------|-------------|
+| `--enable-multimodal` | Enable multimodal (vision) support |
+| `--enable-lora` | Enable LoRA adapter support |
+| `--max-lora-rank` | Maximum LoRA rank (must be >= your adapter's rank) |
+| `--max-loras` | Maximum number of LoRAs to load simultaneously |
+| `--gpu-memory-utilization` | Fraction of GPU memory to use (default 0.85) |
+| `--max-model-len` | Maximum sequence length (default 8192) |
+| `--max-num-batched-tokens` | Maximum batched tokens (default 8192) |
+
+---
+
+## Cleanup
+
+### Remove vLLM Deployment
+
+```bash
+kubectl delete -f agg_qwen_lora.yaml -n ${NAMESPACE}
+```
+
+### Remove DynamoModel CRD
+
+```bash
+kubectl delete -f lora-model.yaml -n ${NAMESPACE}
+```
+
+### Remove Sync Job
+
+```bash
+kubectl delete -f sync-lora-job.yaml -n ${NAMESPACE}
+```
+
+### Remove MinIO
+
+```bash
+helm uninstall minio -n ${NAMESPACE}
+```
+
+### Remove Secrets
+
+```bash
+kubectl delete -f minio-secret.yaml -n ${NAMESPACE}
+kubectl delete secret hf-token-secret -n ${NAMESPACE}
+```
+
+---
+
+## Troubleshooting
+
+### LoRA Fails to Load
+
+1. **Check MinIO connectivity from worker**:
+   ```bash
+   kubectl exec -it deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE} -- \
+     curl http://minio:9000/minio/health/live
+   ```
+
+2. **Verify LoRA exists in MinIO**:
+   ```bash
+   kubectl port-forward svc/minio -n ${NAMESPACE} 9000:9000 &
+   aws --endpoint-url=http://localhost:9000 s3 ls s3://my-loras/ --recursive
+   ```
+
+3. **Check worker logs**:
+   ```bash
+   kubectl logs deployment/agg-qwen-multimodal-lora-vllmworker -n ${NAMESPACE}
+   ```
+
+4. **Verify adapter compatibility**: Ensure the LoRA adapter was trained for the same base model architecture (Qwen3-VL-2B) and that `max-lora-rank` (default 64) is >= the adapter's rank.
+
+### Sync Job Fails
+
+1. **Check job logs**:
+   ```bash
+   kubectl logs job/sync-hf-lora-to-minio -n ${NAMESPACE}
+   ```
+
+2. **Verify HuggingFace token**:
+   ```bash
+   kubectl get secret hf-token-secret -n ${NAMESPACE} -o yaml
+   ```
+
+3. **Check MinIO is accessible**:
+   ```bash
+   kubectl get svc minio -n ${NAMESPACE}
+   ```
+
+### OOM During Inference
+
+- Qwen VL models use dynamic resolution: a 2560px image can produce 5000+ tokens
+- Reduce `--max-model-len` in `agg_qwen_lora.yaml` args
+- Add `--mm-processor-kwargs '{"max_pixels": 1003520}'` to cap image resolution
+- Lower `--gpu-memory-utilization` to `0.80`
+
+### MinIO Connection Refused
+
+- Ensure MinIO pods are running: `kubectl get pods -n ${NAMESPACE} | grep minio`
+- Check MinIO service: `kubectl get svc minio -n ${NAMESPACE}`
+- Verify the `AWS_ENDPOINT` URL matches the service name
+
+## Further Reading
+
+- [Multimodal LoRA Launch Guide](../../launch/lora/README.md) - Local launch with shell scripts
+- [LLM LoRA Deployment](../../../backends/vllm/deploy/lora/README.md) - Text-only LoRA deployment pattern
+- [Dynamo Kubernetes Guide](../../../../docs/pages/kubernetes/README.md) - Platform setup
+- [Installation Guide](../../../../docs/pages/kubernetes/installation-guide.md) - Platform installation
--- a/examples/multimodal/deploy/lora/agg_qwen_lora.yaml
+++ b/examples/multimodal/deploy/lora/agg_qwen_lora.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Aggregated multimodal serving for Qwen VL with dynamic LoRA adapter support.
+# Uses dynamo.vllm --enable-multimodal --enable-lora in a single-worker PD architecture
+# where the Rust OpenAIPreprocessor in the Frontend handles image URLs directly.
+#
+# LoRA adapters are managed via the system API on DYN_SYSTEM_PORT:
+#   Load:   curl -X POST http://<worker>:9090/v1/loras \
+#             -H "Content-Type: application/json" \
+#             -d '{"lora_name": "my-adapter", "source": {"uri": "s3://my-loras/adapter"}}'
+#   List:   curl http://<worker>:9090/v1/loras
+#   Unload: curl -X DELETE http://<worker>:9090/v1/loras/my-adapter
+#
+# Matches the pattern in: examples/multimodal/launch/lora/lora_agg.sh
+
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: agg-qwen-multimodal-lora
+spec:
+  services:
+    Frontend:
+      envFromSecret: hf-token-secret
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          env:
+            - name: DYN_REQUEST_PLANE
+              value: tcp
+            - name: MODEL_PATH
+              value: Qwen/Qwen3-VL-2B-Instruct
+    VllmWorker:
+      envFromSecret: hf-token-secret
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      modelRef:
+        name: Qwen/Qwen3-VL-2B-Instruct
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          env:
+            - name: DYN_REQUEST_PLANE
+              value: tcp
+            - name: MODEL_PATH
+              value: Qwen/Qwen3-VL-2B-Instruct
+            - name: DYN_LORA_ENABLED
+              value: "true"
+            - name: DYN_LORA_PATH
+              value: "/tmp/dynamo_loras_multimodal"
+            - name: DYN_SYSTEM_ENABLED
+              value: "true"
+            - name: DYN_SYSTEM_PORT
+              value: "9090"
+            - name: AWS_ENDPOINT
+              value: "http://minio:9000"
+            - name: AWS_ACCESS_KEY_ID
+              valueFrom:
+                secretKeyRef:
+                  name: minio-secret
+                  key: AWS_ACCESS_KEY_ID
+            - name: AWS_SECRET_ACCESS_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: minio-secret
+                  key: AWS_SECRET_ACCESS_KEY
+            - name: AWS_REGION
+              value: "us-east-1"
+            - name: AWS_ALLOW_HTTP
+              value: "true"
+            - name: BUCKET_NAME
+              value: "my-loras"
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --enable-multimodal
+            - --model
+            - Qwen/Qwen3-VL-2B-Instruct
+            - --connector
+            - none
+            - --enable-lora
+            - --max-lora-rank
+            - "64"
+            - --gpu-memory-utilization
+            - "0.85"
+            - --max-model-len
+            - "8192"
+            - --max-num-batched-tokens
+            - "8192"
--- a/examples/multimodal/deploy/lora/lora-model.yaml
+++ b/examples/multimodal/deploy/lora/lora-model.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoModel
+metadata:
+  name: chhagan-docvl-qwen
+spec:
+  modelName: Chhagan005/Chhagan-DocVL-Qwen3
+  baseModelName: Qwen/Qwen3-VL-2B-Instruct
+  modelType: lora
+  source:
+    uri: s3://my-loras/Chhagan005/Chhagan-DocVL-Qwen3
\ No newline at end of file
--- a/examples/multimodal/deploy/lora/minio-secret.yaml
+++ b/examples/multimodal/deploy/lora/minio-secret.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: Secret
+type: Opaque
+metadata:
+  name: minio-secret
+stringData:
+  AWS_ACCESS_KEY_ID: minioadmin
+  AWS_SECRET_ACCESS_KEY: minioadmin
--- a/examples/multimodal/deploy/lora/sync-lora-job.yaml
+++ b/examples/multimodal/deploy/lora/sync-lora-job.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: sync-hf-lora-to-minio
+spec:
+  template:
+    spec:
+      containers:
+      - name: uploader
+        image: python:3.10-slim
+        command:
+        - /bin/sh
+        - -c
+        - |
+          set -eux
+          pip install --no-cache-dir huggingface-hub awscli
+          hf download  $MODEL_NAME --local-dir /tmp/lora
+          rm -rf /tmp/lora/.cache
+          aws --endpoint-url=http://minio:9000 s3 mb s3://$LORA_ROOT_PATH || true
+          aws --endpoint-url=http://minio:9000 s3 sync /tmp/lora s3://$LORA_ROOT_PATH/$MODEL_NAME
+        envFrom:
+        - secretRef:
+            name: hf-token-secret
+        - secretRef:
+            name: minio-secret
+        env:
+        - name: AWS_REGION   # set this to your aws region
+          value: us-east-1
+        - name: AWS_ALLOW_HTTP # remove/disable this if you are using a S3 endpoint or secure MinIO
+          value: "true"
+        - name: LORA_ROOT_PATH
+          value: "my-loras"
+        - name: MODEL_NAME
+          value: Chhagan005/Chhagan-DocVL-Qwen3
+      restartPolicy: Never
+  backoffLimit: 3
\ No newline at end of file