"lib/vscode:/vscode.git/clone" did not exist on "1f9b69b0f32d1f013a7a2d8b3fa96d5aa5697657"
Unverified Commit a28c5f3a authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

feat: add examples for multimodal loras (#6400)

parent 026f361d
# Multimodal LoRA Serving Guide
Serve vision-language models (VLMs) with dynamically loadable LoRA adapters using Dynamo's aggregated architecture.
## Prerequisites
- **GPU**: NVIDIA GPU with sufficient VRAM (8 GB+ for 2B models, 24 GB+ for 7B models)
- **Dynamo**: Installed with vLLM support (`pip install dynamo[vllm]`)
- **jq** (optional, for pretty JSON output): `sudo apt install jq`
- **hf CLI** (optional, for downloading adapters): `pip install huggingface-hub`
## Quick Start
### 1. Launch the server
```bash
cd examples/multimodal/launch/lora
./lora_agg.sh
```
This starts the frontend (port 8000) and vLLM worker (port 8081) with `Qwen/Qwen3-VL-2B-Instruct` as the base model.
Wait for both services to report ready in the logs (look for `Application startup complete`).
### 2. Verify the server is running
```bash
curl http://localhost:8000/v1/models | jq .
```
You should see the base model listed.
### 3. Download a LoRA adapter
Download a compatible vision LoRA to your local filesystem:
```bash
export HF_TOKEN=<your-huggingface-token>
hf download Chhagan005/Chhagan-DocVL-Qwen3 --local-dir /tmp/my-vlm-lora
```
### 4. Load the LoRA adapter
```bash
curl -s -X POST http://localhost:8081/v1/loras \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-vlm-lora",
"source": {"uri": "file:///tmp/my-vlm-lora"}
}' | jq .
```
Expected response:
```json
{
"status": "success",
"message": "LoRA adapter 'my-vlm-lora' loaded successfully",
"lora_name": "my-vlm-lora",
"lora_id": 1207343256
}
```
### 5. Run inference with the LoRA adapter
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-vlm-lora",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe this image in detail"},
{"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
]}],
"max_tokens": 300,
"temperature": 0.0
}' | jq .
```
### 6. Compare with the base model
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-2B-Instruct",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Describe this image in detail"},
{"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
]}],
"max_tokens": 300,
"temperature": 0.0
}' | jq .
```
### 7. Unload the LoRA adapter
```bash
curl -X DELETE http://localhost:8081/v1/loras/my-vlm-lora | jq .
```
### 8. Stop the server
Press `Ctrl+C` in the terminal running `lora_agg.sh`. The trap handler will clean up child processes.
## Configuration
### Command-line options
```bash
./lora_agg.sh --model llava-hf/llava-1.5-7b-hf # Use a different base model
./lora_agg.sh -- --enforce-eager # Pass extra vLLM args
./lora_agg.sh -- --mm-processor-kwargs '{"max_pixels": 1003520}' # Cap image resolution
```
### Environment variables
| Variable | Default | Description |
|---|---|---|
| `DYN_MODEL_NAME` | `Qwen/Qwen3-VL-2B-Instruct` | Base VLM model |
| `DYN_HTTP_PORT` | `8000` | Frontend HTTP port |
| `DYN_SYSTEM_PORT` | `8081` | Worker system/admin port |
| `DYN_LORA_PATH` | `/tmp/dynamo_loras_multimodal` | Local LoRA adapter cache |
| `DYN_MAX_LORA_RANK` | `64` | Maximum LoRA rank supported |
| `CUDA_VISIBLE_DEVICES` | `0` | GPU device index |
### base models supported by this script
| Model | Notes |
|---|---|
| `Qwen/Qwen3-VL-2B-Instruct` | Default. Good for single-GPU testing. |
| `Qwen/Qwen2.5-VL-7B-Instruct` | Higher quality, needs 24 GB+ VRAM. |
| `llava-hf/llava-1.5-7b-hf` | LLaVA architecture, 4096 max context. |
## LoRA Management API
All management endpoints are served on the system port (default 8081).
### Load a LoRA
```
POST /v1/loras
```
```json
{
"lora_name": "my-adapter",
"source": {
"uri": "file:///path/to/adapter"
}
}
```
Supported URI schemes:
- `file://` — local filesystem path
- `s3://` — S3-compatible storage (requires `AWS_ENDPOINT`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
### List loaded LoRAs
```
GET /v1/loras
```
### Unload a LoRA
```
DELETE /v1/loras/{lora_name}
```
### List all models (base + LoRAs)
```
GET /v1/models (on the frontend port, default 8000)
```
## Running the validation script
A validation script is provided to test the LoRA endpoints against a running server:
```bash
# Start the server in one terminal
./lora_agg.sh
# In another terminal, download a LoRA adapter
hf download Chhagan005/Chhagan-DocVL-Qwen3 --local-dir /tmp/my-vlm-lora
# Run the full test suite (with end-to-end LoRA load/infer/unload)
./validate_lora_agg.sh --lora-path /tmp/my-vlm-lora
# Or run only the error-handling and base-model tests (no adapter needed)
./validate_lora_agg.sh
```
The validation script covers:
- Frontend health and base model discovery
- LoRA load/unload error handling (missing fields, non-existent adapter)
- End-to-end LoRA lifecycle: load, verify in `/v1/models`, infer, unload (when `--lora-path` provided)
- Base model multimodal inference
## Troubleshooting
### Frontend fails to start
- Check if port 8000 is already in use: `lsof -i :8000`
- Set a different port: `DYN_HTTP_PORT=8001 ./lora_agg.sh`
### OOM during inference
- Reduce `--max-model-len` via extra args: `./lora_agg.sh -- --max-model-len 4096`
- Cap image resolution: `./lora_agg.sh -- --mm-processor-kwargs '{"max_pixels": 1003520}'`
- Lower GPU memory utilization: `./lora_agg.sh -- --gpu-memory-utilization 0.80`
### LoRA fails to load
- Verify the adapter path exists and contains `adapter_config.json` and `adapter_model.safetensors`
- Ensure the adapter is compatible with the base model architecture
- Check that `max-lora-rank` (default 64) is >= the adapter's rank
- Review worker logs for detailed error messages
### Inference returns errors after loading LoRA
- Verify the LoRA is loaded: `curl http://localhost:8081/v1/loras | jq .`
- Confirm the model name in the request matches the `lora_name` exactly (case-sensitive)
- Check that the adapter was trained for the same base model
### Cache issues
- Inspect the cache: `ls -la /tmp/dynamo_loras_multimodal/`
- Clear the cache: `rm -rf /tmp/dynamo_loras_multimodal/*`
## Cleanup
```bash
# Remove LoRA cache
rm -rf /tmp/dynamo_loras_multimodal
# Remove downloaded adapter
rm -rf /tmp/my-vlm-lora
```
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Aggregated multimodal serving with LoRA adapter support
#
# Architecture: Single-worker PD (Prefill-Decode) with dynamic LoRA loading
# - Frontend: Rust OpenAIPreprocessor handles image URLs (HTTP and data:// base64)
# - Worker: Standard vLLM worker with vision model + LoRA support
#
# Usage:
# ./lora_agg.sh # Qwen3-VL-2B (default)
# ./lora_agg.sh --model llava-hf/llava-1.5-7b-hf # LLaVA 1.5
# ./lora_agg.sh -- --enforce-eager # Pass extra args to vLLM
set -euo pipefail
trap 'echo "Cleaning up..."; kill 0' EXIT
# ── Configuration ────────────────────────────────────────────────────────
MODEL_NAME="${DYN_MODEL_NAME:-Qwen/Qwen3-VL-2B-Instruct}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
SYSTEM_PORT="${DYN_SYSTEM_PORT:-8081}"
LORA_PATH="${DYN_LORA_PATH:-/tmp/dynamo_loras_multimodal}"
MAX_LORA_RANK="${DYN_MAX_LORA_RANK:-64}"
GPU_DEVICE="${CUDA_VISIBLE_DEVICES:-0}"
# ── Parse command-line arguments ─────────────────────────────────────────
EXTRA_ARGS=()
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL_NAME=$2
shift 2
;;
-h|--help)
cat <<USAGE
Usage: $0 [OPTIONS] [-- EXTRA_VLLM_ARGS]
Options:
--model <model_name> Vision-language model to serve (default: $MODEL_NAME)
-h, --help Show this help message
Environment variables:
DYN_MODEL_NAME Base model name (default: Qwen/Qwen3-VL-2B-Instruct)
DYN_HTTP_PORT Frontend HTTP port (default: 8000)
DYN_SYSTEM_PORT Worker system/admin port (default: 8081)
DYN_LORA_PATH Local cache directory for LoRA adapters (default: /tmp/dynamo_loras_multimodal)
DYN_MAX_LORA_RANK Maximum LoRA rank supported (default: 64)
CUDA_VISIBLE_DEVICES GPU device index (default: 0)
Any arguments after '--' are passed through to the vLLM worker.
After launch, manage LoRA adapters via the system API on port \$DYN_SYSTEM_PORT:
Load: curl -X POST http://localhost:${SYSTEM_PORT}/v1/loras \\
-H "Content-Type: application/json" \\
-d '{"lora_name": "my-adapter", "source": {"uri": "file:///path/to/adapter"}}'
List: curl http://localhost:${SYSTEM_PORT}/v1/loras
Unload: curl -X DELETE http://localhost:${SYSTEM_PORT}/v1/loras/my-adapter
USAGE
exit 0
;;
--)
shift
EXTRA_ARGS+=("$@")
break
;;
*)
EXTRA_ARGS+=("$1")
shift
;;
esac
done
# ── Banner ───────────────────────────────────────────────────────────────
echo "=================================================="
echo "Aggregated Multimodal Serving with LoRA Support"
echo "=================================================="
echo "Model: $MODEL_NAME"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "System API: http://localhost:$SYSTEM_PORT"
echo "LoRA cache: $LORA_PATH"
echo "GPU device: $GPU_DEVICE"
echo "=================================================="
# ── Environment setup ────────────────────────────────────────────────────
# Use TCP transport for multimodal workloads (base64 images can exceed NATS 1MB limit)
export DYN_REQUEST_PLANE=tcp
# Enable dynamic LoRA loading
export DYN_LORA_ENABLED=true
export DYN_LORA_PATH="$LORA_PATH"
mkdir -p "$DYN_LORA_PATH"
# ── Model-specific vLLM settings ────────────────────────────────────────
#
# Qwen VL models use dynamic resolution: a 2560px image can produce 5000+ tokens.
# max-model-len must exceed (text tokens + image tokens).
# Use --mm-processor-kwargs to cap image pixels and reduce token count if OOM.
MODEL_SPECIFIC_ARGS=("--gpu-memory-utilization" "0.85" "--max-model-len" "8192")
case "$MODEL_NAME" in
Qwen/Qwen2.5-VL-7B-Instruct)
MODEL_SPECIFIC_ARGS=("--gpu-memory-utilization" "0.85" "--max-model-len" "8192" "--max-num-seqs" "8192")
;;
Qwen/Qwen3-VL-2B-Instruct)
MODEL_SPECIFIC_ARGS=("--gpu-memory-utilization" "0.85" "--max-model-len" "8192" "--max-num-batched-tokens" "8192")
;;
llava-hf/llava-1.5-7b-hf)
MODEL_SPECIFIC_ARGS=("--gpu-memory-utilization" "0.85" "--max-model-len" "4096")
;;
esac
# ── Start services ──────────────────────────────────────────────────────
echo ""
echo "Starting frontend..."
python -m dynamo.frontend &
FRONTEND_PID=$!
# Wait for frontend to become ready
echo "Waiting for frontend on port $HTTP_PORT..."
for i in $(seq 1 60); do
if curl -sf "http://localhost:$HTTP_PORT/v1/models" > /dev/null 2>&1; then
echo "Frontend is ready."
break
fi
if ! kill -0 "$FRONTEND_PID" 2>/dev/null; then
echo "ERROR: Frontend process exited unexpectedly."
exit 1
fi
sleep 1
done
echo "Starting vLLM worker..."
# --enable-lora: Enable LoRA adapter support in vLLM engine
# --max-lora-rank: Maximum LoRA rank (increase if your adapters have higher rank)
# --connector none: No KV transfer needed for aggregated serving
CUDA_VISIBLE_DEVICES="$GPU_DEVICE" \
DYN_SYSTEM_PORT="$SYSTEM_PORT" \
python -m dynamo.vllm \
--enable-multimodal \
--model "$MODEL_NAME" \
--connector none \
--enable-lora \
--max-lora-rank "$MAX_LORA_RANK" \
"${MODEL_SPECIFIC_ARGS[@]}" \
"${EXTRA_ARGS[@]}" &
# Wait for all background processes
wait
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Validation script for multimodal LoRA endpoints.
#
# Tests the full LoRA lifecycle (list, load, infer, unload) and error handling
# against a running multimodal worker.
#
# Prerequisites:
# A running multimodal worker via lora_agg.sh
#
# Usage:
# ./validate_lora_agg.sh # defaults: frontend=8000, system=8081
# ./validate_lora_agg.sh --lora-path /tmp/my-vlm-lora # with a real LoRA adapter
set -euo pipefail
# ── Defaults ─────────────────────────────────────────────────────────────
FRONTEND_PORT="${DYN_HTTP_PORT:-8000}"
SYSTEM_PORT="${DYN_SYSTEM_PORT:-8081}"
IMAGE_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
LORA_PATH=""
CURL_TIMEOUT=60
PASS=0
FAIL=0
SKIP=0
TOTAL=0
# ── Parse args ───────────────────────────────────────────────────────────
while [[ $# -gt 0 ]]; do
case $1 in
--frontend-port) FRONTEND_PORT=$2; shift 2 ;;
--system-port) SYSTEM_PORT=$2; shift 2 ;;
--image-url) IMAGE_URL=$2; shift 2 ;;
--lora-path) LORA_PATH=$2; shift 2 ;;
--timeout) CURL_TIMEOUT=$2; shift 2 ;;
-h|--help)
cat <<USAGE
Usage: $0 [OPTIONS]
Options:
--frontend-port <port> Frontend HTTP port (default: 8000)
--system-port <port> Worker system port (default: 8081)
--image-url <url> Image URL for multimodal test
--lora-path <path> Path to a real LoRA adapter for end-to-end tests
(skip load/infer tests if not provided)
--timeout <seconds> Curl timeout per request (default: 60)
-h, --help Show this help message
USAGE
exit 0
;;
*) echo "Unknown option: $1"; exit 1 ;;
esac
done
FRONTEND="http://localhost:$FRONTEND_PORT"
SYSTEM="http://localhost:$SYSTEM_PORT"
# ── Helpers ──────────────────────────────────────────────────────────────
pass() { PASS=$((PASS + 1)); TOTAL=$((TOTAL + 1)); echo " PASS: $1"; }
fail() { FAIL=$((FAIL + 1)); TOTAL=$((TOTAL + 1)); echo " FAIL: $1"; }
skip() { SKIP=$((SKIP + 1)); TOTAL=$((TOTAL + 1)); echo " SKIP: $1"; }
check_json_field() {
local json=$1 field=$2 expected=$3 name=$4
local actual
actual=$(echo "$json" | python3 -c "import sys,json; print(json.load(sys.stdin).get('$field',''))" 2>/dev/null || echo "PARSE_ERROR")
if [[ "$actual" == "$expected" ]]; then
pass "$name"
else
fail "$name (expected '$expected', got '$actual')"
fi
}
# Curl wrapper with timeout
api() {
curl -sf --max-time "$CURL_TIMEOUT" "$@" 2>/dev/null
}
# ── Banner ───────────────────────────────────────────────────────────────
echo "=================================================="
echo "Multimodal LoRA Endpoint Validation"
echo "=================================================="
echo "Frontend: $FRONTEND"
echo "System: $SYSTEM"
echo "LoRA path: ${LORA_PATH:-<not set — load/infer tests will be skipped>}"
echo "=================================================="
# ── 1. Frontend health ──────────────────────────────────────────────────
echo ""
echo "[1/9] Checking frontend health..."
if api "$FRONTEND/v1/models" > /dev/null; then
pass "Frontend is reachable"
else
fail "Frontend is NOT reachable at $FRONTEND"
echo "Ensure lora_agg.sh is running. Aborting."
exit 1
fi
# Discover the base model name from the running server
BASE_MODEL=$(api "$FRONTEND/v1/models" | python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['id'])" 2>/dev/null || echo "")
if [[ -n "$BASE_MODEL" ]]; then
echo " Detected base model: $BASE_MODEL"
else
fail "Could not detect base model name"
exit 1
fi
# ── 2. List LoRAs (initially empty) ─────────────────────────────────────
echo ""
echo "[2/9] Testing list_loras (GET)..."
RESP=$(api "$SYSTEM/v1/loras" || echo '{"status":"error"}')
check_json_field "$RESP" "status" "success" "list_loras returns success"
LORA_COUNT=$(echo "$RESP" | python3 -c "import sys,json; print(json.load(sys.stdin).get('count','-1'))" 2>/dev/null || echo "-1")
echo " Currently loaded LoRAs: $LORA_COUNT"
# ── 3. Load LoRA — missing lora_name ────────────────────────────────────
echo ""
echo "[3/9] Testing load_lora error handling (missing lora_name)..."
RESP=$(api -X POST "$SYSTEM/v1/loras" \
-H "Content-Type: application/json" \
-d '{"source": {"uri": "file:///fake/path"}}' || echo '{"status":"error"}')
check_json_field "$RESP" "status" "error" "load_lora rejects missing lora_name"
# ── 4. Load LoRA — missing source ──────────────────────────────────────
echo ""
echo "[4/9] Testing load_lora error handling (missing source)..."
RESP=$(api -X POST "$SYSTEM/v1/loras" \
-H "Content-Type: application/json" \
-d '{"lora_name": "test-lora"}' || echo '{"status":"error"}')
check_json_field "$RESP" "status" "error" "load_lora rejects missing source"
# ── 5. Unload non-existent LoRA ─────────────────────────────────────────
echo ""
echo "[5/9] Testing unload_lora for non-existent adapter..."
RESP=$(api -X DELETE "$SYSTEM/v1/loras/non-existent-lora" || echo '{"status":"error"}')
check_json_field "$RESP" "status" "error" "unload_lora rejects non-existent adapter"
# ── 6. Load a real LoRA adapter ─────────────────────────────────────────
echo ""
echo "[6/9] Loading a real LoRA adapter..."
if [[ -n "$LORA_PATH" && -d "$LORA_PATH" ]]; then
RESP=$(api -X POST "$SYSTEM/v1/loras" \
-H "Content-Type: application/json" \
-d "{\"lora_name\": \"test-vlm-lora\", \"source\": {\"uri\": \"file://$LORA_PATH\"}}" || echo '{"status":"error"}')
check_json_field "$RESP" "status" "success" "load_lora with real adapter"
# Wait for LoRA to propagate to the frontend (discovery takes ~1-2s)
LORA_VISIBLE=false
for _wait in $(seq 1 10); do
MODELS=$(api "$FRONTEND/v1/models" || echo '{}')
if echo "$MODELS" | python3 -c "import sys,json; ids=[m['id'] for m in json.load(sys.stdin).get('data',[])]; assert 'test-vlm-lora' in ids" 2>/dev/null; then
LORA_VISIBLE=true
break
fi
sleep 1
done
if [[ "$LORA_VISIBLE" == "true" ]]; then
pass "LoRA appears in /v1/models"
else
fail "LoRA does NOT appear in /v1/models after 10s"
fi
else
skip "No --lora-path provided, skipping real LoRA load"
fi
# ── 7. Inference with LoRA adapter ──────────────────────────────────────
echo ""
echo "[7/9] Testing inference with LoRA adapter..."
if [[ -n "$LORA_PATH" && -d "$LORA_PATH" ]]; then
RESP=$(api -X POST "$FRONTEND/v1/chat/completions" \
--max-time 120 \
-H "Content-Type: application/json" \
-d "{
\"model\": \"test-vlm-lora\",
\"messages\": [{\"role\": \"user\", \"content\": [
{\"type\": \"text\", \"text\": \"Describe this image briefly.\"},
{\"type\": \"image_url\", \"image_url\": {\"url\": \"$IMAGE_URL\"}}
]}],
\"max_tokens\": 50,
\"temperature\": 0.0
}" || echo '{"error":"request_failed"}')
if echo "$RESP" | python3 -c "import sys,json; d=json.load(sys.stdin); assert 'choices' in d and len(d['choices'])>0" 2>/dev/null; then
pass "LoRA multimodal inference returned choices"
else
fail "LoRA multimodal inference failed: $RESP"
fi
else
skip "No --lora-path provided, skipping LoRA inference"
fi
# ── 8. Inference with base model ────────────────────────────────────────
echo ""
echo "[8/9] Testing base model multimodal inference..."
RESP=$(api -X POST "$FRONTEND/v1/chat/completions" \
--max-time 120 \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$BASE_MODEL\",
\"messages\": [{\"role\": \"user\", \"content\": [
{\"type\": \"text\", \"text\": \"Describe this image briefly.\"},
{\"type\": \"image_url\", \"image_url\": {\"url\": \"$IMAGE_URL\"}}
]}],
\"max_tokens\": 50,
\"temperature\": 0.0
}" || echo '{"error":"request_failed"}')
if echo "$RESP" | python3 -c "import sys,json; d=json.load(sys.stdin); assert 'choices' in d and len(d['choices'])>0" 2>/dev/null; then
pass "Base model multimodal inference returned choices"
else
fail "Base model multimodal inference failed: $RESP"
fi
# ── 9. Unload LoRA adapter ──────────────────────────────────────────────
echo ""
echo "[9/9] Unloading LoRA adapter..."
if [[ -n "$LORA_PATH" && -d "$LORA_PATH" ]]; then
RESP=$(api -X DELETE "$SYSTEM/v1/loras/test-vlm-lora" || echo '{"status":"error"}')
check_json_field "$RESP" "status" "success" "unload_lora succeeds"
# Verify it's gone from models list
MODELS=$(api "$FRONTEND/v1/models" || echo '{}')
if echo "$MODELS" | python3 -c "import sys,json; ids=[m['id'] for m in json.load(sys.stdin).get('data',[])]; assert 'test-vlm-lora' not in ids" 2>/dev/null; then
pass "LoRA removed from /v1/models"
else
fail "LoRA still present in /v1/models after unload"
fi
else
skip "No --lora-path provided, skipping LoRA unload"
fi
# ── Summary ──────────────────────────────────────────────────────────────
echo ""
echo "=================================================="
echo "Results: $PASS passed, $FAIL failed, $SKIP skipped (out of $TOTAL)"
echo "=================================================="
if [[ $FAIL -gt 0 ]]; then
exit 1
fi
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment