feat: add examples for multimodal loras (#6400)

a28c5f3a · Biswa Panda · GitHub · 026f361d · a28c5f3a · a28c5f3a
Unverified Commit a28c5f3a authored Feb 19, 2026 by Biswa Panda Committed by GitHub Feb 19, 2026
3 changed files
--- a/examples/multimodal/launch/lora/README.md
+++ b/examples/multimodal/launch/lora/README.md
+# Multimodal LoRA Serving Guide
+
+Serve vision-language models (VLMs) with dynamically loadable LoRA adapters using Dynamo's aggregated architecture.
+
+## Prerequisites
+
+- **GPU**: NVIDIA GPU with sufficient VRAM (8 GB+ for 2B models, 24 GB+ for 7B models)
+- **Dynamo**: Installed with vLLM support (`pip install dynamo[vllm]`)
+- **jq** (optional, for pretty JSON output): `sudo apt install jq`
+- **hf CLI** (optional, for downloading adapters): `pip install huggingface-hub`
+
+## Quick Start
+
+### 1. Launch the server
+
+```bash
+cd examples/multimodal/launch/lora
+./lora_agg.sh
+```
+
+This starts the frontend (port 8000) and vLLM worker (port 8081) with `Qwen/Qwen3-VL-2B-Instruct` as the base model.
+
+Wait for both services to report ready in the logs (look for `Application startup complete`).
+
+### 2. Verify the server is running
+
+```bash
+curl http://localhost:8000/v1/models | jq .
+```
+
+You should see the base model listed.
+
+### 3. Download a LoRA adapter
+
+Download a compatible vision LoRA to your local filesystem:
+
+```bash
+export HF_TOKEN=<your-huggingface-token>
+
+hf download Chhagan005/Chhagan-DocVL-Qwen3 --local-dir /tmp/my-vlm-lora
+```
+
+### 4. Load the LoRA adapter
+
+```bash
+curl -s -X POST http://localhost:8081/v1/loras \
+  -H "Content-Type: application/json" \
+  -d '{
+    "lora_name": "my-vlm-lora",
+    "source": {"uri": "file:///tmp/my-vlm-lora"}
+  }' | jq .
+```
+
+Expected response:
+```json
+{
+  "status": "success",
+  "message": "LoRA adapter 'my-vlm-lora' loaded successfully",
+  "lora_name": "my-vlm-lora",
+  "lora_id": 1207343256
+}
+```
+
+### 5. Run inference with the LoRA adapter
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "my-vlm-lora",
+    "messages": [{"role": "user", "content": [
+      {"type": "text", "text": "Describe this image in detail"},
+      {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
+    ]}],
+    "max_tokens": 300,
+    "temperature": 0.0
+  }' | jq .
+```
+
+### 6. Compare with the base model
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-VL-2B-Instruct",
+    "messages": [{"role": "user", "content": [
+      {"type": "text", "text": "Describe this image in detail"},
+      {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}}
+    ]}],
+    "max_tokens": 300,
+    "temperature": 0.0
+  }' | jq .
+```
+
+### 7. Unload the LoRA adapter
+
+```bash
+curl -X DELETE http://localhost:8081/v1/loras/my-vlm-lora | jq .
+```
+
+### 8. Stop the server
+
+Press `Ctrl+C` in the terminal running `lora_agg.sh`. The trap handler will clean up child processes.
+
+## Configuration
+
+### Command-line options
+
+```bash
+./lora_agg.sh --model llava-hf/llava-1.5-7b-hf            # Use a different base model
+./lora_agg.sh -- --enforce-eager                            # Pass extra vLLM args
+./lora_agg.sh -- --mm-processor-kwargs '{"max_pixels": 1003520}'  # Cap image resolution
+```
+
+### Environment variables
+
+| Variable | Default | Description |
+|---|---|---|
+| `DYN_MODEL_NAME` | `Qwen/Qwen3-VL-2B-Instruct` | Base VLM model |
+| `DYN_HTTP_PORT` | `8000` | Frontend HTTP port |
+| `DYN_SYSTEM_PORT` | `8081` | Worker system/admin port |
+| `DYN_LORA_PATH` | `/tmp/dynamo_loras_multimodal` | Local LoRA adapter cache |
+| `DYN_MAX_LORA_RANK` | `64` | Maximum LoRA rank supported |
+| `CUDA_VISIBLE_DEVICES` | `0` | GPU device index |
+
+### base models supported by this script
+
+| Model | Notes |
+|---|---|
+| `Qwen/Qwen3-VL-2B-Instruct` | Default. Good for single-GPU testing. |
+| `Qwen/Qwen2.5-VL-7B-Instruct` | Higher quality, needs 24 GB+ VRAM. |
+| `llava-hf/llava-1.5-7b-hf` | LLaVA architecture, 4096 max context. |
+
+## LoRA Management API
+
+All management endpoints are served on the system port (default 8081).
+
+### Load a LoRA
+
+```
+POST /v1/loras
+```
+
+```json
+{
+  "lora_name": "my-adapter",
+  "source": {
+    "uri": "file:///path/to/adapter"
+  }
+}
+```
+
+Supported URI schemes:
+- `file://` — local filesystem path
+- `s3://` — S3-compatible storage (requires `AWS_ENDPOINT`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
+
+### List loaded LoRAs
+
+```
+GET /v1/loras
+```
+
+### Unload a LoRA
+
+```
+DELETE /v1/loras/{lora_name}
+```
+
+### List all models (base + LoRAs)
+
+```
+GET /v1/models        (on the frontend port, default 8000)
+```
+
+## Running the validation script
+
+A validation script is provided to test the LoRA endpoints against a running server:
+
+```bash
+# Start the server in one terminal
+./lora_agg.sh
+
+# In another terminal, download a LoRA adapter
+hf download Chhagan005/Chhagan-DocVL-Qwen3 --local-dir /tmp/my-vlm-lora
+
+# Run the full test suite (with end-to-end LoRA load/infer/unload)
+./validate_lora_agg.sh --lora-path /tmp/my-vlm-lora
+
+# Or run only the error-handling and base-model tests (no adapter needed)
+./validate_lora_agg.sh
+```
+
+The validation script covers:
+- Frontend health and base model discovery
+- LoRA load/unload error handling (missing fields, non-existent adapter)
+- End-to-end LoRA lifecycle: load, verify in `/v1/models`, infer, unload (when `--lora-path` provided)
+- Base model multimodal inference
+
+## Troubleshooting
+
+### Frontend fails to start
+- Check if port 8000 is already in use: `lsof -i :8000`
+- Set a different port: `DYN_HTTP_PORT=8001 ./lora_agg.sh`
+
+### OOM during inference
+- Reduce `--max-model-len` via extra args: `./lora_agg.sh -- --max-model-len 4096`
+- Cap image resolution: `./lora_agg.sh -- --mm-processor-kwargs '{"max_pixels": 1003520}'`
+- Lower GPU memory utilization: `./lora_agg.sh -- --gpu-memory-utilization 0.80`
+
+### LoRA fails to load
+- Verify the adapter path exists and contains `adapter_config.json` and `adapter_model.safetensors`
+- Ensure the adapter is compatible with the base model architecture
+- Check that `max-lora-rank` (default 64) is >= the adapter's rank
+- Review worker logs for detailed error messages
+
+### Inference returns errors after loading LoRA
+- Verify the LoRA is loaded: `curl http://localhost:8081/v1/loras | jq .`
+- Confirm the model name in the request matches the `lora_name` exactly (case-sensitive)
+- Check that the adapter was trained for the same base model
+
+### Cache issues
+- Inspect the cache: `ls -la /tmp/dynamo_loras_multimodal/`
+- Clear the cache: `rm -rf /tmp/dynamo_loras_multimodal/*`
+
+## Cleanup
+
+```bash
+# Remove LoRA cache
+rm -rf /tmp/dynamo_loras_multimodal
+
+# Remove downloaded adapter
+rm -rf /tmp/my-vlm-lora
+```
--- a/examples/multimodal/launch/lora/lora_agg.sh
+++ b/examples/multimodal/launch/lora/lora_agg.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Aggregated multimodal serving with LoRA adapter support
+#
+# Architecture: Single-worker PD (Prefill-Decode) with dynamic LoRA loading
+# - Frontend: Rust OpenAIPreprocessor handles image URLs (HTTP and data:// base64)
+# - Worker: Standard vLLM worker with vision model + LoRA support
+#
+# Usage:
+#   ./lora_agg.sh                                             # Qwen3-VL-2B (default)
+#   ./lora_agg.sh --model llava-hf/llava-1.5-7b-hf           # LLaVA 1.5
+#   ./lora_agg.sh -- --enforce-eager                          # Pass extra args to vLLM
+
+set -euo pipefail
+trap 'echo "Cleaning up..."; kill 0' EXIT
+
+# ── Configuration ────────────────────────────────────────────────────────
+
+MODEL_NAME="${DYN_MODEL_NAME:-Qwen/Qwen3-VL-2B-Instruct}"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+SYSTEM_PORT="${DYN_SYSTEM_PORT:-8081}"
+LORA_PATH="${DYN_LORA_PATH:-/tmp/dynamo_loras_multimodal}"
+MAX_LORA_RANK="${DYN_MAX_LORA_RANK:-64}"
+GPU_DEVICE="${CUDA_VISIBLE_DEVICES:-0}"
+
+# ── Parse command-line arguments ─────────────────────────────────────────
+
+EXTRA_ARGS=()
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --model)
+            MODEL_NAME=$2
+            shift 2
+            ;;
+        -h|--help)
+            cat <<USAGE
+Usage: $0 [OPTIONS] [-- EXTRA_VLLM_ARGS]
+
+Options:
+  --model <model_name>   Vision-language model to serve (default: $MODEL_NAME)
+  -h, --help             Show this help message
+
+Environment variables:
+  DYN_MODEL_NAME         Base model name (default: Qwen/Qwen3-VL-2B-Instruct)
+  DYN_HTTP_PORT          Frontend HTTP port (default: 8000)
+  DYN_SYSTEM_PORT        Worker system/admin port (default: 8081)
+  DYN_LORA_PATH          Local cache directory for LoRA adapters (default: /tmp/dynamo_loras_multimodal)
+  DYN_MAX_LORA_RANK      Maximum LoRA rank supported (default: 64)
+  CUDA_VISIBLE_DEVICES   GPU device index (default: 0)
+
+Any arguments after '--' are passed through to the vLLM worker.
+
+After launch, manage LoRA adapters via the system API on port \$DYN_SYSTEM_PORT:
+  Load:   curl -X POST http://localhost:${SYSTEM_PORT}/v1/loras \\
+            -H "Content-Type: application/json" \\
+            -d '{"lora_name": "my-adapter", "source": {"uri": "file:///path/to/adapter"}}'
+  List:   curl http://localhost:${SYSTEM_PORT}/v1/loras
+  Unload: curl -X DELETE http://localhost:${SYSTEM_PORT}/v1/loras/my-adapter
+USAGE
+            exit 0
+            ;;
+        --)
+            shift
+            EXTRA_ARGS+=("$@")
+            break
+            ;;
+        *)
+            EXTRA_ARGS+=("$1")
+            shift
+            ;;
+    esac
+done
+
+# ── Banner ───────────────────────────────────────────────────────────────
+
+echo "=================================================="
+echo "Aggregated Multimodal Serving with LoRA Support"
+echo "=================================================="
+echo "Model:        $MODEL_NAME"
+echo "Frontend:     http://localhost:$HTTP_PORT"
+echo "System API:   http://localhost:$SYSTEM_PORT"
+echo "LoRA cache:   $LORA_PATH"
+echo "GPU device:   $GPU_DEVICE"
+echo "=================================================="
+
+# ── Environment setup ────────────────────────────────────────────────────
+
+# Use TCP transport for multimodal workloads (base64 images can exceed NATS 1MB limit)
+export DYN_REQUEST_PLANE=tcp
+
+# Enable dynamic LoRA loading
+export DYN_LORA_ENABLED=true
+export DYN_LORA_PATH="$LORA_PATH"
+mkdir -p "$DYN_LORA_PATH"
+
+# ── Model-specific vLLM settings ────────────────────────────────────────
+#
+# Qwen VL models use dynamic resolution: a 2560px image can produce 5000+ tokens.
+# max-model-len must exceed (text tokens + image tokens).
+# Use --mm-processor-kwargs to cap image pixels and reduce token count if OOM.
+
+MODEL_SPECIFIC_ARGS=("--gpu-memory-utilization" "0.85" "--max-model-len" "8192")
+
+case "$MODEL_NAME" in
+    Qwen/Qwen2.5-VL-7B-Instruct)
+        MODEL_SPECIFIC_ARGS=("--gpu-memory-utilization" "0.85" "--max-model-len" "8192" "--max-num-seqs" "8192")
+        ;;
+    Qwen/Qwen3-VL-2B-Instruct)
+        MODEL_SPECIFIC_ARGS=("--gpu-memory-utilization" "0.85" "--max-model-len" "8192" "--max-num-batched-tokens" "8192")
+        ;;
+    llava-hf/llava-1.5-7b-hf)
+        MODEL_SPECIFIC_ARGS=("--gpu-memory-utilization" "0.85" "--max-model-len" "4096")
+        ;;
+esac
+
+# ── Start services ──────────────────────────────────────────────────────
+
+echo ""
+echo "Starting frontend..."
+python -m dynamo.frontend &
+FRONTEND_PID=$!
+
+# Wait for frontend to become ready
+echo "Waiting for frontend on port $HTTP_PORT..."
+for i in $(seq 1 60); do
+    if curl -sf "http://localhost:$HTTP_PORT/v1/models" > /dev/null 2>&1; then
+        echo "Frontend is ready."
+        break
+    fi
+    if ! kill -0 "$FRONTEND_PID" 2>/dev/null; then
+        echo "ERROR: Frontend process exited unexpectedly."
+        exit 1
+    fi
+    sleep 1
+done
+
+echo "Starting vLLM worker..."
+
+# --enable-lora: Enable LoRA adapter support in vLLM engine
+# --max-lora-rank: Maximum LoRA rank (increase if your adapters have higher rank)
+# --connector none: No KV transfer needed for aggregated serving
+CUDA_VISIBLE_DEVICES="$GPU_DEVICE" \
+DYN_SYSTEM_PORT="$SYSTEM_PORT" \
+    python -m dynamo.vllm \
+        --enable-multimodal \
+        --model "$MODEL_NAME" \
+        --connector none \
+        --enable-lora \
+        --max-lora-rank "$MAX_LORA_RANK" \
+        "${MODEL_SPECIFIC_ARGS[@]}" \
+        "${EXTRA_ARGS[@]}" &
+
+# Wait for all background processes
+wait
--- a/examples/multimodal/launch/lora/validate_lora_agg.sh
+++ b/examples/multimodal/launch/lora/validate_lora_agg.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Validation script for multimodal LoRA endpoints.
+#
+# Tests the full LoRA lifecycle (list, load, infer, unload) and error handling
+# against a running multimodal worker.
+#
+# Prerequisites:
+#   A running multimodal worker via lora_agg.sh
+#
+# Usage:
+#   ./validate_lora_agg.sh                            # defaults: frontend=8000, system=8081
+#   ./validate_lora_agg.sh --lora-path /tmp/my-vlm-lora  # with a real LoRA adapter
+
+set -euo pipefail
+
+# ── Defaults ─────────────────────────────────────────────────────────────
+
+FRONTEND_PORT="${DYN_HTTP_PORT:-8000}"
+SYSTEM_PORT="${DYN_SYSTEM_PORT:-8081}"
+IMAGE_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
+LORA_PATH=""
+CURL_TIMEOUT=60
+PASS=0
+FAIL=0
+SKIP=0
+TOTAL=0
+
+# ── Parse args ───────────────────────────────────────────────────────────
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --frontend-port) FRONTEND_PORT=$2; shift 2 ;;
+        --system-port)   SYSTEM_PORT=$2; shift 2 ;;
+        --image-url)     IMAGE_URL=$2; shift 2 ;;
+        --lora-path)     LORA_PATH=$2; shift 2 ;;
+        --timeout)       CURL_TIMEOUT=$2; shift 2 ;;
+        -h|--help)
+            cat <<USAGE
+Usage: $0 [OPTIONS]
+
+Options:
+  --frontend-port <port>  Frontend HTTP port (default: 8000)
+  --system-port <port>    Worker system port (default: 8081)
+  --image-url <url>       Image URL for multimodal test
+  --lora-path <path>      Path to a real LoRA adapter for end-to-end tests
+                          (skip load/infer tests if not provided)
+  --timeout <seconds>     Curl timeout per request (default: 60)
+  -h, --help              Show this help message
+USAGE
+            exit 0
+            ;;
+        *) echo "Unknown option: $1"; exit 1 ;;
+    esac
+done
+
+FRONTEND="http://localhost:$FRONTEND_PORT"
+SYSTEM="http://localhost:$SYSTEM_PORT"
+
+# ── Helpers ──────────────────────────────────────────────────────────────
+
+pass() { PASS=$((PASS + 1)); TOTAL=$((TOTAL + 1)); echo "  PASS: $1"; }
+fail() { FAIL=$((FAIL + 1)); TOTAL=$((TOTAL + 1)); echo "  FAIL: $1"; }
+skip() { SKIP=$((SKIP + 1)); TOTAL=$((TOTAL + 1)); echo "  SKIP: $1"; }
+
+check_json_field() {
+    local json=$1 field=$2 expected=$3 name=$4
+    local actual
+    actual=$(echo "$json" | python3 -c "import sys,json; print(json.load(sys.stdin).get('$field',''))" 2>/dev/null || echo "PARSE_ERROR")
+    if [[ "$actual" == "$expected" ]]; then
+        pass "$name"
+    else
+        fail "$name (expected '$expected', got '$actual')"
+    fi
+}
+
+# Curl wrapper with timeout
+api() {
+    curl -sf --max-time "$CURL_TIMEOUT" "$@" 2>/dev/null
+}
+
+# ── Banner ───────────────────────────────────────────────────────────────
+
+echo "=================================================="
+echo "Multimodal LoRA Endpoint Validation"
+echo "=================================================="
+echo "Frontend:  $FRONTEND"
+echo "System:    $SYSTEM"
+echo "LoRA path: ${LORA_PATH:-<not set — load/infer tests will be skipped>}"
+echo "=================================================="
+
+# ── 1. Frontend health ──────────────────────────────────────────────────
+
+echo ""
+echo "[1/9] Checking frontend health..."
+if api "$FRONTEND/v1/models" > /dev/null; then
+    pass "Frontend is reachable"
+else
+    fail "Frontend is NOT reachable at $FRONTEND"
+    echo "Ensure lora_agg.sh is running. Aborting."
+    exit 1
+fi
+
+# Discover the base model name from the running server
+BASE_MODEL=$(api "$FRONTEND/v1/models" | python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['id'])" 2>/dev/null || echo "")
+if [[ -n "$BASE_MODEL" ]]; then
+    echo "  Detected base model: $BASE_MODEL"
+else
+    fail "Could not detect base model name"
+    exit 1
+fi
+
+# ── 2. List LoRAs (initially empty) ─────────────────────────────────────
+
+echo ""
+echo "[2/9] Testing list_loras (GET)..."
+RESP=$(api "$SYSTEM/v1/loras" || echo '{"status":"error"}')
+check_json_field "$RESP" "status" "success" "list_loras returns success"
+
+LORA_COUNT=$(echo "$RESP" | python3 -c "import sys,json; print(json.load(sys.stdin).get('count','-1'))" 2>/dev/null || echo "-1")
+echo "  Currently loaded LoRAs: $LORA_COUNT"
+
+# ── 3. Load LoRA — missing lora_name ────────────────────────────────────
+
+echo ""
+echo "[3/9] Testing load_lora error handling (missing lora_name)..."
+RESP=$(api -X POST "$SYSTEM/v1/loras" \
+    -H "Content-Type: application/json" \
+    -d '{"source": {"uri": "file:///fake/path"}}' || echo '{"status":"error"}')
+check_json_field "$RESP" "status" "error" "load_lora rejects missing lora_name"
+
+# ── 4. Load LoRA — missing source ──────────────────────────────────────
+
+echo ""
+echo "[4/9] Testing load_lora error handling (missing source)..."
+RESP=$(api -X POST "$SYSTEM/v1/loras" \
+    -H "Content-Type: application/json" \
+    -d '{"lora_name": "test-lora"}' || echo '{"status":"error"}')
+check_json_field "$RESP" "status" "error" "load_lora rejects missing source"
+
+# ── 5. Unload non-existent LoRA ─────────────────────────────────────────
+
+echo ""
+echo "[5/9] Testing unload_lora for non-existent adapter..."
+RESP=$(api -X DELETE "$SYSTEM/v1/loras/non-existent-lora" || echo '{"status":"error"}')
+check_json_field "$RESP" "status" "error" "unload_lora rejects non-existent adapter"
+
+# ── 6. Load a real LoRA adapter ─────────────────────────────────────────
+
+echo ""
+echo "[6/9] Loading a real LoRA adapter..."
+if [[ -n "$LORA_PATH" && -d "$LORA_PATH" ]]; then
+    RESP=$(api -X POST "$SYSTEM/v1/loras" \
+        -H "Content-Type: application/json" \
+        -d "{\"lora_name\": \"test-vlm-lora\", \"source\": {\"uri\": \"file://$LORA_PATH\"}}" || echo '{"status":"error"}')
+    check_json_field "$RESP" "status" "success" "load_lora with real adapter"
+
+    # Wait for LoRA to propagate to the frontend (discovery takes ~1-2s)
+    LORA_VISIBLE=false
+    for _wait in $(seq 1 10); do
+        MODELS=$(api "$FRONTEND/v1/models" || echo '{}')
+        if echo "$MODELS" | python3 -c "import sys,json; ids=[m['id'] for m in json.load(sys.stdin).get('data',[])]; assert 'test-vlm-lora' in ids" 2>/dev/null; then
+            LORA_VISIBLE=true
+            break
+        fi
+        sleep 1
+    done
+
+    if [[ "$LORA_VISIBLE" == "true" ]]; then
+        pass "LoRA appears in /v1/models"
+    else
+        fail "LoRA does NOT appear in /v1/models after 10s"
+    fi
+else
+    skip "No --lora-path provided, skipping real LoRA load"
+fi
+
+# ── 7. Inference with LoRA adapter ──────────────────────────────────────
+
+echo ""
+echo "[7/9] Testing inference with LoRA adapter..."
+if [[ -n "$LORA_PATH" && -d "$LORA_PATH" ]]; then
+    RESP=$(api -X POST "$FRONTEND/v1/chat/completions" \
+        --max-time 120 \
+        -H "Content-Type: application/json" \
+        -d "{
+          \"model\": \"test-vlm-lora\",
+          \"messages\": [{\"role\": \"user\", \"content\": [
+            {\"type\": \"text\", \"text\": \"Describe this image briefly.\"},
+            {\"type\": \"image_url\", \"image_url\": {\"url\": \"$IMAGE_URL\"}}
+          ]}],
+          \"max_tokens\": 50,
+          \"temperature\": 0.0
+        }" || echo '{"error":"request_failed"}')
+
+    if echo "$RESP" | python3 -c "import sys,json; d=json.load(sys.stdin); assert 'choices' in d and len(d['choices'])>0" 2>/dev/null; then
+        pass "LoRA multimodal inference returned choices"
+    else
+        fail "LoRA multimodal inference failed: $RESP"
+    fi
+else
+    skip "No --lora-path provided, skipping LoRA inference"
+fi
+
+# ── 8. Inference with base model ────────────────────────────────────────
+
+echo ""
+echo "[8/9] Testing base model multimodal inference..."
+RESP=$(api -X POST "$FRONTEND/v1/chat/completions" \
+    --max-time 120 \
+    -H "Content-Type: application/json" \
+    -d "{
+      \"model\": \"$BASE_MODEL\",
+      \"messages\": [{\"role\": \"user\", \"content\": [
+        {\"type\": \"text\", \"text\": \"Describe this image briefly.\"},
+        {\"type\": \"image_url\", \"image_url\": {\"url\": \"$IMAGE_URL\"}}
+      ]}],
+      \"max_tokens\": 50,
+      \"temperature\": 0.0
+    }" || echo '{"error":"request_failed"}')
+
+if echo "$RESP" | python3 -c "import sys,json; d=json.load(sys.stdin); assert 'choices' in d and len(d['choices'])>0" 2>/dev/null; then
+    pass "Base model multimodal inference returned choices"
+else
+    fail "Base model multimodal inference failed: $RESP"
+fi
+
+# ── 9. Unload LoRA adapter ──────────────────────────────────────────────
+
+echo ""
+echo "[9/9] Unloading LoRA adapter..."
+if [[ -n "$LORA_PATH" && -d "$LORA_PATH" ]]; then
+    RESP=$(api -X DELETE "$SYSTEM/v1/loras/test-vlm-lora" || echo '{"status":"error"}')
+    check_json_field "$RESP" "status" "success" "unload_lora succeeds"
+
+    # Verify it's gone from models list
+    MODELS=$(api "$FRONTEND/v1/models" || echo '{}')
+    if echo "$MODELS" | python3 -c "import sys,json; ids=[m['id'] for m in json.load(sys.stdin).get('data',[])]; assert 'test-vlm-lora' not in ids" 2>/dev/null; then
+        pass "LoRA removed from /v1/models"
+    else
+        fail "LoRA still present in /v1/models after unload"
+    fi
+else
+    skip "No --lora-path provided, skipping LoRA unload"
+fi
+
+# ── Summary ──────────────────────────────────────────────────────────────
+
+echo ""
+echo "=================================================="
+echo "Results: $PASS passed, $FAIL failed, $SKIP skipped (out of $TOTAL)"
+echo "=================================================="
+
+if [[ $FAIL -gt 0 ]]; then
+    exit 1
+fi