README.md

# Online LoRA Inference (Diffusion)

This example shows how to use **per-request LoRA** with vLLM-Omni diffusion models via the OpenAI-compatible Chat Completions API.

> Note: The LoRA adapter path must be readable on the **server** machine (usually a local path or a mounted directory).
> Note: This example uses `/v1/chat/completions`. LoRA payloads for other OpenAI endpoints are not implemented here.

## Start Server

```bash
# Pick a diffusion model (examples)
# export MODEL=stabilityai/stable-diffusion-3.5-medium
# export MODEL=Qwen/Qwen-Image

bash run_server.sh
```

## Call API (curl)

```bash
# Required: local LoRA folder on the server
export LORA_PATH=/path/to/lora_adapter

# Optional
export SERVER=http://localhost:8091
export PROMPT="A piece of cheesecake"
export LORA_NAME=my_lora
export LORA_SCALE=1.0
# Optional: if omitted, the server derives a stable id from LORA_PATH.
# export LORA_INT_ID=123

bash run_curl_lora_inference.sh
```

## Call API (Python)

```bash
python openai_chat_client.py \
  --prompt "A piece of cheesecake" \
  --lora-path /path/to/lora_adapter \
  --lora-name my_lora \
  --lora-scale 1.0 \
  --output output.png
```

## LoRA Format

LoRA adapters should be in PEFT format, for example:

```
lora_adapter/
├── adapter_config.json
└── adapter_model.safetensors
```