@@ -300,6 +300,12 @@ If MM-aware routing and prefix reuse are working, after sending the same request
| `--downstream-component` | `backend` | Downstream component name (use `backend` for current `dynamo.vllm` defaults) |
| `--downstream-endpoint` | `generate` | Downstream vLLM endpoint name |
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_MM_IMAGE_CACHE_SIZE` | `8` | Maximum number of images to keep in the MM router's in-memory image cache. Increase this if your workload has high image reuse across a larger unique image set. |
## How It Works
### MM Hash Computation
...
...
@@ -354,6 +360,39 @@ For repeated identical images, multiple entries may appear in the same block whe
- `requests` for `http(s)` image URLs
- vLLM-capable backend worker via `python -m dynamo.vllm`
## Performance
### 8× B200, Qwen3-VL-30B-A3B-FP8, HTTP Image Transport
On an 8-GPU B200 node serving 8 replicas of `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8` with concurrent HTTP image requests and moderate (~50%) image reuse across workers, MM-aware routing delivers significant throughput and latency improvements over round-robin (default router mode). The benchmark uses a fixed text prompt and `--osl 1` to ensure the workload is dominated by image tokens, isolating the performance effect of MM router's image-aware KV cache routing on prefill:
- **~1.6× higher throughput** — repeated image requests are steered to the worker that already holds the relevant KV cache blocks, avoiding redundant image downloads and prefill recomputation
- **~1.6× lower average latency** and **~3× lower median (p50) latency** — cache-warm requests complete substantially faster
- **p99 trade-off** — tail latency can increase under skewed workloads due to load imbalance when hot KV blocks are concentrated on a small number of workers
To reproduce, prepare an `aiperf`-compatible JSONL dataset with ~50% image reuse — each line contains a text prompt and one image URL for simplicity, with some URLs repeated across requests. The dataset used in the benchmarks above was generated using the [multimodal JSONL generator](../../../../benchmarks/multimodal/jsonl/README.md).
Example dataset format:
```jsonl
{"text": "Please describe this image.", "images": ["https://example.com/cat.jpg"]}
{"text": "Please describe this image.", "images": ["https://example.com/dog.jpg"]}
{"text": "Please describe this image.", "images": ["https://example.com/bird.jpg"]}
{"text": "Please describe this image.", "images": ["https://example.com/cat.jpg"]}
```
Then benchmark against a running stack:
```bash
aiperf profile \
--model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--input-file example.jsonl \
--custom-dataset-type single_turn \
--osl 1 \
--concurrency 5 \
--artifact-dir ./logs/mm_router_run
```
## Known Limitations
- `mm_processor.py` currently only supports Qwen-style multimodal processors for per-image visual token counting (`Qwen2-VL`, `Qwen2.5-VL`, `Qwen3-VL` style processors).