docs: Add numbers to showcase MM router perf (#7166)

c619ce36 · Kris Hung · GitHub · 0c6a8024 · c619ce36
Unverified Commit c619ce36 authored Mar 10, 2026 by Kris Hung Committed by GitHub Mar 10, 2026
Show whitespace changes
Inline Side-by-side

Showing with 39 additions and 0 deletions

examples/backends/vllm/mm_router_worker/README.md examples/backends/vllm/mm_router_worker/README.md +39 -0

No files found.
--- a/examples/backends/vllm/mm_router_worker/README.md
+++ b/examples/backends/vllm/mm_router_worker/README.md
@@ -300,6 +300,12 @@ If MM-aware routing and prefix reuse are working, after sending the same request
 | `--downstream-component` | `backend` | Downstream component name (use `backend` for current `dynamo.vllm` defaults) |
 | `--downstream-endpoint` | `generate` | Downstream vLLM endpoint name |

+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DYN_MM_IMAGE_CACHE_SIZE` | `8` | Maximum number of images to keep in the MM router's in-memory image cache. Increase this if your workload has high image reuse across a larger unique image set. |
+
 ## How It Works

 ### MM Hash Computation
@@ -354,6 +360,39 @@ For repeated identical images, multiple entries may appear in the same block whe
 - `requests` for `http(s)` image URLs
 - vLLM-capable backend worker via `python -m dynamo.vllm`

+## Performance
+
+### 8× B200, Qwen3-VL-30B-A3B-FP8, HTTP Image Transport
+
+On an 8-GPU B200 node serving 8 replicas of `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8` with concurrent HTTP image requests and moderate (~50%) image reuse across workers, MM-aware routing delivers significant throughput and latency improvements over round-robin (default router mode). The benchmark uses a fixed text prompt and `--osl 1` to ensure the workload is dominated by image tokens, isolating the performance effect of MM router's image-aware KV cache routing on prefill:
+
+- **~1.6× higher throughput** — repeated image requests are steered to the worker that already holds the relevant KV cache blocks, avoiding redundant image downloads and prefill recomputation
+- **~1.6× lower average latency** and **~3× lower median (p50) latency** — cache-warm requests complete substantially faster
+- **p99 trade-off** — tail latency can increase under skewed workloads due to load imbalance when hot KV blocks are concentrated on a small number of workers
+
+To reproduce, prepare an `aiperf`-compatible JSONL dataset with ~50% image reuse — each line contains a text prompt and one image URL for simplicity, with some URLs repeated across requests. The dataset used in the benchmarks above was generated using the [multimodal JSONL generator](../../../../benchmarks/multimodal/jsonl/README.md).
+
+Example dataset format:
+
+```jsonl
+{"text": "Please describe this image.", "images": ["https://example.com/cat.jpg"]}
+{"text": "Please describe this image.", "images": ["https://example.com/dog.jpg"]}
+{"text": "Please describe this image.", "images": ["https://example.com/bird.jpg"]}
+{"text": "Please describe this image.", "images": ["https://example.com/cat.jpg"]}
+```
+
+Then benchmark against a running stack:
+
+```bash
+aiperf profile \
+    --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
+    --input-file example.jsonl \
+    --custom-dataset-type single_turn \
+    --osl 1 \
+    --concurrency 5 \
+    --artifact-dir ./logs/mm_router_run
+```
+
 ## Known Limitations

 - `mm_processor.py` currently only supports Qwen-style multimodal processors for per-image visual token counting (`Qwen2-VL`, `Qwen2.5-VL`, `Qwen3-VL` style processors).