@@ -846,8 +864,8 @@ Generate synthetic image inputs alongside random text prompts to stress-test vis
Notes:
-Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
-Video sampling is not yet implemented.
-For online benchmarks, use `--backend openai-chat` with endpoint `/v1/chat/completions`.
-For offline benchmarks, use `--backend vllm-chat` (see [Offline Throughput Benchmark](#-offline-throughput-benchmark) for an example).
Start the server (example):
...
...
@@ -913,6 +931,74 @@ This should be seen as an edge case, and if this behavior can be avoided by sett
</details>
### 🔬 Multimodal Processor Benchmark
Benchmark per-stage latency of the multimodal (MM) input processor pipeline, including the encoder forward pass. This is useful for profiling preprocessing bottlenecks in vision-language models.
<detailsclass="admonition abstract"markdown="1">
<summary>Show more</summary>
The benchmark measures the following stages for each request:
| Stage | Description |
|-------|-------------|
| `get_mm_hashes_secs` | Time spent hashing multimodal inputs |
| `get_cache_missing_items_secs` | Time spent looking up the processor cache |
| `apply_hf_processor_secs` | Time spent in the HuggingFace processor |
| `merge_mm_kwargs_secs` | Time spent merging multimodal kwargs |
| `apply_prompt_updates_secs` | Time spent updating prompt tokens |
| `preprocessor_total_secs` | Total preprocessing time |
| `encoder_forward_secs` | Time spent in the encoder model forward pass |
| `num_encoder_calls` | Number of encoder invocations per request |
The benchmark also reports end-to-end latency (TTFT + decode time) per
request. Use `--metric-percentiles` to select which percentiles to report
(default: p99) and `--output-json` to save results.
#### Basic Example with Synthetic Data (random-mm)