README.md 4.58 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Multimodal Benchmark Sweep

YAML-driven benchmark orchestrator that launches serving backends, runs
[aiperf](https://github.com/triton-inference-server/perf_analyzer) concurrency
sweeps, and optionally generates comparison plots.

## Quick Start

```bash
# from the repo root
python -m benchmarks.multimodal.sweep \
  --config benchmarks/multimodal/sweep/experiments/embedding_cache/vllm_serve.yaml
```

## How It Works

1. Parse the YAML experiment config.
2. For each **input file** × each **benchmark config**:
   - Launch the serving backend via the workflow script.
   - Run `aiperf profile` at every concurrency level.
   - Stop the server (by default the server restarts between concurrency
     levels to avoid warm-cache bias — controlled by
     `restart_server_every_benchmark`).
3. Generate comparison plots across configs for each input file.

## YAML Config Reference

```yaml
model: Qwen/Qwen3-VL-30B-A3B-Instruct-FP8
concurrencies: [16, 32, 64, 128, 256]
osl: 150                    # output sequence length
32
33
34
conversation_num: 10        # sessions per sweep value (optional; derived from
                            # input JSONL's unique session_id count if unset;
                            # flat JSONLs count each row as a 1-turn conversation)
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
warmup_count: 5
port: 8000
timeout: 900                # seconds to wait for server readiness
output_dir: benchmarks/multimodal/sweep/results/vllm_serve

# Optional env vars injected into the server process
env:
  ENABLE_ENCODER_CACHE: "0"

# JSONL files produced by benchmarks/multimodal/jsonl/
input_files:
  - benchmarks/multimodal/jsonl/1000req_1img_200pool_400word_http.jsonl
  - benchmarks/multimodal/jsonl/1000req_4img_200pool_400word_http.jsonl

# Each config launches the workflow with its own extra_args
configs:
  - label: cache-off
52
    workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
53
54
55
    extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "0"]

  - label: cache-on
56
    workflow: benchmarks/multimodal/sweep/workflows/vllm_serve.sh
57
58
59
60
61
62
63
64
65
66
67
68
    extra_args: [--no-enable-prefix-caching, --multimodal-embedding-cache-capacity-gb, "10"]
```

## CLI Overrides

Any top-level YAML field can be overridden from the command line:

```bash
python -m benchmarks.multimodal.sweep \
  --config experiments/embedding_cache/vllm_serve.yaml \
  --concurrencies 1,2,4 \
  --osl 200 \
69
  --conversation-num 10 \
70
71
72
  --skip-plots
```

73
74
75
76
77
78
79
80
81
82
83
## Warmup semantics

`warmup_count: N` is a **request (turn) budget**, not a session budget. For a
10×10 JSONL with `warmup_count: 2`, warmup issues 2 total requests — both go
to `user_0` (turns 0 and 1) because aiperf's continuation-turn priority keeps
feeding the in-flight session until its budget runs out. Warmup does NOT
consume 2 full sessions (20 requests). Profiling then starts at `user_1`,
runs `user_1..user_9` to completion, and wraps to a fresh `user_0` instance
for the 10th session. Keep `warmup_count` small (≤ turns-per-session) so
warmup stays within a single session's prefix.

84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
## Output Directory Structure

Given the config above with two input files and two configs (`cache-off`,
`cache-on`) at concurrencies `[16, 32]`, the output tree looks like:

```
<output_dir>/
├── 1000req_1img_200pool_400word_http/      # ← derived from input filename
│   ├── cache-off/                          # ← config label
│   │   ├── c16/                            # ← concurrency level
│   │   │   ├── profile_export.jsonl
│   │   │   ├── profile_export_aiperf.json
│   │   │   ├── profile_export_aiperf.csv
│   │   │   ├── gpu_telemetry_export.jsonl
│   │   │   ├── inputs.json
│   │   │   └── logs/
│   │   │       └── aiperf.log
│   │   └── c32/
│   │       └── ...
│   ├── cache-on/
│   │   ├── c16/
│   │   │   └── ...
│   │   └── c32/
│   │       └── ...
│   └── plots/                              # ← comparison plots across configs
│       └── ...
└── 1000req_4img_200pool_400word_http/
    ├── cache-off/
    │   └── ...
    ├── cache-on/
    │   └── ...
    └── plots/
        └── ...
```

## Existing Experiments

| Experiment | Config | Backend |
|---|---|---|
| Embedding cache (vLLM serve) | `experiments/embedding_cache/vllm_serve.yaml` | Single-node vLLM |
| Embedding cache (vLLM E+PD) | `experiments/embedding_cache/vllm_e_pd.yaml` | Disaggregated vLLM E+PD |
| Embedding cache (TRT-LLM E+PD) | `experiments/embedding_cache/trtllm_e_pd.yaml` | Disaggregated TRT-LLM E+PD |