README.md 6.3 KB
Newer Older
raojy's avatar
raojy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# Benchmark KV Cache Offloading with Multi-Turn Conversations

The requirements (pip) for `benchmark_serving_multi_turn.py` can be found in `requirements.txt`

First start serving your model

```bash
export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

vllm serve $MODEL_PATH --served-model-name Llama
```

The variable `MODEL_PATH` should be a path to the model files (e.g. downloaded from huggingface).

## Synthetic Multi-Turn Conversations

Download the following text file (used for generation of synthetic conversations)

```bash
wget https://www.gutenberg.org/ebooks/1184.txt.utf-8
mv 1184.txt.utf-8 pg1184.txt
```

The filename `pg1184.txt` is used in `generate_multi_turn.json` (see `"text_files"`).

But you may use other text files if you prefer (using this specific file is not required).

Then run the benchmarking script

```bash
export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

python benchmark_serving_multi_turn.py --model $MODEL_PATH --served-model-name Llama \
--input-file generate_multi_turn.json --num-clients 2 --max-active-conversations 6
```

You can edit the file `generate_multi_turn.json` to change the conversation parameters (number of turns, etc.).

If successful, you will see the following output

```bash
----------------------------------------------------------------------------------------------------
Statistics summary:
runtime_sec = 215.810
requests_per_sec = 0.769
----------------------------------------------------------------------------------------------------
                   count     mean     std      min      25%      50%      75%      90%      99%      max
ttft_ms            166.0    78.22   67.63    45.91    59.94    62.26    64.43    69.66   353.18   567.54
tpot_ms            166.0    25.37    0.57    24.40    25.07    25.31    25.50    25.84    27.50    28.05
latency_ms         166.0  2591.07  326.90  1998.53  2341.62  2573.01  2860.10  3003.50  3268.46  3862.94
input_num_turns    166.0     7.43    4.57     1.00     3.00     7.00    11.00    13.00    17.00    17.00
input_num_tokens   166.0  2006.20  893.56   522.00  1247.75  2019.00  2718.00  3233.00  3736.45  3899.00
output_num_tokens  166.0   100.01   11.80    80.00    91.00    99.00   109.75   116.00   120.00   120.00
output_num_chunks  166.0    99.01   11.80    79.00    90.00    98.00   108.75   115.00   119.00   119.00
----------------------------------------------------------------------------------------------------
```

If you run with `--warmup-step`, the summary will also include `warmup_runtime_sec`
and `total_runtime_incl_warmup_sec` (while `runtime_sec` continues to reflect the
benchmark-only runtime so the reported throughput stays comparable).

### JSON configuration file for synthetic conversations generation

The input flag `--input-file` is used to determine the input conversations for the benchmark.<br/>
When the input is a JSON file with the field `"filetype": "generate_conversations"` the tool will generate synthetic multi-turn (questions and answers) conversations.

The file `generate_multi_turn.json` is an example file.

The file must contain the sections `prompt_input` and `prompt_output`.

The `prompt_input` section must contain `num_turns`, `prefix_num_tokens` and `num_tokens`:

* `num_turns` - Number of total turns in the conversation (both user & assistant).<br/>
The final value will always be rounded to an even number so each user turn has a reply.
* `prefix_num_tokens` - Tokens added at the start of only the **first user turn** in a conversation (unique per conversation).
* `num_tokens` - Total token length of each **user** message (one turn).

The `prompt_output` section must contain `num_tokens`:

* `num_tokens` - Total token length of each **assistant** message (one turn).

### Random distributions for synthetic conversations generation

When creating an input JSON file (such as `generate_multi_turn.json`),<br/>
every numeric field (such as `num_turns` or `num_tokens`) requires a distribution.<br/>
The distribution determines how to randomly sample values for the field.

The available distributions are listed below.

**Note:** The optional `max` field (for lognormal, zipf, and poisson) can be used to cap sampled values at an upper bound.</br>
Can be used to make sure that the total number of tokens in every request does not exceed `--max-model-len`.

#### constant

```json
{
    "distribution": "constant",
    "value": 500
}
```

* `value` - the fixed integer value (always returns the same number).

#### uniform

```json
{
    "distribution": "uniform",
    "min": 12,
    "max": 18
}
```

* `min` - minimum value (inclusive).
* `max` - maximum value (inclusive), should be equal or larger than min.

#### lognormal

```json
{
    "distribution": "lognormal",
    "average": 1000,
    "max": 5000
}
```

You can parameterize the lognormal distribution in one of two ways:

Using the average and optional median ratio:

* `average` - target average value of the distribution.
* `median_ratio` - the ratio of the median to the average; controls the skewness. Must be in the range (0, 1).

Using the parameters of the underlying normal distribution:

* `mean` - mean of the underlying normal distribution.
* `sigma` - standard deviation of the underlying normal distribution.

#### zipf

```json
{
    "distribution": "zipf",
    "alpha": 1.2,
    "max": 100
}
```

* `alpha` - skew parameter (> 1). Larger values produce stronger skew toward smaller integers.

#### poisson

```json
{
    "distribution": "poisson",
    "alpha": 10,
    "max": 50
}
```

* `alpha` - expected value (λ). Also the variance of the distribution.

## ShareGPT Conversations

To run with the ShareGPT data, download the following ShareGPT dataset:
`https://huggingface.co/datasets/philschmid/sharegpt-raw/blob/main/sharegpt_20230401_clean_lang_split.json`

Use the `convert_sharegpt_to_openai.py` script to convert the dataset to a format supported by `benchmark_serving_multi_turn.py`

```bash
python convert_sharegpt_to_openai.py sharegpt_20230401_clean_lang_split.json sharegpt_conv_128.json --seed=99 --max-items=128
```

The script will convert the ShareGPT dataset to a dataset with the standard user/assistant roles.

The flag `--max-items=128` is used to sample 128 conversations from the original dataset (change as needed).

Use the output JSON file `sharegpt_conv_128.json` as the `--input-file` for `benchmark_serving_multi_turn.py`.