profile_throughput.md 2.05 KB
Newer Older
zhouxiang's avatar
zhouxiang committed
1
# Profile Request Throughput
Lyu Han's avatar
Lyu Han committed
2
3
4
5
6

In the applications, the length of the user's input prompt and the size of generated tokens are dynamic. The static inference performance is insufficient to reflect the inference engine's ability to handle the dynamic characteristics.

Therefore, it is necessary to use real dialogue data to evaluate the dynamic inference capabilities of the inference engine. This article will introduce how to test the dynamic inference performance of LMDeploy on localhost.

zhouxiang's avatar
zhouxiang committed
7
The profiling script is `profile_throughput.py`. Before running it, please install the lmdeploy precompiled package, download the profiling script and the test dataset:
Lyu Han's avatar
Lyu Han committed
8
9

```shell
zhouxiang's avatar
zhouxiang committed
10
pip install lmdeploy
Lyu Han's avatar
Lyu Han committed
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

## Metrics

LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)

`first_token_latency` is only reported in the case of streaming inference.

The formula for calculating `token throughput` is:

$$
TokenThroughput = Number\\ of\\ generated\\ tokens/TotalTime
$$

And the formula for calculating `request throughput` is:

$$
RPM(request\\ per\\ minute) = Number\\ of\\ prompts/TotalTime * 60
$$

Total time includes prefill time.

zhouxiang's avatar
zhouxiang committed
36
## Profile
Lyu Han's avatar
Lyu Han committed
37

zhouxiang's avatar
zhouxiang committed
38
In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show how to profile the inference engines of LMDeploy.
Lyu Han's avatar
Lyu Han committed
39

zhouxiang's avatar
zhouxiang committed
40
### Profile turbomind engine
Lyu Han's avatar
Lyu Han committed
41

zhouxiang's avatar
zhouxiang committed
42
43
```shell
python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b
Lyu Han's avatar
Lyu Han committed
44
45
```

zhouxiang's avatar
zhouxiang committed
46
### Profile pytorch engine
Lyu Han's avatar
Lyu Han committed
47
48

```shell
zhouxiang's avatar
zhouxiang committed
49
python3 profile_throughput.py ./ShareGPT_V3_unfiltered_cleaned_split.json internlm/internlm-7b  --backend pytorch
Lyu Han's avatar
Lyu Han committed
50
51
```

zhouxiang's avatar
zhouxiang committed
52
For detailed argument specification of `profile_throughput.py`, such as request concurrency, sampling parameters, k/v cache memory percentage an so on, please run the help command `python3 profile_throughput.py -h`.