profile_api_server.md 1.9 KB
Newer Older
zhouxiang's avatar
zhouxiang committed
1
# Profile API Server
Lyu Han's avatar
Lyu Han committed
2

zhouxiang's avatar
zhouxiang committed
3
The way to profiling `api_server` performance is similar to the method for [profiling throughput](./profile_throughput.md). The difference is `api_server` should be launched successfully before testing.
Lyu Han's avatar
Lyu Han committed
4

zhouxiang's avatar
zhouxiang committed
5
The profiling script is `profile_restful_api.py`. Before running it, please install the lmdeploy precompiled package, download the script and the test dataset:
Lyu Han's avatar
Lyu Han committed
6
7

```shell
zhouxiang's avatar
zhouxiang committed
8
pip install lmdeploy
Lyu Han's avatar
Lyu Han committed
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

## Metrics

LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)

`first_token_latency` is only reported in the case of streaming inference.

The formula for calculating `token throughput` is:

$$
TokenThroughput = Number\\ of\\ generated\\ tokens/TotalTime
$$

And the formula for calculating `request throughput` is:

$$
RPM(request\\ per\\ minute)=Number\\ of\\ prompts/TotalTime * 60
$$

Total time includes prefill time.

zhouxiang's avatar
zhouxiang committed
34
## Profile
Lyu Han's avatar
Lyu Han committed
35

zhouxiang's avatar
zhouxiang committed
36
In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show the benchmark procedure.
Lyu Han's avatar
Lyu Han committed
37

zhouxiang's avatar
zhouxiang committed
38
### Launch api_server
Lyu Han's avatar
Lyu Han committed
39

zhouxiang's avatar
zhouxiang committed
40
41
```shell
lmdeploy serve api_server internlm/internlm-7b
Lyu Han's avatar
Lyu Han committed
42
43
```

zhouxiang's avatar
zhouxiang committed
44
If you would like to change the server's port or other parameters, such as inference engine, max batch size and etc., please run `lmdeploy serve api_server -h` or read [this](../serving/api_server.md) guide to get the detailed explanation.
Lyu Han's avatar
Lyu Han committed
45

zhouxiang's avatar
zhouxiang committed
46
### Profile
Lyu Han's avatar
Lyu Han committed
47
48

```shell
zhouxiang's avatar
zhouxiang committed
49
python3 profile_restful_api.py http://0.0.0.0:23333 internlm/internlm-7b ./ShareGPT_V3_unfiltered_cleaned_split.json
Lyu Han's avatar
Lyu Han committed
50
51
```

zhouxiang's avatar
zhouxiang committed
52
For detailed argument specification of `profile_restful_api.py`, such as request concurrency, sampling parameters an so on, please run the help command `python3 profile_restful_api.py -h`.