profile_triton_server.md 1.93 KB
Newer Older
zhouxiang's avatar
zhouxiang committed
1
# Profile Triton Inference Server
Lyu Han's avatar
Lyu Han committed
2

zhouxiang's avatar
zhouxiang committed
3
Triton Inference Server (TIS) is another serving method supported by LMDeploy besides `api_server`. Its performance testing methods and metrics are similar to those of [api_server](./profile_api_server.md).
Lyu Han's avatar
Lyu Han committed
4

zhouxiang's avatar
zhouxiang committed
5
The profiling script is `profile_serving.py`. Before running it, please install the lmdeploy precompiled package, download the profiling script and the test dataset:
Lyu Han's avatar
Lyu Han committed
6
7

```shell
zhouxiang's avatar
zhouxiang committed
8
pip install 'lmdeploy[serve]'
Lyu Han's avatar
Lyu Han committed
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
git clone --depth=1 https://github.com/InternLM/lmdeploy
cd lmdeploy/benchmark
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

## Metrics

LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)

`first_token_latency` is only reported in the case of streaming inference.

The formula for calculating `token throughput` is:

$$
TokenThroughput=Number\\ of\\ generated\\ tokens/TotalTime
$$

And the formula for calculating `request throughput` is:

$$
RPM(request\\ per\\ minute)=Number\\ of\\ prompts/TotalTime * 60
$$

Total time includes prefill time.

zhouxiang's avatar
zhouxiang committed
34
## Profile
Lyu Han's avatar
Lyu Han committed
35

zhouxiang's avatar
zhouxiang committed
36
In this section, we take [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) as an example to show the benchmark procedure.
Lyu Han's avatar
Lyu Han committed
37

zhouxiang's avatar
zhouxiang committed
38
### Launch triton inference server
Lyu Han's avatar
Lyu Han committed
39

zhouxiang's avatar
zhouxiang committed
40
Before launching the server, the LLM model must be converted to the turbomind format in advance.
Lyu Han's avatar
Lyu Han committed
41

zhouxiang's avatar
zhouxiang committed
42
43
```shell
lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
Lyu Han's avatar
Lyu Han committed
44
45
```

zhouxiang's avatar
zhouxiang committed
46
Then, the triton inference server can be launched by:
Lyu Han's avatar
Lyu Han committed
47
48

```shell
zhouxiang's avatar
zhouxiang committed
49
bash ./internlm-7b/service_docker_up.sh
Lyu Han's avatar
Lyu Han committed
50
51
```

zhouxiang's avatar
zhouxiang committed
52
### Profile
Lyu Han's avatar
Lyu Han committed
53

zhouxiang's avatar
zhouxiang committed
54
55
56
```shell
python3 profile_serving.py 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
```
Lyu Han's avatar
Lyu Han committed
57

zhouxiang's avatar
zhouxiang committed
58
For detailed argument specification of `profile_serving.py`, such as request concurrency, sampling parameters an so on, please run the help command `python3 profile_serving.py -h`.