Update benchmark user guide (#763)

* user guide of benchmark generation * update benchmark generation guide * update profiling throughput guide * update profiling api_server guide * rename file names * update profile tis user guide * update * fix according to review comments * update * update according to review comments * updaste * add an example * update

Update benchmark user guide (#763)
* user guide of benchmark generation * update benchmark generation guide * update profiling throughput guide * update profiling api_server guide * rename file names * update profile tis user guide * update * fix according to review comments * update * update according to review comments * updaste * add an example * update
d3e2cee4 · Lyu Han · GitHub · 9c46b27c · d3e2cee4 · d3e2cee4
Unverified Commit d3e2cee4 authored Nov 29, 2023 by Lyu Han Committed by GitHub Nov 29, 2023
8 changed files
--- a/docs/en/benchmark/profile_api_server.md
+++ b/docs/en/benchmark/profile_api_server.md
+# API Server Performance Test Method
+The way to profiling api_server performance is similar to the method for [profiling throughput](./profile_throughput.md). The difference is api_server should be launched successfully before testing.
+The evaluation script is `profile_restful_api.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
+```shell
+pip install 'lmdeploy[serve]>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
+The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
+In the following sections, we assume the model is in turbomind format.
+## Metrics
+LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
+`first_token_latency` is only reported in the case of streaming inference.
+The formula for calculating `token throughput` is:
+$$
+TokenThroughput = Number\\ of\\ generated\\ tokens/TotalTime
+$$
+And the formula for calculating `request throughput` is:
+$$
+RPM(request\\ per\\ minute)=Number\\ of\\ prompts/TotalTime * 60
+$$
+Total time includes prefill time.
+## Example
+We take `internlm-7b` as an example. The entire benchmark procedure is:
+```shell
+pip install 'lmdeploy[serve]>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+# get internlm-7b from huggingface and convert it to turbomind format
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+# launch server
+lmdeploy serve api_server ./internlm-7b --server-port 23333
+# open another terminal and run the following command in the directory `lmdeploy/benchmark`
+python3 ./profile_restful_api.py http://0.0.0.0:23333 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+```
+## Methods
+Please refer to [this](../restful_api.md) guide to start `api_server`.
+The argument `--instance-num` reflects the inference instance number. When more than `--instance-num` requests arrive at the `api_server` at the same time, the exceeding part of the requests will wait in the inference queue.
+```shell
+python3 profile_restful_api.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
+```
+The required parameters are:
+- `server_addr`
+  The address of api_server with format `http://{server_ip}:{server_port}`
+- `tokenizer_path`
+  The path of the tokenizer model, which is used to encode the dataset to get the token size of prompts and responses
+- `dataset`
+  The path of the downloaded dataset
+Optional arguments are listed as below:
+- `--concurrency`
+  It represents the number of request threads with default value 64. Requests of concurrent threads will be batched by the inference engine. Its value should not exceed the number of inference instances in the api_server.
+  Otherwise, the excess requests will wait in the inference queue.
+- `--num-prompts`
+  The number of sampled prompts from dataset to process. The default is 2000.
+- `--top_p` and `--temperature`
+  They are used to sample the generated token_id.
+- `--stream_output`
+  Indicator for streaming output. The default is `False`.
+- `--csv`
+  The path of a csv file to save the result with default value `../profile_api_server.csv`
+- `--seed`
+  It is the seed used in sampling prompts from dataset with default value 0.
--- a/docs/en/benchmark/profile_generation.md
+++ b/docs/en/benchmark/profile_generation.md
+# Static Inference Performance Test Method
+We view the performance of the inference engine under the fixed batch and fixed input/output token as static inference performance.
+The evaluation script is `profile_generation.py`. Before running it, please install the lmdeploy precompiled package and download the evaluation script:
+```shell
+pip install 'lmdeploy>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+```
+During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
+The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
+In the following sections, we assume the model is in turbomind format.
+## Metrics
+LMDeploy records test results like first token latency, token throughput (tokens/s), percentile data of each token's latency (P50, P75, P95, P99), GPU mem, etc.
+`first_token_latency` is only reported in the case of streaming inference.
+The formula for calculating `throughput` is:
+$$
+TokenThroughput = Number\\ of\\ generated\\ tokens/TotalTime
+$$
+Total time includes prefill time.
+During the test process, all graphics cards on the node should not run any other programs, otherwise the statistics of GPU mem would be inaccurate.
+## Example
+We take `internlm-7b` as an example. The entire benchmark procedure is:
+```shell
+pip install 'lmdeploy>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+# get internlm-7b from huggingface and convert it to turbomind format
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+# benchmark
+python3 profile_generation.py ./internlm-7b
+```
+## Command details
+```shell
+python3 profile_generation.py <model_path> <optional arguments>
+```
+`model_path` refers to the path on localhost where the model in turbomind format is located.
+Optional arguments are listed as below:
+- `--concurrency`
+  It represents the number of request threads. Requests of concurrent threads will be batched by the inference engine. It is a list with default value `[1, 16, 32, 64]`, which implies that the performance under 4 different levels of concurrency is tested. The level of concurrency should not exceed `max_batch_size` in [turbomind config](../turbomind_config.md#turbomind-20-config). Otherwise, there will be `max_batch_size - concurrency` number of threads randomly waiting almost at any time during test.
+- `--prompt-tokens` and `--completion-tokens`
+  Input token and output token numbers. They are lists of the same length. The elements in the list correspond one-to-one, that is,
+  the pair `(prompt_tokens[i], completion_tokens[i])` is a test case. In the default list `[1, 128, 128, 2048, 2048]` and `[128, 128, 2048, 128, 2048]`, the test cases are `(1, 128)`, `(128, 128)`, `(128, 2048)`, `(2048, 128)` and `(2048, 2048)`
+- `--tp`
+  The number of GPUs used when the inference is in tensor parallel mode. It must be a power of 2. The default is 1.
+- `--top_k`, `--top_p` and `--temperature`
+  They are used to sample the generated token_id.
+- `--csv`
+  A csv file path used to store test results. The default is `./profile_generation.csv`
+- `--log-level`
+  The log level. The default is 'ERROR'.
+- `--test-round`
+  The number of test rounds is set to 10 by default. This means that each case will undergo 10 rounds of testing, and the average result will be calculated.
+We refer to a tuple of `(#concurrency, #prompt_token, #completion_token)` as a test case. Therefore, the total number of test cases (`#test_cases`) executed by the script is `len(concurrency) * len(prompt-tokens)`, and the total test rounds  are `#test_cases * #test_round`. Users can flexibly adjust test parameters according to their actual situation.
--- a/docs/en/benchmark/profile_throughput.md
+++ b/docs/en/benchmark/profile_throughput.md
+# Request Throughput Test Method
+In the applications, the length of the user's input prompt and the size of generated tokens are dynamic. The static inference performance is insufficient to reflect the inference engine's ability to handle the dynamic characteristics.
+Therefore, it is necessary to use real dialogue data to evaluate the dynamic inference capabilities of the inference engine. This article will introduce how to test the dynamic inference performance of LMDeploy on localhost.
+The evaluation script is `profile_throughput.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
+```shell
+pip install 'lmdeploy>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
+The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
+In the following sections, we assume the model is in turbomind format.
+## Metrics
+LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
+`first_token_latency` is only reported in the case of streaming inference.
+The formula for calculating `token throughput` is:
+$$
+TokenThroughput = Number\\ of\\ generated\\ tokens/TotalTime
+$$
+And the formula for calculating `request throughput` is:
+$$
+RPM(request\\ per\\ minute) = Number\\ of\\ prompts/TotalTime * 60
+$$
+Total time includes prefill time.
+## Example
+We take `internlm-7b` as an example. The entire benchmark procedure is:
+```shell
+pip install 'lmdeploy>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+# get internlm-7b from huggingface and convert it to turbomind format
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+python3 profile_throughput.py ./internlm-7b ./ShareGPT_V3_unfiltered_cleaned_split.json
+```
+## Command details
+```shell
+python3 profile_throughput.py <dataset> <model_path> <optional arguments>
+```
+The required parameters are:
+- `dataset`
+  The path of the downloaded dataset
+- `model_path`
+  The path on localhost where the model in turbomind format is located.
+Optional arguments are listed as below:
+- `--concurrency`
+  It represents the number of request threads with default value 64. Requests of concurrent threads will be batched by the inference engine. Its value should not exceed `max_batch_size` in `config.ini`. Otherwise, the excess requests will wait in the inference queue.
+- `--num-prompts`
+  The number of sampled prompts from dataset to process. The default is 2000.
+- `--tp`
+  The number of GPUs used when the inference is in tensor parallel mode. It must be a power of 2. The default is 1.
+- `--top_k`、`--top_p` and `--temperature`
+  They are used to sample the generated token_id.
+- `--stream_output`
+  Indicator for streaming output. The default is `True`.
+- `--csv`
+  The path of a csv file to save the result with default value `./profile_throughput.csv`
+- `--log-level`
+  The log level. The default is `ERROR`.
+- `--seed`
+  It is the seed used in sampling prompts from dataset with default value 0.
--- a/docs/en/benchmark/profile_triton_server.md
+++ b/docs/en/benchmark/profile_triton_server.md
+# Triton Inference Server Performance Test Method
+Triton Inference Server (TIS) is another serving method supported by LMDeploy besides from api_server. Its performance testing methods and metrics are similar to those of [api_server](./profile_api_server.md).
+The evaluation script is `profile_serving.py`. Before running it, please install the lmdeploy precompiled package, download the evaluation script and the test dataset:
+```shell
+pip install 'lmdeploy[serve]>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+During performance test, a specific model needs to be inputted. We recommend converting the model into turbomind format via `lmdeploy convert`, then proceed with testing.
+The reason is to conveniently adjust the parameters of the inference engine in order to achieve better performance, such as batch size (max_batch_size), K/V cache size (max_cache_entry_count), etc. For detailed explanations of these parameters, please refer to [here](../turbomind_config.md).
+In the following sections, we assume the model is in turbomind format.
+## Metrics
+LMDeploy records the performance metrics like first token latency, token throughput (tokens/s) and request throughput (RPM)
+`first_token_latency` is only reported in the case of streaming inference.
+The formula for calculating `token throughput` is:
+$$
+TokenThroughput=Number\\ of\\ generated\\ tokens/TotalTime
+$$
+And the formula for calculating `request throughput` is:
+$$
+RPM(request\\ per\\ minute)=Number\\ of\\ prompts/TotalTime * 60
+$$
+Total time includes prefill time.
+## Example
+We take `internlm-7b` as an example. The entire benchmark procedure is:
+```shell
+pip install 'lmdeploy[serve]>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+# get internlm-7b from huggingface and convert it to turbomind format
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+# launch server
+bash ./internlm-7b/service_docker_up.sh
+# open another terminal and run the following command in the directory `lmdeploy/benchmark`
+python3 ./profile_serving 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+```
+## Command details
+```shell
+python3 profile_serving.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
+```
+The required parameters are:
+- `server_addr`
+  The address of api_server with format `{server_ip}:{server_port}`
+- `tokenizer_path`
+  The path of the tokenizer model, which is used to encode the dataset to get the token size of prompts and responses
+- `dataset`
+  The path of the downloaded dataset
+Optional arguments are listed as below:
+- `--concurrency`
+  It represents the number of request threads with default value 32. Requests of concurrent threads will be batched by the inference engine.
+  It is recommended that `concurrency` does not exceed the `max_batch_size` in `config.ini`, nor should it exceed the number of inference instances in `triton_models`.
+  Otherwise, the excess requests will wait in the inference queue.
+  The configuration item for the number of inference instances is `instance_group`, which is located in the file `{model_path}/triton_models/interactive/config.pbtxt`, and the default is 48.
+- `--num-prompts`
+  The number of sampled prompts from dataset to process. The default is 1000. It is suggested 2000 when `concurrency >= 64`
+- `--top_k`、`--top_p` and `--temperature`
+  They are used to sample the generated token_id.
+- `--stream_output`
+  Indicator for streaming output. The default is `True`.
+- `--csv`
+  The path of a csv file to save the result with default value `../profile_tis.csv`
+- `--seed`
+  It is the seed used in sampling prompts from dataset with default value 0.
--- a/docs/zh_cn/benchmark/profile_api_server.md
+++ b/docs/zh_cn/benchmark/profile_api_server.md
+# api_server 性能测试
+api_server 的测试方式与[求吞吐量测试方法](./profile_throughput.md)类似。不同的是，在测试前，需要先启动 api_server，然后再通过测试脚本发送请求进行测试。
+测试脚本是 `profile_restful_api.py`。测试之前，请安装 lmdeploy 预编译包，并下载评测脚本和测试数据集。
+```shell
+pip install 'lmdeploy[serve]>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+测速时，需输入具体的模型。我们推荐把模型下载到本地，并通过 `lmdeploy convert` 把模型转换为 turbomind 格式，然后再进行测试。
+这么做的原因是，方便调节推理引擎参数，以达到比较好的推理性能，比如批处理大小（max_batch_size），K/V cache缓存大小（max_cache_entry_count）等等。有关这些参数的详细说明，请参考[这里](../turbomind_config.md).
+以下章节中，我们默认模型是 turbomind 格式的。
+## 测量指标
+LMDeploy 统计首token延时（first_token_latency）、token吞吐量（tokens/s）和请求吞吐量（RPM）。
+`first_token_latency` 只有在流式推理的情况下才会输出。
+token吞吐量的计算公式为：
+$$
+吞吐量 = 生成的token数量 / 总时间
+$$
+请求吞吐量的计算公式为：
+$$
+吞吐量 = 请求数量 / 总时间
+$$
+总时间包括 prefill 时间
+## 测试案例
+我们用 `internlm-7b` 为例，api_server的速度测试全流程如下：
+```shell
+pip install 'lmdeploy[serve]>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+# 从huggingface下载internlm-7b，并转为turbomind模型格式
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+# 启动server
+lmdeploy serve api_server ./internlm-7b --server-port 23333
+# 另起终端，在`lmdeploy/benchmark`目录下，执行测速脚本
+python3 ./profile_restful_api.py http://0.0.0.0:23333 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+```
+## 测试方法
+请参考[这里](../restful_api.md) 启动推理服务。启动时的参数 `--instance-num` 表示推理服务中的推理实例数量。当同一时刻到达 api_server 的请求数超过它时，请求会在推理队列中等待。
+```shell
+python3 profile_restful_api.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
+```
+其中，必填参数是：
+- `server_addr`
+  api_server 的地址，格式是 `http://{server_ip}:{server_port}`
+- `tokenizer_path`
+  tokenizer model 的路径。作用是对测试数据集预先 encode，获取对话数据的 token 长度
+- `dataset`
+  下载的测试数据集的路径
+可选测试参数如下：
+- `--concurrency`
+  客户端请求线程的数量，并发请求会被推理引擎拼成 batch，默认为 64。并发请求会被推理引擎拼成 batch。并发数不能超过api_server的`--instance-num`。否则，超出部分的请求会在推理队列中等待。
+- `--num-prompts`
+  从数据集中采样的prompt数量，默认是 2000
+- `--top_p` 和 `--temperature`
+  这三个参数用来采样生成的 token_id
+- `--stream_output`
+  流式推理的开关。默认值为 `False`
+- `--csv`
+  一个 csv 文件路径，用来存放测试结果。默认是 `./profile_api_server.csv`
+- `--seed`
+  从测试数据集中随机采样prompt时的种子。默认为0
--- a/docs/zh_cn/benchmark/profile_generation.md
+++ b/docs/zh_cn/benchmark/profile_generation.md
+# 静态推理性能测试方法
+我们把推理引擎在固定 batch、固定输入输出 token 数量的前提下的推理，称之为静态推理。
+评测脚本是 `profile_generation.py`，在运行此脚本前，请安装 lmdeploy 预编译包，并下载评测脚本
+```shell
+pip install 'lmdeploy>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+```
+测速时，需输入具体的模型。我们推荐把模型下载到本地，并通过 `lmdeploy convert` 把模型转换为 turbomind 格式，然后再进行测试。
+这么做的原因是，方便调节推理引擎参数，以达到比较好的推理性能，比如批处理大小（max_batch_size），K/V cache缓存大小（max_cache_entry_count）等等。有关这些参数的详细说明，请参考[这里](../turbomind_config.md).
+以下章节中，我们默认模型是 turbomind 格式的。
+## 测量指标
+LMDeploy 统计首token延时（first_token_latency）、token 吞吐量（tokens/s），每个token延时的百分位数据（P50，P75，P95，P99）、GPU mem 等测试结果。
+`first_token_latency` 只有在流式推理的情况下才会输出。
+吞吐量的计算公式为：
+$$
+token吞吐量 = 生成的token数量 / 总时间
+$$
+总时间包括 prefill 时间。
+测试过程中，节点上所有的显卡不要运行其他任何程序，否则 GPU mem 的统计会不准确。
+## 测试案例
+我们用 `internlm-7b` 为例，api_server的速度测试全流程如下：
+```shell
+pip install 'lmdeploy>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+# 从huggingface下载internlm-7b，并转为turbomind模型格式
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+# 执行测速脚本
+python3 profile_generation ./internlm-7b
+```
+## 测试方法
+```shell
+python3 profile_generation.py <model_path> <optional arguments>
+```
+其中，`model_path` turbomind格式的模型在 localhost 上的路径。
+可选测试参数如下：
+- `--concurrency`
+  代表请求线程的数量，并发请求会被推理引擎拼成 batch。默认值为`[1, 16, 32, 64]`，意味着默认测试 4 种不同并发度下的性能。并发量不能超过`config.ini`中的`max_batch_size`。否则，超出部分的请求会在推理队列中等待。
+- `--prompt-tokens` 和 `--completion-tokens`
+  输入token和输出token数量。它们是一个列表，列表中的元素是一一对应关系，即，`(--prompt-tokens[i]`, `--completion-tokens[i])` 是一组。比如在默认列表中，`[1, 128, 128, 2048, 2048]`和`[128, 128, 2048, 128, 2048]`，测试组合分别是，`(1, 128)`、`(128, 128)`、`(128, 2048)`、`(2048, 128)`和`(2048, 2048)`
+- `--tp`
+  模型在张量并行时，使用的显卡数量。必须是2的整数次幂。默认为 1。
+- `--top_k`、`--top_p` 和 `--temperature`
+  这三个参数用来采样生成的 token_id。
+- `--csv`
+  一个 csv 文件路径，用来存放测试结果。默认是 `./profile_generation.csv`
+- `--log-level`
+  日志级别。默认是 `ERROR`
+- `--test-round`
+  测试的轮数，默认是 10。表示每组测试设置，会测试 10 轮，统计其平均结果。
+我们把一组 `(并发数, prompt_token数量, completion-token数量)` 称为一组测试用例。所以，脚本执行的`测试用例总数 = 并发数列表长度 x prompt_token 列表长度`，`测试规模 = 测试用例总数 x 测试轮数`。用户可以根据自己的实际情况，灵活的调整测试参数。
--- a/docs/zh_cn/benchmark/profile_throughput.md
+++ b/docs/zh_cn/benchmark/profile_throughput.md
+# 请求吞吐量测试方法
+在真实应用中，用户输入的 prompt 长度以及模型回复的 token 数量是动态变化的。而静态推理能力不足以反映推理引擎对动态输入输出的处理能力。
+所以需要使用真实对话数据，评测推理引擎的动态推理能力。本文将介绍如何在 localhost 上测试 LMDeploy 的动态推理性能。
+测试脚本是 `profile_restful_api.py`。测试之前，请安装 lmdeploy 预编译包，并下载评测脚本和测试数据集。
+```shell
+pip install 'lmdeploy>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+测速时，需输入具体的模型。我们推荐把模型下载到本地，并通过 `lmdeploy convert` 把模型转换为 turbomind 格式，然后再进行测试。
+这么做的原因是，方便调节推理引擎参数，以达到比较好的推理性能，比如批处理大小（max_batch_size），K/V cache缓存大小（max_cache_entry_count）等等。有关这些参数的详细说明，请参考[这里](../turbomind_config.md).
+以下章节中，我们默认模型是 turbomind 格式的。
+## 测量指标
+LMDeploy 统计首token延时（first_token_latency）、token吞吐量（tokens/s）和请求吞吐量（RPM）。
+`first_token_latency` 只有在流式推理的情况下才会输出。
+token吞吐量的计算公式为：
+$$
+token吞吐量 = 生成的token数量 / 总时间
+$$
+请求吞吐量的计算公式为：
+$$
+吞吐量 = 请求数量 / 总时间
+$$
+总时间包括 prefill 时间
+## 测试案例
+我们用 `internlm-7b` 为例，api_server的速度测试全流程如下：
+```shell
+pip install 'lmdeploy>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+# 从huggingface下载internlm-7b，并转为turbomind模型格式
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+# 执行测速脚本
+python3 profile_throughput.py ./internlm-7b ./ShareGPT_V3_unfiltered_cleaned_split.json
+```
+## 测试方法
+```shell
+python3 profile_throughput.py <dataset> <model_path> <optional arguments>
+```
+其中，必填参数是：
+- `dataset`
+  测试数据集的路径
+- `model_path`
+  turbomind格式的模型在 localhost 上的路径
+可选测试参数如下：
+- `--concurrency`
+  代表请求线程的数量，并发请求会被推理引擎拼成 batch，默认为 64。并发请求会被推理引擎拼成 batch。并发数不能超过`config.ini`中的`max_batch_size`。否则，超出部分的请求会在推理队列中等待。
+- `--num-prompts`
+  从数据集中采样的prompt数量。默认是 2000
+- `--tp`
+  模型在张量并行时，使用的显卡数量。必须是2的整数次幂。默认为 1
+- `--top_k`、`--top_p` 和 `--temperature`
+  这三个参数用来采样生成的 token_id
+- `--stream_output`
+  流式推理的开关。默认值为 `True`
+- `--csv`
+  一个 csv 文件路径，用来存放测试结果。默认是 `./profile_throughput.csv`
+- `--log-level`
+  日志级别。默认是 `ERROR`
+- `--seed`
+  从测试数据集中随机采样prompt时的种子。默认为0
--- a/docs/zh_cn/benchmark/profile_triton_server.md
+++ b/docs/zh_cn/benchmark/profile_triton_server.md
+# Triton Inference Server 性能测试方法
+Triton Inference Server(TIS) 是 LMDeploy 支持的除了 api_server 之外的另一种 serving 方式。它的性能测试方式和测试指标和 [api_server](./profile_api_server.md) 的测试方式类似。
+```{note}
+LMDeploy 尚未实现 Triton Inference Server 的 ensemble 推理模式，所以推理性能要比 api_server 弱。对于追求性能的用户，我们推荐使用 api_server 部署服务。
+```
+TIS 性能测试脚本是 `profile_serving.py`。测试之前，请安装 lmdeploy 预编译包，并下载评测脚本和测试数据集。
+```shell
+pip install 'lmdeploy[serve]>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+测速时，需输入具体的模型。我们推荐把模型下载到本地，并通过 `lmdeploy convert` 把模型转换为 turbomind 格式，然后再进行测试。
+这么做的原因是，方便调节推理引擎参数，以达到比较好的推理性能，比如批处理大小（max_batch_size），K/V cache缓存大小（max_cache_entry_count）等等。有关这些参数的详细说明，请参考[这里](../turbomind_config.md).
+以下章节中，我们默认模型是 turbomind 格式的。
+## 测量指标
+LMDeploy 统计首token延时（first_token_latency）、token吞吐量（tokens/s）和请求吞吐量（RPM）。
+`first_token_latency` 只有在流式推理的情况下才会输出。
+token吞吐量的计算公式为：
+$$
+吞吐量 = 生成的token数量 / 总时间
+$$
+请求吞吐量的计算公式为：
+$$
+吞吐量 = 请求数量 / 总时间
+$$
+总时间包括 prefill 时间
+## 测试案例
+我们用 `internlm-7b` 为例，api_server的速度测试全流程如下：
+```shell
+pip install 'lmdeploy[serve]>=0.1.0a1'
+git clone --depth=1 https://github.com/InternLM/lmdeploy
+cd lmdeploy/benchmark
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+# 从huggingface下载internlm-7b，并转为turbomind模型格式
+lmdeploy convert internlm internlm/internlm-7b --dst-path ./internlm-7b
+# 启动server
+bash ./internlm-7b/service_docker_up.sh
+# 另起终端，在`lmdeploy/benchmark`目录下，执行测速脚本
+python3 ./profile_serving 0.0.0.0:33337 ./internlm-7b/triton_models/tokenizer ./ShareGPT_V3_unfiltered_cleaned_split.json
+```
+## 测试方法
+启动服务
+```shell
+python3 profile_restful_api.py <server_addr> <tokenizer_path> <dataset> <optional arguments>
+```
+其中，必填参数是：
+- `server_addr`
+  api_server 的地址，格式是 `{server_ip}:{server_port}`
+- `tokenizer_path`
+  tokenizer model 的路径。作用是对测试数据集预先 encode，获取对话数据的 token 长度
+- `dataset`
+  下载的测试数据集的路径
+可选测试参数如下：
+- `--concurrency`
+  客户端请求线程的数量，并发请求会被推理引擎拼成 batch，默认为 32。并发请求会被推理引擎拼成 batch。建议 concurrency 的值不要超过推理引擎的 `max_batch_size`，也不要超过 triton_models 中的推理实例的数量。
+  推理实例数量的配置项是 `instance_group`，在文件 `{model_path}/triton_models/interactive/config.pbtxt` 里，默认是 48。
+- `--num-prompts`
+  从数据集中采样的prompt数量，默认是 1000
+- `--top_k`、`--top_p` 和 `--temperature`
+  这三个参数用来采样生成的 token_id
+- `--stream_output`
+  流式推理的开关。默认值为 `False`
+- `--csv`
+  一个 csv 文件路径，用来存放测试结果。默认是 `./profile_tis.csv`
+- `--seed`
+  从测试数据集中随机采样prompt时的种子。默认为0