simplify the header of the benchmark table (#820)

* simplify the header of the benchmark table * miss comma * fix lint

simplify the header of the benchmark table (#820)
* simplify the header of the benchmark table * miss comma * fix lint
a5b67b95 · Lyu Han · GitHub · 72869ef8 · a5b67b95 · a5b67b95
Unverified Commit a5b67b95 authored Dec 12, 2023 by Lyu Han Committed by GitHub Dec 12, 2023
8 changed files
--- a/README.md
+++ b/README.md
@@ -117,15 +117,15 @@ pip install lmdeploy

 To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.

-The following use [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.
+The following use [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.

 #### Inference by TurboMind

 ```shell
-lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
+lmdeploy chat turbomind internlm/internlm-chat-7b --model-name internlm-chat-7b
 ```

-> **Note**<br /> The internlm/internlm-chat-7b-v1_1 model will be downloaded under `.cache` folder. You can also use a local path here.
+> **Note**<br /> The internlm/internlm-chat-7b model will be downloaded under `.cache` folder. You can also use a local path here.

 > **Note**<br />
 > When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
@@ -141,7 +141,7 @@ lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-cha
 # install lmdeploy with extra dependencies
 pip install lmdeploy[serve]

-lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
+lmdeploy serve gradio internlm/internlm-chat-7b --model-name internlm-chat-7b
 ```

 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -154,7 +154,7 @@ Launch inference server by:
 # install lmdeploy with extra dependencies
 pip install lmdeploy[serve]

-lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
+lmdeploy serve api_server internlm/internlm-chat-7b --model-name internlm-chat-7b --instance_num 32 --tp 1
 ```

 Then, you can communicate with it by command line,

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -117,15 +117,15 @@ pip install lmdeploy

 使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式，目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型，离线转换需需要先保存模型再加载。

-下面以 [internlm/internlm-chat-7b-v1_1](https://huggingface.co/internlm/internlm-chat-7b-v1_1) 为例，展示在线转换的使用方式。其他方式可参考[load_hf.md](docs/zh_cn/load_hf.md)
+下面以 [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 为例，展示在线转换的使用方式。其他方式可参考[load_hf.md](docs/zh_cn/load_hf.md)

 #### 使用 turbomind 推理

 ```shell
-lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
+lmdeploy chat turbomind internlm/internlm-chat-7b --model-name internlm-chat-7b
 ```

-> **Note**<br /> internlm/internlm-chat-7b-v1_1 会自动下载到 `.cache` 文件夹，这里也可以传下载好的路径。
+> **Note**<br /> internlm/internlm-chat-7b 会自动下载到 `.cache` 文件夹，这里也可以传下载好的路径。

 > **Note**<br />
 > turbomind 在使用 FP16 精度推理 InternLM-7B 模型时，显存开销至少需要 15.7G。建议使用 3090, V100，A100等型号的显卡。<br />
@@ -140,7 +140,7 @@ lmdeploy chat turbomind internlm/internlm-chat-7b-v1_1 --model-name internlm-cha
 # 安装lmdeploy额外依赖
 pip install lmdeploy[serve]

-lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b
+lmdeploy serve gradio internlm/internlm-chat-7b --model-name internlm-chat-7b
 ```

 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
@@ -153,7 +153,7 @@ lmdeploy serve gradio internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-
 # 安装lmdeploy额外依赖
 pip install lmdeploy[serve]

-lmdeploy serve api_server internlm/internlm-chat-7b-v1_1 --model-name internlm-chat-7b --instance_num 32 --tp 1
+lmdeploy serve api_server internlm/internlm-chat-7b --model-name internlm-chat-7b --instance_num 32 --tp 1
 ```

 你可以通过命令行方式与推理服务进行对话：

--- a/benchmark/profile_generation.py
+++ b/benchmark/profile_generation.py
@@ -381,20 +381,26 @@ def main():
        with open(args.csv, 'w') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow([
-                'batch', 'prompt_tokens', 'completion_tokens',
-                '1st_token_latency(min)(s)', '1st_token_latency(max)(s)',
-                '1st_token_latency(ave)(s)', 'percentile50(s)',
-                'percentile75(s)', 'percentile95(s)', 'percentile99(s)',
-                'throughput(token/s)', 'mem_per_proc(GB)', 'mem_per_gpu(GB)'
+                'batch',
+                'prompt_tokens',
+                'completion_tokens',
+                'throughput(out tok/s)',
+                'mem(GB)',
+                'FTL(ave)(s)',
+                'FTL(min)(s)',
+                'FTL(max)(s)',
+                '50%(s)',
+                '75%(s)',
+                '95%(s)',
+                '99%(s)',
            ])
            for re in results:
                writer.writerow([
                    re.batch, re.prompt_tokens, re.completion_tokens,
-                    re.first_token_latency[0], re.first_token_latency[1],
-                    re.first_token_latency[2], re.percentiles[0],
-                    re.percentiles[1], re.percentiles[2], re.percentiles[3],
-                    f'{re.throughput_per_proc:.2f}', f'{re.mem_per_proc:.2f}',
-                    f'{re.mem_per_gpu:.2f}'
+                    f'{re.throughput_per_proc:.2f}', f'{re.mem_per_gpu:.2f}',
+                    re.first_token_latency[2], re.first_token_latency[0],
+                    re.first_token_latency[1], re.percentiles[0],
+                    re.percentiles[1], re.percentiles[2], re.percentiles[3]
                ])



--- a/benchmark/profile_restful_api.py
+++ b/benchmark/profile_restful_api.py
@@ -186,20 +186,18 @@ class Engine:
            with open(self.csv, 'w') as csvfile:
                writer = csv.writer(csvfile)
                writer.writerow([
-                    'batch', 'num_prompts', 'prompt_tokens',
-                    'completion_tokens', '1st_token_latency(min)(s)',
-                    '1st_token_latency(max)(s)', '1st_token_latency(ave)(s)',
-                    'output token thr(tokens/s', 'total token thr(token/s)',
-                    'RPS', 'RPM'
+                    'batch', 'num_prompts', 'RPS', 'RPM', 'FTL(ave)(s)',
+                    'FTL(min)(s)', 'FTL(max)(s)', 'throughput(out tok/s)',
+                    'throughput(total tok/s)'
                ])
                writer.writerow([
                    concurrency,
-                    len(requests), prompt_tokens, completion_tokens,
+                    len(requests), f'{rps:.3f}', f'{rpm:.3f}',
+                    f'{first_token_latency_ave:.3f}' if stream_output else '-',
                    f'{first_token_latency_min:.3f}' if stream_output else '-',
                    f'{first_token_latency_max:.3f}' if stream_output else '-',
-                    f'{first_token_latency_ave:.3f}' if stream_output else '-',
                    f'{completion_token_throughput:.3f}',
-                    f'{total_token_throughput:.3f}', f'{rps:.3f}', f'{rpm:.3f}'
+                    f'{total_token_throughput:.3f}'
                ])



--- a/benchmark/profile_serving.py
+++ b/benchmark/profile_serving.py
@@ -191,20 +191,18 @@ class Engine:
            with open(self.csv, 'w') as csvfile:
                writer = csv.writer(csvfile)
                writer.writerow([
-                    'batch', 'num_prompts', 'prompt_tokens',
-                    'completion_tokens', '1st_token_latency(min)(s)',
-                    '1st_token_latency(max)(s)', '1st_token_latency(ave)(s)',
-                    'output token thr(tokens/s', 'total token thr(token/s)',
-                    'RPS', 'RPM'
+                    'batch', 'num_prompts', 'RPS', 'RPM', 'FTL(ave)(s)',
+                    'FTL(min)(s)', 'FTL(max)(s)', 'throughput(out tok/s)',
+                    'throughput(total tok/s)'
                ])
                writer.writerow([
                    concurrency,
-                    len(requests), prompt_tokens, completion_tokens,
+                    len(requests), f'{rps:.3f}', f'{rpm:.3f}',
+                    f'{first_token_latency_ave:.3f}' if stream_output else '-',
                    f'{first_token_latency_min:.3f}' if stream_output else '-',
                    f'{first_token_latency_max:.3f}' if stream_output else '-',
-                    f'{first_token_latency_ave:.3f}' if stream_output else '-',
                    f'{completion_token_throughput:.3f}',
-                    f'{total_token_throughput:.3f}', f'{rps:.3f}', f'{rpm:.3f}'
+                    f'{total_token_throughput:.3f}'
                ])



--- a/benchmark/profile_throughput.py
+++ b/benchmark/profile_throughput.py
@@ -205,25 +205,23 @@ class Engine:
            with open(self.csv, 'w') as csvfile:
                writer = csv.writer(csvfile)
                writer.writerow([
-                    'batch', 'num_promts', 'prompt_tokens',
-                    'completion_tokens', '1st_token_latency(min)(s)',
-                    '1st_token_latency(max)(s)', '1st_token_latency(ave)(s)',
-                    'percentile50(s)', 'percentile75(s)', 'percentile95(s)',
-                    'percentile99(s)', 'output token thr(tokens/s)',
-                    'total token thr(token/s)', 'RPS', 'RPM'
+                    'batch', 'num_promts', 'RPS', 'RPM', 'FTL(ave)(s)',
+                    'FTL(min)(s)', 'FTL(max)(s)', '50%(s)', '75%(s)', '95%(s)',
+                    '99%(s)', 'throughput(out tok/s)',
+                    'throughput(total tok/s)'
                ])
                writer.writerow([
                    concurrency,
-                    len(requests), prompt_tokens, completion_tokens,
+                    len(requests), f'{rps:.3f}', f'{rpm:.3f}',
+                    f'{first_token_latency_ave:.3f}' if stream_output else '-',
                    f'{first_token_latency_min:.3f}' if stream_output else '-',
                    f'{first_token_latency_max:.3f}' if stream_output else '-',
-                    f'{first_token_latency_ave:.3f}' if stream_output else '-',
                    f'{percentiles[0]:.3f}' if stream_output else '-',
                    f'{percentiles[1]:.3f}' if stream_output else '-',
                    f'{percentiles[2]:.3f}' if stream_output else '-',
                    f'{percentiles[3]:.3f}' if stream_output else '-',
                    f'{completion_token_throughput:.3f}',
-                    f'{total_token_throughput:.3f}', f'{rps:.3f}', f'{rpm:.3f}'
+                    f'{total_token_throughput:.3f}'
                ])



--- a/docs/en/kv_int8.md
+++ b/docs/en/kv_int8.md
@@ -64,7 +64,7 @@ lmdeploy chat turbomind ./workspace

 ## GPU Memory Test

-The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b-v1_1) model.
+The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) model.
 Testing method:

 1. Use `deploy.py` to convert the model, modify the maximum concurrency in the `workspace` configuration; adjust the number of requests in `llama_config.ini`.
@@ -88,7 +88,7 @@ As can be seen, the fp16 version requires 1030MB of GPU memory for each concurre

 ## Accuracy Test

-The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b-v1_1) command model.
+The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) command model.

 Below is the result of PTQ quantization of `kCacheKVInt8` method with only 128 randomly selected data from the c4 dataset. The accuracy was tested using [opencompass](https://github.com/InternLM/opencompass) before and after quantization.


--- a/docs/zh_cn/kv_int8.md
+++ b/docs/zh_cn/kv_int8.md
@@ -64,7 +64,7 @@ lmdeploy chat turbomind ./workspace

 ## 显存测试

-测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b-v1_1) 模型。
+测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 模型。
 测试方法：

 1. 使用 `deploy.py` 转换模型，修改 `workspace` 配置中的最大并发数；调整 `llama_config.ini` 中的请求数
@@ -88,7 +88,7 @@ lmdeploy chat turbomind ./workspace

 ## 精度测试

-测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b-v1_1) 指令模型。
+测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 指令模型。

 以下是 `kCacheKVInt8` 方法仅从 c4 数据集，随机选择 128 条数据 PTQ 量化。量化前后均使用 [opencompass](https://github.com/InternLM/opencompass) 测试精度。