docs(project): add quantization test results (#46)

* docs(README): update description * docs(project): add quantization test results * docs(README): reorder * docs(quantization): add more description * docs(README): remove openmmlab badge * docs(README): scale up image * docs(dir): add zh_cn subdir

docs(project): add quantization test results (#46)
* docs(README): update description * docs(project): add quantization test results * docs(README): reorder * docs(quantization): add more description * docs(README): remove openmmlab badge * docs(README): scale up image * docs(dir): add zh_cn subdir
197b3ee1 · tpoisonooo · GitHub · 9d8949bf · 197b3ee1 · 197b3ee1
Unverified Commit 197b3ee1 authored Jul 04, 2023 by tpoisonooo Committed by GitHub Jul 04, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 83 additions and 46 deletions

README.md README.md +8 -21

README_zh-CN.md README_zh-CN.md +14 -25

docs/zh_cn/quantization.md docs/zh_cn/quantization.md +61 -0

No files found.
--- a/README.md
+++ b/README.md
 <div align="center">
  <img src="resources/lmdeploy-logo.png" width="450"/>
-  <div>&nbsp;</div>
-  <div align="center">
-    <b><font size="5">OpenMMLab website</font></b>
-    <sup>
-        <a href="https://openmmlab.com">
-        <i><font size="4">HOT</font></i>
-      </a>
-    </sup>
-    &nbsp;&nbsp;&nbsp;&nbsp;
-    <b><font size="5">OpenMMLab platform</font></b>
-    <sup>
-      <a href="https://platform.openmmlab.com">
-        <i><font size="4">TRY IT OUT</font></i>
-      </a>
-    </sup>
-  </div>
-  <div>&nbsp;</div>

 [![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
 [![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
@@ -52,15 +35,17 @@ English | [简体中文](README_zh-CN.md)

 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

- A high throughput inference engine named as **TurboMind** based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for LLaMA family models
+- **Efficient Inference Engine TurboMind**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.

- Interactive generation is supported. LMDeploy can remember the history by caching the attention k/v in multi-turn dialogues, so that it can avoid repetitive decoding of historical conversations.
+- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.

 <div align="center">
-  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
+  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
 </div>

- Support persistent-batch inference
+- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive support for model deployment and quantization, and have successfully validated it on models ranging from 7B to 100B parameters.
+
+- **Persistent Batch Inference**: Further optimization of model execution efficiency.

 ![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)

@@ -173,6 +158,8 @@ Then adjust `config.ini`
 - `use_context_fmha` changed to 0, means off
 - `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled

+Here is [quantization test results](./docs/zh_cn/quantization.md).
+
 ## Contributing

 We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
 <div align="center">
  <img src="resources/lmdeploy-logo.png" width="450"/>
-  <div>&nbsp;</div>
-  <div align="center">
-    <b><font size="5">OpenMMLab website</font></b>
-    <sup>
-        <a href="https://openmmlab.com">
-        <i><font size="4">HOT</font></i>
-      </a>
-    </sup>
-    &nbsp;&nbsp;&nbsp;&nbsp;
-    <b><font size="5">OpenMMLab platform</font></b>
-    <sup>
-      <a href="https://platform.openmmlab.com">
-        <i><font size="4">TRY IT OUT</font></i>
-      </a>
-    </sup>
-  </div>
-  <div>&nbsp;</div>

 [![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
 [![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
@@ -50,19 +33,23 @@

 ## 简介

-LMDeploy 是 [MMRazor](https://github.com/open-mmlab/mmrazor) 和 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队联合开发的，针对 LLM 进行轻量化、部署和服务的工具箱。它拥有以下核心功能：
+LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
+这个强大的工具箱提供以下核心功能：

- 基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 实现的高效推理引擎 **TurboMind**, 支持 LLaMA 及其变体模型在 NVIDIA 设备上的推理
+- **高效推理引擎 TurboMind**：基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)，我们实现了高效推理引擎 TurboMind，它支持 LLaMA 及其变体模型在 NVIDIA GPU 上的推理。

- 实现 interactive mode 推理方式。通过缓存多轮对话过程中attention的k/v，记住对话历史，从而避免重复decode历史会话
+- **交互推理方式**：通过缓存多轮对话过程中 attention 的 k/v，记住对话历史，从而避免重复处理历史会话。

-<div align="center">
-  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
-</div>
+  <div align="center">
+    <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
+  </div>
+
+- **多 GPU 部署和量化**：我们提供了全面的模型部署和量化支持，已经在 7～100B 模型上完成验证。

- 支持 persistent batch 推理方式
+- **persistent batch 推理**：进一步优化模型执行效率。
+
+  ![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)

-![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)

 ## 快速上手

@@ -169,6 +156,8 @@ python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
 - `use_context_fmha` 改为 0，表示关闭
 - `quant_policy` 设置为 4。此参数默认为 0，表示不开启

+这里是[量化测试结果](./docs/zh_cn/quantization.md)。
+
 ## 贡献指南

 我们感谢所有的贡献者为改进和提升 LMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。

--- a/docs/zh_cn/quantization.md
+++ b/docs/zh_cn/quantization.md
+# PTQ 量化测试结果
+
+测试对象为内部早期的 100B 模型。尽管模型暂不开放，测试数据仍能展示量化方法对此模型的影响。
+测试方法：
+1. 运行 `deploy.py`，切分 100B 模型到 8 个 GPU 上
+2. 运行量化脚本，得到量化参数，放到 weights 目录
+3. 修改配置文件，使 [kCacheKVInt8](../src/turbomind/models/llama/llama_utils.h) 选项生效
+4. 执行测试数据集，和 fp16 版本对比精度和显存使用情况
+
+## 显存降低
+
+随着 batch_size 增加，`kCacheKVInt8` 可以节约更多显存，从而降低部署成本。
+
+| batch | int8 memory(GB/GPU) | fp16 memory(GB/GPU) |
+| :-: | :-: | :-: |
+| 16 | 40 | 43 |
+| 32 | 48 | 60 |
+
+
+## 精度影响
+
+以下是 `kCacheKVInt8` 方法仅用 c4 数据集量化，在其他数据集的精度损失，数值仅供参考。
+
+| task | dataset | version | metric | diff |
+| :-: | :-: | :-: | :-: | :-: |
+| Exam             | ceval           | -         | avg_accuracy | -0.43 |
+| Exam             | ceval-hard      | -         | avg_accuracy | 2.24 |
+| ChineseUniversal | CMRC_dev        | v1-65aa5c | score        | -2.99 |
+| ChineseUniversal | DRCD_dev        | v1-65aa5c | score        | -1.14 |
+| ChineseUniversal | afqmc-dev       | v1-bbbabc | accuracy     | 1.67 |
+| ChineseUniversal | bustm-dev       | v1-ecded6 | accuracy     | 10.62 |
+| ChineseUniversal | bustm-test      | v1-ecded6 | accuracy     | 14.90 |
+| ChineseUniversal | chid-dev        | v1-ffc5eb | accuracy     | -5.94 |
+| ChineseUniversal | chid-test       | v1-ffc5eb | accuracy     | -4.19 |
+| ChineseUniversal | cluewsc-dev     | v1-b88a63 | accuracy     | -4.40 |
+| ChineseUniversal | cluewsc-test    | v1-b88a63 | accuracy     | -2.56 |
+| ChineseUniversal | eprstmt-dev     | v1-99cf6f | accuracy     | 1.87 |
+| ChineseUniversal | eprstmt-test    | v1-99cf6f | accuracy     | 1.48 |
+| Completion       | lambada         | v1-678ebd | accuracy     | -1.65 |
+| Completion       | story_cloze     | v1-f92a41 | accuracy     | -0.11 |
+| EnglishUniversal | AX_b            | v1-78e4c2 | accuracy     | -1.27 |
+| EnglishUniversal | AX_g            | v1-ccfc17 | accuracy     | -2.81 |
+| EnglishUniversal | BoolQ           | v1-2c7cf3 | accuracy     | -4.22 |
+| EnglishUniversal | CB              | v1-f60fbb | accuracy     | 0.00 |
+| EnglishUniversal | COPA            | v1-d3a03c | accuracy     | -2.00 |
+| EnglishUniversal | MultiRC         | v1-560d31 | accuracy     | -8.79 |
+| EnglishUniversal | ReCoRD          | v1-5a2219 | score        | -2.09 |
+| EnglishUniversal | RTE             | v1-ccfc17 | accuracy     | -3.25 |
+| EnglishUniversal | WiC             | v1-019721 | accuracy     | -6.74 |
+| EnglishUniversal | WSC             | v1-57571c | accuracy     | -5.77 |
+| EnglishUniversal | race-middle     | v1-0c5c3c | accuracy     | -1.19 |
+| EnglishUniversal | race-high       | v1-0c5c3c | accuracy     | -1.06 |
+| Reasoning        | gsm8k_main      | v1-3d5be1 | accuracy     | -8.80 |
+| QA               | hellaswag       | v1-3e134d | accuracy     | -1.45 |
+| QA               | piqa            | v1-362133 | accuracy     | -1.53 |
+| QA               | winogrande      | v1-a2f53f | accuracy     | -0.79 |
+| QA               | openbookqa      | v1-8587d7 | accuracy     | -7.00 |
+| QA               | openbookqa_fact | v1-4e92f0 | accuracy     | -14.00 |
+| QA               | nq              | v1-d2370e | score        | -2.16 |
+| QA               | triviaqa        | v1-ead882 | score        | -0.43 |
+| Security         | crows_pairs     | v1-8fe12f | accuracy     | 11.08 |
\ No newline at end of file