Unverified Commit 197b3ee1 authored by tpoisonooo's avatar tpoisonooo Committed by GitHub
Browse files

docs(project): add quantization test results (#46)

* docs(README): update description

* docs(project): add quantization test results

* docs(README): reorder

* docs(quantization): add more description

* docs(README): remove openmmlab badge

* docs(README): scale up image

* docs(dir): add zh_cn subdir
parent 9d8949bf
<div align="center">
<img src="resources/lmdeploy-logo.png" width="450"/>
<div>&nbsp;</div>
<div align="center">
<b><font size="5">OpenMMLab website</font></b>
<sup>
<a href="https://openmmlab.com">
<i><font size="4">HOT</font></i>
</a>
</sup>
&nbsp;&nbsp;&nbsp;&nbsp;
<b><font size="5">OpenMMLab platform</font></b>
<sup>
<a href="https://platform.openmmlab.com">
<i><font size="4">TRY IT OUT</font></i>
</a>
</sup>
</div>
<div>&nbsp;</div>
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
......@@ -52,15 +35,17 @@ English | [简体中文](README_zh-CN.md)
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
- A high throughput inference engine named as **TurboMind** based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for LLaMA family models
- **Efficient Inference Engine TurboMind**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
- Interactive generation is supported. LMDeploy can remember the history by caching the attention k/v in multi-turn dialogues, so that it can avoid repetitive decoding of historical conversations.
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
<div align="center">
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
</div>
- Support persistent-batch inference
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive support for model deployment and quantization, and have successfully validated it on models ranging from 7B to 100B parameters.
- **Persistent Batch Inference**: Further optimization of model execution efficiency.
![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)
......@@ -173,6 +158,8 @@ Then adjust `config.ini`
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
Here is [quantization test results](./docs/zh_cn/quantization.md).
## Contributing
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
......
<div align="center">
<img src="resources/lmdeploy-logo.png" width="450"/>
<div>&nbsp;</div>
<div align="center">
<b><font size="5">OpenMMLab website</font></b>
<sup>
<a href="https://openmmlab.com">
<i><font size="4">HOT</font></i>
</a>
</sup>
&nbsp;&nbsp;&nbsp;&nbsp;
<b><font size="5">OpenMMLab platform</font></b>
<sup>
<a href="https://platform.openmmlab.com">
<i><font size="4">TRY IT OUT</font></i>
</a>
</sup>
</div>
<div>&nbsp;</div>
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
......@@ -50,19 +33,23 @@
## 简介
LMDeploy 是 [MMRazor](https://github.com/open-mmlab/mmrazor)[MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队联合开发的,针对 LLM 进行轻量化、部署和服务的工具箱。它拥有以下核心功能:
LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy)[MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
这个强大的工具箱提供以下核心功能:
- 基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 实现高效推理引擎 **TurboMind**, 支持 LLaMA 及其变体模型在 NVIDIA 设备上的推理
- **高效推理引擎 TurboMind**基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer),我们实现高效推理引擎 TurboMind,它支持 LLaMA 及其变体模型在 NVIDIA GPU 上的推理
- 实现 interactive mode 推理方式通过缓存多轮对话过程中attentionk/v,记住对话历史,从而避免重复decode历史会话
- **交互推理方式**通过缓存多轮对话过程中 attentionk/v,记住对话历史,从而避免重复处理历史会话
<div align="center">
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
</div>
<div align="center">
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
</div>
- **多 GPU 部署和量化**:我们提供了全面的模型部署和量化支持,已经在 7~100B 模型上完成验证。
- 支持 persistent batch 推理方式
- **persistent batch 推理**:进一步优化模型执行效率。
![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)
![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)
## 快速上手
......@@ -169,6 +156,8 @@ python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
- `use_context_fmha` 改为 0,表示关闭
- `quant_policy` 设置为 4。此参数默认为 0,表示不开启
这里是[量化测试结果](./docs/zh_cn/quantization.md)
## 贡献指南
我们感谢所有的贡献者为改进和提升 LMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
......
# PTQ 量化测试结果
测试对象为内部早期的 100B 模型。尽管模型暂不开放,测试数据仍能展示量化方法对此模型的影响。
测试方法:
1. 运行 `deploy.py`,切分 100B 模型到 8 个 GPU 上
2. 运行量化脚本,得到量化参数,放到 weights 目录
3. 修改配置文件,使 [kCacheKVInt8](../src/turbomind/models/llama/llama_utils.h) 选项生效
4. 执行测试数据集,和 fp16 版本对比精度和显存使用情况
## 显存降低
随着 batch_size 增加,`kCacheKVInt8` 可以节约更多显存,从而降低部署成本。
| batch | int8 memory(GB/GPU) | fp16 memory(GB/GPU) |
| :-: | :-: | :-: |
| 16 | 40 | 43 |
| 32 | 48 | 60 |
## 精度影响
以下是 `kCacheKVInt8` 方法仅用 c4 数据集量化,在其他数据集的精度损失,数值仅供参考。
| task | dataset | version | metric | diff |
| :-: | :-: | :-: | :-: | :-: |
| Exam | ceval | - | avg_accuracy | -0.43 |
| Exam | ceval-hard | - | avg_accuracy | 2.24 |
| ChineseUniversal | CMRC_dev | v1-65aa5c | score | -2.99 |
| ChineseUniversal | DRCD_dev | v1-65aa5c | score | -1.14 |
| ChineseUniversal | afqmc-dev | v1-bbbabc | accuracy | 1.67 |
| ChineseUniversal | bustm-dev | v1-ecded6 | accuracy | 10.62 |
| ChineseUniversal | bustm-test | v1-ecded6 | accuracy | 14.90 |
| ChineseUniversal | chid-dev | v1-ffc5eb | accuracy | -5.94 |
| ChineseUniversal | chid-test | v1-ffc5eb | accuracy | -4.19 |
| ChineseUniversal | cluewsc-dev | v1-b88a63 | accuracy | -4.40 |
| ChineseUniversal | cluewsc-test | v1-b88a63 | accuracy | -2.56 |
| ChineseUniversal | eprstmt-dev | v1-99cf6f | accuracy | 1.87 |
| ChineseUniversal | eprstmt-test | v1-99cf6f | accuracy | 1.48 |
| Completion | lambada | v1-678ebd | accuracy | -1.65 |
| Completion | story_cloze | v1-f92a41 | accuracy | -0.11 |
| EnglishUniversal | AX_b | v1-78e4c2 | accuracy | -1.27 |
| EnglishUniversal | AX_g | v1-ccfc17 | accuracy | -2.81 |
| EnglishUniversal | BoolQ | v1-2c7cf3 | accuracy | -4.22 |
| EnglishUniversal | CB | v1-f60fbb | accuracy | 0.00 |
| EnglishUniversal | COPA | v1-d3a03c | accuracy | -2.00 |
| EnglishUniversal | MultiRC | v1-560d31 | accuracy | -8.79 |
| EnglishUniversal | ReCoRD | v1-5a2219 | score | -2.09 |
| EnglishUniversal | RTE | v1-ccfc17 | accuracy | -3.25 |
| EnglishUniversal | WiC | v1-019721 | accuracy | -6.74 |
| EnglishUniversal | WSC | v1-57571c | accuracy | -5.77 |
| EnglishUniversal | race-middle | v1-0c5c3c | accuracy | -1.19 |
| EnglishUniversal | race-high | v1-0c5c3c | accuracy | -1.06 |
| Reasoning | gsm8k_main | v1-3d5be1 | accuracy | -8.80 |
| QA | hellaswag | v1-3e134d | accuracy | -1.45 |
| QA | piqa | v1-362133 | accuracy | -1.53 |
| QA | winogrande | v1-a2f53f | accuracy | -0.79 |
| QA | openbookqa | v1-8587d7 | accuracy | -7.00 |
| QA | openbookqa_fact | v1-4e92f0 | accuracy | -14.00 |
| QA | nq | v1-d2370e | score | -2.16 |
| QA | triviaqa | v1-ead882 | score | -0.43 |
| Security | crows_pairs | v1-8fe12f | accuracy | 11.08 |
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment