Unverified Commit f44ef17c authored by tpoisonooo's avatar tpoisonooo Committed by GitHub
Browse files

docs(quantization): update description (#272)

parent c238f1cd
......@@ -222,25 +222,11 @@ python3 -m lmdeploy.lite.apis.auto_awq \
--work_dir $WORK_DIR \ # Directory saving quantization parameters from Step 1
```
#### KV Cache INT8 Quantization
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.
```
python3 -m lmdeploy.lite.apis.kv_qparams \
--work_dir $WORK_DIR \ # Directory saving quantization parameters from Step 1
--turbomind_dir $TURBOMIND_DIR \
--kv_sym False \ # Whether to use symmetric or asymmetric quantization.
--num_tp 1 \ # The number of GPUs used for tensor parallelism
```
[Click here](./docs/zh_cn/w4a16.md) to view the test results for weight int4 usage.
Then adjust `workspace/triton_models/weights/config.ini`
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
#### KV Cache INT8 Quantization
Here is [quantization test results](./docs/en/kv_int8.md).
[Click here](./docs/zh_cn/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
> **Warning**<br />
> runtime Tensor Parallel for quantilized model is not available. Please setup `--tp` on `deploy` to enable static TP.
......
......@@ -220,26 +220,11 @@ python3 -m lmdeploy.lite.apis.auto_awq \
--work_dir $WORK_DIR \ # Step 1 保存量化参数的目录
```
#### KV Cache INT8 量化
首先,导出 TurboMind 格式的量化参数(KV Cache INT8 量化需要使用 `TurboMind`
> `$TURBOMIND_DIR` 为 `deploy.py` 转换得到的`workspace/triton_models/weights\` 目录
```
python3 -m lmdeploy.lite.apis.kv_qparams \
--work_dir $WORK_DIR \ # Step 1 保存量化参数的目录
--turbomind_dir $TURBOMIND_DIR \
--kv_sym False \ # 对称量化或非对称量化,默认为 False
--num_tp 1 \ # Tensor 并行使用的 GPU 数,和 deploy.py 保持一致
```
[点击这里](./docs/zh_cn/w4a16.md) 查看 weight int4 用法测试结果。
然后调整 `workspace/triton_models/weights/config.ini`
- `use_context_fmha` 改为 0,表示关闭
- `quant_policy` 设置为 4。此参数默认为 0,表示不开启
#### KV Cache INT8 量化
这里是[量化测试结果](./docs/zh_cn/kv_int8.md)
[点击这里](./docs/zh_cn/kv_int8.md) 查看 kv int8 使用方法、实现公式和测试结果
> **Warning**<br />
> 量化部署不支持运行时 Tensor 并发。如果希望使用 Tensor 并发,需要在 deploy 时配置 tp 参数。
......
# KV Cache Quantization Benchmark Results
# KV Cache Quantization and Test Results
## Benchmark the Graphics Memory Usage
For the LLaMa-7B fp16 model with a maximum length of 2048, the server requires approximately 1030MB of GPU memory to store kv_cache for each concurrent session created. This means that even an A100 80G can only serve a limited number of users.
We take the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) instruction model as the benchmark target. The benchmark process is as follows:
To reduce runtime GPU memory usage, we have implemented PTQ quantization for kv cache, using the following formula:
1. Use the `deploy.py` to convert the model, modify the maximum concurrent amount in `workspace`, and adjust the request amount in `llama_config.ini`
2. Compile the `bin/llama_triton_example` and get the graphics usage status of the fp16 version model under different batch_size settings
3. Execute the quantization script and get the quantization parameters. Then modify the config file to make [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) be effective
4. Re-execute the `bin/llama_triton_example` and get the graphics usage status of the int8 version model under different batch_size settings
```bash
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp
```
## How to Enable KV Cache INT8
### **Step One**
Convert the Hugging Face model format to the TurboMind inference format to create a workspace directory.
```bash
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
```
If you already have a workspace directory, skip this step.
### **Step Two**
Get the quantization parameters.
```bash
python3 -m lmdeploy.lite.apis.kv_qparams \
--work_dir /path/to/internlm-chat-7b \ # Directory of the Hugging Face model
--turbomind_dir workspace/trition_models/weights/ \ # Directory to save the quantization parameters
--kv_sym False \ # Symmetric or asymmetric quantization, default is False
--num_tp 1 \ # Number of GPUs used for Tensor parallelization, keep it consistent with deploy.py
```
`kv_qparams` will generate fp32 scaling factors in the `weights` directory. The file format is a binary produced by `numpy.tofile`.
You can also first set `turbomind_dir` to a private directory, then copy the scaling factors into `workspace/trition_models/weights/`.
### **Step Three**
Modify `workspace/trition_models/weights/config.ini`:
- Set use_context_fmha to 0, which means turning off flashattention
- Set quant_policy to 4. This means enabling kv_cache int8
Here is the benchmark result between the two versions of the model:
This is because there are two versions of flashattention, v1 and v2, and kv_cache int8 has also previously realized the symmetric version.
Considering there are four combinations of kernels needed to be implemented, premature optimization when the algorithm is uncertain can be disastrous for software.
### **Step Four**
Test the chat performance.
```bash
python3 -m lmdeploy.turbomind.chat ./workspace
```
## GPU Memory Test
The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) model.
Testing method:
1. Use `deploy.py` to convert the model, modify the maximum concurrency in the `workspace` configuration; adjust the number of requests in `llama_config.ini`.
2. Compile and run `bin/llama_triton_example` to obtain the GPU memory situation of the fp16 version under different batch_size.
3. Enable quantization, re-run `bin/llama_triton_example` to obtain the GPU memory situation of the int8 version under different batch_size.
Below shows the comparison of GPU memory between the two versions:
| batch_size | fp16 memory(MiB) | int8 memory(MiB) | diff(MiB) |
| :--------: | :--------------: | :--------------: | :-------: |
......@@ -18,33 +76,17 @@ Here is the benchmark result between the two versions of the model:
| 32 | 47073 | 30625 | -16448 |
| 48 | 63553 | 38881 | -24672 |
To compare with the weight quantization method such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/) , we benchmarked the memory usages of the 7B model between the two solutions, with part of the data from [llama.cpp](https://github.com/ggerganov/llama.cpp) . Here is the result:
Compared to directly quantizing Weight (such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/)), we have done a comparative estimation of memory growth in the 7B model for both methods, with some data from [llama.cpp](https://github.com/ggerganov/llama.cpp).
![](../../resources/batch_memory.png)
It can be seen that each concurrency requires 1030MB of GPU memory to save kv_cache for 2048 tokens, so quantizing kv_cache can significantly reduce the speed of the memory growth at runtime.
Note that `kCacheKVInt8` and `WeightInt4` can be used simultaneously.
As can be seen, the fp16 version requires 1030MB of GPU memory for each concurrency, so quantizing kv_cache can significantly reduce the rate of increase of runtime memory.
## Benchmark the Accuracy
## Accuracy Test
The quantification method is PTQ, and the related formula is as follows:
The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) command model.
```
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp
```
Here we take the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) instruction model as the benchmark target again. The benchmark process is as follows:
1. Convert the model with `deploy.py` and run the docker service
2. Test the fp16 version accuracy with `client.py` using the dataset
3. Execute the quantization script to get the quantization parameters, and put them into the weights directory. Then modify the configuration file to make [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) option to be effective
4. Execute the `client.py` again to get the int8 version precision
The following table is the precision result obtained by the `kCacheKVInt8` method after quantizing 128 randomly selected data from the c4 dataset and testing it with [opencompass](https://github.com/InternLM/opencompass).
Below is the result of PTQ quantization of `kCacheKVInt8` method with only 128 randomly selected data from the c4 dataset. The accuracy was tested using [opencompass](https://github.com/InternLM/opencompass) before and after quantization.
| task | dataset | metric | int8 | fp16 | diff |
| :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
......@@ -55,3 +97,5 @@ The following table is the precision result obtained by the `kCacheKVInt8` metho
| Understanding | openbookqa_fact | accuracy | 82.40 | 82.20 | +0.20 |
| Understanding | eprstmt-dev | accuracy | 90.62 | 88.75 | +1.87 |
| Safety | crows_pairs | accuracy | 32.56 | 31.43 | +1.13 |
Note that both `kCacheKVInt8` and `WeightInt4` methods can be enabled at the same time.
# KV Cache 量化测试结果
# KV Cache 量化和测试结果
对于最大长度是 2048 的 LLaMa-7B fp16 模型,服务端每创建 1 个并发,都需要大约 1030MB 显存保存 kv_cache,即便是 A100 80G,能服务的用户也非常有限。
为了降低运行时显存,我们实现了 kv cache PTQ 量化,使用的公式如下:
```bash
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp
```
## 如何开启 KV Cache INT8
### **第一步**
把 huggingface 格式的模型,转成 turbomind 推理格式,得到一个 workspace 目录
```bash
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
```
如果已经有 workspace 目录,可以跳过这步。
### **第二步**
获取量化参数
```bash
python3 -m lmdeploy.lite.apis.kv_qparams \
--work_dir /path/to/internlm-chat-7b \ # huggingface 模型目录
--turbomind_dir workspace/trition_models/weights/ \ # 保存量化参数的目录
--kv_sym False \ # 对称量化或非对称量化,默认为 False
--num_tp 1 \ # Tensor 并行使用的 GPU 数,和 deploy.py 保持一致
```
`kv_qparams` 会在 `weights` 目录生成 fp32 缩放系数,文件格式是 `numpy.tofile` 产生的二进制。
也可以先把 `turbomind_dir` 设成私有目录,再把缩放系数拷贝进 `workspace/trition_models/weights/`
### **第三步**
修改 `workspace/trition_models/weights/config.ini`
- use_context_fmha 改为 0,表示关闭 flashattention
- quant_policy 设置为 4。表示打开 kv_cache int8
这是因为 flashattention 有 v1、v2 两个版本,kv cache int8 曾经也实现过对称版本。
排列组合需要实现 4 套 kernel,在算法不确定的时候过早优化,对软件来说是场灾难。
### **第四步**
测试聊天效果
```bash
python3 -m lmdeploy.turbomind.chat ./workspace
```
## 显存测试
......@@ -7,8 +65,7 @@
1. 使用 `deploy.py` 转换模型,修改 `workspace` 配置中的最大并发数;调整 `llama_config.ini` 中的请求数
2. 编译执行 `bin/llama_triton_example`,获取 fp16 版本在不同 batch_size 的显存情况
3. 执行量化脚本,获取量化参数;修改配置文件,使 [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) 选项生效
4. 重新执行 `bin/llama_triton_example`,获取 int8 版本在不同 batch_size 显存情况
3. 开启量化,重新执行 `bin/llama_triton_example`,获取 int8 版本在不同 batch_size 显存情况
以下是两个版本的显存对比:
......@@ -23,30 +80,13 @@
![](../../resources/batch_memory.png)
可以看到,每个并发需要 1030MB 显存为 2048 token 保存 kv_cache,因此量化 kv_cache 能显著降低运行时的显存增长速度。
需要注意的是,`kCacheKVInt8``WeightInt4` 两种方案可以同时开启。
可以看到,fp16 版本每个并发需要 1030MB 显存,因此量化 kv_cache 能显著降低运行时的显存增长速度。
## 精度测试
量化方法是 PTQ,相关公式如下:
```
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp
```
测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 指令模型。
测试方法:
1.`deploy.py` 转换模型,运行 docker 服务
2. 通过 `client.py` 测试数据集,获取 fp16 版本精度
3. 执行量化脚本,得到量化参数,放到 weights 目录;修改配置文件,使 [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) 选项生效
4. 再次执行 `client.py`,读取 int8 版本精度
以下是 `kCacheKVInt8` 方法仅从 c4 数据集,随机选择 128 条数据 PTQ 量化。量化使用 [opencompass](https://github.com/InternLM/opencompass) 测试。
以下是 `kCacheKVInt8` 方法仅从 c4 数据集,随机选择 128 条数据 PTQ 量化。量化前后均使用 [opencompass](https://github.com/InternLM/opencompass) 测试精度
| task | dataset | metric | int8 | fp16 | diff |
| :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
......@@ -57,3 +97,5 @@ dequant: f = q * scale + zp
| Understanding | openbookqa_fact | accuracy | 82.40 | 82.20 | +0.20 |
| Understanding | eprstmt-dev | accuracy | 90.62 | 88.75 | +1.87 |
| Safety | crows_pairs | accuracy | 32.56 | 31.43 | +1.13 |
需要注意的是,`kCacheKVInt8``WeightInt4` 两种方案可以同时开启。
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment