docs(quantization): update description (#272)

f44ef17c · tpoisonooo · GitHub · c238f1cd · f44ef17c · f44ef17c
Unverified Commit f44ef17c authored Aug 21, 2023 by tpoisonooo Committed by GitHub Aug 21, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 143 additions and 86 deletions

README.md README.md +3 -17

README_zh-CN.md README_zh-CN.md +3 -18

docs/en/kv_int8.md docs/en/kv_int8.md +73 -29

docs/zh_cn/kv_int8.md docs/zh_cn/kv_int8.md +64 -22

No files found.
--- a/README.md
+++ b/README.md
@@ -222,25 +222,11 @@ python3 -m lmdeploy.lite.apis.auto_awq \
  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1
 ```

-#### KV Cache INT8 Quantization
-
-In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
-First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.
-
-```
-python3 -m lmdeploy.lite.apis.kv_qparams \
-  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1
-  --turbomind_dir $TURBOMIND_DIR \
-  --kv_sym False \                   # Whether to use symmetric or asymmetric quantization.
-  --num_tp 1 \                       # The number of GPUs used for tensor parallelism
-```
+[Click here](./docs/zh_cn/w4a16.md) to view the test results for weight int4 usage.

-Then adjust `workspace/triton_models/weights/config.ini`
-
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
+#### KV Cache INT8 Quantization

-Here is [quantization test results](./docs/en/kv_int8.md).
+[Click here](./docs/zh_cn/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.

 > **Warning**<br />
 > runtime Tensor Parallel for quantilized model is not available. Please setup `--tp` on `deploy` to enable static TP.

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -220,26 +220,11 @@ python3 -m lmdeploy.lite.apis.auto_awq \
  --work_dir $WORK_DIR \             # Step 1 保存量化参数的目录
 ```

-#### KV Cache INT8 量化
-
-首先，导出 TurboMind 格式的量化参数（KV Cache INT8 量化需要使用 `TurboMind`）
-
-> `$TURBOMIND_DIR` 为  `deploy.py` 转换得到的`workspace/triton_models/weights\` 目录
-
-```
-python3 -m lmdeploy.lite.apis.kv_qparams \
-  --work_dir $WORK_DIR \              # Step 1 保存量化参数的目录
-  --turbomind_dir $TURBOMIND_DIR \
-  --kv_sym False \                    # 对称量化或非对称量化，默认为 False
-  --num_tp 1  \                       # Tensor 并行使用的 GPU 数，和 deploy.py 保持一致
-```
+[点击这里](./docs/zh_cn/w4a16.md) 查看 weight int4 用法测试结果。

-然后调整 `workspace/triton_models/weights/config.ini`
-
- `use_context_fmha` 改为 0，表示关闭
- `quant_policy` 设置为 4。此参数默认为 0，表示不开启
+#### KV Cache INT8 量化

-这里是[量化测试结果](./docs/zh_cn/kv_int8.md)。
+[点击这里](./docs/zh_cn/kv_int8.md) 查看 kv int8 使用方法、实现公式和测试结果。

 > **Warning**<br />
 > 量化部署不支持运行时 Tensor 并发。如果希望使用 Tensor 并发，需要在 deploy 时配置 tp 参数。

--- a/docs/en/kv_int8.md
+++ b/docs/en/kv_int8.md
-# KV Cache Quantization Benchmark Results
+# KV Cache Quantization and Test Results

-## Benchmark the Graphics Memory Usage
+For the LLaMa-7B fp16 model with a maximum length of 2048, the server requires approximately 1030MB of GPU memory to store kv_cache for each concurrent session created. This means that even an A100 80G can only serve a limited number of users.

-We take the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) instruction model as the benchmark target. The benchmark process is as follows:
+To reduce runtime GPU memory usage, we have implemented PTQ quantization for kv cache, using the following formula:

-1. Use the `deploy.py` to convert the model, modify the maximum concurrent amount in `workspace`, and adjust the request amount in `llama_config.ini`
-2. Compile the `bin/llama_triton_example` and get the graphics usage status of the fp16 version model under different batch_size settings
-3. Execute the quantization script and get the quantization parameters. Then modify the config file to make [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) be effective
-4. Re-execute the `bin/llama_triton_example` and get the graphics usage status of the int8 version model under different batch_size settings
+```bash
+zp = (min+max) / 2
+scale = (max-min) / 255
+quant: q = round( (f-zp) / scale)
+dequant: f = q * scale + zp
+```
+
+## How to Enable KV Cache INT8
+
+### **Step One**
+
+Convert the Hugging Face model format to the TurboMind inference format to create a workspace directory.
+
+```bash
+python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
+```
+
+If you already have a workspace directory, skip this step.
+
+### **Step Two**
+
+Get the quantization parameters.
+
+```bash
+python3 -m lmdeploy.lite.apis.kv_qparams \
+  --work_dir /path/to/internlm-chat-7b  \             # Directory of the Hugging Face model
+  --turbomind_dir workspace/trition_models/weights/ \ # Directory to save the quantization parameters
+  --kv_sym False \                                    # Symmetric or asymmetric quantization, default is False
+  --num_tp 1  \                                       # Number of GPUs used for Tensor parallelization, keep it consistent with deploy.py
+```
+
+`kv_qparams` will generate fp32 scaling factors in the `weights` directory. The file format is a binary produced by `numpy.tofile`.
+
+You can also first set `turbomind_dir` to a private directory, then copy the scaling factors into `workspace/trition_models/weights/`.
+
+### **Step Three**
+
+Modify `workspace/trition_models/weights/config.ini`:
+
+- Set use_context_fmha to 0, which means turning off flashattention
+- Set quant_policy to 4. This means enabling kv_cache int8

-Here is the benchmark result between the two versions of the model:
+This is because there are two versions of flashattention, v1 and v2, and kv_cache int8 has also previously realized the symmetric version.
+
+Considering there are four combinations of kernels needed to be implemented, premature optimization when the algorithm is uncertain can be disastrous for software.
+
+### **Step Four**
+
+Test the chat performance.
+
+```bash
+python3 -m lmdeploy.turbomind.chat ./workspace
+```
+
+## GPU Memory Test
+
+The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) model.
+Testing method:
+
+1. Use `deploy.py` to convert the model, modify the maximum concurrency in the `workspace` configuration; adjust the number of requests in `llama_config.ini`.
+2. Compile and run `bin/llama_triton_example` to obtain the GPU memory situation of the fp16 version under different batch_size.
+3. Enable quantization, re-run `bin/llama_triton_example` to obtain the GPU memory situation of the int8 version under different batch_size.
+
+Below shows the comparison of GPU memory between the two versions:

 | batch_size | fp16 memory(MiB) | int8 memory(MiB) | diff(MiB) |
 | :--------: | :--------------: | :--------------: | :-------: |
@@ -18,33 +76,17 @@ Here is the benchmark result between the two versions of the model:
 |     32     |      47073       |      30625       |  -16448   |
 |     48     |      63553       |      38881       |  -24672   |

-To compare with the weight quantization method such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/) , we benchmarked the memory usages of the 7B model between the two solutions, with part of the data from [llama.cpp](https://github.com/ggerganov/llama.cpp) . Here is the result:
+Compared to directly quantizing Weight (such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/)), we have done a comparative estimation of memory growth in the 7B model for both methods, with some data from [llama.cpp](https://github.com/ggerganov/llama.cpp).

 ![](../../resources/batch_memory.png)

-It can be seen that each concurrency requires 1030MB of GPU memory to save kv_cache for 2048 tokens, so quantizing kv_cache can significantly reduce the speed of the memory growth at runtime.
-
-Note that `kCacheKVInt8` and `WeightInt4` can be used simultaneously.
+As can be seen, the fp16 version requires 1030MB of GPU memory for each concurrency, so quantizing kv_cache can significantly reduce the rate of increase of runtime memory.

-## Benchmark the Accuracy
+## Accuracy Test

-The quantification method is PTQ, and the related formula is as follows:
+The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) command model.

-```
-zp = (min+max) / 2
-scale = (max-min) / 255
-quant: q = round( (f-zp) / scale)
-dequant: f = q * scale + zp
-```
-
-Here we take the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) instruction model as the benchmark target again. The benchmark process is as follows:
-
-1. Convert the model with `deploy.py` and run the docker service
-2. Test the fp16 version accuracy with `client.py` using the dataset
-3. Execute the quantization script to get the quantization parameters, and put them into the weights directory. Then modify the configuration file to make [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) option to be effective
-4. Execute the `client.py` again to get the int8 version precision
-
-The following table is the precision result obtained by the `kCacheKVInt8` method after quantizing 128 randomly selected data from the c4 dataset and testing it with [opencompass](https://github.com/InternLM/opencompass).
+Below is the result of PTQ quantization of `kCacheKVInt8` method with only 128 randomly selected data from the c4 dataset. The accuracy was tested using [opencompass](https://github.com/InternLM/opencompass) before and after quantization.

 |     task      |     dataset     |    metric     | int8  | fp16  | diff  |
 | :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
@@ -55,3 +97,5 @@ The following table is the precision result obtained by the `kCacheKVInt8` metho
 | Understanding | openbookqa_fact |   accuracy    | 82.40 | 82.20 | +0.20 |
 | Understanding |   eprstmt-dev   |   accuracy    | 90.62 | 88.75 | +1.87 |
 |    Safety     |   crows_pairs   |   accuracy    | 32.56 | 31.43 | +1.13 |
+
+Note that both `kCacheKVInt8` and `WeightInt4` methods can be enabled at the same time.
--- a/docs/zh_cn/kv_int8.md
+++ b/docs/zh_cn/kv_int8.md
-# KV Cache 量化测试结果
+# KV Cache 量化和测试结果
+
+对于最大长度是 2048 的 LLaMa-7B fp16 模型，服务端每创建 1 个并发，都需要大约 1030MB 显存保存 kv_cache，即便是 A100 80G，能服务的用户也非常有限。
+
+为了降低运行时显存，我们实现了 kv cache PTQ 量化，使用的公式如下：
+
+```bash
+zp = (min+max) / 2
+scale = (max-min) / 255
+quant: q = round( (f-zp) / scale)
+dequant: f = q * scale + zp
+```
+
+## 如何开启 KV Cache INT8
+
+### **第一步**
+
+把 huggingface 格式的模型，转成 turbomind 推理格式，得到一个 workspace 目录
+
+```bash
+python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
+```
+
+如果已经有 workspace 目录，可以跳过这步。
+
+### **第二步**
+
+获取量化参数
+
+```bash
+python3 -m lmdeploy.lite.apis.kv_qparams \
+  --work_dir /path/to/internlm-chat-7b  \             # huggingface 模型目录
+  --turbomind_dir workspace/trition_models/weights/ \ # 保存量化参数的目录
+  --kv_sym False \                                    # 对称量化或非对称量化，默认为 False
+  --num_tp 1  \                                       # Tensor 并行使用的 GPU 数，和 deploy.py 保持一致
+```
+
+`kv_qparams` 会在 `weights` 目录生成 fp32 缩放系数，文件格式是 `numpy.tofile` 产生的二进制。
+
+也可以先把 `turbomind_dir` 设成私有目录，再把缩放系数拷贝进 `workspace/trition_models/weights/`。
+
+### **第三步**
+
+修改 `workspace/trition_models/weights/config.ini`：
+
+- use_context_fmha 改为 0，表示关闭 flashattention
+- quant_policy 设置为 4。表示打开 kv_cache int8
+
+这是因为 flashattention 有 v1、v2 两个版本，kv cache int8 曾经也实现过对称版本。
+
+排列组合需要实现 4 套 kernel，在算法不确定的时候过早优化，对软件来说是场灾难。
+
+### **第四步**
+
+测试聊天效果
+
+```bash
+python3 -m lmdeploy.turbomind.chat ./workspace
+```

 ## 显存测试

@@ -7,8 +65,7 @@

 1. 使用 `deploy.py` 转换模型，修改 `workspace` 配置中的最大并发数；调整 `llama_config.ini` 中的请求数
 2. 编译执行 `bin/llama_triton_example`，获取 fp16 版本在不同 batch_size 的显存情况
-3. 执行量化脚本，获取量化参数；修改配置文件，使 [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) 选项生效
-4. 重新执行 `bin/llama_triton_example`，获取 int8 版本在不同 batch_size 显存情况
+3. 开启量化，重新执行 `bin/llama_triton_example`，获取 int8 版本在不同 batch_size 显存情况

 以下是两个版本的显存对比：

@@ -23,30 +80,13 @@

 ![](../../resources/batch_memory.png)

-可以看到，每个并发需要 1030MB 显存为 2048 token 保存 kv_cache，因此量化 kv_cache 能显著降低运行时的显存增长速度。
-
-需要注意的是，`kCacheKVInt8` 和 `WeightInt4` 两种方案可以同时开启。
+可以看到，fp16 版本每个并发需要 1030MB 显存，因此量化 kv_cache 能显著降低运行时的显存增长速度。

 ## 精度测试

-量化方法是 PTQ，相关公式如下：
-
-```
-zp = (min+max) / 2
-scale = (max-min) / 255
-quant: q = round( (f-zp) / scale)
-dequant: f = q * scale + zp
-```
-
 测试对象为 [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) 指令模型。
-测试方法：
-
-1. 用 `deploy.py` 转换模型，运行 docker 服务
-2. 通过 `client.py` 测试数据集，获取 fp16 版本精度
-3. 执行量化脚本，得到量化参数，放到 weights 目录；修改配置文件，使 [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) 选项生效
-4. 再次执行 `client.py`，读取 int8 版本精度

-以下是 `kCacheKVInt8` 方法仅从 c4 数据集，随机选择 128 条数据 PTQ 量化。量化后使用 [opencompass](https://github.com/InternLM/opencompass) 测试。
+以下是 `kCacheKVInt8` 方法仅从 c4 数据集，随机选择 128 条数据 PTQ 量化。量化前后均使用 [opencompass](https://github.com/InternLM/opencompass) 测试精度。

 |     task      |     dataset     |    metric     | int8  | fp16  | diff  |
 | :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
@@ -57,3 +97,5 @@ dequant: f = q * scale + zp
 | Understanding | openbookqa_fact |   accuracy    | 82.40 | 82.20 | +0.20 |
 | Understanding |   eprstmt-dev   |   accuracy    | 90.62 | 88.75 | +1.87 |
 |    Safety     |   crows_pairs   |   accuracy    | 32.56 | 31.43 | +1.13 |
+
+需要注意的是，`kCacheKVInt8` 和 `WeightInt4` 两种方案可以同时开启。