[Docs] Translate the quantization.md (#166)

* translate quantization doc * revise

[Docs] Translate the quantization.md (#166)
* translate quantization doc * revise
3df43e8c · Xin Li · GitHub · 63bd5916 · 3df43e8c · 3df43e8c
Unverified Commit 3df43e8c authored Jul 26, 2023 by Xin Li Committed by GitHub Jul 26, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 53 additions and 1 deletion

README.md README.md +1 -1

docs/en/quantization.md docs/en/quantization.md +52 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -164,7 +164,7 @@ Then adjust `workspace/triton_models/weights/config.ini`
 - `use_context_fmha` changed to 0, means off
 - `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
-Here is [quantization test results](./docs/zh_cn/quantization.md).
+Here is [quantization test results](./docs/en/quantization.md).
 ## Contributing

--- a/docs/en/quantization.md
+++ b/docs/en/quantization.md
+# PTQ Quantization Benchmark Results
+## Benchmark the Graphics Memory Usage
+We take the [Chinese-LLaMa-Alpaca 7B](https://github.com/ymcui/Chinese-LLaMA-Alpaca) instruction model as the benchmark target. The benchmark process is as follows:
+1. Use the `deploy.py` to convert the model, modify the maximum concurrent amount in `workspace`, and adjust the request amount in `llama_config.ini`
+2. Compile the `bin/llama_triton_example` and get the graphics usage status of the fp16 version model under different batch_size settings
+3. Execute the quantization script and get the quantization parameters. Then modify the config file to make [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) be effective
+4. Re-execute the `bin/llama_triton_example` and get the graphics usage status of the int8 version model under different batch_size settings
+Here is the benchmark result between the two versions of the model:
+| batch_size | fp16 memory(MiB) | int8 memory(MiB) | diff(MiB) |
+| :--------: | :--------------: | :--------------: | :-------: |
+|     8      |      22337       |      18241       |   -4096   |
+|     16     |      30593       |      22369       |   -8224   |
+|     32     |      47073       |      30625       |  -16448   |
+|     48     |      63553       |      38881       |  -24672   |
+To compare with the weight quantization method such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/) , we benchmarked the memory usages of the 7B model between the two solutions, with part of the data from [llama.cpp](https://github.com/ggerganov/llama.cpp) . Here is the result:
+![](../../resources/batch_memory.png)
+Since each concurrency requires 1030MB of the graphics memory to save the kv_cache for 2048 tokens, and the server side needs to consider the cost of high concurrency scenarios, it is more appropriate to run kv_cache quantization rather than directly quantize the weights.
+Note that `kCacheKVInt8` and `WeightInt4` can be used simultaneously, and we will provide relevant implementations later.
+## Benchmark the Accuracy
+Here we take the [Chinese-LLaMa-Alpaca 7B](https://github.com/ymcui/Chinese-LLaMA-Alpaca) instruction model as the benchmark target again. The benchmark process is as follows:
+1. Convert the model with `deploy.py` and run the docker service
+2. Test the fp16 version accuracy with `client.py` using the dataset
+3. Execute the quantization script to get the quantization parameters, and put them into the weights directory. Then modify the configuration file to make [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) option to be effective
+4. Execute the `client.py` again to get the int8 version precision
+The following table is the precision result obtained by the `kCacheKVInt8` method after quantizing 128 randomly selected data from the c4 dataset and testing it on the mmlu-social-science dataset, which has a total of 3065 multiple-choice questions:
+| task |       dataset       | metric | fp16  | int8  | diff  |
+| :--: | :-----------------: | :----: | :---: | :---: | :---: |
+| Exam | mmlu-social-science | score  | 31.81 | 32.00 | +0.19 |
+We noticed that there is a slight improvement in the precision, and the differences are as follows:
+|                      Type                      | Number |
+| :--------------------------------------------: | :----: |
+| fp16 version failed but int8 version got right |   72   |
+| fp16 version got right but int8 version failed |   66   |
+|      failed in both fp16 and int8 version      |  118   |
+We have validated the quantization implementation on more datasets and larger models and will keep updating the results.