docs(quantzation): update description (#253)

* Update quantization.md * docs(quantization): update description * docs(README): rename quantization files

docs(quantzation): update description (#253)
* Update quantization.md * docs(quantization): update description * docs(README): rename quantization files
903707b5 · tpoisonooo · GitHub · 4c9959f6 · 903707b5 · 903707b5
Unverified Commit 903707b5 authored Aug 17, 2023 by tpoisonooo Committed by GitHub Aug 17, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 8 deletions

README.md README.md +1 -1

README_zh-CN.md README_zh-CN.md +1 -1

docs/en/kv_int8.md docs/en/kv_int8.md +3 -3

docs/zh_cn/kv_int8.md docs/zh_cn/kv_int8.md +3 -3

No files found.
--- a/README.md
+++ b/README.md
@@ -232,7 +232,7 @@ Then adjust `workspace/triton_models/weights/config.ini`
 - `use_context_fmha` changed to 0, means off
 - `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
-Here is [quantization test results](./docs/en/quantization.md).
+Here is [quantization test results](./docs/en/kv_int8.md).
 > **Warning**<br />
 > runtime Tensor Parallel for quantilized model is not available. Please setup `--tp` on `deploy` to enable static TP.

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -231,7 +231,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
 - `use_context_fmha` 改为 0，表示关闭
 - `quant_policy` 设置为 4。此参数默认为 0，表示不开启
-这里是[量化测试结果](./docs/zh_cn/quantization.md)。
+这里是[量化测试结果](./docs/zh_cn/kv_int8.md)。
 > **Warning**<br />
 > 量化部署不支持运行时 Tensor 并发。如果希望使用 Tensor 并发，需要在 deploy 时配置 tp 参数。

--- a/docs/en/quantization.md
+++ b/docs/en/quantization.md
-# PTQ Quantization Benchmark Results
+# KV Cache Quantization Benchmark Results
 ## Benchmark the Graphics Memory Usage
@@ -22,9 +22,9 @@ To compare with the weight quantization method such as [GPTQ-for-LLaMa](https://
 ![](../../resources/batch_memory.png)
-Since each concurrency requires 1030MB of the graphics memory to save the kv_cache for 2048 tokens, and the server side needs to consider the cost of high concurrency scenarios, it is more appropriate to run kv_cache quantization rather than directly quantize the weights.
+It can be seen that each concurrency requires 1030MB of GPU memory to save kv_cache for 2048 tokens, so quantizing kv_cache can significantly reduce the speed of the memory growth at runtime.
-Note that `kCacheKVInt8` and `WeightInt4` can be used simultaneously, and we will provide relevant implementations later.
+Note that `kCacheKVInt8` and `WeightInt4` can be used simultaneously.
 ## Benchmark the Accuracy

--- a/docs/zh_cn/quantization.md
+++ b/docs/zh_cn/quantization.md
-# PTQ 量化测试结果
+# KV Cache 量化测试结果
 ## 显存测试
@@ -23,9 +23,9 @@
 ![](../../resources/batch_memory.png)
-因为每个并发需要 1030MB 显存为 2048 token 保存 kv_cache，而服务端需要考量高并发场景的成本，所以量化 kv_cache 比直接量化 weight 更合适。
+可以看到，每个并发需要 1030MB 显存为 2048 token 保存 kv_cache，因此量化 kv_cache 能显著降低运行时的显存增长速度。
-需要注意的是，`kCacheKVInt8` 和 `WeightInt4` 两种方案可以同时运行，我们后续将提供相关实现。
+需要注意的是，`kCacheKVInt8` 和 `WeightInt4` 两种方案可以同时开启。
 ## 精度测试