You need to sign in or sign up before continuing.
Unverified Commit 903707b5 authored by tpoisonooo's avatar tpoisonooo Committed by GitHub
Browse files

docs(quantzation): update description (#253)

* Update quantization.md

* docs(quantization): update description

* docs(README): rename quantization files
parent 4c9959f6
...@@ -232,7 +232,7 @@ Then adjust `workspace/triton_models/weights/config.ini` ...@@ -232,7 +232,7 @@ Then adjust `workspace/triton_models/weights/config.ini`
- `use_context_fmha` changed to 0, means off - `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled - `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
Here is [quantization test results](./docs/en/quantization.md). Here is [quantization test results](./docs/en/kv_int8.md).
> **Warning**<br /> > **Warning**<br />
> runtime Tensor Parallel for quantilized model is not available. Please setup `--tp` on `deploy` to enable static TP. > runtime Tensor Parallel for quantilized model is not available. Please setup `--tp` on `deploy` to enable static TP.
......
...@@ -231,7 +231,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \ ...@@ -231,7 +231,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
- `use_context_fmha` 改为 0,表示关闭 - `use_context_fmha` 改为 0,表示关闭
- `quant_policy` 设置为 4。此参数默认为 0,表示不开启 - `quant_policy` 设置为 4。此参数默认为 0,表示不开启
这里是[量化测试结果](./docs/zh_cn/quantization.md) 这里是[量化测试结果](./docs/zh_cn/kv_int8.md)
> **Warning**<br /> > **Warning**<br />
> 量化部署不支持运行时 Tensor 并发。如果希望使用 Tensor 并发,需要在 deploy 时配置 tp 参数。 > 量化部署不支持运行时 Tensor 并发。如果希望使用 Tensor 并发,需要在 deploy 时配置 tp 参数。
......
# PTQ Quantization Benchmark Results # KV Cache Quantization Benchmark Results
## Benchmark the Graphics Memory Usage ## Benchmark the Graphics Memory Usage
...@@ -22,9 +22,9 @@ To compare with the weight quantization method such as [GPTQ-for-LLaMa](https:// ...@@ -22,9 +22,9 @@ To compare with the weight quantization method such as [GPTQ-for-LLaMa](https://
![](../../resources/batch_memory.png) ![](../../resources/batch_memory.png)
Since each concurrency requires 1030MB of the graphics memory to save the kv_cache for 2048 tokens, and the server side needs to consider the cost of high concurrency scenarios, it is more appropriate to run kv_cache quantization rather than directly quantize the weights. It can be seen that each concurrency requires 1030MB of GPU memory to save kv_cache for 2048 tokens, so quantizing kv_cache can significantly reduce the speed of the memory growth at runtime.
Note that `kCacheKVInt8` and `WeightInt4` can be used simultaneously, and we will provide relevant implementations later. Note that `kCacheKVInt8` and `WeightInt4` can be used simultaneously.
## Benchmark the Accuracy ## Benchmark the Accuracy
......
# PTQ 量化测试结果 # KV Cache 量化测试结果
## 显存测试 ## 显存测试
...@@ -23,9 +23,9 @@ ...@@ -23,9 +23,9 @@
![](../../resources/batch_memory.png) ![](../../resources/batch_memory.png)
因为每个并发需要 1030MB 显存为 2048 token 保存 kv_cache,而服务端需要考量高并发场景的成本,所以量化 kv_cache 比直接量化 weight 更合适 可以看到,每个并发需要 1030MB 显存为 2048 token 保存 kv_cache,因此量化 kv_cache 能显著降低运行时的显存增长速度
需要注意的是,`kCacheKVInt8``WeightInt4` 两种方案可以同时运行,我们后续将提供相关实现 需要注意的是,`kCacheKVInt8``WeightInt4` 两种方案可以同时开启
## 精度测试 ## 精度测试
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment