update kv8 docs (#681)

b7c88ca8 · pppppM · GitHub · e641dd86 · b7c88ca8 · b7c88ca8
Unverified Commit b7c88ca8 authored Nov 13, 2023 by pppppM Committed by GitHub Nov 13, 2023
Show whitespace changes
Inline Side-by-side

Showing with 0 additions and 10 deletions

docs/en/kv_int8.md docs/en/kv_int8.md +0 -5

docs/zh_cn/kv_int8.md docs/zh_cn/kv_int8.md +0 -5

No files found.
--- a/docs/en/kv_int8.md
+++ b/docs/en/kv_int8.md
@@ -52,13 +52,8 @@ You can also first set `turbomind_dir` to a private directory, then copy the sca

 Modify `workspace/triton_models/weights/config.ini`:

- Set use_context_fmha to 0, which means turning off flashattention
 - Set quant_policy to 4. This means enabling kv_cache int8

-This is because there are two versions of flashattention, v1 and v2, and kv_cache int8 has also previously realized the symmetric version.
-
-Considering there are four combinations of kernels needed to be implemented, premature optimization when the algorithm is uncertain can be disastrous for software.
-
 ### **Step Four**

 Test the chat performance.

--- a/docs/zh_cn/kv_int8.md
+++ b/docs/zh_cn/kv_int8.md
@@ -52,13 +52,8 @@ lmdeploy lite kv_qparams \

 修改 `workspace/triton_models/weights/config.ini`：

- use_context_fmha 改为 0，表示关闭 flashattention
 - quant_policy 设置为 4。表示打开 kv_cache int8

-这是因为 flashattention 有 v1、v2 两个版本，kv cache int8 曾经也实现过对称版本。
-
-排列组合需要实现 4 套 kernel，在算法不确定的时候过早优化，对软件来说是场灾难。
-
 ### **第四步**

 测试聊天效果