Unverified Commit b7c88ca8 authored by pppppM's avatar pppppM Committed by GitHub
Browse files

update kv8 docs (#681)

parent e641dd86
......@@ -52,13 +52,8 @@ You can also first set `turbomind_dir` to a private directory, then copy the sca
Modify `workspace/triton_models/weights/config.ini`:
- Set use_context_fmha to 0, which means turning off flashattention
- Set quant_policy to 4. This means enabling kv_cache int8
This is because there are two versions of flashattention, v1 and v2, and kv_cache int8 has also previously realized the symmetric version.
Considering there are four combinations of kernels needed to be implemented, premature optimization when the algorithm is uncertain can be disastrous for software.
### **Step Four**
Test the chat performance.
......
......@@ -52,13 +52,8 @@ lmdeploy lite kv_qparams \
修改 `workspace/triton_models/weights/config.ini`
- use_context_fmha 改为 0,表示关闭 flashattention
- quant_policy 设置为 4。表示打开 kv_cache int8
这是因为 flashattention 有 v1、v2 两个版本,kv cache int8 曾经也实现过对称版本。
排列组合需要实现 4 套 kernel,在算法不确定的时候过早优化,对软件来说是场灾难。
### **第四步**
测试聊天效果
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment