Unverified Commit 7396d8f6 authored by tpoisonooo's avatar tpoisonooo Committed by GitHub
Browse files

docs(README): typo (#56)

parent 3fff964d
...@@ -43,7 +43,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by ...@@ -43,7 +43,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/> <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
</div> </div>
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive support for model deployment and quantization, and have successfully validated it on models ranging from 7B to 100B parameters. - **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
- **Persistent Batch Inference**: Further optimization of model execution efficiency. - **Persistent Batch Inference**: Further optimization of model execution efficiency.
...@@ -155,16 +155,12 @@ In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can s ...@@ -155,16 +155,12 @@ In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can s
First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`. First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
``` ```
python3 -m lmdeploy.lite.apis.kv_qparams \ python3 -m lmdeploy.lite.apis.kv_qparams \
--model $HF_MODEL \ --model $HF_MODEL \
--output_dir $DEPLOY_WEIGHT_DIR \ --output_dir $DEPLOY_WEIGHT_DIR \
--symmetry True \ # Whether to use symmetric or asymmetric quantization. --symmetry True \ # Whether to use symmetric or asymmetric quantization.
--offload False \ # Whether to offload some modules to CPU to save GPU memory. --offload False \ # Whether to offload some modules to CPU to save GPU memory.
--num_tp 1 \ # The number of GPUs used for tensor parallelism --num_tp 1 \ # The number of GPUs used for tensor parallelism
``` ```
Then adjust `config.ini` Then adjust `config.ini`
......
...@@ -37,7 +37,7 @@ ...@@ -37,7 +37,7 @@
3. 执行量化脚本,得到量化参数,放到 weights 目录;修改配置文件,使 [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) 选项生效 3. 执行量化脚本,得到量化参数,放到 weights 目录;修改配置文件,使 [kCacheKVInt8](../../src/turbomind/models/llama/llama_utils.h) 选项生效
4. 再次执行 `client.py`,读取 int8 版本精度 4. 再次执行 `client.py`,读取 int8 版本精度
以下是 `kCacheKVInt8` 方法仅 c4 数据集量化,在 mmlu-social-science 数据集的精度损失。 以下是 `kCacheKVInt8` 方法仅 c4 数据集,随机选择 128 条数据量化,在 mmlu-social-science 数据集的精度损失。
| task | dataset | metric | fp16 | int8 | diff | | task | dataset | metric | fp16 | int8 | diff |
| :--: | :-----------------: | :----: | :---: | :---: | :---: | | :--: | :-----------------: | :----: | :---: | :---: | :---: |
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment