Unverified Commit 4db08045 authored by tpoisonooo's avatar tpoisonooo Committed by GitHub
Browse files

docs(serving.md): typo (#92)

* docs(serving.md): typo

* docs(README): quantization
parent ac638b37
...@@ -148,7 +148,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \ ...@@ -148,7 +148,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
## Quantization ## Quantization
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users. In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`. First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.
``` ```
python3 -m lmdeploy.lite.apis.kv_qparams \ python3 -m lmdeploy.lite.apis.kv_qparams \
...@@ -159,7 +159,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \ ...@@ -159,7 +159,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
--num_tp 1 \ # The number of GPUs used for tensor parallelism --num_tp 1 \ # The number of GPUs used for tensor parallelism
``` ```
Then adjust `config.ini` Then adjust `workspace/triton_models/weights/config.ini`
- `use_context_fmha` changed to 0, means off - `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled - `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
......
...@@ -147,7 +147,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \ ...@@ -147,7 +147,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
## 量化部署 ## 量化部署
在 fp16 模式下,可以开启 kv_cache int8 量化,单卡可服务更多用户。 在 fp16 模式下,可以开启 kv_cache int8 量化,单卡可服务更多用户。
首先执行量化脚本,量化参数存放到 `deploy.py` 转换的 weight 目录下。 首先执行量化脚本,量化参数存放到 `deploy.py` 转换的 `workspace/triton_models/weights` 目录下。
``` ```
python3 -m lmdeploy.lite.apis.kv_qparams \ python3 -m lmdeploy.lite.apis.kv_qparams \
...@@ -158,7 +158,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \ ...@@ -158,7 +158,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
--num_tp 1 \ # Tensor 并行使用的 GPU 数,和 deploy.py 保持一致 --num_tp 1 \ # Tensor 并行使用的 GPU 数,和 deploy.py 保持一致
``` ```
然后调整 `config.ini` 然后调整 `workspace/triton_models/weights/config.ini`
- `use_context_fmha` 改为 0,表示关闭 - `use_context_fmha` 改为 0,表示关闭
- `quant_policy` 设置为 4。此参数默认为 0,表示不开启 - `quant_policy` 设置为 4。此参数默认为 0,表示不开启
......
...@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh ...@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary> <summary><b>65B</b></summary>
```shell ```shell
python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \ python3 lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8 --tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
......
...@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh ...@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary> <summary><b>65B</b></summary>
```shell ```shell
python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \ python3 lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8 --tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment