Unverified Commit 4db08045 authored by tpoisonooo's avatar tpoisonooo Committed by GitHub
Browse files

docs(serving.md): typo (#92)

* docs(serving.md): typo

* docs(README): quantization
parent ac638b37
......@@ -148,7 +148,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
## Quantization
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.
```
python3 -m lmdeploy.lite.apis.kv_qparams \
......@@ -159,7 +159,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
--num_tp 1 \ # The number of GPUs used for tensor parallelism
```
Then adjust `config.ini`
Then adjust `workspace/triton_models/weights/config.ini`
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
......
......@@ -147,7 +147,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
## 量化部署
在 fp16 模式下,可以开启 kv_cache int8 量化,单卡可服务更多用户。
首先执行量化脚本,量化参数存放到 `deploy.py` 转换的 weight 目录下。
首先执行量化脚本,量化参数存放到 `deploy.py` 转换的 `workspace/triton_models/weights` 目录下。
```
python3 -m lmdeploy.lite.apis.kv_qparams \
......@@ -158,7 +158,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
--num_tp 1 \ # Tensor 并行使用的 GPU 数,和 deploy.py 保持一致
```
然后调整 `config.ini`
然后调整 `workspace/triton_models/weights/config.ini`
- `use_context_fmha` 改为 0,表示关闭
- `quant_policy` 设置为 4。此参数默认为 0,表示不开启
......
......@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary>
```shell
python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
python3 lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh
```
......
......@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary>
```shell
python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
python3 lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment