docs(serving.md): typo (#92)

* docs(serving.md): typo * docs(README): quantization

docs(serving.md): typo (#92)
* docs(serving.md): typo * docs(README): quantization
4db08045 · tpoisonooo · GitHub · ac638b37 · 4db08045 · 4db08045
Unverified Commit 4db08045 authored Jul 11, 2023 by tpoisonooo Committed by GitHub Jul 11, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 6 deletions

README.md README.md +2 -2

README_zh-CN.md README_zh-CN.md +2 -2

docs/en/serving.md docs/en/serving.md +1 -1

docs/zh_cn/serving.md docs/zh_cn/serving.md +1 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -148,7 +148,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
 ## Quantization

 In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
-First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
+First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.

 ```
 python3 -m lmdeploy.lite.apis.kv_qparams \
@@ -159,7 +159,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
  --num_tp 1 \   # The number of GPUs used for tensor parallelism
 ```

-Then adjust `config.ini`
+Then adjust `workspace/triton_models/weights/config.ini`

 - `use_context_fmha` changed to 0, means off
 - `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -147,7 +147,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
 ## 量化部署

 在 fp16 模式下，可以开启 kv_cache int8 量化，单卡可服务更多用户。
-首先执行量化脚本，量化参数存放到 `deploy.py` 转换的 weight 目录下。
+首先执行量化脚本，量化参数存放到 `deploy.py` 转换的 `workspace/triton_models/weights` 目录下。

 ```
 python3 -m lmdeploy.lite.apis.kv_qparams \
@@ -158,7 +158,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
  --num_tp 1  \  # Tensor 并行使用的 GPU 数，和 deploy.py 保持一致
 ```

-然后调整 `config.ini`
+然后调整 `workspace/triton_models/weights/config.ini`

 - `use_context_fmha` 改为 0，表示关闭
 - `quant_policy` 设置为 4。此参数默认为 0，表示不开启

--- a/docs/en/serving.md
+++ b/docs/en/serving.md
@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh
 <summary><b>65B</b></summary>

 ```shell
-python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
+python3 lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 8
 bash workspace/service_docker_up.sh
 ```

--- a/docs/zh_cn/serving.md
+++ b/docs/zh_cn/serving.md
@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh
 <summary><b>65B</b></summary>

 ```shell
-python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
+python3 lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 8
 bash workspace/service_docker_up.sh
 ```