Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
Lmdeploy
Commits
4db08045
Unverified
Commit
4db08045
authored
Jul 11, 2023
by
tpoisonooo
Committed by
GitHub
Jul 11, 2023
Browse files
docs(serving.md): typo (#92)
* docs(serving.md): typo * docs(README): quantization
parent
ac638b37
Changes
4
Show whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
6 additions
and
6 deletions
+6
-6
README.md
README.md
+2
-2
README_zh-CN.md
README_zh-CN.md
+2
-2
docs/en/serving.md
docs/en/serving.md
+1
-1
docs/zh_cn/serving.md
docs/zh_cn/serving.md
+1
-1
No files found.
README.md
View file @
4db08045
...
...
@@ -148,7 +148,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
## Quantization
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the
weight directory
transformed by
`deploy.py`
.
First execute the quantization script, and the quantization parameters are stored in the
`workspace/triton_models/weights`
transformed by
`deploy.py`
.
```
python3 -m lmdeploy.lite.apis.kv_qparams \
...
...
@@ -159,7 +159,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
--num_tp 1 \ # The number of GPUs used for tensor parallelism
```
Then adjust
`config.ini`
Then adjust
`
workspace/triton_models/weights/
config.ini`
-
`use_context_fmha`
changed to 0, means off
-
`quant_policy`
is set to 4. This parameter defaults to 0, which means it is not enabled
...
...
README_zh-CN.md
View file @
4db08045
...
...
@@ -147,7 +147,7 @@ deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
## 量化部署
在 fp16 模式下,可以开启 kv_cache int8 量化,单卡可服务更多用户。
首先执行量化脚本,量化参数存放到
`deploy.py`
转换的 weight 目录下。
首先执行量化脚本,量化参数存放到
`deploy.py`
转换的
`workspace/triton_models/
weight
s`
目录下。
```
python3 -m lmdeploy.lite.apis.kv_qparams \
...
...
@@ -158,7 +158,7 @@ python3 -m lmdeploy.lite.apis.kv_qparams \
--num_tp 1 \ # Tensor 并行使用的 GPU 数,和 deploy.py 保持一致
```
然后调整
`config.ini`
然后调整
`
workspace/triton_models/weights/
config.ini`
-
`use_context_fmha`
改为 0,表示关闭
-
`quant_policy`
设置为 4。此参数默认为 0,表示不开启
...
...
docs/en/serving.md
View file @
4db08045
...
...
@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh
<summary><b>
65B
</b></summary>
```
shell
python3 lmdeploy.serve.turbomind.deploy llama-
13
B /path/to/llama-
13
b llama
\
python3 lmdeploy.serve.turbomind.deploy llama-
65
B /path/to/llama-
65
b llama
\
--tokenizer_path
/path/to/tokenizer/model
--tp
8
bash workspace/service_docker_up.sh
```
...
...
docs/zh_cn/serving.md
View file @
4db08045
...
...
@@ -41,7 +41,7 @@ bash workspace/service_docker_up.sh
<summary><b>
65B
</b></summary>
```
shell
python3 lmdeploy.serve.turbomind.deploy llama-
13
B /path/to/llama-
13
b llama
\
python3 lmdeploy.serve.turbomind.deploy llama-
65
B /path/to/llama-
65
b llama
\
--tokenizer_path
/path/to/tokenizer/model
--tp
8
bash workspace/service_docker_up.sh
```
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment