update readme

52b41767 · laibao · 8228a79e · 52b41767
Commit 52b41767 authored Dec 17, 2024 by laibao
Show whitespace changes
Inline Side-by-side

Showing with 55 additions and 25 deletions

README.md README.md +55 -25

No files found.
--- a/README.md
+++ b/README.md
@@ -4,6 +4,7 @@
 * @Date: 2024-06-13 14:38:07
 * @LastEditTime: 2024-09-30 09:16:01
 -->
 ## 论文
 `GLM: General Language Model Pretraining with Autoregressive Blank Infilling`
@@ -34,23 +35,25 @@ ChatGLM系列模型基于GLM架构开发。GLM是一种基于Transformer的语
 <img src="docs/GLM.png" width="550" height="200">
 </div>
 ## 环境配置
 ### Docker（方法一）
 提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：
 ```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.2-py3.10
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04-vllm0.6
 # <Image ID>用上面拉取docker镜像的ID替换
 # <Host Path>主机端路径
 # <Container Path>容器映射路径
 # 若要在主机端和容器端映射端口需要删除--network host参数
 docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
 ```
 `Tips：若在K100/Z100L上使用，使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`
 ### Dockerfile（方法二）
 ```
 # <Host Path>主机端路径
 # <Container Path>容器映射路径
@@ -59,22 +62,25 @@ docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kf
 ```
 ### Anaconda（方法三）
 ```
 conda create -n chatglm_vllm python=3.10
 ```
 关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
-* DTK驱动：dtk24.04.2
-* Pytorch: 2.1.0
+* DTK驱动：dtk24.04.3
-* triton:2.1.0
+* Pytorch: 2.3.0
-* lmslim: 0.1.0
+* triton: 2.1.0
-* xformers: 0.0.25
+* lmslim: 0.1.2
-* flash_attn: 2.0.4
+* flash_attn: 2.6.1
-* vllm: 0.5.0
+* vllm: 0.6.2
 * python: python3.10
 `Tips：需先安装相关依赖，最后安装vllm包`
 ## 数据集
 无
 ## 推理
@@ -82,28 +88,33 @@ conda create -n chatglm_vllm python=3.10
 ### 模型下载
 | 基座模型                                                             | 长文本模型                                                                     |
-| ------- | ------- |
+| -------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
 | [chatglm2-6b](http://113.200.138.88:18080/aimodels/chatglm2-6b)         | [chatglm2-6b-32k](http://113.200.138.88:18080/aimodels/thudm/chatglm2-6b-32k.git) |
 | [chatglm3-6b](http://113.200.138.88:18080/aimodels/chatglm3-6b)         | [chatglm3-6b-32k](http://113.200.138.88:18080/aimodels/chatglm3-6b-32k)           |
-| [glm-4-9b-chat](http://113.200.138.88:18080/aimodels/glm-4-9b-chat.git) | 
+| [glm-4-9b-chat](http://113.200.138.88:18080/aimodels/glm-4-9b-chat.git) |                                                                                |
 ### 离线批量推理
 ```bash
 python examples/offline_inference.py
 ```
 其中，`prompts`为提示词；`temperature`为控制采样随机性的值，值越小模型生成越确定，值变高模型生成更随机，0表示贪婪采样，默认为1；`max_tokens=16`为生成长度，默认为1；
 `model`为模型路径；`tensor_parallel_size=1`为使用卡数，默认为1；`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。
 ### 离线批量推理性能测试
 1、指定输入输出
 ```bash
 python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model THUDM/glm-4-9b-chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
 ```
-其中`--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定`--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
+其中 `--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
 2、使用数据集
 下载数据集：
 ```bash
 wget http://113.200.138.88:18080/aidatasets/vllm_data/-/raw/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```
@@ -111,35 +122,43 @@ wget http://113.200.138.88:18080/aidatasets/vllm_data/-/raw/main/ShareGPT_V3_unf
 ```bash
 python benchmarks/benchmark_throughput.py --num-prompts 1 --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
 ```
-其中`--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
+其中 `--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
+### OpenAI api服务推理性能测试
-### api服务推理性能测试
 1、启动服务端：
 ```bash
 python -m vllm.entrypoints.openai.api_server  --model THUDM/glm-4-9b-chat  --dtype float16 --enforce-eager -tp 1 
 ```
 2、启动客户端：
 ```bash
 python benchmarks/benchmark_serving.py --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
 ```
-参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py]（benchmarks/benchmark_serving.py）
+参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py]（benchmarks/benchmark_serving.py）
 ### OpenAI兼容服务
 启动服务：
 ```bash
-python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code
+vllm serve THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code --chat-template template_chatglm2.jinja --port 8000
 ```
-这里`--model`为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
+这里serve之后 为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
 列出模型型号：
 ```bash
 curl http://localhost:8000/v1/models
 ```
 ### OpenAI Completions API和vllm结合使用
 ```bash
 curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
@@ -150,10 +169,11 @@ curl http://localhost:8000/v1/completions \
        "temperature": 0
    }'
 ```
-或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
+或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
 ### OpenAI Chat API和vllm结合使用
 ```bash
 curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
@@ -165,7 +185,9 @@ curl http://localhost:8000/v1/chat/completions \
        ]
    }'
 ```
 或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
 ### **gradio和vllm结合使用**
 1.安装gradio
@@ -189,6 +211,7 @@ python  gradio_openai_chatbot_webserver.py --model "THUDM/glm-4-9b-chat" --model
 ```
 chmod +x frpc_linux_amd64_v0.*
 ```
    2.3端口映射
 ```
@@ -198,7 +221,7 @@ ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节
 3.启动OpenAI兼容服务
 ```
-python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4-9b-chatt --enforce-eager --dtype float16 --trust-remote-code --port 8000
+vllm serve THUDM/glm-4-9b-chat  --enforce-eager --dtype float16 --trust-remote-code --chat-template template_chatglm2.jinja --port 8000
 ```
 4.启动gradio服务
@@ -212,25 +235,32 @@ python  gradio_openai_chatbot_webserver.py --model "THUDM/glm-4-9b-chat" --model
 在浏览器中输入本地 URL，可以使用 Gradio 提供的对话服务。
 ## result
 使用的加速卡:1张 DCU-K100_AI-64G
 ```
 Prompt: '晚上睡不着怎么办', Generated text: '？\n晚上睡不着可以尝试以下方法来改善睡眠质量：\n\n1. **调整作息时间**：尽量每天同一时间上床睡觉和起床，建立规律的生物钟。\n\n2. **放松身心**：睡前进行深呼吸、冥想或瑜伽等放松活动，有助于减轻压力和焦虑。\n\n3. **避免咖啡因和酒精**：晚上避免摄入咖啡因和酒精，因为它们可能会干扰睡眠。\n\n'
 ```
 ### 精度
 无
 ## 应用场景
 ### 算法类别
 对话问答
 ### 热点应用行业
 医疗,金融,科研,教育
 ## 源码仓库及问题反馈
 * [https://developer.hpccube.com/codes/modelzoo/llama_vllm](https://developer.hpccube.com/codes/modelzoo/chatglm_vllm)
 ## 参考资料
 * [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
 * [https://github.com/THUDM/ChatGLM3](https://github.com/THUDM/ChatGLM3)