update readme

52b41767 · laibao · 8228a79e · 52b41767
Commit 52b41767 authored Dec 17, 2024 by laibao
Hide whitespace changes
Inline Side-by-side

Showing with 55 additions and 25 deletions

README.md README.md +55 -25

No files found.
--- a/README.md
+++ b/README.md
@@ -4,6 +4,7 @@
 * @Date: 2024-06-13 14:38:07
 * @LastEditTime: 2024-09-30 09:16:01
 -->
+
 ## 论文

 `GLM: General Language Model Pretraining with Autoregressive Blank Infilling`
@@ -24,7 +25,7 @@ ChatGLM-6B 是清华大学开源的开源的、支持中英双语的对话语言
 | ----------- | ---------- | ---- | ---- | -------- | -------- | ------------ |
 | ChatGLM2-6B | 4096       | 28   | 32   | 65024    | RoPE     | 8192         |
 | ChatGLM3-6B | 4096       | 28   | 32   | 65024    | RoPE     | 8192         |
-| glm-4-9b | 4096       | 40   | 32   | 151552    | RoPE     | 131072         |
+| glm-4-9b    | 4096       | 40   | 32   | 151552   | RoPE     | 131072       |

 ## 算法原理

@@ -34,23 +35,25 @@ ChatGLM系列模型基于GLM架构开发。GLM是一种基于Transformer的语
 <img src="docs/GLM.png" width="550" height="200">
 </div>

-
 ## 环境配置

 ### Docker（方法一）
+
 提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：

 ```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.2-py3.10
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04-vllm0.6
 # <Image ID>用上面拉取docker镜像的ID替换
 # <Host Path>主机端路径
 # <Container Path>容器映射路径
 # 若要在主机端和容器端映射端口需要删除--network host参数
 docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash
 ```
+
 `Tips：若在K100/Z100L上使用，使用定制镜像docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.5.0-dtk24.04.1-ubuntu20.04-py310-zk-v1,K100/Z100L不支持awq量化`

 ### Dockerfile（方法二）
+
 ```
 # <Host Path>主机端路径
 # <Container Path>容器映射路径
@@ -59,51 +62,59 @@ docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kf
 ```

 ### Anaconda（方法三）
+
 ```
 conda create -n chatglm_vllm python=3.10
 ```
+
 关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
-* DTK驱动：dtk24.04.2
-* Pytorch: 2.1.0
-* triton:2.1.0
-* lmslim: 0.1.0
-* xformers: 0.0.25
-* flash_attn: 2.0.4
-* vllm: 0.5.0
+
+* DTK驱动：dtk24.04.3
+* Pytorch: 2.3.0
+* triton: 2.1.0
+* lmslim: 0.1.2
+* flash_attn: 2.6.1
+* vllm: 0.6.2
 * python: python3.10

 `Tips：需先安装相关依赖，最后安装vllm包`

 ## 数据集
+
 无

 ## 推理

 ### 模型下载

-| 基座模型 | 长文本模型 |
-| ------- | ------- |
-| [chatglm2-6b](http://113.200.138.88:18080/aimodels/chatglm2-6b) | [chatglm2-6b-32k](http://113.200.138.88:18080/aimodels/thudm/chatglm2-6b-32k.git) | 
-| [chatglm3-6b](http://113.200.138.88:18080/aimodels/chatglm3-6b)  | [chatglm3-6b-32k](http://113.200.138.88:18080/aimodels/chatglm3-6b-32k) | 
-| [glm-4-9b-chat](http://113.200.138.88:18080/aimodels/glm-4-9b-chat.git) | 
+| 基座模型                                                             | 长文本模型                                                                     |
+| -------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
+| [chatglm2-6b](http://113.200.138.88:18080/aimodels/chatglm2-6b)         | [chatglm2-6b-32k](http://113.200.138.88:18080/aimodels/thudm/chatglm2-6b-32k.git) |
+| [chatglm3-6b](http://113.200.138.88:18080/aimodels/chatglm3-6b)         | [chatglm3-6b-32k](http://113.200.138.88:18080/aimodels/chatglm3-6b-32k)           |
+| [glm-4-9b-chat](http://113.200.138.88:18080/aimodels/glm-4-9b-chat.git) |                                                                                |

 ### 离线批量推理
+
 ```bash
 python examples/offline_inference.py
 ```
+
 其中，`prompts`为提示词；`temperature`为控制采样随机性的值，值越小模型生成越确定，值变高模型生成更随机，0表示贪婪采样，默认为1；`max_tokens=16`为生成长度，默认为1；
 `model`为模型路径；`tensor_parallel_size=1`为使用卡数，默认为1；`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。

-
 ### 离线批量推理性能测试
+
 1、指定输入输出
+
 ```bash
 python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model THUDM/glm-4-9b-chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
 ```
-其中`--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定`--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
+
+其中 `--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。

 2、使用数据集
 下载数据集：
+
 ```bash
 wget http://113.200.138.88:18080/aidatasets/vllm_data/-/raw/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```
@@ -111,35 +122,43 @@ wget http://113.200.138.88:18080/aidatasets/vllm_data/-/raw/main/ShareGPT_V3_unf
 ```bash
 python benchmarks/benchmark_throughput.py --num-prompts 1 --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
 ```
-其中`--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。

+其中 `--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
+
+### OpenAI api服务推理性能测试

-### api服务推理性能测试
 1、启动服务端：
+
 ```bash
 python -m vllm.entrypoints.openai.api_server  --model THUDM/glm-4-9b-chat  --dtype float16 --enforce-eager -tp 1 
 ```

 2、启动客户端：
+
 ```bash
 python benchmarks/benchmark_serving.py --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
 ```
-参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py]（benchmarks/benchmark_serving.py）

+参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py]（benchmarks/benchmark_serving.py）

 ### OpenAI兼容服务
+
 启动服务：
+
 ```bash
-python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code
+vllm serve THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code --chat-template template_chatglm2.jinja --port 8000
 ```
-这里`--model`为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。
+
+这里serve之后 为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。

 列出模型型号：
+
 ```bash
 curl http://localhost:8000/v1/models
 ```

 ### OpenAI Completions API和vllm结合使用
+
 ```bash
 curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
@@ -150,10 +169,11 @@ curl http://localhost:8000/v1/completions \
        "temperature": 0
    }'
 ```
-或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)

+或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)

 ### OpenAI Chat API和vllm结合使用
+
 ```bash
 curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
@@ -165,7 +185,9 @@ curl http://localhost:8000/v1/chat/completions \
        ]
    }'
 ```
+
 或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
+
 ### **gradio和vllm结合使用**

 1.安装gradio
@@ -189,16 +211,17 @@ python  gradio_openai_chatbot_webserver.py --model "THUDM/glm-4-9b-chat" --model
 ```
 chmod +x frpc_linux_amd64_v0.*
 ```
+
    2.3端口映射

 ```
 ssh -L 8000:计算节点IP:8000 -L 8001:计算节点IP:8001 用户名@登录节点 -p 登录节点端口
-```    
+```

 3.启动OpenAI兼容服务

 ```
-python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4-9b-chatt --enforce-eager --dtype float16 --trust-remote-code --port 8000
+vllm serve THUDM/glm-4-9b-chat  --enforce-eager --dtype float16 --trust-remote-code --chat-template template_chatglm2.jinja --port 8000
 ```

 4.启动gradio服务
@@ -212,25 +235,32 @@ python  gradio_openai_chatbot_webserver.py --model "THUDM/glm-4-9b-chat" --model
 在浏览器中输入本地 URL，可以使用 Gradio 提供的对话服务。

 ## result
+
 使用的加速卡:1张 DCU-K100_AI-64G
+
 ```
 Prompt: '晚上睡不着怎么办', Generated text: '？\n晚上睡不着可以尝试以下方法来改善睡眠质量：\n\n1. **调整作息时间**：尽量每天同一时间上床睡觉和起床，建立规律的生物钟。\n\n2. **放松身心**：睡前进行深呼吸、冥想或瑜伽等放松活动，有助于减轻压力和焦虑。\n\n3. **避免咖啡因和酒精**：晚上避免摄入咖啡因和酒精，因为它们可能会干扰睡眠。\n\n'
 ```

 ### 精度
+
 无

 ## 应用场景

 ### 算法类别
+
 对话问答

 ### 热点应用行业
+
 医疗,金融,科研,教育

 ## 源码仓库及问题反馈
+
 * [https://developer.hpccube.com/codes/modelzoo/llama_vllm](https://developer.hpccube.com/codes/modelzoo/chatglm_vllm)

 ## 参考资料
+
 * [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
 * [https://github.com/THUDM/ChatGLM3](https://github.com/THUDM/ChatGLM3)