<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2024-06-13 14:38:07
 * @LastEditTime: 2024-08-30 09:35:01
-->
## 论文

`GLM: General Language Model Pretraining with Autoregressive Blank Infilling`

- [https://arxiv.org/abs/2103.10360](https://arxiv.org/abs/2103.10360)

## 模型结构

ChatGLM-6B 是清华大学开源的开源的、支持中英双语的对话语言模型，基于 [General Language Model (GLM)](https://github.com/THUDM/GLM) 架构，具有 62 亿参数。ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练，辅以监督微调、反馈自助、人类反馈强化学习等技术的加持，62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。ChatGLM2-6B 是开源中英双语对话模型 ChatGLM-6B 的第二代版本，ChatGLM3 是智谱AI和清华大学 KEG 实验室联合发布的新一代对话预训练模型。ChatGLM3-6B 是 ChatGLM3 系列中的开源模型，在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上，ChatGLM3-6B 具有更强大的基础模型、更完整的功能支持、更全面的开源序列。

<div align="center">
<img src="docs/transformers.jpg" width="300" height="400">
</div>

以下是ChatGLM系列模型的主要网络参数配置：

| 模型名称    | 隐含层维度 | 层数 | 头数 | 词表大小 | 位置编码 | 最大序列长度 |
| ----------- | ---------- | ---- | ---- | -------- | -------- | ------------ |
| ChatGLM2-6B | 4096       | 28   | 32   | 65024    | RoPE     | 8192         |
| ChatGLM3-6B | 4096       | 28   | 32   | 65024    | RoPE     | 8192         |
| glm-4-9b | 4096       | 40   | 32   | 151552    | RoPE     | 131072         |

## 算法原理

ChatGLM系列模型基于GLM架构开发。GLM是一种基于Transformer的语言模型，以自回归空白填充为训练目标， 同时具备自回归和自编码能力。

<div align="center">
<img src="docs/GLM.png" width="550" height="200">
</div>


## 环境配置

### Docker（方法一）
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：

```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal -v <Host Path>:<Container Path> <Image ID> /bin/bash

pip install aiohttp==3.9.1 outlines==0.0.37 openai==1.23.3 -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
```
`Tips：若在K100/Z100L上使用，需要替换flash_attn，下载链接：https://forum.hpccube.com/thread/515`

### Dockerfile（方法二）
```
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker build -t chatglm:latest .
docker run -it --name chatglm_vllm --privileged --shm-size=64G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v <Host Path>:<Container Path> llama:latest /bin/bash
```
`Tips：若在K100/Z100L上使用，需要替换flash_attn，下载链接：https://forum.hpccube.com/thread/515`

### Anaconda（方法三）
```
conda create -n chatglm_vllm python=3.10
pip install aiohttp==3.9.1 outlines==0.0.37 openai==1.23.3
```
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
* DTK驱动：dtk24.04.1
* Pytorch: 2.1.0
* triton:2.1.0
* vllm: 0.3.3
* xformers: 0.0.25
* flash_attn: 2.0.4
* python: python3.10

`Tips：若在K100/Z100L上使用，需要替换flash_attn，下载链接：https://forum.hpccube.com/thread/515`

## 数据集
无

## 推理
### 源码编译安装
```
# 若使用光源的镜像，可以跳过源码编译安装，镜像中已安装vllm。
git clone http://developer.hpccube.com/codes/modelzoo/chatglm_vllm.git
cd llama_vllm
git submodule init && git submodule update
cd vllm
pip install wheel
python setup.py bdist_wheel
cd dist && pip install vllm*
```

### 模型下载

| chat模型                                                                        | 长文本模型                                                                                | 
| ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | 
| [chatglm2-6b](http://113.200.138.88:18080/aimodels/chatglm2-6b) | [chatglm2-6b-32k](https://huggingface.co/THUDM/chatglm2-6b-32k) | 
| [chatglm3-6b](http://113.200.138.88:18080/aimodels/chatglm3-6b)  | [chatglm3-6b-32k](http://113.200.138.88:18080/aimodels/chatglm3-6b-32k) | 
| [glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | 


### 离线批量推理
```bash
python vllm/examples/offline_inference.py
```
其中，`prompts`为提示词；`temperature`为控制采样随机性的值，值越小模型生成越确定，值变高模型生成更随机，0表示贪婪采样，默认为1；`max_tokens=16`为生成长度，默认为1；
`model`为模型路径；`tensor_parallel_size=1`为使用卡数，默认为1；`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理,`quantization="gptq"`为使用gptq量化进行推理,需下载以上GPTQ模型。


### 离线批量推理性能测试
1、指定输入输出
```bash
python vllm/benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model THUDM/glm-4-9b-chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
其中`--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定`--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。

2、使用数据集
下载数据集：
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

```bash
python vllm/benchmarks/benchmark_throughput.py --num-prompts 1 --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
其中`--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。


### api服务推理性能测试
1、启动服务端：
```bash
python -m vllm.entrypoints.api_server  --model THUDM/glm-4-9b-chat  --dtype float16 --enforce-eager -tp 1 
```

2、启动客户端：
```bash
python vllm/benchmarks/benchmark_serving.py --model THUDM/glm-4-9b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
```
参数同使用数据集，离线批量推理性能测试，具体参考[vllm/benchmarks/benchmark_serving.py]


### OpenAI兼容服务
启动服务：
```bash
python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4-9b-chat --enforce-eager --dtype float16 --trust-remote-code
```
这里`--model`为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理。

列出模型型号：
```bash
curl http://localhost:8000/v1/models
```

### OpenAI Completions API和vllm结合使用
```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "THUDM/glm-4-9b-chat",
        "prompt": "晚上睡不着怎么办",
        "max_tokens": 7,
        "temperature": 0
    }'
```
或者使用[vllm/examples/openai_completion_client.py](https://developer.hpccube.com/codes/OpenDAS/vllm/-/blob/df6349c78b49a5b8f6f600d0d9490791cd1d32ee/examples/openai_completion_client.py)


### OpenAI Chat API和vllm结合使用
```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "THUDM/glm-4-9b-chat",
        "messages": [
            {"role": "system", "content": "晚上睡不着怎么办"},
            {"role": "user", "content": "晚上睡不着怎么办"}
        ]
    }'
```
或者使用[vllm/examples/openai_chatcompletion_client.py](https://developer.hpccube.com/codes/OpenDAS/vllm/-/blob/df6349c78b49a5b8f6f600d0d9490791cd1d32ee/examples/openai_chatcompletion_client.py)


## result
使用的加速卡:1张 DCU-K100_AI-64G
```
Prompt: '晚上睡不着怎么办', Generated text: '？\n晚上睡不着可以尝试以下方法来改善睡眠质量：\n\n1. **调整作息时间**：尽量每天同一时间上床睡觉和起床，建立规律的生物钟。\n\n2. **放松身心**：睡前进行深呼吸、冥想或瑜伽等放松活动，有助于减轻压力和焦虑。\n\n3. **避免咖啡因和酒精**：晚上避免摄入咖啡因和酒精，因为它们可能会干扰睡眠。\n\n'
```

### 精度
无

## 应用场景

### 算法类别
对话问答

### 热点应用行业
医疗,金融,科研,教育

## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/llama_vllm](https://developer.hpccube.com/codes/modelzoo/chatglm_vllm)

## 参考资料
* [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
* [https://github.com/THUDM/ChatGLM3](https://github.com/THUDM/ChatGLM3)