Commit 40e0d5cd authored by Rayyyyy's avatar Rayyyyy
Browse files

Updata codes

parent 9eb7f37f
# GLM-4-9b # GLM-4-9b
## 论文 ## 论文
`GLM: General Language Model Pretraining with Autoregressive Blank Infilling`
- https://arxiv.org/abs/2103.10360
## 模型结构 ## 模型结构
...@@ -10,7 +8,7 @@ ...@@ -10,7 +8,7 @@
</div> </div>
## 算法原理 ## 算法原理
GLM-4-9B是智谱AI推出的最新一代预训练模型GLM-4系列中的开源版本,在语义、数学、推理、代码和知识等多方面的数据集测评中,GLM-4-9B及其人类偏好对齐的版本GLM-4-9B-Chat均表现出超越Llama-3-8B的卓越性能
<div align=center> <div align=center>
<img src="./doc/xx.png" witdh=800 height=300/> <img src="./doc/xx.png" witdh=800 height=300/>
</div> </div>
...@@ -62,13 +60,31 @@ pip install -r requirements.txt ...@@ -62,13 +60,31 @@ pip install -r requirements.txt
## 数据集 ## 数据集
### 准备数据集 ### 准备数据集
对于数据文件,样例采用如下格式。这里是一个不带有工具的例子: 本仓库以[ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法,可通过[Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1)下载处理好的 ADGEN 数据集。数据集下载完成后,将数据解压到[data](./data)目录下。
数据按路径存放后,执行下面的数据转换代码,生成的`dev.jsonl``train.jsonl`默认保存在`AdvertiseGen/saves`目录下:
```
python gen_messages_data.py --data_path /path/to/AdvertiseGen
```
数据集目录结构如下:
```
├── data
│ ├── AdvertiseGen
│ ├── saves # 生成的
│ ├── dev.json
│ └── train.json
```
若想生成自己的数据文件,代码可参考[gen_messages_data.py](./gen_messages_data.py)进行修改,样例采用如下格式。
- 这里是一个不带有工具的例子:
``` ```
{"messages": [{"role": "user", "content": "类型#裤*材质#牛仔布*风格#性感"}, {"role": "assistant", "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质,其柔然的手感和细腻的质地,在穿着舒适的同时,透露着清纯甜美的个性气质。除此之外,流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致,不失为一款随性出街的必备单品。"}]} {"messages": [{"role": "user", "content": "类型#裤*材质#牛仔布*风格#性感"}, {"role": "assistant", "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质,其柔然的手感和细腻的质地,在穿着舒适的同时,透露着清纯甜美的个性气质。除此之外,流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致,不失为一款随性出街的必备单品。"}]}
``` ```
这是一个带有工具调用的例子: - 这是一个带有工具调用的例子:
``` ```
{"messages": [{"role": "system", "content": "", "tools": [{"type": "function", "function": {"name": "get_recommended_books", "description": "Get recommended books based on user's interests", "parameters": {"type": "object", "properties": {"interests": {"type": "array", "items": {"type": "string"}, "description": "The interests to recommend books for"}}, "required": ["interests"]}}}]}, {"role": "user", "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."}, {"role": "assistant", "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"}, {"role": "observation", "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"}, {"role": "assistant", "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."}]} {"messages": [{"role": "system", "content": "", "tools": [{"type": "function", "function": {"name": "get_recommended_books", "description": "Get recommended books based on user's interests", "parameters": {"type": "object", "properties": {"interests": {"type": "array", "items": {"type": "string"}, "description": "The interests to recommend books for"}}, "required": ["interests"]}}}]}, {"role": "user", "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."}, {"role": "assistant", "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"}, {"role": "observation", "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"}, {"role": "assistant", "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."}]}
...@@ -87,7 +103,7 @@ pip install -r requirements.txt ...@@ -87,7 +103,7 @@ pip install -r requirements.txt
``` ```
2. 配置文件位于configs目录下,包括以下文件: 2. 配置文件位于configs目录下,包括以下文件:
- `deepspeed配置文件`[ds_zereo_2](./finetune_demo/configs/ds_zereo_2.json),[ds_zereo_3](./finetune_demo/configs/ds_zereo_3.json) - `deepspeed配置文件`[ds_zereo_2](./finetune_demo/configs/ds_zereo_2.json)[ds_zereo_3](./finetune_demo/configs/ds_zereo_3.json)
- `lora.yaml/ ptuning_v2.yaml / sft.yaml`: 模型不同方式的配置文件,包括模型参数、优化器参数、训练参数等。部分重要参数解释如下: - `lora.yaml/ ptuning_v2.yaml / sft.yaml`: 模型不同方式的配置文件,包括模型参数、优化器参数、训练参数等。部分重要参数解释如下:
+ data_config 部分 + data_config 部分
+ train_file: 训练数据集的文件路径。 + train_file: 训练数据集的文件路径。
...@@ -125,54 +141,59 @@ pip install -r requirements.txt ...@@ -125,54 +141,59 @@ pip install -r requirements.txt
+ num_attention_heads: 2: P-TuningV2 的注意力头数(不要改动)。 + num_attention_heads: 2: P-TuningV2 的注意力头数(不要改动)。
+ token_dim: 256: P-TuningV2 的 token 维度(不要改动)。 + token_dim: 256: P-TuningV2 的 token 维度(不要改动)。
3. `data/AdvertiseGen/saves/``.jsonl`数据地址,`THUDM/glm-4-9b-chat`为模型地址,`configs/lora.yaml`为配置文件地址,以上参数均可根据自身数据地址进行替换。
### 单机单卡 ### 单机单卡
```shell ```shell
python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml bash train.sh
``` ```
### 单机多卡/多机多卡 ### 单机多卡/多机多卡
使用 `deepspeed` 作为加速方案的,您需要安装 `deepspeed` 使用`deepspeed`作为加速方案,请确认当前环境已经根据[环境配置章节](#环境配置)安装好了`deepspeed`
```shell ```shell
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b configs/lora.yaml bash train_dp.sh
``` ```
### 从保存点进行微调 ### 从保存点进行微调
如果按照上述方式进行训练,每次微调都会从头开始,如果你想从训练一半的模型开始微调,你可以加入第四个参数,这个参数有两种传入方式: 如果按照上述方式进行训练,每次微调都会从头开始,如果你想从训练一半的模型开始微调,你可以加入第四个参数,这个参数有两种传入方式:
1. `yes`, 自动从最后一个保存的 Checkpoint开始训练 1. `yes`, 自动从**最后一个保存的Checkpoint**开始训练,例如:
2. `XX`, 断点号数字 例 `600` 则从序号600 Checkpoint开始训练 ```shell
例如,这就是一个从最后一个保存点继续微调的示例代码 python finetune.py ../data/AdvertiseGen/saves/ ../checkpoints/glm-4-9b-chat/ configs/lora.yaml yes
```
2. `XX`, 断点号数字,例`600`则从序号**600 Checkpoint**开始训练,例如:
```shell ```shell
python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml yes python finetune.py ../data/AdvertiseGen/saves/ ../checkpoints/glm-4-9b-chat/ configs/lora.yaml 600
``` ```
## 推理 ## 推理
进入[basic_demo](./basic_demo/)目录下 进入[basic_demo](./basic_demo/)目录下
### 快速调用 ### 快速调用
**参数解释**
- --model_name_or_path:待测模型名或模型地址,当前默认"THUDM/glm-4-9b-chat"
- --device: 当前默认"cuda"
- --query: 待测输入语句,当前默认"你好"
``` ```
python inference.py python inference.py
``` ```
### 使用命令行与 GLM-4-9B 模型进行对话 ### 使用命令行与 GLM-4-9B 模型进行对话
``` ```
python trans_cli_demo.py # GLM-4-9B-Chat python trans_cli_demo.py --model_name_or_path ../checkpoints/GLM-4-9B-Chat
python trans_cli_vision_demo.py # GLM-4V-9B python trans_cli_vision_demo.py --model_name_or_path ../checkpoints/GLM-4V-9B
``` ```
### 使用 Gradio 网页端与 GLM-4-9B-Chat 模型进行对话 ### 使用 Gradio 网页端与 GLM-4-9B-Chat 模型进行对话
``` ```
python trans_web_demo.py python trans_web_demo.py --model_name_or_path ../checkpoints/GLM-4-9B-Chat
```
### 使用 Batch 推理
```
python cli_batch_request_demo.py
``` ```
## result ## result
<div align=center> <div align=center>
<img src="./doc/xx.png" width=1500 heigh=400/> <img src="./doc/result.png" width=1500 heigh=400/>
</div> </div>
### 精度 ### 精度
......
import torch import torch
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" parse = argparse.ArgumentParser()
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True) parse.add_argument('--model_name_or_path', default="THUDM/glm-4-9b-chat")
parse.add_argument('--device', default="cuda")
parse.add_argument('--query', type=str, default="你好")
args = parse.parse_args()
device = args.device
model_name_or_path = args.model_name_or_path
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
query = "你好" query = args.query
inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}], inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
add_generation_prompt=True, add_generation_prompt=True,
tokenize=True, tokenize=True,
...@@ -14,7 +24,7 @@ inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}], ...@@ -14,7 +24,7 @@ inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
inputs = inputs.to(device) inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained( model = AutoModelForCausalLM.from_pretrained(
"THUDM/glm-4-9b-chat", model_name_or_path,
torch_dtype=torch.bfloat16, torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True, low_cpu_mem_usage=True,
trust_remote_code=True trust_remote_code=True
...@@ -24,4 +34,4 @@ gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1} ...@@ -24,4 +34,4 @@ gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad(): with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs) outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:] outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) print('Result', tokenizer.decode(outputs[0], skip_special_tokens=True))
...@@ -11,6 +11,7 @@ ensuring that the CLI interface displays formatted text correctly. ...@@ -11,6 +11,7 @@ ensuring that the CLI interface displays formatted text correctly.
""" """
import os import os
import argparse
import torch import torch
from threading import Thread from threading import Thread
from typing import Union from typing import Union
...@@ -27,11 +28,16 @@ from transformers import ( ...@@ -27,11 +28,16 @@ from transformers import (
TextIteratorStreamer TextIteratorStreamer
) )
# add model path
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', default='THUDM/glm-4-9b-chat')
args = parser.parse_args()
ModelType = Union[PreTrainedModel, PeftModelForCausalLM] ModelType = Union[PreTrainedModel, PeftModelForCausalLM]
TokenizerType = Union[PreTrainedTokenizer, PreTrainedTokenizerFast] TokenizerType = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4-9b-chat') # MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4-9b-chat')
MODEL_PATH = args.model_name_or_path
def load_model_and_tokenizer( def load_model_and_tokenizer(
model_dir: Union[str, Path], trust_remote_code: bool = True model_dir: Union[str, Path], trust_remote_code: bool = True
......
...@@ -11,6 +11,8 @@ ensuring that the CLI interface displays formatted text correctly. ...@@ -11,6 +11,8 @@ ensuring that the CLI interface displays formatted text correctly.
""" """
import os import os
import argparse
import torch import torch
from threading import Thread from threading import Thread
from transformers import ( from transformers import (
...@@ -22,13 +24,20 @@ from transformers import ( ...@@ -22,13 +24,20 @@ from transformers import (
from PIL import Image from PIL import Image
MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4v-9b') # add model path
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', default='THUDM/glm-4v-9b')
args = parser.parse_args()
# MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4v-9b')
MODEL_PATH = args.model_name_or_path
tokenizer = AutoTokenizer.from_pretrained( tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH, MODEL_PATH,
trust_remote_code=True, trust_remote_code=True,
encode_special_tokens=True encode_special_tokens=True
) )
model = AutoModel.from_pretrained( model = AutoModel.from_pretrained(
MODEL_PATH, MODEL_PATH,
trust_remote_code=True, trust_remote_code=True,
......
...@@ -6,10 +6,12 @@ allowing users to interact with the model through a chat-like interface. ...@@ -6,10 +6,12 @@ allowing users to interact with the model through a chat-like interface.
""" """
import os import os
import gradio as gr import argparse
import torch import torch
from threading import Thread
import gradio as gr
from threading import Thread
from typing import Union from typing import Union
from pathlib import Path from pathlib import Path
from peft import AutoPeftModelForCausalLM, PeftModelForCausalLM from peft import AutoPeftModelForCausalLM, PeftModelForCausalLM
...@@ -26,8 +28,14 @@ from transformers import ( ...@@ -26,8 +28,14 @@ from transformers import (
ModelType = Union[PreTrainedModel, PeftModelForCausalLM] ModelType = Union[PreTrainedModel, PeftModelForCausalLM]
TokenizerType = Union[PreTrainedTokenizer, PreTrainedTokenizerFast] TokenizerType = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
# add model path
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', default='THUDM/glm-4-9b-chat')
args = parser.parse_args()
# MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4-9b-chat')
MODEL_PATH = args.model_name_or_path
MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4-9b-chat')
TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH) TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)
......
...@@ -9,14 +9,22 @@ Usage: ...@@ -9,14 +9,22 @@ Usage:
Note: The script includes a modification to handle markdown to plain text conversion, Note: The script includes a modification to handle markdown to plain text conversion,
ensuring that the CLI interface displays formatted text correctly. ensuring that the CLI interface displays formatted text correctly.
""" """
import time import time
import asyncio import asyncio
import argparse
from transformers import AutoTokenizer from transformers import AutoTokenizer
from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
from typing import List, Dict from typing import List, Dict
MODEL_PATH = 'THUDM/glm-4-9b' # add model path
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', default='THUDM/glm-4-9b')
args = parser.parse_args()
# MODEL_PATH = 'THUDM/glm-4-9b'
MODEL_PATH = args.model_name_or_path
def load_model_and_tokenizer(model_dir: str): def load_model_and_tokenizer(model_dir: str):
engine_args = AsyncEngineArgs( engine_args = AsyncEngineArgs(
......
This diff is collapsed.
jieba>=0.42.1 jieba>=0.42.1
datasets>=2.19.1 datasets>=2.19.1
peft>=0.11.0 peft>=0.11.0
nltk==3.8.1 nltk==3.8.1
\ No newline at end of file ruamel.yaml==0.18.6
rouge_chinese==1.0.3
\ No newline at end of file
#!/bin/bash
export HIP_VISIBLE_DEVICES=1 # 可自行修改为指定显卡号
export HSA_FORCE_FINE_GRAIN_PCIE=1
export USE_MIOPEN_BATCHNORM=1
python finetune.py ../data/AdvertiseGen/saves/ ../checkpoints/glm-4-9b-chat/ configs/lora.yaml
\ No newline at end of file
#!/bin/bash
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # 可自行修改为指定显卡号
export HSA_FORCE_FINE_GRAIN_PCIE=1
export USE_MIOPEN_BATCHNORM=1
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py ../data/AdvertiseGen/saves/ ../checkpoints/glm-4-9b-chat/ configs/lora.yaml
\ No newline at end of file
import os
import json
import argparse
# 配置数据
parse = argparse.ArgumentParser()
parse.add_argument('--data_path', default='./data/AdvertiseGen')
args = parse.parse_args()
# 默认保存路径
save_root_path = os.path.join(args.data_path, 'saves')
if not os.path.exists(save_root_path):
os.mkdir(save_root_path)
def save_to_jsonl(train_infos, save_path):
'''将json数据保存到.jsonl文件中'''
with open(save_path, 'w', encoding='utf-8') as file:
for info in train_infos:
file.write(json.dumps(info, ensure_ascii=False)+'\n')
file.close()
def load_json_infos(file_path):
'''读取json数据'''
all_data = []
with open(file_path, 'r', encoding='utf-8') as ofile:
for info in ofile.readlines():
json_info = json.loads(info)
output = {"messages": []}
content = {"role": "user", "content": json_info.get("content")}
summary = {"role": "assistant", "content": json_info.get("summary")}
output["messages"].extend([content, summary])
all_data.append(output)
save_file_path = os.path.join(save_root_path, os.path.basename(file_path)+'l')
save_to_jsonl(all_data, save_file_path)
if __name__ == "__main__":
files = ['train.json', 'dev.json']
for file in files:
file_path = os.path.join(args.data_path, file)
output = load_json_infos(file_path)
...@@ -3,8 +3,8 @@ modelCode=684 ...@@ -3,8 +3,8 @@ modelCode=684
# 模型名称 # 模型名称
modelName=glm4-9b_pytorch modelName=glm4-9b_pytorch
# 模型描述 # 模型描述
modelDescription=GLM-4-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源版本 modelDescription=GLM-4-9B是智谱AI推出的最新一代预训练模型GLM-4系列中的开源版本,在语义、数学、推理、代码和知识等多方面的数据集测评中,GLM-4-9B及其人类偏好对齐的版本GLM-4-9B-Chat均表现出超越Llama-3-8B的卓越性能。
# 应用场景 # 应用场景
appScenario=推理,多轮对话,家居,教育,科研 appScenario=推理,训练,多轮对话,家居,教育,科研
# 框架类型 # 框架类型
frameType=pytorch frameType=pytorch
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment