Commit 67ca83cf authored by Rayyyyy's avatar Rayyyyy
Browse files

Support GLM-4-0414

parent 78ba9d16
from langchain_community.document_loaders import PyMuPDFLoader
import docx
from langchain_community.document_loaders import PyMuPDFLoader
from pptx import Presentation
def extract_text(path):
return open(path, 'r').read()
return open(path, "r").read()
def extract_pdf(path):
loader = PyMuPDFLoader(path)
data = loader.load()
data = [x.page_content for x in data]
content = '\n\n'.join(data)
content = "\n\n".join(data)
return content
def extract_docx(path):
doc = docx.Document(path)
data = []
for paragraph in doc.paragraphs:
data.append(paragraph.text)
content = '\n\n'.join(data)
content = "\n\n".join(data)
return content
def extract_pptx(path):
prs = Presentation(path)
......
# 使用 Intel® Extension for Transformers 推理 GLM-4-9B-Chat 模型
本示例介绍如何使用 Intel® Extension for Transformers 推理 GLM-4-9B-Chat 模型。
## 设备和依赖检查
### 相关推理测试数据
**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。**
测试硬件信息:
+ OS: Ubuntu 22.04 (本教程一定需要在Linux环境下执行)
+ Memory: 512GB
+ Python: 3.10.12
+ CPU: Intel(R) Xeon(R) Platinum 8358 CPU / 12th Gen Intel i5-12400
## 安装依赖
在开始推理之前,请你先安装`inference`中的依赖,同时您需要安装本目录下的依赖项:
```shell
pip install -r requirements.txt
```
## 运行模型推理
```shell
python itrex_cli_demo.py
```
如果您是第一次推理,会有一次模型转换权重的过程,转换后的模型权重存放在`runtime_outputs`文件夹下,这大概会消耗`60G`的硬盘空间。
转换完成后,文件夹下有两个文件:
+ ne_chatglm2_f32.bin 52G(如果您不使用FP32进行推理,可以删掉这个文件)
+ ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin 8.1G
如果您不是第一次推理,则会跳过这个步骤,直接开始对话,推理效果如下:
```shell
Welcome to the CLI chat. Type your messages below.
User: 你好
AVX:1 AVX2:1 AVX512F:1 AVX512BW:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.900, continuous_batching: 0, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model_file_loader: loading model from runtime_outs/ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 151552
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 0
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 0
load_ne_hparams 5.hparams.n_layer = 40
load_ne_hparams 6.hparams.n_rot = 0
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 131072
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 2
load_ne_hparams 15.hparams.ffn_hidden_size = 13696
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000000
load_ne_hparams 21.hparams.freq_base = 5000000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 151329
load_ne_vocab 28.vocab.pad_token_id = 151329
load_ne_vocab 29.vocab.sep_token_id = -1
init: hparams.n_vocab = 151552
init: hparams.n_embd = 4096
init: hparams.n_mult = 0
init: hparams.n_head = 32
init: hparams.n_layer = 40
init: hparams.n_rot = 0
init: hparams.ffn_hidden_size = 13696
init: n_parts = 1
load: ctx size = 16528.38 MB
load: layers[0].ffn_fusion = 1
load: scratch0 = 4096.00 MB
load: scratch1 = 2048.00 MB
load: scratch2 = 4096.00 MB
load: mem required = 26768.38 MB (+ memory per state)
.............................................................................................
model_init_from_file: support_bestla_kv = 1
kv_cache_init: run_mha_reordered = 1
model_init_from_file: kv self size = 690.00 MB
Assistant:
你好👋!我是人工智能助手,很高兴为你服务。有什么可以帮助你的吗?
```
# Using Intel® Extension for Transformers to Inference the GLM-4-9B-Chat Model
This example introduces how to use Intel® Extension for Transformers to inference the GLM-4-9B-Chat model.
## Device and Dependency Check
### Relevant Inference Test Data
**The data in this document is tested on the following hardware environment. The actual running environment requirements and memory usage may vary slightly. Please refer to the actual running environment.**
Test hardware information:
+ OS: Ubuntu 22.04 (This tutorial must be executed in a Linux environment)
+ Memory: 512GB
+ Python: 3.10.12
+ CPU: Intel(R) Xeon(R) Platinum 8358 CPU / 12th Gen Intel i5-12400
## Installing Dependencies
Before starting the inference, please install the dependencies in `inference`, and you need to install the dependencies in this directory:
```shell
pip install -r requirements.txt
```
## Running Model Inference
```shell
python itrex_cli_demo.py
```
If this is your first inference, there will be a process of converting model weights. The converted model weights are stored in the `runtime_outputs` folder, which will consume about `60G` of disk space.
After the conversion is completed, there are two files in the folder:
+ ne_chatglm2_f32.bin 52G (If you do not use FP32 for inference, you can delete this file)
+ ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin 8.1G
If this is not your first inference, this step will be skipped, and you will directly start the conversation. The inference result is as follows:
```shell
Welcome to the CLI chat. Type your messages below.
User: Hello
AVX:1 AVX2:1 AVX512F:1 AVX512BW:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.900, continuous_batching: 0, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model_file_loader: loading model from runtime_outs/ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 151552
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 0
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 0
load_ne_hparams 5.hparams.n_layer = 40
load_ne_hparams 6.hparams.n_rot = 0
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 131072
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.multi_query_group_num = 2
load_ne_hparams 12.hparams.ffn_hidden_size = 13696
load_ne_hparams 13.hparams.inner_hidden_size = 0
load_ne_hparams 14.hparams.n_experts = 0
load_ne_hparams 15.hparams.n_experts_used = 0
load_ne_hparams 16.hparams.n_embd_head_k = 0
load_ne_hparams 17.hparams.norm_eps = 0.000000
load_ne_hparams 18.hparams.freq_base = 5000000.000
load_ne_hparams 19.hparams.freq_scale = 1.000
load_ne_hparams 20.hparams.rope_scaling_factor = 0.000
load_ne_hparams 21.hparams.original_max_position_embeddings = 0
load_ne_hparams 22.hparams.use_yarn = 0
load_ne_vocab 23.vocab.bos_token_id = 1
load_ne_vocab 24.vocab.eos_token_id = 151329
load_ne_vocab 25.vocab.pad_token_id = 151329
load_ne_vocab 26.vocab.sep_token_id = -1
init: hparams.n_vocab = 151552
init: hparams.n_embd = 4096
init: hparams.n_mult = 0
init: hparams.n_head = 32
init: hparams.n_layer = 40
init: hparams.n_rot = 0
init: hparams.ffn_hidden_size = 13696
init: n_parts = 1
load: ctx size = 16528.38 MB
load: layers[0].ffn_fusion = 1
load: scratch0 = 4096.00 MB
load: scratch1 = 2048.00 MB
load: scratch2 = 4096.00 MB
load: mem required = 26768.38 MB (+ memory per state)
.............................................................................................
model_init_from_file: support_bestla_kv = 1
kv_cache_init: run_mha_reordered = 1
model_init_from_file: kv self size = 690.00 MB
Assistant:
Hello👋! I am an AI assistant. How can I help you today?
```
"""
This script creates a CLI demo with transformers backend for the glm-4-9b model with Intel® Extension for Transformers
"""
import os
MODEL_PATH = os.environ.get("MODEL_PATH", "THUDM/GLM-4-9B-0414")
from threading import Thread
import torch
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
class StopOnTokens(StoppingCriteria):
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
stop_ids = [151329, 151336, 151338]
for stop_id in stop_ids:
if input_ids[0][-1] == stop_id:
return True
return False
def initialize_model_and_tokenizer():
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
device_map="cpu", # Use Intel CPU for inference
trust_remote_code=True,
load_in_4bit=True,
)
return tokenizer, model
def get_user_input():
return input("\nUser: ")
def main():
tokenizer, model = initialize_model_and_tokenizer()
history = []
max_length = 100
top_p = 0.9
temperature = 0.8
stop = StopOnTokens()
print("Welcome to the CLI chat. Type your messages below.")
while True:
user_input = get_user_input()
if user_input.lower() in ["exit", "quit"]:
break
history.append([user_input, ""])
messages = []
for idx, (user_msg, model_msg) in enumerate(history):
if idx == len(history) - 1 and not model_msg:
messages.append({"role": "user", "content": user_msg})
break
if user_msg:
messages.append({"role": "user", "content": user_msg})
if model_msg:
messages.append({"role": "assistant", "content": model_msg})
model_inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, return_tensors="pt"
)
streamer = TextIteratorStreamer(tokenizer=tokenizer, timeout=60, skip_prompt=True, skip_special_tokens=True)
generate_kwargs = {
"input_ids": model_inputs,
"streamer": streamer,
"max_new_tokens": max_length,
"do_sample": True,
"top_p": top_p,
"temperature": temperature,
"stopping_criteria": StoppingCriteriaList([stop]),
"repetition_penalty": 1.2,
"eos_token_id": model.config.eos_token_id,
}
t = Thread(target=model.generate, kwargs=generate_kwargs)
t.start()
print("Assistant:", end="", flush=True)
for new_token in streamer:
if new_token:
print(new_token, end="", flush=True)
history[-1][1] += new_token
history[-1][1] = history[-1][1].strip()
if __name__ == "__main__":
main()
cmake>=3.29.5.1
huggingface-hub>=0.23.4
git+https://github.com/intel/neural-speed.git@main#egg=neural-speed
intel-extension-for-transformers>=1.4.2
# 使用 OpenVINO 部署 GLM-4-9B-Chat 模型
Read this in [English](README_en.md).
[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型,提高推理性能,减少模型的内存占用。
本示例将展示如何使用 OpenVINO 部署 GLM-4-9B-Chat 模型。
## 1. 环境配置
首先,你需要安装依赖
```bash
pip install -r requirements.txt
```
## 2. 转换模型
由于需要将Huggingface模型转换为OpenVINO IR模型,因此您需要下载模型并转换。
```
python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
```
### 可以选择的参数
* `--model_id` - 模型所在目录的路径(绝对路径)。
* `--output` - 转换后模型保存的地址。
* `--precision` - 转换的精度。
转换过程如下:
```
====Exporting IR=====
Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00, 2.14it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using framework PyTorch: 2.3.1+cu121
Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ 8 │ 31% (76 / 163) │ 20% (73 / 160) │
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ 4 │ 69% (87 / 163) │ 80% (87 / 160) │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
Configuration saved in glm-4-9b-ov/openvino_config.json
====Exporting tokenizer=====
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
```
## 3. 运行 GLM-4-9B-Chat 模型
```
python3 chat.py --model_path {your_path}/glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
```
### 可以选择的参数
* `--model_path` - OpenVINO IR 模型所在目录的路径。
* `--max_sequence_length` - 输出标记的最大大小。
* `--device` - 运行推理的设备。
### 参考代码
本代码参考 [OpenVINO 官方示例](https://github.com/OpenVINO-dev-contest/chatglm3.openvino) 进行修改。
# Deploy the GLM-4-9B-Chat model using OpenVINO
[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
is an open source toolkit designed by Intel for deep learning inference. It can help developers optimize models, improve inference performance, and reduce model memory usage.
This example will show how to deploy the GLM-4-9B-Chat model using OpenVINO.
## 1. Environment configuration
First, you need to install the dependencies
```bash
pip install -r requirements.txt
```
## 2. Convert the model
Since the Huggingface model needs to be converted to an OpenVINO IR model, you need to download the model and convert it.
```
python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
```
The conversion process is as follows:
```
====Exporting IR=====
Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00, 2.14it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using framework PyTorch: 2.3.1+cu121
Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ 8 │ 31% (76 / 163) │ 20% (73 / 160) │
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ 4 │ 69% (87 / 163) │ 80% (87 / 160) │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
Configuration saved in glm-4-9b-ov/openvino_config.json
====Exporting tokenizer=====
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
```
### Optional parameters
* `--model_id` - Path to the directory where the model is located (absolute path).
* `--output` - Path to where the converted model is saved.
* `--precision` - Precision of the conversion.
## 3. Run the GLM-4-9B-Chat model
```
python3 chat.py --model_path {your_path}glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
```
### Optional parameters
* `--model_path` - Path to the directory where the OpenVINO IR model is located.
* `--max_sequence_length` - Maximum size of the output token.
* `--device` - the device to run inference on.
### Reference code
This code is modified based on the [OpenVINO official example](https://github.com/OpenVINO-dev-contest/chatglm3.openvino).
"""
This script is used to convert the original model to OpenVINO IR format.
The Origin Code can check https://github.com/OpenVINO-dev-contest/chatglm3.openvino/blob/main/convert.py
"""
import argparse
import os
from pathlib import Path
from optimum.intel import OVWeightQuantizationConfig
from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoConfig, AutoTokenizer
if __name__ == "__main__":
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument("-h", "--help", action="help", help="Show this help message and exit.")
parser.add_argument(
"-m", "--model_id", default="THUDM/GLM-4-9B-0414", required=False, type=str, help="orignal model path"
)
parser.add_argument(
"-p",
"--precision",
required=False,
default="int4",
type=str,
choices=["fp16", "int8", "int4"],
help="fp16, int8 or int4",
)
parser.add_argument(
"-o", "--output", default="./glm-4-9b-ov", required=False, type=str, help="Required. path to save the ir model"
)
args = parser.parse_args()
ir_model_path = Path(args.output)
if ir_model_path.exists() == False:
os.mkdir(ir_model_path)
model_kwargs = {
"trust_remote_code": True,
"config": AutoConfig.from_pretrained(args.model_id, trust_remote_code=True),
}
compression_configs = {
"sym": False,
"group_size": 128,
"ratio": 0.8,
}
print("====Exporting IR=====")
if args.precision == "int4":
ov_model = OVModelForCausalLM.from_pretrained(
args.model_id,
export=True,
compile=False,
quantization_config=OVWeightQuantizationConfig(bits=4, **compression_configs),
**model_kwargs,
)
elif args.precision == "int8":
ov_model = OVModelForCausalLM.from_pretrained(
args.model_id, export=True, compile=False, load_in_8bit=True, **model_kwargs
)
else:
ov_model = OVModelForCausalLM.from_pretrained(
args.model_id, export=True, compile=False, load_in_8bit=False, **model_kwargs
)
ov_model.save_pretrained(ir_model_path)
print("====Exporting tokenizer=====")
tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
tokenizer.save_pretrained(ir_model_path)
import argparse
from threading import Thread
from typing import List, Tuple
import torch
from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoConfig, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
class StopOnTokens(StoppingCriteria):
def __init__(self, token_ids):
self.token_ids = token_ids
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
for stop_id in self.token_ids:
if input_ids[0][-1] == stop_id:
return True
return False
if __name__ == "__main__":
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument("-h", "--help", action="help", help="Show this help message and exit.")
parser.add_argument("-m", "--model_path", required=True, type=str, help="Required. model path")
parser.add_argument(
"-l", "--max_sequence_length", default=256, required=False, type=int, help="Required. maximun length of output"
)
parser.add_argument(
"-d", "--device", default="CPU", required=False, type=str, help="Required. device for inference"
)
args = parser.parse_args()
model_dir = args.model_path
ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
print("====Compiling model====")
ov_model = OVModelForCausalLM.from_pretrained(
model_dir,
device=args.device,
ov_config=ov_config,
config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
trust_remote_code=True,
)
streamer = TextIteratorStreamer(tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)
stop_tokens = [StopOnTokens([151329, 151336, 151338])]
def convert_history_to_token(history: List[Tuple[str, str]]):
messages = []
for idx, (user_msg, model_msg) in enumerate(history):
if idx == len(history) - 1 and not model_msg:
messages.append({"role": "user", "content": user_msg})
break
if user_msg:
messages.append({"role": "user", "content": user_msg})
if model_msg:
messages.append({"role": "assistant", "content": model_msg})
model_inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, return_tensors="pt"
)
return model_inputs
history = []
print("====Starting conversation====")
while True:
input_text = input("用户: ")
if input_text.lower() == "stop":
break
if input_text.lower() == "clear":
history = []
print("AI助手: 对话历史已清空")
continue
print("GLM-4-9B-OpenVINO:", end=" ")
history = history + [[input_text, ""]]
model_inputs = convert_history_to_token(history)
generate_kwargs = dict(
input_ids=model_inputs,
max_new_tokens=args.max_sequence_length,
temperature=0.1,
do_sample=True,
top_p=1.0,
top_k=50,
repetition_penalty=1.1,
streamer=streamer,
stopping_criteria=StoppingCriteriaList(stop_tokens),
)
t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
t1.start()
partial_text = ""
for new_text in streamer:
new_text = new_text
print(new_text, end="", flush=True)
partial_text += new_text
print("\n")
history[-1][1] = partial_text
optimum>=1.20.0
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c1ee8ac0864e25e22ea56b5a37a35451531da0e6
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk24.04-py310
\ No newline at end of file
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10
\ No newline at end of file
# GLM-4-9B Chat dialogue model fine-tuning
# GLM-4-9B Chat Fine-tuning
In this demo, you will experience how to fine-tune the GLM-4-9B-Chat open source model (visual understanding model is
not supported). Please strictly follow the steps in the document to avoid unnecessary errors.
[中文阅读](README_zh.md)
## Hardware check
## Hardware Check
**The data in this document are tested in the following hardware environment. The actual operating environment
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
environment. **
Test hardware information:
All fine-tuning tests were performed in the following environment:
+ OS: Ubuntu 22.04
+ Memory: 512GB
+ Python: 3.12.3
+ CUDA Version: 12.3
+ GPU Driver: 535.104.05
+ GPU: NVIDIA A100-SXM4-80GB * 8
> OS: Ubuntu 22.04
>
> Memory: 512GB
>
> Python: 3.12.3
>
> CUDA Version: 12.4
>
> GPU Driver: 535.104.05
>
> GPU: NVIDIA H100 80GB HBM3 (hereafter referred to as GPU)
| Fine-tuning solution | Video memory usage | Weight save point size |
|----------------------|----------------------------------------------|------------------------|
| lora (PEFT) | 21531MiB | 17M |
| p-tuning v2 (PEFT) | 21381MiB | 121M |
| SFT (Zero3 method) | 80935MiB<br/>(Each GPU, 8 GPUs are required) | 20G |
+ Fine-tuning based on Llama-Factory
Before starting fine-tuning, please install the dependencies in `basic_demo` first. You also need to install the
dependencies in this directory:
| Fine-tuning Model | Fine-tuning solution | GPU memory usage |
|-----------------------|----------------------|------------------------------|
| GLM-4-9B-0414 | lora | 22G (Each GPU, Need 1 GPU) |
| GLM-4-9B-0414 | SFT (Zero3 method) | 55G (Each GPU, Need 4 GPUs) |
| GLM-4-9B-0414 | lora | 80G (Each GPU, Need 8 GPUs) |
| GLM-4-32B-0414 | SFT (Zero3 method) | 80G (Each GPU, Need 16 GPUs) |
+ Fine-tuning based on this repository
| Fine-tuning Model | Fine-tuning solution | GPU memory usage |
|--------------------------|------------------------------------|-------------------------------|
| GLM-4V-9B | lora (PEFT), Include EVA2CLIPModel | 75G (Each GPU, Need 1 GPU) |
| GLM-4-9B-Chat | lora (PEFT) | 22G (Each GPU, Need 1 GPU) |
| GLM-4-9B-Chat | SFT (Zero3 method) | 80G (Each GPU, Need 8 GPUs) |
## Preparation
Before starting fine-tuning, please install the dependencies in `inference`, ensure you have cloned the latest version of the model repository, and install the dependencies in this directory:
```bash
pip install -r requirements.txt
......@@ -95,21 +109,107 @@ For data files, the sample uses the following format:
This is a sample without tools:
```
{"messages": [{"role": "user", "content": "类型#裤*材质#牛仔布*风格#性感"}, {"role": "assistant", "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质,其柔然的手感和细腻的质地,在穿着舒适的同时,透露着清纯甜美的个性气质。除此之外,流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致,不失为一款随性出街的必备单品。"}]}
```json
{
"messages": [
{
"role": "user",
"content": "类型#裤*材质#牛仔布*风格#性感"
},
{
"role": "assistant",
"content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质,其柔然的手感和细腻的质地,在穿着舒适的同时,透露着清纯甜美的个性气质。除此之外,流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致,不失为一款随性出街的必备单品。"
}
]
}
```
This is a sample with tools:
```json
{
"messages": [
{
"role": "system",
"content": "",
"tools": [
{
"type": "function",
"function": {
"name": "get_recommended_books",
"description": "Get recommended books based on user's interests",
"parameters": {
"type": "object",
"properties": {
"interests": {
"type": "array",
"items": {
"type": "string"
},
"description": "The interests to recommend books for"
}
},
"required": [
"interests"
]
}
}
}
]
},
{
"role": "user",
"content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."
},
{
"role": "assistant",
"content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"
},
{
"role": "observation",
"content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"
},
{
"role": "assistant",
"content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."
}
]
}
```
{"messages": [{"role": "system", "content": "", "tools": [{"type": "function", "function": {"name": "get_recommended_books", "description": "Get recommended books based on user's interests", "parameters": {"type": "object", "properties": {"interests": {"type": "array", "items": {"type": "string"}, "description": "The interests to recommend books for"}}, "required": ["interests"]}}}]}, {"role": "user", "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."}, {"role": "assistant", "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"}, {"role": "observation", "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"}, {"role": "assistant", "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."}]}
This is a sample with VQA Task:
```json
{
"messages": [
{
"role": "user",
"content": "图片中的动物是什么?",
"image": "/root/images/0001.jpg"
},
{
"role": "assistant",
"content": "图片中有一只猫。"
},
{
"role": "user",
"content": "图片中的猫在做什么?"
},
{
"role": "assistant",
"content": "这只猫坐在或站在桌子上,桌上有很多食物。"
}
]
}
```
- The `system` role is optional, but if it exists, it must appear before the `user` role, and a complete conversation
data (whether single-round or multi-round conversation) can only have one `system` role.
- The `tools` field is optional. If it exists, it must appear after the `system` role, and a complete conversation
data (whether single-round or multi-round conversation) can only have one `tools` field. When the `tools` field
exists, the `system` role must exist and the `content` field is empty.
- The `system` role is optional, but if it exists, it must appear before the `user` role, and the `system` role can only
appear once in a complete conversation (whether it is a single round or a multi-round conversation).
- The `tools` field is optional, but if it exists, it must appear after the `system` role, and the `tools` field can
only appear once in a complete conversation (whether it is a single round or a multi-round conversation). When
the `tools` field exists, the `system` role must exist and the `content` field is empty.
- `GLM-4V-9B` does not support the `tools` field and the `system` field. And `image` must be placed in the first
message. The `image` field needs to contain the `absolute path` of the image.
## Configuration file
......@@ -119,9 +219,8 @@ The fine-tuning configuration file is located in the `config` directory, includi
2. `lora.yaml / ptuning_v2
3. .yaml / sft.yaml`: Configuration files for different modes of models, including model parameters, optimizer
parameters, training parameters, etc. Some important parameters are explained as follows:
parameters, training parameters, etc. Some important parameters are explained as follows: + data_config section
+ data_config section
+ train_file: File path of training dataset.
+ val_file: File path of validation dataset.
+ test_file: File path of test dataset.
......@@ -152,8 +251,7 @@ The fine-tuning configuration file is located in the `config` directory, includi
+ r: rank of LoRA.
+ lora_alpha: scaling factor of LoRA.
+ lora_dropout: dropout probability to use in LoRA layer.
+ P-TuningV2 parameters:
+ num_virtual_tokens: the number of virtual tokens.
+ P-TuningV2 parameters: + num_virtual_tokens: the number of virtual tokens.
+ num_attention_heads: 2: the number of attention heads of P-TuningV2 (do not change).
+ token_dim: 256: the token dimension of P-TuningV2 (do not change).
......@@ -163,15 +261,31 @@ Execute **single machine multi-card/multi-machine multi-card** run through the f
the acceleration solution, and you need to install `deepspeed`.
```shell
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b configs/lora.yaml
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data/AdvertiseGen/ THUDM/GLM-4-9b-0414 configs/lora.yaml # For Chat Fine-tune
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
```
Execute **single machine single card** run through the following code.
```shell
python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml # For Chat Fine-tune
python finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
```
## Log Visualization Support
The fine-tuning code supports using SwanLab to visualize and track training metrics. You can enable tracking by installing SwanLab:
```shell
pip install swanlab
```
You can visit the [SwanLab Visualization Dashboard](https://swanlab.cn/@ShaohonChen/GLM4-Finetune) to view the training logs of example fine-tuning scripts.
If prompted to log in, you can obtain an API Key by visiting [https://swanlab.cn/space/~/settings](https://swanlab.cn/space/~/settings).
If you only want to use the local dashboard, set `swanlab: local` in the configuration parameters and use the `swanlab watch` command to start the offline dashboard.
## Fine-tune from a saved point
If you train as described above, each fine-tuning will start from the beginning. If you want to fine-tune from a
......@@ -184,21 +298,11 @@ half-trained model, you can add a fourth parameter, which can be passed in two w
For example, this is an example code to continue fine-tuning from the last saved point
```shell
python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml yes
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml yes
```
## Use the fine-tuned model
### Verify the fine-tuned model in inference.py
You can Use our fine-tuned model in `finetune_demo/inference.py`, and you can easily test it with just one line of code.
```shell
python inference.py your_finetune_path
```
In this way, the answer you get is the fine-tuned answer.
### Use the fine-tuned model in other demos in this repository or external repositories
You can use our `LORA` and fully fine-tuned models in any demo. This requires you to modify the code yourself according
......@@ -212,26 +316,16 @@ to the following tutorial.
> in `adapter_config.json`.
```python
def load_model_and_tokenizer(
model_dir: Union[str, Path], trust_remote_code: bool = True
) -> tuple[ModelType, TokenizerType]:
def load_model_and_tokenizer(model_dir: Union[str, Path]) -> tuple[ModelType, TokenizerType]:
model_dir = _resolve_path(model_dir)
if (model_dir / 'adapter_config.json').exists():
model = AutoPeftModelForCausalLM.from_pretrained(
model_dir, trust_remote_code=trust_remote_code, device_map='auto'
)
tokenizer_dir = model.peft_config['default'].base_model_name_or_path
else:
model = AutoModelForCausalLM.from_pretrained(
model_dir, trust_remote_code=trust_remote_code, device_map='auto'
)
tokenizer_dir = model_dir
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_dir, trust_remote_code=trust_remote_code
)
return model, tokenizer
if (model_dir / "adapter_config.json").exists():
model = AutoPeftModelForCausalLM.from_pretrained(model_dir, device_map="auto")
tokenizer_dir = model.peft_config["default"].base_model_name_or_path
else:
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto")
tokenizer_dir = model_dir
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
return model, tokenizer
```
2. Read the fine-tuned model. Please note that you should use the location of the fine-tuned model. For example, if your
......@@ -240,11 +334,12 @@ return model, tokenizer
as `model_dir`.
3. After completing the above operations, you can use the fine-tuned model normally. Other calling methods remain
unchanged.
4. This fine-tuning script has not been tested on long texts of 128K or 1M tokens. Fine-tuning long texts requires GPU
devices with larger memory and more efficient fine-tuning solutions, which developers need to handle on their own.
## Reference
```
@inproceedings{liu2022p,
title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
......@@ -262,5 +357,4 @@ eprint={2306.05301},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
\ No newline at end of file
```
# GLM-4-9B Chat 对话模型微调
Read this in [English](README_en.md)
本 demo 中,你将体验到如何微调 GLM-4-9B-Chat 对话开源模型(不支持视觉理解模型)。 请严格按照文档的步骤进行操作,以避免不必要的错误。
Read this in [English](README)
## 硬件检查
**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。**
测试硬件信息:
所有微调测试均在以下环境和硬件下测试:
> OS: Ubuntu 22.04
>
> Memory: 512GB
>
> Python: 3.12.3
>
> CUDA Version: 12.4
>
> GPU Driver: 535.104.05
>
> GPU: NVIDIA H100 80GB HBM3 (以下简称 GPU)
+ 基于 Llama-Factory 进行微调
| Fine-tuning Model | Fine-tuning solution | GPU memory usage |
|-----------------------|----------------------|------------------------------|
| GLM-4-9B-0414 | lora | 22G (Each GPU, Need 1 GPU) |
| GLM-4-9B-0414 | SFT (Zero3 method) | 55G (Each GPU, Need 4 GPUs) |
| GLM-4-9B-0414 | lora | 80G (Each GPU, Need 8 GPUs) |
| GLM-4-32B-0414 | SFT (Zero3 method) | 80G (Each GPU, Need 16 GPUs) |
+ OS: Ubuntu 22.04
+ Memory: 512GB
+ Python: 3.12.3
+ CUDA Version: 12.3
+ GPU Driver: 535.104.05
+ GPU: NVIDIA A100-SXM4-80GB * 8
+ 基于本仓库代码微调
| 微调方案 | 显存占用 | 权重保存点大小 |
|--------------------|-----------------------------------|---------|
| lora (PEFT) | 21531MiB | 17M |
| p-tuning v2 (PEFT) | 21381MiB | 121M |
| SFT (Zero3 method) | 80935MiB<br/>(Each GPU,需要使用8张GPU) | 20G |
| Fine-tuning Model | Fine-tuning solution | GPU memory usage |
|--------------------------|------------------------------------|-------------------------------|
| GLM-4V-9B | lora (PEFT), Include EVA2CLIPModel | 75G (Each GPU, Need 1 GPU) |
| GLM-4-9B-Chat | lora (PEFT) | 22G (Each GPU, Need 1 GPU) |
| GLM-4-9B-Chat | SFT (Zero3 method) | 80G (Each GPU, Need 8 GPUs) |
在开始微调之前,请你先安装`basic_demo`中的依赖,同时您需要安装本目录下的依赖项:
## 准备工作
在开始微调之前,请你先安装 `inference` 中的依赖,并保证克隆了最新版本的模型仓库,同时您需要安装本目录下的依赖项:
```bash
pip install -r requirements.txt
......@@ -50,7 +67,7 @@ pip install -r requirements.txt
"<arg name>": "<arg value>"
}
}
// Add more tools if needed
// Add more tools if needed
]
},
{
......@@ -94,14 +111,98 @@ pip install -r requirements.txt
这里是一个不带有工具的例子:
```
{"messages": [{"role": "user", "content": "类型#裤*材质#牛仔布*风格#性感"}, {"role": "assistant", "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质,其柔然的手感和细腻的质地,在穿着舒适的同时,透露着清纯甜美的个性气质。除此之外,流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致,不失为一款随性出街的必备单品。"}]}
```json
{
"messages": [
{
"role": "user",
"content": "类型#裤*材质#牛仔布*风格#性感"
},
{
"role": "assistant",
"content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质,其柔然的手感和细腻的质地,在穿着舒适的同时,透露着清纯甜美的个性气质。除此之外,流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致,不失为一款随性出街的必备单品。"
}
]
}
```
这是一个带有工具调用的例子:
```json
{
"messages": [
{
"role": "system",
"content": "",
"tools": [
{
"type": "function",
"function": {
"name": "get_recommended_books",
"description": "Get recommended books based on user's interests",
"parameters": {
"type": "object",
"properties": {
"interests": {
"type": "array",
"items": {
"type": "string"
},
"description": "The interests to recommend books for"
}
},
"required": [
"interests"
]
}
}
}
]
},
{
"role": "user",
"content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."
},
{
"role": "assistant",
"content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"
},
{
"role": "observation",
"content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"
},
{
"role": "assistant",
"content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."
}
]
}
```
{"messages": [{"role": "system", "content": "", "tools": [{"type": "function", "function": {"name": "get_recommended_books", "description": "Get recommended books based on user's interests", "parameters": {"type": "object", "properties": {"interests": {"type": "array", "items": {"type": "string"}, "description": "The interests to recommend books for"}}, "required": ["interests"]}}}]}, {"role": "user", "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."}, {"role": "assistant", "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"}, {"role": "observation", "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"}, {"role": "assistant", "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."}]}
这是一个视觉VQA微调的例子:
```json
{
"messages": [
{
"role": "user",
"content": "图片中的动物是什么?",
"image": "/root/images/0001.jpg"
},
{
"role": "assistant",
"content": "图片中有一只猫。"
},
{
"role": "user",
"content": "图片中的猫在做什么?"
},
{
"role": "assistant",
"content": "这只猫坐在或站在桌子上,桌上有很多食物。"
}
]
}
```
- `system` 角色为可选角色,但若存在 `system` 角色,其必须出现在 `user`
......@@ -109,13 +210,15 @@ pip install -r requirements.txt
- `tools` 字段为可选字段,若存在 `tools` 字段,其必须出现在 `system`
角色之后,且一个完整的对话数据(无论单轮或者多轮对话)只能出现一次 `tools` 字段。当 `tools` 字段存在时,`system`
角色必须存在并且 `content` 字段为空。
- `GLM-4V-9B` 不支持 `tools` 字段和 `system` 字段。并且 `image` 必须放在第一条消息中。 `image`
字段需要放置置图片的 `绝对路径`
## 配置文件
微调配置文件位于 `config` 目录下,包括以下文件:
1. `ds_zereo_2 / ds_zereo_3.json`: deepspeed 配置文件。
2. `lora.yaml / ptuning_v2.yaml / sft.yaml`: 模型不同方式的配置文件,包括模型参数、优化器参数、训练参数等。 部分重要参数解释如下:
2. `lora.yaml / sft.yaml`: 模型不同方式的配置文件,包括模型参数、优化器参数、训练参数等。 部分重要参数解释如下:
+ data_config 部分
+ train_file: 训练数据集的文件路径。
+ val_file: 验证数据集的文件路径。
......@@ -154,18 +257,34 @@ pip install -r requirements.txt
## 开始微调
通过以下代码执行 **单机多卡/多机多卡** 运行,这是使用 `deepspeed` 作为加速方案的,您需要安装 `deepspeed`
通过以下代码执行 **单机多卡/多机多卡** 运行,这是使用 `deepspeed` 作为加速方案的,您需要安装 `deepspeed`接着,按照此命令运行:
```shell
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b configs/lora.yaml
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml # For Chat Fine-tune
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
```
通过以下代码执行 **单机单卡** 运行。
```shell
python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml # For Chat Fine-tune
python finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
```
## 日志可视化支持
微调代码支持使用SwanLab对训练指标进行可视化跟踪。可通过安装SwanLab开启跟踪:
```shell
pip install swanlab
```
可以访问[SwanLab可视化看板](https://swanlab.cn/@ShaohonChen/GLM4-Finetune)获得案例微调脚本的训练日志。
如果提示登录,可以通过访问[https://swanlab.cn/space/~/settings](https://swanlab.cn/space/~/settings)获取API Key。
如果仅使用本地看板,可在配置参数中设置`swanlab: local`。并使用`swanlab watch`命令开启离线看板。
## 从保存点进行微调
如果按照上述方式进行训练,每次微调都会从头开始,如果你想从训练一半的模型开始微调,你可以加入第四个参数,这个参数有两种传入方式:
......@@ -176,23 +295,11 @@ python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yam
例如,这就是一个从最后一个保存点继续微调的示例代码
```shell
python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml yes
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml yes
```
## 使用微调后的模型
### 在 inference.py 中验证微调后的模型
您可以在 `finetune_demo/inference.py` 中使用我们的微调后的模型,仅需要一行代码就能简单的进行测试。
```shell
python inference.py your_finetune_path
```
这样,得到的回答就微调后的回答了。
### 在本仓库的其他 demo 或者外部仓库使用微调后的模型
您可以在任何一个 demo 内使用我们的 `LORA` 和 全参微调的模型。这需要你自己按照以下教程进行修改代码。
1. 使用`finetune_demo/inference.py`中读入模型的方式替换 demo 中读入模型的方式。
......@@ -201,34 +308,26 @@ python inference.py your_finetune_path
> 中记录了微调型的路径,如果你的原始模型位置发生更改,则你应该修改`adapter_config.json`中`base_model_name_or_path`的路径。
```python
def load_model_and_tokenizer(
model_dir: Union[str, Path], trust_remote_code: bool = True
) -> tuple[ModelType, TokenizerType]:
def load_model_and_tokenizer(model_dir: Union[str, Path]) -> tuple[ModelType, TokenizerType]:
model_dir = _resolve_path(model_dir)
if (model_dir / 'adapter_config.json').exists():
model = AutoPeftModelForCausalLM.from_pretrained(
model_dir, trust_remote_code=trust_remote_code, device_map='auto'
)
tokenizer_dir = model.peft_config['default'].base_model_name_or_path
if (model_dir / "adapter_config.json").exists():
model = AutoPeftModelForCausalLM.from_pretrained(model_dir, device_map="auto")
tokenizer_dir = model.peft_config["default"].base_model_name_or_path
else:
model = AutoModelForCausalLM.from_pretrained(
model_dir, trust_remote_code=trust_remote_code, device_map='auto'
)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto")
tokenizer_dir = model_dir
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_dir, trust_remote_code=trust_remote_code
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
return model, tokenizer
```
2. 读取微调的模型,请注意,你应该使用微调模型的位置,例如,若你的模型位置为`/path/to/finetune_adapter_model`
,原始模型地址为`path/to/base_model`,则你应该使用`/path/to/finetune_adapter_model`作为`model_dir`
3. 完成上述操作后,就能正常使用微调的模型了,其他的调用方式没有变化。
4. 本微调脚本没有测试过128K 1M等长文本的微调,长文本的微调需要更大显存的GPU设备,并且需要更高效的微调方案,需要开发者自行解决。
## 参考文献
```
@inproceedings{liu2022p,
title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
......@@ -246,5 +345,4 @@ eprint={2306.05301},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
\ No newline at end of file
```
......@@ -26,4 +26,4 @@
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
\ No newline at end of file
}
......@@ -28,4 +28,4 @@
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
\ No newline at end of file
}
......@@ -3,8 +3,13 @@ data_config:
val_file: dev.jsonl
test_file: dev.jsonl
num_proc: 1
combine: True
freezeV: True
max_input_length: 512
max_output_length: 512
# swanlab: "local" # set to local if don`t use cloud
training_args:
# see `transformers.Seq2SeqTrainingArguments`
output_dir: ./output
......@@ -22,9 +27,10 @@ training_args:
log_level: info
logging_strategy: steps
logging_steps: 10
run_name: "glm4-lora-finetune"
# settings for evaluation
per_device_eval_batch_size: 4
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 500
# settings for optimizer
# adam_epsilon: 1e-6
......@@ -35,10 +41,13 @@ training_args:
generation_config:
max_new_tokens: 512
# set your absolute deepspeed path here
#deepspeed: ds_zero_2.json
# deepspeed: configs/ds_zero_3.json
deepspeed: configs/ds_zero_2.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["q_proj", "k_proj", "v_proj"]
......@@ -3,8 +3,13 @@ data_config:
val_file: dev.jsonl
test_file: dev.jsonl
num_proc: 1
max_input_length: 256
combine: True
freezeV: True
max_input_length: 512
max_output_length: 512
# swanlab: "local" # set to local if don`t use cloud
training_args:
# see `transformers.Seq2SeqTrainingArguments`
output_dir: ./output
......@@ -22,9 +27,10 @@ training_args:
log_level: info
logging_strategy: steps
logging_steps: 10
run_name: "glm4-sft-finetune"
# settings for evaluation
per_device_eval_batch_size: 16
evaluation_strategy: steps
eval_strategy: steps
eval_steps: 500
# settings for optimizer
# adam_epsilon: 1e-6
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment