Commit d3e0fa63 authored by chenzk's avatar chenzk
Browse files

v1.0.3

parents
Pipeline #1452 canceled with stages
unsloth filter=lfs diff=lfs merge=lfs -text
None LICENSE Currently
\ No newline at end of file
This diff is collapsed.
# Llama-3
Unsloth对Llama3-8B进行QLoRA训练,最少仅需7.75GB显存,这意味着我们可以在一张1080Ti级别的卡上训练Llama3-8B,其它Firefly、unsloth库中的模型可借鉴llama3的使用方法以此类推。
## 论文
[`The Llama 3 Herd of Models`](https://scontent-lax3-1.xx.fbcdn.net/v/t39.2365-6/452387774_1036916434819166_4173978747091533306_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=DTS7hDTcxZoQ7kNvgG4RrkU&_nc_ht=scontent-lax3-1.xx&gid=A3dKZbFlHdljWrPNA8TkhWm&oh=00_AYCbABVzvTwp7wvKJmAN-2IZeSwABLVkUK0nSbEDvuOaog&oe=66AF7D4D)
## 模型结构
采用标准的decoder-only架构,基于Llama2有几点小改进:1、128K tokens词表的tokenizer,极大提升模型表现;2、在8B/70B上采用grouped query attention(GQA),提升推理性能;3、采用8192 tokens序列进行训练,保证self-attention一般不会跨文档。下图为llama3论文提供的多模态版模型结构:
<div align=center>
<img src="./doc/structure.png"/>
</div>
## 算法原理
llama3将输入embedding后放入attention、ffn等提取特征,最后利用Softmax将解码器最后一层产生的未经归一化的分数向量(logits)转换为概率分布,其中每个元素表示生成对应词汇的概率,这使得模型可以生成一个分布,并从中选择最可能的词作为预测结果。
<div align=center>
<img src="./doc/algorithm.png"/>
</div>
## 环境配置
```
mv Firefly-llama3_unsloth Firefly # 去框架名后缀
```
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
# <your IMAGE ID>为以上拉取的docker的镜像ID替换,本镜像为:a4dd5be0ca23
docker run -it --shm-size=32G -v $PWD/Firefly:/home/Firefly -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=//dev/dri/ --group-add video --name firefly <your IMAGE ID> bash
cd /home/Firefly
pip install -r requirements.txt # requirements.txt
```
### Dockerfile(方法二)
```
cd Firefly/docker
docker build --no-cache -t firefly:latest .
docker run --shm-size=32G --name firefly -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Firefly:/home/Firefly -it firefly bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt。
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.hpccube.com/tool/
```
DTK驱动:dtk24.04.1
python:python3.10
torch:2.1.0
torchvision:0.16.0
triton:2.1.0
torchaudio:2.1.2
deepspeed:0.12.3
bitsandbytes:0.42.0
flash-attn:2.0.4
xformers:0.0.25
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
2、其它非特殊库参照requirements.txt安装
```
pip install -r requirements.txt # requirements.txt
```
`注意,经过上述基础环境配置后,还需安装`[`unsloth`](./unsloth.zip),可以解压已修改好的unsloth文件直接用pip安装即可,也可以如下所示,下载unsloth的github源码自己修改安装:
```
# 安装unsloth
cd unsloth
pip install .
```
`若unsloth为github上公开的官网源代码(项目中的unsloth为未修改的源代码):`
```
# 修改安装后的unsloth
vim /usr/local/lib/python3.10/site-packages/unsloth/kernels/cross_entropy_loss.py:
MAX_FUSED_SIZE = 65536 -> MAX_FUSED_SIZE = 16384
num_warps = 32 -> num_warps = 8 # 位于Fast_CrossEntropyLoss类的_chunked_cross_entropy_forward[(n_rows, n_chunks,)]下面
vim /usr/local/lib/python3.10/site-packages/unsloth/kernels/utils.py
if BLOCK_SIZE >= 32768: num_warps = 32 -> if BLOCK_SIZE >= 32768: num_warps = 8
elif BLOCK_SIZE >= 8192: num_warps = 16 -> elif BLOCK_SIZE >= 8192: num_warps = 8
# 位于函数calculate_settings下面
vim /usr/local/lib/python3.10/site-packages/unsloth/models/_utils.py
model_architectures = ["llama", "mistral", "gemma", "gemma2", "qwen2",] -> model_architectures = ["llama", "mistral", "qwen2",]
vim /usr/local/lib/python3.10/site-packages/unsloth/models/llama.py
Q = Q.transpose(1, 2) -> Q = Q.transpose(1, 2).half()
K = K.transpose(1, 2) -> K = K.transpose(1, 2).half()
V = V.transpose(1, 2) -> V = V.transpose(1, 2).half()
# 位于函数LlamaAttention_fast_forward的elif HAS_FLASH_ATTENTION and attention_mask is None下面
```
`Tips:建议芯片版本>=K100AI、dtk版本>=24.04.1`
## 数据集
若是训练中文模型,建议尝试使用moss数据,英文可尝试ultrachat和WizardLM数据。
项目中已提供迷你数据集供试验,自有数据集可以参考迷你数据集进行制作,训练数据目录结构如下:
```
data/
├── dummy_data.jsonl
```
训练数据为jsonl格式,每一行的数据格式如下,其中conversation字段是必需的,可以根据实际需求添加或删除其他字段:
```
{
"conversation_id":44,
"category":"Brainstorming",
"conversation":[
{
"human":"怎么能够有效地提升个人能力?",
"assistant":"提升个人能力可以通过以下几个方面实现:\n\n1. 学习新知识和技能:阅读相关书籍、参加课程和工作坊、观看教育视频等学习方式,可以帮助你掌握新的知识和技能,从而提升个人能力。\n\n2. 持续练习和实践:不断地练习和实践所学的知识和技能,可以帮助你更好地掌握和应用这些技能,提高你的表现和效率。\n\n3. 寻求反馈和改进:寻求他人的反馈和建议可以帮助你发现自己的不足之处,进一步完善和改进自己的能力。\n\n4. 建立良好的工作和生活习惯:建立良好的工作和生活习惯,如合理安排时间、注意健康、保持积极态度等,可以提高你的工作效率和生活品质,从而提升个人能力。\n\n5. 社交和交流:与他人交流和合作,可以帮助你更好地了解他人的观点和经验,从而提升自己的能力和见识。"
},
{
"human":"对于初学者来说,如何快速掌握编程技能?",
"assistant":"以下是几个建议,可以帮助初学者快速掌握编程技能:\n\n1. 学习基础知识:建议先学习编程语言的基础知识和语法,例如变量、数据类型、控制结构、函数等。可以通过在线课程、编程书籍、视频教程等方式学习。\n\n2. 实践编程:进行实际的编程练习是非常重要的,可以通过参加编程挑战、编写小项目等方式来进行实践。\n\n3. 寻求反馈:可以寻求他人的反馈和建议,例如在线编程社区、编程导师等,他们可以帮助你发现问题并及时纠正。\n\n4. 阅读优秀代码:阅读和分析优秀的代码可以帮助你学习其他程序员的编程思路和技巧,从而提高自己的编程能力。\n\n5. 不断学习和更新:编程技术和工具在不断更新和发展,因此需要不断学习新的知识和技能,例如学习新的编程语言、框架、工具等,以跟上技术的发展和变化。\n\n总之,学习编程需要耐心和坚持,并且需要不断地实践和学习。通过以上建议,可以帮助初学者快速掌握编程技能。"
},
{
"human":"除了参加编程挑战和阅读代码,还有哪些方式可以提升自己的编程能力呢?",
"assistant":"除了参加编程挑战和阅读代码,还有以下几种方式可以提升自己的编程能力:\n\n1. 参与开源项目:加入一个开源项目,可以让你了解到实际的软件开发流程和团队合作方式,并且可以与其他开发人员合作,学习和分享编程技巧和经验。\n\n2. 进行代码重构:对已经编写的代码进行重构,可以帮助你了解更好的编程实践和设计模式,从而提高代码的可读性和可维护性。\n\n3. 练习算法和数据结构:算法和数据结构是编程中的重要基础,练习算法和数据结构可以帮助你了解不同的数据结构和算法,从而更好地解决实际问题。\n\n4. 学习软件工程和架构:学习软件工程和架构的知识,可以帮助你了解如何设计和构建可靠、可扩展和可维护的软件系统。\n\n5. 参加编程社区和论坛:参加编程社区和论坛可以让你与其他开发人员分享和讨论编程经验和技巧,了解最新的技术趋势和解决方案。\n\n总之,提升编程能力需要不断地学习和实践,并与其他开发人员和编程专家进行交流和合作,以获取更多的经验和知识。"
},
{
"human":"你有没有推荐的编程挑战网站?",
"assistant":"当然,以下是一些值得推荐的编程挑战网站:\n\n1. HackerRank:HackerRank 是一个全球领先的技术招聘平台,提供数百种编程挑战和比赛,包括算法、数据结构、数学、SQL、AI 等多个技术领域。\n\n2. LeetCode:LeetCode 是一个在线编程学习和实践平台,提供数千道编程题目,主要涵盖算法和数据结构。\n\n3. Codeforces:Codeforces 是一个全球知名的编程竞赛网站,提供数百个比赛和练习题目,主要涵盖算法和数据结构。\n\n4. Codewars:Codewars 是一个在线编程学习和实践平台,提供数千道编程题目,主要涵盖算法、数据结构、面向对象编程等多个技术领域。\n\n5. Project Euler:Project Euler 是一个非营利性的在线编程挑战平台,提供一系列数学和计算机科学问题,需要用编程语言进行解答。\n\n以上这些编程挑战网站可以帮助你提高编程能力,并且可以与其他编程爱好者进行交流和竞争。"
},
{
"human":"在学习编程的过程中,如何避免陷入困境和放弃呢?",
"assistant":"学习编程是一个需要长期坚持和不断学习的过程,以下是一些避免陷入困境和放弃的方法:\n\n1. 制定合理的学习计划:制定合理的学习计划,包括学习时间、学习内容、目标等,可以帮助你更好地管理时间和精力,避免学习过程中的松懈和分心。\n\n2. 寻找合适的学习资源:选择适合自己的学习资源,例如在线课程、编程书籍、视频教程等,可以让你更好地了解和掌握编程知识和技能。\n\n3. 寻求帮助和支持:在学习过程中,遇到问题和困难是很正常的,可以寻求他人的帮助和支持,例如参加编程社区、找到编程导师等。\n\n4. 进行实践和项目:实践和项目是学习编程的重要组成部分,可以帮助你更好地了解和掌握编程技能,同时也可以提高学习的兴趣和动力。\n\n5. 坚持并保持兴趣:坚持学习和保持兴趣是学习编程的关键。可以通过参加编程社区、参加编程竞赛、与其他编程爱好者交流等方式来保持兴趣和动力。\n\n总之,学习编程需要耐心和坚持,并需要不断学习和实践。通过以上方法可以帮助你避免陷入困境和放弃。"
}
],
}
```
更多资料可参考源项目的[`README_origin`](./README_origin.md)
## 训练
### 单机单卡
```
# 将预训练权重放至:NousResearch/Meta-Llama-3-8B-Instruct
export HIP_VISIBLE_DEVICES=0
python train.py --train_args_file train_args/sft/qlora/llama3-8b-sft-qlora.json # 调用unsloth:json中"use_unsloth"设置为true
```
更多资料可参考源项目的[`README_origin`](./README_origin.md)
## result
参考llama3的对话问答效果:
```
用户:奥运会几年开一次?
Llama-3:奥运会每四年开一次。其中,夏季奥运会和冬季奥运会是分开举办的,分别称为“夏季奥林匹克运动会”和“冬季奥林匹克运动会”。夏季奥运会每四年举办一次,而冬季奥运会也是每四年举办一次,但时间上与夏季奥运会相隔两年。
```
### 精度
DCU K100AI与GPU A800精度一致,训练框架:unsloth。
## 应用场景
### 算法类别
`对话问答`
### 热点应用行业
`制造,广媒,金融,能源,医疗,家居,教育`
## 预训练权重
预训练权重快速下载中心:[SCNet AIModels](http://113.200.138.88:18080/aimodels) ,项目中的预训练权重可从快速下载通道下载:[Meta-Llama-3-8B-Instruct](http://113.200.138.88:18080/aimodels/Meta-Llama-3-8B-Instruct)
Hugging Face 预训练权重地址: [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
## 源码仓库及问题反馈
- http://developer.hpccube.com/codes/modelzoo/retrieval-based-voice-conversion-webui_pytorch.git
## 参考资料
- https://github.com/yangjianxin1/Firefly.git
- https://github.com/unslothai/unsloth.git
This diff is collapsed.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class CustomizedArguments:
"""
一些自定义参数
"""
max_seq_length: int = field(metadata={"help": "输入最大长度"})
train_file: str = field(metadata={"help": "训练集。如果task_type=pretrain,请指定文件夹,将扫描其下面的所有jsonl文件"})
model_name_or_path: str = field(metadata={"help": "预训练权重路径"})
template_name: str = field(default="", metadata={"help": "sft时的数据格式"})
eval_file: Optional[str] = field(default="", metadata={"help": "验证集"})
max_prompt_length: int = field(default=512, metadata={"help": "dpo时,prompt的最大长度"})
beta: float = field(default=0.1, metadata={"help": "The beta factor in DPO loss"})
tokenize_num_workers: int = field(default=10, metadata={"help": "预训练时tokenize的线程数量"})
task_type: str = field(default="sft", metadata={"help": "预训练任务:[pretrain, sft]"})
train_mode: str = field(default="qlora", metadata={"help": "训练方式:[full, qlora]"})
lora_rank: Optional[int] = field(default=64, metadata={"help": "lora rank"})
lora_alpha: Optional[int] = field(default=16, metadata={"help": "lora alpha"})
lora_dropout: Optional[float] = field(default=0.05, metadata={"help": "lora dropout"})
use_unsloth: Optional[bool] = field(default=False, metadata={"help": "use sloth or not"})
from typing import Any, Dict, List
import torch
from loguru import logger
class SFTDataCollator(object):
def __init__(self, tokenizer, max_seq_length):
self.tokenizer = tokenizer
self.max_seq_length = max_seq_length
self.pad_token_id = tokenizer.pad_token_id
def __call__(self, batch: List[Dict[str, Any]]) -> Dict[str, Any]:
# 找出batch中的最大长度
lengths = [len(x['input_ids']) for x in batch if x['input_ids'] is not None]
# 取出batch中的最大长度,如果超过max_seq_length,则取max_seq_length
batch_max_len = min(max(lengths), self.max_seq_length)
# batch_max_len = self.max_seq_length
input_ids_batch, attention_mask_batch, target_mask_batch = [], [], []
# truncate and padding
for x in batch:
input_ids = x['input_ids']
attention_mask = x['attention_mask']
target_mask = x['target_mask']
if input_ids is None:
logger.info('some input_ids is None')
continue
padding_len = batch_max_len - len(input_ids)
# padding
input_ids = input_ids + [self.pad_token_id] * padding_len
attention_mask = attention_mask + [0] * padding_len
target_mask = target_mask + [0] * padding_len
# truncate
input_ids = input_ids[:self.max_seq_length]
attention_mask = attention_mask[:self.max_seq_length]
target_mask = target_mask[:self.max_seq_length]
input_ids_batch.append(input_ids)
attention_mask_batch.append(attention_mask)
target_mask_batch.append(target_mask)
# 将list转换为tensor,得到最终的的模型输入
input_ids_batch = torch.tensor(input_ids_batch, dtype=torch.long)
attention_mask_batch = torch.tensor(attention_mask_batch, dtype=torch.long)
target_mask_batch = torch.tensor(target_mask_batch, dtype=torch.long)
labels = torch.where(target_mask_batch == 1, input_ids_batch, -100)
inputs = {
'input_ids': input_ids_batch,
'attention_mask': attention_mask_batch,
'labels': labels
}
return inputs
class PretrainCollator(object):
def __init__(self, tokenizer, max_seq_length):
self.tokenizer = tokenizer
self.max_seq_length = max_seq_length
self.pad_token_id = tokenizer.pad_token_id
def __call__(self, batch: List[Dict[str, Any]]) -> Dict[str, Any]:
batch = [x['input_ids'] for x in batch if x['input_ids'] is not None]
# 找出batch中的最大长度
lengths = [len(x) for x in batch]
# 取出batch中的最大长度,如果超过max_seq_length,则取max_seq_length
batch_max_len = min(max(lengths), self.max_seq_length)
# batch_max_len = self.max_seq_length
input_ids_batch, attention_mask_batch, labels_batch = [], [], []
for x in batch:
input_ids = x
attention_mask = [1] * len(input_ids)
padding_len = batch_max_len - len(input_ids)
# padding
labels = input_ids + [-100] * padding_len
input_ids = input_ids + [self.pad_token_id] * padding_len
attention_mask = attention_mask + [0] * padding_len
# truncate
input_ids = input_ids[:self.max_seq_length]
labels = labels[:self.max_seq_length]
attention_mask = attention_mask[:self.max_seq_length]
input_ids_batch.append(input_ids)
labels_batch.append(labels)
attention_mask_batch.append(attention_mask)
# 将list转换为tensor,得到最终的的模型输入
input_ids_batch = torch.tensor(input_ids_batch, dtype=torch.long)
labels_batch = torch.tensor(labels_batch, dtype=torch.long)
attention_mask_batch = torch.tensor(attention_mask_batch, dtype=torch.long)
inputs = {
'input_ids': input_ids_batch,
'attention_mask': attention_mask_batch,
'labels': labels_batch
}
return inputs
import json
from loguru import logger
from torch.utils.data import Dataset
class UnifiedSFTDataset(Dataset):
"""
统一的数据处理dataset
"""
def __init__(self, file, tokenizer, max_seq_length, template):
self.tokenizer = tokenizer
self.template_name = template.template_name
self.system_format = template.system_format
self.user_format = template.user_format
self.assistant_format = template.assistant_format
self.system = template.system
self.max_seq_length = max_seq_length
logger.info('Loading data: {}'.format(file))
with open(file, 'r', encoding='utf8') as f:
data_list = f.readlines()
logger.info(f'Use template "{self.template_name}" for training')
logger.info("There are {} data in dataset".format(len(data_list)))
self.data_list = data_list
def __len__(self):
return len(self.data_list)
def __getitem__(self, index):
# 每条数据拼接格式为: {system_format}{user_format}{assistant_format}{user_format}{assistant_format}...
data = self.data_list[index]
data = json.loads(data)
input_ids, target_mask = [], []
# setting system information
if self.system_format is not None:
system = data['system'].strip() if 'system' in data.keys() else self.system
# system信息不为空
if system is not None:
system_text = self.system_format.format(content=system)
input_ids = self.tokenizer.encode(system_text, add_special_tokens=False)
target_mask = [0] * len(input_ids)
conversations = data['conversation']
# 拼接多轮对话
for i, conv in enumerate(conversations):
human = conv['human'].strip()
assistant = conv['assistant'].strip()
human = self.user_format.format(content=human, stop_token=self.tokenizer.eos_token)
assistant = self.assistant_format.format(content=assistant, stop_token=self.tokenizer.eos_token)
input_tokens = self.tokenizer.encode(human, add_special_tokens=False)
output_tokens = self.tokenizer.encode(assistant, add_special_tokens=False)
input_ids += input_tokens + output_tokens
target_mask += [0] * len(input_tokens) + [1] * len(output_tokens)
assert len(input_ids) == len(target_mask)
# 对长度进行截断
input_ids = input_ids[:self.max_seq_length]
target_mask = target_mask[:self.max_seq_length]
attention_mask = [1] * len(input_ids)
assert len(input_ids) == len(target_mask) == len(attention_mask)
inputs = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'target_mask': target_mask
}
return inputs
class ChatGLM2SFTDataset(UnifiedSFTDataset):
def __getitem__(self, index):
# 每条数据格式为: [gMASK]sop [Round 1]\n\n问:{input1}\n\n答:{target1}</s>[Round 2]\n\n问:{input2}\n\n答:{target2}</s>...
data = self.data_list[index]
data = json.loads(data)
input_ids = self.tokenizer.get_prefix_tokens()
target_mask = [0] * len(input_ids)
conversations = data['conversation']
# 拼接多轮对话
for i, conv in enumerate(conversations):
human = conv['human'].strip()
assistant = conv['assistant'].strip()
human = self.user_format.format(content=human, idx=i + 1)
assistant = self.assistant_format.format(content=assistant)
input_tokens = self.tokenizer.encode(human, add_special_tokens=False)
output_tokens = self.tokenizer.encode(assistant, add_special_tokens=False) + [self.tokenizer.eos_token_id]
input_ids += input_tokens + output_tokens
target_mask += [0] * len(input_tokens) + [1] * len(output_tokens)
assert len(input_ids) == len(target_mask)
# 对长度进行截断
input_ids = input_ids[:self.max_seq_length]
target_mask = target_mask[:self.max_seq_length]
attention_mask = [1] * len(input_ids)
assert len(input_ids) == len(target_mask) == len(attention_mask)
inputs = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'target_mask': target_mask
}
return inputs
class ChatGLM3SFTDataset(UnifiedSFTDataset):
def __getitem__(self, index):
# [gMASK]sop <|system|>xxx<|user|>xxx<|assistant|>xxx<eos>
data = self.data_list[index]
data = json.loads(data)
system = data['system'].strip() if 'system' in data.keys() else self.system
input_ids = self.tokenizer.get_prefix_tokens() + \
[self.tokenizer.get_command(f"<|system|>")] + \
self.tokenizer.encode(system, add_special_tokens=False)
target_mask = [0] * len(input_ids)
conversations = data['conversation']
# 拼接多轮对话
for i, conv in enumerate(conversations):
human = conv['human'].strip()
assistant = conv['assistant'].strip()
input_tokens = [self.tokenizer.get_command(f"<|user|>")] + \
self.tokenizer.encode(human, add_special_tokens=False) + \
[self.tokenizer.get_command(f"<|assistant|>")]
output_tokens = self.tokenizer.encode(assistant, add_special_tokens=False) + [self.tokenizer.eos_token_id]
input_ids += input_tokens + output_tokens
target_mask += [0] * len(input_tokens) + [1] * len(output_tokens)
assert len(input_ids) == len(target_mask)
# 对长度进行截断
input_ids = input_ids[:self.max_seq_length]
target_mask = target_mask[:self.max_seq_length]
attention_mask = [1] * len(input_ids)
assert len(input_ids) == len(target_mask) == len(attention_mask)
inputs = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'target_mask': target_mask
}
return inputs
class UnifiedDPODataset(Dataset):
"""
统一的DPO数据集
"""
def __init__(self, file, tokenizer, max_seq_length, max_prompt_length, template):
self.tokenizer = tokenizer
self.template_name = template.template_name
self.system_format = template.system_format
self.user_format = template.user_format
self.assistant_format = template.assistant_format
self.system = template.system
self.max_seq_length = max_seq_length
self.max_prompt_length = max_prompt_length
logger.info('Loading data: {}'.format(file))
with open(file, 'r', encoding='utf8') as f:
data_list = f.readlines()
logger.info(f'Use template "{self.template_name}" for training')
logger.info("There are {} data in dataset".format(len(data_list)))
self.data_list = data_list
def __len__(self):
return len(self.data_list)
def build_prompt_input_ids(self, system, history):
"""
chatglm2: [gMASK]sop [Round 1]\n\n问:{input1}\n\n答:{target1}</s>[Round 2]\n\n问:{input2}\n\n答:{target2}</s>...
chatglm3: [gMASK]sop <|system|>xxx<|user|>xxx<|assistant|>xxx<eos>
others: {system_format}{user_format}{assistant_format}{user_format}{assistant_format}...
"""
# chatglm模型具有特殊的起始token
if self.template_name in ['chatglm2', 'chatglm3']:
prompt_input_ids = self.tokenizer.get_prefix_tokens()
else:
prompt_input_ids = []
# collect system information
if self.system_format is not None:
system = system if system is not None else self.system
# system信息不为空
if system is not None:
if self.template_name == 'chatglm3':
prompt_input_ids += [self.tokenizer.get_command(f"<|system|>")] + self.tokenizer.encode(system, add_special_tokens=False)
else:
system_text = self.system_format.format(content=system)
prompt_input_ids += self.tokenizer.encode(system_text, add_special_tokens=False)
# collect history
for i, conv in enumerate(history):
role = conv['role'].strip()
content = conv['content'].strip()
assert role != 'system', 'there should not be more than one system information'
if role == 'user':
if self.template_name == 'chatglm2':
human = self.user_format.format(content=content, idx=i//2 + 1)
input_ids = self.tokenizer.encode(human, add_special_tokens=False)
elif self.template_name == 'chatglm3':
input_ids = [self.tokenizer.get_command(f"<|user|>")] + \
self.tokenizer.encode(content, add_special_tokens=False) + \
[self.tokenizer.get_command(f"<|assistant|>")]
else:
human = self.user_format.format(content=content, stop_token=self.tokenizer.eos_token)
input_ids = self.tokenizer.encode(human, add_special_tokens=False)
elif role == 'assistant':
if self.template_name in ['chatglm2', 'chatglm3']:
input_ids = self.tokenizer.encode(content, add_special_tokens=False) + [self.tokenizer.eos_token_id]
else:
assistant = self.assistant_format.format(content=content, stop_token=self.tokenizer.eos_token)
input_ids = self.tokenizer.encode(assistant, add_special_tokens=False)
else:
raise Exception('role error')
prompt_input_ids += input_ids
return prompt_input_ids
def __getitem__(self, index):
data = self.data_list[index]
data = json.loads(data)
chosen = data['chosen']
rejected = data['rejected']
assert len(chosen) == len(rejected)
# 判断第0个是否为system
if chosen[0]['role'] == 'system':
system = chosen[0]['content'].strip()
history = chosen[1:-1] # 对话上文
chosen, rejected = chosen[-1], rejected[-1]
else:
system = None
history = chosen[:-1] # 对话上文
chosen, rejected = chosen[-1], rejected[-1]
# build prompt
prompt_input_ids = self.build_prompt_input_ids(system, history)
# build response
if self.template_name in ['chatglm2', 'chatglm3']:
chosen_input_ids = self.tokenizer.encode(chosen['content'], add_special_tokens=False) + [self.tokenizer.eos_token_id]
rejected_input_ids = self.tokenizer.encode(rejected['content'], add_special_tokens=False) + [self.tokenizer.eos_token_id]
else:
chosen = self.assistant_format.format(content=chosen['content'], stop_token=self.tokenizer.eos_token)
rejected = self.assistant_format.format(content=rejected['content'], stop_token=self.tokenizer.eos_token)
chosen_input_ids = self.tokenizer.encode(chosen, add_special_tokens=False)
rejected_input_ids = self.tokenizer.encode(rejected, add_special_tokens=False)
# truncate by max_seq_length
longer_response_length = max(len(chosen_input_ids), len(rejected_input_ids))
# if combined sequence is too long, truncate the prompt
if len(prompt_input_ids) + longer_response_length > self.max_seq_length:
max_prompt_length = max(self.max_prompt_length, self.max_seq_length - longer_response_length)
prompt_input_ids = prompt_input_ids[-max_prompt_length:]
# if that's still too long, truncate the response
if len(prompt_input_ids) + longer_response_length > self.max_seq_length:
chosen_input_ids = chosen_input_ids[: self.max_seq_length - len(prompt_input_ids)]
rejected_input_ids = rejected_input_ids[: self.max_seq_length - len(prompt_input_ids)]
chosen_labels = [-100] * len(prompt_input_ids) + chosen_input_ids
chosen_input_ids = prompt_input_ids + chosen_input_ids
rejected_labels = [-100] * len(prompt_input_ids) + rejected_input_ids
rejected_input_ids = prompt_input_ids + rejected_input_ids
assert len(chosen_labels) == len(chosen_input_ids)
assert len(rejected_labels) == len(rejected_input_ids)
inputs = dict(
prompt_input_ids=prompt_input_ids,
prompt_attention_mask=[1]*len(prompt_input_ids),
chosen_input_ids=chosen_input_ids,
chosen_attention_mask=[1]*len(chosen_input_ids),
chosen_labels=chosen_labels,
rejected_input_ids=rejected_input_ids,
rejected_attention_mask=[1]*len(rejected_input_ids),
rejected_labels=rejected_labels,
)
return inputs
# 为了适配DPOTrainer的接口
def map(self, func, **kwargs):
return self
import torch
import torch.nn as nn
class Loss(object):
"""
所有loss的类父类
"""
def __call__(self, model, inputs, training_args, return_outputs=False):
"""
todo label smoothing
用于计算loss。
看源码发现,return_outputs=True为train时调用,return_outputs=False为eval和predict调用
:param model: 模型
:param inputs: 模型输入,dict
:param training_args: 训练配置参数
:param return_outputs:是否返回模型的输出
:return:
"""
raise NotImplemented
class TargetLMLoss(Loss):
def __init__(self, ignore_index):
super().__init__()
self.ignore_index = ignore_index
self.loss_fn = nn.CrossEntropyLoss(ignore_index=ignore_index)
def __call__(self, model, inputs, training_args, return_outputs=False):
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
target_mask = inputs['target_mask']
# 模型前馈预测
outputs = model(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
logits = outputs["logits"] if isinstance(outputs, dict) else outputs[0]
# 将labels中不属于target的部分,设为ignore_index,只计算target部分的loss
labels = torch.where(target_mask == 1, input_ids, self.ignore_index)
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss = self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
return (loss, outputs) if return_outputs else loss
import transformers
from typing import Tuple, Union
import torch
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions, CausalLMOutputWithPast
from component.loss import TargetLMLoss
from transformers.utils import logging
logger = logging.get_logger(__name__)
class BloomForCausalLM(transformers.BloomForCausalLM):
"""
继承自BloomForCausalLM,区别在于只计算target部分的loss
"""
def forward(
self,
input_ids=None,
past_key_values=None,
attention_mask=None,
labels=None,
target_mask=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
return_loss=False,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
`labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
transformer_outputs = self.transformer(
input_ids,
past_key_values=past_key_values,
attention_mask=attention_mask,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
lm_logits = self.lm_head(hidden_states)
loss = None
if return_loss:
loss_fn = TargetLMLoss(ignore_index=self.config.pad_token_id)
loss = loss_fn(lm_logits, input_ids, target_mask)
if not return_dict:
output = (lm_logits,) + transformer_outputs[1:]
return ((loss,) + output) if loss is not None else output
return CausalLMOutputWithCrossAttentions(
loss=loss,
logits=lm_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
)
from dataclasses import dataclass
from typing import Dict
@dataclass
class Template:
template_name:str
system_format: str
user_format: str
assistant_format: str
system: str
stop_word: str
# stop_token_id: int
template_dict: Dict[str, Template] = dict()
def register_template(template_name, system_format, user_format, assistant_format, system, stop_word=None):
template_dict[template_name] = Template(
template_name=template_name,
system_format=system_format,
user_format=user_format,
assistant_format=assistant_format,
system=system,
stop_word=stop_word,
# stop_token_id=stop_token_id
)
# 注册template
register_template(
template_name='default',
system_format='System: {content}\n\n',
user_format='User: {content}\nAssistant: ',
assistant_format='{content} {stop_token}',
system=None,
stop_word=None
)
register_template(
template_name='internlm',
system_format="<|System|>:{content}\n",
user_format='<|User|>:{content}\n<|Bot|>:',
assistant_format='{content}</s>\n',
system="You are an AI assistant whose name is InternLM (书生·浦语).\n"
"- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.\n"
"- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.",
stop_word='</s>'
)
register_template(
template_name='internlm2',
system_format='<|im_start|>system\n{content}<|im_end|>\n',
user_format='<|im_start|>user\n{content}<|im_end|>\n<|im_start|>assistant\n',
assistant_format='{content}<|im_end|>\n',
system="You are an AI assistant whose name is InternLM (书生·浦语).\n"
"- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.\n"
"- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.",
stop_word='<|im_end|>'
)
register_template(
template_name='qwen',
system_format='<|im_start|>system\n{content}<|im_end|>\n',
user_format='<|im_start|>user\n{content}<|im_end|>\n<|im_start|>assistant\n',
assistant_format='{content}<|im_end|>\n',
system="You are a helpful assistant.",
stop_word='<|im_end|>'
)
register_template(
template_name='yi',
system_format='<|im_start|>system\n{content}<|im_end|>\n',
user_format='<|im_start|>user\n{content}<|im_end|>\n<|im_start|>assistant\n',
assistant_format='{content}<|im_end|>\n',
system=None,
stop_word='<|im_end|>'
)
register_template(
template_name="orion",
system_format='<s>',
user_format='Human: {content}\n\nAssistant: </s>',
assistant_format='{content}</s>',
system='',
stop_word='</s>',
)
register_template(
template_name='deepseek',
system_format=None,
user_format='User: {content}\n\nAssistant: ',
assistant_format='{content}<|end▁of▁sentence|>',
system=None,
stop_word='<|end▁of▁sentence|>'
)
# todo 更优雅的实现方式
register_template(
template_name='chatglm2',
system_format=None,
user_format='[Round {idx}]\n\n问:{content}\n\n答:',
assistant_format='{content}',
system=None,
stop_word='</s>',
)
register_template(
template_name='chatglm3',
system_format='{content}',
user_format='{content}',
assistant_format='{content}',
system="You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.",
stop_word='</s>',
)
register_template(
template_name='ziya2',
system_format=None,
user_format='<human>:{content} <bot>:',
assistant_format='{content}</s>',
system=None,
stop_word='</s>',
)
register_template(
template_name="xverse",
system_format=None,
user_format='Human: {content}\n\nAssistant: ',
assistant_format='{content}<|endoftext|>',
system=None,
stop_word='<|endoftext|>',
)
register_template(
template_name='minicpm',
system_format=None,
user_format='<用户>{content}<AI>',
assistant_format='{content}</s>',
system=None,
stop_word='</s>'
)
register_template(
template_name='zephyr',
system_format='<|system|>\n{content}</s>',
user_format='<|user|>\n{content}</s>\n<|assistant|>\n',
assistant_format='{content}</s>\n',
system=None,
stop_word='</s>'
)
register_template(
template_name='mistral',
system_format='<s>',
user_format='[INST]{content}[/INST]',
assistant_format='{content}</s>',
system='',
stop_word='</s>'
)
register_template(
template_name='mixtral',
system_format='<s>',
user_format='[INST]{content}[/INST]',
assistant_format='{content}</s>',
system='',
stop_word='</s>'
)
register_template(
template_name='baichuan',
system_format=None,
user_format='<reserved_102>{content}<reserved_103>',
assistant_format='{content}</s>',
system=None,
stop_word='</s>'
)
register_template(
template_name='baichuan2',
system_format=None,
user_format='<reserved_106>{content}<reserved_107>',
assistant_format='{content}</s>',
system=None,
stop_word='</s>'
)
register_template(
template_name='vicuna',
system_format='{content}\n',
user_format='USER: {content} ASSISTANT:',
assistant_format='{content}</s>',
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions.",
stop_word='</s>'
)
register_template(
template_name='llama2',
system_format='<<SYS>>\n{content}\n<</SYS>>\n\n',
user_format='[INST]{content}[/INST]',
assistant_format='{content} </s>',
system="You are a helpful, respectful and honest assistant. "
"Always answer as helpfully as possible, while being safe. "
"Your answers should not include any harmful, unethical, "
"racist, sexist, toxic, dangerous, or illegal content. "
"Please ensure that your responses are socially unbiased and positive in nature.\n\n"
"If a question does not make any sense, or is not factually coherent, "
"explain why instead of answering something not correct. "
"If you don't know the answer to a question, please don't share false information.",
stop_word='</s>'
)
register_template(
template_name='llama3',
system_format='<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{content}<|eot_id|>',
user_format='<|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n',
assistant_format='{content}<|eot_id|>',
system=None,
stop_word='<|eot_id|>'
)
register_template(
template_name='gemma',
system_format='<bos>',
user_format='<start_of_turn>user\n{content}<end_of_turn>\n<start_of_turn>model\n',
assistant_format='{content}<eos>\n',
system='',
stop_word='<eos>'
)
register_template(
template_name='phi3',
system_format=None,
user_format='<|user|>\n{content}<|end|>\n<|assistant|>',
assistant_format='{content}<|end|>\n',
system=None,
stop_word='<|end|>'
)
# if __name__ == '__main__':
# model_name_or_path = ''
import transformers
from transformers import (
PreTrainedModel,
TrainingArguments,
DataCollator,
PreTrainedTokenizerBase,
EvalPrediction,
TrainerCallback,
)
from typing import Callable, Dict, List, Optional, Tuple, Union, Any
from torch import nn
from torch.utils.data import Dataset, DataLoader
from transformers.utils import (
logging,
)
from typing import Optional
import os
import torch
logger = logging.get_logger(__name__)
# Name of the files used for checkpointing
TRAINING_ARGS_NAME = "training_args.bin"
TRAINER_STATE_NAME = "trainer_state.json"
OPTIMIZER_NAME = "optimizer.pt"
SCHEDULER_NAME = "scheduler.pt"
SCALER_NAME = "scaler.pt"
class Trainer(transformers.Trainer):
"""
主要修改逻辑:通过传入compute_loss,支持自定义loss计算方式
"""
def __init__(
self,
model: Union[PreTrainedModel, nn.Module] = None,
args: TrainingArguments = None,
data_collator: Optional[DataCollator] = None,
train_dataset: Optional[Dataset] = None,
eval_dataset: Optional[Dataset] = None,
tokenizer: Optional[PreTrainedTokenizerBase] = None,
model_init: Callable[[], PreTrainedModel] = None,
compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,
callbacks: Optional[List[TrainerCallback]] = None,
optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),
preprocess_logits_for_metrics: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = None,
compute_loss=None,
):
super(Trainer, self).__init__(
model=model,
args=args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
model_init=model_init,
compute_metrics=compute_metrics,
callbacks=callbacks,
optimizers=optimizers,
preprocess_logits_for_metrics=preprocess_logits_for_metrics,
)
self.loss_func = compute_loss
def compute_loss(self, model, inputs, return_outputs=False):
"""
重写loss的计算方式
How the loss is computed by Trainer. By default, all models return the loss in the first element.
Subclass and override for custom behavior.
"""
if self.loss_func is None:
loss = super().compute_loss(model, inputs, return_outputs)
else:
loss = self.loss_func(model, inputs, self.args, return_outputs)
return loss
class LoRATrainer(Trainer):
"""
修改checkkpoint的保存逻辑,只保存lora
"""
def _save(self, output_dir: Optional[str] = None, state_dict=None):
# If we are executing this function, we are the process zero, so we don't check for that.
output_dir = output_dir if output_dir is not None else self.args.output_dir
os.makedirs(output_dir, exist_ok=True)
logger.info(f"Saving model checkpoint to {output_dir}")
# 保存lora权重和配置
self.model.save_pretrained(
output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors
)
if self.tokenizer is not None:
self.tokenizer.save_pretrained(output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
\ No newline at end of file
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
from peft import PeftModel
class ModelUtils(object):
@classmethod
def load_model(cls, model_name_or_path, load_in_4bit=False, adapter_name_or_path=None):
# 是否使用4bit量化进行推理
if load_in_4bit:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
)
else:
quantization_config = None
# 加载base model
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
load_in_4bit=load_in_4bit,
trust_remote_code=True,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
device_map='auto',
quantization_config=quantization_config
)
# 加载adapter
if adapter_name_or_path is not None:
model = PeftModel.from_pretrained(model, adapter_name_or_path)
return model
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment