Commit c0d96b32 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #2852 failed with stages
in 0 seconds
# for MiniCPM-2B hf inference
torch>=2.0.0
transformers==4.53.2
gradio>=4.26.0
# for vllm inference
# vllm>=0.4.0.post1
# for openai api inference
openai>=1.17.1
tiktoken>=0.6.0
loguru>=0.7.2
sentence_transformers>=2.6.1
sse_starlette>=2.1.0
# for MiniCPM-V hf inference
Pillow>=10.3.0
timm>=0.9.16
sentencepiece>=0.2.0
# MiniCPM 微调
(部分代码函数的文档由[RepoAgent](https://github.com/OpenBMB/RepoAgent)自动产生)
[English Version](https://github.com/OpenBMB/MiniCPM/blob/main/finetune/README_en.md)
本目录提供 MiniCPM-2B 模型的微调示例,包括全量微调和 PEFT。格式上,提供多轮对话微调样例和输入输出格式微调样例。
如果将模型下载到了本地,本文和代码中的 `OpenBMB/MiniCPM-2B` 字段均应替换为相应地址以从本地加载模型。
运行示例需要 `python>=3.10`,除基础的 `torch` 依赖外,示例代码运行还需要依赖。
**我们提供了 [示例notebook](lora_finetune.ipynb) 用于演示如何以 AdvertiseGen 为例处理数据和使用微调脚本。**
```bash
pip install -r requirements.txt
```
## 测试硬件标准
我们仅提供了单机多卡/多机多卡的运行示例,因此您需要至少一台具有多个 GPU 的机器。本仓库中的**默认配置文件**中,我们记录了显存的占用情况:
+ SFT 全量微调: 4张显卡平均分配,每张显卡占用 `30245MiB` 显存。
+ LORA 微调: 1张显卡,占用 `10619MiB` 显存。
+ qlora 微调+cpu+offload: 1张显卡,占用 `5500MiB` 显存。
> 请注意,该结果仅供参考,对于不同的参数,显存占用可能会有所不同。请结合你的硬件情况进行调整。
## 多轮对话格式
多轮对话微调示例采用 ChatGLM3 对话格式约定,对不同角色添加不同 `loss_mask` 从而在一遍计算中为多轮回复计算 `loss`
对于数据文件,样例采用如下格式
```json
[
{
"messages": [
{
"role": "system",
"content": "<system prompt text>"
},
{
"role": "user",
"content": "<user prompt text>"
},
{
"role": "assistant",
"content": "<assistant response text>"
},
// ... Multi Turn
{
"role": "user",
"content": "<user prompt text>"
},
{
"role": "assistant",
"content": "<assistant response text>"
}
]
}
// ...
]
```
## 数据集格式示例
> 请注意,现在的微调代码中加入了验证集,因此,对于一组完整的微调数据集,必须包含训练数据集和验证数据集,测试数据集可以不填写。或者直接用验证数据集代替。
```
{"messages": [{"role": "user", "content": "类型#裙*裙长#半身裙"}, {"role": "assistant", "content": "这款百搭时尚的仙女半身裙,整体设计非常的飘逸随性,穿上之后每个女孩子都能瞬间变成小仙女啦。料子非常的轻盈,透气性也很好,穿到夏天也很舒适。"}]}
```
## 开始微调
通过以下代码执行 **单机多卡/多机多卡** 运行。
```bash
cd finetune
bash sft_finetune.sh
```
通过以下代码执行 **单机单卡** 运行。
```angular2html
cd finetune
bash lora_finetune.sh
```
# MiniCPM Fine-tuning
(Part of the doc inside these demo code is automatically generated by [RepoAgent](https://github.com/OpenBMB/RepoAgent))
[中文版](https://github.com/OpenBMB/MiniCPM/blob/main/finetune/README.md)
This directory provides examples of fine-tuning the MiniCPM-2B model, including full model fine-tuning and PEFT. In terms of format, we offer examples for multi-turn dialogue fine-tuning and input-output format fine-tuning.
If you have downloaded the model to your local system, the `OpenBMB/MiniCPM-2B` field mentioned in this document and in the code should be replaced with the corresponding address to load the model from your local system.
Running the example requires `python>=3.10`. Besides the basic `torch` dependency, additional dependencies are needed to run the example code.
**We have provided an [example notebook](lora_finetune.ipynb) to demonstrate how to process data and use the fine-tuning script with AdvertiseGen as an example.**
```bash
pip install -r requirements.txt
```
## Testing Hardware Standard
We only provide examples for single-node multi-GPU/multi-node multi-GPU setups, so you will need at least one machine with multiple GPUs. In the **default configuration file** in this repository, we have documented the memory usage:
+ SFT full parameters fine-tuning: Evenly distributed across 4 GPUs, each GPU consumes `30245MiB` of memory.
+ LORA fine-tuning: One GPU, consuming `10619MiB` of memory.。
> Please note that these results are for reference only, and memory consumption may vary with different parameters. Please adjust according to your hardware situation.
## Multi-Turn Dialogue Format
The multi-turn dialogue fine-tuning example adopts the ChatGLM3 dialogue format convention, adding different `loss_mask` for different roles, thus calculating `loss` for multiple replies in one computation.
For the data file, the example uses the following format
```json
[
{
"messages": [
{
"role": "system",
"content": "<system prompt text>"
},
{
"role": "user",
"content": "<user prompt text>"
},
{
"role": "assistant",
"content": "<assistant response text>"
},
// ... Multi Turn
{
"role": "user",
"content": "<user prompt text>"
},
{
"role": "assistant",
"content": "<assistant response text>"
}
]
}
// ...
]
```
## Dataset Format Example
> Please note, the fine-tuning code now includes a validation set, so for a complete set of fine-tuning datasets, it must contain training and validation datasets, while the test dataset is optional. Or, you can use the validation dataset in place of it.
```
{"messages": [{"role": "user", "content": "类型#裙*裙长#半身裙"}, {"role": "assistant", "content": "这款百搭时尚的仙女半身裙,整体设计非常的飘逸随性,穿上之后每个女孩子都能瞬间变成小仙女啦。料子非常的轻盈,透气性也很好,穿到夏天也很舒适。"}]}
```
## Start Fine-tuning
Execute **single-node multi-GPU/multi-node multi-GPU** runs with the following code.
```bash
cd finetune
bash sft_finetune.sh
```
Execute **single-node single-GPU** runs with the following code.
```angular2html
cd finetune
bash lora_finetune.sh
```
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"overlap_comm": true,
"reduce_scatter": true,
"contiguous_gradients": true
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"flops_profiler": {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
\ No newline at end of file
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"overlap_comm": true,
"reduce_scatter": true,
"contiguous_gradients": true,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"flops_profiler": {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
\ No newline at end of file
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_gather_16bit_weights_on_model_save": true
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"flops_profiler": {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
\ No newline at end of file
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_gather_16bit_weights_on_model_save": true,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"flops_profiler": {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
\ No newline at end of file
[
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#上衣*材质#牛仔布*颜色#白色*风格#简约*图案#刺绣*衣样式#外套*衣款式#破洞"
},
{
"role": "assistant",
"content": "简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#裙*材质#针织*颜色#纯色*风格#复古*风格#文艺*风格#简约*图案#格子*图案#纯色*图案#复古*裙型#背带裙*裙长#连衣裙*裙领型#半高领"
},
{
"role": "assistant",
"content": "这款BRAND针织两件套连衣裙,简约的纯色半高领针织上衣,修饰着颈部线,尽显优雅气质。同时搭配叠穿起一条背带式的复古格纹裙,整体散发着一股怀旧的时髦魅力,很是文艺范。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#上衣*风格#嘻哈*图案#卡通*图案#印花*图案#撞色*衣样式#卫衣*衣款式#连帽"
},
{
"role": "assistant",
"content": "嘻哈玩转童年,随时<UNK>,没错,出街还是要靠卫衣来装酷哦!时尚个性的连帽设计,率性有范还防风保暖。还有胸前撞色的卡通印花设计,靓丽抢眼更富有趣味性,加上前幅大容量又时尚美观的袋鼠兜,简直就是孩子耍帅装酷必备的利器。"
}
]
}
]
\ No newline at end of file
[
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#裤*版型#宽松*风格#性感*图案#线条*裤型#阔腿裤"
},
{
"role": "assistant",
"content": "宽松的阔腿裤这两年真的吸粉不少,明星时尚达人的心头爱。毕竟好穿时尚,谁都能穿出腿长2米的效果宽松的裤腿,当然是遮肉小能手啊。上身随性自然不拘束,面料亲肤舒适贴身体验感棒棒哒。系带部分增加设计看点,还让单品的设计感更强。腿部线条若隐若现的,性感撩人。颜色敲温柔的,与裤子本身所呈现的风格有点反差萌。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#裙*风格#简约*图案#条纹*图案#线条*图案#撞色*裙型#鱼尾裙*裙袖长#无袖"
},
{
"role": "assistant",
"content": "圆形领口修饰脖颈线条,适合各种脸型,耐看有气质。无袖设计,尤显清凉,简约横条纹装饰,使得整身人鱼造型更为生动立体。加之撞色的鱼尾下摆,深邃富有诗意。收腰包臀,修饰女性身体曲线,结合别出心裁的鱼尾裙摆设计,勾勒出自然流畅的身体轮廓,展现了婀娜多姿的迷人姿态。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#上衣*版型#宽松*颜色#粉红色*图案#字母*图案#文字*图案#线条*衣样式#卫衣*衣款式#不规则"
},
{
"role": "assistant",
"content": "宽松的卫衣版型包裹着整个身材,宽大的衣身与身材形成鲜明的对比描绘出纤瘦的身形。下摆与袖口的不规则剪裁设计,彰显出时尚前卫的形态。被剪裁过的样式呈现出布条状自然地垂坠下来,别具有一番设计感。线条分明的字母样式有着花式的外观,棱角分明加上具有少女元气的枣红色十分有年轻活力感。粉红色的衣身把肌肤衬托得很白嫩又健康。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#裙*版型#宽松*材质#雪纺*风格#清新*裙型#a字*裙长#连衣裙"
},
{
"role": "assistant",
"content": "踩着轻盈的步伐享受在午后的和煦风中,让放松与惬意感为你免去一身的压力与束缚,仿佛要将灵魂也寄托在随风摇曳的雪纺连衣裙上,吐露出<UNK>微妙而又浪漫的清新之意。宽松的a字版型除了能够带来足够的空间,也能以上窄下宽的方式强化立体层次,携带出自然优雅的曼妙体验。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#上衣*材质#棉*颜色#蓝色*风格#潮*衣样式#polo*衣领型#polo领*衣袖长#短袖*衣款式#拼接"
},
{
"role": "assistant",
"content": "想要在人群中脱颖而出吗?那么最适合您的莫过于这款polo衫短袖,采用了经典的polo领口和柔软纯棉面料,让您紧跟时尚潮流。再配合上潮流的蓝色拼接设计,使您的风格更加出众。就算单从选料上来说,这款polo衫的颜色沉稳经典,是这个季度十分受大众喜爱的风格了,而且兼具舒适感和时尚感。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#上衣*版型#h*材质#蚕丝*风格#复古*图案#条纹*图案#复古*图案#撞色*衣样式#衬衫*衣领型#小立领"
},
{
"role": "assistant",
"content": "小女人十足的条纹衬衣,缎面一点点的复古,还有蓝绿色这种高级气质复古色,真丝材质,撞色竖条纹特别的现代感味道,直h型的裁剪和特别的衣长款式,更加独立性格。双层小立领,更显脸型。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#裙*材质#网纱*颜色#粉红色*图案#线条*图案#刺绣*裙腰型#高腰*裙长#连衣裙*裙袖长#短袖*裙领型#圆领"
},
{
"role": "assistant",
"content": "这款连衣裙,由上到下都透出一丝迷人诱惑的女性魅力,经典圆领型,开口度恰好,露出你的迷人修长的脖颈线条,很是优雅气质,短袖设计,在这款上竟是撩人美貌,高腰线,散开的裙摆,到小腿的长度,遮住了腿部粗的部分,对身材有很好的修饰作用,穿起来很女神;裙身粉红色花枝重工刺绣,让人一眼难忘!而且在这种网纱面料上做繁复图案的绣花,是很考验工艺的,对机器的要求会更高,更加凸显我们的高品质做工;"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#上衣*颜色#纯色*图案#纯色*图案#文字*图案#印花*衣样式#卫衣"
},
{
"role": "assistant",
"content": "一款非常简洁大方的纯色卫衣,设计点在于胸前的“<UNK><UNK>”的中文字印花,新颖特别,让人眼前一亮。简单又吸睛的款式,而且不失时髦感,很适合个性年轻人。"
}
]
},
{
"messages": [
{
"role": "system",
"content": "你是一个广告生成助手,你的任务是根据用户提供的产品信息和广告类型,生成相应的广告文案。"
},
{
"role": "user",
"content": "类型#上衣*版型#宽松*颜色#黑色*颜色#灰色*颜色#姜黄色*风格#休闲*图案#线条*图案#撞色*衣样式#毛衣*衣袖型#落肩袖"
},
{
"role": "assistant",
"content": "看惯了灰色的冷淡和黑色的沉闷感,来一点醒目的彩色增添点活力吧。亮眼又吸睛的姜黄色色调,嫩肤显白非常的有设计感。趣味的撞色和宽松的版型相交辉映,修饰身形小缺点的同时,时尚又百搭。优雅的落肩袖,轻松修饰肩部线条,让毛衣上身凸显出一丝慵懒随性的休闲感,时尚魅力尽显。"
}
]
}
]
\ No newline at end of file
[
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 身上裹一件工厂发的棉大衣,手插在袖筒里\n句子2:身上至少一件衣服\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 东、中、西部地区要加强多种形式的合作,在协调发展中逐步实现共同富裕\n句子2:东、中、西部地区发展存在不协调\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 外贸经营权进一步放开\n句子2:外贸经营权经历了先收缩再放开的过程。\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 一些地方财政收支矛盾较大\n句子2:地方经历了经济危机\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 等他回来,我们就出去吃啊.\n句子2:我们在等他\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 我觉得你看吧,咱们从专业角度,咱们刚刚说了,物理学咱们都没有办法去判断他到底怎么排名\n句子2:他学习物理学\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 否则,我们的高要求得不到落实,也影响了我们的低目标的实现\n句子2:我们的要求落实与否无所谓。\n"
},
{
"role": "assistant",
"content": "contradiction"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 因此,日本舆论曾盛传海部内阁是过渡性政权\n句子2:此舆论在中国传播范围同样广泛\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 阿嗲说,谢谢你们,生日还送了东西.\n句子2:阿嗲收到生日礼物\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 他当时每次喝酒喝大,有一次喝酒喝得太大我就把他拽到我们家,因为我知道他一定最后以吐出来为终点\n句子2:我去过他家\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 我我后来又给你写了一封信啊.\n句子2:其实收信人收到了第一封信,但是可能忘记了\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 每天都在打高尔夫球的.\n句子2:主人公生活得很规律。\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 要狠抓造林质量不放松\n句子2:种的树越多越好,至于成活率和质量并不需要考虑\n"
},
{
"role": "assistant",
"content": "contradiction"
}
]
}
]
\ No newline at end of file
[
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 一月份跟二月份肯定有一个月份有.\n句子2:肯定有一个月份有\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 一月份跟二月份肯定有一个月份有.\n句子2:一月份有\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 一月份跟二月份肯定有一个月份有.\n句子2:一月二月都没有\n"
},
{
"role": "assistant",
"content": "contradiction"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 一点来钟时,张永红却来了\n句子2:一点多钟,张永红来了\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 不讲社会效果,信口开河,对任何事情都随意发议论,甚至信谣传谣,以讹传讹,那是会涣散队伍、贻误事业的\n句子2:以讹传讹是有害的\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 不讲社会效果,信口开河,对任何事情都随意发议论,甚至信谣传谣,以讹传讹,那是会涣散队伍、贻误事业的\n句子2:以讹传讹会被处罚\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 不讲社会效果,信口开河,对任何事情都随意发议论,甚至信谣传谣,以讹传讹,那是会涣散队伍、贻误事业的\n句子2:信口开河不会贻误事业\n"
},
{
"role": "assistant",
"content": "contradiction"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 严师母又哼了一声:你保证你没有别的心,却不能保证旁人没有\n句子2:你保证过你没有别的心\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 严师母又哼了一声:你保证你没有别的心,却不能保证旁人没有\n句子2:旁人有别的心\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 严师母又哼了一声:你保证你没有别的心,却不能保证旁人没有\n句子2:你一定能够保证旁人没有别的心\n"
},
{
"role": "assistant",
"content": "contradiction"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 中国人民勤劳智慧,具有无限的创新创造潜能,只要充分释放出来,中国的发展就一定会有更为广阔空间\n句子2:中国人民的创造潜能完全没有被释放出来\n"
},
{
"role": "assistant",
"content": "neutral"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 中国人民勤劳智慧,具有无限的创新创造潜能,只要充分释放出来,中国的发展就一定会有更为广阔空间\n句子2:中国人民没有创造潜能\n"
},
{
"role": "assistant",
"content": "contradiction"
}
]
},
{
"messages": [
{
"role": "user",
"content": "请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种?\n句子1: 事实表明,美国侵犯别国国权威性,遑论侵犯人权了\n句子2:美国侵犯了别国国权威性\n"
},
{
"role": "assistant",
"content": "entailment"
}
]
}
]
\ No newline at end of file
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 准备数据集\n",
"\n",
"将数据集转换为更通用的格式\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# 转换为 ChatML 格式\n",
"import os\n",
"import shutil\n",
"import json\n",
"\n",
"input_dir = \"data/AdvertiseGen\"\n",
"output_dir = \"data/mlx_AdvertiseGen\"\n",
"if os.path.exists(output_dir):\n",
" shutil.rmtree(output_dir)\n",
"os.makedirs(output_dir, exist_ok=True)\n",
"\n",
"for fn in [\"train.json\", \"dev.json\"]:\n",
" data_out_list = []\n",
" with open(os.path.join(input_dir, fn), \"r\") as f, open(os.path.join(output_dir, fn), \"w\") as fo:\n",
" for line in f:\n",
" if len(line.strip()) > 0:\n",
" data = json.loads(line)\n",
" data_out = {\"input\":data['content'],'prompt':\"/n请为以下关键词生成一条广告语。\",'output':data['summary']}\n",
" data_out_list.append(data_out)\n",
"\n",
" for d in data_out_list:\n",
" json_str = json.dumps(d,ensure_ascii=False) # 将字典转换为JSON字符串\n",
" fo.write(json_str + '\\n') # 写入字符串并添加换行符\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
# -*- coding: utf-8 -*-
import json
from dataclasses import dataclass, field
from typing import Dict, Optional
import torch
import transformers
from torch.utils.data import Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
Trainer,
TrainingArguments,
BitsAndBytesConfig,
)
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="openbmb/MiniCPM-2B-sft-bf16")
@dataclass
class DataArguments:
train_data_path: str = field(
default="data/AdvertiseGenChatML/train.json",
metadata={"help": "Path to the training data."},
)
eval_data_path: str = field(
default="data/AdvertiseGenChatML/dev.json",
metadata={"help": "Path to the test data."},
)
@dataclass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
model_max_length: int = field(
default=512,
metadata={
"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
},
)
use_lora: bool = field(default=False)
qlora: bool = field(default=False)
class SupervisedDataset(Dataset):
"""Dataset for supervised fine-tuning.
Loads the dataset from a json file and preprocesses it.example:
/data/AdvertiseGenChatML/train.json
"""
def __init__(
self,
data_path,
tokenizer,
model_max_length=4096,
):
super(SupervisedDataset, self).__init__()
self.data = json.load(open(data_path))
self.tokenizer = tokenizer
self.model_max_length = model_max_length
self.ignore_index = -100
item = self.preprocessing(self.data[0])
print("input:", self.tokenizer.decode(item["input_ids"]))
labels = []
for id_ in item["label_ids"]:
if id_ == -100:
continue
labels.append(id_)
print("label:", self.tokenizer.decode(labels))
def __len__(self):
return len(self.data)
def preprocessing(self, example):
input_ids = [self.tokenizer.bos_token_id]
label_ids = [self.ignore_index]
for message in example["messages"]:
role = message["role"]
content = message["content"]
content_ids = self.tokenizer.apply_chat_template([message])
if role == "user":
if self.tokenizer.eos_token_id == 73440: # minicpm3.0 and minicpm4.0
input_ids += self.tokenizer.apply_chat_template(
[message], add_generation_prompt=True
)
label_ids += [self.ignore_index] * len(
self.tokenizer.apply_chat_template(
[message], add_generation_prompt=True
)
)
else: # minicpm2.0
input_ids += content_ids
label_ids += [self.ignore_index] * len(content_ids)
elif role == "system":
input_ids += content_ids
label_ids += [self.ignore_index] * len(content_ids)
elif role == "assistant":
if self.tokenizer.eos_token_id == 73440: # minicpm3.0 and minicpm4.0
input_ids += tokenizer.encode(content, add_special_tokens=False)
label_ids += tokenizer.encode(content, add_special_tokens=False)
else: # minicpm2.0
input_ids += content_ids
label_ids += content_ids
input_ids.append(self.tokenizer.eos_token_id)
label_ids.append(self.tokenizer.eos_token_id)
# truncate to max len
input_ids = input_ids[: self.model_max_length]
label_ids = label_ids[: self.model_max_length]
attention_mask = [1] * len(input_ids)
# pad to max len
input_ids += [self.tokenizer.eos_token_id] * (
self.model_max_length - len(input_ids)
)
label_ids += [self.ignore_index] * (self.model_max_length - len(label_ids))
attention_mask += [0] * (self.model_max_length - len(attention_mask))
# convert to pt tensor
input_ids = torch.LongTensor(input_ids)
label_ids = torch.LongTensor(label_ids)
attention_mask = torch.LongTensor(attention_mask)
return {
"input_ids": input_ids,
"label_ids": label_ids,
"attention_mask": attention_mask,
}
def __getitem__(self, idx) -> Dict[str, torch.Tensor]:
return self.preprocessing(self.data[idx])
def load_model_and_tokenizer(
model_path: str,
max_length: int = 4096,
use_lora: bool = True,
qlora: bool = False,
bf16: bool = False,
fp16: bool = False,
):
"""load model and tokenizer"""
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
assert not (bf16 and fp16), "bf16 or fp16, not both"
if bf16:
dtype = torch.bfloat16
elif fp16:
dtype = torch.float16
else:
dtype = torch.float32
if qlora:
assert use_lora, "use_lora must be True when use_qlora is True"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # 是否进行4bit量化
load_in_8bit=False, # 是否进行8bit量化
bnb_4bit_compute_dtype=torch.float16, # 计算精度设置
bnb_4bit_quant_storage=torch.uint8, # 量化权重的储存格式
bnb_4bit_quant_type="nf4", # 量化格式,这里用的是正太分布的int4
bnb_4bit_use_double_quant=True, # 是否采用双量化,即对zeropoint和scaling参数进行量化
llm_int8_enable_fp32_cpu_offload=False, # 是否llm使用int8,cpu上保存的参数使用fp32
llm_int8_has_fp16_weight=False, # 是否启用混合精度
# llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"], # 不进行量化的模块
llm_int8_threshold=6.0, # llm.int8()算法中的离群值,根据这个值区分是否进行量化
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=dtype,
trust_remote_code=True,
quantization_config=quantization_config,
)
else:
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=dtype,
trust_remote_code=True,
)
if use_lora:
from peft import LoraConfig, TaskType, get_peft_model
lora_config = LoraConfig(
init_lora_weights="gaussian",
task_type=TaskType.CAUSAL_LM,
target_modules=(
["q_a_proj", "kv_a_proj_with_mqa", "q_b_proj", "kv_b_proj"]
if model.config.architectures == ["MiniCPM3ForCausalLM"]
else ["q_proj", "v_proj"]
),
r=64,
lora_alpha=32,
lora_dropout=0.1,
inference_mode=False,
)
model = get_peft_model(model, lora_config)
# trainable params: 2,949,120 || all params: 3,010,652,928 || trainable%: 0.09795616002669305
model.print_trainable_parameters()
# model.enable_input_require_grads() # need when using adapter
return model, tokenizer
if __name__ == "__main__":
model_path = "/mnt/data/user/tc_agi/yh/models/MiniCPM"
parser = transformers.HfArgumentParser(
(ModelArguments, DataArguments, TrainingArguments)
)
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
model, tokenizer = load_model_and_tokenizer(
model_path=model_args.model_name_or_path,
max_length=training_args.model_max_length,
use_lora=training_args.use_lora,
qlora=training_args.qlora,
bf16=training_args.bf16,
fp16=training_args.fp16,
)
train_dataset = SupervisedDataset(
data_path=data_args.train_data_path,
tokenizer=tokenizer,
model_max_length=training_args.model_max_length,
)
eval_dataset = SupervisedDataset(
data_path=data_args.eval_data_path,
tokenizer=tokenizer,
model_max_length=training_args.model_max_length,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
)
trainer.train()
# save the incremental PEFT weights, more details can be found in https://huggingface.co/blog/peft
# model.save_pretrained("output_dir")
# MiniCPM_llama_factory 微调
MiniCPM已经支持llama_factory微调,llama_factory支持continue_pretrain,sft,ppo,dpo,kto,orpo等等微调方式。
由于llama_factory功能强大,但初学者较难上手,我们录制了微调教程
**我们提供了 llama_factory_example文件夹,用来微调minicpm1b,minicpm2b模型。**
1.首先安装llama_factory依赖。
```bash
git clone https://github.com/hiyouga/LLaMA-Factory
cd LLaMA-Factory
pip install -r requirements.txt
```
2.将数据集处理成Minicpm/finetune/llama_factory_example/llama_factory_data文件夹中的格式,示例包括dpo,kto,sft三种微调方式并放置到llama_factory/data目录下.以dpo为例:
```json
[
{
"conversations": [
{
"from": "human",
"value": "Hi! I'd like to create a new language game simulating the first person perspective of a character named Angela."
}
],
"chosen": {
"from": "gpt",
"value": "That sounds like a fun and engaging idea! Here are some tips to help you create the game:\n1. ......"
},
"rejected": {
"from": "gpt",
"value": "Hello! I'd be happy to help you create a language game simulating the first-person perspective ....."
}
}
]
```
3.在llama_factory/data/dataset_info.json中添加数据集信息,保证dataset_info.json中能找到你的数据集,如下例:
``` json
{"identity": {
"file_name": "identity.json"
},
"sft_zh_demo": {
"file_name": "alpaca_zh_demo.json"
},
"kto_en_demo": {
"file_name": "kto_en_demo.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
"kto_tag": "label"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
},
"dpo_en_demo": {
"file_name": "dpo_en_demo.json",
"ranking": true,
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"chosen": "chosen",
"rejected": "rejected"
}
}
}
```
4.将MiniCPM/finetune/llama_factory_example中文件复制到LLaMA-Factory/examples目录下。
```bash
cd LLaMA-Factory/examples
mkdir minicpm
#以下代码中的/your/path要改成你的MiniCPM代码和LLaMA-Factory路径
cp -r /your/path/MiniCPM/finetune/llama_factory_example/* /your/path/LLaMA-Factory/examples/minicpm
```
5.以dpo为例,首先修改minicpm_dpo.yaml,需要修改的:
```yaml
model_name_or_path: openbmb/MiniCPM-2B-sft-bf16 #或者你本地保存的地址
dataset: dpo_en_demo #这里写dataset_info.json中的键名
output_dir: your/finetune_minicpm/save/path
bf16: true #如果你的设备支持bf16,否则false
deepspeed: examples/deepspeed/ds_z2_config.json #如果显存不够可以改成ds_z3_config.json
```
6.修改single_node.sh文件中:
- 1.如果是a100以及更高端服务器,删除以下两行
```bash
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
```
- 2.设置你希望参与微调的卡,以下示例为第1张到第8张卡都参与微调
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
```
- 3.将以下代码src/train.py空格后方参数改为llama_facoty中minicpm_dpo.yaml的绝对路径
```bash
src/train.py /root/ld/ld_project/LLaMA-Factory/examples/minicpm/minicpm_sft.yaml
```
7.执行:
```bash
cd LLaMA-Factory
bash single_node.sh
```
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
### model
model_name_or_path: /root/ld/ld_model_pretrained/MiniCPM4/
### method
stage: dpo
do_train: true
finetuning_type: full
### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z2_config.json
### dataset
dataset: dpo_en_demo
template: cpm3
cutoff_len: 1200
max_samples: 50000000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/minicpm/dpo
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_strategy: epoch
### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 0.00001
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: true
### eval
val_size: 0.1
per_device_eval_batch_size: 4
evaluation_strategy: steps
eval_steps: 500
### model
model_name_or_path: /root/ld/ld_model_pretrained/MiniCPM4/
### method
stage: kto
do_train: true
finetuning_type: full
kto_ftx: 0.1
### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z2_config.json
### dataset
dataset: kto_harmless
template: cpm3
cutoff_len: 1200
max_samples: 500000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/minicpm/kto
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 0.000005
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: true
### eval
val_size: 0.1
per_device_eval_batch_size: 16
evaluation_strategy: steps
eval_steps: 500
### model
model_name_or_path: /root/ld/ld_model_pretrained/MiniCPM4/
### method
stage: sft
do_train: true
finetuning_type: full
### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z2_config.json
### dataset
dataset: glaive_toolcall_en,glaive_toolcall_zh
template: cpm3
cutoff_len: 1800
max_samples: 500000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/minicpm/fuction_call
logging_steps: 10
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
bf16: true
### eval
val_size: 0.1
per_device_eval_batch_size: 4
evaluation_strategy: steps
eval_steps: 500
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment