added llama_tencentpretrain_pytorch

ea2d13c2 · zhaoying1 · ea2d13c2 · ea2d13c2 · ea2d13c2 · ea2d13c2
Commit ea2d13c2 authored Sep 07, 2023 by zhaoying1
20 changed files
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py37-latest
+COPY requirements.txt requirements.txt
+RUN source /opt/dtk-23.04/env.sh
+RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone 
+ENV LANG C.UTF-8
+RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+
--- a/README.md
+++ b/README.md
+## 基于TencentPretrain框架的LLaMA微调训练
+
+
+## 模型介绍
+LLaMA，这是一个基础语言模型的集合，参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。特别是，llama 13B在大多数基准测试中优于GPT-3 (175B)， LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。
+
+LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。以下是与原始架构的主要区别：
+
+**预归一化**。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
+
+**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
+
+**旋转嵌入**。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。
+
+LLaMA 2是LLaMA的新一代版本，具有商业友好的许可证。 LLaMA 2 有 3 种不同的尺寸：7B、13B 和 70B。Llama 2训练语料相比LLaMA多出40%，上下文长度是由之前的2048升级到4096，可以理解和生成更长的文本。Llama 2采用了 Llama 1 的大部分预训练设置和模型架构，使用标准Transformer 架构，使用 RMSNorm 应用预归一化、使用 SwiGLU 激活函数和旋转位置嵌入RoPE。具体细节可参考论文：
+
+[Llama 2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)
+
+[LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/pdf/2302.13971v1.pdf)
+
+
+### 依赖环境
+* Python >= 3.6 
+* [torch >= 1.1](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04)
+* six >= 1.12.0
+* argparse
+* packaging
+* regex
+* [DeepSpeed](https://cancon.hpccube.com:65024/4/main/deepspeed/dtk23.04)
+
+推荐使用docker方式运行，提供[光源](https://www.sourcefind.cn/#/service-list)拉取的docker镜像：image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py37-latest
+
+#### Docker配置方式
+```commandline
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py37-latest
+docker run -dit --network=host --name=llama-tencentpretrain --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py37-latest
+docker exec -it llama-tencentpretrain /bin/bash
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+``` 
+也可从目录下的Dockerfile构建镜像：
+
+```commandline
+docker build -t llama:latest .
+docker run -dit --network=host --name=llama-tencentpretrain --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 llama:latest
+``` 
+
+
+
+### 模型权重下载
+1. 方式一：下载huggingface格式模型。以 7B 模型为例，首先下载预训练[LLaMA权重](https://huggingface.co/decapoda-research/llama-7b-hf)，转换到TencentPretrain格式：
+```commandline
+python3 scripts/convert_llama_from_huggingface_to_tencentpretrain.py --input_model_path $LLaMA_HF_PATH \
+                       --output_model_path  models/llama-7b.bin --type 7B
+``` 
+2. 方式二：也可以直接下载[TencentPretrain对应格式模型](https://huggingface.co/Linly-AI/)进行微调训练，不需要转换格式。
+
+### 全参数增量预训练
+##### 数据预处理
+1. 构建预训练数据集
+
+txt预训练语料：多个txt需要合并到一个 .txt 文件并按行随机打乱，语料格式如下：
+```commandline
+doc1
+doc2
+doc3
+``` 
+jsonl 预训练语料：为了支持代码等包含换行符的数据，预训练数据也可以整理成jsonl格式，格式如下：
+```commandline
+{"text": "doc1"}
+{"text": "doc2"}
+{"text": "doc3"}
+``` 
+
+2. 按如下方式进行预处理
+```commandline
+python3 preprocess.py --corpus_path $CORPUS_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
+                      --dataset_path $OUTPUT_DATASET_PATH --data_processor lm --seq_length 1024
+``` 
+可选参数： --json_format_corpus：使用jsonl格式数据；
+--full_sentences：对长度不足的样本使用其他样本进行填充（没有 pad token）；
+
+##### 增量预训练
+1. 单机
+```commandline
+deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 \
+                      --pretrained_model_path models/llama-7b.bin \
+                      --dataset_path $OUTPUT_DATASET_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
+                      --config_path models/llama/7b_config.json \
+                      --output_model_path models/llama_zh_7b \
+                      --world_size 8 --data_processor lm  --deepspeed_checkpoint_activations \
+                      --total_steps 300000 --save_checkpoint_steps 5000 --batch_size 24
+```
+
+2. 集群
+```commandline
+cd slurm_scripts
+bash run-pt.sh
+```
+### 全参数指令微调
+##### 数据预处理
+1. 构建指令数据集：指令数据为 json 格式，包含instruction、input、output三个字段（可以为空），每行一条样本。
+示例：
+```commandline
+{"instruction": "在以下文本中提取所有的日期。", "input": "6月21日是夏至，这是一年中白天最长的一天。", "output": "6月21日"}
+{"instruction": "", "input": "请生成一个新闻标题，描述一场正在发生的大型自然灾害。\\n\n", "output": "\"强烈飓风肆虐，数百万人疏散！\""}
+``` 
+2. 按如下方式进行预处理
+```commandline
+python3 preprocess.py --corpus_path $INSTRUCTION_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
+                      --dataset_path $OUTPUT_DATASET_PATH --data_processor alpaca --seq_length 1024
+``` 
+##### 微调训练
+1. 单机
+```commandline
+deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 \
+                      --pretrained_model_path models/llama_zh_7b.bin \
+                      --dataset_path $OUTPUT_DATASET_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
+                      --config_path models/llama/7b_config.json \
+                      --output_model_path models/chatflow_7b \
+                      --world_size 8 --data_processor alpaca --prefix_lm_loss --deepspeed_checkpoint_activations \
+                      --total_steps 20000 --save_checkpoint_steps 2000 --batch_size 24
+```
+
+2. 集群
+```commandline
+cd slurm_scripts
+bash run-ift.sh
+```
+
+
+### 模型分块
+训练初始化时，每张卡会加载一个模型的拷贝，因此内存需求为模型大小*GPU数量。内存不足时可以通过以下方式将模型分块，然后使用分块加载。
+```commandline
+python3 scripts/convert_model_into_blocks.py \
+        --input_model_path path/to/chinese_llama_13b.bin \
+        --output_model_path path/to/chinese_llama_13b \
+        --block_size 10
+```
+其中，--input_model_path 输入模型路径； --output_model_path 输出模型目录； --block_size 分块大小；在训练加载模型时，将 pretrained_model_path 改为以上输出的目录即可。
+
+### 模型推理
+TencentPretrain格式模型推理请参考[LLAMA_pytorch](https://developer.hpccube.com/codes/modelzoo/llama_pytorch)
+
+
+### 训练实验结果
+- 利用公开指令数据集[alpaca_gpt4_data_zh.json](https://huggingface.co/datasets/shibing624/alpaca-zh)，基于汉化ChineseLLaMA的7B、13B基础模型，我们进行指令微调训练实验，以下为训练Loss：
+<div align="center">
+<figure class="half">
+    <img width = '300' height ='250' src="./data/media/ift_7B_bs2_32node_128cards.jpg">
+    <img width = '300' height ='250' src="./data/media/ift_13B_bs2_32node_128cards.jpg">
+</figure>
+</div>
+
+- 利用公开指令数据集[alpaca_gpt4_data_zh.json](https://huggingface.co/datasets/shibing624/alpaca-zh)，基于meta开源的[meta-llama/Llama-2-7b-chat-hf](https://pan.xunlei.com/s/VN_kQa1_HBvV-X9QVI6jV2kOA1?pwd=xmra) ，我们进行中文指令微调训练实验，以下为训练Loss：
+<div align="center">
+<img src="./data/media/ift_llama2_7B_bs2_32node_128cards.jpg" width="300" height="250">
+</div>
+
+## 源码仓库及问题反馈
+
+- https://developer.hpccube.com/codes/modelzoo/llama1-2
+
+## 参考
+
+* https://github.com/CVI-SZU/Linly
+* https://github.com/Tencent/TencentPretrain/
+* https://github.com/ProjectD-AI/llama_inference
\ No newline at end of file
--- a/data/alpaca_gpt4_data_zh.json
+++ b/data/alpaca_gpt4_data_zh.json
--- a/data/dataset.pt
+++ b/data/dataset.pt
--- a/data/media/ift_13B_bs2_32node_128cards.jpg
+++ b/data/media/ift_13B_bs2_32node_128cards.jpg
--- a/data/media/ift_7B_bs2_32node_128cards.jpg
+++ b/data/media/ift_7B_bs2_32node_128cards.jpg
--- a/data/media/ift_llama2_7B_bs2_32node_128cards.jpg
+++ b/data/media/ift_llama2_7B_bs2_32node_128cards.jpg
--- a/model.properties
+++ b/model.properties
+# 模型名称
+modelName=LLaMA(TencentPretrain)_Pytorch
+# 模型描述
+modelDescription=基于Pytorch框架的LLaMA微调训练
+# 应用场景(多个标签以英文逗号分割)
+appScenario=训练,train,nlp,智能聊天助手
+# 框架类型(多个标签以英文逗号分割)
+frameType=Pytorch
--- a/models/albert/base_config.json
+++ b/models/albert/base_config.json
+{
+  "emb_size": 128,
+  "feedforward_size": 3072,
+  "hidden_size": 768,
+  "hidden_act": "relu",
+  "heads_num": 12,
+  "layers_num": 12,
+  "max_seq_length": 512,
+  "dropout": 0.0,
+  "data_processor": "albert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "factorized_embedding_parameterization": true,
+  "parameter_sharing": true,
+  "target": ["mlm", "sp"]
+}
\ No newline at end of file
--- a/models/albert/large_config.json
+++ b/models/albert/large_config.json
+{
+  "emb_size": 128,
+  "feedforward_size": 4096,
+  "hidden_size": 1024,
+  "hidden_act": "relu",
+  "heads_num": 16,
+  "layers_num": 24,
+  "max_seq_length": 512,
+  "dropout": 0.0,
+  "data_processor": "albert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "factorized_embedding_parameterization": true,
+  "parameter_sharing": true,
+  "target": ["mlm", "sp"]
+}
\ No newline at end of file
--- a/models/albert/xlarge_config.json
+++ b/models/albert/xlarge_config.json
+{
+  "emb_size": 128,
+  "feedforward_size": 8192,
+  "hidden_size": 2048,
+  "hidden_act": "relu",
+  "heads_num": 16,
+  "layers_num": 24,
+  "max_seq_length": 512,
+  "dropout": 0.0,
+  "data_processor": "albert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "factorized_embedding_parameterization": true,
+  "parameter_sharing": true,
+  "target": ["mlm", "sp"]
+}
\ No newline at end of file
--- a/models/albert/xxlarge_config.json
+++ b/models/albert/xxlarge_config.json
+{
+  "emb_size": 128,
+  "feedforward_size": 16384,
+  "hidden_size": 4096,
+  "hidden_act": "relu",
+  "heads_num": 16,
+  "layers_num": 12,
+  "max_seq_length": 512,
+  "dropout": 0.0,
+  "data_processor": "albert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "factorized_embedding_parameterization": true,
+  "parameter_sharing": true,
+  "target": ["mlm", "sp"]
+}
\ No newline at end of file
--- a/models/bart/base_config.json
+++ b/models/bart/base_config.json
+{
+  "emb_size": 768,
+  "feedforward_size": 3072,
+  "hidden_size": 768,
+  "hidden_act": "gelu",
+  "heads_num": 12,
+  "layers_num": 6,
+  "decoder_layers_num": 6,
+  "max_seq_length": 1024,
+  "dropout": 0.1,
+  "data_processor": "bart",
+  "embedding": ["word", "pos"],
+  "tgt_embedding": ["word", "pos"],
+  "share_embedding": true,
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "decoder": "transformer",
+  "target": ["lm"],
+  "tie_weights": true,
+  "has_lmtarget_bias": true
+}
\ No newline at end of file
--- a/models/bart/large_config.json
+++ b/models/bart/large_config.json
+{
+  "emb_size": 1024,
+  "feedforward_size": 4096,
+  "hidden_size": 1024,
+  "hidden_act": "gelu",
+  "heads_num": 16,
+  "layers_num": 12,
+  "decoder_layers_num": 12,
+  "max_seq_length": 1024,
+  "dropout": 0.1,
+  "data_processor": "bart",
+  "embedding": ["word", "pos"],
+  "tgt_embedding": ["word", "pos"],
+  "share_embedding": true,
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "decoder": "transformer",
+  "target": ["lm"],
+  "tie_weights": true,
+  "has_lmtarget_bias": true
+}
\ No newline at end of file
--- a/models/beit/base_config.json
+++ b/models/beit/base_config.json
+{
+  "emb_size": 768,
+  "feedforward_size": 3072,
+  "hidden_size": 768,
+  "hidden_act": "gelu",
+  "heads_num": 12,
+  "layers_num": 12,
+  "dropout": 0.1,
+  "data_processor": "beit",
+  "embedding": ["masked_patch", "pos"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "target": ["mlm"],
+  "image_height": 256,
+  "image_width": 256,
+  "patch_size": 16,
+  "image_preprocess": ["crop"],
+  "tokenizer": "vqgan",
+  "image_tokenizer": {
+    "is_gumbel": false,
+    "is_transformer": false,
+    "image_vocab_size": 16384,
+    "frame_size": 16
+  }
+}
\ No newline at end of file
--- a/models/bert/base_config.json
+++ b/models/bert/base_config.json
+{
+  "emb_size": 768,
+  "feedforward_size": 3072,
+  "hidden_size": 768,
+  "hidden_act": "gelu",
+  "heads_num": 12,
+  "layers_num": 12,
+  "max_seq_length": 512,
+  "dropout": 0.1,
+  "data_processor": "bert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "target": ["mlm", "sp"],
+  "tie_weights": true
+}
\ No newline at end of file
--- a/models/bert/large_config.json
+++ b/models/bert/large_config.json
+{
+  "emb_size": 1024,
+  "feedforward_size": 4096,
+  "hidden_size": 1024,
+  "hidden_act": "gelu",
+  "heads_num": 16,
+  "layers_num": 24,
+  "max_seq_length": 512,
+  "dropout": 0.1,
+  "data_processor": "bert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "target": ["mlm", "sp"],
+  "tie_weights": true
+}
\ No newline at end of file
--- a/models/bert/medium_config.json
+++ b/models/bert/medium_config.json
+{
+  "emb_size": 512,
+  "feedforward_size": 2048,
+  "hidden_size": 512,
+  "hidden_act": "gelu",
+  "heads_num": 8,
+  "layers_num": 8,
+  "max_seq_length": 512,
+  "dropout": 0.1,
+  "data_processor": "bert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "target": ["mlm", "sp"],
+  "tie_weights": true
+}
\ No newline at end of file
--- a/models/bert/mini_config.json
+++ b/models/bert/mini_config.json
+{
+  "emb_size": 256,
+  "feedforward_size": 1024,
+  "hidden_size": 256,
+  "hidden_act": "gelu",
+  "heads_num": 4,
+  "layers_num": 4,
+  "max_seq_length": 512,
+  "dropout": 0.1,
+  "data_processor": "bert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "target": ["mlm", "sp"],
+  "tie_weights": true
+}
\ No newline at end of file
--- a/models/bert/small_config.json
+++ b/models/bert/small_config.json
+{
+  "emb_size": 512,
+  "feedforward_size": 2048,
+  "hidden_size": 512,
+  "hidden_act": "gelu",
+  "heads_num": 8,
+  "layers_num": 4,
+  "max_seq_length": 512,
+  "dropout": 0.1,
+  "data_processor": "bert",
+  "embedding": ["word", "pos", "seg"],
+  "encoder": "transformer",
+  "mask": "fully_visible",
+  "target": ["mlm", "sp"],
+  "tie_weights": true
+}
\ No newline at end of file