Update README and inference codes

e2fa7e61 · Rayyyyy · d3dd8642 · e2fa7e61 · d3dd8642 · e2fa7e61
Commit e2fa7e61 authored Jun 26, 2024 by Rayyyyy
7 changed files
--- a/README.md
+++ b/README.md
@@ -63,40 +63,54 @@ pip install -r requirements.txt
 ## 数据集
 ### 准备数据集
-数据预处理的脚本，参考[数据预处理说明文档](./docs/data_process.md).
+在训练之前，需要将文本语料库转换为token id并存储在二进制文件中。
+1. 数据集中的样本用'\n'分隔，每个样本内的'\n'用'\<n>'代替（程序会在预处理过程中将'\<n>'替换回'\n'），数据集中的每一行都是一个样本；对于用于调优Yuan2.0的数据集，在 `instruction` 和 `response` 之间放一个'\<sep>'作为分隔符。下面是一个finetune数据集的示例：
+```text
+John买了3件衬衫，每件售价为20美元。此外，他还需要支付所有商品的10%税款。他总共支付了多少钱？<sep>John购买的3件衬衫的总价为3 \times 20 = 60美元。<n>所有商品的税款为总价的10%，即60 \times 0.1 = 6美元。<n>因此，John总共支付的钱数为60 + 6 = 66美元。
+每年，Dani作为赢得亚马逊季度最佳买家的奖励，会得到4个一对裤子（每对裤子2条）。如果初始时他有50条裤子，计算出5年后他将拥有多少条裤子。<sep>每年Dani会得到4 \times 2 = 8条裤子，因此5年后他将得到8 \times 5 = 40条裤子。<n>那么，5年后他总共拥有的裤子数量为初始时的50条加上5年内得到的40条，即50 + 40 = 90条裤子。<n>因此，5年后他将拥有90条裤子。
+```
+2. 数据预处理的脚本[preprocess_data.py](./tools/preprocess_data.py)，代码中的主要变量设置如下:
-数据集的目录结构如下:
+| 变量名  |  描述   |
+| ------------------ | ------------------ |
+| `--input` | 存储训练数据集的文件夹路径，数据集应该存储在`.txt`文件中。注意:即使只有一个`.txt`文件需要处理，路径应该是存放.txt文件的地方(即文件夹)，而不是`.txt`文件的路径。|
+| `--data-idx` | 这为训练数据集设置了索引。如果只有一个数据集要转换，那么`--data-idx`参数应该设置为'0'。如果有多个训练数据集，设为'0-n'，其中n为训练数据集的个数。 |
+| `--tokenizer_path` | tokenizer files 所在路径 |
+| `--output_path` | 存储预处理数据的地址, 将为每个数据集创建一个 `.idx` 文件和一个 `.bin` 文件。 |
+更多内容可以参考[数据预处理说明文档](./docs/data_process.md).
+3. 如果一个数据集是被处理过的，即`--output_path` 路径下存在该数据集的`.idx` 文件和`.bin`文件，程序会自动跳过对该数据集的处理。执行数据预处理代码命令如下：
 ```bash
+# <Specify path> 请根据实际路径进行修改
+python ./tools/preprocess_data_yuan.py --input '<Specify path>' --data-idx '0-42' --tokenizer_path './tokenizer' --output_path '<Specify path>'
+```
+数据集的目录结构如下:
+```
 ├── datasets
-│   ├── cnn_dm
+│   ├── xxx.idx
-│       ├── test.source
+│   └── xxx.bin
-│       ├── test.target
-│       ├── train.source
-│       ├── train.target
-│       ├── val.source
-│       └── val.target
-│   └── blank_yahoo
-│       ├── blank
-│       ├── test.txt
-│       ├── train.txt
-│       └── valid.txt
 ```
 ## 训练
-1. 提供了用于预训练的文档和[example](./examples)的脚本，具体使用方法可以参考[预训练说明文档](./docs/pretrain.md)。这里以`Yuan-2.1B-M32`模型为例。
+1. 提供了用于预训练的文档和[example](./examples)的脚本，这里以`Yuan-2.1B-M32`模型为例。其他使用方法可以参考[预训练说明文档](./docs/pretrain.md)。
-2. 预训练模型可参考[Yuan2.0-M32](https://github.com/IEIT-Yuan/Yuan2.0-M32)下的`Model Downloads`部分。
+2. 预训练模型下载路径可参考[Yuan2.0-M32](https://github.com/IEIT-Yuan/Yuan2.0-M32)的`Model Downloads`部分。
 3. 开始训练之前，下列参数需要根据实际情况进行修改：
 - `GPUS_PER_NODE` 修改为所需卡数
 - `CHECKPOINT_PATH` 预训练模型地址
 - `DATA_PATH` 训练数据地址
 - `TOKENIZER_MODEL_PATH` tokenzier模型地址
- `TENSORBOARD_PATH`
+- `TENSORBOARD_PATH` tensorboard路径
+- `--use-lf-gate` 激活Localized Filtering-based Attention(LFA)
+- `--lf-conv2d-num-pad`参数，训练情况下设置为 `1` ，推理情况下设置为 `0`
+- `--use-distributed-optimizer`和`--recompute-method`训练时候控制内存的使用
 特别地，如果dataset path如下所示:
 ```bash
@@ -110,24 +124,14 @@ DATA_PATH='1 /path/dataset'
 4. 执行：
 ```bash
-bash examples/pretrain_yuan2.0_moe_2x32B.sh
+bash examples/pretrain_yuan2.0_2.1B.sh
 ```
 ## 推理
-1. 根据需求修改`--model_name_or_path`模型地址;
-2. 添加环境变量
-```bash
-pip install -U huggingface_hub hf_transfer
-export export HF_ENDPOINT=https://hf-mirror.com
-```
-3. 执行:
 ```bash
-python inference.py --model_name_or_path THUDM/glm-10b
+python vllm/yuan_inference.py --model_path /path/of/Yuan2-M32-hf
 ```
-### Benchmarks 测试
-提供了[**HumanEval**](./docs/eval_humaneval_cn.md) 、[**GSM8K**](./docs/eval_gsm8k_cn.md) 、[**MMLU**](./docs/eval_mmlu_cn.md)、[**Math**](./docs/eval_math_cn.md)、[**ARC-C**](./docs/eval_arc_cn.md) 的评估脚本，以方便大家复现我们的评测结果。
 ## result
 <div align=center>
@@ -138,12 +142,12 @@ python inference.py --model_name_or_path THUDM/glm-10b
 测试数据：智源glm_trian_data数据，使用的加速卡:K100。
 | device | dtype | params | acc |
 | :------: | :------: | :------: | :------: |
-| A100 | fp16 | bs=8, lr=1e-05 | 0.808 |
+| A800 | fp16 |     |     |
-| K100 | fp16 | bs=8, lr=1e-05 | 0.804 |
+| K100 | fp16 |     |     |
 ## 应用场景
 ### 算法类别
-多轮对话
+对话问答
 ### 热点应用行业
 家居,教育,科研
@@ -152,4 +156,4 @@ python inference.py --model_name_or_path THUDM/glm-10b
 - https://developer.hpccube.com/codes/modelzoo/yuan2.0-m32_pytorch
 ## 参考资料
- https://github.com/THUDM/GLM
+- https://github.com/IEIT-Yuan/Yuan2.0-M32
--- a/doc/.coveragerc
+++ b/doc/.coveragerc
-[html]
-directory = coverage
-[run]
-data_file = .coverage_$LOCAL_RANK
--- a/Attention Router.png
+++ b/Attention Router.png
--- a/docs/data_process.md
+++ b/docs/data_process.md
@@ -14,7 +14,6 @@ The main variables in the code should be set as follows:
 | `--output_path`    | The path where to store the preprocessed dataset, one .idx file and one .bin file would be created for each dataset.                                                                                                                                |
 ## Dataset
 Samples in dataset should be seperated with '\n', and within each sample, the '\n' should be replaced with '\<n>', therefore each line in the dataset is a single sample. And the program would replace the '\<n>' back to '\n' during preprocessing.&#x20;

--- a/icon.png
+++ b/icon.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=710
+# 模型名称
+modelName=yuan2.0-m32_pytorch
+# 模型描述
+modelDescription=源2.0-M32大幅提升了模型算力效率，在性能全面对标LLaMA3-700亿的同时，显著降低了在模型训练、微调和推理所需的算力开销，算力消耗仅为LLaMA3的1/19。
+# 应用场景
+appScenario=推理,训练,对话问答,家居,教育,科研
+# 框架类型
+frameType=pytorch
--- a/vllm/yuan_inference.py
+++ b/vllm/yuan_inference.py
-from vllm import LLM, SamplingParams
-import time
 import os
+import time
+import argparse
 from transformers import LlamaTokenizer
+from vllm import LLM, SamplingParams
+## params
+parser = argparse.ArgumentParser()
+parser.add_argument('--model_path', default='', help='model path')
+args = parser.parse_args()
-tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-HF/', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
+model_path = args.model_path
+tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
 tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>','<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
 prompts = ["写一篇春游作文"]
 sampling_params = SamplingParams(max_tokens=300, temperature=1, top_p=0, top_k=1, min_p=0.0, length_penalty=1.0, repetition_penalty=1.0, stop="<eod>", )
-llm = LLM(model="/mnt/beegfs2/Yuan2-M32-HF/", trust_remote_code=True, tensor_parallel_size=8, gpu_memory_utilization=0.8, disable_custom_all_reduce=True, max_num_seqs=1)
+## init model
+llm = LLM(model=model_path, trust_remote_code=True, tensor_parallel_size=8, gpu_memory_utilization=0.8, disable_custom_all_reduce=True, max_num_seqs=1)
+## inference
 start_time = time.time()
 outputs = llm.generate(prompts, sampling_params)
 end_time = time.time()