README.md

### 环境配置

1. 拉取镜像：docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04.1-py3.10
2. 安装基础依赖包
<pre>
pip install -r requirements.txt
</pre>

若使用 pip install 下载安装过慢，可添加源：-i https://pypi.tuna.tsinghua.edu.cn/simple/


### 下载openwebtext训练数据
https://huggingface.co/datasets/Skylion007/openwebtext

### 数据预处理

#### 1.将原始 tar 包转换为 openwebtext.jsonl

配置路径（根据你的实际路径修改）

SUBSETS_DIR = "/models/datasets/openwebtext/subsets"  # 存放tar包的目录
OUTPUT_JSONL = "/models/datasets/openwebtext/openwebtext.jsonl"  # 输出的jsonl文件

```
python convert_openwebtext_jsonl.py
```

输出openwebtext.jsonl文件，大小68G

#### 2.预处理为Llama-2 格式的数据

```
python tools/preprocess_data.py \
  --input openwebtext.jsonl \
  --output-prefix /models/datasets/openwebtext/openwebtext-llama-7b \
  --tokenizer-type Llama2Tokenizer \
  --tokenizer-model /path/to/llama2_7b_hf/tokenizer.model \
  --workers 16 \
  --append-eod
```
输出openwebtext-llama-7b_text_document.bin  openwebtext-llama-7b_text_document.idx
.bin文件31G

### 下载tokenizer文件

链接: https://www.modelscope.cn/models/shakechen/Llama-2-7b-hf/files
下载其中的tokenizer*文件

### Llama预训练
脚本: `Llama_pretraining.sh`

修改数据集与tokenizer路径
```shell
DATA_PATH="/models/datasets/openwebtext/openwebtext-llama-7b/openwebtext-llama-7b_text_document"
--tokenizer-model /models1/Llama-2-7b-chat-hf/tokenizer.model
```
- 单机8卡训练
  
  ```shell
  mpirun --allow-run-as-root -np 8 Llama_pretraining.sh localhost >& Llama_pretraining.log
  ```
  在`Llama_pretraining.log`中查看训练日志