README.md

# 内容
- [内容](#内容)
- [环境配置](#环境配置)
- [预训练](#预训练)
  - [GPT](##GPT)
    - [下载词汇文件](###下载词汇文件)
    - [下载训练数据](###下载训练数据)
    - [数据预处理](###数据预处理)
    - [GPT预训练](###GPT预训练)
  - [Llama](##Llama)
    - [下载tokenizer文件](###下载tokenizer文件)
    - [下载训练数据](###下载训练数据)
    - [数据预处理](###数据预处理)
    - [Llama预训练](###Llama预训练)
- [参考](#参考)

# 更新日志

2025.3.14适配最新代码，shell启动脚本在examples对应模型目录下

2024.12.16适配了torch prof

使用方法: 启动脚本中添加下列参数, 即可采集对应的prof信息

```python
# 采集torchprof
mpirun -np 8 --allow-run-as-root train_mixtral_8x7B_1nodes.sh localhost --profiling=torch
```

```bash
# prof相关参数
TORCH_PROFIE_ARGS=(
    --profile # 开启profile
    --profile-step-start 4 # skip前3个iter, warm第4个iter
    --profile-step-end 5 # 采集第5个iter
    --use-pytorch-profiler # 使用torch prof
    --profile-ranks 0 3 # 采集全局rank 第0和3
    --profile-dir ./prof_data # prof文件的保存目录
)
```


# 环境配置
1. 安装基础依赖包
<pre>
pip install -r requirements.txt
</pre>
2. 安装HCU相关whl包

HCU相关包下载目录：[https://cancon.hpccube.com:65024/4/main](https://cancon.hpccube.com:65024/4/main)

pytorch whl包：pytorch ---> dtk-24.04.1
根据python版本,下载对应pytorch的whl包

<pre>
pip install torch* (下载的torch的whl包)
</pre>
torchvision whl包：vision ---> dtk-24.04.1
根据python版本,下载对应torchvision的whl包

<pre>
pip install torchvision* (下载的torchvision的whl包)
</pre>
apex whl包：apex ---> dtk-24.04.1
根据python版本,下载对应apex的whl包

<pre>
pip install apex* (下载的apex的whl包)
</pre>

若使用 pip install 下载安装过慢，可添加源：-i https://pypi.tuna.tsinghua.edu.cn/simple/

# 预训练
## GPT
### 下载词汇文件

<pre>
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
</pre>

### 下载训练数据
使用1GB 79K jsonl数据集
<pre>
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
</pre>
解压后为单个`oscar-1GB.jsonl`文件

### 数据预处理

```shell
python tools/preprocess_data.py \
    --input oscar-1GB.jsonl \ 
    --output-prefix ./dataset/oscar-1GB-gpt \
    --vocab-file gpt2-vocab.json \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file gpt2-merges.txt \
    --append-eod \
    --workers 8

# 参数说明
# --input				输入数据集路径，即oscar-1GB.jsonl.xz解压后的文件路径
# --output-prefix		输出数据路径(需要输出目录已创建)，处理后会自动加上_text_document后缀
# --vocab-file				下载的gpt2-vocab.json词表文件路径
# --tokenizer-type 	tokenizer类型
# --merge-file		下载的gpt2-merges.txt文件路径		
# --append-eod		添加结束标志符		
# --workers			进程数
```


### GPT预训练
脚本目录: `examples/gpt3/`

修改数据集与词汇文件路径
```shell
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
DATA_PATH="./dataset/oscar-1GB-gpt_text_document"
```
- 单机多卡训练
  ```shell
  # 修改脚本中的分布式启动参数
  # 单机可以使用localhost指定通信地址为本地
  # -np 8指定8进程\(8卡\)并行
  # --allow-run-as-root以root权限启动
  mpirun --allow-run-as-root -np 8 GPT_pretraining.sh localhost >& GPT_pretraining.log
  ```
  注: 这里的`localhost`参数会传到脚本中的`--dist-url`中

  在`GPT_pretraining.log`中查看训练日志

- 多机多卡训练
  
  多节点docker设置:
  1. 容器内执行/usr/sbin/sshd -p 12345，启动一个端口
  2. 容器间可通过该端口ssh登录，ssh ip -p 12345
  3. 如果需要免密，docker run容器时，docker -v /root/.ssh 挂载.ssh目录
  4. 容器间mpirun执行: `mpirun -np .. --hostfile hosts -mca plm_rsh_args "-p 12345" ./xx.sh master_ip`


  **案例**: 设有节点192.168.1.1和192.168.1.2两个节点, 每个节点上8张卡, 192.168.1.1作为master节点

  hosts文件:
  ```txt
  192.168.1.1 slots=8 
  192.168.1.2 slots=8
  ```

  在master节点执行命令

  ```shell
  mpirun --allow-run-as-root -np 16 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_args "-p 12345" --bind-to none ./GPT_pretraining.sh 192.168.1.1 >& GPT_pretraining.log
  ```
  在`GPT_pretraining.log`中查看训练日志

## Llama
### 下载tokenizer文件

链接: https://www.modelscope.cn/models/shakechen/Llama-2-7b-hf/files
下载其中的tokenizer*文件

### 下载训练数据
使用1GB 79K jsonl数据集
<pre>
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
xz -d oscar-1GB.jsonl.xz
</pre>
解压后为单个`oscar-1GB.jsonl`文件

### 数据预处理

```shell
python tools/preprocess_data.py \
  --input oscar-1GB.jsonl \
  --output-prefix /datasets/oscar-1GB-llama\
  --tokenizer-type Llama2Tokenizer \
  --tokenizer-model /path/to/llama2_7b_hf/tokenizer.model \
  --workers 16 \
  --append-eod
```

### Llama预训练
脚本: `examples/llama`

修改数据集与tokenizer路径
```shell
DATA_PATH="/datasets/oscar-1GB-llama_text_document"
--tokenizer-model /path/to/llama2_7b_hf/tokenizer.model
```
- 单机多卡训练
  ```shell
  # 具体参数说明参考上文GPT
  mpirun --allow-run-as-root -np 8 Llama_pretraining.sh localhost >& Llama_pretraining.log
  ```
  在`Llama_pretraining.log`中查看训练日志

- 多机多卡训练
  
  **案例**: 设有节点192.168.1.1和192.168.1.2两个节点, 每个节点上8张卡, 192.168.1.1作为master节点

  hosts配置如上文GTP所示

  在master节点执行命令

  ```shell
  mpirun --allow-run-as-root -np 16 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_args "-p 12345" --bind-to none ./Llama_pretraining.sh 192.168.1.1 >& Llama_pretraining.log
  ```

  在`Llama_pretraining.log`中查看训练日志

# 参考

- [README_ORIGIN](README_ORIGIN.md)