update

d6b60084 · “yuguo” · 429d3145 · d6b60084
Commit d6b60084 authored Oct 10, 2023 by “yuguo”
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 4 deletions

README.md README.md +7 -4

No files found.
--- a/README.md
+++ b/README.md
@@ -44,13 +44,14 @@ LLaMA，这是一个基础语言模型的集合，参数范围从7B到65B。在

 我们在Fastchat目录下集成了英文对话数据集供用户快速验证：

-    ./FastChat-main/playground/data/alpaca-data-conversation.json
+    $ tree ./FastChat-main/playground/data
+      ── alpaca-data-conversation.json

 ## LLAMA-13B微调（使用mpi）

 ### 环境配置

-2节点16卡Z00L裸金属节点，要求dtk22.10.1环境正常，mpirun文件夹下包含预编译好的openmpi库mpi4.tar.gz，可直接使用：
+按照节点环境修改env.sh，环境变量参考dtk-22.10。修改2节点16卡Z00L裸金属节点，要求dtk环境正常，mpirun文件夹下包含预编译好的openmpi库mpi4.tar.gz，可直接使用。关于本项目DCU显卡所需torch库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装：

 ```
 cp -r mpirun/* ./
@@ -67,7 +68,7 @@ pip3 install  apex-0.1+gitdb7007a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl（

 ### 训练

-该训练脚本需要2节点，每节点8张DCU-Z100L-32G。
+该训练脚本需要2节点，每节点8张DCU-Z100L-32G。按需更改mpi_single.sh中模型权重所在路径。

 并行配置采用zero3，使用fp16精度微调，如果想使能apex adamw_apex_fused优化器，更改./FastChat-main/fastchat/train/train.py:55行优化器改成adamw_apex_fused。deepspeed config.json如下：

@@ -97,12 +98,14 @@ pip3 install  apex-0.1+gitdb7007a.dtk2210-cp38-cp38-manylinux2014_x86_64.whl（
 }
 ```

-进入节点1，根据环境修改hostfile，保证两节点文件路径一致，配置相同，修改mpi_job.sh中--mca btl_tcp_if_include enp97s0f1，enp97s0f1改为ip a命令后对应节点ip的网卡名，numa可以根据当前节点拓扑更改绑定，微调命令：
+进入节点1，根据环境修改hostfile，保证两节点文件路径一致，配置相同，按需修改mpi_job.sh中--mca btl_tcp_if_include enp97s0f1，enp97s0f1改为ip a命令后对应节点ip的网卡名，numa可以根据当前节点拓扑更改绑定，微调命令：

 ```
 source mpi_job.sh
 ```

+如果单节点运行7B的模型出现oom，可以适当减少batch size。
+
 ## 精度

 训练数据：[./FastChat-main/playground/data/alpaca-data-conversation.json](链接)