updated for dtk24.04.1

7f03ad53 · yinger_z · 94a879be · 7f03ad53 · 7f03ad53 · 7f03ad53
Commit 7f03ad53 authored Aug 23, 2024 by yinger_z
5 changed files
--- a/Dockerfile
+++ b/Dockerfile
-FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
 COPY requirements.txt requirements.txt
-RUN source /opt/dtk-23.04/env.sh
+RUN source /opt/dtk/env.sh
 RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone 
 ENV LANG C.UTF-8
 RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
--- a/README.md
+++ b/README.md
@@ -30,8 +30,8 @@ Baichuan整体模型基于标准的Transformer结构，采用了和LLaMA一样
 ### Docker(方式一)
 推荐使用docker方式运行，提供拉取的docker镜像：
 ```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
-docker run -dit --network=host --name=baichuan2 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+docker run -dit --shm-size 80g --network=host --name=baichuan2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
 docker exec -it baichuan2 /bin/bash
 ```
 安装docker中没有的依赖:
@@ -43,20 +43,20 @@ pip install -r requirements.txt  -i http://mirrors.aliyun.com/pypi/simple/ --tru
 ### Dockerfile(方式二)
 ```
 docker build -t baichuan2:latest .
-docker run -dit --network=host --name=baichuan2 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 baichuan2:latest
+docker run -dit --shm-size 80g --network=host --name=baichuan2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro baichuan2:latest /bin/bash
 docker exec -it baichuan2 /bin/bash
 ```

 ### Conda(方式三)
 1. 创建conda虚拟环境：
 ```
-conda create -n baichuan2 python=3.8
+conda create -n baichuan2 python=3.10
 ```

 2. 关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
- [DTK 23.04](https://cancon.hpccube.com:65024/1/main/DTK-23.04.1)
- [Pytorch 1.13.1](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04)
- [Deepspeed 0.9.2](https://cancon.hpccube.com:65024/4/main/deepspeed/dtk23.04)
+- [DTK 24.04](https://cancon.hpccube.com:65024/1/main/DTK-24.04.1)
+- [Pytorch 2.1.0](https://cancon.hpccube.com:65024/4/main/pytorch/DAS1.1.1)
+- [Deepspeed 0.12.3](https://cancon.hpccube.com:65024/4/main/deepspeed/DAS1.1)

    Tips：以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。

@@ -65,31 +65,10 @@ conda create -n baichuan2 python=3.8
 pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
 ```

-### 注意 1
-
-```
-#到虚拟环境下对应的python/site-packages注释掉一些版本判断
-site-packages/accelerate/accelerator.py 文件
-
- 287             #if not is_deepspeed_available():
- 288             #    raise ImportError("DeepSpeed is not installed => run `pip install deepspeed` or build it from source.")
- 289             #if compare_versions("deepspeed", "<", "0.9.3"):
- 290             #    raise ImportError("DeepSpeed version must be >= 0.9.3. Please update DeepSpeed.")
- 
-site-packages/transformers/utils/versions.py 文件
- 43     #if not ops[op](version.parse(got_ver), version.parse(want_ver)):
- 44     #    raise ImportError(
- 45     #        f"{requirement} is required for a normal functioning of this module, but found {pkg}=={got_ver}.{hint}"
- 46     #    )
-```
-
-### 注意 2
-
-训练前请参考[modeling_baichuan.py](./modeling_baichuan.py)修改模型文件夹中modeling_baichuan.py的`Attention`类的代码，主要(暂时)去除去torch2.X的依赖。


-### 注意3 
-若不支持xformers，在多节点训练中可能会出现xformers相关报错:"ImportError: This modeling file reguires the following packages that were not found in your environment: xformers." ，您可通过直接将[modeling_baichuan.py](./modeling_baichuan.py)中xpos设置为None来解决，即注释import xformers相关代码，并设置`xops=None`。
+### 注意3
+若不支持xformers，在训练中可能会出现xformers相关报错:"ImportError: This modeling file reguires the following packages that were not found in your environment: xformers." ，您可通过直接将[modeling_baichuan.py](./modeling_baichuan.py)中xpos设置为None来解决，即注释import xformers相关代码，并设置`xops=None`。

 ## 数据集


--- a/fine-tune/ft_train.sh
+++ b/fine-tune/ft_train.sh
 hostfile=""
-HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --hostfile=$hostfile fine-tune.py  \
+HIP_VISIBLE_DEVICES=0,1,2,3,4,5 deepspeed --hostfile=$hostfile fine-tune.py  \
    --report_to "none" \
    --data_path "data/belle_chat_ramdon_10k.json" \
-    --model_name_or_path "../../baichuan2-13b-chat-hf" \
+    --model_name_or_path "./baichuan2-13b-chat" \
    --output_dir "output" \
    --model_max_length 512 \
    --num_train_epochs 4 \

--- a/fine-tune/lora_train.sh
+++ b/fine-tune/lora_train.sh
 hostfile=""
-HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --hostfile=$hostfile fine-tune.py  \
+HIP_VISIBLE_DEVICES=0,1,2,3 deepspeed --hostfile=$hostfile fine-tune.py  \
    --report_to "none" \
    --data_path "data/belle_chat_ramdon_10k.json" \
-    --model_name_or_path "../../baichuan2-13b-chat-hf" \
+    --model_name_or_path "./baichuan2-13b-chat" \
    --output_dir "output" \
    --model_max_length 64 \
    --num_train_epochs 4 \

--- a/modeling_baichuan.py
+++ b/modeling_baichuan.py