Commit 7f03ad53 authored by yinger_z's avatar yinger_z
Browse files

updated for dtk24.04.1

parent 94a879be
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
COPY requirements.txt requirements.txt
RUN source /opt/dtk-23.04/env.sh
RUN source /opt/dtk/env.sh
RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone
ENV LANG C.UTF-8
RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
......@@ -30,8 +30,8 @@ Baichuan整体模型基于标准的Transformer结构,采用了和LLaMA一样
### Docker(方式一)
推荐使用docker方式运行,提供拉取的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
docker run -dit --network=host --name=baichuan2 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -dit --shm-size 80g --network=host --name=baichuan2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
docker exec -it baichuan2 /bin/bash
```
安装docker中没有的依赖:
......@@ -43,20 +43,20 @@ pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --tru
### Dockerfile(方式二)
```
docker build -t baichuan2:latest .
docker run -dit --network=host --name=baichuan2 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 baichuan2:latest
docker run -dit --shm-size 80g --network=host --name=baichuan2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /opt/hyhal/:/opt/hyhal/:ro baichuan2:latest /bin/bash
docker exec -it baichuan2 /bin/bash
```
### Conda(方式三)
1. 创建conda虚拟环境:
```
conda create -n baichuan2 python=3.8
conda create -n baichuan2 python=3.10
```
2. 关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
- [DTK 23.04](https://cancon.hpccube.com:65024/1/main/DTK-23.04.1)
- [Pytorch 1.13.1](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04)
- [Deepspeed 0.9.2](https://cancon.hpccube.com:65024/4/main/deepspeed/dtk23.04)
- [DTK 24.04](https://cancon.hpccube.com:65024/1/main/DTK-24.04.1)
- [Pytorch 2.1.0](https://cancon.hpccube.com:65024/4/main/pytorch/DAS1.1.1)
- [Deepspeed 0.12.3](https://cancon.hpccube.com:65024/4/main/deepspeed/DAS1.1)
Tips:以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。
......@@ -65,31 +65,10 @@ conda create -n baichuan2 python=3.8
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
```
### 注意 1
```
#到虚拟环境下对应的python/site-packages注释掉一些版本判断
site-packages/accelerate/accelerator.py 文件
287 #if not is_deepspeed_available():
288 # raise ImportError("DeepSpeed is not installed => run `pip install deepspeed` or build it from source.")
289 #if compare_versions("deepspeed", "<", "0.9.3"):
290 # raise ImportError("DeepSpeed version must be >= 0.9.3. Please update DeepSpeed.")
site-packages/transformers/utils/versions.py 文件
43 #if not ops[op](version.parse(got_ver), version.parse(want_ver)):
44 # raise ImportError(
45 # f"{requirement} is required for a normal functioning of this module, but found {pkg}=={got_ver}.{hint}"
46 # )
```
### 注意 2
训练前请参考[modeling_baichuan.py](./modeling_baichuan.py)修改模型文件夹中modeling_baichuan.py的`Attention`类的代码,主要(暂时)去除去torch2.X的依赖。
### 注意3
若不支持xformers,在多节点训练中可能会出现xformers相关报错:"ImportError: This modeling file reguires the following packages that were not found in your environment: xformers." ,您可通过直接将[modeling_baichuan.py](./modeling_baichuan.py)中xpos设置为None来解决,即注释import xformers相关代码,并设置`xops=None`
### 注意3
若不支持xformers,在训练中可能会出现xformers相关报错:"ImportError: This modeling file reguires the following packages that were not found in your environment: xformers." ,您可通过直接将[modeling_baichuan.py](./modeling_baichuan.py)中xpos设置为None来解决,即注释import xformers相关代码,并设置`xops=None`
## 数据集
......
hostfile=""
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --hostfile=$hostfile fine-tune.py \
HIP_VISIBLE_DEVICES=0,1,2,3,4,5 deepspeed --hostfile=$hostfile fine-tune.py \
--report_to "none" \
--data_path "data/belle_chat_ramdon_10k.json" \
--model_name_or_path "../../baichuan2-13b-chat-hf" \
--model_name_or_path "./baichuan2-13b-chat" \
--output_dir "output" \
--model_max_length 512 \
--num_train_epochs 4 \
......
hostfile=""
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --hostfile=$hostfile fine-tune.py \
HIP_VISIBLE_DEVICES=0,1,2,3 deepspeed --hostfile=$hostfile fine-tune.py \
--report_to "none" \
--data_path "data/belle_chat_ramdon_10k.json" \
--model_name_or_path "../../baichuan2-13b-chat-hf" \
--model_name_or_path "./baichuan2-13b-chat" \
--output_dir "output" \
--model_max_length 64 \
--num_train_epochs 4 \
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment