Qwen_Megatron🤤 # 环境配置 1. 拉取镜像,创建容器,安装基础依赖包
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu20.04-dtk24.04.2-py3.10

docker run -it     --network=host     --name=llmtrain_libo     --restart=on-failure:10      --privileged      --device=/dev/kfd --device=/dev/dri     --ipc=host --shm-size=32G --group-add video     --cap-add=SYS_PTRACE --security-opt seccomp=unconfined     -v /mnt/fs/catl:/data    -v /root/.ssh:/root/.ssh  -v /opt/hyhal:/opt/hyhal   -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1       image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu20.04-dtk24.04.2-py3.10        bash

pip install -r requirements.txt
2. 多节点docker设置: 1.进入容器执行/usr/sbin/sshd -p 12345,启动一个端口 2.容器间可通过该端口ssh登录,ssh ip -p 12345 3.如果需要免密,docker run容器时,docker -v /root/.ssh:/root/.ssh 挂载.ssh目录 4.容器间mpirun执行,mpirun -np .... --hostfile hosts -mca plm_rsh_args "-p 12345" ./exe localhost ### 分布式训练 - 执行多卡训练,参数按需修改 ``` bash run.sh > trian_1_64.log 2>&1 & ```