README.md 1.17 KB
Newer Older
jerrrrry's avatar
jerrrrry committed
1
Qwen_Megatron🤤
jerrrrry's avatar
jerrrrry committed
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


# 环境配置
1. 拉取镜像,创建容器,安装基础依赖包
<pre>
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu20.04-dtk24.04.2-py3.10

docker run -it     --network=host     --name=llmtrain_libo     --restart=on-failure:10      --privileged      --device=/dev/kfd --device=/dev/dri     --ipc=host --shm-size=32G --group-add video     --cap-add=SYS_PTRACE --security-opt seccomp=unconfined     -v /mnt/fs/catl:/data    -v /root/.ssh:/root/.ssh  -v /opt/hyhal:/opt/hyhal   -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1       image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu20.04-dtk24.04.2-py3.10        bash

pip install -r requirements.txt
</pre>

2. 多节点docker设置: 

1.进入容器执行/usr/sbin/sshd -p 12345,启动一个端口    
2.容器间可通过该端口ssh登录,ssh ip -p 12345    
3.如果需要免密,docker run容器时,docker -v /root/.ssh:/root/.ssh 挂载.ssh目录    
4.容器间mpirun执行,mpirun -np .... --hostfile hosts -mca plm_rsh_args "-p 12345" ./exe localhost




### 分布式训练
- 执行多卡训练,参数按需修改

  ```
  bash run.sh > trian_1_64.log 2>&1 &
  ```