README.md

# dcu_megatron_core_v0.15.0  大模型预训练 

dcu_megatron预训练步骤

## Qwen3-0.6B 单节点预训练

### 1、环境准备

拉取镜像

```shell
docker pull harbor.sourcefind.cn:5443/dcu/admin/base/pytorch:2.5.1-ubuntu22.04-dtk25.04.4-1230-py3.10-20251230
docker run --shm-size 500g --network=host --name=limeng_xxx --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v ~/:/workspace/ -v /public/opendas/DL_DATA/llm-models/:/home/models:ro -v /opt/hyhal:/opt/hyhal:ro -it harbor.sourcefind.cn:5443/dcu/admin/base/pytorch:2.5.1-ubuntu22.04-dtk25.04.4-1230-py3.10-20251230 bash

git clone http://112.11.119.99:10068/dcutoolkit/deeplearing/dcu_megatron.git
cd dcu_megatron
git switch --detach origin/core_v0.15.0
#如遇网络问题 也可直接下载压缩包
git clone https://github.com/NVIDIA/Megatron-Energon.git
cd Megatron-Energon
git switch --detach ea11c980

git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git switch --detach 0d7e02bd

pip install pybind11
```

### 2、准备配置

在examples/qwen/run_qwen3_0.6B.sh下修改为本机实际参数

```shell
DTK_ENV="/opt/dtk/env.sh"                                                            # where env.sh of dtk
DATA_PATH="/workspace/data/xxx"                                                     # path to dataset
TOKENIZER_MODEL_PATH="/home/models/qwen3/Qwen3-0.6B"                                # path to tokenizer.model
CHECKPOINT_PATH="/workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1"            # path to ckpt
NCCL_ENV=${MEGATRON_PATH}/requirements/env.sh                 # Please adjust the variables based on the actual NET being used
LAUNCH_WITH_BINDING=${MEGATRON_PATH}/requirements/launch_with_binding.sh # Please adjust the variables based on the actual NET being used
GPUS=8
PORT="25900"

```

在examples/qwen/train_qwen3_0.6B.sh下 修改相关训练配置

### 3、开始训练

```shell
bash run_qwen3_0.6B.sh
```

## Qwen3-8B 单节点预训练

镜像配置和项目配置 与上述0.6B 一样

训练配置修改examples/qwen/run_qwen3_8B.sh和train_qwen3_8B_1nodes.sh 为本机实际环境

bash run_qwen3_8B.sh启动训练