# dcu_megatron_core_v0.15.0 大模型预训练 dcu_megatron预训练步骤 ## Qwen3-0.6B 单节点预训练 ### 1、环境准备 拉取镜像 ```shell docker pull harbor.sourcefind.cn:5443/dcu/admin/base/pytorch:2.5.1-ubuntu22.04-dtk25.04.4-1230-py3.10-20251230 docker run --shm-size 500g --network=host --name=limeng_xxx --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v ~/:/workspace/ -v /public/opendas/DL_DATA/llm-models/:/home/models:ro -v /opt/hyhal:/opt/hyhal:ro -it harbor.sourcefind.cn:5443/dcu/admin/base/pytorch:2.5.1-ubuntu22.04-dtk25.04.4-1230-py3.10-20251230 bash git clone http://112.11.119.99:10068/dcutoolkit/deeplearing/dcu_megatron.git cd dcu_megatron git switch --detach origin/core_v0.15.0 #如遇网络问题 也可直接下载压缩包 git clone https://github.com/NVIDIA/Megatron-Energon.git cd Megatron-Energon git switch --detach ea11c980 git clone https://github.com/NVIDIA/Megatron-LM.git cd Megatron-LM git switch --detach 0d7e02bd pip install pybind11 ``` ### 2、准备配置 在examples/qwen/run_qwen3_0.6B.sh下修改为本机实际参数 ```shell DTK_ENV="/opt/dtk/env.sh" # where env.sh of dtk DATA_PATH="/workspace/data/xxx" # path to dataset TOKENIZER_MODEL_PATH="/home/models/qwen3/Qwen3-0.6B" # path to tokenizer.model CHECKPOINT_PATH="/workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1" # path to ckpt NCCL_ENV=${MEGATRON_PATH}/requirements/env.sh # Please adjust the variables based on the actual NET being used LAUNCH_WITH_BINDING=${MEGATRON_PATH}/requirements/launch_with_binding.sh # Please adjust the variables based on the actual NET being used GPUS=8 PORT="25900" ``` 在examples/qwen/train_qwen3_0.6B.sh下 修改相关训练配置 ### 3、开始训练 ```shell bash run_qwen3_0.6B.sh ``` ## Qwen3-8B 单节点预训练 镜像配置和项目配置 与上述0.6B 一样 训练配置修改examples/qwen/run_qwen3_8B.sh和train_qwen3_8B_1nodes.sh 为本机实际环境 bash run_qwen3_8B.sh启动训练