模型文件未保存在./work_dirs/job/job_id/路径下等其他问题
环境:
国家超算成都中心曙光集群,CentOS7+DCU1号16GB
任务:
llama3-8B-instruct+xtuner微调
参数等
#部分llama3_8b_instruct_qlora_alpaca_e3_M.py设置:
optim_type = AdamW
lr = 1e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 10
save_total_limit = 3 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
#########################
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
),
lora=dict(
type=LoraConfig,
r=64,# 32
lora_alpha=32, # 64
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
################################
#模型tp分片 Meta-Llama-3-8B-Instruct/config.json
"pretraining_tp": 2
问题情况
- 11780步训练中断,没有模型保存。
- 中断原因:出现NCCL超时(30分钟)
- 偶发错误:启动训练时无法找到目录中hunggingface缓存的.arrow文件。重新启动训练有概率解决
后两个问题同样出现在其他模型微调和其他任务中。
部分log
05/12 10:32:33 - mmengine - ^[[4m^[[97mINFO^[[0m - Iter(train) [11780/42460] lr: 8.4787e-05 eta: 5 days, 3:07:47 time: 10.1190 data_time: 0.0534 memory: 5297 loss: nan
05/12 10:32:33 - mmengine - ^[[4m^[[97mINFO^[[0m - Iter(train) [11780/42480] lr: 8.4803e-05 eta: 5 days, 3:12:34 time: 10.1205 data_time: 0.0718 memory: 5297 loss: nan
05/12 10:32:33 - mmengine - ^[[4m^[[97mINFO^[[0m - Iter(train) [11780/42470] lr: 8.4796e-05 eta: 5 days, 3:10:10 time: 10.1196 data_time: 0.0650 memory: 5297 loss: nan
E0512 11:04:06.376451 27517 ProcessGroupNCCL.cpp:474] [Rank 116] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10785040, OpType=_ALLGATHER_BASE, NumelIn=467968, NumelOut=59899904, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
I0512 11:04:06.376550 27517 ProcessGroupNCCL.cpp:874] [Rank 116] Destroyed 1communicators on CUDA device 0
E0512 11:04:06.376560 27517 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E0512 11:04:06.376590 27517 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
E0512 11:04:06.380620 18526 ProcessGroupNCCL.cpp:474] [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10785040, OpType=_ALLGATHER_BASE, NumelIn=467968, NumelOut=59899904, Timeout(ms)=1800000) ran for 1800006 milliseconds before timing out.
I0512 11:04:06.380704 18526 ProcessGroupNCCL.cpp:874] [Rank 19] Destroyed 1communicators on CUDA device 3
E0512 11:04:06.380713 18526 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E0512 11:04:06.380740 18526 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
E0512 11:04:06.399801 5998 ProcessGroupNCCL.cpp:474] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10785040, OpType=_ALLGATHER_BASE, NumelIn=467968, NumelOut=59899904, Timeout(ms)=1800000) ran for 1800025 milliseconds before timing out.
I0512 11:04:06.399885 5998 ProcessGroupNCCL.cpp:874] [Rank 31] Destroyed 1communicators on CUDA device 3
E0512 11:04:06.399895 5998 ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
E0512 11:04:06.399902 5998 ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
E0512 11:04:06.404671 19867 ProcessGroupNCCL.cpp:474] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10785040, OpType=_ALLGATHER_BASE, NumelIn=467968, NumelOut=59899904, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out.
slurm脚本
#!/bin/bash
#SBATCH -p normal
#SBATCH -N 32
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=4
#SBATCH --gres=dcu:4
#SBATCH --mem=100GB
#SBATCH -J llm3Xtuner
#SBATCH -o ./log/llm3-%j.out
#SBATCH -e ./log/llm3-%j.out
source env.sh
rm -rf ./hostfile/*
echo "START TIME: $(date)"
hostfile=./hostfile/$SLURM_JOB_ID
scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile}
for i in `cat $hostfile`
do
echo ${i} slots=4 >> `pwd`/hostfile/hostfile-dl-$SLURM_JOB_ID
done
np=$(cat $hostfile|sort|uniq |wc -l)
np=$(($np*4))
echo $np
nodename=$(cat $hostfile |sed -n "1p")
dist_url=`echo $nodename | awk '{print $1}'`
mpirun -np $np --allow-run-as-root --hostfile hostfile/hostfile-dl-$SLURM_JOB_ID --bind-to none `pwd`/single_finetune.sh
~