You need to sign in or sign up before continuing.
使用Llama Factory 训练问题咨询
我在docker环境下,使用Llama Factory 进行训练,完成时候,会报以下错误:
[INFO|trainer.py:3993] 2025-07-10 20:30:17,767 >> Saving model checkpoint to saves/glm4-9b/full/sft/checkpoint-3
[INFO|configuration_utils.py:424] 2025-07-10 20:30:17,772 >> Configuration saved in saves/glm4-9b/full/sft/checkpoint-3/config.json
[INFO|configuration_utils.py:904] 2025-07-10 20:30:17,773 >> Configuration saved in saves/glm4-9b/full/sft/checkpoint-3/generation_config.json
swanlab: Error happened while training
swanlab: 🌟 Run `swanlab watch /root/llama-factory/swanlog/run-20250710_202905-bhsahxthv3a3zmf8mdd4v` to view SwanLab Experiment Dashboard locally
File "/root/llama-factory/src/llamafactory/launcher.py", line 23, in <module>
launch()
File "/root/llama-factory/src/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/root/llama-factory/src/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/root/llama-factory/src/llamafactory/train/tuner.py", line 72, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/root/llama-factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2240, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2622, in _inner_training_loop
self._maybe_log_save_evaluate(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3102, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3199, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3898, in save_model
self._save(output_dir, state_dict=state_dict)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4015, in _save
self.model.save_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3572, in save_pretrained
ptrs[id_tensor_storage(tensor)].append(name)
File "/usr/local/lib/python3.10/site-packages/transformers/pytorch_utils.py", line 300, in id_tensor_storage
from torch.distributed.tensor import DTensor
cannot import name 'DTensor' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/site-packages/torch/distributed/tensor/__init__.py)
docker环境下,各包版本信息如下: torch 2.4.1+das.opt2.dtk2504 torchdata 0.11.0 torchvision 0.19.1+das.opt2.dtk2504 deepspeed 0.14.2+das.opt2.dtk2504 transformers 4.52.4
在网络上找了下解决办法,是说torch版本不匹配。我使用pip install torch --upgrade更新后,llama factory 训练会有其他的错误。
在社区找到 torch-2.5.1+das.opt1.dtk25041-cp310-cp310-manylinux_2_28_x86_64.whl文件,安装后运行也不可用,还请帮忙解答。