Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • G GLM-4_pytorch
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ModelZoo
  • GLM-4_pytorch
  • Issues
  • #3

You need to sign in or sign up before continuing.
Closed
Open
Created Jul 10, 2025 by t5y6jjj@t5y6jjj

使用Llama Factory 训练问题咨询

我在docker环境下,使用Llama Factory 进行训练,完成时候,会报以下错误:

[INFO|trainer.py:3993] 2025-07-10 20:30:17,767 >> Saving model checkpoint to saves/glm4-9b/full/sft/checkpoint-3
[INFO|configuration_utils.py:424] 2025-07-10 20:30:17,772 >> Configuration saved in saves/glm4-9b/full/sft/checkpoint-3/config.json
[INFO|configuration_utils.py:904] 2025-07-10 20:30:17,773 >> Configuration saved in saves/glm4-9b/full/sft/checkpoint-3/generation_config.json
swanlab: Error happened while training
swanlab: 🌟 Run `swanlab watch /root/llama-factory/swanlog/run-20250710_202905-bhsahxthv3a3zmf8mdd4v` to view SwanLab Experiment Dashboard locally
  File "/root/llama-factory/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/root/llama-factory/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/root/llama-factory/src/llamafactory/train/tuner.py", line 110, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/root/llama-factory/src/llamafactory/train/tuner.py", line 72, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/root/llama-factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2240, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2622, in _inner_training_loop
    self._maybe_log_save_evaluate(
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3102, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial)
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3199, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3898, in save_model
    self._save(output_dir, state_dict=state_dict)
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 4015, in _save
    self.model.save_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3572, in save_pretrained
    ptrs[id_tensor_storage(tensor)].append(name)
  File "/usr/local/lib/python3.10/site-packages/transformers/pytorch_utils.py", line 300, in id_tensor_storage
    from torch.distributed.tensor import DTensor
cannot import name 'DTensor' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/site-packages/torch/distributed/tensor/__init__.py)

docker环境下,各包版本信息如下: torch 2.4.1+das.opt2.dtk2504 torchdata 0.11.0 torchvision 0.19.1+das.opt2.dtk2504 deepspeed 0.14.2+das.opt2.dtk2504 transformers 4.52.4

在网络上找了下解决办法,是说torch版本不匹配。我使用pip install torch --upgrade更新后,llama factory 训练会有其他的错误。

在社区找到 torch-2.5.1+das.opt1.dtk25041-cp310-cp310-manylinux_2_28_x86_64.whl文件,安装后运行也不可用,还请帮忙解答。

Assignee
Assign to
Time tracking