Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • L llama3_pytorch
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ModelZoo
  • llama3_pytorch
  • Issues
  • #4

Closed
Open
Created Jun 17, 2024 by ncic_liuyao@ncic_liuyao

pip conflict error and finetune.sh llama-3-8b error

First step docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk24.04-py310 docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=80G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash cd /your_code_path/llama3_pytorch pip install -e . pip uninstall flash-attn # 2.0.4+82379d7.abi0.dtk2404.torch2.1

docker环境含有deepspeed的可不进行安装, 需要对照版本是否一致即可

pip install deepspeed-0.12.3+das1.0+gita724046.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl git clone -b v0.1.18 https://github.com/InternLM/xtuner.git cd xtuner pip install -e '.[all]'

pip install -e '.[all]' error ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. lmdeploy 0.1.0-git782048c.abi0.dtk2404.torch2.1. requires transformers==4.33.2, but you have transformers 4.41.2 which is incompatible. onnxruntime 1.15.0+gita9ca438.abi0.dtk2404 requires numpy>=1.26.4, but you have numpy 1.24.3 which is incompatible. xformers 0.0.25+gitd11e899.abi0.dtk2404.torch2.1 requires numpy<=1.23.5, but you have numpy 1.24.3 which is incompatible.

Second step pip install mmengine==0.10.3

注意bitsandbytes库版本,如果环境中一致可不安装,否则需要重新安装

pip install bitsandbytes-0.37.0+das1.0+gitd3d888f.abi0.dtk2404.torch2.1-py3-none-any.whl

install bitsandbytes error: xtuner 0.1.18 requires bitsandbytes>=0.40.0.post4, but you have bitsandbytes 0.37.0+gitd3d888f.abi0.dtk2404.torch2.1 which is incompatible.

third step: bash finetune.sh

Traceback (most recent call last): File "/lytest/code/algorithm/xtuner/xtuner/tools/train.py", line 342, in main() File "/lytest/code/algorithm/xtuner/xtuner/tools/train.py", line 338, in main runner.train() File "/usr/local/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train self.strategy.prepare( File "/usr/local/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare model = self.build_model(model) File "/usr/local/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model model = MODELS.build(model) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg return build_from_cfg(cfg, registry, default_args) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/lytest/code/algorithm/xtuner/xtuner/model/sft.py", line 85, in init self.llm = self._build_from_cfg_or_module(llm) File "/lytest/code/algorithm/xtuner/xtuner/model/sft.py", line 213, in _build_from_cfg_or_module return BUILDER.build(cfg_or_mod) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4194, in _load_pretrained_model state_dict = load_state_dict(shard_file, is_quantized=is_quantized) File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 508, in load_state_dict with safe_open(checkpoint_file, framework="pt") as f: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Traceback (most recent call last): File "/lytest/code/algorithm/xtuner/xtuner/tools/train.py", line 342, in main() File "/lytest/code/algorithm/xtuner/xtuner/tools/train.py", line 338, in main runner.train() File "/usr/local/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train self.strategy.prepare( File "/usr/local/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare model = self.build_model(model) File "/usr/local/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model model = MODELS.build(model) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg return build_from_cfg(cfg, registry, default_args) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/lytest/code/algorithm/xtuner/xtuner/model/sft.py", line 85, in init self.llm = self._build_from_cfg_or_module(llm) File "/lytest/code/algorithm/xtuner/xtuner/model/sft.py", line 213, in _build_from_cfg_or_module return BUILDER.build(cfg_or_mod) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/usr/local/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4194, in _load_pretrained_model state_dict = load_state_dict(shard_file, is_quantized=is_quantized) File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 508, in load_state_dict with safe_open(checkpoint_file, framework="pt") as f: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge I0617 20:42:42.845207 2913 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1 I0617 20:42:42.875689 2912 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0 [2024-06-17 20:42:49,678] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2912) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/lytest/code/algorithm/xtuner/xtuner/tools/train.py FAILED

Assignee
Assign to
Time tracking