Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • C ChatGLM-6B_pytorch
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ModelZoo
  • ChatGLM-6B_pytorch
  • Issues
  • #2

Closed
Open
Created Sep 05, 2023 by Sugon_ldc@Sugon_ldcDeveloper

程序无法训练,readme中推荐的22.10存在deepspeed版本不匹配,23.04也在训练过程出现多种报错的问题

dtk22.10.1存在如下问题:image

换用dtk 23.04\torch1.13.1\deepspeed0.9.*出现图中的报错,具体表现为在第4个epoch后出现问题,具体报错附在最后imageimage,

Traceback (most recent call last): File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 432, in Traceback (most recent call last): File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 432, in Traceback (most recent call last): File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 432, in main()main()main()

  File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 371, in main

File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 371, in main main() File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 371, in main

File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 371, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1635, in train train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1635, in train train_result = trainer.train(resume_from_checkpoint=checkpoint) train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1635, in train

File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1635, in train return inner_training_loop(return inner_training_loop(return inner_training_loop(

return inner_training_loop( File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1704, in _inner_training_loop

File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1704, in _inner_training_loop File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1704, in _inner_training_loop

File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1704, in _inner_training_loop deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/public/home/acq6p89156/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init

  File "/public/home/acq6p89156/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(

deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/public/home/acq6p89156/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init

File "/public/home/acq6p89156/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.9/site-packages/deepspeed/init.py", line 167, in initialize deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.9/site-packages/deepspeed/init.py", line 167, in initialize

  File "/usr/local/lib/python3.9/site-packages/deepspeed/__init__.py", line 167, in initialize

deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.9/site-packages/deepspeed/init.py", line 167, in initialize engine = DeepSpeedEngine(args=args,engine = DeepSpeedEngine(args=args,engine = DeepSpeedEngine(args=args,

File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init

File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init engine = DeepSpeedEngine(args=args, File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_optimizer File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_optimizer

self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_optimizer File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_optimizer raise ZeRORuntimeException(msg) deepspeed.runtime.zero.utils.raise ZeRORuntimeException(msg) ZeRORuntimeExceptiondeepspeed.runtime.zero.utils: You are using ZeRO-Offload with a client provided optimizer (<class 'transformers.optimization.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.. ZeRORuntimeException : raise ZeRORuntimeException(msg)You are using ZeRO-Offload with a client provided optimizer (<class 'transformers.optimization.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file. raise ZeRORuntimeException(msg)

deepspeed.runtime.zero.utilsdeepspeed.runtime.zero.utils.ZeRORuntimeException: .ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer (<class 'transformers.optimization.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.You are using ZeRO-Offload with a client provided optimizer (<class 'transformers.optimization.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.

[2023-09-05 11:57:57,816] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6974 [2023-09-05 11:57:57,818] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6977 [2023-09-05 11:57:57,818] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6980 [2023-09-05 11:57:57,819] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6983 [2023-09-05 11:57:57,819] [ERROR] [launch.py:434:sigkill_handler] ['/usr/local/bin/python3.9', '-u', 'main.py', '--local_rank=3', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', 'AdvertiseGen/train.json', '--test_file', 'AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', '/public/home/acq6p89156/mytest2/THUDM/chatglm-6b', '--output_dir', './output_pt/adgen-chatglm-6b-pt-4c-5e-3', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '3000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '5e-3', '--pre_seq_len', '128', '--fp16'] exits with return code = 1

Edited Sep 05, 2023 by Sugon_ldc
Assignee
Assign to
Time tracking