程序无法训练,readme中推荐的22.10存在deepspeed版本不匹配,23.04也在训练过程出现多种报错的问题
换用dtk 23.04\torch1.13.1\deepspeed0.9.*出现图中的报错,具体表现为在第4个epoch后出现问题,具体报错附在最后
,
Traceback (most recent call last): File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 432, in Traceback (most recent call last): File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 432, in Traceback (most recent call last): File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 432, in main()main()main()
File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 371, in main
File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 371, in main main() File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 371, in main
File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/main.py", line 371, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1635, in train train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1635, in train train_result = trainer.train(resume_from_checkpoint=checkpoint) train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1635, in train
File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1635, in train return inner_training_loop(return inner_training_loop(return inner_training_loop(
return inner_training_loop( File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1704, in _inner_training_loop
File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1704, in _inner_training_loop File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1704, in _inner_training_loop
File "/public/home/acq6p89156/mytest2/chatglm-v1.0/ptuning/trainer.py", line 1704, in _inner_training_loop deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/public/home/acq6p89156/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
File "/public/home/acq6p89156/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/public/home/acq6p89156/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
File "/public/home/acq6p89156/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.9/site-packages/deepspeed/init.py", line 167, in initialize deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.9/site-packages/deepspeed/init.py", line 167, in initialize
File "/usr/local/lib/python3.9/site-packages/deepspeed/__init__.py", line 167, in initialize
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.9/site-packages/deepspeed/init.py", line 167, in initialize engine = DeepSpeedEngine(args=args,engine = DeepSpeedEngine(args=args,engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init
File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init engine = DeepSpeedEngine(args=args, File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_optimizer File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_optimizer
self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_optimizer File "/usr/local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_optimizer raise ZeRORuntimeException(msg) deepspeed.runtime.zero.utils.raise ZeRORuntimeException(msg) ZeRORuntimeExceptiondeepspeed.runtime.zero.utils: You are using ZeRO-Offload with a client provided optimizer (<class 'transformers.optimization.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.. ZeRORuntimeException : raise ZeRORuntimeException(msg)You are using ZeRO-Offload with a client provided optimizer (<class 'transformers.optimization.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file. raise ZeRORuntimeException(msg)
deepspeed.runtime.zero.utilsdeepspeed.runtime.zero.utils.ZeRORuntimeException: .ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer (<class 'transformers.optimization.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.You are using ZeRO-Offload with a client provided optimizer (<class 'transformers.optimization.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.
[2023-09-05 11:57:57,816] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6974 [2023-09-05 11:57:57,818] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6977 [2023-09-05 11:57:57,818] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6980 [2023-09-05 11:57:57,819] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6983 [2023-09-05 11:57:57,819] [ERROR] [launch.py:434:sigkill_handler] ['/usr/local/bin/python3.9', '-u', 'main.py', '--local_rank=3', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', 'AdvertiseGen/train.json', '--test_file', 'AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', '/public/home/acq6p89156/mytest2/THUDM/chatglm-6b', '--output_dir', './output_pt/adgen-chatglm-6b-pt-4c-5e-3', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '3000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '5e-3', '--pre_seq_len', '128', '--fp16'] exits with return code = 1
