[root@test fine-tune]# ./lora_train.sh Setting ds_accelerator to cuda (auto detect) [2024-07-10 04:29:55,792] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-07-10 04:29:57,028] [INFO] [runner.py:555:main] cmd = /usr/local/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None fine-tune.py --report_to none --data_path data/belle_chat_ramdon_10k.json --model_name_or_path ../../Baichuan2-7B-Chat --output_dir output --model_max_length 64 --num_train_epochs 4 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --save_strategy epoch --learning_rate 2e-5 --lr_scheduler_type constant --adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-8 --max_grad_norm 1.0 --weight_decay 1e-4 --warmup_ratio 0.0 --logging_steps 1 --gradient_checkpointing True --deepspeed ds_config.json --fp16 --use_lora True Setting ds_accelerator to cuda (auto detect) [2024-07-10 04:30:01,384] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2024-07-10 04:30:01,384] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2024-07-10 04:30:01,384] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2024-07-10 04:30:01,385] [INFO] [launch.py:163:main] dist_world_size=8 [2024-07-10 04:30:01,385] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) [2024-07-10 04:30:07,701] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-07-10 04:30:07,701] [INFO] [comm.py:594:init_distributed] cdb=None [2024-07-10 04:30:07,703] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-07-10 04:30:07,703] [INFO] [comm.py:594:init_distributed] cdb=None [2024-07-10 04:30:07,703] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-07-10 04:30:07,735] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-07-10 04:30:07,735] [INFO] [comm.py:594:init_distributed] cdb=None [2024-07-10 04:30:07,736] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented WARNING: Logging before InitGoogleLogging() is written to STDERR I0710 04:30:07.736606 4046 ProcessGroupNCCL.cpp:669] [Rank 1] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 [2024-07-10 04:30:07,736] [INFO] [comm.py:594:init_distributed] cdb=None I0710 04:30:07.736632 4621 ProcessGroupNCCL.cpp:835] [Rank 1] NCCL watchdog thread started! WARNING: Logging before InitGoogleLogging() is written to STDERR I0710 04:30:07.737826 4055 ProcessGroupNCCL.cpp:669] [Rank 4] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:30:07.737860 4623 ProcessGroupNCCL.cpp:835] [Rank 4] NCCL watchdog thread started! [2024-07-10 04:30:07,756] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-07-10 04:30:07,756] [INFO] [comm.py:594:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I0710 04:30:07.757907 4052 ProcessGroupNCCL.cpp:669] [Rank 3] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:30:07.757953 4625 ProcessGroupNCCL.cpp:835] [Rank 3] NCCL watchdog thread started! [2024-07-10 04:30:07,773] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-07-10 04:30:07,773] [INFO] [comm.py:594:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I0710 04:30:07.775177 4627 ProcessGroupNCCL.cpp:835] [Rank 2] NCCL watchdog thread started! I0710 04:30:07.775156 4049 ProcessGroupNCCL.cpp:669] [Rank 2] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 [2024-07-10 04:30:07,796] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-07-10 04:30:07,796] [INFO] [comm.py:594:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I0710 04:30:07.797539 4058 ProcessGroupNCCL.cpp:669] [Rank 5] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:30:07.797577 4629 ProcessGroupNCCL.cpp:835] [Rank 5] NCCL watchdog thread started! [2024-07-10 04:30:07,935] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-07-10 04:30:07,936] [INFO] [comm.py:594:init_distributed] cdb=None WARNING: Logging before InitGoogleLogging() is written to STDERR I0710 04:30:07.937068 4631 ProcessGroupNCCL.cpp:835] [Rank 7] NCCL watchdog thread started! I0710 04:30:07.937042 4064 ProcessGroupNCCL.cpp:669] [Rank 7] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 WARNING: Logging before InitGoogleLogging() is written to STDERR I0710 04:30:08.702939 4633 ProcessGroupNCCL.cpp:835] [Rank 6] NCCL watchdog thread started! I0710 04:30:08.702908 4061 ProcessGroupNCCL.cpp:669] [Rank 6] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 WARNING: Logging before InitGoogleLogging() is written to STDERR I0710 04:30:08.709986 4635 ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started! I0710 04:30:08.709964 4043 ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. I0710 04:30:17.775079 4043 ProcessGroupNCCL.cpp:1274] NCCL_DEBUG: N/A [2024-07-10 04:30:40,706] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 7.51B parameters trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744 trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744 Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors input: 写一篇关于气候变化对海洋生态的影响的文章。 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 label: 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors input: 写一篇关于气候变化对海洋生态的影响的文章。 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744 label: 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744 Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors input: 写一篇关于气候变化对海洋生态的影响的文章。 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 label: 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors input: 写一篇关于气候变化对海洋生态的影响的文章。 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 label: 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744 trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744 trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744 Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors input: 写一篇关于气候变化对海洋生态的影响的文章。 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 label: 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors input: 写一篇关于气候变化对海洋生态的影响的文章。 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 label: 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors input: 写一篇关于气候变化对海洋生态的影响的文章。 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 label: 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 I0710 04:31:25.425884 5748 ProcessGroupNCCL.cpp:835] [Rank 4] NCCL watchdog thread started! I0710 04:31:25.425859 4055 ProcessGroupNCCL.cpp:669] [Rank 4] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:31:25.443286 4058 ProcessGroupNCCL.cpp:669] [Rank 5] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:31:25.443320 5749 ProcessGroupNCCL.cpp:835] [Rank 5] NCCL watchdog thread started! I0710 04:31:25.621404 5750 ProcessGroupNCCL.cpp:835] [Rank 6] NCCL watchdog thread started! I0710 04:31:25.621384 4061 ProcessGroupNCCL.cpp:669] [Rank 6] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:31:25.696574 5751 ProcessGroupNCCL.cpp:835] [Rank 1] NCCL watchdog thread started! I0710 04:31:25.696537 4046 ProcessGroupNCCL.cpp:669] [Rank 1] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:31:25.930004 4052 ProcessGroupNCCL.cpp:669] [Rank 3] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:31:25.930064 5752 ProcessGroupNCCL.cpp:835] [Rank 3] NCCL watchdog thread started! I0710 04:31:25.985500 4049 ProcessGroupNCCL.cpp:669] [Rank 2] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:31:25.985546 5753 ProcessGroupNCCL.cpp:835] [Rank 2] NCCL watchdog thread started! trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744 I0710 04:31:26.139194 4064 ProcessGroupNCCL.cpp:669] [Rank 7] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:31:26.139247 5754 ProcessGroupNCCL.cpp:835] [Rank 7] NCCL watchdog thread started! Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors input: 写一篇关于气候变化对海洋生态的影响的文章。 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 label: 好的,以下是你要求的文章: 气候变化对海洋生态的影响 气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题 I0710 04:31:27.211406 4043 ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_DESYNC_DEBUG: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 I0710 04:31:27.211434 5756 ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started! Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py38_cpu/utils... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cpu/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.8/site-packages/torch/include -isystem /usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/site-packages/torch/include/TH -isystem /usr/local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 In file included from /opt/rh/devtoolset-7/root/usr/include/c++/7/ext/hash_set:60:0, from /usr/local/lib/python3.8/site-packages/torch/include/glog/stl_logging.h:54, from /usr/local/lib/python3.8/site-packages/torch/include/c10/util/logging_is_google_glog.h:20, from /usr/local/lib/python3.8/site-packages/torch/include/c10/util/Logging.h:26, from /usr/local/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h:18, from /usr/local/lib/python3.8/site-packages/torch/include/c10/core/GeneratorImpl.h:12, from /usr/local/lib/python3.8/site-packages/torch/include/ATen/core/Generator.h:22, from /usr/local/lib/python3.8/site-packages/torch/include/ATen/CPUGeneratorImpl.h:3, from /usr/local/lib/python3.8/site-packages/torch/include/ATen/Context.h:3, from /usr/local/lib/python3.8/site-packages/torch/include/ATen/ATen.h:7, from /usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_flatten.h:3, from /usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp:11: /opt/rh/devtoolset-7/root/usr/include/c++/7/backward/backward_warning.h:32:2: warning: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated. [-Wcpp] #warning \ ^~~~~~~ [2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so Loading extension module utils... Time to load utils op: 20.42320728302002 seconds Loading extension module utils... Time to load utils op: 20.428962230682373 seconds Loading extension module utils... Time to load utils op: 20.42719841003418 seconds Loading extension module utils... Time to load utils op: 20.328219413757324 seconds Loading extension module utils... Time to load utils op: 20.42875051498413 seconds Loading extension module utils... Time to load utils op: 20.4275643825531 seconds Loading extension module utils... Time to load utils op: 20.42873740196228 seconds Loading extension module utils... Time to load utils op: 20.32988953590393 seconds Parameter Offload: Total persistent parameters: 790528 in 129 params Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step...Time to load utils op: 0.0008893013000488281 seconds Loading extension module utils... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0009264945983886719 seconds Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0009138584136962891 seconds Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...Time to load utils op: 0.0008387565612792969 seconds No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0009930133819580078 seconds Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... Time to load utils op: 0.0009253025054931641 seconds No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0009849071502685547 seconds Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0005936622619628906 seconds 0%| | 0/2500 [00:00 train() File "fine-tune.py", line 153, in train trainer.train() File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward loss = self.module(*inputs, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward outputs = self.model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward return module(*inputs, output_attentions, None) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward proj = self.W_pack(hidden_states) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap return LinearFunctionForZeroStage3.apply(input, weight) File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd return fwd(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward output = input.matmul(weight.t()) RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed } Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. Traceback (most recent call last): File "fine-tune.py", line 159, in train() File "fine-tune.py", line 153, in train trainer.train() File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward loss = self.module(*inputs, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward outputs = self.model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward return module(*inputs, output_attentions, None) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward proj = self.W_pack(hidden_states) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap return LinearFunctionForZeroStage3.apply(input, weight) File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd return fwd(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward output = input.matmul(weight.t()) RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed } Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. Traceback (most recent call last): File "fine-tune.py", line 159, in train() File "fine-tune.py", line 153, in train trainer.train() File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward loss = self.module(*inputs, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward outputs = self.model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward return module(*inputs, output_attentions, None) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward proj = self.W_pack(hidden_states) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap return LinearFunctionForZeroStage3.apply(input, weight) File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd return fwd(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward output = input.matmul(weight.t()) RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed } Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. Traceback (most recent call last): File "fine-tune.py", line 159, in train() File "fine-tune.py", line 153, in train trainer.train() File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward loss = self.module(*inputs, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward outputs = self.model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward return module(*inputs, output_attentions, None) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward proj = self.W_pack(hidden_states) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap return LinearFunctionForZeroStage3.apply(input, weight) File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd return fwd(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward output = input.matmul(weight.t()) RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed } Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. Traceback (most recent call last): File "fine-tune.py", line 159, in train() File "fine-tune.py", line 153, in train trainer.train() File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward loss = self.module(*inputs, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward outputs = self.model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward return module(*inputs, output_attentions, None) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward proj = self.W_pack(hidden_states) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap return LinearFunctionForZeroStage3.apply(input, weight) File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd return fwd(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward output = input.matmul(weight.t()) RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)` rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed } Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. Traceback (most recent call last): File "fine-tune.py", line 159, in train() File "fine-tune.py", line 153, in train trainer.train() File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed } Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward loss = self.module(*inputs, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward outputs = self.model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward return module(*inputs, output_attentions, None) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward proj = self.W_pack(hidden_states) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap return LinearFunctionForZeroStage3.apply(input, weight) File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd return fwd(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward output = input.matmul(weight.t()) RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)` Traceback (most recent call last): File "fine-tune.py", line 159, in train() File "fine-tune.py", line 153, in train trainer.train() File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed } Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward loss = self.module(*inputs, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward outputs = self.model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward return module(*inputs, output_attentions, None) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward proj = self.W_pack(hidden_states) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap return LinearFunctionForZeroStage3.apply(input, weight) File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd return fwd(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward output = input.matmul(weight.t()) RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)` Traceback (most recent call last): File "fine-tune.py", line 159, in train() File "fine-tune.py", line 153, in train trainer.train() File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward loss = self.module(*inputs, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward outputs = self.model( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward return module(*inputs, output_attentions, None) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward proj = self.W_pack(hidden_states) File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap return LinearFunctionForZeroStage3.apply(input, weight) File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd return fwd(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward output = input.matmul(weight.t()) RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)` 0%| | 0/2500 [00:04