[root@test fine-tune]# ./lora_train.sh 
Setting ds_accelerator to cuda (auto detect)
[2024-07-10 04:29:55,792] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-10 04:29:57,028] [INFO] [runner.py:555:main] cmd = /usr/local/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None fine-tune.py --report_to none --data_path data/belle_chat_ramdon_10k.json --model_name_or_path ../../Baichuan2-7B-Chat --output_dir output --model_max_length 64 --num_train_epochs 4 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --save_strategy epoch --learning_rate 2e-5 --lr_scheduler_type constant --adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-8 --max_grad_norm 1.0 --weight_decay 1e-4 --warmup_ratio 0.0 --logging_steps 1 --gradient_checkpointing True --deepspeed ds_config.json --fp16 --use_lora True
Setting ds_accelerator to cuda (auto detect)
[2024-07-10 04:30:01,384] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-07-10 04:30:01,384] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-07-10 04:30:01,384] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-07-10 04:30:01,385] [INFO] [launch.py:163:main] dist_world_size=8
[2024-07-10 04:30:01,385] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
[2024-07-10 04:30:07,701] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-07-10 04:30:07,701] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-07-10 04:30:07,703] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-07-10 04:30:07,703] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-07-10 04:30:07,703] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-07-10 04:30:07,735] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-07-10 04:30:07,735] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-07-10 04:30:07,736] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0710 04:30:07.736606  4046 ProcessGroupNCCL.cpp:669] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[2024-07-10 04:30:07,736] [INFO] [comm.py:594:init_distributed] cdb=None
I0710 04:30:07.736632  4621 ProcessGroupNCCL.cpp:835] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0710 04:30:07.737826  4055 ProcessGroupNCCL.cpp:669] [Rank 4] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:30:07.737860  4623 ProcessGroupNCCL.cpp:835] [Rank 4] NCCL watchdog thread started!
[2024-07-10 04:30:07,756] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-07-10 04:30:07,756] [INFO] [comm.py:594:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0710 04:30:07.757907  4052 ProcessGroupNCCL.cpp:669] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:30:07.757953  4625 ProcessGroupNCCL.cpp:835] [Rank 3] NCCL watchdog thread started!
[2024-07-10 04:30:07,773] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-07-10 04:30:07,773] [INFO] [comm.py:594:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0710 04:30:07.775177  4627 ProcessGroupNCCL.cpp:835] [Rank 2] NCCL watchdog thread started!
I0710 04:30:07.775156  4049 ProcessGroupNCCL.cpp:669] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[2024-07-10 04:30:07,796] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-07-10 04:30:07,796] [INFO] [comm.py:594:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0710 04:30:07.797539  4058 ProcessGroupNCCL.cpp:669] [Rank 5] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:30:07.797577  4629 ProcessGroupNCCL.cpp:835] [Rank 5] NCCL watchdog thread started!
[2024-07-10 04:30:07,935] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-07-10 04:30:07,936] [INFO] [comm.py:594:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0710 04:30:07.937068  4631 ProcessGroupNCCL.cpp:835] [Rank 7] NCCL watchdog thread started!
I0710 04:30:07.937042  4064 ProcessGroupNCCL.cpp:669] [Rank 7] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0710 04:30:08.702939  4633 ProcessGroupNCCL.cpp:835] [Rank 6] NCCL watchdog thread started!
I0710 04:30:08.702908  4061 ProcessGroupNCCL.cpp:669] [Rank 6] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0710 04:30:08.709986  4635 ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started!
I0710 04:30:08.709964  4043 ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
I0710 04:30:17.775079  4043 ProcessGroupNCCL.cpp:1274] NCCL_DEBUG: N/A
[2024-07-10 04:30:40,706] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 7.51B parameters
trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744
trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744
Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors
input: <reserved_106>写一篇关于气候变化对海洋生态的影响的文章。
<reserved_107>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
label: </s>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors
input: <reserved_106>写一篇关于气候变化对海洋生态的影响的文章。
<reserved_107>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744
label: </s>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744
Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors
input: <reserved_106>写一篇关于气候变化对海洋生态的影响的文章。
<reserved_107>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
label: </s>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors
input: <reserved_106>写一篇关于气候变化对海洋生态的影响的文章。
<reserved_107>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
label: </s>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744
trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744
trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744
Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors
input: <reserved_106>写一篇关于气候变化对海洋生态的影响的文章。
<reserved_107>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
label: </s>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors
input: <reserved_106>写一篇关于气候变化对海洋生态的影响的文章。
<reserved_107>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
label: </s>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors
input: <reserved_106>写一篇关于气候变化对海洋生态的影响的文章。
<reserved_107>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
label: </s>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
I0710 04:31:25.425884  5748 ProcessGroupNCCL.cpp:835] [Rank 4] NCCL watchdog thread started!
I0710 04:31:25.425859  4055 ProcessGroupNCCL.cpp:669] [Rank 4] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:31:25.443286  4058 ProcessGroupNCCL.cpp:669] [Rank 5] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:31:25.443320  5749 ProcessGroupNCCL.cpp:835] [Rank 5] NCCL watchdog thread started!
I0710 04:31:25.621404  5750 ProcessGroupNCCL.cpp:835] [Rank 6] NCCL watchdog thread started!
I0710 04:31:25.621384  4061 ProcessGroupNCCL.cpp:669] [Rank 6] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:31:25.696574  5751 ProcessGroupNCCL.cpp:835] [Rank 1] NCCL watchdog thread started!
I0710 04:31:25.696537  4046 ProcessGroupNCCL.cpp:669] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:31:25.930004  4052 ProcessGroupNCCL.cpp:669] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:31:25.930064  5752 ProcessGroupNCCL.cpp:835] [Rank 3] NCCL watchdog thread started!
I0710 04:31:25.985500  4049 ProcessGroupNCCL.cpp:669] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:31:25.985546  5753 ProcessGroupNCCL.cpp:835] [Rank 2] NCCL watchdog thread started!
trainable params: 524,288 || all params: 7,506,497,536 || trainable%: 0.006984455766295744
I0710 04:31:26.139194  4064 ProcessGroupNCCL.cpp:669] [Rank 7] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:31:26.139247  5754 ProcessGroupNCCL.cpp:835] [Rank 7] NCCL watchdog thread started!
Token indices sequence length is longer than the specified maximum sequence length for this model (159 > 64). Running this sequence through the model will result in indexing errors
input: <reserved_106>写一篇关于气候变化对海洋生态的影响的文章。
<reserved_107>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
label: </s>好的，以下是你要求的文章：
气候变化对海洋生态的影响
气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升，这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外，海洋酸化也是一个问题
I0710 04:31:27.211406  4043 ProcessGroupNCCL.cpp:669] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
I0710 04:31:27.211434  5756 ProcessGroupNCCL.cpp:835] [Rank 0] NCCL watchdog thread started!
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cpu/utils...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cpu/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.8/site-packages/torch/include -isystem /usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/site-packages/torch/include/TH -isystem /usr/local/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1
In file included from /opt/rh/devtoolset-7/root/usr/include/c++/7/ext/hash_set:60:0,
                 from /usr/local/lib/python3.8/site-packages/torch/include/glog/stl_logging.h:54,
                 from /usr/local/lib/python3.8/site-packages/torch/include/c10/util/logging_is_google_glog.h:20,
                 from /usr/local/lib/python3.8/site-packages/torch/include/c10/util/Logging.h:26,
                 from /usr/local/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h:18,
                 from /usr/local/lib/python3.8/site-packages/torch/include/c10/core/GeneratorImpl.h:12,
                 from /usr/local/lib/python3.8/site-packages/torch/include/ATen/core/Generator.h:22,
                 from /usr/local/lib/python3.8/site-packages/torch/include/ATen/CPUGeneratorImpl.h:3,
                 from /usr/local/lib/python3.8/site-packages/torch/include/ATen/Context.h:3,
                 from /usr/local/lib/python3.8/site-packages/torch/include/ATen/ATen.h:7,
                 from /usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_flatten.h:3,
                 from /usr/local/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp:11:
/opt/rh/devtoolset-7/root/usr/include/c++/7/backward/backward_warning.h:32:2: warning: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated. [-Wcpp]
 #warning \
  ^~~~~~~
[2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 20.42320728302002 seconds
Loading extension module utils...
Time to load utils op: 20.428962230682373 seconds
Loading extension module utils...
Time to load utils op: 20.42719841003418 seconds
Loading extension module utils...
Time to load utils op: 20.328219413757324 seconds
Loading extension module utils...
Time to load utils op: 20.42875051498413 seconds
Loading extension module utils...
Time to load utils op: 20.4275643825531 seconds
Loading extension module utils...
Time to load utils op: 20.42873740196228 seconds
Loading extension module utils...
Time to load utils op: 20.32988953590393 seconds
Parameter Offload: Total persistent parameters: 790528 in 129 params
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...Time to load utils op: 0.0008893013000488281 seconds

Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0009264945983886719 seconds
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...No modifications detected for re-loaded extension module utils, skipping build step...

Loading extension module utils...
Time to load utils op: 0.0009138584136962891 seconds
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...Time to load utils op: 0.0008387565612792969 seconds

No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0009930133819580078 seconds
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Time to load utils op: 0.0009253025054931641 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0009849071502685547 seconds
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005936622619628906 seconds
  0%|                                                                                                                                                                                                                                                                                                                                           | 0/2500 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...

rocBLAS warning: No paths matched /opt/dtk-23.04/lib/rocblas/library/*gfx926*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.

rocBLAS warning: No paths matched /opt/dtk-23.04/lib/rocblas/library/*gfx926*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.

rocBLAS warning: No paths matched /opt/dtk-23.04/lib/rocblas/library/*gfx926*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.

rocBLAS warning: No paths matched /opt/dtk-23.04/lib/rocblas/library/*gfx926*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.

rocBLAS warning: No paths matched /opt/dtk-23.04/lib/rocblas/library/*gfx926*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.

rocBLAS warning: No paths matched /opt/dtk-23.04/lib/rocblas/library/*gfx926*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.

rocBLAS warning: No paths matched /opt/dtk-23.04/lib/rocblas/library/*gfx926*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.

rocBLAS warning: No paths matched /opt/dtk-23.04/lib/rocblas/library/*gfx926*co. Make sure that ROCBLAS_TENSILE_LIBPATH is set correctly.

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed }
Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
Traceback (most recent call last):
  File "fine-tune.py", line 159, in <module>
    train()
  File "fine-tune.py", line 153, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward
    proj = self.W_pack(hidden_states)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)`

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed }
Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
Traceback (most recent call last):
  File "fine-tune.py", line 159, in <module>
    train()
  File "fine-tune.py", line 153, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward
    proj = self.W_pack(hidden_states)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)`

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed }
Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
Traceback (most recent call last):
  File "fine-tune.py", line 159, in <module>
    train()
  File "fine-tune.py", line 153, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward
    proj = self.W_pack(hidden_states)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)`

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed }
Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
Traceback (most recent call last):
  File "fine-tune.py", line 159, in <module>
    train()
  File "fine-tune.py", line 153, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward
    proj = self.W_pack(hidden_states)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)`

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed }
Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
Traceback (most recent call last):
  File "fine-tune.py", line 159, in <module>
    train()
  File "fine-tune.py", line 153, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward
    proj = self.W_pack(hidden_states)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)`

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed }
Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
Traceback (most recent call last):
  File "fine-tune.py", line 159, in <module>
    train()
  File "fine-tune.py", line 153, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed }
Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward
    proj = self.W_pack(hidden_states)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)`
Traceback (most recent call last):
  File "fine-tune.py", line 159, in <module>
    train()
  File "fine-tune.py", line 153, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 12288, N: 128, K: 4096, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 4096, row_stride_c: 1, col_stride_c: 12288, row_stride_d: 1, col_stride_d: 12288, beta: 0, batch_count: 1, strided_batch: true, stride_a: 1, stride_b: 1, stride_c: 1, stride_d: 1, atomics_mode: atomics_allowed }
Kernel Cijk_Alik_Bljk_HB_MT64x32x16_SN_APM1_AF0EM2_AF1EM1_AMAS3_ASAE01_ASCE01_ASEM2_BL1_DTL0_ETSP_EPS1_FL0_GRVW4_GSU1_GSUAMB_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_LRVW4_MAC_MDA2_NLCA1_NLCB1_ONLL1_PK0_PGR1_PLR1_RK0_SU32_SUM0_SUS256_SVW4_SNLL0_TT4_4_USFGRO0_VAW2_VS1_VW4_WG16_8_1_WGM1 not found in any loaded module.
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward
    proj = self.W_pack(hidden_states)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)`
Traceback (most recent call last):
  File "fine-tune.py", line 159, in <module>
    train()
  File "fine-tune.py", line 153, in train
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 453, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 449, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 205, in forward
    proj = self.W_pack(hidden_states)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/peft/tuners/lora.py", line 817, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 106, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: rocblas_status_internal_error when calling `rocblas_gemm_ex( handle, opa, opb, m, n, k, &falpha, a, rocblas_datatype_f16_r, lda, b, rocblas_datatype_f16_r, ldb, &fbeta, c, rocblas_datatype_f16_r, ldc, c, rocblas_datatype_f16_r, ldc, rocblas_datatype_f16_r, rocblas_gemm_algo_standard, 0, flag)`
  0%|                                                                                                                                                                                                                                                                                                                                           | 0/2500 [00:04<?, ?it/s]
[2024-07-10 04:31:58,559] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 4043
[2024-07-10 04:31:58,653] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 4046
[2024-07-10 04:31:58,988] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 4049
[2024-07-10 04:31:59,283] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 4052
[2024-07-10 04:31:59,284] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 4055
[2024-07-10 04:31:59,284] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 4058
[2024-07-10 04:31:59,285] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 4061
[2024-07-10 04:31:59,286] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 4064
[2024-07-10 04:31:59,286] [ERROR] [launch.py:320:sigkill_handler] ['/usr/local/bin/python3.8', '-u', 'fine-tune.py', '--local_rank=7', '--report_to', 'none', '--data_path', 'data/belle_chat_ramdon_10k.json', '--model_name_or_path', '../../Baichuan2-7B-Chat', '--output_dir', 'output', '--model_max_length', '64', '--num_train_epochs', '4', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--save_strategy', 'epoch', '--learning_rate', '2e-5', '--lr_scheduler_type', 'constant', '--adam_beta1', '0.9', '--adam_beta2', '0.98', '--adam_epsilon', '1e-8', '--max_grad_norm', '1.0', '--weight_decay', '1e-4', '--warmup_ratio', '0.0', '--logging_steps', '1', '--gradient_checkpointing', 'True', '--deepspeed', 'ds_config.json', '--fp16', '--use_lora', 'True'] exits with return code = 1
[root@test fine-tune]#