nohup: ignoring input The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `4` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. [2024-06-20 19:37:09,141] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 19:37:09,189] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 19:37:09,195] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 19:37:09,241] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) WARNING: Logging before InitGoogleLogging() is written to STDERR I0620 19:37:09.644747 1901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=308831104 WARNING: Logging before InitGoogleLogging() is written to STDERR I0620 19:37:09.711325 1904 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=290381616 WARNING: Logging before InitGoogleLogging() is written to STDERR I0620 19:37:09.713822 1903 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=313050000 WARNING: Logging before InitGoogleLogging() is written to STDERR I0620 19:37:09.763563 1902 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=291112624 trainable params: 3030371328 || all params: 3030371328 || trainable%: 100.0 trainable params: 3030371328 || all params: 3030371328 || trainable%: 100.0 trainable params: 3030371328 || all params: 3030371328 || trainable%: 100.0 trainable params: 3030371328 || all params: 3030371328 || trainable%: 100.0 /usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': max_seq_length, dataset_text_field. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /usr/local/lib/python3.10/site-packages/transformers/training_args.py:1847: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead. warnings.warn( /usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': max_seq_length, dataset_text_field. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /usr/local/lib/python3.10/site-packages/transformers/training_args.py:1847: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead. warnings.warn( /usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': max_seq_length, dataset_text_field. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': max_seq_length, dataset_text_field. Will not be supported from version '1.0.0'. Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /usr/local/lib/python3.10/site-packages/transformers/training_args.py:1847: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead. warnings.warn( /usr/local/lib/python3.10/site-packages/transformers/training_args.py:1847: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead. warnings.warn( /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:269: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:307: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( I0620 19:37:14.700358 1903 ProcessGroupNCCL.cpp:2780] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:269: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:307: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:269: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:307: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( I0620 19:37:14.742319 1904 ProcessGroupNCCL.cpp:2780] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:269: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:307: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`. warnings.warn( I0620 19:37:14.743152 1902 ProcessGroupNCCL.cpp:2780] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. I0620 19:37:14.748099 1901 ProcessGroupNCCL.cpp:2780] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. I0620 19:37:15.159780 1901 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Training... Training... Training... Training... /usr/local/lib/python3.10/site-packages/transformers/optimization.py:457: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( /usr/local/lib/python3.10/site-packages/transformers/optimization.py:457: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( /usr/local/lib/python3.10/site-packages/transformers/optimization.py:457: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( /usr/local/lib/python3.10/site-packages/transformers/optimization.py:457: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( wandb: ERROR api_key not configured (no-tty). call wandb.login(key=[your_api_key]) Traceback (most recent call last): File "/home/starcoder2_pytorch/finetune.py", line 145, in main(args) File "/home/starcoder2_pytorch/finetune.py", line 129, in main trainer.train() File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train output = super().train(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2036, in _inner_training_loop self.control = self.callback_handler.on_train_begin(args, self.state, self.control) File "/usr/local/lib/python3.10/site-packages/transformers/trainer_callback.py", line 370, in on_train_begin return self.call_event("on_train_begin", args, state, control) File "/usr/local/lib/python3.10/site-packages/transformers/trainer_callback.py", line 414, in call_event result = getattr(callback, event)( File "/usr/local/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 768, in on_train_begin self.setup(args, state, model, **kwargs) File "/usr/local/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 741, in setup self._wandb.init( File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1195, in init raise e File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1172, in init wi.setup(kwargs) File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 306, in setup wandb_login._login( File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_login.py", line 317, in _login wlogin.prompt_api_key() File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_login.py", line 247, in prompt_api_key raise UsageError("api_key not configured (no-tty). call " + directive) wandb.errors.UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) [2024-06-20 19:37:23,288] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1902 closing signal SIGTERM [2024-06-20 19:37:23,288] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1903 closing signal SIGTERM [2024-06-20 19:37:23,288] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1904 closing signal SIGTERM [2024-06-20 19:37:23,402] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1901) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1014, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ finetune.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-20_19:37:23 host : c7312cf682cf rank : 0 (local_rank: 0) exitcode : 1 (pid: 1901) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================