"projects/vscode:/vscode.git/clone" did not exist on "a9bc082ce95f5b2b976d17acea9afd0b425a3c04"
fix bug of attn backward in non-casual model with context parallel open. (#1031)
This bug will cause bug [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: ~/megatron/bin/python.
That is because we miss the rng_states that is required in attention recompute (for dropout), but no hint is provided.
It is very very very difficult to trace and cost me two weeks.
```python
before the start of training step] datetime: 2024-07-22 18:26:45
[2024-07-22 18:27:00,941] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: /home//miniconda3/envs/megatron/bin/python
Traceback (most recent call last):
File "/home//miniconda3/envs/megatron/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.2.1+cu121', 'console_scripts', 'torchrun')())
File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
```
Signed-off-by:
李金梁 <975761915@qq.com>
Showing
Please register or sign in to comment