• 李金梁's avatar
    fix bug of attn backward in non-casual model with context parallel open. (#1031) · 4cc220c9
    李金梁 authored
    
    
    This bug will cause bug [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: ~/megatron/bin/python.
    
    That is because we miss the rng_states that is required in attention recompute (for dropout), but no hint is provided.  
    
    It is very very very difficult to trace and cost me two weeks.
    
    ```python
    before the start of training step] datetime: 2024-07-22 18:26:45 
    [2024-07-22 18:27:00,941] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: /home//miniconda3/envs/megatron/bin/python
    Traceback (most recent call last):
      File "/home//miniconda3/envs/megatron/bin/torchrun", line 33, in <module>
        sys.exit(load_entry_point('torch==2.2.1+cu121', 'console_scripts', 'torchrun')())
      File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
        return f(*args, **kwargs)
      File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
        run(args)
      File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
        elastic_launch(
      File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ```
    Signed-off-by: default avatar李金梁 <975761915@qq.com>
    4cc220c9
attention.py 282 KB