1. 25 Jul, 2024 1 commit
    • 李金梁's avatar
      fix bug of attn backward in non-casual model with context parallel open. (#1031) · 4cc220c9
      李金梁 authored
      
      
      This bug will cause bug [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: ~/megatron/bin/python.
      
      That is because we miss the rng_states that is required in attention recompute (for dropout), but no hint is provided.  
      
      It is very very very difficult to trace and cost me two weeks.
      
      ```python
      before the start of training step] datetime: 2024-07-22 18:26:45 
      [2024-07-22 18:27:00,941] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: /home//miniconda3/envs/megatron/bin/python
      Traceback (most recent call last):
        File "/home//miniconda3/envs/megatron/bin/torchrun", line 33, in <module>
          sys.exit(load_entry_point('torch==2.2.1+cu121', 'console_scripts', 'torchrun')())
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
          return f(*args, **kwargs)
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
          run(args)
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
          elastic_launch(
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
          return launch_agent(self._config, self._entrypoint, list(args))
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
          raise ChildFailedError(
      torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
      ```
      Signed-off-by: default avatar李金梁 <975761915@qq.com>
      4cc220c9
  2. 19 Jul, 2024 1 commit
  3. 10 Jul, 2024 1 commit
  4. 03 Jul, 2024 1 commit
  5. 01 Jul, 2024 1 commit
  6. 18 Jun, 2024 2 commits
  7. 15 Jun, 2024 1 commit
  8. 14 Jun, 2024 3 commits
  9. 13 Jun, 2024 1 commit
  10. 10 Jun, 2024 1 commit
  11. 06 Jun, 2024 1 commit
  12. 30 May, 2024 1 commit
  13. 29 May, 2024 1 commit
  14. 25 May, 2024 1 commit
  15. 21 May, 2024 1 commit
  16. 17 May, 2024 1 commit
  17. 13 May, 2024 1 commit
  18. 09 May, 2024 1 commit
  19. 02 May, 2024 1 commit
  20. 30 Apr, 2024 1 commit
  21. 29 Apr, 2024 1 commit
  22. 26 Apr, 2024 1 commit
  23. 24 Apr, 2024 1 commit
  24. 16 Apr, 2024 1 commit
  25. 12 Apr, 2024 1 commit
  26. 06 Apr, 2024 1 commit
  27. 03 Apr, 2024 1 commit
  28. 21 Mar, 2024 2 commits
  29. 20 Mar, 2024 1 commit
  30. 06 Mar, 2024 1 commit
  31. 28 Feb, 2024 1 commit
  32. 24 Feb, 2024 1 commit
  33. 15 Feb, 2024 1 commit
  34. 08 Feb, 2024 1 commit
  35. 06 Feb, 2024 1 commit
  36. 03 Feb, 2024 1 commit