1. 13 Aug, 2024 1 commit
  2. 06 Aug, 2024 3 commits
  3. 02 Aug, 2024 1 commit
  4. 01 Aug, 2024 1 commit
  5. 26 Jul, 2024 1 commit
  6. 25 Jul, 2024 1 commit
    • 李金梁's avatar
      fix bug of attn backward in non-casual model with context parallel open. (#1031) · 4cc220c9
      李金梁 authored
      
      
      This bug will cause bug [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: ~/megatron/bin/python.
      
      That is because we miss the rng_states that is required in attention recompute (for dropout), but no hint is provided.  
      
      It is very very very difficult to trace and cost me two weeks.
      
      ```python
      before the start of training step] datetime: 2024-07-22 18:26:45 
      [2024-07-22 18:27:00,941] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: /home//miniconda3/envs/megatron/bin/python
      Traceback (most recent call last):
        File "/home//miniconda3/envs/megatron/bin/torchrun", line 33, in <module>
          sys.exit(load_entry_point('torch==2.2.1+cu121', 'console_scripts', 'torchrun')())
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
          return f(*args, **kwargs)
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
          run(args)
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
          elastic_launch(
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
          return launch_agent(self._config, self._entrypoint, list(args))
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
          raise ChildFailedError(
      torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
      ```
      Signed-off-by: default avatar李金梁 <975761915@qq.com>
      4cc220c9
  7. 19 Jul, 2024 1 commit
  8. 10 Jul, 2024 1 commit
  9. 03 Jul, 2024 1 commit
  10. 01 Jul, 2024 1 commit
  11. 18 Jun, 2024 2 commits
  12. 15 Jun, 2024 1 commit
  13. 14 Jun, 2024 3 commits
  14. 13 Jun, 2024 1 commit
  15. 10 Jun, 2024 1 commit
  16. 06 Jun, 2024 1 commit
  17. 30 May, 2024 1 commit
  18. 29 May, 2024 1 commit
  19. 25 May, 2024 1 commit
  20. 21 May, 2024 1 commit
  21. 17 May, 2024 1 commit
  22. 13 May, 2024 1 commit
  23. 09 May, 2024 1 commit
  24. 02 May, 2024 1 commit
  25. 30 Apr, 2024 1 commit
  26. 29 Apr, 2024 1 commit
  27. 26 Apr, 2024 1 commit
  28. 24 Apr, 2024 1 commit
  29. 16 Apr, 2024 1 commit
  30. 12 Apr, 2024 1 commit
  31. 06 Apr, 2024 1 commit
  32. 03 Apr, 2024 1 commit
  33. 21 Mar, 2024 2 commits
  34. 20 Mar, 2024 1 commit