1. 27 Aug, 2024 1 commit
  2. 23 Aug, 2024 2 commits
  3. 22 Aug, 2024 1 commit
  4. 21 Aug, 2024 4 commits
  5. 20 Aug, 2024 1 commit
  6. 19 Aug, 2024 1 commit
  7. 16 Aug, 2024 2 commits
  8. 15 Aug, 2024 2 commits
  9. 14 Aug, 2024 3 commits
  10. 13 Aug, 2024 2 commits
  11. 12 Aug, 2024 2 commits
  12. 10 Aug, 2024 1 commit
  13. 09 Aug, 2024 2 commits
  14. 08 Aug, 2024 1 commit
  15. 06 Aug, 2024 4 commits
  16. 02 Aug, 2024 2 commits
  17. 01 Aug, 2024 2 commits
  18. 31 Jul, 2024 2 commits
  19. 30 Jul, 2024 2 commits
  20. 26 Jul, 2024 1 commit
  21. 25 Jul, 2024 2 commits
    • 李金梁's avatar
      fix bug of attn backward in non-casual model with context parallel open. (#1031) · 4cc220c9
      李金梁 authored
      
      
      This bug will cause bug [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: ~/megatron/bin/python.
      
      That is because we miss the rng_states that is required in attention recompute (for dropout), but no hint is provided.  
      
      It is very very very difficult to trace and cost me two weeks.
      
      ```python
      before the start of training step] datetime: 2024-07-22 18:26:45 
      [2024-07-22 18:27:00,941] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: /home//miniconda3/envs/megatron/bin/python
      Traceback (most recent call last):
        File "/home//miniconda3/envs/megatron/bin/torchrun", line 33, in <module>
          sys.exit(load_entry_point('torch==2.2.1+cu121', 'console_scripts', 'torchrun')())
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
          return f(*args, **kwargs)
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
          run(args)
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
          elastic_launch(
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
          return launch_agent(self._config, self._entrypoint, list(args))
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
          raise ChildFailedError(
      torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
      ```
      Signed-off-by: default avatar李金梁 <975761915@qq.com>
      4cc220c9
    • Kirthi Shankar Sivamani's avatar
      Fixes for pip wheels (#1042) · 1aaf1cc8
      Kirthi Shankar Sivamani authored
      
      
      * Fixes for wheels
      Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
      
      * Fix paddle wheel test
      Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
      
      ---------
      Signed-off-by: default avatarKirthi Shankar Sivamani <ksivamani@nvidia.com>
      1aaf1cc8