1. 09 Sep, 2024 1 commit
  2. 05 Sep, 2024 2 commits
  3. 30 Aug, 2024 2 commits
  4. 23 Aug, 2024 1 commit
  5. 21 Aug, 2024 2 commits
  6. 20 Aug, 2024 1 commit
  7. 16 Aug, 2024 1 commit
  8. 15 Aug, 2024 2 commits
  9. 13 Aug, 2024 1 commit
  10. 06 Aug, 2024 3 commits
  11. 02 Aug, 2024 1 commit
  12. 01 Aug, 2024 1 commit
  13. 26 Jul, 2024 1 commit
  14. 25 Jul, 2024 1 commit
    • 李金梁's avatar
      fix bug of attn backward in non-casual model with context parallel open. (#1031) · 4cc220c9
      李金梁 authored
      
      
      This bug will cause bug [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: ~/megatron/bin/python.
      
      That is because we miss the rng_states that is required in attention recompute (for dropout), but no hint is provided.  
      
      It is very very very difficult to trace and cost me two weeks.
      
      ```python
      before the start of training step] datetime: 2024-07-22 18:26:45 
      [2024-07-22 18:27:00,941] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1761020) of binary: /home//miniconda3/envs/megatron/bin/python
      Traceback (most recent call last):
        File "/home//miniconda3/envs/megatron/bin/torchrun", line 33, in <module>
          sys.exit(load_entry_point('torch==2.2.1+cu121', 'console_scripts', 'torchrun')())
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
          return f(*args, **kwargs)
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
          run(args)
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
          elastic_launch(
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
          return launch_agent(self._config, self._entrypoint, list(args))
        File "/home//miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
          raise ChildFailedError(
      torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
      ```
      Signed-off-by: default avatar李金梁 <975761915@qq.com>
      4cc220c9
  15. 19 Jul, 2024 1 commit
  16. 10 Jul, 2024 1 commit
  17. 03 Jul, 2024 1 commit
  18. 01 Jul, 2024 1 commit
  19. 18 Jun, 2024 2 commits
  20. 15 Jun, 2024 1 commit
  21. 14 Jun, 2024 3 commits
  22. 13 Jun, 2024 1 commit
  23. 10 Jun, 2024 1 commit
  24. 06 Jun, 2024 1 commit
  25. 30 May, 2024 1 commit
  26. 29 May, 2024 1 commit
  27. 25 May, 2024 1 commit
  28. 21 May, 2024 1 commit
  29. 17 May, 2024 1 commit
  30. 13 May, 2024 1 commit
  31. 09 May, 2024 1 commit