1. 19 May, 2022 2 commits
  2. 18 May, 2022 1 commit
    • Masaki Kozuki's avatar
      [transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1
      Masaki Kozuki authored
      
      
      * NcclDistributedTestBase
      
      * fix stupid mistake
      
      * add UCC test
      
      * add UCC backend
      
      * torch ucc tests
      
      * allows for UCC backend
      
      * Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense
      
      * Apply 4 suggestion(s) to 1 file(s)
      
      * mix&match NCCL & UCC
      
      * use both ucc&nccl in gpt
      
      * UCC for Pipeline Parallel, NCCL for the others
      
      * conditionally use ucc
      
      * make ucc guards more friendly
      
      * test raises when torch_ucc isn't available
      
      * Change to member variable from class variable
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * pass async_comm to train, I mistakenly dropped it during the rebase
      
      * fix typo: functionality
      
      * Enable tensor parallel only when device count > 4
      
      I want pipeline model parallel world size to be >= 4 because
      previously I saw GPT/BERT failing when only UCC is used.
      So I'm speculating that there's some gotcha around pipeline size of 4.
      
      * Add nvidia driver version guard
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * move world_size as it was not correctly reflected
      
      * keep eye on the nvml api thing
      
      * import unittest
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      3490b9e1
  3. 13 May, 2022 1 commit
  4. 12 May, 2022 1 commit
    • eqy's avatar
      Async pipeline parallel (#1373) · 3fe35211
      eqy authored
      * initial check in
      
      * fix
      
      * fix test
      
      * address some review comments and cleanup
      
      * fix
      
      * bookmark
      
      * fix sync placement to come before gather
      
      * similar fix for non-gather case
      
      * add async bert
      
      * update gpt minimal test
      
      * allow selection of default pp test
      
      * fix bert test
      
      * cleanup
      
      * cleanup
      3fe35211
  5. 11 May, 2022 1 commit
  6. 29 Apr, 2022 3 commits
  7. 21 Apr, 2022 1 commit
  8. 20 Apr, 2022 1 commit
  9. 19 Apr, 2022 1 commit
  10. 14 Apr, 2022 1 commit
  11. 13 Apr, 2022 1 commit
  12. 08 Apr, 2022 3 commits
  13. 07 Apr, 2022 2 commits
    • Masaki Kozuki's avatar
      Deprecation warning: `pyprof` & `reparameterization` (#1348) · 727a6452
      Masaki Kozuki authored
      * add warning to pyprof
      
      * add warning to reparameterization
      
      note: this module is already not import-able as follows:
      
      ```
      (base) root@c4bb3f161482:/vscode/apex# python -c 'import torch; import
      apex; from apex import reparameterization'
      /vscode/apex/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be
      removed by the end of June, 2022
        warnings.warn("pyprof will be removed by the end of June, 2022",
      FutureWarning)
      /vscode/apex/apex/reparameterization/__init__.py:2: FutureWarning:
      reparameterization will be removed by the end of June, 2022
        warnings.warn("reparameterization will be removed by the end of June,
      2022", FutureWarning)
      Traceback (most recent call last):
        File "<string>", line 1, in <module>
        File "/vscode/apex/apex/reparameterization/__init__.py", line 4, in
      <module>
          from .weight_norm import WeightNorm
        File "/vscode/apex/apex/reparameterization/weight_norm.py", line 3, in
      <module>
          from ..fp16_utils import Fused_Weight_Norm
      ImportError: cannot import name 'Fused_Weight_Norm' from
      'apex.fp16_utils' (/vscode/apex/apex/fp16_utils/__init__.py)
      ```
      727a6452
    • Masaki Kozuki's avatar
      [transformer] add microbatches test (#1349) · 7d903878
      Masaki Kozuki authored
      * add test
      
      * destroy model parallel was missing
      7d903878
  14. 05 Apr, 2022 2 commits
  15. 03 Apr, 2022 1 commit
  16. 02 Apr, 2022 4 commits
  17. 01 Apr, 2022 3 commits
  18. 31 Mar, 2022 3 commits
  19. 30 Mar, 2022 2 commits
  20. 29 Mar, 2022 2 commits
  21. 28 Mar, 2022 1 commit
  22. 25 Mar, 2022 3 commits