1. 23 Jun, 2022 1 commit
    • Tim Moon's avatar
      Move distributed Adam unit test to contrib dir (#1406) · 57f890a7
      Tim Moon authored
      * Increase default bucket size in distributed Adam
      
      * Move distributed Adam unit test to contrib tests
      
      Integrate into unit testing framework
      
      * Tweak hyperparameters for dist Adam optimizer test
      
      Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar.
      
      * Use distributed test infrastructure in distributed Adam unit test
      
      Suggestion from @crcrpar.
      57f890a7
  2. 22 Jun, 2022 2 commits
    • Masaki Kozuki's avatar
      Temporary Solution to Let `FusedAdam` support BFloat16 (#1407) · 81f8ba79
      Masaki Kozuki authored
      * add temporary dispatch of double, float, half, bfloat16
      
      * fusedadam of bfloat16
      
      * Add bfloat16 path to FusedAdam
      81f8ba79
    • Tim Moon's avatar
      Gradient clipping with fused kernels (#1405) · dcb02fcf
      Tim Moon authored
      * Gradient clipping routine with fused kernels
      
      Identical API as PyTorch. Falls back to PyTorch impl when not computing L2 norm.
      
      * Add unit test for gradient clipping
      
      * Add fp16 case to gradient clipping unit test
      
      * Tweaks to grad clipping unit test
      
      Review suggestions from @crcrpar
      
      * Debug gradient clipping tests
      
      When checking that incorrect results produce assertion errors, make sure to generate a discrepancy outside the range of numerical error.
      dcb02fcf
  3. 16 Jun, 2022 1 commit
  4. 14 Jun, 2022 3 commits
  5. 13 Jun, 2022 1 commit
  6. 31 May, 2022 1 commit
  7. 20 May, 2022 1 commit
  8. 19 May, 2022 2 commits
  9. 18 May, 2022 1 commit
    • Masaki Kozuki's avatar
      [transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1
      Masaki Kozuki authored
      
      
      * NcclDistributedTestBase
      
      * fix stupid mistake
      
      * add UCC test
      
      * add UCC backend
      
      * torch ucc tests
      
      * allows for UCC backend
      
      * Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense
      
      * Apply 4 suggestion(s) to 1 file(s)
      
      * mix&match NCCL & UCC
      
      * use both ucc&nccl in gpt
      
      * UCC for Pipeline Parallel, NCCL for the others
      
      * conditionally use ucc
      
      * make ucc guards more friendly
      
      * test raises when torch_ucc isn't available
      
      * Change to member variable from class variable
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * pass async_comm to train, I mistakenly dropped it during the rebase
      
      * fix typo: functionality
      
      * Enable tensor parallel only when device count > 4
      
      I want pipeline model parallel world size to be >= 4 because
      previously I saw GPT/BERT failing when only UCC is used.
      So I'm speculating that there's some gotcha around pipeline size of 4.
      
      * Add nvidia driver version guard
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * move world_size as it was not correctly reflected
      
      * keep eye on the nvml api thing
      
      * import unittest
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      3490b9e1
  10. 13 May, 2022 1 commit
  11. 12 May, 2022 1 commit
    • eqy's avatar
      Async pipeline parallel (#1373) · 3fe35211
      eqy authored
      * initial check in
      
      * fix
      
      * fix test
      
      * address some review comments and cleanup
      
      * fix
      
      * bookmark
      
      * fix sync placement to come before gather
      
      * similar fix for non-gather case
      
      * add async bert
      
      * update gpt minimal test
      
      * allow selection of default pp test
      
      * fix bert test
      
      * cleanup
      
      * cleanup
      3fe35211
  12. 11 May, 2022 1 commit
  13. 29 Apr, 2022 3 commits
  14. 21 Apr, 2022 1 commit
  15. 20 Apr, 2022 1 commit
  16. 19 Apr, 2022 1 commit
  17. 14 Apr, 2022 1 commit
  18. 13 Apr, 2022 1 commit
  19. 08 Apr, 2022 3 commits
  20. 07 Apr, 2022 2 commits
    • Masaki Kozuki's avatar
      Deprecation warning: `pyprof` & `reparameterization` (#1348) · 727a6452
      Masaki Kozuki authored
      * add warning to pyprof
      
      * add warning to reparameterization
      
      note: this module is already not import-able as follows:
      
      ```
      (base) root@c4bb3f161482:/vscode/apex# python -c 'import torch; import
      apex; from apex import reparameterization'
      /vscode/apex/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be
      removed by the end of June, 2022
        warnings.warn("pyprof will be removed by the end of June, 2022",
      FutureWarning)
      /vscode/apex/apex/reparameterization/__init__.py:2: FutureWarning:
      reparameterization will be removed by the end of June, 2022
        warnings.warn("reparameterization will be removed by the end of June,
      2022", FutureWarning)
      Traceback (most recent call last):
        File "<string>", line 1, in <module>
        File "/vscode/apex/apex/reparameterization/__init__.py", line 4, in
      <module>
          from .weight_norm import WeightNorm
        File "/vscode/apex/apex/reparameterization/weight_norm.py", line 3, in
      <module>
          from ..fp16_utils import Fused_Weight_Norm
      ImportError: cannot import name 'Fused_Weight_Norm' from
      'apex.fp16_utils' (/vscode/apex/apex/fp16_utils/__init__.py)
      ```
      727a6452
    • Masaki Kozuki's avatar
      [transformer] add microbatches test (#1349) · 7d903878
      Masaki Kozuki authored
      * add test
      
      * destroy model parallel was missing
      7d903878
  21. 05 Apr, 2022 2 commits
  22. 03 Apr, 2022 1 commit
  23. 02 Apr, 2022 4 commits
  24. 01 Apr, 2022 3 commits
  25. 31 Mar, 2022 1 commit