1. 08 Aug, 2022 1 commit
  2. 29 Jul, 2022 5 commits
  3. 28 Jul, 2022 1 commit
  4. 26 Jul, 2022 2 commits
  5. 25 Jul, 2022 1 commit
  6. 21 Jul, 2022 2 commits
  7. 20 Jul, 2022 1 commit
    • Aidyn-A's avatar
      [transformer] UCC async test (#1417) · a29a698f
      Aidyn-A authored
      * add test
      
      * update batch sizes
      
      * update batch sizes
      
      * small updates
      
      * delete comment
      
      * add async comm
      
      * add sync if needed
      
      * update tests
      
      * remove redundant imports
      
      * code cleanup
      
      * minor updates
      
      * update dtype for comparison
      
      * fix dtypes
      
      * fix typo
      
      * modify sizes and use common_utils.find_free_port
      
      * fix typo and use double precision
      
      * revert some changes, create test for profiling on L1
      
      * remove redundant line
      
      * revert UCC_TLS and add sync to fwd_bwd
      
      * code clean up
      
      * code clean up
      
      * modify BERT test
      
      * add comment
      a29a698f
  8. 14 Jul, 2022 2 commits
  9. 11 Jul, 2022 1 commit
  10. 07 Jul, 2022 1 commit
  11. 05 Jul, 2022 2 commits
  12. 23 Jun, 2022 2 commits
    • Masaki Kozuki's avatar
      [transformer] Port Sequence Parallelism (takeover of #1396) (#1400) · 3ff1a10f
      Masaki Kozuki authored
      * it looks possible to remove this file
      
      * add communication collectives
      
      * update Column|RowParallelLinear
      
      * update checkpoint function
      
      * update function name
      
      * parity between public and private collectives
      
      * row parallel linear
      
      * column parallel linear
      
      * sequence parallel: p2p comm
      
      fix typo
      
      * sequence parallel: pipeline parallel
      
      * fix typo
      
      * add layernorm with sequence_parallel_enabled attr
      
      * class variable -> member variable
      
      * fix col parallel test with sequence parallel
      
      * Initial test of `forward_backward_pipelining_without_interleaving` with `model_type=ModelType.encoder_and_decoder`
      
      * add cases pretending to test sequence_parallel
      
      * Apply 2 suggestion(s) to 1 file(s)
      
      * update sequence_parallel_enabled docstring
      
      * update docstring: order of tensor dimensions, sequence_parallel_enabled behavior
      
      * Divide sequence_length if sequence parallel
      
      tensor shape should be updated if sequence parallel is enabled.
      
      * cherry-pick https://github.com/NVIDIA/Megatron-LM/commit/8474e6e54fcb9dfa37aea039352f9fb485fb6f61
      
      * type annotation
      
      * Fix matmul call in RowParallelLinear
      
      Fix `sequence_parallel_enabled` to `False` as you can see in
      https://github.com/NVIDIA/Megatron-LM/blob/d898a8991d1a08d29074f87819d1bf41517e35f5/megatron/mpu/layers.py#L511-L514
      
      * update rowparallellinear test
      
      * fix `loss_weight` is not defined in test_layers
      
      * @eqy's comment
      
      * mixed fused layer norm
      
      * fix typo
      
      * misc
      
      * test_layers cleanup
      
      * Skip Bert/GPT script
      
      Since these two models haven't gotten updated for sequence parallle, e.g. the update of the order of dimension from (batch, sequence, feature) to (sequence, batch, feature) and global variables of arguments
      
      * debug part 1/N: comment out `x.retain_grad`
      
      * debug part 2/N: [ColumnParallelLinear] comment out overriding of sequence_parallel_enabled
      
      * debug 3/N: add pipeline test with parallel mlp
      
      * Fix handling `self.input_tensor` and argument
      
      * tp2pp4 ModelType.encoder_or_decoder is failing, which can be at my fault because the backward is blaming the output and the grad_ouptut shape don't match
      
      * revert debug 1/N
      
      * defer tensor model parallel size > 1
      
      * split tensor in sequence dim
      
      * cosmetic
      
      * cosmetic: remove archaic comment
      
      * enable TP>1 for encoder_and_decoder as well
      
      * set requires_grad=True always...
      
      * Set `scatter_gather_tensors_in_pipeline` to :obj:`False`
      
      for the sake of nemo megatron's GPT works with sequence parallel enabled.
      
      * brush up comment of `requires_grad()`
      
      There's a possibility that PyTorch DistributedDataParallel hangs
      when some tensor (or parameter) doesn't require grad according to @ptrblck.
      This forced `requires_grad` in my understanding is different from that.
      
      * misc changes of scatter_gather_tensors_in_pipeline comment
      
      * guard for torch_ucc
      
      * cosmetic changes related to tests
      
      * update command line arguments
      
      * update TransformerLanguageModel
      
      * rename
      
      * move gpt to gpt.py
      
      * update bert
      
      * add all_gather for params in sequence parallel region
      
      * misc. some diffs were lost during rebasing...
      
      * updates for non sequence parallel execution
      
      * gpt with sequence parallel
      
      * Apply 2 suggestion(s) to 2 file(s)
      
      * update tensor&pipeline parallel size
      
      * why `sequence_parallel_enabled` is not supplied!? Did I messed up when rebasing?
      
      * cosmetic fix
      
      * correct key is sequence_parallel_enabled
      3ff1a10f
    • Tim Moon's avatar
      Move distributed Adam unit test to contrib dir (#1406) · 57f890a7
      Tim Moon authored
      * Increase default bucket size in distributed Adam
      
      * Move distributed Adam unit test to contrib tests
      
      Integrate into unit testing framework
      
      * Tweak hyperparameters for dist Adam optimizer test
      
      Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar.
      
      * Use distributed test infrastructure in distributed Adam unit test
      
      Suggestion from @crcrpar.
      57f890a7
  13. 22 Jun, 2022 2 commits
    • Masaki Kozuki's avatar
      Temporary Solution to Let `FusedAdam` support BFloat16 (#1407) · 81f8ba79
      Masaki Kozuki authored
      * add temporary dispatch of double, float, half, bfloat16
      
      * fusedadam of bfloat16
      
      * Add bfloat16 path to FusedAdam
      81f8ba79
    • Tim Moon's avatar
      Gradient clipping with fused kernels (#1405) · dcb02fcf
      Tim Moon authored
      * Gradient clipping routine with fused kernels
      
      Identical API as PyTorch. Falls back to PyTorch impl when not computing L2 norm.
      
      * Add unit test for gradient clipping
      
      * Add fp16 case to gradient clipping unit test
      
      * Tweaks to grad clipping unit test
      
      Review suggestions from @crcrpar
      
      * Debug gradient clipping tests
      
      When checking that incorrect results produce assertion errors, make sure to generate a discrepancy outside the range of numerical error.
      dcb02fcf
  14. 16 Jun, 2022 1 commit
  15. 14 Jun, 2022 3 commits
  16. 13 Jun, 2022 1 commit
  17. 31 May, 2022 2 commits
  18. 20 May, 2022 1 commit
  19. 19 May, 2022 2 commits
  20. 18 May, 2022 1 commit
    • Masaki Kozuki's avatar
      [transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1
      Masaki Kozuki authored
      
      
      * NcclDistributedTestBase
      
      * fix stupid mistake
      
      * add UCC test
      
      * add UCC backend
      
      * torch ucc tests
      
      * allows for UCC backend
      
      * Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense
      
      * Apply 4 suggestion(s) to 1 file(s)
      
      * mix&match NCCL & UCC
      
      * use both ucc&nccl in gpt
      
      * UCC for Pipeline Parallel, NCCL for the others
      
      * conditionally use ucc
      
      * make ucc guards more friendly
      
      * test raises when torch_ucc isn't available
      
      * Change to member variable from class variable
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * pass async_comm to train, I mistakenly dropped it during the rebase
      
      * fix typo: functionality
      
      * Enable tensor parallel only when device count > 4
      
      I want pipeline model parallel world size to be >= 4 because
      previously I saw GPT/BERT failing when only UCC is used.
      So I'm speculating that there's some gotcha around pipeline size of 4.
      
      * Add nvidia driver version guard
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * move world_size as it was not correctly reflected
      
      * keep eye on the nvml api thing
      
      * import unittest
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      3490b9e1
  21. 13 May, 2022 1 commit
  22. 12 May, 2022 1 commit
    • eqy's avatar
      Async pipeline parallel (#1373) · 3fe35211
      eqy authored
      * initial check in
      
      * fix
      
      * fix test
      
      * address some review comments and cleanup
      
      * fix
      
      * bookmark
      
      * fix sync placement to come before gather
      
      * similar fix for non-gather case
      
      * add async bert
      
      * update gpt minimal test
      
      * allow selection of default pp test
      
      * fix bert test
      
      * cleanup
      
      * cleanup
      3fe35211
  23. 11 May, 2022 1 commit
  24. 29 Apr, 2022 3 commits