1. 14 Jun, 2022 2 commits
  2. 31 May, 2022 1 commit
  3. 20 May, 2022 1 commit
  4. 19 May, 2022 2 commits
  5. 18 May, 2022 1 commit
    • Masaki Kozuki's avatar
      [transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1
      Masaki Kozuki authored
      
      
      * NcclDistributedTestBase
      
      * fix stupid mistake
      
      * add UCC test
      
      * add UCC backend
      
      * torch ucc tests
      
      * allows for UCC backend
      
      * Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense
      
      * Apply 4 suggestion(s) to 1 file(s)
      
      * mix&match NCCL & UCC
      
      * use both ucc&nccl in gpt
      
      * UCC for Pipeline Parallel, NCCL for the others
      
      * conditionally use ucc
      
      * make ucc guards more friendly
      
      * test raises when torch_ucc isn't available
      
      * Change to member variable from class variable
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * pass async_comm to train, I mistakenly dropped it during the rebase
      
      * fix typo: functionality
      
      * Enable tensor parallel only when device count > 4
      
      I want pipeline model parallel world size to be >= 4 because
      previously I saw GPT/BERT failing when only UCC is used.
      So I'm speculating that there's some gotcha around pipeline size of 4.
      
      * Add nvidia driver version guard
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * move world_size as it was not correctly reflected
      
      * keep eye on the nvml api thing
      
      * import unittest
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      3490b9e1
  6. 12 May, 2022 1 commit
    • eqy's avatar
      Async pipeline parallel (#1373) · 3fe35211
      eqy authored
      * initial check in
      
      * fix
      
      * fix test
      
      * address some review comments and cleanup
      
      * fix
      
      * bookmark
      
      * fix sync placement to come before gather
      
      * similar fix for non-gather case
      
      * add async bert
      
      * update gpt minimal test
      
      * allow selection of default pp test
      
      * fix bert test
      
      * cleanup
      
      * cleanup
      3fe35211
  7. 11 May, 2022 1 commit
  8. 29 Apr, 2022 1 commit
  9. 07 Apr, 2022 1 commit
  10. 25 Mar, 2022 1 commit
    • Masaki Kozuki's avatar
      [transformer] Format & Test Refactoring (#1325) · a0ed4151
      Masaki Kozuki authored
      * try PyTorch custom TestCase class
      
      * revert
      
      * initial working example
      
      * update
      
      * data utils
      
      * fix imports
      
      * hardcode backend to nccl
      
      * fix signature
      
      * fix typo
      
      * mapping
      
      * set device
      
      * init
      
      * refactor x entropy
      
      * remove unused import & destroy model parallel
      
      * refactor random
      
      * fix test
      
      * remove migrated tests
      
      * refactor
      
      * init
      
      * separate affine weight init
      
      * init model parallel
      
      * split more
      
      * weight init fix part 1
      
      * use cpu init for consistency btwn native and tensor parallel
      
      * black
      
      * add col parallel
      
      * use a 3D tensor of square matrix for column parallel linear
      
      * skip the failing cases
      
      * migrate layers test
      
      * pipeline parallel forward/backward
      
      * fix typo
      
      * fix typo
      
      * fix
      
      * fix pipeline world size
      
      * black
      
      * rm `run_pipeline_parallel_test` in favor of test_pipeline_parallel_fwd_bwd.py
      
      * stop logging
      
      * set log level
      
      * black
      
      * license and format
      
      * fix
      
      * skip tf32 as matrices are small
      
      * remove potentially inappropriate license
      
      * Apply suggestions from code review
      
      * remove `TODO` comment
      
      * `torch.testing.assert_allclose` -> `torch.testing.assert_close`
      
      * remove comment-outs
      
      * remote unused import
      
      * minor fix
      a0ed4151
  11. 26 Feb, 2022 1 commit
  12. 25 Feb, 2022 1 commit
  13. 23 Feb, 2022 1 commit
  14. 04 Feb, 2022 1 commit
  15. 31 Jan, 2022 1 commit
  16. 28 Jan, 2022 2 commits
    • Masaki Kozuki's avatar
      small changes in test and logger format (#1278) · b1c75f6f
      Masaki Kozuki authored
      * cosmetic refactor in test
      
      * log with PID
      
      * log more info: rank, pid, filename, lineNo
      b1c75f6f
    • Masaki Kozuki's avatar
      allow for `None` batch (#1280) · a960fe8c
      Masaki Kozuki authored
      * have get_kth_microbatch deal with None batch
      
      * broadcast based on tensor parallel rank
      
      * dtype
      
      * remove unnecessary .cuda()
      
      Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.
      a960fe8c
  17. 21 Jan, 2022 1 commit
  18. 17 Dec, 2021 1 commit
    • Masaki Kozuki's avatar
      Add an argument of `dtype` to forward_backward functions to specify the dtype... · b88c507e
      Masaki Kozuki authored
      Add an argument of `dtype` to forward_backward functions to specify the dtype used in p2p comm (#1249)
      
      * let users sepcify dtype for p2p comm taking the possibility of O2 style AMP into account
      
      * add `dtype` argument to forward_backward functions
      
      * fix
      
      * better message
      
      * add docstring of dtype
      
      * add a link to dtype logic of p2p comm
      b88c507e
  19. 16 Dec, 2021 1 commit
  20. 14 Dec, 2021 1 commit
  21. 10 Dec, 2021 2 commits
    • Masaki Kozuki's avatar
      Cherry-pick Megatron-LM's changes in pipeline model parallel for T5 (#1232) · 0e25fcc4
      Masaki Kozuki authored
      * update parallel_state
      
      * update pipeline common funcs - forward_step and backward_step
      
      * update pipelining w/o interleaving
      
      * type hint
      
      * merge utils into without_interleaving
      
      Motivation: functions in utils are only used by
      forward_backward_pipelining_without_interleaving
      
      * fix handling of `model_type`
      
      * fix import of DDP
      
      * update set_input_tensor method
      
      * fix
      
      * cosmetic
      
      * update model
      
      * refactor pipeline test scripts
      0e25fcc4
    • Rishi Puri's avatar
      Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel)... · ab7af058
      Rishi Puri authored
      
      Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222)
      
      * minimal bert pipeline parallel test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * first draft of gpt minimal test
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * framework to scale up the gpt2 test for variety of distributed setups
      
      * adding gpt_minimal_test to list of multigpu tests
      Co-authored-by: default avatarEddie Yan <eddiey@nvidia.com>
      Co-authored-by: default avatarriship <riship@nvidia.com>
      ab7af058
  22. 09 Dec, 2021 1 commit
  23. 19 Nov, 2021 2 commits
    • eqy's avatar
      minimal bert pipeline parallel test (#1216) · aa756cec
      eqy authored
      * minimal bert pipeline parallel test
      
      * fix global and cleanup
      
      * use get_forward_backward_func
      
      * cleanup and fix some tests
      aa756cec
    • Masaki Kozuki's avatar
      [POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133
      Masaki Kozuki authored
      * init logging use
      
      * fix
      
      * clean up
      
      * fp32 p2p comm
      
      * init
      
      * Dynamic global batch size with `MegatronPretrainingSampler`
      
      I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
      global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now
      
      * revive original pipeline parallel test
      
      * update MULTIGPU_TEST: add dynamic batchsize test
      
      * run MegatronPretrainingRandomSampler
      
      * fix comment
      
      * fix
      
      * update
      
      * cosmetic
      
      * add note
      
      * Apply 2 suggestion(s) to 2 file(s)
      
      * change following https://github.com/NVIDIA/apex/pull/1210
      
      * fix
      35336133
  24. 10 Nov, 2021 1 commit
  25. 27 Oct, 2021 1 commit
    • Masaki Kozuki's avatar
      Pipeline Model Parallel (#1202) · 63d5dd63
      Masaki Kozuki authored
      
      
      * Init apex.ppu (pipeline model parallel utility)
      
      Reference commit:
      
      ```
      commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
      Merge: 14f2c684 7b293d9b
      Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
      Date:   Wed Sep 22 22:57:54 2021 -0700
      
          Merge branch 'add_BOS' into 'main'
      
          Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives
      
          See merge request ADLR/megatron-lm!328
      ```
      
      * removing get_args and replace import - phase 1
      
      * removing get_args and replace import - phase 2
      
      * move ppu to apex.transformer.pipeline_parallel
      
      * update two __init__.py
      
      * update READMEs
      
      * mpu -> parallel_state & tensor_parallel
      
      * fix
      
      * remove not pipeline files
      
      * separate schedules.py - phase 1
      
      * dissect schedules.py
      
      * data_iterators -> batch
      
      * remove optimizer from forward_backward_step funcs
      
      * init test
      
      * Apply 2 suggestion(s) to 2 file(s)
      
      * fix cyclic import
      
      * fix syntax of Callable
      
      * fix - 1
      
      * move directory as testing used for pp test as well
      
      * add some functions for num microbatches calculator
      
      * model is a list in pipeline parallel
      
      * skip build num microbatch calculator
      
      * fix test
      
      * assert -> raise
      
      * skip args printing
      
      * specify tensor shape everywhere even if None - phase 1
      
      * private timers
      
      * passing tensor shape & dtype around
      
      * update dtype handling by introducing helper func
      
      * write helper func to reduce cyclomatic complexity
      
      * remove duplicate
      
      * update
      
      * move split_tensor_into_1d_equal_chunks to avoid cyclic import
      
      * tmp
      
      * cosmetic
      
      * move gather_split_1d_tensor to avoid cyclic imports
      
      * remove debug print
      
      * add outer loop
      
      * early return if possible
      
      * cosmetic
      
      * passing around tensor shape
      
      * refactor test
      
      * add script to learn batch sampler behavior
      
      * update
      
      * minibatch splitter
      
      * add minibatch splitter
      
      * split minibatch into microbatches
      
      * minor changes
      
      * uncomment split batch for test sake
      
      * set as attribute
      
      * study the behavior of no pipelining
      
      * debug 1
      
      * reflect test util namespace change
      
      * update readme
      
      * cosmetic in test
      
      * add model build helper func for interleaving shced
      
      * adding model builder from megatron
      
      * canbe cyclic import
      
      * fix
      
      * enable interleaving test, but failing even if forward only
      
      * fix batch preparation
      
      * add explanation
      
      * print data parallel size
      
      * fix typo
      
      * Add Megatron style GPT model by Rishi
      Co-authored-by: default avatarRishi Puri <riship@nvidia.com>
      
      * update
      
      * type hint for jit
      
      * fix forward_backward_no_pipelining test
      
      * pipeline forward backward seem to hang if not forward only
      
      * fix typo
      
      * debug
      
      * add p2p test
      
      * simplify
      
      * fix
      
      * tentative
      
      * set both tmp and pmp to 1
      
      * init
      
      * fix typo
      
      * fix
      
      * fix path of divide
      
      * set seed for tmp
      
      * update upon Eddie comment
      
      * fix typo
      
      * adding failing data loader test
      
      * fix
      
      * megatron still failing
      
      * check in
      
      * with the nested loop of new order, interleaving seems fine
      
      * cosmetic change
      
      * make `forward_backward_pipelining_with_interleaving private
      
      * warn users that interleaving sched is unstable
      
      * move noop handler to no pipelining
      
      * comment out rank_print
      
      * make `build_model` more flexible
      
      * skip megatron test tentatively
      
      * correctly comment out rank_print
      
      * correctly comment out rank_print
      
      * correctly comment out rank_print
      
      * skip appropriately
      
      * remove wip p2p comm test
      
      * update type hint of model_provider_func
      
      * disable tf32 in each test script
      
      * skip interleaving w/ backward
      
      * rename as mpu is the old name
      
      * remove broken case
      
      * expose build_model func
      
      * delete `dist.ring_exchange` func call and `use_ring_exchange` argument
      
      * nit fixes
      
      * check in
      
      * remove unused file
      
      * update the list
      
      * update tensor shape
      
      * remove mixed dtype case
      
      * use torch.distributed.run
      
      * 2020 -> 2021
      
      * another 2020 -> 2021
      
      * docstring & type hint
      
      * fix teardown
      
      * update
      
      * change to experimental
      
      * check if warned
      Co-authored-by: default avatarRishi Puri <riship@nvidia.com>
      Co-authored-by: default avatarEddie Yan <eddiey@nvidia.com>
      63d5dd63
  26. 23 Oct, 2021 1 commit
  27. 08 Oct, 2021 1 commit
  28. 06 Oct, 2021 1 commit
  29. 02 Oct, 2021 1 commit
  30. 15 Apr, 2021 1 commit
    • Sudhakar Singh's avatar
      Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac
      Sudhakar Singh authored
      * Add unit tests for fused-novograd
      
      * Fix: tensors should reside on the same device
      
      * Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test
      
      * fixed issues mentioned in the comments
      59d2f7ac
  31. 01 Dec, 2020 1 commit
  32. 05 Aug, 2020 1 commit
  33. 06 Jul, 2020 1 commit
    • jjsjann123's avatar
      [sync BN] (#792) · 1ff54b8f
      jjsjann123 authored
      * [sync BN]
      
      support non-uniform batch size across process group.
      
      TODO: test should be added once cleaned up.
      
      * updating unit tests
      
      * new unit tests for different inputs
      
      * cleaning
      1ff54b8f
  34. 23 Jun, 2020 2 commits