1. 25 Jul, 2022 1 commit
  2. 18 May, 2022 1 commit
    • Masaki Kozuki's avatar
      [transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1
      Masaki Kozuki authored
      
      
      * NcclDistributedTestBase
      
      * fix stupid mistake
      
      * add UCC test
      
      * add UCC backend
      
      * torch ucc tests
      
      * allows for UCC backend
      
      * Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense
      
      * Apply 4 suggestion(s) to 1 file(s)
      
      * mix&match NCCL & UCC
      
      * use both ucc&nccl in gpt
      
      * UCC for Pipeline Parallel, NCCL for the others
      
      * conditionally use ucc
      
      * make ucc guards more friendly
      
      * test raises when torch_ucc isn't available
      
      * Change to member variable from class variable
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * pass async_comm to train, I mistakenly dropped it during the rebase
      
      * fix typo: functionality
      
      * Enable tensor parallel only when device count > 4
      
      I want pipeline model parallel world size to be >= 4 because
      previously I saw GPT/BERT failing when only UCC is used.
      So I'm speculating that there's some gotcha around pipeline size of 4.
      
      * Add nvidia driver version guard
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * move world_size as it was not correctly reflected
      
      * keep eye on the nvml api thing
      
      * import unittest
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      3490b9e1
  3. 25 Mar, 2022 1 commit
    • Masaki Kozuki's avatar
      [transformer] Format & Test Refactoring (#1325) · a0ed4151
      Masaki Kozuki authored
      * try PyTorch custom TestCase class
      
      * revert
      
      * initial working example
      
      * update
      
      * data utils
      
      * fix imports
      
      * hardcode backend to nccl
      
      * fix signature
      
      * fix typo
      
      * mapping
      
      * set device
      
      * init
      
      * refactor x entropy
      
      * remove unused import & destroy model parallel
      
      * refactor random
      
      * fix test
      
      * remove migrated tests
      
      * refactor
      
      * init
      
      * separate affine weight init
      
      * init model parallel
      
      * split more
      
      * weight init fix part 1
      
      * use cpu init for consistency btwn native and tensor parallel
      
      * black
      
      * add col parallel
      
      * use a 3D tensor of square matrix for column parallel linear
      
      * skip the failing cases
      
      * migrate layers test
      
      * pipeline parallel forward/backward
      
      * fix typo
      
      * fix typo
      
      * fix
      
      * fix pipeline world size
      
      * black
      
      * rm `run_pipeline_parallel_test` in favor of test_pipeline_parallel_fwd_bwd.py
      
      * stop logging
      
      * set log level
      
      * black
      
      * license and format
      
      * fix
      
      * skip tf32 as matrices are small
      
      * remove potentially inappropriate license
      
      * Apply suggestions from code review
      
      * remove `TODO` comment
      
      * `torch.testing.assert_allclose` -> `torch.testing.assert_close`
      
      * remove comment-outs
      
      * remote unused import
      
      * minor fix
      a0ed4151