1. 18 May, 2022 1 commit
    • Masaki Kozuki's avatar
      [transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1
      Masaki Kozuki authored
      
      
      * NcclDistributedTestBase
      
      * fix stupid mistake
      
      * add UCC test
      
      * add UCC backend
      
      * torch ucc tests
      
      * allows for UCC backend
      
      * Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense
      
      * Apply 4 suggestion(s) to 1 file(s)
      
      * mix&match NCCL & UCC
      
      * use both ucc&nccl in gpt
      
      * UCC for Pipeline Parallel, NCCL for the others
      
      * conditionally use ucc
      
      * make ucc guards more friendly
      
      * test raises when torch_ucc isn't available
      
      * Change to member variable from class variable
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * pass async_comm to train, I mistakenly dropped it during the rebase
      
      * fix typo: functionality
      
      * Enable tensor parallel only when device count > 4
      
      I want pipeline model parallel world size to be >= 4 because
      previously I saw GPT/BERT failing when only UCC is used.
      So I'm speculating that there's some gotcha around pipeline size of 4.
      
      * Add nvidia driver version guard
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      
      * move world_size as it was not correctly reflected
      
      * keep eye on the nvml api thing
      
      * import unittest
      Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
      3490b9e1
  2. 07 Apr, 2022 1 commit