• Masaki Kozuki's avatar
    [transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1
    Masaki Kozuki authored
    
    
    * NcclDistributedTestBase
    
    * fix stupid mistake
    
    * add UCC test
    
    * add UCC backend
    
    * torch ucc tests
    
    * allows for UCC backend
    
    * Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense
    
    * Apply 4 suggestion(s) to 1 file(s)
    
    * mix&match NCCL & UCC
    
    * use both ucc&nccl in gpt
    
    * UCC for Pipeline Parallel, NCCL for the others
    
    * conditionally use ucc
    
    * make ucc guards more friendly
    
    * test raises when torch_ucc isn't available
    
    * Change to member variable from class variable
    Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
    
    * pass async_comm to train, I mistakenly dropped it during the rebase
    
    * fix typo: functionality
    
    * Enable tensor parallel only when device count > 4
    
    I want pipeline model parallel world size to be >= 4 because
    previously I saw GPT/BERT failing when only UCC is used.
    So I'm speculating that there's some gotcha around pipeline size of 4.
    
    * Add nvidia driver version guard
    Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
    
    * move world_size as it was not correctly reflected
    
    * keep eye on the nvml api thing
    
    * import unittest
    Co-authored-by: default avatarAidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
    3490b9e1
test_microbatches.py 3.51 KB