Commits · e57d9e79ca6e75fbcdf76cfd950bdf33e9c9203f · OpenDAS / apex

25 Jul, 2022 1 commit
- [transformer] update tests (#1428) · e57d9e79
  Aidyn-A authored Jul 25, 2022
  
  e57d9e79
18 May, 2022 1 commit

[transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1

Masaki Kozuki authored May 18, 2022



* NcclDistributedTestBase

* fix stupid mistake

* add UCC test

* add UCC backend

* torch ucc tests

* allows for UCC backend

* Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense

* Apply 4 suggestion(s) to 1 file(s)

* mix&match NCCL & UCC

* use both ucc&nccl in gpt

* UCC for Pipeline Parallel, NCCL for the others

* conditionally use ucc

* make ucc guards more friendly

* test raises when torch_ucc isn't available

* Change to member variable from class variable
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

* pass async_comm to train, I mistakenly dropped it during the rebase

* fix typo: functionality

* Enable tensor parallel only when device count > 4

I want pipeline model parallel world size to be >= 4 because
previously I saw GPT/BERT failing when only UCC is used.
So I'm speculating that there's some gotcha around pipeline size of 4.

* Add nvidia driver version guard
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

* move world_size as it was not correctly reflected

* keep eye on the nvml api thing

* import unittest
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

3490b9e1

25 Mar, 2022 1 commit

[transformer] Format & Test Refactoring (#1325) · a0ed4151

Masaki Kozuki authored Mar 24, 2022

* try PyTorch custom TestCase class

* revert

* initial working example

* update

* data utils

* fix imports

* hardcode backend to nccl

* fix signature

* fix typo

* mapping

* set device

* init

* refactor x entropy

* remove unused import & destroy model parallel

* refactor random

* fix test

* remove migrated tests

* refactor

* init

* separate affine weight init

* init model parallel

* split more

* weight init fix part 1

* use cpu init for consistency btwn native and tensor parallel

* black

* add col parallel

* use a 3D tensor of square matrix for column parallel linear

* skip the failing cases

* migrate layers test

* pipeline parallel forward/backward

* fix typo

* fix typo

* fix

* fix pipeline world size

* black

* rm `run_pipeline_parallel_test` in favor of test_pipeline_parallel_fwd_bwd.py

* stop logging

* set log level

* black

* license and format

* fix

* skip tf32 as matrices are small

* remove potentially inappropriate license

* Apply suggestions from code review

* remove `TODO` comment

* `torch.testing.assert_allclose` -> `torch.testing.assert_close`

* remove comment-outs

* remote unused import

* minor fix

a0ed4151