- 14 Jul, 2022 1 commit
-
-
Sandeep Subramanian authored
* Time dimension shape check for fused scale mask softmax kernel Signed-off-by:
MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Add shape test Signed-off-by:
MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix mask shape Signed-off-by:
MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
-
- 11 Jul, 2022 1 commit
-
-
Perkz Zheng authored
* update: mpu for t5 rpe * update: add rpe mpu group test * fix semicolon bugs Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> * fix semicolon bugs Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 07 Jul, 2022 1 commit
-
-
Masaki Kozuki authored
* remove pyprof * remove reparameterization * remove pyprof test * clean up
-
- 05 Jul, 2022 2 commits
-
-
Tim Moon authored
* Add features to distributed Adam for Megatron support Support gradient clipping, gradient scaling, FP32 grad accumulation, and multiple dtypes and devices. * Restore closure arg to distributed Adam Review suggestion from @crcrpar
-
eqy authored
* Integer driver number comparison * packaging
-
- 23 Jun, 2022 2 commits
-
-
Masaki Kozuki authored
* it looks possible to remove this file * add communication collectives * update Column|RowParallelLinear * update checkpoint function * update function name * parity between public and private collectives * row parallel linear * column parallel linear * sequence parallel: p2p comm fix typo * sequence parallel: pipeline parallel * fix typo * add layernorm with sequence_parallel_enabled attr * class variable -> member variable * fix col parallel test with sequence parallel * Initial test of `forward_backward_pipelining_without_interleaving` with `model_type=ModelType.encoder_and_decoder` * add cases pretending to test sequence_parallel * Apply 2 suggestion(s) to 1 file(s) * update sequence_parallel_enabled docstring * update docstring: order of tensor dimensions, sequence_parallel_enabled behavior * Divide sequence_length if sequence parallel tensor shape should be updated if sequence parallel is enabled. * cherry-pick https://github.com/NVIDIA/Megatron-LM/commit/8474e6e54fcb9dfa37aea039352f9fb485fb6f61 * type annotation * Fix matmul call in RowParallelLinear Fix `sequence_parallel_enabled` to `False` as you can see in https://github.com/NVIDIA/Megatron-LM/blob/d898a8991d1a08d29074f87819d1bf41517e35f5/megatron/mpu/layers.py#L511-L514 * update rowparallellinear test * fix `loss_weight` is not defined in test_layers * @eqy's comment * mixed fused layer norm * fix typo * misc * test_layers cleanup * Skip Bert/GPT script Since these two models haven't gotten updated for sequence parallle, e.g. the update of the order of dimension from (batch, sequence, feature) to (sequence, batch, feature) and global variables of arguments * debug part 1/N: comment out `x.retain_grad` * debug part 2/N: [ColumnParallelLinear] comment out overriding of sequence_parallel_enabled * debug 3/N: add pipeline test with parallel mlp * Fix handling `self.input_tensor` and argument * tp2pp4 ModelType.encoder_or_decoder is failing, which can be at my fault because the backward is blaming the output and the grad_ouptut shape don't match * revert debug 1/N * defer tensor model parallel size > 1 * split tensor in sequence dim * cosmetic * cosmetic: remove archaic comment * enable TP>1 for encoder_and_decoder as well * set requires_grad=True always... * Set `scatter_gather_tensors_in_pipeline` to :obj:`False` for the sake of nemo megatron's GPT works with sequence parallel enabled. * brush up comment of `requires_grad()` There's a possibility that PyTorch DistributedDataParallel hangs when some tensor (or parameter) doesn't require grad according to @ptrblck. This forced `requires_grad` in my understanding is different from that. * misc changes of scatter_gather_tensors_in_pipeline comment * guard for torch_ucc * cosmetic changes related to tests * update command line arguments * update TransformerLanguageModel * rename * move gpt to gpt.py * update bert * add all_gather for params in sequence parallel region * misc. some diffs were lost during rebasing... * updates for non sequence parallel execution * gpt with sequence parallel * Apply 2 suggestion(s) to 2 file(s) * update tensor&pipeline parallel size * why `sequence_parallel_enabled` is not supplied!? Did I messed up when rebasing? * cosmetic fix * correct key is sequence_parallel_enabled
-
Tim Moon authored
* Increase default bucket size in distributed Adam * Move distributed Adam unit test to contrib tests Integrate into unit testing framework * Tweak hyperparameters for dist Adam optimizer test Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar. * Use distributed test infrastructure in distributed Adam unit test Suggestion from @crcrpar.
-
- 22 Jun, 2022 2 commits
-
-
Masaki Kozuki authored
* add temporary dispatch of double, float, half, bfloat16 * fusedadam of bfloat16 * Add bfloat16 path to FusedAdam
-
Tim Moon authored
* Gradient clipping routine with fused kernels Identical API as PyTorch. Falls back to PyTorch impl when not computing L2 norm. * Add unit test for gradient clipping * Add fp16 case to gradient clipping unit test * Tweaks to grad clipping unit test Review suggestions from @crcrpar * Debug gradient clipping tests When checking that incorrect results produce assertion errors, make sure to generate a discrepancy outside the range of numerical error.
-
- 16 Jun, 2022 1 commit
-
-
Kevin Stephano authored
Remove legacy fuser usage from multihead attention in contrib in favor of the default which should be nvfuser. Modify test scripts to activate fusion. (#1403)
-
- 14 Jun, 2022 3 commits
-
-
Thor Johnsen authored
ZeRO-2 support in DistributedFusedAdam
-
Tim Moon authored
Adjust test options to have tighter tolerances.
-
Tim Moon authored
-
- 13 Jun, 2022 1 commit
-
-
Tim Moon authored
-
- 31 May, 2022 1 commit
-
-
eqy authored
Do pipeline parallelism tests in double because TF32 environment variables can be painful to manage across test suites (#1391) * check in * skip interleaved with 2 GPU * change type annotation * address comments thanks @crcrpar @Aidyn-A
-
- 20 May, 2022 1 commit
-
-
Aidyn-A authored
* add grad check * change assert * minor changes * revert unnecessary changes * suggested changes * fix tensor comparison * small changes
-
- 19 May, 2022 2 commits
-
-
eqy authored
* check in * type * cleanup * cleanup * fix function call * Apply suggestions from code review Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
eqy authored
* check in * fancy context style Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
- 18 May, 2022 1 commit
-
-
Masaki Kozuki authored
* NcclDistributedTestBase * fix stupid mistake * add UCC test * add UCC backend * torch ucc tests * allows for UCC backend * Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense * Apply 4 suggestion(s) to 1 file(s) * mix&match NCCL & UCC * use both ucc&nccl in gpt * UCC for Pipeline Parallel, NCCL for the others * conditionally use ucc * make ucc guards more friendly * test raises when torch_ucc isn't available * Change to member variable from class variable Co-authored-by:
Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com> * pass async_comm to train, I mistakenly dropped it during the rebase * fix typo: functionality * Enable tensor parallel only when device count > 4 I want pipeline model parallel world size to be >= 4 because previously I saw GPT/BERT failing when only UCC is used. So I'm speculating that there's some gotcha around pipeline size of 4. * Add nvidia driver version guard Co-authored-by:
Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com> * move world_size as it was not correctly reflected * keep eye on the nvml api thing * import unittest Co-authored-by:
Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>
-
- 13 May, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 12 May, 2022 1 commit
-
-
eqy authored
* initial check in * fix * fix test * address some review comments and cleanup * fix * bookmark * fix sync placement to come before gather * similar fix for non-gather case * add async bert * update gpt minimal test * allow selection of default pp test * fix bert test * cleanup * cleanup
-
- 11 May, 2022 1 commit
-
-
Aidyn-A authored
* add loss comparison to test_pipeline_parallel_fwd_bwd * applied some suggested changes * update test_pipeline_parallel_fwd_bwd.py * update test_pipeline_parallel_fwd_bwd.py 2 * minor update * update test_pipeline_parallel_fwd_bwd.py 3
-
- 29 Apr, 2022 3 commits
-
-
eqy authored
* fix typo * Update test_pipeline_parallel_fwd_bwd.py
-
Masaki Kozuki authored
This is cherry-picked for easier comparison with megatron-lm.
-
yjk21 authored
-
- 21 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* guard * update * remove unnecessary version guard * runtime version guard * cosmetic * skip tests appropriately
-
- 20 Apr, 2022 1 commit
-
-
Thor Johnsen authored
Peer memory halo exchange
-
- 19 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* bump version * add guard * fix the cond
-
- 14 Apr, 2022 1 commit
-
-
Thor Johnsen authored
-
- 13 Apr, 2022 1 commit
-
-
Thor Johnsen authored
-
- 08 Apr, 2022 3 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 07 Apr, 2022 2 commits
-
-
Masaki Kozuki authored
* add warning to pyprof * add warning to reparameterization note: this module is already not import-able as follows: ``` (base) root@c4bb3f161482:/vscode/apex# python -c 'import torch; import apex; from apex import reparameterization' /vscode/apex/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022 warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning) /vscode/apex/apex/reparameterization/__init__.py:2: FutureWarning: reparameterization will be removed by the end of June, 2022 warnings.warn("reparameterization will be removed by the end of June, 2022", FutureWarning) Traceback (most recent call last): File "<string>", line 1, in <module> File "/vscode/apex/apex/reparameterization/__init__.py", line 4, in <module> from .weight_norm import WeightNorm File "/vscode/apex/apex/reparameterization/weight_norm.py", line 3, in <module> from ..fp16_utils import Fused_Weight_Norm ImportError: cannot import name 'Fused_Weight_Norm' from 'apex.fp16_utils' (/vscode/apex/apex/fp16_utils/__init__.py) ``` -
Masaki Kozuki authored
* add test * destroy model parallel was missing
-
- 05 Apr, 2022 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 03 Apr, 2022 1 commit
-
-
Thor Johnsen authored
-
- 02 Apr, 2022 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-