- 26 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
* fuse grad accumulation w/ weight grad Co-authored-by:
Sangkug Lym <slym@nvidia.com> * fp32 training path * not using *args, **kwargs * backward: moved the tensor dimension cnversion Co-authored-by:
Sangkug Lym <slym@nvidia.com> * move files to csrc/megatron * fix fp32 path * fix typo * add to in order to select the correct custom extension * fix typo * comment on import guard * update test: enable gradient_accumulation_fusion * 86 * remove redundant call of `test_column_parallel_linear` Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 15 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 12 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 04 Feb, 2022 1 commit
-
-
eqy authored
* FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 09 Dec, 2021 1 commit
-
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
- 27 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* Init apex.ppu (pipeline model parallel utility) Reference commit: ``` commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main) Merge: 14f2c684 7b293d9b Author: Mohammad Shoeybi <mshoeybi@nvidia.com> Date: Wed Sep 22 22:57:54 2021 -0700 Merge branch 'add_BOS' into 'main' Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives See merge request ADLR/megatron-lm!328 ``` * removing get_args and replace import - phase 1 * removing get_args and replace import - phase 2 * move ppu to apex.transformer.pipeline_parallel * update two __init__.py * update READMEs * mpu -> parallel_state & tensor_parallel * fix * remove not pipeline files * separate schedules.py - phase 1 * dissect schedules.py * data_iterators -> batch * remove optimizer from forward_backward_step funcs * init test * Apply 2 suggestion(s...
-
- 08 Oct, 2021 1 commit
-
-
eqy authored
-
- 07 Oct, 2021 1 commit
-
-
eqy authored
-
- 02 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
Co-authored-by:
Piotr Bialecki <pbialecki@nvidia.com> Co-authored-by:
Eddie Yan <eddiey@nvidia.com> Co-authored-by:
Rishi Puri <riship@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 24 Sep, 2021 1 commit
-
-
Masaki Kozuki authored
-
- 04 Sep, 2021 1 commit
-
-
Burc Eryilmaz authored
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path * safer guard around CUBLAS constants, remove unreferenced variable * more guard changes * guard against cublas version instead of cuda Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 01 Sep, 2021 2 commits
-
-
Burc Eryilmaz authored
* fuse norm into scale * add fused norm into dlamb Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
Burc Eryilmaz authored
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 17 May, 2021 1 commit
-
-
Burc Eryilmaz authored
Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 19 Apr, 2021 1 commit
-
-
Burc Eryilmaz authored
* don't create cublasLt handle, fix zero block size case * cleanup
-
- 17 Apr, 2021 1 commit
-
-
Burc Eryilmaz authored
* initial cublaslt support * 64 bit input * add license headers * cleanup * remove license Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 15 Apr, 2021 1 commit
-
-
Sudhakar Singh authored
* Add unit tests for fused-novograd * Fix: tensors should reside on the same device * Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test * fixed issues mentioned in the comments
-
- 19 Oct, 2020 1 commit
-
-
lly-zero-one authored
In this PR, we mainly tried to optimize the performance of Syncatchnorm and also fixed one potential issue in the welford_parallel kernel implementation. For performance improvement, we batched the mean/var/count all_gather communication together and sent it once in the forward path We also batch the all_reduce in backward path We add the contiguous call on the input of welford_parallel kernel. If there is any standard perf benchmark, I would be happy to run it.
-
- 05 Aug, 2020 1 commit
-
-
ngimel authored
* add device guards to the optimizers * add untracked file * set deviceGuard in multi_tensor_apply * address review comments; fix lamb * indent * typo
-
- 06 Jul, 2020 1 commit
-
-
jjsjann123 authored
* [sync BN] support non-uniform batch size across process group. TODO: test should be added once cleaned up. * updating unit tests * new unit tests for different inputs * cleaning
-
- 23 May, 2020 1 commit
-
-
Kexin Yu authored
-
- 22 May, 2020 5 commits
- 21 May, 2020 1 commit
-
-
Kexin Yu authored
-
- 14 May, 2020 1 commit
-
-
Andrew Tulloch authored
-
- 30 Apr, 2020 3 commits
-
-
Kexin Yu authored
-
Deyu Fu authored
* modify MTA axpby for wider load/store * Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
-
Deyu Fu authored
* update fused bias relu backward kernel * adding support for not require first layer dgrad * fix bug: wrong layer in requires grad * add infrastructure for optional bias and activation, currently only support no bias and no relu * make bias and relu optional separately * add sigmoid activation option
-
- 28 Apr, 2020 1 commit
-
-
Kexin Yu authored
-
- 22 Apr, 2020 1 commit
-
-
Deyu Fu authored
-
- 10 Apr, 2020 1 commit
-
-
Thor Johnsen authored
-
- 27 Feb, 2020 1 commit
-
-
mcarilli authored
* NHWC support for multi tensor apply * compilation fix for version<=1.4
-
- 04 Oct, 2019 1 commit
-
-
Deyu Fu authored
* move previous fused_adam and fp16_optimizer to contrib * make build contrib.fused_adam optional * change build option name * remove unnecessary try import
-
- 06 Sep, 2019 1 commit
-
-
mcarilli authored
* Pushing for build tests * Contrib files * Removing deprecated checks
-
- 20 Aug, 2019 1 commit
-
-
Deyu Fu authored
-
- 17 Aug, 2019 1 commit
-
-
Deyu Fu authored
-
- 16 Aug, 2019 1 commit
-
-
Deyu Fu authored
correctly not apply bias correction to epsilon(same as recent upstream change) correctly not apply bias correction to weight decay(consistent with upstream AdamW) Make adam_w_mode for FusedAdam/LAMB, to do L2 or Weight Decay (Adam vs AdamW) Correct document reg_inside_moment differently from adam_w_mode in FusedNovoGrad Removed legacy eps_mode from FusedAdam Make internal math type float across fused optimizers
-