- 24 Mar, 2022 3 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 23 Mar, 2022 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 18 Mar, 2022 1 commit
-
-
eqy authored
* update ngc link and dockerhub container tag * update * update * update * Update README.md Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
- 16 Mar, 2022 1 commit
-
-
Masaki Kozuki authored
[transformer] Warn only when `gradient_accumulation_fusion` is `True` and `fused_weight_gradient_mlp_cuda` is missing (#1317)
-
- 15 Mar, 2022 4 commits
-
-
Masaki Kozuki authored
* initial issue_template -- bug * Apply suggestions from code review Co-authored-by:
eqy <eqy@cs.washington.edu> Co-authored-by:
eqy <eqy@cs.washington.edu>
-
Yuanzhe Dong authored
* Move forward cudnn-frontend * update throw_if to adapt cudnn frontend
-
Thor Johnsen authored
Leave bottleneck masks as bool
-
Thor Johnsen authored
-
- 11 Mar, 2022 1 commit
-
-
chochowski authored
* extend api to allow forced memory zeroing (empty() does not do it) * typo fix * ctx change * move zeroing flag to ctx * update test Co-authored-by:
mchochowski <mchochowski@nvidia.com> Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com>
-
- 08 Mar, 2022 4 commits
-
-
Masaki Kozuki authored
This reverts commit adbe075a.
-
Masaki Kozuki authored
This reverts commit 74e04667.
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
- 01 Mar, 2022 1 commit
-
-
Masaki Kozuki authored
* update build_model to support enc&dec model * fix typo: cur_sargs -> cur_args * enc&dec path: correctly update pre/post process
-
- 27 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 26 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
* fuse grad accumulation w/ weight grad Co-authored-by:
Sangkug Lym <slym@nvidia.com> * fp32 training path * not using *args, **kwargs * backward: moved the tensor dimension cnversion Co-authored-by:
Sangkug Lym <slym@nvidia.com> * move files to csrc/megatron * fix fp32 path * fix typo * add to in order to select the correct custom extension * fix typo * comment on import guard * update test: enable gradient_accumulation_fusion * 86 * remove redundant call of `test_column_parallel_linear` Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 25 Feb, 2022 3 commits
-
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
- 23 Feb, 2022 4 commits
-
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
Thor Johnsen authored
Change data type for virtual tensors to float
-
Thor Johnsen authored
-
- 15 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 12 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 11 Feb, 2022 1 commit
-
-
Stas Bekman authored
* [FusedRMSNorm doc] add epsilon to formula * correct * better wording
-
- 10 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 07 Feb, 2022 1 commit
-
-
eqy authored
-
- 04 Feb, 2022 1 commit
-
-
eqy authored
* FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 01 Feb, 2022 2 commits
-
-
ChongyuNVIDIA authored
* Add the permutation related support as the extension for asp lib. * [Fix] Track the permutation sequence for progressive channel swap strategy. * Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings. * Fix the deprecated functions in ASP unit tests. * Fix the sparsity info typo in ASP lib. * [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search. * Update the README.md with identical random seed setting and NeurIPS info. * Integrate the Pybind11 enhancement of permutation search into ASP lib.
-
Masaki Kozuki authored
* add kwarg of `custom_sync_context_handler` * add kwargs to ignore custom_sync_context_handler which mistakenly passed to fwd/bwd funcs
-
- 31 Jan, 2022 2 commits
-
-
Masaki Kozuki authored
* Free output tensor on each pipeline stage for smaller memory footprint see: https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb * ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff * ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4 * remove position embedding group in destroy * pass deallocate_pipeline_outputs to backward_step * fix typo * missing deallocate_pipeline_outputs * fix typo: grad_ouptut -> grad_output * update tests * remove accessed todo * test with data parallel size of 2 if there's equal to or more than 8 gpus
-
chochowski authored
* fix graph capture failure, fix norm computation with full_ar and clip_after * fix group range to compute l2_norm Co-authored-by:
seryilmaz <seryilmaz@nvidia.com> Co-authored-by:
mchochowski <mchochowski@nvidia.com>
-
- 29 Jan, 2022 1 commit
-
-
Burc Eryilmaz authored
-
- 28 Jan, 2022 2 commits
-
-
Masaki Kozuki authored
* cosmetic refactor in test * log with PID * log more info: rank, pid, filename, lineNo
-
Masaki Kozuki authored
* have get_kth_microbatch deal with None batch * broadcast based on tensor parallel rank * dtype * remove unnecessary .cuda() Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.
-
- 21 Jan, 2022 1 commit
-
-
Masaki Kozuki authored
* add keyword argument of `grad_scaler` * update test * pass dtype to fwd_step_func * add log * calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting * add keyword argument of `grad_scaler` * update test * pass dtype to fwd_step_func * add log * calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting * option to turn off autocast inside forward_step function As there's some users who activate `autocast` outside fwd/bwd functions. * add missing arg of disable_autocast * reorder args of no pipeline
-