- 19 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* bump version * add guard * fix the cond
-
- 07 Apr, 2022 2 commits
-
-
Masaki Kozuki authored
* add warning to pyprof * add warning to reparameterization note: this module is already not import-able as follows: ``` (base) root@c4bb3f161482:/vscode/apex# python -c 'import torch; import apex; from apex import reparameterization' /vscode/apex/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022 warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning) /vscode/apex/apex/reparameterization/__init__.py:2: FutureWarning: reparameterization will be removed by the end of June, 2022 warnings.warn("reparameterization will be removed by the end of June, 2022", FutureWarning) Traceback (most recent call last): File "<string>", line 1, in <module> File "/vscode/apex/apex/reparameterization/__init__.py", line 4, in <module> from .weight_norm import WeightNorm File "/vscode/apex/apex/reparameterization/weight_norm.py", line 3, in <module> from ..fp16_utils import Fused_Weight_Norm ImportError: cannot import name 'Fused_Weight_Norm' from 'apex.fp16_utils' (/vscode/apex/apex/fp16_utils/__init__.py) ``` -
Masaki Kozuki authored
* add test * destroy model parallel was missing
-
- 30 Mar, 2022 1 commit
-
-
Gil Shomron authored
* Enabled Conv-Bias-ReLU fusion The following modules are enabled using cuDNN runtime fusion: 1) Conv-Bias-ReLU (+backward) 2) Conv-Bias (+backward) 3) Conv-Bias-Mask-ReLU (+backward) * Casts cleanup and autocast in unittest - Remove redundant dtype casts - Simulate the usage in the unittest by using torch.cuda.amp.autocast Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com> * Fixed save_for_backward Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by:
root <root@luna-0277.selene.nvidia.com>
-
- 25 Mar, 2022 3 commits
-
-
yjk21 authored
-
Masaki Kozuki authored
* try PyTorch custom TestCase class * revert * initial working example * update * data utils * fix imports * hardcode backend to nccl * fix signature * fix typo * mapping * set device * init * refactor x entropy * remove unused import & destroy model parallel * refactor random * fix test * remove migrated tests * refactor * init * separate affine weight init * init model parallel * split more * weight init fix part 1 * use cpu init for consistency btwn native and tensor parallel * black * add col parallel * use a 3D tensor of square matrix for column parallel linear * skip the failing cases * migrate layers test * pipeline parallel forward/backward * fix typo * fix typo * fix * fix pipeline world size * black * rm `run_pipeline_parallel_test` in favor of test_pipeline_parallel_fwd_bwd.py * stop logging * set log level * black * license and format * fix * skip tf32 as matrices are small * remove potentially inappropriate license * Apply suggestions from code review * remove `TODO` comment * `torch.testing.assert_allclose` -> `torch.testing.assert_close` * remove comment-outs * remote unused import * minor fix
-
Masaki Kozuki authored
* update * Add comment to `destroy_model_parallel`
-
- 24 Mar, 2022 1 commit
-
-
Masaki Kozuki authored
Take-over of #1097 * Add fast CUDA focal loss implementation. * Enable fast math for CUDA focal loss. * Correct typo. * replace deprecated macros * Add fast CUDA focal loss implementation. * Enable fast math for CUDA focal loss. * Correct typo. * replace deprecated macros * TORCH_CUDA_CHECK -> AT_CUDA_CHECK The former is defined in torch/csrc/profiler/cuda.cpp so it's not available usually. The latter however is defined in ATen/cuda/Exceptions.h as an alias of C10_CUDA_CHECK. * add test * clean up * guard for torchvision Co-authored-by:Wil Kong <alpha0422@gmail.com>
-
- 18 Mar, 2022 1 commit
-
-
eqy authored
* update ngc link and dockerhub container tag * update * update * update * Update README.md Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
- 16 Mar, 2022 1 commit
-
-
Masaki Kozuki authored
[transformer] Warn only when `gradient_accumulation_fusion` is `True` and `fused_weight_gradient_mlp_cuda` is missing (#1317)
-
- 15 Mar, 2022 4 commits
-
-
Masaki Kozuki authored
* initial issue_template -- bug * Apply suggestions from code review Co-authored-by:
eqy <eqy@cs.washington.edu> Co-authored-by:
eqy <eqy@cs.washington.edu>
-
Yuanzhe Dong authored
* Move forward cudnn-frontend * update throw_if to adapt cudnn frontend
-
Thor Johnsen authored
Leave bottleneck masks as bool
-
Thor Johnsen authored
-
- 11 Mar, 2022 1 commit
-
-
chochowski authored
* extend api to allow forced memory zeroing (empty() does not do it) * typo fix * ctx change * move zeroing flag to ctx * update test Co-authored-by:
mchochowski <mchochowski@nvidia.com> Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com>
-
- 08 Mar, 2022 4 commits
-
-
Masaki Kozuki authored
This reverts commit adbe075a.
-
Masaki Kozuki authored
This reverts commit 74e04667.
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
- 01 Mar, 2022 1 commit
-
-
Masaki Kozuki authored
* update build_model to support enc&dec model * fix typo: cur_sargs -> cur_args * enc&dec path: correctly update pre/post process
-
- 27 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 26 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
* fuse grad accumulation w/ weight grad Co-authored-by:
Sangkug Lym <slym@nvidia.com> * fp32 training path * not using *args, **kwargs * backward: moved the tensor dimension cnversion Co-authored-by:
Sangkug Lym <slym@nvidia.com> * move files to csrc/megatron * fix fp32 path * fix typo * add to in order to select the correct custom extension * fix typo * comment on import guard * update test: enable gradient_accumulation_fusion * 86 * remove redundant call of `test_column_parallel_linear` Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 25 Feb, 2022 3 commits
-
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
- 23 Feb, 2022 4 commits
-
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
Thor Johnsen authored
Change data type for virtual tensors to float
-
Thor Johnsen authored
-
- 15 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 12 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 11 Feb, 2022 1 commit
-
-
Stas Bekman authored
* [FusedRMSNorm doc] add epsilon to formula * correct * better wording
-
- 10 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 07 Feb, 2022 1 commit
-
-
eqy authored
-
- 04 Feb, 2022 1 commit
-
-
eqy authored
* FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 01 Feb, 2022 2 commits
-
-
ChongyuNVIDIA authored
* Add the permutation related support as the extension for asp lib. * [Fix] Track the permutation sequence for progressive channel swap strategy. * Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings. * Fix the deprecated functions in ASP unit tests. * Fix the sparsity info typo in ASP lib. * [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search. * Update the README.md with identical random seed setting and NeurIPS info. * Integrate the Pybind11 enhancement of permutation search into ASP lib.
-
Masaki Kozuki authored
* add kwarg of `custom_sync_context_handler` * add kwargs to ignore custom_sync_context_handler which mistakenly passed to fwd/bwd funcs
-
- 31 Jan, 2022 2 commits
-
-
Masaki Kozuki authored
* Free output tensor on each pipeline stage for smaller memory footprint see: https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb * ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff * ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4 * remove position embedding group in destroy * pass deallocate_pipeline_outputs to backward_step * fix typo * missing deallocate_pipeline_outputs * fix typo: grad_ouptut -> grad_output * update tests * remove accessed todo * test with data parallel size of 2 if there's equal to or more than 8 gpus
-
chochowski authored
* fix graph capture failure, fix norm computation with full_ar and clip_after * fix group range to compute l2_norm Co-authored-by:
seryilmaz <seryilmaz@nvidia.com> Co-authored-by:
mchochowski <mchochowski@nvidia.com>
-
- 29 Jan, 2022 1 commit
-
-
Burc Eryilmaz authored
-