"torchvision/csrc/io/decoder/decoder.h" did not exist on "8b9859d3aeebcd37e6a284fc751c58569857f7be"
- 25 Feb, 2022 3 commits
-
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
- 23 Feb, 2022 4 commits
-
-
Masaki Kozuki authored
-
Masaki Kozuki authored
-
Thor Johnsen authored
Change data type for virtual tensors to float
-
Thor Johnsen authored
-
- 15 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 12 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 11 Feb, 2022 1 commit
-
-
Stas Bekman authored
* [FusedRMSNorm doc] add epsilon to formula * correct * better wording
-
- 10 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 07 Feb, 2022 1 commit
-
-
eqy authored
-
- 04 Feb, 2022 1 commit
-
-
eqy authored
* FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 01 Feb, 2022 2 commits
-
-
ChongyuNVIDIA authored
* Add the permutation related support as the extension for asp lib. * [Fix] Track the permutation sequence for progressive channel swap strategy. * Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings. * Fix the deprecated functions in ASP unit tests. * Fix the sparsity info typo in ASP lib. * [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search. * Update the README.md with identical random seed setting and NeurIPS info. * Integrate the Pybind11 enhancement of permutation search into ASP lib.
-
Masaki Kozuki authored
* add kwarg of `custom_sync_context_handler` * add kwargs to ignore custom_sync_context_handler which mistakenly passed to fwd/bwd funcs
-
- 31 Jan, 2022 2 commits
-
-
Masaki Kozuki authored
* Free output tensor on each pipeline stage for smaller memory footprint see: https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb * ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff * ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4 * remove position embedding group in destroy * pass deallocate_pipeline_outputs to backward_step * fix typo * missing deallocate_pipeline_outputs * fix typo: grad_ouptut -> grad_output * update tests * remove accessed todo * test with data parallel size of 2 if there's equal to or more than 8 gpus
-
chochowski authored
* fix graph capture failure, fix norm computation with full_ar and clip_after * fix group range to compute l2_norm Co-authored-by:
seryilmaz <seryilmaz@nvidia.com> Co-authored-by:
mchochowski <mchochowski@nvidia.com>
-
- 29 Jan, 2022 1 commit
-
-
Burc Eryilmaz authored
-
- 28 Jan, 2022 2 commits
-
-
Masaki Kozuki authored
* cosmetic refactor in test * log with PID * log more info: rank, pid, filename, lineNo
-
Masaki Kozuki authored
* have get_kth_microbatch deal with None batch * broadcast based on tensor parallel rank * dtype * remove unnecessary .cuda() Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.
-
- 21 Jan, 2022 2 commits
-
-
Masaki Kozuki authored
* add keyword argument of `grad_scaler` * update test * pass dtype to fwd_step_func * add log * calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting * add keyword argument of `grad_scaler` * update test * pass dtype to fwd_step_func * add log * calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting * option to turn off autocast inside forward_step function As there's some users who activate `autocast` outside fwd/bwd functions. * add missing arg of disable_autocast * reorder args of no pipeline
-
eqy authored
CC @crcrpar @ptrblck
-
- 19 Jan, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 13 Jan, 2022 1 commit
-
-
Shintaro Iwasaki authored
-
- 17 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
Add an argument of `dtype` to forward_backward functions to specify the dtype used in p2p comm (#1249) * let users sepcify dtype for p2p comm taking the possibility of O2 style AMP into account * add `dtype` argument to forward_backward functions * fix * better message * add docstring of dtype * add a link to dtype logic of p2p comm
-
- 16 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
-
eqy authored
* reduce bert memory usage, placeholder data for gpt * update gpt test * fix * Update tests/L0/run_transformer/run_bert_minimal_test.py remove debugging indexing Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> * Update tests/L0/run_transformer/run_bert_minimal_test.py cleanup Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 15 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* apply formatter & remove duplicate func def * dry CUDA_HOME None check * `--threads 4`
-
Masaki Kozuki authored
-
- 14 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm
-
eqy authored
-
- 10 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* update parallel_state * update pipeline common funcs - forward_step and backward_step * update pipelining w/o interleaving * type hint * merge utils into without_interleaving Motivation: functions in utils are only used by forward_backward_pipelining_without_interleaving * fix handling of `model_type` * fix import of DDP * update set_input_tensor method * fix * cosmetic * update model * refactor pipeline test scripts
-
Rishi Puri authored
Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222) * minimal bert pipeline parallel test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * adding gpt_minimal_test to list of multigpu tests Co-authored-by:
Eddie Yan <eddiey@nvidia.com> Co-authored-by:
riship <riship@nvidia.com>
-
- 09 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* pass `self.mask_additive` * clang-format * removing THCState
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
- 19 Nov, 2021 3 commits
-
-
eqy authored
* minimal bert pipeline parallel test * fix global and cleanup * use get_forward_backward_func * cleanup and fix some tests
-
Masaki Kozuki authored
Co-authored-by:
Sangkug Lym <slym@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
Masaki Kozuki authored
* init logging use * fix * clean up * fp32 p2p comm * init * Dynamic global batch size with `MegatronPretrainingSampler` I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now * revive original pipeline parallel test * update MULTIGPU_TEST: add dynamic batchsize test * run MegatronPretrainingRandomSampler * fix comment * fix * update * cosmetic * add note * Apply 2 suggestion(s) to 2 file(s) * change following https://github.com/NVIDIA/apex/pull/1210 * fix
-
- 10 Nov, 2021 2 commits
-
-
Masaki Kozuki authored
-
eqy authored
-