- 16 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
-
eqy authored
* reduce bert memory usage, placeholder data for gpt * update gpt test * fix * Update tests/L0/run_transformer/run_bert_minimal_test.py remove debugging indexing Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> * Update tests/L0/run_transformer/run_bert_minimal_test.py cleanup Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 15 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* apply formatter & remove duplicate func def * dry CUDA_HOME None check * `--threads 4`
-
Masaki Kozuki authored
-
- 14 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm
-
eqy authored
-
- 10 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* update parallel_state * update pipeline common funcs - forward_step and backward_step * update pipelining w/o interleaving * type hint * merge utils into without_interleaving Motivation: functions in utils are only used by forward_backward_pipelining_without_interleaving * fix handling of `model_type` * fix import of DDP * update set_input_tensor method * fix * cosmetic * update model * refactor pipeline test scripts
-
Rishi Puri authored
Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222) * minimal bert pipeline parallel test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * adding gpt_minimal_test to list of multigpu tests Co-authored-by:
Eddie Yan <eddiey@nvidia.com> Co-authored-by:
riship <riship@nvidia.com>
-
- 09 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* pass `self.mask_additive` * clang-format * removing THCState
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
- 19 Nov, 2021 3 commits
-
-
eqy authored
* minimal bert pipeline parallel test * fix global and cleanup * use get_forward_backward_func * cleanup and fix some tests
-
Masaki Kozuki authored
Co-authored-by:
Sangkug Lym <slym@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
Masaki Kozuki authored
* init logging use * fix * clean up * fp32 p2p comm * init * Dynamic global batch size with `MegatronPretrainingSampler` I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now * revive original pipeline parallel test * update MULTIGPU_TEST: add dynamic batchsize test * run MegatronPretrainingRandomSampler * fix comment * fix * update * cosmetic * add note * Apply 2 suggestion(s) to 2 file(s) * change following https://github.com/NVIDIA/apex/pull/1210 * fix
-
- 10 Nov, 2021 3 commits
-
-
Masaki Kozuki authored
-
eqy authored
-
eqy authored
-
- 27 Oct, 2021 2 commits
-
-
Masaki Kozuki authored
* Persistent LayerNorm: Multi-CTA Rewrite * autocast support Co-authored-by:Young-Jun Ko <youngjun.ko@gmail.com>
-
Masaki Kozuki authored
* Init apex.ppu (pipeline model parallel utility) Reference commit: ``` commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main) Merge: 14f2c684 7b293d9b Author: Mohammad Shoeybi <mshoeybi@nvidia.com> Date: Wed Sep 22 22:57:54 2021 -0700 Merge branch 'add_BOS' into 'main' Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives See merge request ADLR/megatron-lm!328 ``` * removing get_args and replace import - phase 1 * removing get_args and replace import - phase 2 * move ppu to apex.transformer.pipeline_parallel * update two __init__.py * update READMEs * mpu -> parallel_state & tensor_parallel * fix * remove not pipeline files * separate schedules.py - phase 1 * dissect schedules.py * data_iterators -> batch * remove optimizer from forward_backward_step funcs * init test * Apply 2 suggestion(s...
-
- 23 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* switch from clone to out-of-place subtract * Update apex/mpu/cross_entropy.py * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:Eddie Yan <eddiey@nvidia.com>
-
- 18 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
Changes include - THC headers removal - TH macros replacement - fix some typo in comment
-
- 16 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
-
- 14 Oct, 2021 2 commits
-
-
Burc Eryilmaz authored
change chunking scheme for full-allreduce case, add parameter order argument, both to enable contiguous chunking of allgather (#1190)
-
Nan Zheng authored
1. remove the weight broadcast in the constructor 2. disable unnecessary allreduces for clip-after-ar
-
- 13 Oct, 2021 1 commit
-
-
eqy authored
-
- 08 Oct, 2021 2 commits
-
-
Masaki Kozuki authored
* run backward * remove custom_fwd/custom_bwd
-
eqy authored
-
- 07 Oct, 2021 1 commit
-
-
eqy authored
-
- 06 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* [ColumnParallelLinear] Test behavior in autocast * fix test * casts manually to autocast dtype
-
- 02 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
Co-authored-by:
Piotr Bialecki <pbialecki@nvidia.com> Co-authored-by:
Eddie Yan <eddiey@nvidia.com> Co-authored-by:
Rishi Puri <riship@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 30 Sep, 2021 1 commit
-
-
X Wang authored
-
- 28 Sep, 2021 1 commit
-
-
X Wang authored
-
- 24 Sep, 2021 2 commits
-
-
romerojosh authored
-
Masaki Kozuki authored
-
- 08 Sep, 2021 1 commit
-
-
Masaki Kozuki authored
- passing include directories to `CUDAExtension`'s `include_dirs` argument - removing `-I/path/to/dir` arguments from `extra_compile_args`
-
- 04 Sep, 2021 1 commit
-
-
Burc Eryilmaz authored
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path * safer guard around CUBLAS constants, remove unreferenced variable * more guard changes * guard against cublas version instead of cuda Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 02 Sep, 2021 5 commits
-
-
Thor Johnsen authored
Optional NCCL communicator argument to init method
-
Thor Johnsen authored
-
Thor Johnsen authored
Bug fix in wgrad
-
Thor Johnsen authored
-
Thor Johnsen authored
Bug fixes
-