Commits · 214a9a08697fcd6d38b36a0dd7304aea0b7e48c1 · OpenDAS / Megatron-LM

16 May, 2022 2 commits
- address review comment · 214a9a08
  Vijay Korthikanti authored May 16, 2022
  
  214a9a08
- update jit warmup code to handle sequence parallelism · a4ef7c40
  Vijay Korthikanti authored May 16, 2022
  
  a4ef7c40
09 May, 2022 1 commit
- (1)support pyt20.12 compatibility, (2) arg naming update · f602ac56
  Sangkug Lym authored Apr 11, 2022
```
change dummy_handler to nullcontext
```
  f602ac56
25 Mar, 2022 1 commit
- jit functions warmpups to (1) match fprop and recompute results and (2) remove grad_enable · c7fef593
  Sangkug Lym authored Mar 24, 2022
  
  c7fef593
19 Jan, 2022 1 commit
- flag for data parallel random initialization · a7ee77ea
  Vijay Korthikanti authored Jan 19, 2022
  
  a7ee77ea
12 Jan, 2022 1 commit
- Phase1 merge: vit optimizations + dataset enhancements + scaled_softmax kernel · 7a77abd9
  Vijay Korthikanti authored Jan 12, 2022
  
  7a77abd9
24 Nov, 2021 1 commit
- Add support for signal-based dynamic checkpointing · dbe6c728
  Szymon Migacz authored Nov 23, 2021
  
  dbe6c728
07 Sep, 2021 1 commit
- remove increasing nccl stream for overlapping allreduce and gemm · f3e57f6f
  Sangkug Lym authored Sep 06, 2021
  
  f3e57f6f
02 Sep, 2021 3 commits
- reflect feedback · 3f652469
  slym authored Sep 02, 2021
  
  3f652469
- minor changes · 16c90445
  slym authored Sep 02, 2021
  
  16c90445
- t # This is a combination of 2 commits. · cf7efd4f
  Sangkug Lym authored Aug 30, 2021
```
allreduce overlap with wgrad gemm

change custom delay to dummy add
```
  cf7efd4f
27 Aug, 2021 2 commits
- Use nvfuser at pytorch >= 1.10 · dba2506d
  Sangkug Lym authored Aug 27, 2021
  
  dba2506d
- Adding API server · 3fe6821a
  Ryan Prenger authored Aug 27, 2021
  
  3fe6821a
26 Aug, 2021 1 commit
- Fixing bug caused by merge · 448cb299
  rprenger authored Aug 25, 2021
  
  448cb299
19 Aug, 2021 1 commit
- removed contiguous buffer for checkpointed activation · e923ec52
  mshoeybi authored Aug 19, 2021
  
  e923ec52
16 Aug, 2021 1 commit
- changed torch distributed init method from tcp to env · ee4fd509
  Mohammad Shoeybi authored Aug 16, 2021
  
  ee4fd509
30 Jul, 2021 1 commit

Support for pipeline parallelism in T5 model · 46c74b4c

Deepak Narayanan authored Jun 22, 2021

- Accumulate encoder hidden state gradient to handle skip connection
- Correctly compute the number of layers in encoder / decoder for T5 model
- Ensure e weights are initialized the same way in embeddings
- Synchronize embedding gradients across encoder and decoder for T5 model
- Support for checkpoint loading and saving

46c74b4c

02 Jul, 2021 1 commit
- Got 530 Billion parameter model working! · ac3db159
  rprenger authored Jul 02, 2021
  
  ac3db159
18 Mar, 2021 1 commit
- refactored the fused kernels build · 0d5188c1
  Mohammad Shoeybi authored Mar 17, 2021
  
  0d5188c1
09 Feb, 2021 1 commit

Interleaved pipeline execution and code refactoring · dd889062

Deepak Narayanan authored Dec 12, 2020

- Split a model's computation into multiple virtual stages as needed,
and schedule communication correctly between these virtual stages
- Move schedule code into `schedules.py` and communication code into
`p2p_communication.py`
- Use hyphens instead of spaces in all time logging for consistency
- Factor out code in megatron/training.py into helper functions
- Refactor evaluate() function: make it use forward_backward_schedule
functions

dd889062

26 Jan, 2021 1 commit
- call makefile every run so we recompile if the code has changed · f0232865
  mohammad authored Jan 25, 2021
  
  f0232865
30 Dec, 2020 1 commit
- moved compile helper to initialize · 242770dd
  mshoeybi authored Dec 29, 2020
  
  242770dd
19 Dec, 2020 1 commit
- Move args writer to the beginning of training · 6e9d5cb0
  mohammad authored Dec 12, 2020
  
  6e9d5cb0
12 Nov, 2020 3 commits
- Refactor word_embeddings_weight() logic into separate method, and other Mohammad comments · 57c3b364
  Deepak Narayanan authored Nov 03, 2020
  
  57c3b364
- Intra-layer MP -> Tensor MP, Inter-layer MP -> Pipeline MP · 52a5f2f2
  Deepak Narayanan authored Oct 20, 2020
  
  52a5f2f2
- Pipeline parallelism implementation with periodic full-pipeline syncs · 7abd3e90
  Deepak Narayanan authored Aug 29, 2020
```
Also includes following changes for inter-layer model-parallel implementation:
- Refactoring of model implementations
- Training loop changes to support inter-layer communication using `ring_exchange`
- New groups for inter-layer communication
- Checkpoint changes
- Command line arguments
```
  7abd3e90
01 Oct, 2020 1 commit
- fix for nemo: do not initialize mpu if it is already initialized · 6e433055
  mohammad authored Oct 01, 2020
  
  6e433055
02 Sep, 2020 2 commits
- addressed Jareds comments · 5d29769c
  mohammad authored Sep 02, 2020
  
  5d29769c
- added splitting checkpointed activations · d6c4248b
  mohammad authored Sep 02, 2020
  
  d6c4248b
27 Aug, 2020 1 commit
- Addressing code review comments · 1eafa861
  Boris Fomitchev authored Aug 26, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  1eafa861
07 Aug, 2020 2 commits
- comment fix · 4e8ace9b
  Boris Fomitchev authored Aug 06, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  4e8ace9b
- fixed typo in comment · 39c80331
  Boris Fomitchev authored Aug 06, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  39c80331
06 Aug, 2020 3 commits
- Addressing code review comments · c632d205
  Boris Fomitchev authored Aug 06, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  c632d205
- Removed assert in case of external ddp, exported initialize_megatron() · be842037
  Boris Fomitchev authored Aug 06, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  be842037
- Implementing lazy parallel initialization · 5c04ceb3
  Boris Fomitchev authored Aug 05, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  5c04ceb3
30 Jul, 2020 1 commit
- Changes for NeMo/lightning compatibility · 417c7f6a
  Boris Fomitchev authored Jul 30, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  417c7f6a
22 Jul, 2020 1 commit
- Initialization fixes: allowing simple case like pytest pass, also making apex optional · 9026b86d
  Boris Fomitchev authored Jul 22, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  9026b86d
05 Jun, 2020 1 commit
- Prune changes to only be related to ICT · 32bb4edc
  Neel Kant authored Jun 05, 2020
  
  32bb4edc
21 May, 2020 1 commit
- Change sync variable to gloo backend · 05ea0cca
  Neel Kant authored May 21, 2020
  
  05ea0cca
19 May, 2020 1 commit
- Indexer_async works in theory · e338e311
  Neel Kant authored May 18, 2020
  
  e338e311