Commits · fdcaeba044fcdfe8a3e247d7f842c475a6171f55 · OpenDAS / Megatron-LM

12 Nov, 2020 25 commits
- Remove timing labels that don't make sense · fdcaeba0
  Deepak Narayanan authored Nov 10, 2020
  
  fdcaeba0
- Better communication timing · 7d367b1d
  Deepak Narayanan authored Nov 10, 2020
  
  7d367b1d
- Divide gradient by number of microbatches in minibatch · 3d7194c4
  Deepak Narayanan authored Nov 03, 2020
  
  3d7194c4
- Better 'forward' and 'backward' timing in megatron/training.py · a6756bf8
  Deepak Narayanan authored Nov 03, 2020
  
  a6756bf8
- Move training schedule to 1F1B for memory efficiency · 3e6898e6
  Deepak Narayanan authored Oct 24, 2020
  
  3e6898e6
- Only transpose hidden_states when necessary · 6abf39be
  Deepak Narayanan authored Nov 03, 2020
  
  6abf39be
- Refactor word_embeddings_weight() logic into separate method, and other Mohammad comments · 57c3b364
  Deepak Narayanan authored Nov 03, 2020
  
  57c3b364
- Log times for various sub-operations in forward and backward pass in main training loop · eed0062a
  Deepak Narayanan authored Oct 30, 2020
  
  eed0062a
- Throw exception if ring_exchange is not available when pipeline_model_parallel_size > 1 · 2d8de296
  Deepak Narayanan authored Oct 30, 2020
  
  2d8de296
- Bugfix in megatron/training.py: correct global_batch_size computation · 7ce373f3
  Deepak Narayanan authored Oct 29, 2020
```
Prevents data_loader from running out of training examples
```
  7ce373f3
- Improve names of identifiers used for timing in main training loop · 9d4c735a
  Deepak Narayanan authored Oct 29, 2020
  
  9d4c735a
- Clarifications in comments and minor refactoring to make main training loop more readable · 8fb2bc8c
  Deepak Narayanan authored Oct 28, 2020
  
  8fb2bc8c
- Remove unused parameter sharing logic · 1271fd73
  Deepak Narayanan authored Oct 28, 2020
  
  1271fd73
- Bugfix in main training loop: Update master_grads only after grads are correctly accumulated · 9b558566
  Deepak Narayanan authored Oct 28, 2020
  
  9b558566
- Simplify logic in megatron/fp16/fp16.py · 767e6e92
  Deepak Narayanan authored Oct 28, 2020
  
  767e6e92
- Small notes in comments in response to Jared's comments · aa9cae27
  Deepak Narayanan authored Oct 28, 2020
  
  aa9cae27
- Address Jared's comments in README and loss_scaler.py · dd079406
  Deepak Narayanan authored Oct 27, 2020
  
  dd079406
- Improve time logging when num_microbatches_in_minibatch > 1 · 63740223
  Deepak Narayanan authored Oct 27, 2020
```
Make sure all forward and backward operations are accounted for
```
  63740223
- Back compatibility of checkpoints: use `model_parallel_size` when checking for equality of args · d5b526d5
  Deepak Narayanan authored Oct 26, 2020
  
  d5b526d5
- Refactor communication code in main training loop to helper method · 318d68c2
  Deepak Narayanan authored Oct 23, 2020
  
  318d68c2
- Back compatibility of checkpoints: don't rename model_parallel_rng_tracker · e805f0bd
  Deepak Narayanan authored Oct 22, 2020
  
  e805f0bd
- Removal of unneeded changes so that diff is smaller · 275d4e64
  Deepak Narayanan authored Oct 20, 2020
  
  275d4e64
- Intra-layer MP -> Tensor MP, Inter-layer MP -> Pipeline MP · 52a5f2f2
  Deepak Narayanan authored Oct 20, 2020
  
  52a5f2f2
- Pipeline parallelism implementation with periodic full-pipeline syncs · 7abd3e90
  Deepak Narayanan authored Aug 29, 2020
```
Also includes following changes for inter-layer model-parallel implementation:
- Refactoring of model implementations
- Training loop changes to support inter-layer communication using `ring_exchange`
- New groups for inter-layer communication
- Checkpoint changes
- Command line arguments
```
  7abd3e90
- fp32 working · 28cd66e1
  mohammad authored Aug 28, 2020
  
  28cd66e1
03 Nov, 2020 2 commits
- Merge branch 'fix_logging' into 'main' · b4b0d739
  Deepak Narayanan authored Nov 03, 2020
```
fixed loss average when all but one value is skipped

See merge request ADLR/megatron-lm!164
```
  b4b0d739
- fixed loss average when all but one value is skipped · 664cd28b
  mohammad authored Nov 03, 2020
  
  664cd28b
14 Oct, 2020 6 commits
- Merge branch 'main_evaluate_wiki' into 'main' · 79888e16
  Mohammad Shoeybi authored Oct 14, 2020
```
fixed wiki evaluation

See merge request ADLR/megatron-lm!157
```
  79888e16
- fixed wiki evaluation issue · ef2adb5d
  Mostofa Patwary authored Oct 14, 2020
  
  ef2adb5d
- fixed wiki evaluation issue · 38c45de7
  Mostofa Patwary authored Oct 14, 2020
  
  38c45de7
- Merge branch 'main_beta' into 'main' · 64cf3d98
  Mohammad Shoeybi authored Oct 13, 2020
```
Adam betas and eps

See merge request ADLR/megatron-lm!156
```
  64cf3d98
- added adam betas and eps as arguments · c55e154f
  Mostofa Patwary authored Oct 13, 2020
  
  c55e154f
- added adam betas and eps as arguments · 48269d8d
  Mostofa Patwary authored Oct 13, 2020
  
  48269d8d
13 Oct, 2020 5 commits
- Merge branch 'vijay/suppress_ninja_output' into 'main' · 5753e8f1
  Mohammad Shoeybi authored Oct 12, 2020
```
suppress kernel loading output

See merge request ADLR/megatron-lm!155
```
  5753e8f1
- suppress kernel loading output · 8fca9b49
  Vijay Korthikanti authored Oct 12, 2020
  
  8fca9b49
- Merge branch 'vijay/ordering_perf_fix' into 'main' · 538f0e05
  Mohammad Shoeybi authored Oct 12, 2020
```
reordering perf fix

See merge request ADLR/megatron-lm!154
```
  538f0e05
- typo fix · 01dffdb5
  Vijay Korthikanti authored Oct 12, 2020
  
  01dffdb5
- address review comments · 0bc75448
  Vijay Korthikanti authored Oct 12, 2020
  
  0bc75448
12 Oct, 2020 2 commits
- Incrementing checkpoint version to 2.0 · 9b0083ea
  Vijay Korthikanti authored Oct 12, 2020
  
  9b0083ea
- reordering perf fix · ee327acd
  Vijay Korthikanti authored Oct 12, 2020
  
  ee327acd