Commits · b81cad6609c9d91efa5701a3bdc71fc2f007d65b · OpenDAS / Megatron-LM

19 Dec, 2020 15 commits
- Fix TensorBoard writes · b81cad66
  mohammad authored Dec 12, 2020
  
  b81cad66
- Fix loss addition in TensorBoard · 5a304ede
  mshoeybi authored Dec 12, 2020
  
  5a304ede
- Some bugfixes · 29a69547
  mshoeybi authored Dec 11, 2020
  
  29a69547
- Address Jared's comments · 56243e19
  mshoeybi authored Dec 11, 2020
  
  56243e19
- Fix some bugs, add exit-duration capability · a31833ce
  mshoeybi authored Dec 11, 2020
  
  a31833ce
- Change lr-warmup-percent to lr-warmup-fraction · 9321d5c6
  Jared Casper authored Dec 10, 2020
  
  9321d5c6
- Work batch-size name changes into task code · 3afcba6e
  Jared Casper authored Dec 09, 2020
  
  3afcba6e
- Add pipelining to GLUE and RACE tasks · caa9dca5
  Jared Casper authored Nov 30, 2020
  
  caa9dca5
- Better memory tracking across pipeline-parallel ranks · 3574b8e6
  Deepak Narayanan authored Dec 06, 2020
  
  3574b8e6
- Sample based learning rate computation · 22ab91bb
  mohammad authored Dec 08, 2020
  
  22ab91bb
- Minor refactoring · c30ba0f7
  mohammad authored Dec 08, 2020
  
  c30ba0f7
- Add constant num micro-batches calculator · feecd5d9
  mohammad authored Dec 07, 2020
  
  feecd5d9
- Add micro-batch size calculator · 6ea23928
  mohammad authored Dec 06, 2020
  
  6ea23928
- Rename --batch-size to --micro-batch-size and drop in-minibatch from... · 9019bbf4
  mohammad authored Dec 06, 2020
```
Rename --batch-size to --micro-batch-size and drop in-minibatch from --num-micro-batches-in-minibatch
```
  9019bbf4
- Make an eval iteration the same number of samples as a training iteration · a84a5fa0
  Jared Casper authored Dec 03, 2020
  
  a84a5fa0
02 Dec, 2020 1 commit
- addrressed jareds comments · cebd3b8b
  mohammad authored Dec 02, 2020
  
  cebd3b8b
30 Nov, 2020 1 commit
- refactored learning rate scheduler so addition of variable batch size is easier · ff12df6b
  mohammad authored Nov 29, 2020
  
  ff12df6b
28 Nov, 2020 1 commit
- added consumed tokens to checkpoints and some refactoring · f0a445fa
  mohammad authored Nov 27, 2020
  
  f0a445fa
26 Nov, 2020 1 commit
- simplified sampler · 4311b695
  mohammad authored Nov 25, 2020
  
  4311b695
12 Nov, 2020 21 commits
- Make sure dataloader state is the same after checkpoint is loaded · cd4822f1
  Deepak Narayanan authored Nov 12, 2020
  
  cd4822f1
- Move division of loss tensor by number of microbatches to training.py · c671de3e
  Deepak Narayanan authored Nov 12, 2020
  
  c671de3e
- Refactor code according to Jared's comments: move pipelining and... · 1979c242
  Deepak Narayanan authored Nov 12, 2020
```
Refactor code according to Jared's comments: move pipelining and non-pipelining training loops into separate methods

Also, use mpu.get_*_model_parallel_size() instead of args.*_model_parallel_size
```
  1979c242
- Allocate tensor in `communicate()` method directly on GPU (instead of... · 9ff6f473
  mshoeybi authored Nov 11, 2020
```
Allocate tensor in `communicate()` method directly on GPU (instead of allocating on CPU and then moving to GPU)
```
  9ff6f473
- Remove timing labels that don't make sense · fdcaeba0
  Deepak Narayanan authored Nov 10, 2020
  
  fdcaeba0
- Better communication timing · 7d367b1d
  Deepak Narayanan authored Nov 10, 2020
  
  7d367b1d
- Divide gradient by number of microbatches in minibatch · 3d7194c4
  Deepak Narayanan authored Nov 03, 2020
  
  3d7194c4
- Better 'forward' and 'backward' timing in megatron/training.py · a6756bf8
  Deepak Narayanan authored Nov 03, 2020
  
  a6756bf8
- Move training schedule to 1F1B for memory efficiency · 3e6898e6
  Deepak Narayanan authored Oct 24, 2020
  
  3e6898e6
- Only transpose hidden_states when necessary · 6abf39be
  Deepak Narayanan authored Nov 03, 2020
  
  6abf39be
- Log times for various sub-operations in forward and backward pass in main training loop · eed0062a
  Deepak Narayanan authored Oct 30, 2020
  
  eed0062a
- Bugfix in megatron/training.py: correct global_batch_size computation · 7ce373f3
  Deepak Narayanan authored Oct 29, 2020
```
Prevents data_loader from running out of training examples
```
  7ce373f3
- Improve names of identifiers used for timing in main training loop · 9d4c735a
  Deepak Narayanan authored Oct 29, 2020
  
  9d4c735a
- Clarifications in comments and minor refactoring to make main training loop more readable · 8fb2bc8c
  Deepak Narayanan authored Oct 28, 2020
  
  8fb2bc8c
- Bugfix in main training loop: Update master_grads only after grads are correctly accumulated · 9b558566
  Deepak Narayanan authored Oct 28, 2020
  
  9b558566
- Small notes in comments in response to Jared's comments · aa9cae27
  Deepak Narayanan authored Oct 28, 2020
  
  aa9cae27
- Improve time logging when num_microbatches_in_minibatch > 1 · 63740223
  Deepak Narayanan authored Oct 27, 2020
```
Make sure all forward and backward operations are accounted for
```
  63740223
- Refactor communication code in main training loop to helper method · 318d68c2
  Deepak Narayanan authored Oct 23, 2020
  
  318d68c2
- Intra-layer MP -> Tensor MP, Inter-layer MP -> Pipeline MP · 52a5f2f2
  Deepak Narayanan authored Oct 20, 2020
  
  52a5f2f2
- Pipeline parallelism implementation with periodic full-pipeline syncs · 7abd3e90
  Deepak Narayanan authored Aug 29, 2020
```
Also includes following changes for inter-layer model-parallel implementation:
- Refactoring of model implementations
- Training loop changes to support inter-layer communication using `ring_exchange`
- New groups for inter-layer communication
- Checkpoint changes
- Command line arguments
```
  7abd3e90
- fp32 working · 28cd66e1
  mohammad authored Aug 28, 2020
  
  28cd66e1