Commits · 63c340ec4a3111ed12e1657e63e242051cc20e9d · OpenDAS / Megatron-LM · GitLab

19 Nov, 2020 1 commit
- Merge branch 'main' into pipeline_parallel_main · 63c340ec
  Jared Casper authored Nov 19, 2020
  
  63c340ec
18 Nov, 2020 9 commits
- Merge branch 'update-norm' into 'main' · ea81d62f
  Mohammad Shoeybi authored Nov 17, 2020
```
Replace deprecated torch.norm with torch.linalg.norm.

See merge request ADLR/megatron-lm!175
```
  ea81d62f
- Merge branch 'community-fixes' into 'main' · ac837a4e
  Mohammad Shoeybi authored Nov 17, 2020
```
Community fixes

See merge request ADLR/megatron-lm!176
```
  ac837a4e
- Merge branch 'fix/help-title-dist' of https://github.com/lazykyama/Megatron-LM into community-fixes · 356f8771
  Jared Casper authored Nov 17, 2020
  
  356f8771
- Merge branch 'patch-1' of https://github.com/sublee/Megatron-LM into community-fixes · 24ccd8aa
  Jared Casper authored Nov 17, 2020
  
  24ccd8aa
- Merge branch 'patch-2' of https://github.com/nakosung/Megatron-LM into community-fixes · cec3dbe6
  Jared Casper authored Nov 17, 2020
  
  cec3dbe6
- Merge branch 'patch-1' of https://github.com/nakosung/Megatron-LM into community-fixes · 8d0da202
  Jared Casper authored Nov 17, 2020
  
  8d0da202
- Merge branch 'patch-1' of https://github.com/dweekly/Megatron-LM into community-fixes · 5e934043
  Jared Casper authored Nov 17, 2020
  
  5e934043
- Merge branch 'patch-1' of https://github.com/akhileshgotmare/Megatron-LM into community-fixes · 642c644a
  Jared Casper authored Nov 17, 2020
  
  642c644a
- Replace deprecated torch.norm with torch.linalg.norm. · 17035d6c
  Jared Casper authored Nov 17, 2020
  
  17035d6c
17 Nov, 2020 3 commits
- Merge branch 'main' into pipeline_parallel_main · 6fae152a
  Jared Casper authored Nov 16, 2020
  
  6fae152a
- Merge branch 'finetune_fix' into 'main' · e7c7a78f
  Mohammad Shoeybi authored Nov 16, 2020
```
Update code used for finetuning to latest API.

See merge request ADLR/megatron-lm!174
```
  e7c7a78f
- Update code used for finetuning to latest API. · b219ff00
  Jared Casper authored Nov 16, 2020
  
  b219ff00
13 Nov, 2020 1 commit
- New example scripts showing how to use MP, and some notes to main README · a5a41922
  Deepak Narayanan authored Nov 13, 2020
  
  a5a41922
12 Nov, 2020 26 commits
- Make sure dataloader state is the same after checkpoint is loaded · cd4822f1
  Deepak Narayanan authored Nov 12, 2020
  
  cd4822f1
- Move division of loss tensor by number of microbatches to training.py · c671de3e
  Deepak Narayanan authored Nov 12, 2020
  
  c671de3e
- Small bugfix in bert_model.py: make sure word_embeddings is initialized before... · 69a546be
  Deepak Narayanan authored Nov 12, 2020
```
Small bugfix in bert_model.py: make sure word_embeddings is initialized before instantiating lm_head
```
  69a546be
- Refactor code according to Jared's comments: move pipelining and... · 1979c242
  Deepak Narayanan authored Nov 12, 2020
```
Refactor code according to Jared's comments: move pipelining and non-pipelining training loops into separate methods

Also, use mpu.get_*_model_parallel_size() instead of args.*_model_parallel_size
```
  1979c242
- Allocate tensor in `communicate()` method directly on GPU (instead of... · 9ff6f473
  mshoeybi authored Nov 11, 2020
```
Allocate tensor in `communicate()` method directly on GPU (instead of allocating on CPU and then moving to GPU)
```
  9ff6f473
- Remove timing labels that don't make sense · fdcaeba0
  Deepak Narayanan authored Nov 10, 2020
  
  fdcaeba0
- Better communication timing · 7d367b1d
  Deepak Narayanan authored Nov 10, 2020
  
  7d367b1d
- Divide gradient by number of microbatches in minibatch · 3d7194c4
  Deepak Narayanan authored Nov 03, 2020
  
  3d7194c4
- Better 'forward' and 'backward' timing in megatron/training.py · a6756bf8
  Deepak Narayanan authored Nov 03, 2020
  
  a6756bf8
- Move training schedule to 1F1B for memory efficiency · 3e6898e6
  Deepak Narayanan authored Oct 24, 2020
  
  3e6898e6
- Only transpose hidden_states when necessary · 6abf39be
  Deepak Narayanan authored Nov 03, 2020
  
  6abf39be
- Refactor word_embeddings_weight() logic into separate method, and other Mohammad comments · 57c3b364
  Deepak Narayanan authored Nov 03, 2020
  
  57c3b364
- Log times for various sub-operations in forward and backward pass in main training loop · eed0062a
  Deepak Narayanan authored Oct 30, 2020
  
  eed0062a
- Throw exception if ring_exchange is not available when pipeline_model_parallel_size > 1 · 2d8de296
  Deepak Narayanan authored Oct 30, 2020
  
  2d8de296
- Bugfix in megatron/training.py: correct global_batch_size computation · 7ce373f3
  Deepak Narayanan authored Oct 29, 2020
```
Prevents data_loader from running out of training examples
```
  7ce373f3
- Improve names of identifiers used for timing in main training loop · 9d4c735a
  Deepak Narayanan authored Oct 29, 2020
  
  9d4c735a
- Clarifications in comments and minor refactoring to make main training loop more readable · 8fb2bc8c
  Deepak Narayanan authored Oct 28, 2020
  
  8fb2bc8c
- Remove unused parameter sharing logic · 1271fd73
  Deepak Narayanan authored Oct 28, 2020
  
  1271fd73
- Bugfix in main training loop: Update master_grads only after grads are correctly accumulated · 9b558566
  Deepak Narayanan authored Oct 28, 2020
  
  9b558566
- Simplify logic in megatron/fp16/fp16.py · 767e6e92
  Deepak Narayanan authored Oct 28, 2020
  
  767e6e92
- Small notes in comments in response to Jared's comments · aa9cae27
  Deepak Narayanan authored Oct 28, 2020
  
  aa9cae27
- Address Jared's comments in README and loss_scaler.py · dd079406
  Deepak Narayanan authored Oct 27, 2020
  
  dd079406
- Improve time logging when num_microbatches_in_minibatch > 1 · 63740223
  Deepak Narayanan authored Oct 27, 2020
```
Make sure all forward and backward operations are accounted for
```
  63740223
- Back compatibility of checkpoints: use `model_parallel_size` when checking for equality of args · d5b526d5
  Deepak Narayanan authored Oct 26, 2020
  
  d5b526d5
- Refactor communication code in main training loop to helper method · 318d68c2
  Deepak Narayanan authored Oct 23, 2020
  
  318d68c2
- Back compatibility of checkpoints: don't rename model_parallel_rng_tracker · e805f0bd
  Deepak Narayanan authored Oct 22, 2020
  
  e805f0bd