Commits · 9b174da873cf8e84f1bef1b6a88d98ff7791946e · OpenDAS / Megatron-LM

21 Dec, 2020 1 commit

Merge branch 'pipeline_parallel_main' into 'main' · 9b174da8

Deepak Narayanan authored Dec 21, 2020

Pipeline parallelism and inter-layer model parallelism implementation

See merge request ADLR/megatron-lm!159

9b174da8

19 Dec, 2020 28 commits
- Quick fix for pipeline tasks to get learning rate correct · 6e83649f
  Jared Casper authored Dec 18, 2020
  
  6e83649f
- Added a comment to justify 80 percent · 25c07e14
  mohammad authored Dec 14, 2020
  
  25c07e14
- Move args writer to the beginning of training · 6e9d5cb0
  mohammad authored Dec 12, 2020
  
  6e9d5cb0
- Cleaned up load/save checkpoint printing · 8a6e56b8
  mohammad authored Dec 12, 2020
  
  8a6e56b8
- Fix TensorBoard writes · b81cad66
  mohammad authored Dec 12, 2020
  
  b81cad66
- Fix loss addition in TensorBoard · 5a304ede
  mshoeybi authored Dec 12, 2020
  
  5a304ede
- Some bugfixes · 29a69547
  mshoeybi authored Dec 11, 2020
  
  29a69547
- Last epoch should not be globally shuffled · 39181113
  mshoeybi authored Dec 11, 2020
  
  39181113
- Address Jared's comments · 56243e19
  mshoeybi authored Dec 11, 2020
  
  56243e19
- Fix some bugs, add exit-duration capability · a31833ce
  mshoeybi authored Dec 11, 2020
  
  a31833ce
- Add comment describing _PIPELINE_GLOBAL_RANKS · 51315905
  Jared Casper authored Dec 10, 2020
  
  51315905
- Fix text generation without recompute · 1d4e8760
  Jared Casper authored Dec 10, 2020
  
  1d4e8760
- Nicer error messages for deprecated arguments · 2623551d
  Jared Casper authored Dec 10, 2020
  
  2623551d
- Change lr-warmup-percent to lr-warmup-fraction · 9321d5c6
  Jared Casper authored Dec 10, 2020
  
  9321d5c6
- Add implementation for pipelined zeroshot GPT-2 evaluation · 0c151638
  Jared Casper authored Dec 09, 2020
  
  0c151638
- Work batch-size name changes into task code · 3afcba6e
  Jared Casper authored Dec 09, 2020
  
  3afcba6e
- Initial implementation of pipelined text generation · 5c45db4a
  Jared Casper authored Dec 09, 2020
  
  5c45db4a
- Add pipelining to GLUE and RACE tasks · caa9dca5
  Jared Casper authored Nov 30, 2020
  
  caa9dca5
- Better memory tracking across pipeline-parallel ranks · 3574b8e6
  Deepak Narayanan authored Dec 06, 2020
  
  3574b8e6
- Address Jared's comments · 00ac56ab
  mohammad authored Dec 09, 2020
  
  00ac56ab
- Sample based learning rate computation · 22ab91bb
  mohammad authored Dec 08, 2020
  
  22ab91bb
- Minor fixes for batch size rampup · 6a68502d
  mohammad authored Dec 08, 2020
  
  6a68502d
- Support for ramping up the batch size · de0b70a0
  mohammad authored Dec 08, 2020
  
  de0b70a0
- Minor refactoring · c30ba0f7
  mohammad authored Dec 08, 2020
  
  c30ba0f7
- Add constant num micro-batches calculator · feecd5d9
  mohammad authored Dec 07, 2020
  
  feecd5d9
- Add micro-batch size calculator · 6ea23928
  mohammad authored Dec 06, 2020
  
  6ea23928
- Rename --batch-size to --micro-batch-size and drop in-minibatch from... · 9019bbf4
  mohammad authored Dec 06, 2020
```
Rename --batch-size to --micro-batch-size and drop in-minibatch from --num-micro-batches-in-minibatch
```
  9019bbf4
- Make an eval iteration the same number of samples as a training iteration · a84a5fa0
  Jared Casper authored Dec 03, 2020
  
  a84a5fa0
03 Dec, 2020 4 commits
- Merge branch 'main' into pipeline_parallel_merge · 2cf1d6d0
  Jared Casper authored Dec 03, 2020
  
  2cf1d6d0
- Merge branch 'consumed_tokens_restart_fix' into 'main' · 3aacd955
  Jared Casper authored Dec 03, 2020
```
found a bug in consumed tokens initialization

See merge request ADLR/megatron-lm!182
```
  3aacd955
- found a bug in consumed tokens initialization · e2a4d426
  mohammad authored Dec 02, 2020
  
  e2a4d426
- Merge branch 'main' into pipeline_parallel_main · 91d4a605
  Jared Casper authored Dec 02, 2020
  
  91d4a605
02 Dec, 2020 7 commits
- Merge branch 'megatron_sampler' into 'main' · 75bd9b54
  Jared Casper authored Dec 02, 2020
```
Simplified sampler (will be needed later for batch size increase) and removed deprecated data stuff

See merge request ADLR/megatron-lm!177
```
  75bd9b54
- Merge branch 'blendable_dataset' into 'megatron_sampler' · fac6718a
  Jared Casper authored Dec 02, 2020
```
Blendable dataset

See merge request ADLR/megatron-lm!178
```
  fac6718a
- Merge branch 'refactor_learning_rate' into 'blendable_dataset' · 1eda0a17
  Jared Casper authored Dec 02, 2020
```
Refactor learning rate so it is easier to make learning rate based on consumed samples

See merge request ADLR/megatron-lm!179
```
  1eda0a17
- addressed Jareds comments · fa80af26
  mohammad authored Dec 02, 2020
  
  fa80af26
- Merge branch 'blendable_dataset' into refactor_learning_rate · 45504541
  mohammad authored Dec 02, 2020
  
  45504541
- addressed Jareds comments · 98989693
  mohammad authored Dec 02, 2020
  
  98989693
- Merge branch 'megatron_sampler' into blendable_dataset · bc56e4a5
  mohammad authored Dec 02, 2020
  
  bc56e4a5