Commits · 5489bda9bcc9fed83fe45832de033e85bb31649f · OpenDAS / Megatron-LM

13 Feb, 2021 1 commit
- More comments and some cleanup (e.g., better variable names) · 5489bda9
  Deepak Narayanan authored Feb 13, 2021
  
  5489bda9
10 Feb, 2021 2 commits
- Comments in megatron/schedules.py and address a few more comments · 626645c0
  Deepak Narayanan authored Feb 10, 2021
  
  626645c0
- Move unwrap to megatron/utils.py and clean up imports in megatron/schedules.py · cc691cbf
  Deepak Narayanan authored Feb 10, 2021
  
  cc691cbf
09 Feb, 2021 4 commits
- Compute tensor chunk size more cleanly, and add assertion for global batch size · e3e5ea89
  Deepak Narayanan authored Jan 20, 2021
  
  e3e5ea89
- Break up tensors sent between pipeline stages into smaller chunks that can be all-gathered · 27fc4689
  Deepak Narayanan authored Jan 20, 2021
  
  27fc4689
- Put in barriers in appropriate places to measure length of pipeline stall · 8e922d5b
  Deepak Narayanan authored Jan 09, 2021
  
  8e922d5b
- Interleaved pipeline execution and code refactoring · dd889062
  Deepak Narayanan authored Dec 12, 2020
```
- Split a model's computation into multiple virtual stages as needed,
and schedule communication correctly between these virtual stages
- Move schedule code into `schedules.py` and communication code into
`p2p_communication.py`
- Use hyphens instead of spaces in all time logging for consistency
- Factor out code in megatron/training.py into helper functions
- Refactor evaluate() function: make it use forward_backward_schedule
functions
```
  dd889062
08 Feb, 2021 2 commits
- Merge branch 'ckpt_rng' into 'main' · c1faa9fe
  Mohammad Shoeybi authored Feb 08, 2021
```
Improve handling of rng states in checkpoints.

See merge request ADLR/megatron-lm!231
```
  c1faa9fe
- Improve handling of rng states in checkpoints. · 08a848c7
  Jared Casper authored Feb 08, 2021
  
  08a848c7
06 Feb, 2021 1 commit

Merge branch 'main_fix' into 'main' · 8863af8c

Jared Casper authored Feb 05, 2021

Use torch.cuda.synchronize() right after calling batch_isend_irecv() communication API

See merge request ADLR/megatron-lm!230

8863af8c

05 Feb, 2021 3 commits
- Use torch.cuda.synchronize() right after calling batch_isend_irecv() communication API · 7ffea978
  Deepak Narayanan authored Feb 05, 2021
  
  7ffea978
- Merge branch 'fused_kernel_cond' into 'main' · 2096d356
  Jared Casper authored Feb 05, 2021
```
conditioning fused kernels

See merge request ADLR/megatron-lm!228
```
  2096d356
- address review comments · 0cb36de2
  Vijay Korthikanti authored Feb 05, 2021
  
  0cb36de2
04 Feb, 2021 1 commit
- conditioning fused kernels · 4916bae6
  Vijay Korthikanti authored Feb 04, 2021
  
  4916bae6
02 Feb, 2021 2 commits
- Merge branch 'merge_bugfix' into 'main' · 872e38ea
  Mohammad Shoeybi authored Feb 02, 2021
```
Fix bug in merge_mp_partitions for handling recent checkpoints.

See merge request ADLR/megatron-lm!226
```
  872e38ea
- Fix bug in merge_mp_partitions for handling recent checkpoints. · 72105ef0
  Jared Casper authored Feb 02, 2021
  
  72105ef0
01 Feb, 2021 2 commits
- Merge branch 'preprocess_fix' into 'main' · c601d751
  Mohammad Shoeybi authored Feb 01, 2021
```
Handle empty documents in preprocess_data.

See merge request ADLR/megatron-lm!225
```
  c601d751
- Handle empty documents in preprocess_data. · 09d220cf
  Jared Casper authored Feb 01, 2021
  
  09d220cf
29 Jan, 2021 4 commits
- Merge branch 'ci' into 'main' · 1b8e2891
  Jared Casper authored Jan 29, 2021
```
Init CI tests with very basic import test.

See merge request ADLR/megatron-lm!224
```
  1b8e2891
- Init CI tests with very basic import test. · 2526c614
  Jared Casper authored Jan 29, 2021
  
  2526c614
- Merge branch 'tensorboard_queue_size_increase' into 'main' · f2a3a25c
  Jared Casper authored Jan 29, 2021
```
added option to change tensorboard queue size

See merge request ADLR/megatron-lm!223
```
  f2a3a25c
- added option to change tensorboard queue size · e9b90500
  mohammad authored Jan 29, 2021
  
  e9b90500
28 Jan, 2021 11 commits
- Merge branch 'logging_refactor' into 'main' · 0e5b64af
  Jared Casper authored Jan 28, 2021
```
added options for tensorboard logging

See merge request ADLR/megatron-lm!222
```
  0e5b64af
- changed validation loss name · 792a468d
  mohammad authored Jan 28, 2021
  
  792a468d
- added options for tensorboard logging · 3a26a168
  mohammad authored Jan 28, 2021
  
  3a26a168
- Merge branch 'ckpt_fix' into 'main' · 16db4a2c
  Mohammad Shoeybi authored Jan 28, 2021
```
Typo fix.

See merge request ADLR/megatron-lm!221
```
  16db4a2c
- Typo fix. · e1f574cd
  Jared Casper authored Jan 28, 2021
  
  e1f574cd
- Merge branch 'ckpt_merge' into 'main' · 36c2674c
  Mohammad Shoeybi authored Jan 28, 2021
```
Teach merge_mp_partitions how to write out a pipelined model.

See merge request ADLR/megatron-lm!218
```
  36c2674c
- Improve comments around layer regex replacement. · 98a5b9a0
  Jared Casper authored Jan 28, 2021
  
  98a5b9a0
- Merge branch 'vision_transformer' into 'main' · de722164
  Jared Casper authored Jan 28, 2021
```
license text for autoaugmentation

See merge request ADLR/megatron-lm!220
```
  de722164
- change to LICENSE file · dcea434a
  Vijay Korthikanti authored Jan 28, 2021
  
  dcea434a
- Merge branch 'ckpt_transpose' into 'main' · 4468f3e4
  Jared Casper authored Jan 28, 2021
```
Rework handling of older checkpoint's attention weight/bias ordering.

See merge request ADLR/megatron-lm!219
```
  4468f3e4
- license text for autoaugmentation · 54bb3046
  Vijay Korthikanti authored Jan 28, 2021
  
  54bb3046
27 Jan, 2021 7 commits
- Move rearranging query_key_value and key_value values in old checkpoints to... · 76960d7c
  Jared Casper authored Jan 27, 2021
```
Move rearranging query_key_value and key_value values in old checkpoints to when the checkpoint is loaded instead of runtime..
```
  76960d7c
- Teach merge_mp_partitions how to write out a pipelined model. · 7cabbe67
  Jared Casper authored Jan 27, 2021
  
  7cabbe67
- Merge branch 'gpt3_script_fix' into 'main' · c7444380
  Jared Casper authored Jan 27, 2021
```
added init method std to gpt3 example

See merge request ADLR/megatron-lm!217
```
  c7444380
- added init method std · a6bf1a04
  mshoeybi authored Jan 26, 2021
  
  a6bf1a04
- Merge branch 'log_grad_norm' into 'main' · 60704e72
  Jared Casper authored Jan 26, 2021
```
added grad and params norm to logging and tensorboard

See merge request ADLR/megatron-lm!214
```
  60704e72
- added flag so we dont calculate params norm all the time · 3dcbaec9
  mohammad authored Jan 26, 2021
  
  3dcbaec9
- Merge branch 'main' into log_grad_norm · 929c780c
  mohammad authored Jan 26, 2021
  
  929c780c