Commits · aed2f75e209e525c842aec7c044af7acae2a4614 · OpenDAS / Megatron-LM

13 Feb, 2021 1 commit
- More comments and some cleanup (e.g., better variable names) · 5489bda9
  Deepak Narayanan authored Feb 13, 2021
  
  5489bda9
09 Feb, 2021 1 commit

Interleaved pipeline execution and code refactoring · dd889062

Deepak Narayanan authored Dec 12, 2020

- Split a model's computation into multiple virtual stages as needed,
and schedule communication correctly between these virtual stages
- Move schedule code into `schedules.py` and communication code into
`p2p_communication.py`
- Use hyphens instead of spaces in all time logging for consistency
- Factor out code in megatron/training.py into helper functions
- Refactor evaluate() function: make it use forward_backward_schedule
functions

dd889062

04 Jan, 2021 1 commit
- Use batched send and recv instead of torch.distributed.ring_exchange() · d899988e
  Deepak Narayanan authored Jan 04, 2021
  
  d899988e
19 Dec, 2020 2 commits
- Add comment describing _PIPELINE_GLOBAL_RANKS · 51315905
  Jared Casper authored Dec 10, 2020
  
  51315905
- Initial implementation of pipelined text generation · 5c45db4a
  Jared Casper authored Dec 09, 2020
  
  5c45db4a
12 Nov, 2020 2 commits

Intra-layer MP -> Tensor MP, Inter-layer MP -> Pipeline MP · 52a5f2f2
Deepak Narayanan authored Oct 20, 2020

52a5f2f2

Pipeline parallelism implementation with periodic full-pipeline syncs · 7abd3e90

Deepak Narayanan authored Aug 29, 2020

Also includes following changes for inter-layer model-parallel implementation:
- Refactoring of model implementations
- Training loop changes to support inter-layer communication using `ring_exchange`
- New groups for inter-layer communication
- Checkpoint changes
- Command line arguments

7abd3e90

06 Aug, 2020 1 commit
- Implementing lazy parallel initialization · 5c04ceb3
  Boris Fomitchev authored Aug 05, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  5c04ceb3
30 Jul, 2020 1 commit
- Changes for NeMo/lightning compatibility · 417c7f6a
  Boris Fomitchev authored Jul 30, 2020
```
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
```
  417c7f6a
26 Jul, 2020 1 commit
- Add additional assertion on Indexer to test correctness, and limit verbosity in other classes · eaa5d877
  Neel Kant authored Jul 25, 2020
  
  eaa5d877
05 Jun, 2020 1 commit
- Prune changes to only be related to ICT · 32bb4edc
  Neel Kant authored Jun 05, 2020
  
  32bb4edc
26 May, 2020 1 commit
- Corrected realm example building, misc improvements for async concurrency · 2fd4ea6c
  Neel Kant authored May 25, 2020
  
  2fd4ea6c
21 May, 2020 1 commit
- Change sync variable to gloo backend · 05ea0cca
  Neel Kant authored May 21, 2020
  
  05ea0cca
20 May, 2020 1 commit
- Async works for total 8 GPU, indexer debug mode · a670b6c9
  Neel Kant authored May 19, 2020
  
  a670b6c9
19 May, 2020 2 commits
- Reorganize indexer. Things run up to saving model checkpoint and repeating · d4b00be0
  Neel Kant authored May 19, 2020
  
  d4b00be0
- Indexer_async works in theory · e338e311
  Neel Kant authored May 18, 2020
  
  e338e311
16 Apr, 2020 1 commit
- changed licence 2019 to 2020 · 463d1257
  Mohammad authored Apr 16, 2020
  
  463d1257
10 Feb, 2020 1 commit
- Model parallel merger · 57c2060f
  Mohammad Shoeybi authored Feb 10, 2020
  
  57c2060f
08 Oct, 2019 1 commit
- created megatron package · b886b7bb
  Mohammad Shoeybi authored Oct 08, 2019
  
  b886b7bb
30 Jul, 2019 1 commit

large update including model parallelism and gpt2 · abe36e2e

Raul Puri authored Jul 29, 2019


Co-authored-by: shoeybi <shoeybim@gmail.com>
Co-authored-by: raulpuric <raulpuric@berkeley.edu>
Co-authored-by: jaredcasper <jaredcasper@gmail.com>
Co-authored-by: mpatwary <mostofa.patwary@gmail.com>
Co-authored-by: plegresl <plegresl@gmail.com>

abe36e2e