- 12 May, 2022 1 commit
-
-
John Kamalu authored
tools/merge_datasets.py - tool to merge multiple dataset files into a single dataset - testing conducted and included in the megatron-testing repo https://gitlab-master.nvidia.com/ADLR/megatron-testing tools/preprocess_data.py - magic numbers changed to required command line arguments megatron/data/indexed_dataset.py - when merging, fix to properly update document index - testing conducted and included in the megatron-testing repo (see above) - fix follows this history https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/66
-
- 14 May, 2021 1 commit
-
-
Jared Casper authored
-
- 01 Feb, 2021 1 commit
-
-
Jared Casper authored
-
- 12 Nov, 2020 2 commits
-
-
Deepak Narayanan authored
-
Deepak Narayanan authored
Also includes following changes for inter-layer model-parallel implementation: - Refactoring of model implementations - Training loop changes to support inter-layer communication using `ring_exchange` - New groups for inter-layer communication - Checkpoint changes - Command line arguments
-
- 09 Jun, 2020 1 commit
-
-
Neel Kant authored
-
- 24 Apr, 2020 1 commit
-
-
Raul Puri authored
-
- 16 Apr, 2020 1 commit
-
-
Mohammad authored
-
- 13 Apr, 2020 1 commit
-
-
Mohammad authored
-
- 08 Apr, 2020 3 commits
-
-
Jared Casper authored
-
Jared Casper authored
-
Jared Casper authored
preprocess_data: - Adds ability to not split sentences. This is used for gpt2 datasets. - Adds ability to create multiple datasets from different json keys, this is current untested. indexed_dataset: - Add new "get" function to get a portion of an entry.
-