Data preprocessing testing changes + fixes
tools/merge_datasets.py - tool to merge multiple dataset files into a single dataset - testing conducted and included in the megatron-testing repo https://gitlab-master.nvidia.com/ADLR/megatron-testing tools/preprocess_data.py - magic numbers changed to required command line arguments megatron/data/indexed_dataset.py - when merging, fix to properly update document index - testing conducted and included in the megatron-testing repo (see above) - fix follows this history https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/66
Showing
tools/merge_datasets.py
0 → 100644
Please register or sign in to comment