"vscode:/vscode.git/clone" did not exist on "2d5d3cd28edd73f5e60f6ed8ee9df5825f387677"
Data preprocessing testing changes + fixes
tools/merge_datasets.py - tool to merge multiple dataset files into a single dataset - testing conducted and included in the megatron-testing repo https://gitlab-master.nvidia.com/ADLR/megatron-testing tools/preprocess_data.py - magic numbers changed to required command line arguments megatron/data/indexed_dataset.py - when merging, fix to properly update document index - testing conducted and included in the megatron-testing repo (see above) - fix follows this history https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/66
Showing
tools/merge_datasets.py
0 → 100644
Please register or sign in to comment