Updates to preprocess_data.py and indexed_dataset.
preprocess_data: - Adds ability to not split sentences. This is used for gpt2 datasets. - Adds ability to create multiple datasets from different json keys, this is current untested. indexed_dataset: - Add new "get" function to get a portion of an entry.
Showing
scripts/preprocess_data.py
0 → 100644
Please register or sign in to comment