"tools/base_data_ocr.json" did not exist on "194f34af688e5c7b403e973ee7c3357c29fb0abf"
Updates to preprocess_data.py and indexed_dataset.
preprocess_data: - Adds ability to not split sentences. This is used for gpt2 datasets. - Adds ability to create multiple datasets from different json keys, this is current untested. indexed_dataset: - Add new "get" function to get a portion of an entry.
Showing
scripts/preprocess_data.py
0 → 100644
Please register or sign in to comment