• Jared Casper's avatar
    Updates to preprocess_data.py and indexed_dataset. · da0562fc
    Jared Casper authored
    preprocess_data:
    - Adds ability to not split sentences. This is used for gpt2 datasets.
    
    - Adds ability to create multiple datasets from different json keys,
    this is current untested.
    
    indexed_dataset:
    - Add new "get" function to get a portion of an entry.
    da0562fc
preprocess_data.py 6.56 KB