• Naman Goyal's avatar
    Cythonize token block dataset (#834) · 4fc39538
    Naman Goyal authored
    Summary:
    Cythonized token block dataset code, it's `> 100x` faster. Token block for entire `bookwiki+CC+stories+openweb` is just ~`39.9` seconds.
    
    TODO:
    1) I think, I can make it 2x more faster.
    2) cleanup.
    
    EDIT History:
    ~~First pass at parellelizing `token_block_dataset`. The code feels somewhat complicated and cluttered.
    This is 2-3x faster though on my tests on `bookwiki` dataset with both `complete` and `complete_doc` modes.
    myleott Can you take a look for correctness as I am still not 100% sure that I am not missing corner cases.~~
    Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/834
    
    Test Plan:
    Imported from GitHub, without a `Test Plan:` line.
    
    Test workflow: f133816198
    
    Reviewed By: myleott
    
    Differential Revision: D16970257
    
    Pulled By: myleott
    
    fbshipit-source-id: ec45a308193c9e9f3e7075336c15df4723228d6f
    4fc39538
setup.py 2.22 KB