Cythonize token block dataset (#834)
Summary: Cythonized token block dataset code, it's `> 100x` faster. Token block for entire `bookwiki+CC+stories+openweb` is just ~`39.9` seconds. TODO: 1) I think, I can make it 2x more faster. 2) cleanup. EDIT History: ~~First pass at parellelizing `token_block_dataset`. The code feels somewhat complicated and cluttered. This is 2-3x faster though on my tests on `bookwiki` dataset with both `complete` and `complete_doc` modes. myleott Can you take a look for correctness as I am still not 100% sure that I am not missing corner cases.~~ Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/834 Test Plan: Imported from GitHub, without a `Test Plan:` line. Test workflow: f133816198 Reviewed By: myleott Differential Revision: D16970257 Pulled By: myleott fbshipit-source-id: ec45a308193c9e9f3e7075336c15df4723228d6f
Showing