tests/test_tokenization_fast.py · 9336086ab5d232cccd9512333518cf4299528882 · chenpangpang / transformers

Transformer-XL: Improved tokenization with sacremoses (#6322) · cb276b41

RafaelWO authored Aug 28, 2020



* Improved tokenization with sacremoses

 * The TransfoXLTokenizer is now using sacremoses for tokenization
 * Added tokenization of comma-separated and floating point numbers.
 * Removed prepare_for_tokenization() from tokenization_transfo_xl.py because punctuation is handled by sacremoses
 * Added corresponding tests
 * Removed test comapring TransfoXLTokenizer and TransfoXLTokenizerFast
 * Added deprecation warning to TransfoXLTokenizerFast

* isort change
Co-authored-by: Teven <teven.lescao@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

cb276b41

test_tokenization_fast.py 42.4 KB

Replace test_tokenization_fast.py