• RafaelWO's avatar
    Transformer-XL: Improved tokenization with sacremoses (#6322) · cb276b41
    RafaelWO authored
    
    
    * Improved tokenization with sacremoses
    
     * The TransfoXLTokenizer is now using sacremoses for tokenization
     * Added tokenization of comma-separated and floating point numbers.
     * Removed prepare_for_tokenization() from tokenization_transfo_xl.py because punctuation is handled by sacremoses
     * Added corresponding tests
     * Removed test comapring TransfoXLTokenizer and TransfoXLTokenizerFast
     * Added deprecation warning to TransfoXLTokenizerFast
    
    * isort change
    Co-authored-by: default avatarTeven <teven.lescao@gmail.com>
    Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
    cb276b41
test_tokenization_transfo_xl.py 4.12 KB