• Ben Eyal's avatar
    馃毃 馃毃 馃毃 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in... · 9f9ddcc2
    Ben Eyal authored
    馃毃 馃毃 馃毃 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` (#15775)
    
    * Add test for SentencePiece not adding special tokens to strings
    
    * Add SentencePieceStringConversionMixin to fix issue 15003
    
    * Fix conversion from tokens to string for most SentencePiece tokenizers
    
    Tokenizers fixed:
    - AlbertTokenizer
    - BarthezTokenizer
    - CamembertTokenizer
    - FNetTokenizer
    - M2M100Tokenizer
    - MBart50Tokenizer
    - PegasusTokenizer
    - Speech2TextTokenizer
    
    * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab
    
    * Fix DebertaV2Tokenizer
    
    * Ignore LayoutXLMTokenizer in SentencePiece string conversion test
    
    * Run 'make style' and 'make quality'
    
    * Clean convert_tokens_to_string test
    
    Instead of explicitly ignoring LayoutXLMTokenizer in the test,
    override the test in LayoutLMTokenizationTest and do nothing in it.
    
    * Remove commented out code
    
    * Improve robustness of convert_tokens_to_string test
    
    Instead of comparing lengths of re-tokenized text and input_ids,
    check that converting all special tokens to string yields a string
    with all special tokens.
    
    * Inline and remove SentencePieceStringConversionMixin
    
    The convert_tokens_to_string method is now implemented
    in each relevant SentencePiece tokenizer.
    
    * Run 'make style' and 'make quality'
    
    * Revert removal of space in convert_tokens_to_string
    
    * Remove redundant import
    
    * Revert test text to original
    
    * Uncomment the lowercasing of the reverse_text variable
    
    * Mimic Rust tokenizer behavior for tokenizers
    
    - Albert
    - Barthez
    - Camembert
    - MBart50
    - T5
    
    * Fix accidentally skipping test in wrong tokenizer
    
    * Add test for equivalent Rust and slow tokenizer behavior
    
    * Override _decode in BigBirdTokenizer to mimic Rust behavior
    
    * Override _decode in FNetTokenizer to mimic Rust behavior
    
    * Override _decode in XLNetTokenizer to mimic Rust behavior
    
    * Remove unused 're' import
    
    * Update DebertaV2Tokenizer to mimic Rust tokenizer
    
    * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.
    
    * Ignore problematic tests in Deberta V2
    
    * Add comment on why the Deberta V2 tests are skipped
    9f9ddcc2
test_tokenization_common.py 206 KB