• Funtowicz Morgan's avatar
    Tokenizers v3.0.0 (#3185) · 96ab75b8
    Funtowicz Morgan authored
    
    
    * Renamed num_added_tokens to num_special_tokens_to_add
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Make fast tokenizers unittests work on Windows.
    
    * Entirely refactored unittest for tokenizers fast.
    
    * Remove ABC class for CommonFastTokenizerTest
    
    * Added embeded_special_tokens tests from allenai @dirkgr
    
    * Make embeded_special_tokens tests from allenai more generic
    
    * Uniformize vocab_size as a property for both Fast and normal tokenizers
    
    * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
    
    * Ensure providing None input raise the same ValueError than Python tokenizer + tests.
    
    * Fix invalid input for assert_padding when testing batch_encode_plus
    
    * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
    
    * Ensure tokenize() correctly forward add_special_tokens to rust.
    
    * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
    Avoid stripping on None values.
    
    * unittests ensure tokenize() also throws a ValueError if provided None
    
    * Added add_special_tokens unittest for all supported models.
    
    * Style
    
    * Make sure TransfoXL test run only if PyTorch is provided.
    
    * Split up tokenizers tests for each model type.
    
    * Fix invalid unittest with new tokenizers API.
    
    * Filter out Roberta openai detector models from unittests.
    
    * Introduce BatchEncoding on fast tokenizers path.
    
    This new structure exposes all the mappings retrieved from Rust.
    It also keeps the current behavior with model forward.
    
    * Introduce BatchEncoding on slow tokenizers path.
    
    Backward compatibility.
    
    * Improve error message on BatchEncoding for slow path
    
    * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
    
    * Style and format.
    
    * Added typing on all methods for PretrainedTokenizerFast
    
    * Style and format
    
    * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
    
    * Style and format
    
    * encode_plus now supports pretokenized inputs.
    
    * Remove user warning about add_special_tokens when working on pretokenized inputs.
    
    * Always go through the post processor.
    
    * Added support for pretokenized input pairs on encode_plus
    
    * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
    
    * Added pretokenized inputs support on batch_encode_plus
    
    * Update BatchEncoding methods name to match Encoding.
    
    * Bump setup.py tokenizers dependency to 0.7.0rc1
    
    * Remove unused parameters in BertTokenizerFast
    
    * Make sure Roberta returns token_type_ids for unittests.
    
    * Added missing typings
    
    * Update add_tokens prototype to match tokenizers side and allow AddedToken
    
    * Bumping tokenizers to 0.7.0rc2
    
    * Added documentation for BatchEncoding
    
    * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
    
    * Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
    
    * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
    
    * Fix text-classification pipeline using the wrong tokenizer
    
    * Make pipelines works with BatchEncoding
    
    * Turn off add_special_tokens on tokenize by default.
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Remove add_prefix_space from tokenize call in unittest.
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Style and quality
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Correct message for batch_encode_plus none input exception.
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Fix invalid list comprehension for offset_mapping overriding content every iteration.
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * TransfoXL uses Strip normalizer.
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Bump tokenizers dependency to 0.7.0rc3
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * SpecilaTokenMixin can use slots to faster access to underlying attributes.
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Remove update_special_tokens from fast tokenizers.
    
    * Ensure TransfoXL unittests are run only when torch is available.
    
    * Style.
    Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    
    * Style
    
    * Style 馃檹馃檹
    
    
    
    * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
    
    * Remove Roberta warning on __init__.
    
    * Move documentation to Google style.
    Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
    96ab75b8
test_tokenization_fast.py 24.1 KB