• Michael Watkins's avatar
    Consider do_lower_case in PreTrainedTokenizer · 7246d3c2
    Michael Watkins authored
    As pointed out in #1545, when using an uncased model, and adding
    a new uncased token, the tokenizer does not correctly identify this
    in the case that the input text contains the token in a cased format.
    
    For instance, if we load bert-base-uncased into BertTokenizer, and
    then use .add_tokens() to add "cool-token", we get the expected
    result for .tokenize('this is a cool-token'). However, we get a
    possibly unexpected result for .tokenize('this is a cOOl-Token'),
    which in fact mirrors the result for the former from before the new
    token was added.
    
    This commit adds
    - functionality to PreTrainedTokenizer to handle this
    situation in case a tokenizer (currently Bert, DistilBert,
    and XLNet) has the do_lower_case=True kwarg by:
        1) lowercasing tokens added with .add_tokens()
        2) lowercasing text at the beginning of .tokenize()
    - new common test case for tokenizers
    
    https://github.com/huggingface/transformers/issues/1545
    7246d3c2
tokenization_tests_commons.py 15.5 KB