1. 12 Nov, 2019 2 commits
    • Michael Watkins's avatar
      Consider do_lower_case in PreTrainedTokenizer · 7246d3c2
      Michael Watkins authored
      As pointed out in #1545, when using an uncased model, and adding
      a new uncased token, the tokenizer does not correctly identify this
      in the case that the input text contains the token in a cased format.
      
      For instance, if we load bert-base-uncased into BertTokenizer, and
      then use .add_tokens() to add "cool-token", we get the expected
      result for .tokenize('this is a cool-token'). However, we get a
      possibly unexpected result for .tokenize('this is a cOOl-Token'),
      which in fact mirrors the result for the former from before the new
      token was added.
      
      This commit adds
      - functionality to PreTrainedTokenizer to handle this
      situation in case a tokenizer (currently Bert, DistilBert,
      and XLNet) has the do_lower_case=True kwarg by:
          1) lowercasing tokens added with .add_tokens()
          2) lowercasing text at the beginning of .tokenize()
      - new common test case for tokenizers
      
      https://github.com/huggingface/transformers/issues/1545
      7246d3c2
    • thomwolf's avatar
      fix #1789 · 8aba81a0
      thomwolf authored
      8aba81a0
  2. 11 Nov, 2019 1 commit
  3. 08 Nov, 2019 1 commit
  4. 06 Nov, 2019 7 commits
  5. 05 Nov, 2019 11 commits
  6. 04 Nov, 2019 12 commits
  7. 03 Nov, 2019 1 commit
  8. 01 Nov, 2019 2 commits
  9. 31 Oct, 2019 3 commits