1. 02 Mar, 2020 1 commit
  2. 24 Feb, 2020 1 commit
  3. 20 Feb, 2020 1 commit
  4. 13 Feb, 2020 1 commit
    • Joe Davison's avatar
      Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f
      Joe Davison authored
      * Preserve spaces in GPT-2 tokenizers
      
      Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
      tokenizers, enabling correct BPE encoding. Automatically inserts a space
      in front of first token in encode function when adding special tokens.
      
      * Add tokenization preprocessing method
      
      * Add framework argument to pipeline factory
      
      Also fixes pipeline test issue. Each test input now treated as a
      distinct sequence.
      f1e8a51f
  5. 29 Jan, 2020 2 commits
  6. 06 Jan, 2020 2 commits
  7. 24 Dec, 2019 1 commit
  8. 23 Dec, 2019 1 commit
  9. 22 Dec, 2019 7 commits
  10. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26
  11. 20 Dec, 2019 2 commits
  12. 13 Dec, 2019 1 commit
  13. 06 Dec, 2019 2 commits
    • Michael Watkins's avatar
      Fix bug which lowercases special tokens · 2670b0d6
      Michael Watkins authored
      2670b0d6
    • Aymeric Augustin's avatar
      Remove dependency on pytest for running tests (#2055) · 35401fe5
      Aymeric Augustin authored
      * Switch to plain unittest for skipping slow tests.
      
      Add a RUN_SLOW environment variable for running them.
      
      * Switch to plain unittest for PyTorch dependency.
      
      * Switch to plain unittest for TensorFlow dependency.
      
      * Avoid leaking open files in the test suite.
      
      This prevents spurious warnings when running tests.
      
      * Fix unicode warning on Python 2 when running tests.
      
      The warning was:
      
          UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
      
      * Support running PyTorch tests on a GPU.
      
      Reverts 27e015bd.
      
      * Tests no longer require pytest.
      
      * Make tests pass on cuda
      35401fe5
  14. 04 Dec, 2019 1 commit
  15. 22 Nov, 2019 2 commits
  16. 12 Nov, 2019 2 commits
    • Lysandre's avatar
      Fix special tokens addition in decoder · 74d0bcb6
      Lysandre authored
      74d0bcb6
    • Michael Watkins's avatar
      Consider do_lower_case in PreTrainedTokenizer · 7246d3c2
      Michael Watkins authored
      As pointed out in #1545, when using an uncased model, and adding
      a new uncased token, the tokenizer does not correctly identify this
      in the case that the input text contains the token in a cased format.
      
      For instance, if we load bert-base-uncased into BertTokenizer, and
      then use .add_tokens() to add "cool-token", we get the expected
      result for .tokenize('this is a cool-token'). However, we get a
      possibly unexpected result for .tokenize('this is a cOOl-Token'),
      which in fact mirrors the result for the former from before the new
      token was added.
      
      This commit adds
      - functionality to PreTrainedTokenizer to handle this
      situation in case a tokenizer (currently Bert, DistilBert,
      and XLNet) has the do_lower_case=True kwarg by:
          1) lowercasing tokens added with .add_tokens()
          2) lowercasing text at the beginning of .tokenize()
      - new common test case for tokenizers
      
      https://github.com/huggingface/transformers/issues/1545
      7246d3c2
  17. 04 Nov, 2019 1 commit
  18. 22 Oct, 2019 1 commit
  19. 04 Oct, 2019 2 commits
  20. 03 Oct, 2019 5 commits
  21. 26 Sep, 2019 1 commit
  22. 24 Sep, 2019 2 commits