"vscode:/vscode.git/clone" did not exist on "a6676384423b39dee66bc3833ef3415cd089602a"
  1. 25 Jun, 2020 1 commit
  2. 23 Jun, 2020 1 commit
    • Thomas Wolf's avatar
      Tokenizers API developments (#5103) · 11fdde02
      Thomas Wolf authored
      
      
      * Add return lengths
      
      * make pad a bit more flexible so it can be used as collate_fn
      
      * check all kwargs sent to encoding method are known
      
      * fixing kwargs in encodings
      
      * New AddedToken class in python
      
      This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.
      
      * style and quality
      
      * switched to hugginface tokenizers library for AddedTokens
      
      * up to tokenizer 0.8.0-rc3 - update API to use AddedToken state
      
      * style and quality
      
      * do not raise an error on additional or unused kwargs for tokenize() but only a warning
      
      * transfo-xl pretrained model requires torch
      
      * Update src/transformers/tokenization_utils.py
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      11fdde02
  3. 15 Jun, 2020 1 commit
    • Anthony MOI's avatar
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220
      Anthony MOI authored
      
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
      
      * Use tokenizers pre-tokenized pipeline
      
      * failing pretrokenized test
      
      * Fix is_pretokenized in python
      
      * add pretokenized tests
      
      * style and quality
      
      * better tests for batched pretokenized inputs
      
      * tokenizers clean up - new padding_strategy - split the files
      
      * [HUGE] refactoring tokenizers - padding - truncation - tests
      
      * style and quality
      
      * bump up requied tokenizers version to 0.8.0-rc1
      
      * switched padding/truncation API - simpler better backward compat
      
      * updating tests for custom tokenizers
      
      * style and quality - tests on pad
      
      * fix QA pipeline
      
      * fix backward compatibility for max_length only
      
      * style and quality
      
      * Various cleans up - add verbose
      
      * fix tests
      
      * update docstrings
      
      * Fix tests
      
      * Docs reformatted
      
      * __call__ method documented
      Co-authored-by: default avatarThomas Wolf <thomwolf@users.noreply.github.com>
      Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
      36434220
  4. 20 May, 2020 1 commit
  5. 13 Feb, 2020 1 commit
    • Joe Davison's avatar
      Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f
      Joe Davison authored
      * Preserve spaces in GPT-2 tokenizers
      
      Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
      tokenizers, enabling correct BPE encoding. Automatically inserts a space
      in front of first token in encode function when adding special tokens.
      
      * Add tokenization preprocessing method
      
      * Add framework argument to pipeline factory
      
      Also fixes pipeline test issue. Each test input now treated as a
      distinct sequence.
      f1e8a51f
  6. 15 Jan, 2020 1 commit
  7. 06 Jan, 2020 2 commits
  8. 22 Dec, 2019 8 commits
  9. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26
  10. 06 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Remove dependency on pytest for running tests (#2055) · 35401fe5
      Aymeric Augustin authored
      * Switch to plain unittest for skipping slow tests.
      
      Add a RUN_SLOW environment variable for running them.
      
      * Switch to plain unittest for PyTorch dependency.
      
      * Switch to plain unittest for TensorFlow dependency.
      
      * Avoid leaking open files in the test suite.
      
      This prevents spurious warnings when running tests.
      
      * Fix unicode warning on Python 2 when running tests.
      
      The warning was:
      
          UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
      
      * Support running PyTorch tests on a GPU.
      
      Reverts 27e015bd.
      
      * Tests no longer require pytest.
      
      * Make tests pass on cuda
      35401fe5
  11. 04 Nov, 2019 1 commit
  12. 22 Oct, 2019 1 commit
  13. 04 Oct, 2019 1 commit
  14. 26 Sep, 2019 2 commits
  15. 19 Sep, 2019 1 commit
  16. 30 Aug, 2019 5 commits
  17. 13 Aug, 2019 1 commit
  18. 12 Aug, 2019 1 commit
  19. 09 Aug, 2019 1 commit
  20. 08 Aug, 2019 1 commit
  21. 07 Aug, 2019 1 commit
  22. 05 Aug, 2019 2 commits