1. 15 Jun, 2020 1 commit
    • Anthony MOI's avatar
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220
      Anthony MOI authored
      
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
      
      * Use tokenizers pre-tokenized pipeline
      
      * failing pretrokenized test
      
      * Fix is_pretokenized in python
      
      * add pretokenized tests
      
      * style and quality
      
      * better tests for batched pretokenized inputs
      
      * tokenizers clean up - new padding_strategy - split the files
      
      * [HUGE] refactoring tokenizers - padding - truncation - tests
      
      * style and quality
      
      * bump up requied tokenizers version to 0.8.0-rc1
      
      * switched padding/truncation API - simpler better backward compat
      
      * updating tests for custom tokenizers
      
      * style and quality - tests on pad
      
      * fix QA pipeline
      
      * fix backward compatibility for max_length only
      
      * style and quality
      
      * Various cleans up - add verbose
      
      * fix tests
      
      * update docstrings
      
      * Fix tests
      
      * Docs reformatted
      
      * __call__ method documented
      Co-authored-by: default avatarThomas Wolf <thomwolf@users.noreply.github.com>
      Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
      36434220
  2. 04 Jun, 2020 1 commit
    • Funtowicz Morgan's avatar
      Introduce a new tensor type for return_tensors on tokenizer for NumPy (#4585) · 5bf9afbf
      Funtowicz Morgan authored
      * Refactor tensor creation in tokenizers.
      
      * Make sure to convert string to TensorType
      
      * Refactor convert_to_tensors_
      
      * Introduce numpy tensor creation
      
      * Format
      
      * Add unittest for TensorType creation from str
      
      * sorting imports
      
      * Added unittests for numpy tensor conversion.
      
      * Do not use in-place version for squeeze as numpy doesn't provide such feature.
      
      * Added extra parameter prepend_batch_axis: bool on prepare_for_model.
      
      * Ensure test_np_encode_plus_sent_to_model is not executed if encoder/decoder model.
      
      * style.
      
      * numpy tests require_torch for now while flax not merged.
      
      * Hopefully will make flake8 happy.
      
      * One more time 馃幎
      5bf9afbf
  3. 19 May, 2020 1 commit
  4. 14 May, 2020 1 commit
  5. 01 May, 2020 1 commit
  6. 09 Apr, 2020 1 commit
  7. 08 Apr, 2020 1 commit
  8. 06 Apr, 2020 1 commit
    • Funtowicz Morgan's avatar
      Tokenizers v3.0.0 (#3185) · 96ab75b8
      Funtowicz Morgan authored
      
      
      * Renamed num_added_tokens to num_special_tokens_to_add
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Make fast tokenizers unittests work on Windows.
      
      * Entirely refactored unittest for tokenizers fast.
      
      * Remove ABC class for CommonFastTokenizerTest
      
      * Added embeded_special_tokens tests from allenai @dirkgr
      
      * Make embeded_special_tokens tests from allenai more generic
      
      * Uniformize vocab_size as a property for both Fast and normal tokenizers
      
      * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
      
      * Ensure providing None input raise the same ValueError than Python tokenizer + tests.
      
      * Fix invalid input for assert_padding when testing batch_encode_plus
      
      * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
      
      * Ensure tokenize() correctly forward add_special_tokens to rust.
      
      * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
      Avoid stripping on None values.
      
      * unittests ensure tokenize() also throws a ValueError if provided None
      
      * Added add_special_tokens unittest for all supported models.
      
      * Style
      
      * Make sure TransfoXL test run only if PyTorch is provided.
      
      * Split up tokenizers tests for each model type.
      
      * Fix invalid unittest with new tokenizers API.
      
      * Filter out Roberta openai detector models from unittests.
      
      * Introduce BatchEncoding on fast tokenizers path.
      
      This new structure exposes all the mappings retrieved from Rust.
      It also keeps the current behavior with model forward.
      
      * Introduce BatchEncoding on slow tokenizers path.
      
      Backward compatibility.
      
      * Improve error message on BatchEncoding for slow path
      
      * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
      
      * Style and format.
      
      * Added typing on all methods for PretrainedTokenizerFast
      
      * Style and format
      
      * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
      
      * Style and format
      
      * encode_plus now supports pretokenized inputs.
      
      * Remove user warning about add_special_tokens when working on pretokenized inputs.
      
      * Always go through the post processor.
      
      * Added support for pretokenized input pairs on encode_plus
      
      * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
      
      * Added pretokenized inputs support on batch_encode_plus
      
      * Update BatchEncoding methods name to match Encoding.
      
      * Bump setup.py tokenizers dependency to 0.7.0rc1
      
      * Remove unused parameters in BertTokenizerFast
      
      * Make sure Roberta returns token_type_ids for unittests.
      
      * Added missing typings
      
      * Update add_tokens prototype to match tokenizers side and allow AddedToken
      
      * Bumping tokenizers to 0.7.0rc2
      
      * Added documentation for BatchEncoding
      
      * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
      
      * Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
      
      * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
      
      * Fix text-classification pipeline using the wrong tokenizer
      
      * Make pipelines works with BatchEncoding
      
      * Turn off add_special_tokens on tokenize by default.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove add_prefix_space from tokenize call in unittest.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style and quality
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Correct message for batch_encode_plus none input exception.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Fix invalid list comprehension for offset_mapping overriding content every iteration.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * TransfoXL uses Strip normalizer.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Bump tokenizers dependency to 0.7.0rc3
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * SpecilaTokenMixin can use slots to faster access to underlying attributes.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove update_special_tokens from fast tokenizers.
      
      * Ensure TransfoXL unittests are run only when torch is available.
      
      * Style.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style
      
      * Style 馃檹馃檹
      
      
      
      * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
      
      * Remove Roberta warning on __init__.
      
      * Move documentation to Google style.
      Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
      96ab75b8
  9. 09 Mar, 2020 1 commit
    • Lysandre Debut's avatar
      Skipping outputs (#3116) · 5164ea91
      Lysandre Debut authored
      * Minimal example
      
      * Proposal 2
      
      * Proposal 2 for fast tokenizers
      
      * Typings
      
      * Docs
      
      * Revert "Docs" for easier review
      
      This reverts commit eaf0f97062e809887704a542144c537f769d5223.
      
      * Remove unnecessary assignments
      
      * Tests
      
      * Fix faulty type
      
      * Remove prints
      
      * return_outputs -> model_input_names
      
      * Revert "Revert "Docs" for easier review"
      
      This reverts commit 6fdc69408102bf695797f2dfddbb6350c6b9e722.
      
      * code quality
      5164ea91
  10. 02 Mar, 2020 1 commit
  11. 24 Feb, 2020 1 commit
  12. 20 Feb, 2020 1 commit
  13. 13 Feb, 2020 1 commit
    • Joe Davison's avatar
      Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f
      Joe Davison authored
      * Preserve spaces in GPT-2 tokenizers
      
      Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
      tokenizers, enabling correct BPE encoding. Automatically inserts a space
      in front of first token in encode function when adding special tokens.
      
      * Add tokenization preprocessing method
      
      * Add framework argument to pipeline factory
      
      Also fixes pipeline test issue. Each test input now treated as a
      distinct sequence.
      f1e8a51f
  14. 29 Jan, 2020 2 commits
  15. 06 Jan, 2020 2 commits
  16. 24 Dec, 2019 1 commit
  17. 23 Dec, 2019 1 commit
  18. 22 Dec, 2019 7 commits
  19. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26
  20. 20 Dec, 2019 2 commits
  21. 13 Dec, 2019 1 commit
  22. 06 Dec, 2019 2 commits
    • Michael Watkins's avatar
      Fix bug which lowercases special tokens · 2670b0d6
      Michael Watkins authored
      2670b0d6
    • Aymeric Augustin's avatar
      Remove dependency on pytest for running tests (#2055) · 35401fe5
      Aymeric Augustin authored
      * Switch to plain unittest for skipping slow tests.
      
      Add a RUN_SLOW environment variable for running them.
      
      * Switch to plain unittest for PyTorch dependency.
      
      * Switch to plain unittest for TensorFlow dependency.
      
      * Avoid leaking open files in the test suite.
      
      This prevents spurious warnings when running tests.
      
      * Fix unicode warning on Python 2 when running tests.
      
      The warning was:
      
          UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
      
      * Support running PyTorch tests on a GPU.
      
      Reverts 27e015bd.
      
      * Tests no longer require pytest.
      
      * Make tests pass on cuda
      35401fe5
  23. 04 Dec, 2019 1 commit
  24. 22 Nov, 2019 2 commits
  25. 12 Nov, 2019 2 commits
    • Lysandre's avatar
      Fix special tokens addition in decoder · 74d0bcb6
      Lysandre authored
      74d0bcb6
    • Michael Watkins's avatar
      Consider do_lower_case in PreTrainedTokenizer · 7246d3c2
      Michael Watkins authored
      As pointed out in #1545, when using an uncased model, and adding
      a new uncased token, the tokenizer does not correctly identify this
      in the case that the input text contains the token in a cased format.
      
      For instance, if we load bert-base-uncased into BertTokenizer, and
      then use .add_tokens() to add "cool-token", we get the expected
      result for .tokenize('this is a cool-token'). However, we get a
      possibly unexpected result for .tokenize('this is a cOOl-Token'),
      which in fact mirrors the result for the former from before the new
      token was added.
      
      This commit adds
      - functionality to PreTrainedTokenizer to handle this
      situation in case a tokenizer (currently Bert, DistilBert,
      and XLNet) has the do_lower_case=True kwarg by:
          1) lowercasing tokens added with .add_tokens()
          2) lowercasing text at the beginning of .tokenize()
      - new common test case for tokenizers
      
      https://github.com/huggingface/transformers/issues/1545
      7246d3c2
  26. 04 Nov, 2019 1 commit
  27. 22 Oct, 2019 1 commit
  28. 04 Oct, 2019 1 commit