1. 23 Jun, 2022 1 commit
  2. 21 Jun, 2022 1 commit
  3. 31 May, 2022 1 commit
  4. 12 May, 2022 1 commit
  5. 13 Apr, 2022 1 commit
  6. 04 Apr, 2022 1 commit
  7. 23 Mar, 2022 1 commit
  8. 15 Feb, 2022 1 commit
  9. 02 Feb, 2022 2 commits
  10. 01 Feb, 2022 2 commits
    • SaulLu's avatar
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86
      SaulLu authored
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)
      
      * add new test
      
      * update test
      
      * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`
      
      * add `tokenizer_file` for the fast only tokenizer
      
      * change global variables layoutxml
      
      * remove `"tokenizer_file"` from DPR tokenizer's Global variables
      
      * remove `tokenizer_file` from herbert slow tokenizer init
      
      * `"tokenizer_file"` from LED tokenizer's Global variables
      
      * remove `tokenizer_file` from mbart slow tokenizer init
      
      * remove `tokenizer_file` from slow tokenizer template
      
      * adapt to versioning
      
      * adapt the `test_tokenizer_mismatch_warning` test
      
      * clean test
      
      * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py
      
      * Revert "remove `tokenizer_file` from mbart slow tokenizer init"
      
      This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.
      
      * Revert "`"tokenizer_file"` from LED tokenizer's Global variables"
      
      This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.
      
      * Revert "remove `tokenizer_file` from herbert slow tokenizer init"
      
      This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.
      
      * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"
      
      This reverts commit da0895330bedfafc81ae3073470a9348c669f032.
      
      * set `tokenizer_file` in super `__init__` of mbart
      7b8bdd86
    • SaulLu's avatar
      replace assert with exception for padding_side arg in `PreTrainedTokenizerBase` `__init__` (#15454) · 6d585fe0
      SaulLu authored
      * replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`
      
      * add test
      
      * fix kwargs
      
      * reformat test
      
      * format
      
      * format
      
      * fix typo to render the documentation
      6d585fe0
  11. 27 Jan, 2022 1 commit
    • SaulLu's avatar
      improve saving strategy of sentencepiece tokenizer (#15328) · ade7371a
      SaulLu authored
      
      
      * add new test
      
      * add a feature to same the sentencepiece tokenizer model when the init file was deleted
      
      * update marian
      
      * update m2m_100
      
      * fix marian
      
      * update speech to text
      
      * override test for layoutxlm
      
      * fix saving bartpho
      
      * remove harcoded values bartpho
      
      * special token string version
      
      * finish bartpho
      
      * override layoutxml test
      
      * add mbart
      
      * move special tokens list
      
      * format
      
      * Revert "format"
      
      This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7.
      
      * simplify list of string of special tokens
      
      * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      ade7371a
  12. 06 Jan, 2022 1 commit
  13. 03 Jan, 2022 1 commit
  14. 30 Dec, 2021 1 commit
  15. 03 Dec, 2021 1 commit
    • Li-Huai (Allan) Lin's avatar
      Improve tokenizer tests (#13594) · 66ea7391
      Li-Huai (Allan) Lin authored
      * Use new method to acquire tokenizers
      
      * Resolve TODOs.
      
      * Style
      
      * Fix
      
      * Enable do_lower_case in test_tokenize_special_tokens
      
      * Apply suggestion from code review
      
      * Fix mask token handling
      
      * Revert "Fix mask token handling"
      
      This reverts commit daaa3f5291b1f71e5bc3604ca281c000000c4648.
      
      * Fix FNet mask token tokenization
      
      * Complete everything
      
      * Apply suggestions from code review
      66ea7391
  16. 10 Nov, 2021 1 commit
  17. 08 Nov, 2021 1 commit
  18. 02 Nov, 2021 1 commit
  19. 11 Oct, 2021 1 commit
  20. 08 Oct, 2021 1 commit
  21. 05 Oct, 2021 1 commit
  22. 17 Sep, 2021 1 commit
  23. 09 Sep, 2021 1 commit
  24. 02 Sep, 2021 1 commit
    • Apoorv Garg's avatar
      Correct order of overflowing_tokens for slow tokenizer (#13179) · b91e65af
      Apoorv Garg authored
      * correct order of overflowing_tokens for slow tokenizer (issue fix #13148)
      
      * python 3.9 requires sentencepiece version 0.1.94 or above
      
      * slicing of ids fixed in truncated_sequence()
      
      * Update setup.py
      
      * Correct order of overflowing tokens for pair of sentences
      
      * code reformatted
      
      * Update tokenization_utils_base.py
      
      * reformatting file
      
      * test to check single_input added
      
      * missing function restored
      
      * test to check pair_input overflowing tokens order
      
      * test to check pair_input overflowing tokens order
      
      * test to check pair_input overflowing tokens order
      
      * added an error message for pair of seq and longest_first strategy
      
      * test for pair_input modified
      
      * variable name corrected
      
      * fixed a typo in error message
      
      * requested changes implemented
      
      * required test added
      
      * Corrected the message to match test message
      
      * added error message for Luke Tokenizer
      
      * lost test recovered
      
      * docstring for truncate_sequences and prepare_for_model updated
      
      * docstring for luke tokenizer updated
      
      * updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING
      
      * aligned text and fixed puncuatations
      
      * improved style and quality of code
      
      * fixed error_msg in truncate_sequences
      
      * replaced encode_plus method with regular call method
      
      * clean up
      
      * rephrased the docstring
      b91e65af
  25. 01 Sep, 2021 1 commit
  26. 23 Aug, 2021 1 commit
    • SaulLu's avatar
      Change how "additional_special_tokens" argument in the ".from_pretrained"... · 7223844d
      SaulLu authored
      Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account (#13056)
      
      * add test
      
      * add change in PretrainedTokenizerBase
      
      * change Luke
      
      * deactivate
      
      * add the possibility to add additional special tokens for M2M100
      
      * format
      
      * add special test for canine
      
      * proposed changes for mbart
      
      * proposed changes for mbart50
      
      * proposed changes for byt5
      
      * proposed changes for canine
      
      * proposed changes for t5
      
      * test fast and slow
      
      * remove comment
      
      * remove comment
      
      * add fast version for all tests
      
      * replace break by continue
      
      * add more comments
      
      * add check to avoid duplicates
      
      * remove comment
      
      * format
      
      * proposed change for wave2vec2
      
      * reverse changes mbart
      
      * uncomment
      
      * format
      7223844d
  27. 17 Jul, 2021 1 commit
  28. 16 Jul, 2021 1 commit
  29. 01 Jul, 2021 1 commit
  30. 29 Jun, 2021 1 commit
  31. 23 Jun, 2021 1 commit
  32. 14 Jun, 2021 1 commit
  33. 07 Jun, 2021 1 commit
  34. 01 Jun, 2021 1 commit
    • Philip May's avatar
      Add regression tests for slow sentencepiece tokenizers. (#11737) · fcad8018
      Philip May authored
      * add test_vocab_size for sentencepiece tok.
      
      * add test_get_vocab for sentencepiece tok.
      
      * add test_convert_token_and_id for sentencepiece tok.
      
      * add test_tokenize_and_convert_tokens_to_string for all tok.
      
      * improve test_tokenize_and_convert_tokens_to_string for sp. tok.
      
      * add common tokenizer integration tests
      - for albert
      - for barthez
      
      * add tokenizer integration tests to bert gen.
      
      * add most tokenizer integration tests
      
      * fix camembert tokenizer integration test
      
      * add tokenizer integration test to marian
      
      * add tokenizer integration test to reformer
      
      * add typing and doc to tokenizer_integration_test_util
      
      * fix tokenizer integration test of reformer
      
      * improve test_sentencepiece_tokenize_and_convert_tokens_to_string
      
      * empty commit to trigger CI
      
      * fix tokenizer integration test of reformer
      
      * remove code not needed anymore
      
      * empty commit to trigger CI
      
      * empty commit to trigger CI
      fcad8018
  35. 13 May, 2021 1 commit
    • Philip May's avatar
      Enable option for subword regularization in more tokenizers. (#11417) · 37ed3ab7
      Philip May authored
      * improve slow class tok usage at xlm rob
      
      * add subword regularization for barthez
      
      * improve barthez tok. test
      
      * fix tokenizer tests
      
      * add subword regularization for camembert
      
      * add subword regularization for deberta v2 tokenizer
      
      * add more doc to deberta v2 tokenizer
      
      * add subword regularization for speech to text tok.
      
      * fix sp_model_kwargs type in speech 2 text tok.
      
      * add subword regularization for M2M100 tok.
      
      * add more concrete type hints
      
      * fix tests for m2m100 and s2t tok.
      
      * add missing Any import
      
      * fix syntax error in m2m100 tok.
      
      * fix unpickle of m2m100 and s2t tok.
      
      * fix test of m2m100 and s2t tok.
      
      * improve unpickle of deberta v2 tok.
      
      * add test for pickle of barthez & camembert
      
      * fix pickle of barthez & camembert
      
      * add test for deberta v2 tok. pickle
      
      * fix m2m100 tok. pickle
      
      * fix s2t tok. pickle
      
      * add subword regularization to albert tok.
      
      * refactor subword reg. test into TokenizerTesterMixin
      
      improve albert tok. test
      
      remove sample argument form albert tok.
      
      check subword reg. using TokenizerTesterMixin
      
      improve tok. tests
      
      improve xlm roberta tok. tests
      
      improve xlm roberta tok. tests
      
      * add subword regularization for big bird t.
      
      * improve xlm roberta tok. test
      
      * add subword regularization for mbart50 tok.
      
      * add subword regularization for pegasus tok.
      
      * add subword regularization for reformer tok.
      
      * add subword regularization for T5 tok.
      
      * fix t5 tok. test formatting
      
      * add subword regularization for xlm_proph. tok.
      
      * add subword regularization for xlnet tok.
      
      * add subword regularization for gert_gen tok.
      
      * add typing to tokenizers
      
      * add typing to xlm rob. tok
      
      * add subword regularization for marian tok.
      
      * add reverse tok. test
      
      * fix marian tok test
      
      * fix marian tok test
      
      * fix casing in tok. tests
      
      * fix style of tok. common test
      
      * fix deberta v2 tok test
      
      * add type annotations to tok. tests
      
      * add type annotations to tok. __init__
      
      * add typing to kokenizer
      
      * add type annotations to tok. __init__
      
      * don't specify the default when it's None
      
      * fix barthez tok. doc
      
      * move sentencepiece tok. tests to TokenizerTesterMixin
      
      * fix unused imports
      
      * fix albert tok. test
      
      * add comment to sentencepiece test options
      
      * fix Any import at big bird tok.
      
      * fix Any import at xlm prophetnet tok.
      
      * empty commit to trigger CI
      37ed3ab7
  36. 04 May, 2021 1 commit
  37. 26 Apr, 2021 2 commits