"examples/vscode:/vscode.git/clone" did not exist on "5011efbec81a7a1d094a2eda8bde2b74613ca8b8"
  1. 14 Oct, 2022 1 commit
  2. 27 Sep, 2022 1 commit
  3. 16 Sep, 2022 2 commits
  4. 15 Sep, 2022 1 commit
  5. 29 Aug, 2022 1 commit
  6. 24 Aug, 2022 1 commit
    • SaulLu's avatar
      add warning to let the user know that the `__call__` method is faster than... · 6667b0d7
      SaulLu authored
      add warning to let the user know that the `__call__` method is faster than `encode` + `pad` for a fast tokenizer (#18693)
      
      * add warning to let the user know that the  method is slower that  for a fast tokenizer
      
      * user warnings
      
      * fix layoutlmv2
      
      * fix layout*
      
      * change warnings into logger.warning
      6667b0d7
  7. 05 Aug, 2022 1 commit
    • Sylvain Gugger's avatar
      Use new huggingface_hub tools for download models (#18438) · 5cd40323
      Sylvain Gugger authored
      * Draft new cached_file
      
      * Initial draft for config and model
      
      * Small fixes
      
      * Fix first batch of tests
      
      * Look in cache when internet is down
      
      * Fix last tests
      
      * Bad black, not fixing all quality errors
      
      * Make diff less
      
      * Implement change for TF and Flax models
      
      * Add tokenizer and feature extractor
      
      * For compatibility with main
      
      * Add utils to move the cache and auto-do it at first use.
      
      * Quality
      
      * Deal with empty commit shas
      
      * Deal with empty etag
      
      * Address review comments
      5cd40323
  8. 01 Aug, 2022 1 commit
  9. 11 Jul, 2022 1 commit
  10. 23 Jun, 2022 1 commit
  11. 21 Jun, 2022 1 commit
  12. 31 May, 2022 1 commit
  13. 12 May, 2022 1 commit
  14. 13 Apr, 2022 1 commit
  15. 04 Apr, 2022 1 commit
  16. 23 Mar, 2022 1 commit
  17. 15 Feb, 2022 1 commit
  18. 02 Feb, 2022 2 commits
  19. 01 Feb, 2022 2 commits
    • SaulLu's avatar
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86
      SaulLu authored
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)
      
      * add new test
      
      * update test
      
      * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`
      
      * add `tokenizer_file` for the fast only tokenizer
      
      * change global variables layoutxml
      
      * remove `"tokenizer_file"` from DPR tokenizer's Global variables
      
      * remove `tokenizer_file` from herbert slow tokenizer init
      
      * `"tokenizer_file"` from LED tokenizer's Global variables
      
      * remove `tokenizer_file` from mbart slow tokenizer init
      
      * remove `tokenizer_file` from slow tokenizer template
      
      * adapt to versioning
      
      * adapt the `test_tokenizer_mismatch_warning` test
      
      * clean test
      
      * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py
      
      * Revert "remove `tokenizer_file` from mbart slow tokenizer init"
      
      This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.
      
      * Revert "`"tokenizer_file"` from LED tokenizer's Global variables"
      
      This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.
      
      * Revert "remove `tokenizer_file` from herbert slow tokenizer init"
      
      This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.
      
      * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"
      
      This reverts commit da0895330bedfafc81ae3073470a9348c669f032.
      
      * set `tokenizer_file` in super `__init__` of mbart
      7b8bdd86
    • SaulLu's avatar
      replace assert with exception for padding_side arg in `PreTrainedTokenizerBase` `__init__` (#15454) · 6d585fe0
      SaulLu authored
      * replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`
      
      * add test
      
      * fix kwargs
      
      * reformat test
      
      * format
      
      * format
      
      * fix typo to render the documentation
      6d585fe0
  20. 27 Jan, 2022 1 commit
    • SaulLu's avatar
      improve saving strategy of sentencepiece tokenizer (#15328) · ade7371a
      SaulLu authored
      
      
      * add new test
      
      * add a feature to same the sentencepiece tokenizer model when the init file was deleted
      
      * update marian
      
      * update m2m_100
      
      * fix marian
      
      * update speech to text
      
      * override test for layoutxlm
      
      * fix saving bartpho
      
      * remove harcoded values bartpho
      
      * special token string version
      
      * finish bartpho
      
      * override layoutxml test
      
      * add mbart
      
      * move special tokens list
      
      * format
      
      * Revert "format"
      
      This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7.
      
      * simplify list of string of special tokens
      
      * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      ade7371a
  21. 06 Jan, 2022 1 commit
  22. 03 Jan, 2022 1 commit
  23. 30 Dec, 2021 1 commit
  24. 03 Dec, 2021 1 commit
    • Li-Huai (Allan) Lin's avatar
      Improve tokenizer tests (#13594) · 66ea7391
      Li-Huai (Allan) Lin authored
      * Use new method to acquire tokenizers
      
      * Resolve TODOs.
      
      * Style
      
      * Fix
      
      * Enable do_lower_case in test_tokenize_special_tokens
      
      * Apply suggestion from code review
      
      * Fix mask token handling
      
      * Revert "Fix mask token handling"
      
      This reverts commit daaa3f5291b1f71e5bc3604ca281c000000c4648.
      
      * Fix FNet mask token tokenization
      
      * Complete everything
      
      * Apply suggestions from code review
      66ea7391
  25. 10 Nov, 2021 1 commit
  26. 08 Nov, 2021 1 commit
  27. 02 Nov, 2021 1 commit
  28. 11 Oct, 2021 1 commit
  29. 08 Oct, 2021 1 commit
  30. 05 Oct, 2021 1 commit
  31. 17 Sep, 2021 1 commit
  32. 09 Sep, 2021 1 commit
  33. 02 Sep, 2021 1 commit
    • Apoorv Garg's avatar
      Correct order of overflowing_tokens for slow tokenizer (#13179) · b91e65af
      Apoorv Garg authored
      * correct order of overflowing_tokens for slow tokenizer (issue fix #13148)
      
      * python 3.9 requires sentencepiece version 0.1.94 or above
      
      * slicing of ids fixed in truncated_sequence()
      
      * Update setup.py
      
      * Correct order of overflowing tokens for pair of sentences
      
      * code reformatted
      
      * Update tokenization_utils_base.py
      
      * reformatting file
      
      * test to check single_input added
      
      * missing function restored
      
      * test to check pair_input overflowing tokens order
      
      * test to check pair_input overflowing tokens order
      
      * test to check pair_input overflowing tokens order
      
      * added an error message for pair of seq and longest_first strategy
      
      * test for pair_input modified
      
      * variable name corrected
      
      * fixed a typo in error message
      
      * requested changes implemented
      
      * required test added
      
      * Corrected the message to match test message
      
      * added error message for Luke Tokenizer
      
      * lost test recovered
      
      * docstring for truncate_sequences and prepare_for_model updated
      
      * docstring for luke tokenizer updated
      
      * updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING
      
      * aligned text and fixed puncuatations
      
      * improved style and quality of code
      
      * fixed error_msg in truncate_sequences
      
      * replaced encode_plus method with regular call method
      
      * clean up
      
      * rephrased the docstring
      b91e65af
  34. 01 Sep, 2021 1 commit
  35. 23 Aug, 2021 1 commit
    • SaulLu's avatar
      Change how "additional_special_tokens" argument in the ".from_pretrained"... · 7223844d
      SaulLu authored
      Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account (#13056)
      
      * add test
      
      * add change in PretrainedTokenizerBase
      
      * change Luke
      
      * deactivate
      
      * add the possibility to add additional special tokens for M2M100
      
      * format
      
      * add special test for canine
      
      * proposed changes for mbart
      
      * proposed changes for mbart50
      
      * proposed changes for byt5
      
      * proposed changes for canine
      
      * proposed changes for t5
      
      * test fast and slow
      
      * remove comment
      
      * remove comment
      
      * add fast version for all tests
      
      * replace break by continue
      
      * add more comments
      
      * add check to avoid duplicates
      
      * remove comment
      
      * format
      
      * proposed change for wave2vec2
      
      * reverse changes mbart
      
      * uncomment
      
      * format
      7223844d
  36. 17 Jul, 2021 1 commit
  37. 16 Jul, 2021 1 commit