1. 10 May, 2023 1 commit
  2. 24 Apr, 2023 1 commit
  3. 03 Apr, 2023 1 commit
    • Arthur's avatar
      Fix llama tokenizer (#22402) · c0f99b4d
      Arthur authored
      * draft
      
      * update tokenization limma and conversion script
      
      * more udpates
      
      * initial commit
      
      * style
      
      * default pad to None
      
      * draft tokenization tests
      
      * update test
      
      * update tokenization tests
      
      * nits
      
      * update
      
      * versioning test
      
      * major fix
      
      * fix more testst
      
      * finish fixing special masks
      
      * last nit
      
      * more nits
      
      * add encode decode tests
      
      * add more
      
      * fix token type ids
      
      * style
      c0f99b4d
  4. 29 Mar, 2023 1 commit
  5. 09 Mar, 2023 1 commit
  6. 07 Feb, 2023 1 commit
    • Sylvain Gugger's avatar
      Cleanup quality (#21493) · 67d07487
      Sylvain Gugger authored
      * Remove mentions of flake8/isort
      
      * Clean up inits
      
      * Deall with all other inits
      
      * Last special rule for dummy files
      67d07487
  7. 06 Feb, 2023 1 commit
    • Sylvain Gugger's avatar
      Update quality tooling for formatting (#21480) · 6f79d264
      Sylvain Gugger authored
      * Result of black 23.1
      
      * Update target to Python 3.7
      
      * Switch flake8 to ruff
      
      * Configure isort
      
      * Configure isort
      
      * Apply isort with line limit
      
      * Put the right black version
      
      * adapt black in check copies
      
      * Fix copies
      6f79d264
  8. 02 Nov, 2022 1 commit
    • Ben Eyal's avatar
      馃毃 馃毃 馃毃 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in... · 9f9ddcc2
      Ben Eyal authored
      馃毃 馃毃 馃毃 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` (#15775)
      
      * Add test for SentencePiece not adding special tokens to strings
      
      * Add SentencePieceStringConversionMixin to fix issue 15003
      
      * Fix conversion from tokens to string for most SentencePiece tokenizers
      
      Tokenizers fixed:
      - AlbertTokenizer
      - BarthezTokenizer
      - CamembertTokenizer
      - FNetTokenizer
      - M2M100Tokenizer
      - MBart50Tokenizer
      - PegasusTokenizer
      - Speech2TextTokenizer
      
      * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab
      
      * Fix DebertaV2Tokenizer
      
      * Ignore LayoutXLMTokenizer in SentencePiece string conversion test
      
      * Run 'make style' and 'make quality'
      
      * Clean convert_tokens_to_string test
      
      Instead of explicitly ignoring LayoutXLMTokenizer in the test,
      override the test in LayoutLMTokenizationTest and do nothing in it.
      
      * Remove commented out code
      
      * Improve robustness of convert_tokens_to_string test
      
      Instead of comparing lengths of re-tokenized text and input_ids,
      check that converting all special tokens to string yields a string
      with all special tokens.
      
      * Inline and remove SentencePieceStringConversionMixin
      
      The convert_tokens_to_string method is now implemented
      in each relevant SentencePiece tokenizer.
      
      * Run 'make style' and 'make quality'
      
      * Revert removal of space in convert_tokens_to_string
      
      * Remove redundant import
      
      * Revert test text to original
      
      * Uncomment the lowercasing of the reverse_text variable
      
      * Mimic Rust tokenizer behavior for tokenizers
      
      - Albert
      - Barthez
      - Camembert
      - MBart50
      - T5
      
      * Fix accidentally skipping test in wrong tokenizer
      
      * Add test for equivalent Rust and slow tokenizer behavior
      
      * Override _decode in BigBirdTokenizer to mimic Rust behavior
      
      * Override _decode in FNetTokenizer to mimic Rust behavior
      
      * Override _decode in XLNetTokenizer to mimic Rust behavior
      
      * Remove unused 're' import
      
      * Update DebertaV2Tokenizer to mimic Rust tokenizer
      
      * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.
      
      * Ignore problematic tests in Deberta V2
      
      * Add comment on why the Deberta V2 tests are skipped
      9f9ddcc2
  9. 25 Oct, 2022 1 commit
  10. 14 Oct, 2022 1 commit
  11. 27 Sep, 2022 1 commit
  12. 16 Sep, 2022 2 commits
  13. 15 Sep, 2022 1 commit
  14. 29 Aug, 2022 1 commit
  15. 24 Aug, 2022 1 commit
    • SaulLu's avatar
      add warning to let the user know that the `__call__` method is faster than... · 6667b0d7
      SaulLu authored
      add warning to let the user know that the `__call__` method is faster than `encode` + `pad` for a fast tokenizer (#18693)
      
      * add warning to let the user know that the  method is slower that  for a fast tokenizer
      
      * user warnings
      
      * fix layoutlmv2
      
      * fix layout*
      
      * change warnings into logger.warning
      6667b0d7
  16. 05 Aug, 2022 1 commit
    • Sylvain Gugger's avatar
      Use new huggingface_hub tools for download models (#18438) · 5cd40323
      Sylvain Gugger authored
      * Draft new cached_file
      
      * Initial draft for config and model
      
      * Small fixes
      
      * Fix first batch of tests
      
      * Look in cache when internet is down
      
      * Fix last tests
      
      * Bad black, not fixing all quality errors
      
      * Make diff less
      
      * Implement change for TF and Flax models
      
      * Add tokenizer and feature extractor
      
      * For compatibility with main
      
      * Add utils to move the cache and auto-do it at first use.
      
      * Quality
      
      * Deal with empty commit shas
      
      * Deal with empty etag
      
      * Address review comments
      5cd40323
  17. 01 Aug, 2022 1 commit
  18. 11 Jul, 2022 1 commit
  19. 23 Jun, 2022 1 commit
  20. 21 Jun, 2022 1 commit
  21. 31 May, 2022 1 commit
  22. 12 May, 2022 1 commit
  23. 13 Apr, 2022 1 commit
  24. 04 Apr, 2022 1 commit
  25. 23 Mar, 2022 1 commit
  26. 15 Feb, 2022 1 commit
  27. 02 Feb, 2022 2 commits
  28. 01 Feb, 2022 2 commits
    • SaulLu's avatar
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86
      SaulLu authored
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)
      
      * add new test
      
      * update test
      
      * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`
      
      * add `tokenizer_file` for the fast only tokenizer
      
      * change global variables layoutxml
      
      * remove `"tokenizer_file"` from DPR tokenizer's Global variables
      
      * remove `tokenizer_file` from herbert slow tokenizer init
      
      * `"tokenizer_file"` from LED tokenizer's Global variables
      
      * remove `tokenizer_file` from mbart slow tokenizer init
      
      * remove `tokenizer_file` from slow tokenizer template
      
      * adapt to versioning
      
      * adapt the `test_tokenizer_mismatch_warning` test
      
      * clean test
      
      * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py
      
      * Revert "remove `tokenizer_file` from mbart slow tokenizer init"
      
      This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.
      
      * Revert "`"tokenizer_file"` from LED tokenizer's Global variables"
      
      This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.
      
      * Revert "remove `tokenizer_file` from herbert slow tokenizer init"
      
      This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.
      
      * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"
      
      This reverts commit da0895330bedfafc81ae3073470a9348c669f032.
      
      * set `tokenizer_file` in super `__init__` of mbart
      7b8bdd86
    • SaulLu's avatar
      replace assert with exception for padding_side arg in `PreTrainedTokenizerBase` `__init__` (#15454) · 6d585fe0
      SaulLu authored
      * replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`
      
      * add test
      
      * fix kwargs
      
      * reformat test
      
      * format
      
      * format
      
      * fix typo to render the documentation
      6d585fe0
  29. 27 Jan, 2022 1 commit
    • SaulLu's avatar
      improve saving strategy of sentencepiece tokenizer (#15328) · ade7371a
      SaulLu authored
      
      
      * add new test
      
      * add a feature to same the sentencepiece tokenizer model when the init file was deleted
      
      * update marian
      
      * update m2m_100
      
      * fix marian
      
      * update speech to text
      
      * override test for layoutxlm
      
      * fix saving bartpho
      
      * remove harcoded values bartpho
      
      * special token string version
      
      * finish bartpho
      
      * override layoutxml test
      
      * add mbart
      
      * move special tokens list
      
      * format
      
      * Revert "format"
      
      This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7.
      
      * simplify list of string of special tokens
      
      * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      ade7371a
  30. 06 Jan, 2022 1 commit
  31. 03 Jan, 2022 1 commit
  32. 30 Dec, 2021 1 commit
  33. 03 Dec, 2021 1 commit
    • Li-Huai (Allan) Lin's avatar
      Improve tokenizer tests (#13594) · 66ea7391
      Li-Huai (Allan) Lin authored
      * Use new method to acquire tokenizers
      
      * Resolve TODOs.
      
      * Style
      
      * Fix
      
      * Enable do_lower_case in test_tokenize_special_tokens
      
      * Apply suggestion from code review
      
      * Fix mask token handling
      
      * Revert "Fix mask token handling"
      
      This reverts commit daaa3f5291b1f71e5bc3604ca281c000000c4648.
      
      * Fix FNet mask token tokenization
      
      * Complete everything
      
      * Apply suggestions from code review
      66ea7391
  34. 10 Nov, 2021 1 commit
  35. 08 Nov, 2021 1 commit
  36. 02 Nov, 2021 1 commit
  37. 11 Oct, 2021 1 commit