1. 31 May, 2022 1 commit
  2. 12 May, 2022 1 commit
  3. 30 Mar, 2022 1 commit
  4. 25 Mar, 2022 1 commit
  5. 23 Mar, 2022 1 commit
    • Sylvain Gugger's avatar
      Reorganize file utils (#16264) · 4975002d
      Sylvain Gugger authored
      * Split file_utils in several submodules
      
      * Fixes
      
      * Add back more objects
      
      * More fixes
      
      * Who exactly decided to import that from there?
      
      * Second suggestion to code with code review
      
      * Revert wront move
      
      * Fix imports
      
      * Adapt all imports
      
      * Adapt all imports everywhere
      
      * Revert this import, will fix in a separate commit
      4975002d
  6. 01 Feb, 2022 1 commit
    • SaulLu's avatar
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86
      SaulLu authored
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)
      
      * add new test
      
      * update test
      
      * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`
      
      * add `tokenizer_file` for the fast only tokenizer
      
      * change global variables layoutxml
      
      * remove `"tokenizer_file"` from DPR tokenizer's Global variables
      
      * remove `tokenizer_file` from herbert slow tokenizer init
      
      * `"tokenizer_file"` from LED tokenizer's Global variables
      
      * remove `tokenizer_file` from mbart slow tokenizer init
      
      * remove `tokenizer_file` from slow tokenizer template
      
      * adapt to versioning
      
      * adapt the `test_tokenizer_mismatch_warning` test
      
      * clean test
      
      * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py
      
      * Revert "remove `tokenizer_file` from mbart slow tokenizer init"
      
      This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.
      
      * Revert "`"tokenizer_file"` from LED tokenizer's Global variables"
      
      This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.
      
      * Revert "remove `tokenizer_file` from herbert slow tokenizer init"
      
      This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.
      
      * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"
      
      This reverts commit da0895330bedfafc81ae3073470a9348c669f032.
      
      * set `tokenizer_file` in super `__init__` of mbart
      7b8bdd86
  7. 03 Jan, 2022 1 commit
  8. 30 Dec, 2021 1 commit
    • Nicolas Patry's avatar
      Enabling `tokenizers` upgrade. (#14941) · 08cb5718
      Nicolas Patry authored
      * Enabling `tokenizers` upgrade.
      
      * Moved ugly comment.
      
      * Tokenizers==0.11.1 needs an update to keep borrow checker
      
      happy in highly contiguous calls.
      
      * Support both 0.11.1 and 0.11.0
      08cb5718
  9. 27 Dec, 2021 1 commit
    • Sylvain Gugger's avatar
      Doc styler v2 (#14950) · 87e6e4fe
      Sylvain Gugger authored
      * New doc styler
      
      * Fix issue with args at the start
      
      * Code sample fixes
      
      * Style code examples in MDX
      
      * Fix more patterns
      
      * Typo
      
      * Typo
      
      * More patterns
      
      * Do without black for now
      
      * Get more info in error
      
      * Docstring style
      
      * Re-enable check
      
      * Quality
      
      * Fix add_end_docstring decorator
      
      * Fix docstring
      87e6e4fe
  10. 23 Dec, 2021 1 commit
  11. 21 Dec, 2021 1 commit
    • Sylvain Gugger's avatar
      Mass conversion of documentation from rst to Markdown (#14866) · 27b3031d
      Sylvain Gugger authored
      * Convert docstrings of all configurations and tokenizers
      
      * Processors and fixes
      
      * Last modeling files and fixes to models
      
      * Pipeline modules
      
      * Utils files
      
      * Data submodule
      
      * All the other files
      
      * Style
      
      * Missing examples
      
      * Style again
      
      * Fix copies
      
      * Say bye bye to rst docstrings forever
      27b3031d
  12. 08 Sep, 2021 1 commit
  13. 01 Sep, 2021 1 commit
  14. 30 Aug, 2021 1 commit
  15. 13 Aug, 2021 1 commit
  16. 12 Jul, 2021 1 commit
  17. 09 Jul, 2021 2 commits
    • Sylvain Gugger's avatar
      Simplify unk token (#12582) · 0cc2dc24
      Sylvain Gugger authored
      * Base test
      
      * More test
      
      * Fix mistake
      
      * Add a docstring change
      
      * Add doc ignore
      
      * Simplify logic for unk token in Unigram tokenizers
      
      * Remove changes from otehr branch
      0cc2dc24
    • Nicolas Patry's avatar
      This will reduce "Already borrowed error": (#12550) · cc12e1db
      Nicolas Patry authored
      * This will reduce "Already borrowed error":
      
      Original issue https://github.com/huggingface/tokenizers/issues/537
      
      
      
      The original issue is caused by transformers calling many times
      mutable functions on the rust tokenizers.
      Rust needs to guarantee that only 1 agent has a mutable reference
      to memory at a given time (for many reasons which don't need explaining
      here). Usually, the rust compiler can guarantee that this property is
      true at compile time.
      
      Unfortunately, this is impossible for Python to do that, so PyO3, the
      bridge between rust and python used by `tokenizers`, will change the
      compile guarantee for a dynamic guarantee, so if multiple agents try
      to have multiple mutable borrows at the same time, then the runtime will
      yell with "Already borrowed".
      
      The proposed fix here in transformers, is simply to reduce the actual
      number of calls that really need mutable borrows. By reducing them,
      we reduce the risk of running into "Already borrowed" error.
      The caveat is now we add a call to read the current configuration of the
      `_tokenizer`, so worst case we have 2 calls instead of 1, and best case
      we simply have 1 + a Python comparison of a dict (should be negligible).
      
      * Adding a test.
      
      * trivial error :(.
      
      * Update tests/test_tokenization_fast.py
      Co-authored-by: default avatarSaulLu <55560583+SaulLu@users.noreply.github.com>
      
      * Adding reference to original issues in the tests.
      
      * Update the tests with fast tokenizer.
      Co-authored-by: default avatarSaulLu <55560583+SaulLu@users.noreply.github.com>
      cc12e1db
  18. 29 Jun, 2021 1 commit
  19. 14 Jun, 2021 1 commit
  20. 26 Apr, 2021 1 commit
  21. 15 Apr, 2021 1 commit
  22. 05 Apr, 2021 1 commit
  23. 31 Mar, 2021 1 commit
  24. 08 Mar, 2021 1 commit
  25. 25 Feb, 2021 1 commit
    • Patrick von Platen's avatar
      [PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor,... · cb38ffcc
      Patrick von Platen authored
      [PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer (#10324)
      
      * push to show
      
      * small improvement
      
      * small improvement
      
      * Update src/transformers/feature_extraction_utils.py
      
      * Update src/transformers/feature_extraction_utils.py
      
      * implement base
      
      * add common tests
      
      * make all tests pass for wav2vec2
      
      * make padding work & add more tests
      
      * finalize feature extractor utils
      
      * add call method to feature extraction
      
      * finalize feature processor
      
      * finish tokenizer
      
      * finish general processor design
      
      * finish tests
      
      * typo
      
      * remove bogus file
      
      * finish docstring
      
      * add docs
      
      * finish docs
      
      * small fix
      
      * correct docs
      
      * save intermediate
      
      * load changes
      
      * apply changes
      
      * apply changes to doc
      
      * change tests
      
      * apply surajs recommend
      
      * final changes
      
      * Apply suggestions from code review
      
      * fix typo
      
      * fix import
      
      * correct docstring
      cb38ffcc
  26. 08 Feb, 2021 1 commit
  27. 04 Feb, 2021 1 commit
  28. 02 Dec, 2020 1 commit
    • Nicolas Patry's avatar
      Warning about too long input for fast tokenizers too (#8799) · a8c3f9aa
      Nicolas Patry authored
      * Warning about too long input for fast tokenizers too
      
      If truncation is not set in tokenizers, but the tokenization is too long
      for the model (`model_max_length`), we used to trigger a warning that
      
      The input would probably fail (which it most likely will).
      
      This PR re-enables the warning for fast tokenizers too and uses common
      code for the trigger to make sure it's consistent across.
      
      * Checking for pair of inputs too.
      
      * Making the function private and adding it's doc.
      
      * Remove formatting ?? in odd place.
      
      * Missed uppercase.
      a8c3f9aa
  29. 27 Nov, 2020 1 commit
    • Giovanni Compagnoni's avatar
      Extend typing to path-like objects in `PretrainedConfig` and `PreTrainedModel` (#8770) · f9a2a9e3
      Giovanni Compagnoni authored
      * update configuration_utils.py typing to allow pathlike objects when sensible
      
      * update modeling_utils.py typing to allow pathlike objects when sensible
      
      * black
      
      * update tokenization_utils_base.py typing to allow pathlike objects when sensible
      
      * update tokenization_utils_fast.py typing to allow pathlike objects when sensible
      
      * update configuration_auto.py typing to allow pathlike objects when sensible
      
      * update configuration_auto.py docstring to allow pathlike objects when sensible
      
      * update tokenization_auto.py docstring to allow pathlike objects when sensible
      
      * black
      f9a2a9e3
  30. 17 Nov, 2020 1 commit
  31. 15 Nov, 2020 1 commit
    • Thomas Wolf's avatar
      [breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests... · f4e04cd2
      Thomas Wolf authored
      
      [breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency (#8073)
      
      * Fixing roberta for slow-fast tests
      
      * WIP getting equivalence on pipelines
      
      * slow-to-fast equivalence - working on question-answering pipeline
      
      * optional FAISS tests
      
      * Pipeline Q&A
      
      * Move pipeline tests to their own test job again
      
      * update tokenizer to add sequence id methods
      
      * update to tokenizers 0.9.4
      
      * set sentencepiecce as optional
      
      * clean up squad
      
      * clean up pipelines to use sequence_ids
      
      * style/quality
      
      * wording
      
      * Switch to use_fast = True by default
      
      * update tests for use_fast at True by default
      
      * fix rag tokenizer test
      
      * removing protobuf from required dependencies
      
      * fix NER test for use_fast = True by default
      
      * fixing example tests (Q&A examples use slow tokenizers for now)
      
      * protobuf in main deps extras["sentencepiece"] and example deps
      
      * fix protobug install test
      
      * try to fix seq2seq by switching to slow tokenizers for now
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      f4e04cd2
  32. 29 Oct, 2020 1 commit
  33. 26 Oct, 2020 2 commits
    • Sylvain Gugger's avatar
      Doc styling (#8067) · 08f534d2
      Sylvain Gugger authored
      * Important files
      
      * Styling them all
      
      * Revert "Styling them all"
      
      This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.
      
      * Syling them for realsies
      
      * Fix syntax error
      
      * Fix benchmark_utils
      
      * More fixes
      
      * Fix modeling auto and script
      
      * Remove new line
      
      * Fixes
      
      * More fixes
      
      * Fix more files
      
      * Style
      
      * Add FSMT
      
      * More fixes
      
      * More fixes
      
      * More fixes
      
      * More fixes
      
      * Fixes
      
      * More fixes
      
      * More fixes
      
      * Last fixes
      
      * Make sphinx happy
      08f534d2
    • Thomas Wolf's avatar
      [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization (#8006) · 79eb3915
      Thomas Wolf authored
      * fixing #8001
      
      * make T5 tokenizer serialization more robust - style
      79eb3915
  34. 23 Oct, 2020 1 commit
  35. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 馃帀
      
      * and removed hard dependency on tokenizers 馃帀
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  36. 13 Oct, 2020 1 commit
  37. 08 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58
      Thomas Wolf authored
      
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
      
      * [WIP] SP tokenizers
      
      * fixing tests for T5
      
      * WIP tokenizers
      
      * serialization
      
      * update T5
      
      * WIP T5 tokenization
      
      * slow to fast conversion script
      
      * Refactoring to move tokenzier implementations inside transformers
      
      * Adding gpt - refactoring - quality
      
      * WIP adding several tokenizers to the fast world
      
      * WIP Roberta - moving implementations
      
      * update to dev4 switch file loading to in-memory loading
      
      * Updating and fixing
      
      * advancing on the tokenizers - updating do_lower_case
      
      * style and quality
      
      * moving forward with tokenizers conversion and tests
      
      * MBart, T5
      
      * dumping the fast version of transformer XL
      
      * Adding to autotokenizers + style/quality
      
      * update init and space_between_special_tokens
      
      * style and quality
      
      * bump up tokenizers version
      
      * add protobuf
      
      * fix pickle Bert JP with Mecab
      
      * fix newly added tokenizers
      
      * style and quality
      
      * fix bert japanese
      
      * fix funnel
      
      * limite tokenizer warning to one occurence
      
      * clean up file
      
      * fix new tokenizers
      
      * fast tokenizers deep tests
      
      * WIP adding all the special fast tests on the new fast tokenizers
      
      * quick fix
      
      * adding more fast tokenizers in the fast tests
      
      * all tokenizers in fast version tested
      
      * Adding BertGenerationFast
      
      * bump up setup.py for CI
      
      * remove BertGenerationFast (too early)
      
      * bump up tokenizers version
      
      * Clean old docstrings
      
      * Typo
      
      * Update following Lysandre comments
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      9aeacb58
  38. 22 Sep, 2020 1 commit