1. 11 Dec, 2020 1 commit
  2. 10 Dec, 2020 1 commit
  3. 08 Dec, 2020 1 commit
  4. 03 Dec, 2020 1 commit
  5. 02 Dec, 2020 1 commit
    • Nicolas Patry's avatar
      Warning about too long input for fast tokenizers too (#8799) · a8c3f9aa
      Nicolas Patry authored
      * Warning about too long input for fast tokenizers too
      
      If truncation is not set in tokenizers, but the tokenization is too long
      for the model (`model_max_length`), we used to trigger a warning that
      
      The input would probably fail (which it most likely will).
      
      This PR re-enables the warning for fast tokenizers too and uses common
      code for the trigger to make sure it's consistent across.
      
      * Checking for pair of inputs too.
      
      * Making the function private and adding it's doc.
      
      * Remove formatting ?? in odd place.
      
      * Missed uppercase.
      a8c3f9aa
  6. 01 Dec, 2020 1 commit
  7. 30 Nov, 2020 1 commit
  8. 27 Nov, 2020 1 commit
    • Giovanni Compagnoni's avatar
      Extend typing to path-like objects in `PretrainedConfig` and `PreTrainedModel` (#8770) · f9a2a9e3
      Giovanni Compagnoni authored
      * update configuration_utils.py typing to allow pathlike objects when sensible
      
      * update modeling_utils.py typing to allow pathlike objects when sensible
      
      * black
      
      * update tokenization_utils_base.py typing to allow pathlike objects when sensible
      
      * update tokenization_utils_fast.py typing to allow pathlike objects when sensible
      
      * update configuration_auto.py typing to allow pathlike objects when sensible
      
      * update configuration_auto.py docstring to allow pathlike objects when sensible
      
      * update tokenization_auto.py docstring to allow pathlike objects when sensible
      
      * black
      f9a2a9e3
  9. 19 Nov, 2020 1 commit
  10. 17 Nov, 2020 3 commits
  11. 15 Nov, 2020 1 commit
    • Thomas Wolf's avatar
      [breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests... · f4e04cd2
      Thomas Wolf authored
      
      [breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency (#8073)
      
      * Fixing roberta for slow-fast tests
      
      * WIP getting equivalence on pipelines
      
      * slow-to-fast equivalence - working on question-answering pipeline
      
      * optional FAISS tests
      
      * Pipeline Q&A
      
      * Move pipeline tests to their own test job again
      
      * update tokenizer to add sequence id methods
      
      * update to tokenizers 0.9.4
      
      * set sentencepiecce as optional
      
      * clean up squad
      
      * clean up pipelines to use sequence_ids
      
      * style/quality
      
      * wording
      
      * Switch to use_fast = True by default
      
      * update tests for use_fast at True by default
      
      * fix rag tokenizer test
      
      * removing protobuf from required dependencies
      
      * fix NER test for use_fast = True by default
      
      * fixing example tests (Q&A examples use slow tokenizers for now)
      
      * protobuf in main deps extras["sentencepiece"] and example deps
      
      * fix protobug install test
      
      * try to fix seq2seq by switching to slow tokenizers for now
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      f4e04cd2
  12. 10 Nov, 2020 1 commit
  13. 29 Oct, 2020 1 commit
  14. 26 Oct, 2020 2 commits
    • Sylvain Gugger's avatar
      Doc styling (#8067) · 08f534d2
      Sylvain Gugger authored
      * Important files
      
      * Styling them all
      
      * Revert "Styling them all"
      
      This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.
      
      * Syling them for realsies
      
      * Fix syntax error
      
      * Fix benchmark_utils
      
      * More fixes
      
      * Fix modeling auto and script
      
      * Remove new line
      
      * Fixes
      
      * More fixes
      
      * Fix more files
      
      * Style
      
      * Add FSMT
      
      * More fixes
      
      * More fixes
      
      * More fixes
      
      * More fixes
      
      * Fixes
      
      * More fixes
      
      * More fixes
      
      * Last fixes
      
      * Make sphinx happy
      08f534d2
    • Thomas Wolf's avatar
      [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization (#8006) · 79eb3915
      Thomas Wolf authored
      * fixing #8001
      
      * make T5 tokenizer serialization more robust - style
      79eb3915
  15. 24 Oct, 2020 1 commit
  16. 23 Oct, 2020 2 commits
  17. 21 Oct, 2020 1 commit
  18. 19 Oct, 2020 2 commits
  19. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 馃帀
      
      * and removed hard dependency on tokenizers 馃帀
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  20. 13 Oct, 2020 2 commits
  21. 12 Oct, 2020 1 commit
  22. 08 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58
      Thomas Wolf authored
      
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
      
      * [WIP] SP tokenizers
      
      * fixing tests for T5
      
      * WIP tokenizers
      
      * serialization
      
      * update T5
      
      * WIP T5 tokenization
      
      * slow to fast conversion script
      
      * Refactoring to move tokenzier implementations inside transformers
      
      * Adding gpt - refactoring - quality
      
      * WIP adding several tokenizers to the fast world
      
      * WIP Roberta - moving implementations
      
      * update to dev4 switch file loading to in-memory loading
      
      * Updating and fixing
      
      * advancing on the tokenizers - updating do_lower_case
      
      * style and quality
      
      * moving forward with tokenizers conversion and tests
      
      * MBart, T5
      
      * dumping the fast version of transformer XL
      
      * Adding to autotokenizers + style/quality
      
      * update init and space_between_special_tokens
      
      * style and quality
      
      * bump up tokenizers version
      
      * add protobuf
      
      * fix pickle Bert JP with Mecab
      
      * fix newly added tokenizers
      
      * style and quality
      
      * fix bert japanese
      
      * fix funnel
      
      * limite tokenizer warning to one occurence
      
      * clean up file
      
      * fix new tokenizers
      
      * fast tokenizers deep tests
      
      * WIP adding all the special fast tests on the new fast tokenizers
      
      * quick fix
      
      * adding more fast tokenizers in the fast tests
      
      * all tokenizers in fast version tested
      
      * Adding BertGenerationFast
      
      * bump up setup.py for CI
      
      * remove BertGenerationFast (too early)
      
      * bump up tokenizers version
      
      * Clean old docstrings
      
      * Typo
      
      * Update following Lysandre comments
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      9aeacb58
  23. 06 Oct, 2020 1 commit
  24. 23 Sep, 2020 1 commit
  25. 22 Sep, 2020 1 commit
  26. 16 Sep, 2020 1 commit
  27. 14 Sep, 2020 1 commit
  28. 09 Sep, 2020 1 commit
  29. 08 Sep, 2020 1 commit
    • Stas Bekman's avatar
      typo (#7001) · c18f5916
      Stas Bekman authored
      apologies for the tiny PRs, just sending those as I find them.
      c18f5916
  30. 04 Sep, 2020 1 commit
  31. 26 Aug, 2020 1 commit
  32. 24 Aug, 2020 1 commit
  33. 19 Aug, 2020 1 commit
  34. 12 Aug, 2020 1 commit