1. 23 Nov, 2022 1 commit
    • raghavanone's avatar
      change the way sentinel tokens can retrived (#20373) · 03ae1f06
      raghavanone authored
      * change the way sentinel tokens can retrived
      
      * Fix line length for doc string
      
      * Fix line length for doc string
      
      * Add more stronger test for t5 tokenization
      
      * Format file changes
      
      * Make a stronger test for filtering sentinel tokens
      
      * fix file format issues
      03ae1f06
  2. 29 Jul, 2022 1 commit
  3. 03 May, 2022 1 commit
    • Yih-Dar's avatar
      Move test model folders (#17034) · 19420fd9
      Yih-Dar authored
      
      
      * move test model folders (TODO: fix imports and others)
      
      * fix (potentially partially) imports (in model test modules)
      
      * fix (potentially partially) imports (in tokenization test modules)
      
      * fix (potentially partially) imports (in feature extraction test modules)
      
      * fix import utils.test_modeling_tf_core
      
      * fix path ../fixtures/
      
      * fix imports about generation.test_generation_flax_utils
      
      * fix more imports
      
      * fix fixture path
      
      * fix get_test_dir
      
      * update module_to_test_file
      
      * fix get_tests_dir from wrong transformers.utils
      
      * update config.yml (CircleCI)
      
      * fix style
      
      * remove missing imports
      
      * update new model script
      
      * update check_repo
      
      * update SPECIAL_MODULE_TO_TEST_MAP
      
      * fix style
      
      * add __init__
      
      * update self-scheduled
      
      * fix add_new_model scripts
      
      * check one way to get location back
      
      * python setup.py build install
      
      * fix import in test auto
      
      * update self-scheduled.yml
      
      * update slack notification script
      
      * Add comments about artifact names
      
      * fix for yolos
      Co-authored-by: default avatarydshieh <ydshieh@users.noreply.github.com>
      19420fd9
  4. 02 May, 2022 1 commit
  5. 23 Mar, 2022 1 commit
    • Sylvain Gugger's avatar
      Reorganize file utils (#16264) · 4975002d
      Sylvain Gugger authored
      * Split file_utils in several submodules
      
      * Fixes
      
      * Add back more objects
      
      * More fixes
      
      * Who exactly decided to import that from there?
      
      * Second suggestion to code with code review
      
      * Revert wront move
      
      * Fix imports
      
      * Adapt all imports
      
      * Adapt all imports everywhere
      
      * Revert this import, will fix in a separate commit
      4975002d
  6. 23 Feb, 2022 1 commit
  7. 23 Aug, 2021 1 commit
    • SaulLu's avatar
      Change how "additional_special_tokens" argument in the ".from_pretrained"... · 7223844d
      SaulLu authored
      Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account (#13056)
      
      * add test
      
      * add change in PretrainedTokenizerBase
      
      * change Luke
      
      * deactivate
      
      * add the possibility to add additional special tokens for M2M100
      
      * format
      
      * add special test for canine
      
      * proposed changes for mbart
      
      * proposed changes for mbart50
      
      * proposed changes for byt5
      
      * proposed changes for canine
      
      * proposed changes for t5
      
      * test fast and slow
      
      * remove comment
      
      * remove comment
      
      * add fast version for all tests
      
      * replace break by continue
      
      * add more comments
      
      * add check to avoid duplicates
      
      * remove comment
      
      * format
      
      * proposed change for wave2vec2
      
      * reverse changes mbart
      
      * uncomment
      
      * format
      7223844d
  8. 01 Jun, 2021 1 commit
    • Philip May's avatar
      Add regression tests for slow sentencepiece tokenizers. (#11737) · fcad8018
      Philip May authored
      * add test_vocab_size for sentencepiece tok.
      
      * add test_get_vocab for sentencepiece tok.
      
      * add test_convert_token_and_id for sentencepiece tok.
      
      * add test_tokenize_and_convert_tokens_to_string for all tok.
      
      * improve test_tokenize_and_convert_tokens_to_string for sp. tok.
      
      * add common tokenizer integration tests
      - for albert
      - for barthez
      
      * add tokenizer integration tests to bert gen.
      
      * add most tokenizer integration tests
      
      * fix camembert tokenizer integration test
      
      * add tokenizer integration test to marian
      
      * add tokenizer integration test to reformer
      
      * add typing and doc to tokenizer_integration_test_util
      
      * fix tokenizer integration test of reformer
      
      * improve test_sentencepiece_tokenize_and_convert_tokens_to_string
      
      * empty commit to trigger CI
      
      * fix tokenizer integration test of reformer
      
      * remove code not needed anymore
      
      * empty commit to trigger CI
      
      * empty commit to trigger CI
      fcad8018
  9. 13 May, 2021 1 commit
    • Philip May's avatar
      Enable option for subword regularization in more tokenizers. (#11417) · 37ed3ab7
      Philip May authored
      * improve slow class tok usage at xlm rob
      
      * add subword regularization for barthez
      
      * improve barthez tok. test
      
      * fix tokenizer tests
      
      * add subword regularization for camembert
      
      * add subword regularization for deberta v2 tokenizer
      
      * add more doc to deberta v2 tokenizer
      
      * add subword regularization for speech to text tok.
      
      * fix sp_model_kwargs type in speech 2 text tok.
      
      * add subword regularization for M2M100 tok.
      
      * add more concrete type hints
      
      * fix tests for m2m100 and s2t tok.
      
      * add missing Any import
      
      * fix syntax error in m2m100 tok.
      
      * fix unpickle of m2m100 and s2t tok.
      
      * fix test of m2m100 and s2t tok.
      
      * improve unpickle of deberta v2 tok.
      
      * add test for pickle of barthez & camembert
      
      * fix pickle of barthez & camembert
      
      * add test for deberta v2 tok. pickle
      
      * fix m2m100 tok. pickle
      
      * fix s2t tok. pickle
      
      * add subword regularization to albert tok.
      
      * refactor subword reg. test into TokenizerTesterMixin
      
      improve albert tok. test
      
      remove sample argument form albert tok.
      
      check subword reg. using TokenizerTesterMixin
      
      improve tok. tests
      
      improve xlm roberta tok. tests
      
      improve xlm roberta tok. tests
      
      * add subword regularization for big bird t.
      
      * improve xlm roberta tok. test
      
      * add subword regularization for mbart50 tok.
      
      * add subword regularization for pegasus tok.
      
      * add subword regularization for reformer tok.
      
      * add subword regularization for T5 tok.
      
      * fix t5 tok. test formatting
      
      * add subword regularization for xlm_proph. tok.
      
      * add subword regularization for xlnet tok.
      
      * add subword regularization for gert_gen tok.
      
      * add typing to tokenizers
      
      * add typing to xlm rob. tok
      
      * add subword regularization for marian tok.
      
      * add reverse tok. test
      
      * fix marian tok test
      
      * fix marian tok test
      
      * fix casing in tok. tests
      
      * fix style of tok. common test
      
      * fix deberta v2 tok test
      
      * add type annotations to tok. tests
      
      * add type annotations to tok. __init__
      
      * add typing to kokenizer
      
      * add type annotations to tok. __init__
      
      * don't specify the default when it's None
      
      * fix barthez tok. doc
      
      * move sentencepiece tok. tests to TokenizerTesterMixin
      
      * fix unused imports
      
      * fix albert tok. test
      
      * add comment to sentencepiece test options
      
      * fix Any import at big bird tok.
      
      * fix Any import at xlm prophetnet tok.
      
      * empty commit to trigger CI
      37ed3ab7
  10. 04 May, 2021 1 commit
  11. 16 Mar, 2021 1 commit
  12. 22 Feb, 2021 1 commit
  13. 06 Jan, 2021 1 commit
    • Sylvain Gugger's avatar
      Fast transformers import part 1 (#9441) · 0c96262f
      Sylvain Gugger authored
      * Don't import libs to check they are available
      
      * Don't import integrations at init
      
      * Add importlib_metdata to deps
      
      * Remove old vars references
      
      * Avoid syntax error
      
      * Adapt testing utils
      
      * Try to appease torchhub
      
      * Add dependency
      
      * Remove more private variables
      
      * Fix typo
      
      * Another typo
      
      * Refine the tf availability test
      0c96262f
  14. 10 Nov, 2020 2 commits
  15. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 🎉
      
      * and removed hard dependency on tokenizers 🎉
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  16. 09 Oct, 2020 1 commit
  17. 08 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58
      Thomas Wolf authored
      
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
      
      * [WIP] SP tokenizers
      
      * fixing tests for T5
      
      * WIP tokenizers
      
      * serialization
      
      * update T5
      
      * WIP T5 tokenization
      
      * slow to fast conversion script
      
      * Refactoring to move tokenzier implementations inside transformers
      
      * Adding gpt - refactoring - quality
      
      * WIP adding several tokenizers to the fast world
      
      * WIP Roberta - moving implementations
      
      * update to dev4 switch file loading to in-memory loading
      
      * Updating and fixing
      
      * advancing on the tokenizers - updating do_lower_case
      
      * style and quality
      
      * moving forward with tokenizers conversion and tests
      
      * MBart, T5
      
      * dumping the fast version of transformer XL
      
      * Adding to autotokenizers + style/quality
      
      * update init and space_between_special_tokens
      
      * style and quality
      
      * bump up tokenizers version
      
      * add protobuf
      
      * fix pickle Bert JP with Mecab
      
      * fix newly added tokenizers
      
      * style and quality
      
      * fix bert japanese
      
      * fix funnel
      
      * limite tokenizer warning to one occurence
      
      * clean up file
      
      * fix new tokenizers
      
      * fast tokenizers deep tests
      
      * WIP adding all the special fast tests on the new fast tokenizers
      
      * quick fix
      
      * adding more fast tokenizers in the fast tests
      
      * all tokenizers in fast version tested
      
      * Adding BertGenerationFast
      
      * bump up setup.py for CI
      
      * remove BertGenerationFast (too early)
      
      * bump up tokenizers version
      
      * Clean old docstrings
      
      * Typo
      
      * Update following Lysandre comments
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      9aeacb58
  18. 11 Sep, 2020 1 commit
  19. 10 Sep, 2020 1 commit
    • Patrick von Platen's avatar
      Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. (#6594) · 7fd1febf
      Patrick von Platen authored
      * add conversion script
      
      * improve conversion script
      
      * make style
      
      * add tryout files
      
      * fix
      
      * update
      
      * add causal bert
      
      * better names
      
      * add tokenizer file as well
      
      * finish causal_bert
      
      * fix small bugs
      
      * improve generate
      
      * change naming
      
      * renaming
      
      * renaming
      
      * renaming
      
      * remove leftover files
      
      * clean files
      
      * add fix tokenizer
      
      * finalize
      
      * correct slow test
      
      * update docs
      
      * small fixes
      
      * fix link
      
      * adapt check repo
      
      * apply sams and sylvains recommendations
      
      * fix import
      
      * implement Lysandres recommendations
      
      * fix logger warn
      7fd1febf
  20. 30 Aug, 2020 1 commit
  21. 28 Aug, 2020 2 commits
    • Sam Shleifer's avatar
      3cac867f
    • Sam Shleifer's avatar
      prepare_seq2seq_batch makes labels/ decoder_input_ids made later. (#6654) · 9336086a
      Sam Shleifer authored
      * broken test
      
      * batch parity
      
      * tests pass
      
      * boom boom
      
      * boom boom
      
      * split out bart tokenizer tests
      
      * fix tests
      
      * boom boom
      
      * Fixed dataset bug
      
      * Fix marian
      
      * Undo extra
      
      * Get marian working
      
      * Fix t5 tok tests
      
      * Test passing
      
      * Cleanup
      
      * better assert msg
      
      * require torch
      
      * Fix mbart tests
      
      * undo extra decoder_attn_mask change
      
      * Fix import
      
      * pegasus tokenizer can ignore src_lang kwargs
      
      * unused kwarg test cov
      
      * boom boom
      
      * add todo for pegasus issue
      
      * cover one word translation edge case
      
      * Cleanup
      
      * doc
      9336086a
  22. 26 Aug, 2020 1 commit
  23. 25 Aug, 2020 1 commit
  24. 17 Aug, 2020 1 commit
  25. 19 May, 2020 1 commit
  26. 15 Jan, 2020 1 commit
  27. 06 Jan, 2020 2 commits
  28. 22 Dec, 2019 7 commits
  29. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26
  30. 10 Dec, 2019 1 commit
  31. 07 Nov, 2019 1 commit