"tasks/git@developer.sourcefind.cn:sugon_wxj/megatron-lm.git" did not exist on "f64977fdfe325aefbdf09edb132fc1a77010ebe6"
  1. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 馃帀
      
      * and removed hard dependency on tokenizers 馃帀
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  2. 09 Oct, 2020 1 commit
  3. 08 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58
      Thomas Wolf authored
      
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
      
      * [WIP] SP tokenizers
      
      * fixing tests for T5
      
      * WIP tokenizers
      
      * serialization
      
      * update T5
      
      * WIP T5 tokenization
      
      * slow to fast conversion script
      
      * Refactoring to move tokenzier implementations inside transformers
      
      * Adding gpt - refactoring - quality
      
      * WIP adding several tokenizers to the fast world
      
      * WIP Roberta - moving implementations
      
      * update to dev4 switch file loading to in-memory loading
      
      * Updating and fixing
      
      * advancing on the tokenizers - updating do_lower_case
      
      * style and quality
      
      * moving forward with tokenizers conversion and tests
      
      * MBart, T5
      
      * dumping the fast version of transformer XL
      
      * Adding to autotokenizers + style/quality
      
      * update init and space_between_special_tokens
      
      * style and quality
      
      * bump up tokenizers version
      
      * add protobuf
      
      * fix pickle Bert JP with Mecab
      
      * fix newly added tokenizers
      
      * style and quality
      
      * fix bert japanese
      
      * fix funnel
      
      * limite tokenizer warning to one occurence
      
      * clean up file
      
      * fix new tokenizers
      
      * fast tokenizers deep tests
      
      * WIP adding all the special fast tests on the new fast tokenizers
      
      * quick fix
      
      * adding more fast tokenizers in the fast tests
      
      * all tokenizers in fast version tested
      
      * Adding BertGenerationFast
      
      * bump up setup.py for CI
      
      * remove BertGenerationFast (too early)
      
      * bump up tokenizers version
      
      * Clean old docstrings
      
      * Typo
      
      * Update following Lysandre comments
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      9aeacb58
  4. 11 Sep, 2020 1 commit
  5. 10 Sep, 2020 1 commit
    • Patrick von Platen's avatar
      Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. (#6594) · 7fd1febf
      Patrick von Platen authored
      * add conversion script
      
      * improve conversion script
      
      * make style
      
      * add tryout files
      
      * fix
      
      * update
      
      * add causal bert
      
      * better names
      
      * add tokenizer file as well
      
      * finish causal_bert
      
      * fix small bugs
      
      * improve generate
      
      * change naming
      
      * renaming
      
      * renaming
      
      * renaming
      
      * remove leftover files
      
      * clean files
      
      * add fix tokenizer
      
      * finalize
      
      * correct slow test
      
      * update docs
      
      * small fixes
      
      * fix link
      
      * adapt check repo
      
      * apply sams and sylvains recommendations
      
      * fix import
      
      * implement Lysandres recommendations
      
      * fix logger warn
      7fd1febf
  6. 30 Aug, 2020 1 commit
  7. 28 Aug, 2020 2 commits
    • Sam Shleifer's avatar
      3cac867f
    • Sam Shleifer's avatar
      prepare_seq2seq_batch makes labels/ decoder_input_ids made later. (#6654) · 9336086a
      Sam Shleifer authored
      * broken test
      
      * batch parity
      
      * tests pass
      
      * boom boom
      
      * boom boom
      
      * split out bart tokenizer tests
      
      * fix tests
      
      * boom boom
      
      * Fixed dataset bug
      
      * Fix marian
      
      * Undo extra
      
      * Get marian working
      
      * Fix t5 tok tests
      
      * Test passing
      
      * Cleanup
      
      * better assert msg
      
      * require torch
      
      * Fix mbart tests
      
      * undo extra decoder_attn_mask change
      
      * Fix import
      
      * pegasus tokenizer can ignore src_lang kwargs
      
      * unused kwarg test cov
      
      * boom boom
      
      * add todo for pegasus issue
      
      * cover one word translation edge case
      
      * Cleanup
      
      * doc
      9336086a
  8. 26 Aug, 2020 1 commit
  9. 25 Aug, 2020 1 commit
  10. 17 Aug, 2020 1 commit
  11. 19 May, 2020 1 commit
  12. 15 Jan, 2020 1 commit
  13. 06 Jan, 2020 2 commits
  14. 22 Dec, 2019 7 commits
  15. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26
  16. 10 Dec, 2019 1 commit
  17. 07 Nov, 2019 1 commit
  18. 06 Nov, 2019 1 commit
  19. 04 Nov, 2019 1 commit
  20. 22 Oct, 2019 1 commit
  21. 04 Oct, 2019 1 commit
  22. 26 Sep, 2019 1 commit
  23. 19 Sep, 2019 1 commit
  24. 30 Aug, 2019 1 commit
  25. 12 Aug, 2019 1 commit
  26. 05 Aug, 2019 1 commit
  27. 15 Jul, 2019 1 commit
  28. 09 Jul, 2019 2 commits
  29. 05 Jul, 2019 3 commits