1. 26 Oct, 2020 1 commit
    • Sylvain Gugger's avatar
      Doc styling (#8067) · 08f534d2
      Sylvain Gugger authored
      * Important files
      
      * Styling them all
      
      * Revert "Styling them all"
      
      This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.
      
      * Syling them for realsies
      
      * Fix syntax error
      
      * Fix benchmark_utils
      
      * More fixes
      
      * Fix modeling auto and script
      
      * Remove new line
      
      * Fixes
      
      * More fixes
      
      * Fix more files
      
      * Style
      
      * Add FSMT
      
      * More fixes
      
      * More fixes
      
      * More fixes
      
      * More fixes
      
      * Fixes
      
      * More fixes
      
      * More fixes
      
      * Last fixes
      
      * Make sphinx happy
      08f534d2
  2. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 馃帀
      
      * and removed hard dependency on tokenizers 馃帀
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  3. 23 Sep, 2020 1 commit
  4. 26 Aug, 2020 2 commits
  5. 18 Apr, 2020 1 commit
    • Thomas Wolf's avatar
      Cleanup fast tokenizers integration (#3706) · 827d6d6e
      Thomas Wolf authored
      
      
      * First pass on utility classes and python tokenizers
      
      * finishing cleanup pass
      
      * style and quality
      
      * Fix tests
      
      * Updating following @mfuntowicz comment
      
      * style and quality
      
      * Fix Roberta
      
      * fix batch_size/seq_length inBatchEncoding
      
      * add alignement methods + tests
      
      * Fix OpenAI and Transfo-XL tokenizers
      
      * adding trim_offsets=True default for GPT2 et RoBERTa
      
      * style and quality
      
      * fix tests
      
      * add_prefix_space in roberta
      
      * bump up tokenizers to rc7
      
      * style
      
      * unfortunately tensorfow does like these - removing shape/seq_len for now
      
      * Update src/transformers/tokenization_utils.py
      Co-Authored-By: default avatarStefan Schweter <stefan@schweter.it>
      
      * Adding doc and docstrings
      
      * making flake8 happy
      Co-authored-by: default avatarStefan Schweter <stefan@schweter.it>
      827d6d6e
  6. 25 Feb, 2020 1 commit
    • Lysandre Debut's avatar
      Documentation (#2989) · bb7c4685
      Lysandre Debut authored
      * All Tokenizers
      
      BertTokenizer + few fixes
      RobertaTokenizer
      OpenAIGPTTokenizer + Fixes
      GPT2Tokenizer + fixes
      TransfoXLTokenizer
      Correct rst for TransformerXL
      XLMTokenizer + fixes
      XLNet Tokenizer + Style
      DistilBERT + Fix XLNet RST
      CTRLTokenizer
      CamemBERT Tokenizer
      FlaubertTokenizer
      XLMRobertaTokenizer
      cleanup
      
      * cleanup
      bb7c4685
  7. 20 Feb, 2020 1 commit
  8. 15 Jan, 2020 1 commit
  9. 06 Jan, 2020 2 commits
  10. 22 Dec, 2019 7 commits
  11. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26
  12. 06 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Remove dependency on pytest for running tests (#2055) · 35401fe5
      Aymeric Augustin authored
      * Switch to plain unittest for skipping slow tests.
      
      Add a RUN_SLOW environment variable for running them.
      
      * Switch to plain unittest for PyTorch dependency.
      
      * Switch to plain unittest for TensorFlow dependency.
      
      * Avoid leaking open files in the test suite.
      
      This prevents spurious warnings when running tests.
      
      * Fix unicode warning on Python 2 when running tests.
      
      The warning was:
      
          UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
      
      * Support running PyTorch tests on a GPU.
      
      Reverts 27e015bd.
      
      * Tests no longer require pytest.
      
      * Make tests pass on cuda
      35401fe5
  13. 05 Dec, 2019 1 commit
  14. 22 Oct, 2019 2 commits
  15. 10 Oct, 2019 2 commits
  16. 09 Oct, 2019 1 commit
  17. 08 Oct, 2019 2 commits
  18. 04 Oct, 2019 1 commit
    • keskarnitish's avatar
      Adding CTRL (squashed commit) · dbed1c5d
      keskarnitish authored
      adding conversion script
      
      adding first draft of modeling & tokenization
      
      adding placeholder for test files
      
      bunch of changes
      
      registering the tokenizer/model/etc
      
      tests
      
      change link; something is very VERY wrong here
      
      weird end-of-word thingy going on
      
      i think the tokenization works now ; wrote the unit tests
      
      overall structure works;load w next
      
      the monster is alive!
      
      works after some cleanup as well
      
      adding emacs autosave to gitignore
      
      currently only supporting the 48 layer one; seems to infer fine on my macbook
      
      cleanup
      
      fixing some documentation
      
      fixing some documentation
      
      tests passing?
      
      now works on CUDA also
      
      adding greedy?
      
      adding greedy sampling
      
      works well
      dbed1c5d
  19. 03 Oct, 2019 1 commit
  20. 26 Sep, 2019 3 commits
  21. 30 Aug, 2019 3 commits
  22. 23 Aug, 2019 1 commit
  23. 21 Aug, 2019 2 commits
  24. 04 Aug, 2019 1 commit