1. 13 Jul, 2021 1 commit
  2. 01 Jun, 2021 1 commit
    • Philip May's avatar
      Add regression tests for slow sentencepiece tokenizers. (#11737) · fcad8018
      Philip May authored
      * add test_vocab_size for sentencepiece tok.
      
      * add test_get_vocab for sentencepiece tok.
      
      * add test_convert_token_and_id for sentencepiece tok.
      
      * add test_tokenize_and_convert_tokens_to_string for all tok.
      
      * improve test_tokenize_and_convert_tokens_to_string for sp. tok.
      
      * add common tokenizer integration tests
      - for albert
      - for barthez
      
      * add tokenizer integration tests to bert gen.
      
      * add most tokenizer integration tests
      
      * fix camembert tokenizer integration test
      
      * add tokenizer integration test to marian
      
      * add tokenizer integration test to reformer
      
      * add typing and doc to tokenizer_integration_test_util
      
      * fix tokenizer integration test of reformer
      
      * improve test_sentencepiece_tokenize_and_convert_tokens_to_string
      
      * empty commit to trigger CI
      
      * fix tokenizer integration test of reformer
      
      * remove code not needed anymore
      
      * empty commit to trigger CI
      
      * empty commit to trigger CI
      fcad8018
  3. 13 May, 2021 1 commit
    • Philip May's avatar
      Enable option for subword regularization in more tokenizers. (#11417) · 37ed3ab7
      Philip May authored
      * improve slow class tok usage at xlm rob
      
      * add subword regularization for barthez
      
      * improve barthez tok. test
      
      * fix tokenizer tests
      
      * add subword regularization for camembert
      
      * add subword regularization for deberta v2 tokenizer
      
      * add more doc to deberta v2 tokenizer
      
      * add subword regularization for speech to text tok.
      
      * fix sp_model_kwargs type in speech 2 text tok.
      
      * add subword regularization for M2M100 tok.
      
      * add more concrete type hints
      
      * fix tests for m2m100 and s2t tok.
      
      * add missing Any import
      
      * fix syntax error in m2m100 tok.
      
      * fix unpickle of m2m100 and s2t tok.
      
      * fix test of m2m100 and s2t tok.
      
      * improve unpickle of deberta v2 tok.
      
      * add test for pickle of barthez & camembert
      
      * fix pickle of barthez & camembert
      
      * add test for deberta v2 tok. pickle
      
      * fix m2m100 tok. pickle
      
      * fix s2t tok. pickle
      
      * add subword regularization to albert tok.
      
      * refactor subword reg. test into TokenizerTesterMixin
      
      improve albert tok. test
      
      remove sample argument form albert tok.
      
      check subword reg. using TokenizerTesterMixin
      
      improve tok. tests
      
      improve xlm roberta tok. tests
      
      improve xlm roberta tok. tests
      
      * add subword regularization for big bird t.
      
      * improve xlm roberta tok. test
      
      * add subword regularization for mbart50 tok.
      
      * add subword regularization for pegasus tok.
      
      * add subword regularization for reformer tok.
      
      * add subword regularization for T5 tok.
      
      * fix t5 tok. test formatting
      
      * add subword regularization for xlm_proph. tok.
      
      * add subword regularization for xlnet tok.
      
      * add subword regularization for gert_gen tok.
      
      * add typing to tokenizers
      
      * add typing to xlm rob. tok
      
      * add subword regularization for marian tok.
      
      * add reverse tok. test
      
      * fix marian tok test
      
      * fix marian tok test
      
      * fix casing in tok. tests
      
      * fix style of tok. common test
      
      * fix deberta v2 tok test
      
      * add type annotations to tok. tests
      
      * add type annotations to tok. __init__
      
      * add typing to kokenizer
      
      * add type annotations to tok. __init__
      
      * don't specify the default when it's None
      
      * fix barthez tok. doc
      
      * move sentencepiece tok. tests to TokenizerTesterMixin
      
      * fix unused imports
      
      * fix albert tok. test
      
      * add comment to sentencepiece test options
      
      * fix Any import at big bird tok.
      
      * fix Any import at xlm prophetnet tok.
      
      * empty commit to trigger CI
      37ed3ab7
  4. 10 May, 2021 1 commit
    • Tanmay Laud's avatar
      Big Bird Fast Tokenizer implementation (#11075) · f7f87295
      Tanmay Laud authored
      
      
      * Added Big Bird Fast Tokenizer initial file
      
      * style fixes
      
      * flake fixes
      
      * Added big bird fast tokenizer to init files
      
      * Added big bird fast to Auto tokenization
      
      * fix styles
      
      * minor quality fixes
      
      * Added initial test code
      
      * Fix SpmConverter when precompiled_charsmap doesn't exist
      
      * fixed post processor
      
      * minor style fix
      
      * minor fix input names
      
      * Actually fix identity normalization
      
      * style
      
      * Added token type ids to fast tokenizer
      
      * style
      
      * flake fix
      
      * fix copies
      Co-authored-by: default avatarAnthony MOI <m.anthony.moi@gmail.com>
      f7f87295
  5. 30 Mar, 2021 1 commit
    • Vasudev Gupta's avatar
      BigBird (#10183) · 6dfd0272
      Vasudev Gupta authored
      
      
      * init bigbird
      
      * model.__init__ working, conversion script ready, config updated
      
      * add conversion script
      
      * BigBirdEmbeddings working :)
      
      * slightly update conversion script
      
      * BigBirdAttention working :) ; some bug in layer.output.dense
      
      * add debugger-notebook
      
      * forward() working for BigBirdModel :) ; replaced gelu with gelu_fast
      
      * tf code adapted to torch till rand_attn in bigbird_block_sparse_attention ; till now everything working :)
      
      * BigBirdModel working in block-sparse attention mode :)
      
      * add BigBirdForPreTraining
      
      * small fix
      
      * add tokenizer for BigBirdModel
      
      * fix config & hence modeling
      
      * fix base prefix
      
      * init testing
      
      * init tokenizer test
      
      * pos_embed must be absolute, attn_type=original_full when add_cross_attn=True , nsp loss is optional in BigBirdForPreTraining, add assert statements
      
      * remove position_embedding_type arg
      
      * complete normal tests
      
      * add comments to block sparse attention
      
      * add attn_probs for sliding & global tokens
      
      * create fn for block sparse attn mask creation
      
      * add special tests
      
      * restore pos embed arg
      
      * minor fix
      
      * attn probs update
      
      * make big bird fully gpu friendly
      
      * fix tests
      
      * remove pruning
      
      * correct tokenzier & minor fixes
      
      * update conversion script , remove norm_type
      
      * tokenizer-inference test add
      
      * remove extra comments
      
      * add docs
      
      * save intermediate
      
      * finish trivia_qa conversion
      
      * small update to forward
      
      * correct qa and layer
      
      * better error message
      
      * BigBird QA ready
      
      * fix rebased
      
      * add triva-qa debugger notebook
      
      * qa setup
      
      * fixed till embeddings
      
      * some issue in q/k/v_layer
      
      * fix bug in conversion-script
      
      * fixed till self-attn
      
      * qa fixed except layer norm
      
      * add qa end2end test
      
      * fix gradient ckpting ; other qa test
      
      * speed-up big bird a bit
      
      * hub_id=google
      
      * clean up
      
      * make quality
      
      * speed up einsum with bmm
      
      * finish perf improvements for big bird
      
      * remove wav2vec2 tok
      
      * fix tokenizer
      
      * include docs
      
      * correct docs
      
      * add helper to auto pad block size
      
      * make style
      
      * remove fast tokenizer for now
      
      * fix some
      
      * add pad test
      
      * finish
      
      * fix some bugs
      
      * fix another bug
      
      * fix buffer tokens
      
      * fix comment and merge from master
      
      * add comments
      
      * make style
      
      * commit some suggestions
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Fix typos
      
      * fix some more suggestions
      
      * add another patch
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * fix copies
      
      * another path
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      
      * update
      
      * update nit suggestions
      
      * make style
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      6dfd0272
  6. 07 Dec, 2020 1 commit
  7. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 馃帀
      
      * and removed hard dependency on tokenizers 馃帀
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  8. 24 Sep, 2020 1 commit
  9. 11 Sep, 2020 1 commit
  10. 10 Sep, 2020 1 commit
    • Patrick von Platen's avatar
      Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. (#6594) · 7fd1febf
      Patrick von Platen authored
      * add conversion script
      
      * improve conversion script
      
      * make style
      
      * add tryout files
      
      * fix
      
      * update
      
      * add causal bert
      
      * better names
      
      * add tokenizer file as well
      
      * finish causal_bert
      
      * fix small bugs
      
      * improve generate
      
      * change naming
      
      * renaming
      
      * renaming
      
      * renaming
      
      * remove leftover files
      
      * clean files
      
      * add fix tokenizer
      
      * finalize
      
      * correct slow test
      
      * update docs
      
      * small fixes
      
      * fix link
      
      * adapt check repo
      
      * apply sams and sylvains recommendations
      
      * fix import
      
      * implement Lysandres recommendations
      
      * fix logger warn
      7fd1febf
  11. 26 Aug, 2020 1 commit
  12. 24 Aug, 2020 1 commit
  13. 20 Aug, 2020 1 commit
  14. 01 Jul, 2020 1 commit
  15. 19 May, 2020 1 commit
  16. 18 Mar, 2020 1 commit
  17. 25 Feb, 2020 1 commit