"vscode:/vscode.git/clone" did not exist on "335f57baf86094907a14de7ddc9f3e791ae3519b"
  1. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 馃帀
      
      * and removed hard dependency on tokenizers 馃帀
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  2. 08 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58
      Thomas Wolf authored
      
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
      
      * [WIP] SP tokenizers
      
      * fixing tests for T5
      
      * WIP tokenizers
      
      * serialization
      
      * update T5
      
      * WIP T5 tokenization
      
      * slow to fast conversion script
      
      * Refactoring to move tokenzier implementations inside transformers
      
      * Adding gpt - refactoring - quality
      
      * WIP adding several tokenizers to the fast world
      
      * WIP Roberta - moving implementations
      
      * update to dev4 switch file loading to in-memory loading
      
      * Updating and fixing
      
      * advancing on the tokenizers - updating do_lower_case
      
      * style and quality
      
      * moving forward with tokenizers conversion and tests
      
      * MBart, T5
      
      * dumping the fast version of transformer XL
      
      * Adding to autotokenizers + style/quality
      
      * update init and space_between_special_tokens
      
      * style and quality
      
      * bump up tokenizers version
      
      * add protobuf
      
      * fix pickle Bert JP with Mecab
      
      * fix newly added tokenizers
      
      * style and quality
      
      * fix bert japanese
      
      * fix funnel
      
      * limite tokenizer warning to one occurence
      
      * clean up file
      
      * fix new tokenizers
      
      * fast tokenizers deep tests
      
      * WIP adding all the special fast tests on the new fast tokenizers
      
      * quick fix
      
      * adding more fast tokenizers in the fast tests
      
      * all tokenizers in fast version tested
      
      * Adding BertGenerationFast
      
      * bump up setup.py for CI
      
      * remove BertGenerationFast (too early)
      
      * bump up tokenizers version
      
      * Clean old docstrings
      
      * Typo
      
      * Update following Lysandre comments
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      9aeacb58
  3. 07 Jul, 2020 2 commits
    • Quentin Lhoest's avatar
      Fix tests imports dpr (#5576) · 4fedc125
      Quentin Lhoest authored
      * fix test imports
      
      * fix max_length
      
      * style
      
      * fix tests
      4fedc125
    • Quentin Lhoest's avatar
      Add DPR model (#5279) · fbd87921
      Quentin Lhoest authored
      
      
      * beginning of dpr modeling
      
      * wip
      
      * implement forward
      
      * remove biencoder + better init weights
      
      * export dpr model to embed model for nlp lib
      
      * add new api
      
      * remove old code
      
      * make style
      
      * fix dumb typo
      
      * don't load bert weights
      
      * docs
      
      * docs
      
      * style
      
      * move the `k` parameter
      
      * fix init_weights
      
      * add pretrained configs
      
      * minor
      
      * update config names
      
      * style
      
      * better config
      
      * style
      
      * clean code based on PR comments
      
      * change Dpr to DPR
      
      * fix config
      
      * switch encoder config to a dict
      
      * style
      
      * inheritance -> composition
      
      * add messages in assert startements
      
      * add dpr reader tokenizer
      
      * one tokenizer per model
      
      * fix base_model_prefix
      
      * fix imports
      
      * typo
      
      * add convert script
      
      * docs
      
      * change tokenizers conf names
      
      * style
      
      * change tokenizers conf names
      
      * minor
      
      * minor
      
      * fix wrong names
      
      * minor
      
      * remove unused convert functions
      
      * rename convert script
      
      * use return_tensors in tokenizers
      
      * remove n_questions dim
      
      * move generate logic to tokenizer
      
      * style
      
      * add docs
      
      * docs
      
      * quality
      
      * docs
      
      * add tests
      
      * style
      
      * add tokenization tests
      
      * DPR full tests
      
      * Stay true to the attention mask building
      
      * update docs
      
      * missing param in bert input docs
      
      * docs
      
      * style
      Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
      fbd87921