1. 07 Apr, 2020 7 commits
  2. 06 Apr, 2020 19 commits
    • Teven's avatar
      0a9d09b4
    • Funtowicz Morgan's avatar
      Tokenizers v3.0.0 (#3185) · 96ab75b8
      Funtowicz Morgan authored
      
      
      * Renamed num_added_tokens to num_special_tokens_to_add
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Make fast tokenizers unittests work on Windows.
      
      * Entirely refactored unittest for tokenizers fast.
      
      * Remove ABC class for CommonFastTokenizerTest
      
      * Added embeded_special_tokens tests from allenai @dirkgr
      
      * Make embeded_special_tokens tests from allenai more generic
      
      * Uniformize vocab_size as a property for both Fast and normal tokenizers
      
      * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
      
      * Ensure providing None input raise the same ValueError than Python tokenizer + tests.
      
      * Fix invalid input for assert_padding when testing batch_encode_plus
      
      * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
      
      * Ensure tokenize() correctly forward add_special_tokens to rust.
      
      * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
      Avoid stripping on None values.
      
      * unittests ensure tokenize() also throws a ValueError if provided None
      
      * Added add_special_tokens unittest for all supported models.
      
      * Style
      
      * Make sure TransfoXL test run only if PyTorch is provided.
      
      * Split up tokenizers tests for each model type.
      
      * Fix invalid unittest with new tokenizers API.
      
      * Filter out Roberta openai detector models from unittests.
      
      * Introduce BatchEncoding on fast tokenizers path.
      
      This new structure exposes all the mappings retrieved from Rust.
      It also keeps the current behavior with model forward.
      
      * Introduce BatchEncoding on slow tokenizers path.
      
      Backward compatibility.
      
      * Improve error message on BatchEncoding for slow path
      
      * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
      
      * Style and format.
      
      * Added typing on all methods for PretrainedTokenizerFast
      
      * Style and format
      
      * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
      
      * Style and format
      
      * encode_plus now supports pretokenized inputs.
      
      * Remove user warning about add_special_tokens when working on pretokenized inputs.
      
      * Always go through the post processor.
      
      * Added support for pretokenized input pairs on encode_plus
      
      * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
      
      * Added pretokenized inputs support on batch_encode_plus
      
      * Update BatchEncoding methods name to match Encoding.
      
      * Bump setup.py tokenizers dependency to 0.7.0rc1
      
      * Remove unused parameters in BertTokenizerFast
      
      * Make sure Roberta returns token_type_ids for unittests.
      
      * Added missing typings
      
      * Update add_tokens prototype to match tokenizers side and allow AddedToken
      
      * Bumping tokenizers to 0.7.0rc2
      
      * Added documentation for BatchEncoding
      
      * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
      
      * Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
      
      * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
      
      * Fix text-classification pipeline using the wrong tokenizer
      
      * Make pipelines works with BatchEncoding
      
      * Turn off add_special_tokens on tokenize by default.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove add_prefix_space from tokenize call in unittest.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style and quality
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Correct message for batch_encode_plus none input exception.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Fix invalid list comprehension for offset_mapping overriding content every iteration.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * TransfoXL uses Strip normalizer.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Bump tokenizers dependency to 0.7.0rc3
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * SpecilaTokenMixin can use slots to faster access to underlying attributes.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove update_special_tokens from fast tokenizers.
      
      * Ensure TransfoXL unittests are run only when torch is available.
      
      * Style.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style
      
      * Style 馃檹馃檹
      
      
      
      * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
      
      * Remove Roberta warning on __init__.
      
      * Move documentation to Google style.
      Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
      96ab75b8
    • Ethan Perez's avatar
      Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631) · e52d1258
      Ethan Perez authored
      * Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py
      
      `convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.
      
      * Simplifying change to match recent commits
      e52d1258
    • ktrapeznikov's avatar
      Create README.md · 0ac33ddd
      ktrapeznikov authored
      0ac33ddd
    • Manuel Romero's avatar
      Add model card · 326e6eba
      Manuel Romero authored
      326e6eba
    • Manuel Romero's avatar
      Add model card · 43eca3f8
      Manuel Romero authored
      43eca3f8
    • Manuel Romero's avatar
      Create README.md · 6bec88ca
      Manuel Romero authored
      6bec88ca
    • Manuel Romero's avatar
      Add model card (#3655) · 769b60f9
      Manuel Romero authored
      * Add model card
      
      * Fix model name in fine-tuning script
      769b60f9
    • Manuel Romero's avatar
      Create model card (#3654) · c4bcb019
      Manuel Romero authored
      * Create model card
      
      * Fix model name in fine-tuning script
      c4bcb019
    • Manuel Romero's avatar
      Create README.md · 6903a987
      Manuel Romero authored
      6903a987
    • MichalMalyska's avatar
      Create README.md (#3662) · 760872db
      MichalMalyska authored
      760872db
    • jjacampos's avatar
      Add model card for BERTeus (#3649) · 47e1334c
      jjacampos authored
      * Add model card for BERTeus
      
      * Update README
      47e1334c
    • Suchin's avatar
      BioMed Roberta-Base (AllenAI) (#3643) · 529534dc
      Suchin authored
      
      
      * added model card
      
      * updated README
      
      * updated README
      
      * updated README
      
      * added evals
      
      * removed pico eval
      
      * Tweaks
      Co-authored-by: default avatarJulien Chaumond <chaumond@gmail.com>
      529534dc
    • Lysandre Debut's avatar
      Update notebooks (#3620) · 261c4ff4
      Lysandre Debut authored
      * Update notebooks
      
      * From local to global link
      
      * from local links to *actual* global links
      261c4ff4
    • Julien Chaumond's avatar
    • LysandreJik's avatar
      Re-pin isort · ea6dba27
      LysandreJik authored
      ea6dba27
    • LysandreJik's avatar
      unpin isort for pypi · 11c3257a
      LysandreJik authored
      11c3257a
    • LysandreJik's avatar
      Release: v2.8.0 · 36bffc81
      LysandreJik authored
      36bffc81
    • Patrick von Platen's avatar
      [Generate, Test] Split generate test function into beam search, no beam search (#3601) · 2ee41056
      Patrick von Platen authored
      * split beam search and no beam search test
      
      * fix test
      
      * clean generate tests
      2ee41056
  3. 05 Apr, 2020 2 commits
  4. 04 Apr, 2020 7 commits
  5. 03 Apr, 2020 5 commits
    • Max Ryabinin's avatar
      Speed up GELU computation with torch.jit (#2988) · c6acd246
      Max Ryabinin authored
      * Compile gelu_new with torchscript
      
      * Compile _gelu_python with torchscript
      
      * Wrap gelu_new with torch.jit for torch>=1.4
      c6acd246
    • Lysandre Debut's avatar
      ELECTRA (#3257) · d5d7d886
      Lysandre Debut authored
      * Electra wip
      
      * helpers
      
      * Electra wip
      
      * Electra v1
      
      * ELECTRA may be saved/loaded
      
      * Generator & Discriminator
      
      * Embedding size instead of halving the hidden size
      
      * ELECTRA Tokenizer
      
      * Revert BERT helpers
      
      * ELECTRA Conversion script
      
      * Archive maps
      
      * PyTorch tests
      
      * Start fixing tests
      
      * Tests pass
      
      * Same configuration for both models
      
      * Compatible with base + large
      
      * Simplification + weight tying
      
      * Archives
      
      * Auto + Renaming to standard names
      
      * ELECTRA is uncased
      
      * Tests
      
      * Slight API changes
      
      * Update tests
      
      * wip
      
      * ElectraForTokenClassification
      
      * temp
      
      * Simpler arch + tests
      
      Removed ElectraForPreTraining which will be in a script
      
      * Conversion script
      
      * Auto model
      
      * Update links to S3
      
      * Split ElectraForPreTraining and ElectraForTokenClassification
      
      * Actually test PreTraining model
      
      * Remove num_labels from configuration
      
      * wip
      
      * wip
      
      * From discriminator and generator to electra
      
      * Slight API changes
      
      * Better naming
      
      * TensorFlow ELECTRA tests
      
      * Accurate conversion script
      
      * Added to conversion script
      
      * Fast ELECTRA tokenizer
      
      * Style
      
      * Add ELECTRA to README
      
      * Modeling Pytorch Doc + Real style
      
      * TF Docs
      
      * Docs
      
      * Correct links
      
      * Correct model intialized
      
      * random fixes
      
      * style
      
      * Addressing Patrick's and Sam's comments
      
      * Correct links in docs
      d5d7d886
    • Yohei Tamura's avatar
      BertJapaneseTokenizer accept options for mecab (#3566) · 8594dd80
      Yohei Tamura authored
      * BertJapaneseTokenizer accept options for mecab
      
      * black
      
      * fix mecab_option to Option[str]
      8594dd80
    • HUSEIN ZOLKEPLI's avatar
      Added albert-base-bahasa-cased README and fixed tiny-bert-bahasa-cased README (#3613) · 216e167c
      HUSEIN ZOLKEPLI authored
      * add bert bahasa readme
      
      * update readme
      
      * update readme
      
      * added xlnet
      
      * added tiny-bert and fix xlnet readme
      
      * added albert base
      216e167c
    • ahotrod's avatar
      Update README.md (#3604) · 1ac6a246
      ahotrod authored
      Update AutoModel & AutoTokernizer loading.
      1ac6a246