- 18 Sep, 2020 1 commit
-
-
Dat Quoc Nguyen authored
* Add BERTweet and PhoBERT models * Update modeling_auto.py Re-add `bart` to LM_MAPPING * Update tokenization_auto.py Re-add `from .configuration_mobilebert import MobileBertConfig` not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig` * Add BERTweet and PhoBERT to pretrained_models.rst * Update tokenization_auto.py Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer. * Update BertweetTokenizer - without nltk * Update model card for BERTweet * PhoBERT - with Auto mode - without import fastBPE * PhoBERT - with Auto mode - without import fastBPE * BERTweet - with Auto mode - without import fastBPE * Add PhoBERT and BERTweet to TF modeling auto * Improve Docstrings for PhobertTokenizer and BertweetTokenizer * Update PhoBERT and BERTweet model cards * Fixed a merge conflict in tokenization_auto * Used black to reformat BERTweet- and PhoBERT-related files * Used isort to reformat BERTweet- and PhoBERT-related files * Reformatted BERTweet- and PhoBERT-related files based on flake8 * Updated test files * Updated test files * Updated tf test files * Updated tf test files * Updated tf test files * Updated tf test files * Update commits from huggingface * Delete unnecessary files * Add tokenizers to auto and init files * Add test files for tokenizers * Revised model cards * Update save_vocabulary function in BertweetTokenizer and PhobertTokenizer and test files * Revised test files * Update orders of Phobert and Bertweet tokenizers in auto tokenization file
-
- 15 Jun, 2020 1 commit
-
-
Anthony MOI authored
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510) * Use tokenizers pre-tokenized pipeline * failing pretrokenized test * Fix is_pretokenized in python * add pretokenized tests * style and quality * better tests for batched pretokenized inputs * tokenizers clean up - new padding_strategy - split the files * [HUGE] refactoring tokenizers - padding - truncation - tests * style and quality * bump up requied tokenizers version to 0.8.0-rc1 * switched padding/truncation API - simpler better backward compat * updating tests for custom tokenizers * style and quality - tests on pad * fix QA pipeline * fix backward compatibility for max_length only * style and quality * Various cleans up - add verbose * fix tests * update docstrings * Fix tests * Docs reformatted * __call__ method documented Co-authored-by:
Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by:
Lysandre <lysandre.debut@reseau.eseo.fr>
-
- 15 Jan, 2020 1 commit
-
-
Julien Chaumond authored
-
- 06 Jan, 2020 2 commits
-
-
alberduris authored
-
alberduris authored
-
- 22 Dec, 2019 8 commits
-
-
Aymeric Augustin authored
On Python 3, `open is io.open`.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This is the same change as for (TF)CommonTestCases for modeling.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This construct isn't used anymore these days. Running python tests/test_foo.py puts the tests/ directory on PYTHONPATH, which isn't representative of how we run tests. Use python -m unittest tests/test_foo.py instead.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This is the result of: $ isort --recursive examples templates transformers utils hubconf.py setup.py
-
- 21 Dec, 2019 1 commit
-
-
Aymeric Augustin authored
This is the result of: $ black --line-length 119 examples templates transformers utils hubconf.py setup.py There's a lot of fairly long lines in the project. As a consequence, I'm picking the longest widely accepted line length, 119 characters. This is also Thomas' preference, because it allows for explicit variable names, to make the code easier to understand.
-
- 08 Oct, 2019 1 commit
-
-
thomwolf authored
-
- 04 Oct, 2019 1 commit
-
-
keskarnitish authored
adding conversion script adding first draft of modeling & tokenization adding placeholder for test files bunch of changes registering the tokenizer/model/etc tests change link; something is very VERY wrong here weird end-of-word thingy going on i think the tokenization works now ; wrote the unit tests overall structure works;load w next the monster is alive! works after some cleanup as well adding emacs autosave to gitignore currently only supporting the 48 layer one; seems to infer fine on my macbook cleanup fixing some documentation fixing some documentation tests passing? now works on CUDA also adding greedy? adding greedy sampling works well
-
- 26 Sep, 2019 2 commits
- 30 Aug, 2019 5 commits
- 05 Aug, 2019 1 commit
-
-
thomwolf authored
-
- 15 Jul, 2019 1 commit
-
-
thomwolf authored
-
- 09 Jul, 2019 2 commits
- 05 Jul, 2019 3 commits
- 02 Jul, 2019 1 commit
-
-
thomwolf authored
-
- 17 Apr, 2019 4 commits
- 16 Apr, 2019 1 commit
-
-
thomwolf authored
-
- 15 Apr, 2019 2 commits
- 11 Feb, 2019 2 commits