"vscode:/vscode.git/clone" did not exist on "a6676384423b39dee66bc3833ef3415cd089602a"
- 25 Jun, 2020 1 commit
-
-
Thomas Wolf authored
* avoid recursion in id checks for fast tokenizers * better typings and fix #5232 * align slow and fast tokenizers behaviors for Roberta and GPT2 * style and quality * fix tests - improve typings
-
- 23 Jun, 2020 1 commit
-
-
Thomas Wolf authored
* Add return lengths * make pad a bit more flexible so it can be used as collate_fn * check all kwargs sent to encoding method are known * fixing kwargs in encodings * New AddedToken class in python This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens. * style and quality * switched to hugginface tokenizers library for AddedTokens * up to tokenizer 0.8.0-rc3 - update API to use AddedToken state * style and quality * do not raise an error on additional or unused kwargs for tokenize() but only a warning * transfo-xl pretrained model requires torch * Update src/transformers/tokenization_utils.py Co-authored-by:
Lysandre Debut <lysandre@huggingface.co> Co-authored-by:
Lysandre Debut <lysandre@huggingface.co>
-
- 15 Jun, 2020 1 commit
-
-
Anthony MOI authored
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510) * Use tokenizers pre-tokenized pipeline * failing pretrokenized test * Fix is_pretokenized in python * add pretokenized tests * style and quality * better tests for batched pretokenized inputs * tokenizers clean up - new padding_strategy - split the files * [HUGE] refactoring tokenizers - padding - truncation - tests * style and quality * bump up requied tokenizers version to 0.8.0-rc1 * switched padding/truncation API - simpler better backward compat * updating tests for custom tokenizers * style and quality - tests on pad * fix QA pipeline * fix backward compatibility for max_length only * style and quality * Various cleans up - add verbose * fix tests * update docstrings * Fix tests * Docs reformatted * __call__ method documented Co-authored-by:
Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by:
Lysandre <lysandre.debut@reseau.eseo.fr>
-
- 20 May, 2020 1 commit
-
-
Lysandre Debut authored
* There is one missing key in BERT * Correct device for CamemBERT model * RoBERTa tokenization adding prefix space * Style
-
- 13 Feb, 2020 1 commit
-
-
Joe Davison authored
* Preserve spaces in GPT-2 tokenizers Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa) tokenizers, enabling correct BPE encoding. Automatically inserts a space in front of first token in encode function when adding special tokens. * Add tokenization preprocessing method * Add framework argument to pipeline factory Also fixes pipeline test issue. Each test input now treated as a distinct sequence.
-
- 15 Jan, 2020 1 commit
-
-
Julien Chaumond authored
-
- 06 Jan, 2020 2 commits
-
-
alberduris authored
-
alberduris authored
-
- 22 Dec, 2019 8 commits
-
-
Aymeric Augustin authored
On Python 3, `open is io.open`.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This is the same change as for (TF)CommonTestCases for modeling.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This construct isn't used anymore these days. Running python tests/test_foo.py puts the tests/ directory on PYTHONPATH, which isn't representative of how we run tests. Use python -m unittest tests/test_foo.py instead.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This is the result of: $ isort --recursive examples templates transformers utils hubconf.py setup.py
-
- 21 Dec, 2019 1 commit
-
-
Aymeric Augustin authored
This is the result of: $ black --line-length 119 examples templates transformers utils hubconf.py setup.py There's a lot of fairly long lines in the project. As a consequence, I'm picking the longest widely accepted line length, 119 characters. This is also Thomas' preference, because it allows for explicit variable names, to make the code easier to understand.
-
- 06 Dec, 2019 1 commit
-
-
Aymeric Augustin authored
* Switch to plain unittest for skipping slow tests. Add a RUN_SLOW environment variable for running them. * Switch to plain unittest for PyTorch dependency. * Switch to plain unittest for TensorFlow dependency. * Avoid leaking open files in the test suite. This prevents spurious warnings when running tests. * Fix unicode warning on Python 2 when running tests. The warning was: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal * Support running PyTorch tests on a GPU. Reverts 27e015bd. * Tests no longer require pytest. * Make tests pass on cuda
-
- 04 Nov, 2019 1 commit
-
-
thomwolf authored
-
- 22 Oct, 2019 1 commit
-
-
Lysandre authored
-
- 04 Oct, 2019 1 commit
-
-
thomwolf authored
-
- 26 Sep, 2019 2 commits
- 19 Sep, 2019 1 commit
-
-
LysandreJik authored
-
- 30 Aug, 2019 5 commits
- 13 Aug, 2019 1 commit
-
-
LysandreJik authored
-
- 12 Aug, 2019 1 commit
-
-
LysandreJik authored
-
- 09 Aug, 2019 1 commit
-
-
LysandreJik authored
-
- 08 Aug, 2019 1 commit
-
-
LysandreJik authored
-
- 07 Aug, 2019 1 commit
-
-
LysandreJik authored
-
- 05 Aug, 2019 2 commits
-
-
Julien Chaumond authored
-
Julien Chaumond authored
-