- 02 Mar, 2020 1 commit
-
-
Patrick von Platen authored
* force pad_token_id to be set before padding * fix tests and forbid padding without having a padding_token_id set
-
- 24 Feb, 2020 1 commit
-
-
Lysandre Debut authored
* Testing that encode_plus and batch_encode_plus behave the same way Spoiler alert: they don't * Testing rest of arguments in batch_encode_plus * Test tensor return in batch_encode_plus * Addressing Sam's comments * flake8 * Simplified with `num_added_tokens`
-
- 20 Feb, 2020 1 commit
-
-
Joe Davison authored
-
- 13 Feb, 2020 1 commit
-
-
Joe Davison authored
* Preserve spaces in GPT-2 tokenizers Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa) tokenizers, enabling correct BPE encoding. Automatically inserts a space in front of first token in encode function when adding special tokens. * Add tokenization preprocessing method * Add framework argument to pipeline factory Also fixes pipeline test issue. Each test input now treated as a distinct sequence.
-
- 29 Jan, 2020 2 commits
- 06 Jan, 2020 2 commits
-
-
alberduris authored
-
alberduris authored
-
- 24 Dec, 2019 1 commit
-
-
Anthony MOI authored
-
- 23 Dec, 2019 1 commit
-
-
Aymeric Augustin authored
-
- 22 Dec, 2019 7 commits
-
-
Aymeric Augustin authored
On Python 3, `open is io.open`.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This is the same change as for (TF)CommonTestCases for modeling.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
-
Aymeric Augustin authored
This is the result of: $ isort --recursive examples templates transformers utils hubconf.py setup.py
-
- 21 Dec, 2019 1 commit
-
-
Aymeric Augustin authored
This is the result of: $ black --line-length 119 examples templates transformers utils hubconf.py setup.py There's a lot of fairly long lines in the project. As a consequence, I'm picking the longest widely accepted line length, 119 characters. This is also Thomas' preference, because it allows for explicit variable names, to make the code easier to understand.
-
- 20 Dec, 2019 2 commits
- 13 Dec, 2019 1 commit
-
-
LysandreJik authored
-
- 06 Dec, 2019 2 commits
-
-
Michael Watkins authored
-
Aymeric Augustin authored
* Switch to plain unittest for skipping slow tests. Add a RUN_SLOW environment variable for running them. * Switch to plain unittest for PyTorch dependency. * Switch to plain unittest for TensorFlow dependency. * Avoid leaking open files in the test suite. This prevents spurious warnings when running tests. * Fix unicode warning on Python 2 when running tests. The warning was: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal * Support running PyTorch tests on a GPU. Reverts 27e015bd. * Tests no longer require pytest. * Make tests pass on cuda
-
- 04 Dec, 2019 1 commit
-
-
LysandreJik authored
-
- 22 Nov, 2019 2 commits
-
-
LysandreJik authored
-
LysandreJik authored
-
- 12 Nov, 2019 2 commits
-
-
Lysandre authored
-
Michael Watkins authored
As pointed out in #1545, when using an uncased model, and adding a new uncased token, the tokenizer does not correctly identify this in the case that the input text contains the token in a cased format. For instance, if we load bert-base-uncased into BertTokenizer, and then use .add_tokens() to add "cool-token", we get the expected result for .tokenize('this is a cool-token'). However, we get a possibly unexpected result for .tokenize('this is a cOOl-Token'), which in fact mirrors the result for the former from before the new token was added. This commit adds - functionality to PreTrainedTokenizer to handle this situation in case a tokenizer (currently Bert, DistilBert, and XLNet) has the do_lower_case=True kwarg by: 1) lowercasing tokens added with .add_tokens() 2) lowercasing text at the beginning of .tokenize() - new common test case for tokenizers https://github.com/huggingface/transformers/issues/1545
-
- 04 Nov, 2019 1 commit
-
-
thomwolf authored
-
- 22 Oct, 2019 1 commit
-
-
Lysandre authored
-
- 04 Oct, 2019 2 commits
- 03 Oct, 2019 5 commits
-
-
LysandreJik authored
-
LysandreJik authored
-
LysandreJik authored
-
LysandreJik authored
-
LysandreJik authored
-
- 26 Sep, 2019 1 commit
-
-
thomwolf authored
-
- 24 Sep, 2019 2 commits
-
-
thomwolf authored
-
LysandreJik authored
-