Commits · 146c521235ca057570cac4c1fc3f884ac464e580 · chenpangpang / transformers

02 Mar, 2020 1 commit
- Force pad_token_id to be set before padding for standard tokenizer (#3035) · c0135194
  Patrick von Platen authored Mar 02, 2020
```
* force pad_token_id to be set before padding

* fix tests and forbid padding without having a padding_token_id set
```
  c0135194
24 Feb, 2020 1 commit

Testing that batch_encode_plus is the same as encode_plus (#2973) · 21d8b6a3

Lysandre Debut authored Feb 24, 2020

* Testing that encode_plus and batch_encode_plus behave the same way

Spoiler alert: they don't

* Testing rest of arguments in batch_encode_plus

* Test tensor return in batch_encode_plus

* Addressing Sam's comments

* flake8

* Simplified with `num_added_tokens`

21d8b6a3

20 Feb, 2020 1 commit
- Add get_vocab method to PretrainedTokenizer · 197d74f9
  Joe Davison authored Feb 20, 2020
  
  197d74f9
13 Feb, 2020 1 commit

Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f

Joe Davison authored Feb 13, 2020

* Preserve spaces in GPT-2 tokenizers

Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
tokenizers, enabling correct BPE encoding. Automatically inserts a space
in front of first token in encode function when adding special tokens.

* Add tokenization preprocessing method

* Add framework argument to pipeline factory

Also fixes pipeline test issue. Each test input now treated as a
distinct sequence.

f1e8a51f

29 Jan, 2020 2 commits
- Style · e63a81dd
  Lysandre authored Jan 29, 2020
  
  e63a81dd
- Copy object instead of passing the reference · 21734901
  Lysandre authored Jan 29, 2020
  
  21734901
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
24 Dec, 2019 1 commit
- Add tests for fast tokenizers · 2818e505
  Anthony MOI authored Dec 24, 2019
  
  2818e505
23 Dec, 2019 1 commit
- Remove unused variables in tests. · e6c0019c
  Aymeric Augustin authored Dec 23, 2019
  
  e6c0019c
22 Dec, 2019 7 commits
- Use built-in open(). · 1c62e87b
  Aymeric Augustin authored Dec 22, 2019
```
On Python 3, `open is io.open`.
```
  1c62e87b
- Remove sys.version_info[0] == 2 or 3. · 798b3b38
  Aymeric Augustin authored Dec 22, 2019
  
  798b3b38
- Remove __future__ imports. · c824d15a
  Aymeric Augustin authored Dec 22, 2019
  
  c824d15a
- Replace CommonTestCases for tokenizers with a mixin. · 00204f2b
  Aymeric Augustin authored Dec 22, 2019
```
This is the same change as for (TF)CommonTestCases for modeling.
```
  00204f2b
- Rename file for consistency. · a3c5883f
  Aymeric Augustin authored Dec 22, 2019
  
  a3c5883f
- Move tests outside of library. · 067395d5
  Aymeric Augustin authored Dec 22, 2019
  
  067395d5
- Sort imports with isort. · 158e82e0
  Aymeric Augustin authored Dec 21, 2019
```
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
```
  158e82e0
21 Dec, 2019 1 commit

Reformat source code with black. · fa84ae26

Aymeric Augustin authored Dec 21, 2019

This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.

fa84ae26

20 Dec, 2019 2 commits
- Clean special tokens test · 65c75fc5
  Lysandre authored Dec 20, 2019
  
  65c75fc5
- Added test for all special tokens · fb393ad9
  Lysandre authored Dec 20, 2019
  
  fb393ad9
13 Dec, 2019 1 commit
- Tests for all tokenizers · c3248cf1
  LysandreJik authored Dec 11, 2019
  
  c3248cf1
06 Dec, 2019 2 commits

Fix bug which lowercases special tokens · 2670b0d6
Michael Watkins authored Dec 04, 2019

2670b0d6

Remove dependency on pytest for running tests (#2055) · 35401fe5

Aymeric Augustin authored Dec 06, 2019

* Switch to plain unittest for skipping slow tests.

Add a RUN_SLOW environment variable for running them.

* Switch to plain unittest for PyTorch dependency.

* Switch to plain unittest for TensorFlow dependency.

* Avoid leaking open files in the test suite.

This prevents spurious warnings when running tests.

* Fix unicode warning on Python 2 when running tests.

The warning was:

    UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

* Support running PyTorch tests on a GPU.

Reverts 27e015bd.

* Tests no longer require pytest.

* Make tests pass on cuda

35401fe5

04 Dec, 2019 1 commit
- Padding side is tokenizer-dependant · a7ca6d73
  LysandreJik authored Dec 04, 2019
  
  a7ca6d73
22 Nov, 2019 2 commits
- Padding strategy (left and right) rather than boolean flag · a7dafe2f
  LysandreJik authored Nov 21, 2019
  
  a7dafe2f
- `encode` and `encode_plus` handle attention masks and padding · 9f374c82
  LysandreJik authored Nov 22, 2019
  
  9f374c82
12 Nov, 2019 2 commits

Fix special tokens addition in decoder · 74d0bcb6
Lysandre authored Nov 12, 2019

74d0bcb6

Consider do_lower_case in PreTrainedTokenizer · 7246d3c2

Michael Watkins authored Nov 06, 2019

As pointed out in #1545, when using an uncased model, and adding
a new uncased token, the tokenizer does not correctly identify this
in the case that the input text contains the token in a cased format.

For instance, if we load bert-base-uncased into BertTokenizer, and
then use .add_tokens() to add "cool-token", we get the expected
result for .tokenize('this is a cool-token'). However, we get a
possibly unexpected result for .tokenize('this is a cOOl-Token'),
which in fact mirrors the result for the former from before the new
token was added.

This commit adds
- functionality to PreTrainedTokenizer to handle this
situation in case a tokenizer (currently Bert, DistilBert,
and XLNet) has the do_lower_case=True kwarg by:
    1) lowercasing tokens added with .add_tokens()
    2) lowercasing text at the beginning of .tokenize()
- new common test case for tokenizers

https://github.com/huggingface/transformers/issues/1545

7246d3c2

04 Nov, 2019 1 commit
- fix #1532 and encode_plus · 8d6b9d71
  thomwolf authored Nov 04, 2019
  
  8d6b9d71
22 Oct, 2019 1 commit
- Remove · 7d709e55
  Lysandre authored Oct 22, 2019
  
  7d709e55
04 Oct, 2019 2 commits
- fixes · 78ef1a99
  thomwolf authored Oct 04, 2019
  
  78ef1a99
- update encode_plus - add truncation strategies · 6c1d0bc0
  thomwolf authored Oct 04, 2019
  
  6c1d0bc0
03 Oct, 2019 5 commits
- Update naming + remove f string in run_lm_finetuning example · aebd8323
  LysandreJik authored Oct 02, 2019
  
  aebd8323
- always_truncate by default · 651bfb7a
  LysandreJik authored Sep 30, 2019
  
  651bfb7a
- Supports already existing special tokens · cc412edd
  LysandreJik authored Sep 30, 2019
  
  cc412edd
- Sequence IDS · 2f259b22
  LysandreJik authored Sep 30, 2019
  
  2f259b22
- Always truncate argument in the encode method · 7c789c33
  LysandreJik authored Sep 30, 2019
  
  7c789c33
26 Sep, 2019 1 commit
- [BIG] pytorch-transformers => transformers · 31c23bd5
  thomwolf authored Sep 26, 2019
  
  31c23bd5
24 Sep, 2019 2 commits
- various updates · a6981076
  thomwolf authored Sep 24, 2019
  
  a6981076
- `output_token_type` -> `token_type_ids` · c832f43a
  LysandreJik authored Sep 24, 2019
  
  c832f43a