Commits · b6ea0f43aeb7ff1dcb03658e38bacae1130abd91 · chenpangpang / transformers

22 Dec, 2019 4 commits
- Replace CommonTestCases for tokenizers with a mixin. · 00204f2b
  Aymeric Augustin authored Dec 22, 2019
```
This is the same change as for (TF)CommonTestCases for modeling.
```
  00204f2b
- Rename file for consistency. · a3c5883f
  Aymeric Augustin authored Dec 22, 2019
  
  a3c5883f
- Move tests outside of library. · 067395d5
  Aymeric Augustin authored Dec 22, 2019
  
  067395d5
- Sort imports with isort. · 158e82e0
  Aymeric Augustin authored Dec 21, 2019
```
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
```
  158e82e0
21 Dec, 2019 1 commit

Reformat source code with black. · fa84ae26

Aymeric Augustin authored Dec 21, 2019

This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.

fa84ae26

20 Dec, 2019 2 commits
- Clean special tokens test · 65c75fc5
  Lysandre authored Dec 20, 2019
  
  65c75fc5
- Added test for all special tokens · fb393ad9
  Lysandre authored Dec 20, 2019
  
  fb393ad9
13 Dec, 2019 1 commit
- Tests for all tokenizers · c3248cf1
  LysandreJik authored Dec 11, 2019
  
  c3248cf1
06 Dec, 2019 2 commits

Fix bug which lowercases special tokens · 2670b0d6
Michael Watkins authored Dec 04, 2019

2670b0d6

Remove dependency on pytest for running tests (#2055) · 35401fe5

Aymeric Augustin authored Dec 06, 2019

* Switch to plain unittest for skipping slow tests.

Add a RUN_SLOW environment variable for running them.

* Switch to plain unittest for PyTorch dependency.

* Switch to plain unittest for TensorFlow dependency.

* Avoid leaking open files in the test suite.

This prevents spurious warnings when running tests.

* Fix unicode warning on Python 2 when running tests.

The warning was:

    UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

* Support running PyTorch tests on a GPU.

Reverts 27e015bd.

* Tests no longer require pytest.

* Make tests pass on cuda

35401fe5

04 Dec, 2019 1 commit
- Padding side is tokenizer-dependant · a7ca6d73
  LysandreJik authored Dec 04, 2019
  
  a7ca6d73
22 Nov, 2019 2 commits
- Padding strategy (left and right) rather than boolean flag · a7dafe2f
  LysandreJik authored Nov 21, 2019
  
  a7dafe2f
- `encode` and `encode_plus` handle attention masks and padding · 9f374c82
  LysandreJik authored Nov 22, 2019
  
  9f374c82
12 Nov, 2019 2 commits

Fix special tokens addition in decoder · 74d0bcb6
Lysandre authored Nov 12, 2019

74d0bcb6

Consider do_lower_case in PreTrainedTokenizer · 7246d3c2

Michael Watkins authored Nov 06, 2019

As pointed out in #1545, when using an uncased model, and adding
a new uncased token, the tokenizer does not correctly identify this
in the case that the input text contains the token in a cased format.

For instance, if we load bert-base-uncased into BertTokenizer, and
then use .add_tokens() to add "cool-token", we get the expected
result for .tokenize('this is a cool-token'). However, we get a
possibly unexpected result for .tokenize('this is a cOOl-Token'),
which in fact mirrors the result for the former from before the new
token was added.

This commit adds
- functionality to PreTrainedTokenizer to handle this
situation in case a tokenizer (currently Bert, DistilBert,
and XLNet) has the do_lower_case=True kwarg by:
    1) lowercasing tokens added with .add_tokens()
    2) lowercasing text at the beginning of .tokenize()
- new common test case for tokenizers

https://github.com/huggingface/transformers/issues/1545

7246d3c2

04 Nov, 2019 1 commit
- fix #1532 and encode_plus · 8d6b9d71
  thomwolf authored Nov 04, 2019
  
  8d6b9d71
22 Oct, 2019 1 commit
- Remove · 7d709e55
  Lysandre authored Oct 22, 2019
  
  7d709e55
04 Oct, 2019 2 commits
- fixes · 78ef1a99
  thomwolf authored Oct 04, 2019
  
  78ef1a99
- update encode_plus - add truncation strategies · 6c1d0bc0
  thomwolf authored Oct 04, 2019
  
  6c1d0bc0
03 Oct, 2019 5 commits
- Update naming + remove f string in run_lm_finetuning example · aebd8323
  LysandreJik authored Oct 02, 2019
  
  aebd8323
- always_truncate by default · 651bfb7a
  LysandreJik authored Sep 30, 2019
  
  651bfb7a
- Supports already existing special tokens · cc412edd
  LysandreJik authored Sep 30, 2019
  
  cc412edd
- Sequence IDS · 2f259b22
  LysandreJik authored Sep 30, 2019
  
  2f259b22
- Always truncate argument in the encode method · 7c789c33
  LysandreJik authored Sep 30, 2019
  
  7c789c33
26 Sep, 2019 1 commit
- [BIG] pytorch-transformers => transformers · 31c23bd5
  thomwolf authored Sep 26, 2019
  
  31c23bd5
24 Sep, 2019 4 commits
- various updates · a6981076
  thomwolf authored Sep 24, 2019
  
  a6981076
- `output_token_type` -> `token_type_ids` · c832f43a
  LysandreJik authored Sep 24, 2019
  
  c832f43a
- Updated tests · 0ea82b24
  LysandreJik authored Sep 24, 2019
  
  0ea82b24
- Updated DistilBERT · 9d44236f
  LysandreJik authored Sep 24, 2019
  
  9d44236f
19 Sep, 2019 9 commits
- Tokenizer accepts token list as well as string · 3df208c9
  LysandreJik authored Sep 19, 2019
  
  3df208c9
- prepare_for_model and prepare_pair_for_model methods. Added an option to... · 66ea76b8
  LysandreJik authored Sep 19, 2019
```
prepare_for_model and prepare_pair_for_model methods. Added an option to select which sequence will be truncated.
```
  66ea76b8
- Stride + tests + small fixes · baa74326
  LysandreJik authored Sep 19, 2019
  
  baa74326
- Mask computing in standalone method. Tests. · c10c7d59
  LysandreJik authored Sep 19, 2019
  
  c10c7d59
- Sentence -> Sequence. Removed output_mask from the special token addition methods. · bf503158
  LysandreJik authored Sep 19, 2019
  
  bf503158
- encode + encode_plus tests modified · 6393261e
  LysandreJik authored Sep 19, 2019
  
  6393261e
- Max encoding length + corresponding tests · af23b626
  LysandreJik authored Sep 11, 2019
  
  af23b626
- Number of added tokens calculator · d572d702
  LysandreJik authored Sep 11, 2019
  
  d572d702
- Added binary masking tests · c3df2136
  LysandreJik authored Sep 02, 2019
  
  c3df2136
05 Sep, 2019 1 commit
- adding test for common properties and cleaning up a bit base class · 5c6cac10
  thomwolf authored Sep 05, 2019
  
  5c6cac10
02 Sep, 2019 1 commit
- fixing #1133 · fede4ef4
  thomwolf authored Sep 02, 2019
  
  fede4ef4