Commits · ea8eba35e2984882c3cd522ff669eb8060941a94 · chenpangpang / transformers

20 Feb, 2020 3 commits

Fix InputExample docstring (#2891) · ea8eba35
Scott Gigante authored Feb 20, 2020

ea8eba35

Tokenizer fast warnings (#2922) · e2a6445e

Funtowicz Morgan authored Feb 20, 2020



* Remove warning when pad_to_max_length is not set.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move RoberTa warning to RoberTa and not GPT2 base tokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

e2a6445e

Expose all constructor parameter for BertTokenizerFast (#2921) · 9b309331
Funtowicz Morgan authored Feb 20, 2020
```
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
```
9b309331

19 Feb, 2020 6 commits

Fast Tokenizers save pretrained should return the list of generated file paths. (#2918) · d490b5d5

Funtowicz Morgan authored Feb 20, 2020



* Correctly return the tuple of generated file(s) when calling save_pretrained
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality and format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

d490b5d5

Patch ALBERT with heads in TensorFlow · 2708b44e
Lysandre authored Feb 19, 2020

2708b44e
Patch ALBERT with heads in TensorFlow · 1abd53b1
Lysandre authored Feb 19, 2020

1abd53b1

Override build_inputs_with_special_tokens for fast tokenizers (#2912) · e6767642

Funtowicz Morgan authored Feb 19, 2020



* Override build_inputs_with_special_tokens for fast impl + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality + format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

e6767642

Release: v2.5.0 · fb560dcb
Lysandre authored Feb 19, 2020
```
Welcome Rust Tokenizers
```
fb560dcb

Integrate fast tokenizers library inside transformers (#2674) · 3f3fa7f7

Funtowicz Morgan authored Feb 19, 2020



* Implemented fast version of tokenizers
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumped tokenizers version requirements to latest 0.2.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added matching tests
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching OpenAI GPT tokenization !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching GPT2 on tokenizers
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose add_prefix_space as constructor parameter for GPT2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching Roberta tokenization !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed fast implementation of CTRL.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Binding TransformerXL tokenizers to Rust.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Updating tests accordingly.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added tokenizers as top-level modules.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black & isort.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rename LookupTable to WordLevel to match Rust side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use "fast" suffix instead of "ru" for rust tokenizers implementations.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Introduce tokenize() method on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus dispatchs to batch_encode_plus
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* batch_encode_plus now dispatchs to encode if there is only one input element.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind all the encode_plus parameter to the forwarded batch_encode_plus call.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.3.0
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix tokenization_auto with support for new (python, fast) mapping schema.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give correct fixtures path in test_tokenization_fast.py for the CLI.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose max_len_ properties on BertTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* _convert_encoding should keep the batch axis tensor if only one sample in the batch.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add warning message for RobertaTokenizerFast if used for MLM.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added use_fast (bool) parameter on AutoTokenizer.from_pretrained().

This allows to easily enable/disable Rust-based tokenizer instantiation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Let's tokenizers handle all the truncation and padding stuff.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to provide tokenizer arguments during pipeline creation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update test_fill_mask pipeline to not use fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix too much parameters for convert_encoding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* When enabling padding, max_length should be set to None.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Avoid returning nested tensors of length 1 when calling encode_plus
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure output is padded when return_tensor is not None.

Tensor creation requires the inital list input to be of the exact same size.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Disable transfoxl unittest if pytorch is not available (required to load the model)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus should not remove the leading batch axis if return_tensor is set
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Temporary disable fast tokenizers on QA pipelines.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix formatting issues.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.4.0

* Update style

* Enable truncation + stride unit test on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add unittest ensuring special_tokens set match between Python and Rust.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure special_tokens are correctly set during construction.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give more warning feedback to the user in case of padding without pad_token.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* quality & format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added possibility to add a single token as str
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added unittest for add_tokens and add_special_tokens on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix rebase mismatch on pipelines qa default model.

QA requires cased input while the tokenizers would be uncased.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Using offset mapping relative to the original string + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: save_vocabulary requires folder and file name
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Simplify import for Bert.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Remove private member access in tokenize()
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Bump tokenizers dependency to 0.4.2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* format & quality.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Use named arguments when applicable.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Relax type checking to include tuple and list object.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Document the truncate_and_pad manager behavior.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Raise an exception if return_offsets_mapping is not available with the current tokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure padding is set on the tokenizers before setting any padding strategy + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* On pytorch we need to stack tensor to get proper new axis.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generalize tests to different framework removing hard written return_tensors="..."
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizer dependency for num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Overflowing tokens in batch_encode_plus are now stacked over the batch axis.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improved error message for padding strategy without pad token.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumping tokenizers dependency to 0.5.0 for release.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Optimizing convert_encoding around 4x improvement. 🚀

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix unittests for overflow_to_sampling_mapping not being returned as tensor.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Format & quality.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove perfect alignment constraint for Roberta (allowing 1% difference max)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Triggering final CI
Co-authored-by: MOI Anthony <xn1t0x@gmail.com>

3f3fa7f7

14 Feb, 2020 1 commit
- [pipeline] Alias NerPipeline as TokenClassificationPipeline · 7d22fefd
  Julien Chaumond authored Feb 13, 2020
  
  7d22fefd
13 Feb, 2020 2 commits

Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f

Joe Davison authored Feb 13, 2020

* Preserve spaces in GPT-2 tokenizers

Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
tokenizers, enabling correct BPE encoding. Automatically inserts a space
in front of first token in encode function when adding special tokens.

* Add tokenization preprocessing method

* Add framework argument to pipeline factory

Also fixes pipeline test issue. Each test input now treated as a
distinct sequence.

f1e8a51f

get_activation('relu') provides a simple mapping from strings i… (#2807) · ef74b0f0
Sam Shleifer authored Feb 13, 2020
```
* activations.py contains a mapping from string to activation function
* resolves some `gelu` vs `gelu_new` ambiguity
```
ef74b0f0

11 Feb, 2020 2 commits

BERT decoder: Fix causal mask dtype. · ee5de0ba

Oleksiy Syvokon authored Feb 06, 2020

PyTorch < 1.3 requires multiplication operands to be of the same type.
This was violated when using default attention mask (i.e.,
attention_mask=None in arguments) given BERT in the decoder mode.

In particular, this was breaking Model2Model and made tutorial
from the quickstart failing.

ee5de0ba

Fix typo in src/transformers/data/processors/squad.py · bed38d3a
jiyeon authored Feb 11, 2020

bed38d3a

10 Feb, 2020 3 commits
- intermediate_size > hidden_dim in distilbert config docstrings · 539f601b
  Lysandre authored Feb 10, 2020
  
  539f601b
- FlauBERT lang embeddings only when n_langs > 1 · cfb7d108
  Lysandre authored Feb 10, 2020
  
  cfb7d108
- Correctly compute tokens when padding on the left · 125a75a1
  Lysandre authored Feb 10, 2020
  
  125a75a1
07 Feb, 2020 8 commits
- Correct docstring for xlnet · 520e7f21
  Lysandre authored Feb 07, 2020
  
  520e7f21
- E231 · 7046de29
  Lysandre authored Feb 07, 2020
  
  7046de29
- styling · 0d3aa3c0
  VictorSanh authored Feb 07, 2020
  
  0d3aa3c0
- distilbert-base-cased weights + Readmes + omissions · ee5a6856
  VictorSanh authored Feb 07, 2020
  
  ee5a6856
- Fix importing unofficial TF models with extra optimizer weights · 73368963
  monologg authored Jan 27, 2020
  
  73368963
- Fix documentation in ProjectedAdaptiveLogSoftmax · d7dabfef
  Ari authored Feb 06, 2020
  
  d7dabfef
- style and quality · c6c5c3fd
  thomwolf authored Feb 07, 2020
  
  c6c5c3fd
- @julien-c proposal for TF/PT compat in hf_buckets · 961c6977
  thomwolf authored Feb 07, 2020
  
  961c6977
06 Feb, 2020 3 commits
- cleanup · d311f87b
  thomwolf authored Feb 07, 2020
  
  d311f87b
- file_cache has options to extract archives · 7d99e05f
  thomwolf authored Feb 07, 2020
  
  7d99e05f
- Changed vocabulary save function. Variable name was inconsistent, causing an... · 2c12464a
  dchurchwell authored Feb 06, 2020
```
Changed vocabulary save function. Variable name was inconsistent, causing an error to be thrown when passing a file name instead of a directory.
```
  2c12464a
05 Feb, 2020 1 commit
- Fix GPT2 config set to trainable · 6bb6a017
  James Betker authored Feb 04, 2020
```
This prevents the model from being saved, and who knows
what else.
```
  6bb6a017
04 Feb, 2020 2 commits
- Revert erroneous fix · 3bf54172
  Lysandre authored Feb 04, 2020
  
  3bf54172
- Remove redundant hidden states · 90ab15cb
  Lysandre authored Jan 08, 2020
  
  90ab15cb
03 Feb, 2020 3 commits
- Pipelines: fix crash when modelcard is None · 9a50828b
  Julien Chaumond authored Feb 03, 2020
```
cc @mfuntowicz does this seem correct?
```
  9a50828b
- Sample instead of greedy decoding by default in generate · 6c1b2355
  Lysandre authored Feb 03, 2020
  
  6c1b2355
- [Follow up 213] · 239dd23f
  Lysandre authored Feb 03, 2020
```
Masked indices should have -1 and not -100. Updating documentation + scripts that were forgotten
```
  239dd23f
01 Feb, 2020 1 commit

CLI script to gather environment info (#2699) · 9773e5e0

Bram Vanroy authored Feb 01, 2020

* add "info" command to CLI

As a convenience, add the info directive to CLI. Running `python transformers-cli info` will return a string containing the transformers version, platform, python version, PT/TF version and GPU support

* Swap f-strings for .format

Still supporting 3.5 so can't use f-strings (sad face)

* Add reference in issue to CLI

* Add the expected fields to issue template

This way, people can still add the information manually if they want. (Though I fear they'll just ignore it.)

* Remove heading from output

* black-ify

* order of imports

Should ensure isort test passes

* use is_X_available over import..pass

* style

* fix copy-paste bug

* Rename command info -> env

Also adds the command to CONTRIBUTING.md in "Did you find a bug" section

9773e5e0

31 Jan, 2020 5 commits

Patch: v2.4.1 · d426b58b
Lysandre authored Jan 31, 2020

d426b58b
Flaubert auto tokenizer + tests · 1e82cd84
Lysandre authored Jan 31, 2020
```
cc @julien-c
```
1e82cd84

FlauBERT load in AutoModel · ff6f1492

Lysandre authored Jan 31, 2020

The FlauBERT configuration file inherits from XLMConfig, and is recognized as such when loading from AutoModels as the XLMConfig is checked before the FlaubertConfig.

Changing the order solves this problem, but a test should be added.

ff6f1492

Release: v2.4.0 · 6664ea94
Lysandre authored Jan 31, 2020

6664ea94

[Umberto] model shortcuts (#2661) · 5a6b138b

Julien Chaumond authored Jan 30, 2020

* [Umberto] model shortcuts

cc @loretoparisi @simonefrancia

see #2485

* Ensure that tokenizers will be correctly configured

5a6b138b