Commits · 96ab75b8dd48a9384a74ba4307a4ebfb197343cd · chenpangpang / transformers

06 Apr, 2020 1 commit

Funtowicz Morgan authored Apr 06, 2020

* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

96ab75b8

26 Mar, 2020 2 commits

Adds translation pipeline (#3419) · 022e8fab

Patrick von Platen authored Mar 26, 2020

* fix merge conflicts

* add t5 summarization example

* change parameters for t5 summarization

* make style

* add first code snippet for translation

* only add prefixes

* add prefix patterns

* make style

* renaming

* fix conflicts

* remove unused patterns

* solve conflicts

* fix merge conflicts

* remove translation example

* remove summarization example

* make sure tensors are in numpy for float comparsion

* re-add t5 config

* fix t5 import config typo

* make style

* remove unused numpy statements

* update doctstring

* import translation pipeline

022e8fab

Add t5 to pipeline(task='summarization') (#3413) · 9c683ef0

Patrick von Platen authored Mar 26, 2020

* solve conflicts

* move warnings below

* incorporate changes

* add pad_to_max_length to pipelines

* add bug fix for T5 beam search

* add prefix patterns

* make style

* fix conflicts

* adapt pipelines for task specific parameters

* improve docstring

* remove unused patterns

9c683ef0

17 Mar, 2020 1 commit

Add Summarization to Pipelines (#3128) · 38a555a8

Sam Shleifer authored Mar 17, 2020

* passing

* Undo stupid chg

* docs

* undo rename

* delete-cruft

* only import if you have torch

* Dont rely on dict ordering

* Fix dict ordering upstream

* docstring link

* docstring link

* remove trailing comma for 3.5 compat

* new name

* delegate kwarging

* Update kwargs

38a555a8

09 Mar, 2020 1 commit
- TFQA pipeline marked as slow test · 525b6b1c
  Lysandre authored Mar 09, 2020
  
  525b6b1c
02 Mar, 2020 1 commit

Pipeline doc (#3055) · d3eb7d23

Lysandre Debut authored Mar 02, 2020

* Pipeline doc initial commit

* pipeline abstraction

* Remove modelcard argument from pipeline

* Task-specific pipelines can be instantiated with no model or tokenizer

* All pipelines doc

d3eb7d23

19 Feb, 2020 1 commit

Integrate fast tokenizers library inside transformers (#2674) · 3f3fa7f7

Funtowicz Morgan authored Feb 19, 2020



* Implemented fast version of tokenizers
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumped tokenizers version requirements to latest 0.2.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added matching tests
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching OpenAI GPT tokenization !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching GPT2 on tokenizers
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose add_prefix_space as constructor parameter for GPT2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching Roberta tokenization !
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed fast implementation of CTRL.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Binding TransformerXL tokenizers to Rust.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Updating tests accordingly.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added tokenizers as top-level modules.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black & isort.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rename LookupTable to WordLevel to match Rust side.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use "fast" suffix instead of "ru" for rust tokenizers implementations.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Introduce tokenize() method on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus dispatchs to batch_encode_plus
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* batch_encode_plus now dispatchs to encode if there is only one input element.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind all the encode_plus parameter to the forwarded batch_encode_plus call.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.3.0
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix tokenization_auto with support for new (python, fast) mapping schema.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give correct fixtures path in test_tokenization_fast.py for the CLI.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose max_len_ properties on BertTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* _convert_encoding should keep the batch axis tensor if only one sample in the batch.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add warning message for RobertaTokenizerFast if used for MLM.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added use_fast (bool) parameter on AutoTokenizer.from_pretrained().

This allows to easily enable/disable Rust-based tokenizer instantiation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Let's tokenizers handle all the truncation and padding stuff.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to provide tokenizer arguments during pipeline creation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update test_fill_mask pipeline to not use fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix too much parameters for convert_encoding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* When enabling padding, max_length should be set to None.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Avoid returning nested tensors of length 1 when calling encode_plus
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure output is padded when return_tensor is not None.

Tensor creation requires the inital list input to be of the exact same size.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Disable transfoxl unittest if pytorch is not available (required to load the model)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus should not remove the leading batch axis if return_tensor is set
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Temporary disable fast tokenizers on QA pipelines.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix formatting issues.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.4.0

* Update style

* Enable truncation + stride unit test on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add unittest ensuring special_tokens set match between Python and Rust.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure special_tokens are correctly set during construction.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give more warning feedback to the user in case of padding without pad_token.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* quality & format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added possibility to add a single token as str
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added unittest for add_tokens and add_special_tokens on fast tokenizers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix rebase mismatch on pipelines qa default model.

QA requires cased input while the tokenizers would be uncased.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Using offset mapping relative to the original string + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: save_vocabulary requires folder and file name
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Simplify import for Bert.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Remove private member access in tokenize()
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Bump tokenizers dependency to 0.4.2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* format & quality.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Use named arguments when applicable.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Relax type checking to include tuple and list object.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Document the truncate_and_pad manager behavior.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Raise an exception if return_offsets_mapping is not available with the current tokenizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure padding is set on the tokenizers before setting any padding strategy + unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* On pytorch we need to stack tensor to get proper new axis.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generalize tests to different framework removing hard written return_tensors="..."
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizer dependency for num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Overflowing tokens in batch_encode_plus are now stacked over the batch axis.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improved error message for padding strategy without pad token.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumping tokenizers dependency to 0.5.0 for release.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Optimizing convert_encoding around 4x improvement. 🚀

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix unittests for overflow_to_sampling_mapping not being returned as tensor.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Format & quality.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove perfect alignment constraint for Roberta (allowing 1% difference max)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Triggering final CI
Co-authored-by: MOI Anthony <xn1t0x@gmail.com>

3f3fa7f7

18 Feb, 2020 1 commit
- Skip flaky test_tf_question_answering (#2845) · 20fc18fb
  Sam Shleifer authored Feb 18, 2020
```
* Skip flaky test

* Style
```
  20fc18fb
13 Feb, 2020 1 commit

Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f

Joe Davison authored Feb 13, 2020

* Preserve spaces in GPT-2 tokenizers

Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
tokenizers, enabling correct BPE encoding. Automatically inserts a space
in front of first token in encode function when adding special tokens.

* Add tokenization preprocessing method

* Add framework argument to pipeline factory

Also fixes pipeline test issue. Each test input now treated as a
distinct sequence.

f1e8a51f

07 Feb, 2020 2 commits
- omission · d8b43600
  VictorSanh authored Feb 07, 2020
  
  d8b43600
- distilbert-base-cased weights + Readmes + omissions · ee5a6856
  VictorSanh authored Feb 07, 2020
  
  ee5a6856
30 Jan, 2020 1 commit

fill_mask helper (#2576) · 9fa836a7

Julien Chaumond authored Jan 30, 2020

* fill_mask helper

* [poc] FillMaskPipeline

* Revert "[poc] FillMaskPipeline"

This reverts commit 67eeea55b0f97b46c2b828de0f4ee97d87338335.

* Revert "fill_mask helper"

This reverts commit cacc17b884e14bb6b07989110ffe884ad9e36eaa.

* README: clarify that Pipelines can also do text-classification

cf. question at the AI&ML meetup last week, @mfuntowicz

* Fix test: test feature-extraction pipeline

* Test tweaks

* Slight refactor of existing pipeline (in preparation of new FillMaskPipeline)

* Extraneous doc

* More robust way of doing this

@mfuntowicz as we don't rely on the model name anymore (see AutoConfig)

* Also add RobertaConfig as a quickfix for wrong token_type_ids

* cs

* [BIG] FillMaskPipeline

9fa836a7

15 Jan, 2020 2 commits
- Graduate sst-2 to a canonical one · eb59e9f7
  Julien Chaumond authored Jan 15, 2020
  
  eb59e9f7
- Close #2392 · e184ad13
  Julien Chaumond authored Jan 15, 2020
  
  e184ad13
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
22 Dec, 2019 4 commits
- Remove unittest.main() in test modules. · 7e98e211
  Aymeric Augustin authored Dec 22, 2019
```
This construct isn't used anymore these days.

Running python tests/test_foo.py puts the tests/ directory on
PYTHONPATH, which isn't representative of how we run tests.

Use python -m unittest tests/test_foo.py instead.
```
  7e98e211
- Switch test files to the standard test_*.py scheme. · ced0a942
  Aymeric Augustin authored Dec 22, 2019
  
  ced0a942
- Move tests outside of library. · 067395d5
  Aymeric Augustin authored Dec 22, 2019
  
  067395d5
- Sort imports with isort. · 158e82e0
  Aymeric Augustin authored Dec 21, 2019
```
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
```
  158e82e0
21 Dec, 2019 1 commit

Reformat source code with black. · fa84ae26

Aymeric Augustin authored Dec 21, 2019

This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.

fa84ae26

20 Dec, 2019 3 commits
- defaults models for tf and pt - update tests · db0795b5
  thomwolf authored Dec 20, 2019
  
  db0795b5
- update tests to remove unittest.patch · 01ffc65e
  thomwolf authored Dec 20, 2019
  
  01ffc65e
- All tests are green. · 61d9ee45
  Morgan Funtowicz authored Dec 20, 2019
  
  61d9ee45
10 Dec, 2019 2 commits
- Refactored qa pipeline argument handling + unittests · 9a24e0cf
  Morgan Funtowicz authored Dec 11, 2019
  
  9a24e0cf
- Added QuestionAnsweringPipeline unit tests. · aae74065
  Morgan Funtowicz authored Dec 09, 2019
  
  aae74065