Commits · 443b0cad96a7fee62ae7c229536f01ddd7f41241 · chenpangpang / transformers

13 Jul, 2020 1 commit
- rename the function to match the rest of the test convention (#5692) · 443b0cad
  Stas Bekman authored Jul 13, 2020
  
  443b0cad
10 Jul, 2020 1 commit

Change model outputs types to self-document outputs (#5438) · edfd82f5

Sylvain Gugger authored Jul 10, 2020

* [WIP] Proposal for model outputs

* All Bert models

* Make CI green maybe?

* Fix ONNX test

* Isolate ModelOutput from pt and tf

* Formatting

* Add Electra models

* Auto-generate docstrings from outputs

* Add TF outputs

* Add some BERT models

* Revert TF side

* Remove last traces of TF changes

* Fail with a clear error message

* Add Albert and work through Bart

* Add CTRL and DistilBert

* Formatting

* Progress on Bart

* Renames and finish Bart

* Formatting

* Fix last test

* Add DPR

* Finish Electra and add FlauBERT

* Add GPT2

* Add Longformer

* Add MMBT

* Add MobileBert

* Add GPT

* Formatting

* Add Reformer

* Add Roberta

* Add T5

* Add Transformer XL

* Fix test

* Add XLM + fix XLMForTokenClassification

* Style + XLMRoberta

* Add XLNet

* Formatting

* Add doc of return_tuple arg

edfd82f5

08 Jul, 2020 2 commits

Fix Inconsistent NER Grouping (Pipeline) (#4987) · 0cc4eae0

Lorenzo Ampil authored Jul 09, 2020



* Add B I handling to grouping

* Add fix to include separate entity as last token

* move last_idx definition outside loop

* Use first entity in entity group as reference for entity type

* Add test cases

* Take out extra class accidentally added

* Return tf ner grouped test to original

* Take out redundant last entity

* Get last_idx safely
Co-authored-by: ColleterVi <36503688+ColleterVi@users.noreply.github.com>

* Fix first entity comment

* Create separate functions for group_sub_entities and group_entities (splitting call method to testable functions)

* Take out unnecessary last_idx

* Remove additional forward pass test

* Move token classification basic tests to separate class

* Move token classification basic tests back to monocolumninputtestcase

* Move base ner tests to nerpipelinetests

* Take out unused kwargs

* Add back mandatory_keys argument

* Add unitary tests for group_entities in _test_ner_pipeline

* Fix last entity handling

* Fix grouping fucntion used

* Add typing to group_sub_entities and group_entities
Co-authored-by: ColleterVi <36503688+ColleterVi@users.noreply.github.com>

0cc4eae0

[Benchmark] Add benchmarks for TF Training (#5594) · f82a2a5e

Patrick von Platen authored Jul 08, 2020

* tf_train

* adapt timing for tpu

* fix timing

* fix timing

* fix timing

* fix timing

* update notebook

* add tests

f82a2a5e

07 Jul, 2020 8 commits

Add mbart-large-cc25, support translation finetuning (#5129) · 353b8f1e

Sam Shleifer authored Jul 07, 2020

improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg

353b8f1e

[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming... · 4dc65591

Patrick von Platen authored Jul 07, 2020


[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile (#5395)

* add first version of clm tf

* make style

* add more tests for bert

* update tf clm loss

* fix tests

* correct tf ner script

* add mlm loss

* delete bogus file

* clean tf auto model + add tests

* finish adding clm loss everywhere

* fix training in distilbert

* fix flake8

* save intermediate

* fix tf t5 naming

* remove prints

* finish up

* up

* fix tf gpt2

* fix new test utils import

* fix flake8

* keep backward compatibility

* Update src/transformers/modeling_tf_albert.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_auto.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_electra.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_roberta.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_mobilebert.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_auto.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_bert.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_distilbert.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* apply sylvains suggestions
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

4dc65591

Fix tests imports dpr (#5576) · 4fedc125
Quentin Lhoest authored Jul 07, 2020
```
* fix test imports

* fix max_length

* style

* fix tests
```
4fedc125
[Bart] enable test_torchscript, update test_tie_weights (#5457) · d4886173
Sam Shleifer authored Jul 07, 2020
```
* Passing all but one torchscript test

* Style

* move comment

* remove unneeded assert
```
d4886173

Add DPR model (#5279) · fbd87921

Quentin Lhoest authored Jul 07, 2020



* beginning of dpr modeling

* wip

* implement forward

* remove biencoder + better init weights

* export dpr model to embed model for nlp lib

* add new api

* remove old code

* make style

* fix dumb typo

* don't load bert weights

* docs

* docs

* style

* move the `k` parameter

* fix init_weights

* add pretrained configs

* minor

* update config names

* style

* better config

* style

* clean code based on PR comments

* change Dpr to DPR

* fix config

* switch encoder config to a dict

* style

* inheritance -> composition

* add messages in assert startements

* add dpr reader tokenizer

* one tokenizer per model

* fix base_model_prefix

* fix imports

* typo

* add convert script

* docs

* change tokenizers conf names

* style

* change tokenizers conf names

* minor

* minor

* fix wrong names

* minor

* remove unused convert functions

* rename convert script

* use return_tensors in tokenizers

* remove n_questions dim

* move generate logic to tokenizer

* style

* add docs

* docs

* quality

* docs

* add tests

* style

* add tokenization tests

* DPR full tests

* Stay true to the attention mask building

* update docs

* missing param in bert input docs

* docs

* style
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

fbd87921

Make T5 compatible with ONNX (#5518) · 69122657

Abel authored Jul 07, 2020



* Default decoder inputs to encoder ones for T5 if neither are specified.

* Fixing typo, now all tests are passing.

* Changing einsum to operations supported by onnx

* Adding a test to ensure T5 can be exported to onnx op>9

* Modified test for onnx export to make it faster

* Styling changes.

* Styling changes.

* Changing notation for matrix multiplication
Co-authored-by: Abel Riboulot <tkai@protomail.com>

69122657

[Reformer] Adapt Reformer MaskedLM Attn mask (#5560) · 989ae326
Patrick von Platen authored Jul 07, 2020
```
* fix attention mask

* fix slow test

* refactor attn masks

* fix fp16 generate test
```
989ae326

Added data collator for permutation (XLNet) language modeling and related calls (#5522) · 3dcb748e

Shashank Gupta authored Jul 07, 2020

* Added data collator for XLNet language modeling and related calls

Added DataCollatorForXLNetLanguageModeling in data/data_collator.py
to generate necessary inputs for language modeling training with
XLNetLMHeadModel. Also added related arguments, logic and calls in
examples/language-modeling/run_language_modeling.py.

Resolves: #4739, #2008 (partially)

* Changed name to `DataCollatorForPermutationLanguageModeling`

Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`.
Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use.
CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of
similar to `mems` for XLNet).
Changed calls and imports appropriately.

* Added detailed comments, changed variable names

Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative.

* Added tests for new data collator

Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences.

* Fixed styling issues

3dcb748e

06 Jul, 2020 1 commit

Various tokenizers fixes (#5558) · 5787e4c1

Anthony MOI authored Jul 06, 2020

* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization

5787e4c1

03 Jul, 2020 2 commits

[cleanup] TF T5 tests only init t5-base once. (#5410) · 58cca47c
Sam Shleifer authored Jul 03, 2020

58cca47c

Exposing prepare_for_model for both slow & fast tokenizers (#5479) · 17ade127

Lysandre Debut authored Jul 03, 2020



* Exposing prepare_for_model for both slow & fast tokenizers

* Update method signature

* The traditional style commit

* Hide the warnings behind the verbose flag

* update default truncation strategy and prepare_for_model

* fix tests and prepare_for_models methods
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

17ade127

02 Jul, 2020 1 commit

Changed expected_output_ids in TransfoXL generation test (#5462) · 6726416e

Teven authored Jul 02, 2020

* Changed expected_output_ids in TransfoXL generation test to match #4826 generation PR.

* making black happy

* making isort happy

6726416e

01 Jul, 2020 8 commits

[Reformer] Add Masked LM Reformer (#5426) · d16e36c7
Patrick von Platen authored Jul 01, 2020
```
* fix conflicts

* fix

* happy rebasing
```
d16e36c7
Fix tensor label type inference in default collator (#5250) · 35befd9c
Joe Davison authored Jul 01, 2020
```
* allow tensor label inputs to default collator

* replace try/except with type check
```
35befd9c
finish reformer qa head (#5433) · fe81f7d1
Patrick von Platen authored Jul 01, 2020

fe81f7d1

[Longformer] Major Refactor (#5219) · d697b6ca

Patrick von Platen authored Jul 01, 2020

* refactor naming

* add small slow test

* refactor

* refactor naming

* rename selected to extra

* big global attention refactor

* make style

* refactor naming

* save intermed

* refactor functions

* finish function refactor

* fix tests

* fix longformer

* fix longformer

* fix longformer

* fix all tests but one

* finish longformer

* address sams and izs comments

* fix transpose

d697b6ca

[fix] Marian tests import (#5442) · e0d58ddb
Sam Shleifer authored Jul 01, 2020

e0d58ddb

Raises PipelineException on FillMaskPipeline when there are != 1 mask_token in the input (#5389) · 608d5a7c

Funtowicz Morgan authored Jul 01, 2020



* Added PipelineException
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fill-mask pipeline raises exception when more than one mask_token detected.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Put everything in a function.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added tests on pipeline fill-mask when input has != 1 mask_token
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix numel() computation for TF
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing PR comments.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove function typing to avoid import on specific framework.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Retry typing with @julien-c tip.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality².
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Simplify fill-mask mask_token checking.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Trigger CI

608d5a7c

MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182) · 43cb03a9
Sam Shleifer authored Jul 01, 2020

43cb03a9
Move tests/utils.py -> transformers/testing_utils.py (#5350) · 13deb95a
Sam Shleifer authored Jul 01, 2020

13deb95a

30 Jun, 2020 1 commit
- [fix] slow fill_mask test failure (#5406) · 32d20314
  Sam Shleifer authored Jun 30, 2020
  
  32d20314
29 Jun, 2020 1 commit

[Docs] Benchmark docs (#5360) · 4bcc35cd

Patrick von Platen authored Jun 29, 2020



* first doc version

* add benchmark docs

* fix typos

* improve README

* Update docs/source/benchmarks.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* fix naming and docs
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

4bcc35cd

28 Jun, 2020 1 commit
- [mBART] skip broken forward pass test, stronger integration test (#5327) · 28a690a8
  Sam Shleifer authored Jun 28, 2020
  
  28a690a8
26 Jun, 2020 4 commits

examples/seq2seq/run_eval.py fixes and docs (#5322) · 393b8dc0
Sam Shleifer authored Jun 26, 2020

393b8dc0

[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308) · 601d4d69

Thomas Wolf authored Jun 26, 2020

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples

601d4d69

[pipelines] Change summarization default to distilbart-cnn-12-6 (#5289) · 798dbff6
Sam Shleifer authored Jun 26, 2020

798dbff6

Add pad_to_multiple_of on tokenizers (reimport) (#5054) · 135791e8

Funtowicz Morgan authored Jun 26, 2020



* Add new parameter `pad_to_multiple_of` on tokenizers.

* unittest for pad_to_multiple_of

* Add .name when logging enum.

* Fix missing .items() on dict in tests.

* Add special check + warning if the tokenizer doesn't have proper pad_token.

* Use the correct logger format specifier.

* Ensure tokenizer with no pad_token do not modify the underlying padding strategy.

* Skip test if tokenizer doesn't have pad_token

* Fix RobertaTokenizer on empty input

* Format.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fix and updating to simpler API
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

135791e8

25 Jun, 2020 3 commits

Refactor Code samples; Test code samples (#5036) · 364a5ae1

Lysandre Debut authored Jun 25, 2020



* Refactor code samples

* Test docstrings

* Style

* Tokenization examples

* Run rust of tests

* First step to testing source docs

* Style and BART comment

* Test the remainder of the code samples

* Style

* let to const

* Formatting fixes

* Ready for merge

* Fix fixture + Style

* Fix last tests

* Update docs/source/quicktour.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing @sgugger's comments + Fix MobileBERT in TF
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

364a5ae1

[tokenizers] Several small improvements and bug fixes (#5287) · 315f464b

Thomas Wolf authored Jun 25, 2020

* avoid recursion in id checks for fast tokenizers

* better typings and fix #5232

* align slow and fast tokenizers behaviors for Roberta and GPT2

* style and quality

* fix tests - improve typings

315f464b

[Tokenization] Fix #5181 - make #5155 more explicit - move back the default... · 27cf1d97

Thomas Wolf authored Jun 25, 2020

[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING (#5252)

* fix-5181

Padding to max sequence length while truncation to another length was wrong on slow tokenizers

* clean up and fix #5155

* fix XLM test

* Fix tests for Transfo-XL

* logging only above WARNING in tests

* switch slow tokenizers tests in @slow

* fix Marian truncation tokenization test

* style and quality

* make the test a lot faster by limiting the sequence length used in tests

27cf1d97

24 Jun, 2020 4 commits
- Add more tests on tokenizers serialization - fix bugs (#5056) · 7ac91107
  Thomas Wolf authored Jun 24, 2020
```
* update tests for fast tokenizers + fix small bug in saving/loading

* better tests on serialization

* fixing serialization

* comment cleanup
```
  7ac91107
- Cleaning TensorFlow models (#5229) · cf10d4cf
  Lysandre Debut authored Jun 24, 2020
```
* Cleaning TensorFlow models

Update all classes


stylr

* Don't average loss
```
  cf10d4cf
- [Use cache] Align logic of `use_cache` with output_attentions and output_hidden_states (#5194) · c2a26ec8
  Patrick von Platen authored Jun 24, 2020
```
* fix use cache

* add bart use cache

* fix bart

* finish bart
```
  c2a26ec8
- [Benchmark] Extend Benchmark to all model type extensions (#5241) · 9fe09cec
  Patrick von Platen authored Jun 24, 2020
```
* add benchmark for all kinds of models

* improved import

* delete bogus files

* make style
```
  9fe09cec
23 Jun, 2020 2 commits

[bart] add config.extra_pos_embeddings to facilitate reuse (#5190) · 58918c76
Sam Shleifer authored Jun 23, 2020

58918c76

Tokenizers API developments (#5103) · 11fdde02

Thomas Wolf authored Jun 23, 2020



* Add return lengths

* make pad a bit more flexible so it can be used as collate_fn

* check all kwargs sent to encoding method are known

* fixing kwargs in encodings

* New AddedToken class in python

This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.

* style and quality

* switched to hugginface tokenizers library for AddedTokens

* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state

* style and quality

* do not raise an error on additional or unused kwargs for tokenize() but only a warning

* transfo-xl pretrained model requires torch

* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

11fdde02