Commits · 7abc1d96d114873d9c3c2f1bc81343fb1407cec4 · chenpangpang / transformers

04 Nov, 2020 3 commits

Clean up data collators and datasets (#8308) · 9c4aa4ac

Sylvain Gugger authored Nov 04, 2020



* Clean up data collators and datasets

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Remove needless clone
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

9c4aa4ac

Improve QA pipeline error handling (#8286) · 7342d9a5

Nicolas Patry authored Nov 04, 2020

- The issue is that with previous code we would have the following:

```python
qa_pipeline = (...)
qa_pipeline(question="Where was he born ?", context="")
-> IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
```

The goal here is to improve this to actually return a ValueError
wherever possible.

While at it, I tried to simplify QuestionArgumentHandler's code to
make it smaller and more compat while keeping backward compat.

7342d9a5

[Generate Test] fix greedy generate test (#8293) · cb966e64
Patrick von Platen authored Nov 04, 2020
```
* fix greedy generate test

* delet ipdb
```
cb966e64

03 Nov, 2020 7 commits

[WIP] Ner pipeline grouped_entities fixes (#5970) · 29b536a7

Ceyda Cinarel authored Nov 04, 2020



* Bug fix: NER pipeline shouldn't group separate entities of same type

* style fix

* [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
	(B-type1 B-type1) != (B-type1 I-type1)
[Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
	The simplest fix is to just group the subwords with the first wordpiece.
	[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
	[TODO] handle different wordpiece_prefix ## ? possible approaches:
		get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
		have an _is_subword(token)
[Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO]  can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

* use offset_mapping to fix [UNK] token problem

* ignore score for subwords

* modify ner_pipeline test

* modify ner_pipeline test

* modify ner_pipeline test

* ner_pipeline change ignore_subwords default to true

* add ner_pipeline ignore_subword=False test case

* fix offset_mapping index

* fix style again duh

* change is_subword and convert_tokens_to_string logic

* merge tests with new test structure

* change test names

* remove old tests

* ner tests for fast tokenizer

* fast tokenizers have convert_tokens_to_string

* Fix the incorrect merge
Co-authored-by: Ceyda Cinarel <snu-ceyda@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

29b536a7

[CIs] Better reports everywhere (#8275) · 1bb4bba5

Stas Bekman authored Nov 03, 2020

* make it possible to invoke testconf.py in both test suites without crashing on having the same option added

* perl -pi -e 's|--make_reports|--make-reports|' to be consistent with other opts

* add `pytest --make-reports` to all CIs (and artifacts)

* fix

1bb4bba5

Data collator for token classification (#8274) · 7f556d2e
Sylvain Gugger authored Nov 03, 2020
```
* Add DataCollatorForTokenClassification and clean tests

* Make quality
```
7f556d2e
Clean Trainer tests and datasets dep (#8268) · 4c19f3ba
Sylvain Gugger authored Nov 03, 2020

4c19f3ba

Updated ConversationalPipeline to work with encoder-decoder models (#8207) · 74f6f91a

guillaume-be authored Nov 03, 2020



* Updated ConversationalPipeline to work with encoder-decoder models (e.g. BlenderBot)

* Addition of integration test for EncoderDecoder conversation model
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

74f6f91a

[FIX] TextGenerationPipeline is currently broken. (#8256) · c66ffa3a

Nicolas Patry authored Nov 03, 2020

* [FIX] TextGenerationPipeline is currently broken.

It's most likely due to #8180.
What's missing is a multi vs single string handler at the beginning of
the pipe.
And also there was no testing of this pipeline.

* Fixing Conversational tests too.

c66ffa3a

Refactoring the generate() function (#6949) · a1bbcf3f

Patrick von Platen authored Nov 03, 2020

* first draft

* show design proposition for new generate method

* up

* make better readable

* make first version

* gpt2 tests pass

* make beam search for gpt2 work

* add first encoder-decoder code

* delete typo

* make t5 work

* save indermediate

* make bart work with beam search

* finish beam search bart / t5

* add default kwargs

* make more tests pass

* fix no bad words sampler

* some fixes and tests for all distribution processors

* fix test

* fix rag slow tests

* merge to master

* add nograd to generate

* make all slow tests pass

* speed up generate

* fix edge case bug

* small fix

* correct typo

* add type hints and docstrings

* fix typos in tests

* add beam search tests

* add tests for beam scorer

* fix test rag

* finish beam search tests

* move generation tests in seperate file

* fix generation tests

* more tests

* add aggressive generation tests

* fix tests

* add gpt2 sample test

* add more docstring

* add more docs

* finish doc strings

* apply some more of sylvains and sams comments

* fix some typos

* make fix copies

* apply lysandres and sylvains comments

* final corrections on examples

* small fix for reformer

a1bbcf3f

02 Nov, 2020 3 commits
- 2 SinusoidalPositionalEmbedding fixes (#8226) · 504ff7bb
  Stas Bekman authored Nov 02, 2020
  
  504ff7bb
- Fix ignore list behavior in doctests (#8213) · 0c92e7d9
  Santiago Castro authored Nov 02, 2020
  
  0c92e7d9
- Fix the behaviour of DefaultArgumentHandler (removing it). (#8180) · 84caa233
  Nicolas Patry authored Nov 02, 2020
```
* Some work to fix the behaviour of DefaultArgumentHandler by removing it.

* Fixing specific pipelines argument checking.
```
  84caa233
30 Oct, 2020 3 commits

Replace swish with silu (#8166) · 00112c35

TFUsers authored Oct 30, 2020



* Replace swish with silu

* revert nn.silu to nn.swish due to older version

* simplify optimized silu conditional and fix format

* Update activations.py

* Update activations_tf.py

* Update modeling_flax_utils.py

* Update modeling_openai.py

* add swish testcase

* add pytorch swish testcase

* Add more robust python version check

* more formatting fixes
Co-authored-by: TFUsers <TFUsers@gmail.com>

00112c35

TFMarian, TFMbart, TFPegasus, TFBlenderbot (#7987) · 566b083e

Sam Shleifer authored Oct 30, 2020



* Start plumbing

* Marian close

* Small stubs for all children

* Fixed bart

* marian working

* pegasus test is good, but failing

* Checkin tests

* More model files

* Subtle marian, pegasus integration test failures

* Works well

* rm print

* boom boom

* Still failing model2doc

* merge master

* Equivalence test failing, all others fixed

* cleanup

* Fix embed_scale

* Cleanup marian pipeline test

* Undo extra changes

* Smaller delta

* Cleanup model testers

* undo delta

* fix tests import structure

* cross test decorator

* Cleaner set_weights

* Respect authorized_unexpected_keys

* No warnings

* No warnings

* style

* Nest tf import

* black

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* functional dropout

* fixup

* Fixup

* style_doc

* embs

* shape list

* delete slow force_token_id_to_be_generated func

* fixup
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

566b083e

Ci test tf super slow (#8007) · 10f8c636

Lysandre Debut authored Oct 30, 2020

* Test TF GPU CI

* Change cache

* Fix missing torch requirement

* Fix some model tests


Style

* LXMERT

* MobileBERT

* Longformer skip test

* XLNet

* The rest of the tests

* RAG goes OOM in multi gpu setup

* YAML test files

* Last fixes

* Skip doctests

* Fill mask tests

* Yaml files

* Last test fix

* Style

* Update cache

* Change ONNX tests to slow + use tiny model

10f8c636

29 Oct, 2020 2 commits

Smarter prediction loop and no- -> no_ in console args (#8151) · acf56408
Sylvain Gugger authored Oct 29, 2020
```
* Smarter prediction loop and no- -> no_ in console args

* Fix test
```
acf56408

Fix doc errors and typos across the board (#8139) · 969859d5

Santiago Castro authored Oct 29, 2020

* Fix doc errors and typos across the board

* Fix a typo

* Fix the CI

* Fix more typos

* Fix CI

* More fixes

* Fix CI

* More fixes

* More fixes

969859d5

28 Oct, 2020 1 commit

[testing] port test_trainer_distributed to distributed pytest + TestCasePlus enhancements (#8107) · 5423f2a9

Stas Bekman authored Oct 28, 2020



* move the helper code into testing_utils

* port test_trainer_distributed to work with pytest

* improve docs

* simplify notes

* doc

* doc

* style

* doc

* further improvements

* torch might not be available

* real fix

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

5423f2a9

27 Oct, 2020 3 commits

infer entailment label id on zero shot pipeline (#8059) · 3e58b6b7

Joe Davison authored Oct 27, 2020

* add entailment dim argument

* rename dim -> id

* fix last name change, style

* rm arg, auto-infer only

* typo

* rm superfluous import

3e58b6b7

Fix a bug for `CallbackHandler.callback_list` (#8052) · 7bff0af0

Harutaka Kawamura authored Oct 27, 2020



* Fix callback_list

* Add test
Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>

* Fix test
Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>

7bff0af0

[CI] generate separate report files as artifacts (#7995) · bfd5e370

Stas Bekman authored Oct 27, 2020

* better reports

* a whole bunch of reports in their own files

* clean up

* improvements

* github artifacts experiment

* style

* complete the report generator with multiple improvements/fixes

* fix

* save all reports under one dir to easy upload

* can remove temp failing tests

* doc fix

* some cleanup

bfd5e370

26 Oct, 2020 5 commits
- Fix + Test (#8049) · cbad90d8
  Lysandre Debut authored Oct 26, 2020
  
  cbad90d8
- Fix label name in DataCollatorForNextSentencePrediction test (#8048) · 07747863
  Sylvain Gugger authored Oct 26, 2020
  
  07747863
- Cleanup pytorch tests (#8033) · 8bbe8247
  Sam Shleifer authored Oct 26, 2020
  
  8bbe8247
- fsmt slow test uses lists (#8031) · f20aec1d
  Sam Shleifer authored Oct 26, 2020
  
  f20aec1d
- [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization (#8006) · 79eb3915
  Thomas Wolf authored Oct 26, 2020
```
* fixing #8001

* make T5 tokenizer serialization more robust - style
```
  79eb3915
23 Oct, 2020 3 commits

Fix BatchEncoding.word_to_tokens for removed tokens (#7939) · 5e323017
Anthony MOI authored Oct 23, 2020

5e323017
[Reformer] remove reformer pad_token_id (#7991) · 4acfd1a8
Patrick von Platen authored Oct 23, 2020
```
* remove reformer pad_token_id

* fix pegasus
```
4acfd1a8

[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers... · 3a40cdf5

Thomas Wolf authored Oct 23, 2020


[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups (#7970)

* WIP refactoring pipeline tests - switching to fast tokenizers

* fix dialog pipeline and fill-mask

* refactoring pipeline tests backbone

* make large tests slow

* fix tests (tf Bart inactive for now)

* fix doc...

* clean up for merge

* fixing tests - remove bart from summarization until there is TF

* fix quality and RAG

* Add new translation pipeline tests - fix JAX tests

* only slow for dialog

* Fixing the missing TF-BART imports in modeling_tf_auto

* spin out pipeline tests in separate CI job

* adding pipeline test to CI YAML

* add slow pipeline tests

* speed up tf and pt join test to avoid redoing all the standalone pt and tf tests

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* Update src/transformers/pipelines.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/pipelines.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add require_torch and require_tf in is_pt_tf_cross_test
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

3a40cdf5

22 Oct, 2020 7 commits

Only log total_flos at the end of training (#7981) · 06fc3954
Sylvain Gugger authored Oct 22, 2020
```
* Only log total_flos at the end of training

* Fix test
```
06fc3954

FillMaskPipeline: support passing top_k on __call__ (#7971) · ff65beaf

Julien Chaumond authored Oct 22, 2020

* FillMaskPipeline: support passing top_k on __call__

Also move from topk to top_k

* migrate to new param name in tests

* Review from @sgugger

ff65beaf

New run glue script (#7917) · 2e5052d4

Sylvain Gugger authored Oct 22, 2020



* Start simplification

* More progress

* Finished script

* Address comments and update tests instructions

* Wrong test

* Accept files as inputs and fix test

* Update src/transformers/trainer_utils.py
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Fix labels and add combined score

* Add special labels

* Update TPU command

* Revert to old label strategy

* Use model labels

* Fix for STT-B

* Styling

* Apply suggestions from code review
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Code styling

* Fix review comments
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

2e5052d4

Fixing the "translation", "translation_XX_to_YY" pipelines. (#7975) · 18ce6b8f

Nicolas Patry authored Oct 22, 2020



* Actually make the "translation", "translation_XX_to_YY" task behave correctly.

Background:
- Currently "translation_cn_to_ar" does not work. (only 3 pairs are
supported)
- Some models, contain in their config the correct values for the (src,
tgt) pair they can translate. It's usually just one pair, and we can
infer it automatically from the `model.config.task_specific_params`. If
it's not defined we can still probably load the TranslationPipeline
nevertheless.

Proposed fix:
- A simplified version of what could become more general which is
a `parametrized` task. "translation" + (src, tgt) in this instance
it what we need in the general case. The way we go about it for now
is simply parsing "translation_XX_to_YY". If cases of parametrized task arise
we should preferably go in something closer to what `datasets` propose
which is having a secondary argument `task_options`? that will be close
to what that task requires.
- Should be backward compatible in all cases for instance
`pipeline(task="translation_en_to_de") should work out of the box.
- Should provide a warning when a specific translation pair has been
selected on behalf of the user using
`model.config.task_specific_params`.

* Update src/transformers/pipelines.py
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

18ce6b8f

[PretrainedConfig] Fix save pretrained config for edge case (#7943) · f34372a9

Patrick von Platen authored Oct 22, 2020



* fix config save

* add test

* add config class variable and another test

* line break

* fix fsmt and typo

* god am I making many errors today :-/

* Update src/transformers/configuration_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

f34372a9

[fsmt test] basic config test with online model + super tiny model (#7860) · 64b4d25c
Stas Bekman authored Oct 22, 2020
```
* basic config test with online model

* typo

* style

* better test
```
64b4d25c

[testing] slow tests should be marked as slow (#7895) · 83481056

Stas Bekman authored Oct 22, 2020



* slow tests should be slow

* exception note

* style

* integrate LysandreJik's notes with some expansions

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* another slow test

* fix link, and prose

* clarify.

* note from Sam

* typo
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

83481056

21 Oct, 2020 3 commits

fix test (#7947) · 52decab3
Patrick von Platen authored Oct 21, 2020

52decab3

TensorBoard/Wandb/optuna/raytune integration improvements. (#7935) · e174bfeb

François Lagunas authored Oct 21, 2020

Improved TensorBoard and Wandb integration, as well as optuna and ray/tune support, with minor modifications to trainer core code.

e174bfeb

[multiple models] skip saving/loading deterministic state_dict keys (#7878) · 57516c0c

Stas Bekman authored Oct 21, 2020

* make the save_load special key tests common

* handle mbart

* cleaner solution

* fix

* move test_save_load_missing_keys back into fstm for now

* restore

* style

* add marian

* add pegasus

* blenderbot

* revert - no static embed

57516c0c