Commits · 27b402cab0a27f2a57067ce8aa6b3e35fc48612e · chenpangpang / transformers

05 Nov, 2020 7 commits

Output global_attentions in Longformer models (#7562) · 27b402ca

Guillaume Filion authored Nov 05, 2020



* Output global_attentions in Longformer models

* make style

* small refactoring

* fix tests

* make fix-copies

* add for tf as well

* remove comments in test

* make fix-copies

* make style

* add docs

* make docstring pretty
Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>

27b402ca

no warn (#8329) · 7abc1d96
Sam Shleifer authored Nov 05, 2020

7abc1d96

change TokenClassificationTask class methods to static methods (#7902) · 52f44dd6

Bobby Donchev authored Nov 05, 2020



* change TokenClassificationTask class methods to static methods

Since we do not require self in the class methods of TokenClassificationTask we should probably switch to static methods. Also, since the class TokenClassificationTask does not contain a constructor it is currently unusable as is. By switching to static methods this fixes the issue of having to document the intent of the broken class.

Also, since the get_labels and read_examples_from_file methods are ought to be implemented. Static method definitions are unchanged even after inheritance, which means that it can be overridden, similar to other class methods.

* Trigger Build
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

52f44dd6

Corrected typo in readme (#8320) · 77c8f6c6
Guillem García Subies authored Nov 05, 2020

77c8f6c6
Update PULL_REQUEST_TEMPLATE.md · 226b9deb
Patrick von Platen authored Nov 05, 2020

226b9deb
Update bug-report.md · 6f35c61f
Patrick von Platen authored Nov 05, 2020

6f35c61f

Create README.md (#8223) · 638c0b7c

Yifan Peng authored Nov 05, 2020



* Create README.md

* Update README.md

* Apply suggestions from code review
Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

638c0b7c

04 Nov, 2020 13 commits
- Clean up data collators and datasets (#8308) · 9c4aa4ac
  Sylvain Gugger authored Nov 04, 2020
```
* Clean up data collators and datasets

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Remove needless clone
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
```
  9c4aa4ac
- Fix path to old run_language_modeling.py script (#8302) · b1d3e95e
  Manuel Romero authored Nov 04, 2020
  
  b1d3e95e
- Speedup doc build (#8301) · b6e58db2
  Sylvain Gugger authored Nov 04, 2020
```
* Try -j option

* Try other thing

* Bigger machine

* Test lower sphinx version

* Remove trailing space
```
  b6e58db2
- adding model cards for distilled models (#8300) · 969ccac2
  Victor SANH authored Nov 04, 2020
```
* adding model cards for distil models

* forgot the languages
```
  969ccac2
- Improve QA pipeline error handling (#8286) · 7342d9a5
  Nicolas Patry authored Nov 04, 2020
```
- The issue is that with previous code we would have the following:

```python
  qa_pipeline = (...)
  qa_pipeline(question="Where was he born ?", context="")
  -> IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
```

The goal here is to improve this to actually return a ValueError
wherever possible.

While at it, I tried to simplify QuestionArgumentHandler's code to
make it smaller and more compat while keeping backward compat.
```
  7342d9a5
- Update model cards of deepset/roberta-base-squad2 v1 and v2 (#8241) · 38630e7a
  Branden Chan authored Nov 04, 2020
```
* update deepset/roberta-base-squad2 to v2

* Update model_cards/deepset/roberta-base-squad2/README.md
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
```
  38630e7a
- Model card: T5-base fine-tuned on QASC (#8299) · 04561ecb
  Manuel Romero authored Nov 04, 2020
  
  04561ecb
- Revert size change as it doesn't change anything · 854b44aa
  Sylvain Gugger authored Nov 04, 2020
  
  854b44aa
- Upgrade resource for doc building · 414985c4
  Sylvain Gugger authored Nov 04, 2020
  
  414985c4
- Fix validation file loading in scripts (#8298) · cf897246
  Sylvain Gugger authored Nov 04, 2020
  
  cf897246
- [Generate Test] fix greedy generate test (#8293) · cb966e64
  Patrick von Platen authored Nov 04, 2020
```
* fix greedy generate test

* delet ipdb
```
  cb966e64
- Fix typo in language-modeling README.md (#8287) · 734afa37
  Pengzhi Gao authored Nov 04, 2020
  
  734afa37
- [blenderbot] regex fix (#8282) · 7a7e2c26
  Stas Bekman authored Nov 04, 2020
```
Fixing:

```
  src/transformers/tokenization_blenderbot.py:163: DeprecationWarning: invalid escape sequence \s
      token = re.sub("\s{2,}", " ", token)
```
```
  7a7e2c26
03 Nov, 2020 14 commits

[WIP] Ner pipeline grouped_entities fixes (#5970) · 29b536a7

Ceyda Cinarel authored Nov 04, 2020



* Bug fix: NER pipeline shouldn't group separate entities of same type

* style fix

* [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
	(B-type1 B-type1) != (B-type1 I-type1)
[Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
	The simplest fix is to just group the subwords with the first wordpiece.
	[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
	[TODO] handle different wordpiece_prefix ## ? possible approaches:
		get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
		have an _is_subword(token)
[Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO]  can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

* use offset_mapping to fix [UNK] token problem

* ignore score for subwords

* modify ner_pipeline test

* modify ner_pipeline test

* modify ner_pipeline test

* ner_pipeline change ignore_subwords default to true

* add ner_pipeline ignore_subword=False test case

* fix offset_mapping index

* fix style again duh

* change is_subword and convert_tokens_to_string logic

* merge tests with new test structure

* change test names

* remove old tests

* ner tests for fast tokenizer

* fast tokenizers have convert_tokens_to_string

* Fix the incorrect merge
Co-authored-by: Ceyda Cinarel <snu-ceyda@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

29b536a7

[CIs] Better reports everywhere (#8275) · 1bb4bba5

Stas Bekman authored Nov 03, 2020

* make it possible to invoke testconf.py in both test suites without crashing on having the same option added

* perl -pi -e 's|--make_reports|--make-reports|' to be consistent with other opts

* add `pytest --make-reports` to all CIs (and artifacts)

* fix

1bb4bba5

Data collator for token classification (#8274) · 7f556d2e
Sylvain Gugger authored Nov 03, 2020
```
* Add DataCollatorForTokenClassification and clean tests

* Make quality
```
7f556d2e

improve documentation of training_args.py (#8270) · 6a064447

Philip May authored Nov 03, 2020



* improve documentation of training_args.py

- do_train
- do_eval
- do_predict

* fix line too long

* fix style with black on training_args.py

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix line length with utils/style_doc

* black reformatting
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

6a064447

Clean Trainer tests and datasets dep (#8268) · 4c19f3ba
Sylvain Gugger authored Nov 03, 2020

4c19f3ba
make files independent (#8267) · 068e6b5e
Patrick von Platen authored Nov 03, 2020

068e6b5e
[examples] minimal version requirement run-time check in PL (#8133) · cd360dcb
Stas Bekman authored Nov 03, 2020
```
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
```
cd360dcb
forward the worker stderr to the parent process (#8262) · 971c638e
Stas Bekman authored Nov 03, 2020

971c638e
Fix Tatoeba skip · eb6313e8
Lysandre authored Nov 03, 2020

eb6313e8

Updated ConversationalPipeline to work with encoder-decoder models (#8207) · 74f6f91a

guillaume-be authored Nov 03, 2020



* Updated ConversationalPipeline to work with encoder-decoder models (e.g. BlenderBot)

* Addition of integration test for EncoderDecoder conversation model
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

74f6f91a

[FIX] TextGenerationPipeline is currently broken. (#8256) · c66ffa3a

Nicolas Patry authored Nov 03, 2020

* [FIX] TextGenerationPipeline is currently broken.

It's most likely due to #8180.
What's missing is a multi vs single string handler at the beginning of
the pipe.
And also there was no testing of this pipeline.

* Fixing Conversational tests too.

c66ffa3a

Refactoring the generate() function (#6949) · a1bbcf3f

Patrick von Platen authored Nov 03, 2020

* first draft

* show design proposition for new generate method

* up

* make better readable

* make first version

* gpt2 tests pass

* make beam search for gpt2 work

* add first encoder-decoder code

* delete typo

* make t5 work

* save indermediate

* make bart work with beam search

* finish beam search bart / t5

* add default kwargs

* make more tests pass

* fix no bad words sampler

* some fixes and tests for all distribution processors

* fix test

* fix rag slow tests

* merge to master

* add nograd to generate

* make all slow tests pass

* speed up generate

* fix edge case bug

* small fix

* correct typo

* add type hints and docstrings

* fix typos in tests

* add beam search tests

* add tests for beam scorer

* fix test rag

* finish beam search tests

* move generation tests in seperate file

* fix generation tests

* more tests

* add aggressive generation tests

* fix tests

* add gpt2 sample test

* add more docstring

* add more docs

* finish doc strings

* apply some more of sylvains and sams comments

* fix some typos

* make fix copies

* apply lysandres and sylvains comments

* final corrections on examples

* small fix for reformer

a1bbcf3f

Skip tatoeba tests if Tatoeba-Challenge not cloned (#8260) · b63beb74
Sam Shleifer authored Nov 03, 2020

b63beb74
[Seq2Seq] Correct import in Seq2Seq Trainer (#8254) · 9f1747f9
Patrick von Platen authored Nov 03, 2020

9f1747f9

02 Nov, 2020 6 commits
- 2 SinusoidalPositionalEmbedding fixes (#8226) · 504ff7bb
  Stas Bekman authored Nov 02, 2020
  
  504ff7bb
- add new notebooks (#8246) · f744b815
  Patrick von Platen authored Nov 02, 2020
  
  f744b815
- fix encoder decoder bug (#8243) · dc26726d
  Patrick von Platen authored Nov 02, 2020
  
  dc26726d
- Add XLMProphetNetTokenizer to tokenization auto (#8245) · 9a23af4a
  Lysandre Debut authored Nov 02, 2020
  
  9a23af4a
- Create README.md · 5b178f3c
  Patrick von Platen authored Nov 02, 2020
  
  5b178f3c
- Add line by line option to mlm/plm scripts (#8240) · e1b1b614
  Sylvain Gugger authored Nov 02, 2020
```
* Make line by line optional in run_mlm

* Add option to disable dynamic padding

* Add option to plm too and update README

* Typos

* More typos

* Even more typos

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
```
  e1b1b614