Commits · 55e8d0cea25be18b044523b30f4bef58fec63289 · chenpangpang / transformers

"docs/vscode:/vscode.git/clone" did not exist on "b90e29d52cfe94b1995cc5254f700e776b866d2d"

10 Nov, 2020 4 commits

Update links from s3 to huggingface.co · 55e8d0ce
Julien Chaumond authored Nov 10, 2020

55e8d0ce

Patch token classification pipeline (#8364) · 850afb42

Lysandre Debut authored Nov 10, 2020



* Patch token classification pipeline

* Some added tests for TokenClassificationArgumentHandler (#8366)
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

850afb42

Model versioning (#8324) · 70f622fa

Julien Chaumond authored Nov 10, 2020

* fix typo

* rm use_cdn & references, and implement new hf_bucket_url

* I'm pretty sure we don't need to `read` this file

* same here

* [BIG] file_utils.networking: do not gobble up errors anymore

* Fix CI 😇



* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Tiny doc tweak

* Add doc + pass kwarg everywhere

* Add more tests and explain

cc @sshleifer let me know if better
Co-Authored-By: Sam Shleifer <sshleifer@gmail.com>

* Also implement revision in pipelines

In the case where we're passing a task name or a string model identifier

* Fix CI 😇



* Fix CI

* [hf_api] new methods + command line implem

* make style

* Final endpoints post-migration

* Fix post-migration

* Py3.6 compat

cc @stefan-it

Thank you @stas00
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

70f622fa

Changing XLNet default from not using memories to 512 context size following paper (#8417) · 4185b115
Teven authored Nov 10, 2020
```
* Move XLNet memory length FutureWarning

* isort

* style

* Changed default XLNet memory length
```
4185b115

09 Nov, 2020 10 commits
- [github CI] add a multi-gpu job for all example tests (#8341) · 190df585
  Stas Bekman authored Nov 09, 2020
```
* add a multi-gpu job for all example tests

* run only ported tests

* rename

* explain why env is re-activated on each step

* mark all unported/checked tests with @require_torch_non_multigpu_but_fix_me

* style

* Apply suggestions from code review
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
```
  190df585
- Check all models are in an auto class (#8425) · a39218b7
  Sylvain Gugger authored Nov 09, 2020
  
  a39218b7
- Fix bart shape comment (#8423) · a8339b9e
  Sam Shleifer authored Nov 09, 2020
  
  a8339b9e
- [Tests] Add Common Test for Training + Fix a couple of bugs (#8415) · 9c83b96e
  Patrick von Platen authored Nov 09, 2020
```
* add training tests

* correct longformer

* fix docs

* fix some tests

* fix some more train tests

* remove ipdb

* fix multiple edge case model training

* fix funnel and prophetnet

* clean gpt models

* undo renaming of albert
```
  9c83b96e
- Deprecate old data/metrics functions (#8420) · 52040517
  Sylvain Gugger authored Nov 09, 2020
  
  52040517
- [fsmt convert script] fairseq broke chkpt data - fixing that (#8377) · d4d1fbfc
  Stas Bekman authored Nov 09, 2020
```
* fairseq broke chkpt data - fixing that

* style

* support older bpecodes filenames - specifically "code" in iwslt14
```
  d4d1fbfc
- [fsmt tokenizer] support lowercase tokenizer (#8389) · 78d706f3
  Stas Bekman authored Nov 09, 2020
```
* support lowercase tokenizer

* fix arg pos
```
  78d706f3
- Bug fix for permutation language modelling (#8409) · 1e2acd0d
  Shashank Gupta authored Nov 09, 2020
  
  1e2acd0d
- add evaluate doc - trainer.evaluate returns 'epoch' from training (#8273) · bf8625e7
  Philip May authored Nov 09, 2020
```
* add evaluate doc

* fix style with utils/style.doc

* Update src/transformers/trainer.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
```
  bf8625e7
- comet_ml temporary fix(#8410) · 4ab5617b
  Stas Bekman authored Nov 09, 2020
  
  4ab5617b
08 Nov, 2020 2 commits
- Fix DataCollatorForWholeWordMask again (#8397) · 4a53e8e9
  Jonathan Chang authored Nov 08, 2020
  
  4a53e8e9
- fixed default labels for QA model (#8399) · 61073099
  Manav Rathod authored Nov 08, 2020
  
  61073099
07 Nov, 2020 1 commit
- Fix DataCollatorForWholeWordMask (#8379) · 77a257fc
  Jonathan Chang authored Nov 07, 2020
```
* Fix DataCollatorForWholeWordMask

* Replace all tensorize_batch in data_collator.py
```
  77a257fc
06 Nov, 2020 2 commits

fix encoder outputs (#8368) · 07708793
Patrick von Platen authored Nov 06, 2020

07708793

[All Seq2Seq model + CLM models that can be used with EncoderDecoder] Add... · bc0d26d1

Yossi Synett authored Nov 06, 2020


[All Seq2Seq model + CLM models that can be used with EncoderDecoder] Add cross-attention weights to outputs (#8071)

* Output cross-attention with decoder attention output

* Update src/transformers/modeling_bert.py

* add cross-attention for t5 and bart as well

* fix tests

* correct typo in docs

* add sylvains and sams comments

* correct typo
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

bc0d26d1

05 Nov, 2020 3 commits

[s2s] test_distributed_eval (#8315) · d787935a
Stas Bekman authored Nov 05, 2020
```
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
```
d787935a

Make Trainer evaluation handle dynamic seq_length (#8336) · 04e442d5

Sylvain Gugger authored Nov 05, 2020

* Make Trainer evaluation handle dynamic seq_length

* Document behavior.

* Fix test

* Better fix

* Fixes for realsies this time

* Address review comments

* Without forgetting to save...

04e442d5

Output global_attentions in Longformer models (#7562) · 27b402ca

Guillaume Filion authored Nov 05, 2020



* Output global_attentions in Longformer models

* make style

* small refactoring

* fix tests

* make fix-copies

* add for tf as well

* remove comments in test

* make fix-copies

* make style

* add docs

* make docstring pretty
Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>

27b402ca

04 Nov, 2020 3 commits

Clean up data collators and datasets (#8308) · 9c4aa4ac

Sylvain Gugger authored Nov 04, 2020



* Clean up data collators and datasets

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Remove needless clone
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

9c4aa4ac

Improve QA pipeline error handling (#8286) · 7342d9a5

Nicolas Patry authored Nov 04, 2020

- The issue is that with previous code we would have the following:

```python
qa_pipeline = (...)
qa_pipeline(question="Where was he born ?", context="")
-> IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
```

The goal here is to improve this to actually return a ValueError
wherever possible.

While at it, I tried to simplify QuestionArgumentHandler's code to
make it smaller and more compat while keeping backward compat.

7342d9a5

[blenderbot] regex fix (#8282) · 7a7e2c26

Stas Bekman authored Nov 04, 2020

Fixing:

```
src/transformers/tokenization_blenderbot.py:163: DeprecationWarning: invalid escape sequence \s
    token = re.sub("\s{2,}", " ", token)
```

7a7e2c26

03 Nov, 2020 8 commits

[WIP] Ner pipeline grouped_entities fixes (#5970) · 29b536a7

Ceyda Cinarel authored Nov 04, 2020



* Bug fix: NER pipeline shouldn't group separate entities of same type

* style fix

* [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
	(B-type1 B-type1) != (B-type1 I-type1)
[Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
	The simplest fix is to just group the subwords with the first wordpiece.
	[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
	[TODO] handle different wordpiece_prefix ## ? possible approaches:
		get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
		have an _is_subword(token)
[Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO]  can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

* use offset_mapping to fix [UNK] token problem

* ignore score for subwords

* modify ner_pipeline test

* modify ner_pipeline test

* modify ner_pipeline test

* ner_pipeline change ignore_subwords default to true

* add ner_pipeline ignore_subword=False test case

* fix offset_mapping index

* fix style again duh

* change is_subword and convert_tokens_to_string logic

* merge tests with new test structure

* change test names

* remove old tests

* ner tests for fast tokenizer

* fast tokenizers have convert_tokens_to_string

* Fix the incorrect merge
Co-authored-by: Ceyda Cinarel <snu-ceyda@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

29b536a7

[CIs] Better reports everywhere (#8275) · 1bb4bba5

Stas Bekman authored Nov 03, 2020

* make it possible to invoke testconf.py in both test suites without crashing on having the same option added

* perl -pi -e 's|--make_reports|--make-reports|' to be consistent with other opts

* add `pytest --make-reports` to all CIs (and artifacts)

* fix

1bb4bba5

Data collator for token classification (#8274) · 7f556d2e
Sylvain Gugger authored Nov 03, 2020
```
* Add DataCollatorForTokenClassification and clean tests

* Make quality
```
7f556d2e

improve documentation of training_args.py (#8270) · 6a064447

Philip May authored Nov 03, 2020



* improve documentation of training_args.py

- do_train
- do_eval
- do_predict

* fix line too long

* fix style with black on training_args.py

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix line length with utils/style_doc

* black reformatting
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

6a064447

forward the worker stderr to the parent process (#8262) · 971c638e
Stas Bekman authored Nov 03, 2020

971c638e

Updated ConversationalPipeline to work with encoder-decoder models (#8207) · 74f6f91a

guillaume-be authored Nov 03, 2020



* Updated ConversationalPipeline to work with encoder-decoder models (e.g. BlenderBot)

* Addition of integration test for EncoderDecoder conversation model
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

74f6f91a

[FIX] TextGenerationPipeline is currently broken. (#8256) · c66ffa3a

Nicolas Patry authored Nov 03, 2020

* [FIX] TextGenerationPipeline is currently broken.

It's most likely due to #8180.
What's missing is a multi vs single string handler at the beginning of
the pipe.
And also there was no testing of this pipeline.

* Fixing Conversational tests too.

c66ffa3a

Refactoring the generate() function (#6949) · a1bbcf3f

Patrick von Platen authored Nov 03, 2020

* first draft

* show design proposition for new generate method

* up

* make better readable

* make first version

* gpt2 tests pass

* make beam search for gpt2 work

* add first encoder-decoder code

* delete typo

* make t5 work

* save indermediate

* make bart work with beam search

* finish beam search bart / t5

* add default kwargs

* make more tests pass

* fix no bad words sampler

* some fixes and tests for all distribution processors

* fix test

* fix rag slow tests

* merge to master

* add nograd to generate

* make all slow tests pass

* speed up generate

* fix edge case bug

* small fix

* correct typo

* add type hints and docstrings

* fix typos in tests

* add beam search tests

* add tests for beam scorer

* fix test rag

* finish beam search tests

* move generation tests in seperate file

* fix generation tests

* more tests

* add aggressive generation tests

* fix tests

* add gpt2 sample test

* add more docstring

* add more docs

* finish doc strings

* apply some more of sylvains and sams comments

* fix some typos

* make fix copies

* apply lysandres and sylvains comments

* final corrections on examples

* small fix for reformer

a1bbcf3f

02 Nov, 2020 7 commits
- 2 SinusoidalPositionalEmbedding fixes (#8226) · 504ff7bb
  Stas Bekman authored Nov 02, 2020
  
  504ff7bb
- fix encoder decoder bug (#8243) · dc26726d
  Patrick von Platen authored Nov 02, 2020
  
  dc26726d
- Add XLMProphetNetTokenizer to tokenization auto (#8245) · 9a23af4a
  Lysandre Debut authored Nov 02, 2020
  
  9a23af4a
- Fix TensorBoardCallback for older versions of PyTorch (#8239) · 5406f31a
  Sylvain Gugger authored Nov 02, 2020
  
  5406f31a
- Fix bad import with PyTorch <= 1.4.1 (#8237) · d1ad4bff
  Sylvain Gugger authored Nov 02, 2020
  
  d1ad4bff
- Fix ignore list behavior in doctests (#8213) · 0c92e7d9
  Santiago Castro authored Nov 02, 2020
  
  0c92e7d9
- Fix the behaviour of DefaultArgumentHandler (removing it). (#8180) · 84caa233
  Nicolas Patry authored Nov 02, 2020
```
* Some work to fix the behaviour of DefaultArgumentHandler by removing it.

* Fixing specific pipelines argument checking.
```
  84caa233