Commits · d0486c8bc238bc5bb4757871b0c5afc49c3c11cb · chenpangpang / transformers

"examples/run_mmimdb.py" did not exist on "912fdff899cf0fd674ed357e46a0209311aefad2"

15 Jul, 2020 1 commit
- [cleanup] T5 test, warnings (#5761) · d0486c8b
  Sam Shleifer authored Jul 15, 2020
  
  d0486c8b
14 Jul, 2020 2 commits

[fix] mbart_en_ro_generate test now identical to fairseq (#5731) · 838950ee
Sam Shleifer authored Jul 14, 2020

838950ee

[Reformer classification head] Implement the reformer model classification... · f867000f

as-stevens authored Jul 14, 2020


[Reformer classification head] Implement the reformer model classification head for text classification (#5198)

* Reformer model head classification implementation for text classification

* Reformat the reformer model classification code

* PR review comments, and test case implementation for reformer for classification head changes

* CI/CD reformer for classification head test import error fix

* CI/CD test case implementation  added ReformerForSequenceClassification to all_model_classes

* Code formatting- fixed

* Normal test cases added for reformer classification head

* Fix test cases implementation for the reformer classification head

* removed token_type_id parameter from the reformer classification head

* fixed the test case for reformer classification head

* merge conflict with master fixed

* merge conflict, changed reformer classification to accept the choice_label parameter added in latest code

* refactored the the reformer classification head test code

* reformer classification head, common transform test cases fixed

* final set of the review comment, rearranging the reformer classes and docstring add to classification forward method

* fixed the compilation error and text case fix for reformer classification head

* Apply suggestions from code review

Remove unnecessary dup
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

f867000f

13 Jul, 2020 2 commits
- FlaubertForTokenClassification (#5644) · 45addfe9
  Stas Bekman authored Jul 13, 2020
```
* implement FlaubertForTokenClassification as a subclass of XLMForTokenClassification

* fix mapping order

* add the doc

* add common tests
```
  45addfe9
- rename the function to match the rest of the test convention (#5692) · 443b0cad
  Stas Bekman authored Jul 13, 2020
  
  443b0cad
10 Jul, 2020 1 commit

Change model outputs types to self-document outputs (#5438) · edfd82f5

Sylvain Gugger authored Jul 10, 2020

* [WIP] Proposal for model outputs

* All Bert models

* Make CI green maybe?

* Fix ONNX test

* Isolate ModelOutput from pt and tf

* Formatting

* Add Electra models

* Auto-generate docstrings from outputs

* Add TF outputs

* Add some BERT models

* Revert TF side

* Remove last traces of TF changes

* Fail with a clear error message

* Add Albert and work through Bart

* Add CTRL and DistilBert

* Formatting

* Progress on Bart

* Renames and finish Bart

* Formatting

* Fix last test

* Add DPR

* Finish Electra and add FlauBERT

* Add GPT2

* Add Longformer

* Add MMBT

* Add MobileBert

* Add GPT

* Formatting

* Add Reformer

* Add Roberta

* Add T5

* Add Transformer XL

* Fix test

* Add XLM + fix XLMForTokenClassification

* Style + XLMRoberta

* Add XLNet

* Formatting

* Add doc of return_tuple arg

edfd82f5

08 Jul, 2020 2 commits

Fix Inconsistent NER Grouping (Pipeline) (#4987) · 0cc4eae0

Lorenzo Ampil authored Jul 09, 2020



* Add B I handling to grouping

* Add fix to include separate entity as last token

* move last_idx definition outside loop

* Use first entity in entity group as reference for entity type

* Add test cases

* Take out extra class accidentally added

* Return tf ner grouped test to original

* Take out redundant last entity

* Get last_idx safely
Co-authored-by: ColleterVi <36503688+ColleterVi@users.noreply.github.com>

* Fix first entity comment

* Create separate functions for group_sub_entities and group_entities (splitting call method to testable functions)

* Take out unnecessary last_idx

* Remove additional forward pass test

* Move token classification basic tests to separate class

* Move token classification basic tests back to monocolumninputtestcase

* Move base ner tests to nerpipelinetests

* Take out unused kwargs

* Add back mandatory_keys argument

* Add unitary tests for group_entities in _test_ner_pipeline

* Fix last entity handling

* Fix grouping fucntion used

* Add typing to group_sub_entities and group_entities
Co-authored-by: ColleterVi <36503688+ColleterVi@users.noreply.github.com>

0cc4eae0

[Benchmark] Add benchmarks for TF Training (#5594) · f82a2a5e

Patrick von Platen authored Jul 08, 2020

* tf_train

* adapt timing for tpu

* fix timing

* fix timing

* fix timing

* fix timing

* update notebook

* add tests

f82a2a5e

07 Jul, 2020 8 commits

Add mbart-large-cc25, support translation finetuning (#5129) · 353b8f1e

Sam Shleifer authored Jul 07, 2020

improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg

353b8f1e

[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming... · 4dc65591

Patrick von Platen authored Jul 07, 2020


[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile (#5395)

* add first version of clm tf

* make style

* add more tests for bert

* update tf clm loss

* fix tests

* correct tf ner script

* add mlm loss

* delete bogus file

* clean tf auto model + add tests

* finish adding clm loss everywhere

* fix training in distilbert

* fix flake8

* save intermediate

* fix tf t5 naming

* remove prints

* finish up

* up

* fix tf gpt2

* fix new test utils import

* fix flake8

* keep backward compatibility

* Update src/transformers/modeling_tf_albert.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_auto.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_electra.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_roberta.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_mobilebert.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_auto.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_bert.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_distilbert.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* apply sylvains suggestions
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

4dc65591

Fix tests imports dpr (#5576) · 4fedc125
Quentin Lhoest authored Jul 07, 2020
```
* fix test imports

* fix max_length

* style

* fix tests
```
4fedc125
[Bart] enable test_torchscript, update test_tie_weights (#5457) · d4886173
Sam Shleifer authored Jul 07, 2020
```
* Passing all but one torchscript test

* Style

* move comment

* remove unneeded assert
```
d4886173

Add DPR model (#5279) · fbd87921

Quentin Lhoest authored Jul 07, 2020



* beginning of dpr modeling

* wip

* implement forward

* remove biencoder + better init weights

* export dpr model to embed model for nlp lib

* add new api

* remove old code

* make style

* fix dumb typo

* don't load bert weights

* docs

* docs

* style

* move the `k` parameter

* fix init_weights

* add pretrained configs

* minor

* update config names

* style

* better config

* style

* clean code based on PR comments

* change Dpr to DPR

* fix config

* switch encoder config to a dict

* style

* inheritance -> composition

* add messages in assert startements

* add dpr reader tokenizer

* one tokenizer per model

* fix base_model_prefix

* fix imports

* typo

* add convert script

* docs

* change tokenizers conf names

* style

* change tokenizers conf names

* minor

* minor

* fix wrong names

* minor

* remove unused convert functions

* rename convert script

* use return_tensors in tokenizers

* remove n_questions dim

* move generate logic to tokenizer

* style

* add docs

* docs

* quality

* docs

* add tests

* style

* add tokenization tests

* DPR full tests

* Stay true to the attention mask building

* update docs

* missing param in bert input docs

* docs

* style
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

fbd87921

Make T5 compatible with ONNX (#5518) · 69122657

Abel authored Jul 07, 2020



* Default decoder inputs to encoder ones for T5 if neither are specified.

* Fixing typo, now all tests are passing.

* Changing einsum to operations supported by onnx

* Adding a test to ensure T5 can be exported to onnx op>9

* Modified test for onnx export to make it faster

* Styling changes.

* Styling changes.

* Changing notation for matrix multiplication
Co-authored-by: Abel Riboulot <tkai@protomail.com>

69122657

[Reformer] Adapt Reformer MaskedLM Attn mask (#5560) · 989ae326
Patrick von Platen authored Jul 07, 2020
```
* fix attention mask

* fix slow test

* refactor attn masks

* fix fp16 generate test
```
989ae326

Added data collator for permutation (XLNet) language modeling and related calls (#5522) · 3dcb748e

Shashank Gupta authored Jul 07, 2020

* Added data collator for XLNet language modeling and related calls

Added DataCollatorForXLNetLanguageModeling in data/data_collator.py
to generate necessary inputs for language modeling training with
XLNetLMHeadModel. Also added related arguments, logic and calls in
examples/language-modeling/run_language_modeling.py.

Resolves: #4739, #2008 (partially)

* Changed name to `DataCollatorForPermutationLanguageModeling`

Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`.
Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use.
CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of
similar to `mems` for XLNet).
Changed calls and imports appropriately.

* Added detailed comments, changed variable names

Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative.

* Added tests for new data collator

Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences.

* Fixed styling issues

3dcb748e

06 Jul, 2020 1 commit

Various tokenizers fixes (#5558) · 5787e4c1

Anthony MOI authored Jul 06, 2020

* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization

5787e4c1

03 Jul, 2020 2 commits

[cleanup] TF T5 tests only init t5-base once. (#5410) · 58cca47c
Sam Shleifer authored Jul 03, 2020

58cca47c

Exposing prepare_for_model for both slow & fast tokenizers (#5479) · 17ade127

Lysandre Debut authored Jul 03, 2020



* Exposing prepare_for_model for both slow & fast tokenizers

* Update method signature

* The traditional style commit

* Hide the warnings behind the verbose flag

* update default truncation strategy and prepare_for_model

* fix tests and prepare_for_models methods
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

17ade127

02 Jul, 2020 1 commit

Changed expected_output_ids in TransfoXL generation test (#5462) · 6726416e

Teven authored Jul 02, 2020

* Changed expected_output_ids in TransfoXL generation test to match #4826 generation PR.

* making black happy

* making isort happy

6726416e

01 Jul, 2020 8 commits

[Reformer] Add Masked LM Reformer (#5426) · d16e36c7
Patrick von Platen authored Jul 01, 2020
```
* fix conflicts

* fix

* happy rebasing
```
d16e36c7
Fix tensor label type inference in default collator (#5250) · 35befd9c
Joe Davison authored Jul 01, 2020
```
* allow tensor label inputs to default collator

* replace try/except with type check
```
35befd9c
finish reformer qa head (#5433) · fe81f7d1
Patrick von Platen authored Jul 01, 2020

fe81f7d1

[Longformer] Major Refactor (#5219) · d697b6ca

Patrick von Platen authored Jul 01, 2020

* refactor naming

* add small slow test

* refactor

* refactor naming

* rename selected to extra

* big global attention refactor

* make style

* refactor naming

* save intermed

* refactor functions

* finish function refactor

* fix tests

* fix longformer

* fix longformer

* fix longformer

* fix all tests but one

* finish longformer

* address sams and izs comments

* fix transpose

d697b6ca

[fix] Marian tests import (#5442) · e0d58ddb
Sam Shleifer authored Jul 01, 2020

e0d58ddb

Raises PipelineException on FillMaskPipeline when there are != 1 mask_token in the input (#5389) · 608d5a7c

Funtowicz Morgan authored Jul 01, 2020



* Added PipelineException
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fill-mask pipeline raises exception when more than one mask_token detected.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Put everything in a function.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added tests on pipeline fill-mask when input has != 1 mask_token
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix numel() computation for TF
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing PR comments.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove function typing to avoid import on specific framework.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Retry typing with @julien-c tip.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality².
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Simplify fill-mask mask_token checking.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Trigger CI

608d5a7c

MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182) · 43cb03a9
Sam Shleifer authored Jul 01, 2020

43cb03a9
Move tests/utils.py -> transformers/testing_utils.py (#5350) · 13deb95a
Sam Shleifer authored Jul 01, 2020

13deb95a

30 Jun, 2020 1 commit
- [fix] slow fill_mask test failure (#5406) · 32d20314
  Sam Shleifer authored Jun 30, 2020
  
  32d20314
29 Jun, 2020 1 commit

[Docs] Benchmark docs (#5360) · 4bcc35cd

Patrick von Platen authored Jun 29, 2020



* first doc version

* add benchmark docs

* fix typos

* improve README

* Update docs/source/benchmarks.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* fix naming and docs
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

4bcc35cd

28 Jun, 2020 1 commit
- [mBART] skip broken forward pass test, stronger integration test (#5327) · 28a690a8
  Sam Shleifer authored Jun 28, 2020
  
  28a690a8
26 Jun, 2020 4 commits

examples/seq2seq/run_eval.py fixes and docs (#5322) · 393b8dc0
Sam Shleifer authored Jun 26, 2020

393b8dc0

[tokenizers] Updates data processors, docstring, examples and model cards to the new API (#5308) · 601d4d69

Thomas Wolf authored Jun 26, 2020

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples

601d4d69

[pipelines] Change summarization default to distilbart-cnn-12-6 (#5289) · 798dbff6
Sam Shleifer authored Jun 26, 2020

798dbff6

Add pad_to_multiple_of on tokenizers (reimport) (#5054) · 135791e8

Funtowicz Morgan authored Jun 26, 2020



* Add new parameter `pad_to_multiple_of` on tokenizers.

* unittest for pad_to_multiple_of

* Add .name when logging enum.

* Fix missing .items() on dict in tests.

* Add special check + warning if the tokenizer doesn't have proper pad_token.

* Use the correct logger format specifier.

* Ensure tokenizer with no pad_token do not modify the underlying padding strategy.

* Skip test if tokenizer doesn't have pad_token

* Fix RobertaTokenizer on empty input

* Format.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fix and updating to simpler API
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

135791e8

25 Jun, 2020 3 commits

Refactor Code samples; Test code samples (#5036) · 364a5ae1

Lysandre Debut authored Jun 25, 2020



* Refactor code samples

* Test docstrings

* Style

* Tokenization examples

* Run rust of tests

* First step to testing source docs

* Style and BART comment

* Test the remainder of the code samples

* Style

* let to const

* Formatting fixes

* Ready for merge

* Fix fixture + Style

* Fix last tests

* Update docs/source/quicktour.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing @sgugger's comments + Fix MobileBERT in TF
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

364a5ae1

[tokenizers] Several small improvements and bug fixes (#5287) · 315f464b

Thomas Wolf authored Jun 25, 2020

* avoid recursion in id checks for fast tokenizers

* better typings and fix #5232

* align slow and fast tokenizers behaviors for Roberta and GPT2

* style and quality

* fix tests - improve typings

315f464b

[Tokenization] Fix #5181 - make #5155 more explicit - move back the default... · 27cf1d97

Thomas Wolf authored Jun 25, 2020

[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING (#5252)

* fix-5181

Padding to max sequence length while truncation to another length was wrong on slow tokenizers

* clean up and fix #5155

* fix XLM test

* Fix tests for Transfo-XL

* logging only above WARNING in tests

* switch slow tokenizers tests in @slow

* fix Marian truncation tokenization test

* style and quality

* make the test a lot faster by limiting the sequence length used in tests

27cf1d97

24 Jun, 2020 2 commits

Add more tests on tokenizers serialization - fix bugs (#5056) · 7ac91107

Thomas Wolf authored Jun 24, 2020

* update tests for fast tokenizers + fix small bug in saving/loading

* better tests on serialization

* fixing serialization

* comment cleanup

7ac91107

Cleaning TensorFlow models (#5229) · cf10d4cf
Lysandre Debut authored Jun 24, 2020
```
* Cleaning TensorFlow models

Update all classes


stylr

* Don't average loss
```
cf10d4cf