Commits · 27b1516d32b691533fc497e7ee4ceb88c39cdfdf · chenpangpang / transformers

03 Nov, 2021 1 commit

minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and... · 27b1516d

Dean Wyatte authored Nov 03, 2021

minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf" (#13891)

* minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf"

* more consinstent implementation for numpy_mask_tokens

27b1516d

31 Aug, 2021 1 commit

TF/Numpy variants for all DataCollator classes (#13105) · 854260ca

Matt authored Aug 31, 2021



* Adding a TF variant of the DataCollatorForTokenClassification to get feedback

* Added a Numpy variant and a post_init check to fail early if a missing import is found

* Fixed call to Numpy variant

* Added a couple more of the collators

* Update src/transformers/data/data_collator.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fixes, style pass, finished DataCollatorForSeqToSeq

* Added all the LanguageModeling DataCollators, except SOP and PermutationLanguageModeling

* Adding DataCollatorForPermutationLanguageModeling

* Style pass

* Add missing `__call__` for PLM

* Remove `post_init` checks for frameworks because the imports inside them were making us fail code quality checks

* Remove unused imports

* First attempt at some TF tests

* A second attempt to make any of those tests actually work

* TF tests, round three

* TF tests, round four

* TF tests, round five

* TF tests, all enabled!

* Style pass

* Merging tests into `test_data_collator.py`

* Merging tests into `test_data_collator.py`

* Fixing up test imports

* Fixing up test imports

* Trying shuffling the conditionals around

* Commenting out non-functional old tests

* Completed all tests for all three frameworks

* Style pass

* Fixed test typo

* Style pass

* Move standard `__call__` method to mixin

* Rearranged imports for `test_data_collator`

* Fix data collator typo "torch" -> "pt"

* Fixed the most embarrassingly obvious bug

* Update src/transformers/data/data_collator.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Renaming mixin

* Updating docs
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Dalton Walker <dalton_walker@icloud.com>
Co-authored-by: Andrew Romans <andrew.romans@hotmail.com>

854260ca

08 Apr, 2021 1 commit
- Run mlm pad to multiple for fp16 (#11128) · 6c40e497
  Andrea Cappelli authored Apr 08, 2021
```
* Add mlm collator pad to multiple option (#10627)

* Use padding to 8x in run mlm (#10627)
```
  6c40e497
07 Dec, 2020 1 commit
- Copyright (#8970) · 00aa9dbc
  Sylvain Gugger authored Dec 07, 2020
```
* Add copyright everywhere missing

* Style
```
  00aa9dbc
04 Nov, 2020 1 commit

Clean up data collators and datasets (#8308) · 9c4aa4ac

Sylvain Gugger authored Nov 04, 2020



* Clean up data collators and datasets

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Remove needless clone
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

9c4aa4ac

03 Nov, 2020 1 commit
- Data collator for token classification (#8274) · 7f556d2e
  Sylvain Gugger authored Nov 03, 2020
```
* Add DataCollatorForTokenClassification and clean tests

* Make quality
```
  7f556d2e
26 Oct, 2020 1 commit
- Fix label name in DataCollatorForNextSentencePrediction test (#8048) · 07747863
  Sylvain Gugger authored Oct 26, 2020
  
  07747863
22 Sep, 2020 1 commit

Mark big downloads slow (#7325) · 1ee2194f

Sylvain Gugger authored Sep 22, 2020

* Make big downloads as slow

* Add import

* Right order for slow decorator

* More slow tests

1ee2194f

10 Sep, 2020 1 commit

Albert pretrain datasets/ datacollator (#6168) · 762cba3b

Yu Liu authored Sep 10, 2020



* add dataset for albert pretrain

* datacollator for albert pretrain

* naming, comprehension, file reading change

* data cleaning is no needed after this modification

* delete prints

* fix a bug

* file structure change

* add tests for albert datacollator

* remove random seed

* add back len and get item function

* sample file for testing and test code added

* format change for black

* more format change

* Style

* var assignment issue resolve

* add back wrongly deleted DataCollatorWithPadding in init file

* Style
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

762cba3b

31 Aug, 2020 1 commit

Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task (#6644) · 2de7ee03

Huang Lianzhe authored Aug 31, 2020



* add datacollator and dataset for next sentence prediction task

* bug fix (numbers of special tokens & truncate sequences)

* bug fix (+ dict inputs support for data collator)

* add padding for nsp data collator; renamed cached files to avoid conflict.

* add test for nsp data collator

* Style
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

2de7ee03

20 Aug, 2020 1 commit

Add tests to Trainer (#6605) · 573bdb0a

Sylvain Gugger authored Aug 20, 2020

* Add tests to Trainer

* Test if removing long breaks everything

* Remove ugly hack

* Fix distributed test

* Use float for number of epochs

573bdb0a

20 Jul, 2020 1 commit

Trainer support for iterabledataset (#5834) · 290b6e18

Pradhy729 authored Jul 20, 2020

* Don't pass sampler for iterable dataset

* Added check for test and eval dataloaders.

* Formatting

* Don't pass sampler for iterable dataset

* Added check for test and eval dataloaders.

* Formatting

* Cleaner if nesting.

* Added test for trainer and iterable dataset

* Formatting for test

* Fixed import when torch is available only.

* Added require torch decorator to helper class

* Moved dataset class inside unittest

* Removed nested if and changed model in test

* Checking torch availability for IterableDataset

290b6e18

07 Jul, 2020 1 commit

Added data collator for permutation (XLNet) language modeling and related calls (#5522) · 3dcb748e

Shashank Gupta authored Jul 07, 2020

* Added data collator for XLNet language modeling and related calls

Added DataCollatorForXLNetLanguageModeling in data/data_collator.py
to generate necessary inputs for language modeling training with
XLNetLMHeadModel. Also added related arguments, logic and calls in
examples/language-modeling/run_language_modeling.py.

Resolves: #4739, #2008 (partially)

* Changed name to `DataCollatorForPermutationLanguageModeling`

Changed the name of `DataCollatorForXLNetLanguageModeling` to the more general `DataCollatorForPermutationLanguageModelling`.
Removed the `--mlm` flag requirement for the new collator and defined a separate `--plm_probability` flag for its use.
CTRL uses a CLM loss just like GPT and GPT-2, so should work out of the box with this script (provided `past` is taken care of
similar to `mems` for XLNet).
Changed calls and imports appropriately.

* Added detailed comments, changed variable names

Added more detailed comments to `DataCollatorForPermutationLanguageModeling` in `data/data_collator.py` to explain working. Also cleaned up variable names and made them more informative.

* Added tests for new data collator

Added tests in `tests/test_trainer.py` for DataCollatorForPermutationLanguageModeling based on those in DataCollatorForLanguageModeling. A specific test has been added to check for odd-length sequences.

* Fixed styling issues

3dcb748e

01 Jul, 2020 2 commits
- Fix tensor label type inference in default collator (#5250) · 35befd9c
  Joe Davison authored Jul 01, 2020
```
* allow tensor label inputs to default collator

* replace try/except with type check
```
  35befd9c
- Move tests/utils.py -> transformers/testing_utils.py (#5350) · 13deb95a
  Sam Shleifer authored Jul 01, 2020
  
  13deb95a
18 Jun, 2020 1 commit
- Fix #5114 (#5122) · 5f721ad6
  Sylvain Gugger authored Jun 18, 2020
  
  5f721ad6
17 Jun, 2020 1 commit
- Make default_data_collator more flexible and deprecate old behavior (#5060) · 20fa8289
  Sylvain Gugger authored Jun 17, 2020
```
* Make default_data_collator more flexible

* Accept tensors for all features

* Document code

* Refactor

* Formatting
```
  20fa8289
15 Jun, 2020 1 commit

Make DataCollator a callable (#5015) · 1affde2f

Sylvain Gugger authored Jun 15, 2020



* Make DataCollator a callable

* Update src/transformers/data/data_collator.py
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

1affde2f

05 Jun, 2020 1 commit
- Fix argument label (#4792) · 4dd5cf22
  Sylvain Gugger authored Jun 05, 2020
```
* Fix argument label

* Fix test
```
  4dd5cf22
21 May, 2020 1 commit

Adds predict stage for glue tasks, and generate result files which can be... · 49296533

Zhangyx authored May 21, 2020


Adds predict stage for glue tasks, and generate result files which can be submitted to gluebenchmark.com (#4463)

* Adds predict stage for glue tasks, and generate result files which could be submitted to gluebenchmark.com website.

* Use Split enum + always output the label name
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

49296533

13 May, 2020 1 commit

(v2) Improvements to the wandb integration (#4324) · 24175910

Julien Chaumond authored May 12, 2020



* Improvements to the wandb integration

* small reorg + no global necessary

* feat(trainer): log epoch and final metrics

* Simplify logging a bit

* Fixup

* Fix crash when just running eval
Co-authored-by: Chris Van Pelt <vanpelt@gmail.com>
Co-authored-by: Boris Dayma <boris.dayma@gmail.com>

24175910

07 May, 2020 1 commit

BIG Reorganize examples (#4213) · 0ae96ff8

Julien Chaumond authored May 07, 2020

* Created using Colaboratory

* [examples] reorganize files

* remove run_tpu_glue.py as superseded by TPU support in Trainer

* Bugfix: int, not tuple

* move files around

0ae96ff8

22 Apr, 2020 1 commit

Trainer (#3800) · dd9d483d

Julien Chaumond authored Apr 21, 2020

* doc

* [tests] Add sample files for a regression task

* [HUGE] Trainer

* Feedback from @sshleifer

* Feedback from @thomwolf + logging tweak

* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes

* [glue] Use default max_seq_length of 128 like before

* [glue] move DataTrainingArguments around

* [ner] Change interface of InputExample, and align run_{tf,pl}

* Re-align the pl scripts a little bit

* ner

* [ner] Add integration test

* Fix language_modeling with API tweak

* [ci] Tweak loss target

* Don't break console output

* amp.initialize: model must be on right device before

* [multiple-choice] update for Trainer

* Re-align to 827d6d6e

dd9d483d