Commits · aeb2dac04de77ed28fbceee8d9fb7e6f54d4a230 · chenpangpang / transformers

20 Sep, 2021 1 commit

Gunjan Chhablani authored Sep 20, 2021



* Init FNet

* Update config

* Fix config

* Update model classes

* Update tokenizers to use sentencepiece

* Fix errors in model

* Fix defaults in config

* Remove position embedding type completely

* Fix typo and take only real numbers

* Fix type vocab size in configuration

* Add projection layer to embeddings

* Fix position ids bug in embeddings

* Add minor changes

* Add conversion script and remove CausalLM vestiges

* Fix conversion script

* Fix conversion script

* Remove CausalLM Test

* Update checkpoint names to dummy checkpoints

* Add tokenizer mapping

* Fix modeling file and corresponding tests

* Add tokenization test file

* Add PreTraining model test

* Make style and quality

* Make tokenization base tests work

* Update docs

* Add FastTokenizer tests

* Fix fast tokenizer special tokens

* Fix style and quality

* Remove load_tf_weights vestiges

* Add FNet to  main README

* Fix configuration example indentation

* Comment tokenization slow test

* Fix style

* Add changes from review

* Fix style

* Remove bos and eos tokens from tokenizers

* Add tokenizer slow test, TPU transforms, NSP

* Add scipy check

* Add scipy availabilty check to test

* Fix tokenizer and use correct inputs

* Remove remaining TODOs

* Fix tests

* Fix tests

* Comment Fourier Test

* Uncomment Fourier Test

* Change to google checkpoint

* Add changes from review

* Fix activation function

* Fix model integration test

* Add more integration tests

* Add comparison steps to MLM integration test

* Fix style

* Add masked tokenization fix

* Improve mask tokenization fix

* Fix index docs

* Add changes from review

* Fix issue

* Fix failing import in test

* some more fixes

* correct fast tokenizer

* finalize

* make style

* Remove additional tokenization logic

* Set do_lower_case to False

* Allow keeping accents

* Fix tokenization test

* Fix FNet Tokenizer Fast

* fix tests

* make style

* Add tips to FNet docs
Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>

d8049331

17 Sep, 2021 4 commits
- Fix GPT2Config parameters in GPT2ModelTester (#13630) · b518aaf1
  calpt authored Sep 17, 2021
  
  b518aaf1
- Updated tiny distilbert models (#13631) · 300ee0c7
  Lysandre Debut authored Sep 17, 2021
  
  300ee0c7
- Fix special tokens not correctly tokenized (#13489) · da8beaaf
  Li-Huai (Allan) Lin authored Sep 17, 2021
```
* Fix special tokens not correctly tokenized

* Add testing

* Fix

* Fix

* Use user workflows instead of directly assigning variables

* Enable test of fast tokenizers

* Update test of canine tokenizer
```
  da8beaaf
- [Trainer] Add nan/inf logging filter (#13619) · 1f9dcfc1
  Patrick von Platen authored Sep 17, 2021
```
* finish

* add test

* push

* remove unnecessary code

* up

* correct test

* Update src/transformers/training_args.py
```
  1f9dcfc1
16 Sep, 2021 4 commits
- XLMR tokenizer is fully picklable (#13577) · e02ed0ee
  Benjamin Davidson authored Sep 16, 2021
```
* made tokenizer fully picklable

* remove whitespace

* added testcase
```
  e02ed0ee
- Feature Extractor: Wav2Vec2 & Speech2Text - Allow truncation + padding=longest (#13600) · 4d5b4c78
  Patrick von Platen authored Sep 16, 2021
```
* correct

* add tests

* Update src/transformers/feature_extraction_sequence_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
```
  4d5b4c78
- Fix test (#13608) · cec1c636
  Lysandre Debut authored Sep 16, 2021
  
  cec1c636
- correct (#13585) · b5bab710
  Patrick von Platen authored Sep 16, 2021
  
  b5bab710
15 Sep, 2021 1 commit
- [Pretrained Model] Add resize_position_embeddings (#13559) · 95f933ea
  Patrick von Platen authored Sep 15, 2021
```
* finish

* delete bogus file

* correct some stuff

* finish

* finish
```
  95f933ea
14 Sep, 2021 2 commits

[Flax] Addition of FlaxPegasus (#13420) · c1e47bf4

Bhadresh Savani authored Sep 14, 2021



* added initial files

* fixes pipeline

* fixes style and quality

* fixes doc issue and positional encoding

* fixes layer norm and test

* fixes quality issue

* fixes code quality

* removed extra layer norm

* added layer norm back in encoder and decoder

* added more code copy quality checks

* update tests

* Apply suggestions from code review

* fix import

* fix test
Co-authored-by: patil-suraj <surajp815@gmail.com>

c1e47bf4

Push to hub when saving checkpoints (#13503) · 3081d386

Sylvain Gugger authored Sep 14, 2021

* Push to hub when saving checkpoints

* Add model card

* Revert partial model card

* Small fix for checkpoint

* Add tests

* Add documentation

* Fix tests

* Bump huggingface_hub

* Fix test

3081d386

13 Sep, 2021 3 commits
- up (#13538) · d2904264
  Patrick von Platen authored Sep 13, 2021
  
  d2904264
- fixing BC in `fill-mask` (wasn't tested in theses test suites (#13540) · 65ee1a43
  Nicolas Patry authored Sep 13, 2021
```
apparently).
```
  65ee1a43
- up (#13536) · 9d60eebe
  Patrick von Platen authored Sep 13, 2021
  
  9d60eebe
10 Sep, 2021 3 commits

[GPT-Neo] Simplify local attention (#13491) · 010965dc
Suraj Patil authored Sep 10, 2021
```
* simplify local attention

* update tests

* add a comment and use torch.bitwise_xor
```
010965dc

[Wav2Vec2] Fix normalization for non-padded tensors (#13512) · d7b3b709

Patrick von Platen authored Sep 10, 2021

* finalize

* Apply suggestions from code review

* finish cleaner implementation

* more tests

* small fix

* finish

* up

d7b3b709

[Large PR] Entire rework of pipelines. (#13308) · c63fcabf

Nicolas Patry authored Sep 10, 2021



* Enabling dataset iteration on pipelines.

Enabling dataset iteration on pipelines.

Unifying parameters under `set_parameters` function.

Small fix.

Last fixes after rebase

Remove print.

Fixing text2text `generate_kwargs`

No more `self.max_length`.

Fixing tf only conversational.

Consistency in start/stop index over TF/PT.

Speeding up drastically on TF (nasty bug where max_length would increase
a ton.)

Adding test for support for non fast tokenizers.

Fixign GPU usage on zero-shot.

Fix working on Tf.

Update src/transformers/pipelines/base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Small cleanup.

Remove all asserts + simple format.

* Fixing audio-classification for large PR.

* Overly explicity null checking.

* Encapsulating GPU/CPU pytorch manipulation directly within `base.py`.

* Removed internal state for parameters of the  pipeline.

Instead of overriding implicitly internal state, we moved
to real named arguments on every `preprocess`, `_forward`,
`postprocess` function.

Instead `_sanitize_parameters` will be used to split all kwargs
of both __init__ and __call__ into the 3 kinds of named parameters.

* Move import warnings.

* Small fixes.

* Quality.

* Another small fix, using the CI to debug faster.

* Last fixes.

* Last fix.

* Small cleanup of tensor moving.

* is not None.

* Adding a bunch of docs + a iteration test.

* Fixing doc style.

* KeyDataset = None guard.

* RRemoving the Cuda test for pipelines (was testing).

* Even more simple iteration test.

* Correct import .

* Long day.

* Fixes in docs.

* [WIP] migrating object detection.

* Fixed the target_size bug.

* Fixup.

* Bad variable name.

* Fixing `ensure_on_device` respects original ModelOutput.

c63fcabf

09 Sep, 2021 4 commits

Fixing #13381 (#13400) · aacd2123
Nicolas Patry authored Sep 09, 2021
```
* Fixing #13381

* Enabling automatic LED models.
```
aacd2123
Fixing backward compatiblity for non prefixed tokens (B-, I-). (#13493) · db514a75
Nicolas Patry authored Sep 09, 2021

db514a75
Refactor internals for Trainer push_to_hub (#13486) · e59d4d01
Sylvain Gugger authored Sep 09, 2021

e59d4d01

[Tentative] Moving slow tokenizer to the Trie world. (#13220) · 3dd538c4

Nicolas Patry authored Sep 09, 2021



* Moving slow tokenizer to the Trie world.

* Adding more docstrings to the Trie.

* Fixing doctest (incompatible wiht our format? )

* Update src/transformers/tokenization_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Adding a lot more comment into the internals of this algorithm.

* Cleaner doc.

* Fixing the namings.

* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* quality.

* Fixing longest first match.

* Small improvements to cuts + more test + canine resistant test.

* Fixing fast test.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

3dd538c4

08 Sep, 2021 4 commits

Fix integration tests for TFWav2Vec2 and TFHubert · e1f6e490
Anton Lozhkov authored Sep 08, 2021

e1f6e490

Object detection pipeline (#12886) · 2a15e8cc

Mishig Davaadorj authored Sep 08, 2021



* Implement object-detection pipeline

* Define threshold const

* Add `threshold` argument

* Refactor

* Uncomment test inputs

* `rm
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Fix typo
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Fix typo
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Chore better doc
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Rm unnecessary lines
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Chore better naming
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/pipelines/object_detection.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/pipelines/object_detection.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Fix typo

* Add `detr-tiny` for tests

* Add `ObjectDetectionPipeline` to `trnsfrmrs/init`

* Implement new bbox format

* Update detr post_process

* Update `load_img` method obj det pipeline

* make style

* Implement new testing format for obj det pipeln

* Add guard pytorch specific code in pipeline

* Add doc

* Make pipeline_obj_tet tests deterministic

* Revert some changes to `post_process` COCO api

* Chore

* Update src/transformers/pipelines/object_detection.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/pipelines/object_detection.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/pipelines/object_detection.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/pipelines/object_detection.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/pipelines/object_detection.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/pipelines/object_detection.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Rm timm requirement

* make fixup

* Add timm requirement to test

* Make fixup

* Guard torch.Tensor

* Chore

* Delete unnecessary comment
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

2a15e8cc

Enable automated model list copying for localized READMEs (#13465) · 18447c20

Li-Huai (Allan) Lin authored Sep 08, 2021



* Complete basic mechanism

* Save

* Complete everything

* Style & Quality

* Update READMEs

* Add testing

* Fix README.md format

* Apply suggestions

* Fix format

* Update utils/check_copies.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

18447c20

[CLIP] fix logit_scale init (#13436) · c164c651
Suraj Patil authored Sep 08, 2021
```
* fix logit_scale init

* add logit_scale_init_value as config param
```
c164c651

07 Sep, 2021 2 commits
- Fixing by correctly raising UnicodeDecodeError. (#13449) · 5c7789d4
  Nicolas Patry authored Sep 07, 2021
  
  5c7789d4
- Fix img classification tests (#13456) · 79815090
  Nathan Raw authored Sep 07, 2021
```
* ✅ Update image-classification example's tests

* 🔥 remove cats_and_dogs test samples

* 💄 fix flake8
```
  79815090
06 Sep, 2021 5 commits

Update model configs - Allow setters for common properties (#13026) · c8be8a9a

Nils Reimers authored Sep 06, 2021

* refactor GPT Config to allow dyn. properties

* make attribute_map a class attribute

* remove old code

* update unit test to test config: Add test for common properties setter

* update unit test to test config: Add test for common properties passed as parameters to __init__

* update to black code format

* Allow that setters are not defined for certain config classes

* update config classes to implement attribute_map

* bugfix lxmert config - id2labels was not defined when num_labels was set

* update broken configs - add attribute_maps

* update bart config

* update black codestyle

* update documentation on common config attributes

* update GPTJ config to new attribute map

* update docs on common attributes

* gptj config: add max_position_embeddings

* gptj config: format with black

* update speech to text 2 config

* format doc file to max_len 119

* update config template

c8be8a9a

Adding a test for multibytes unicode. (#13447) · cf4eb8b3

Nicolas Patry authored Sep 06, 2021

* Adding a test for multibytes unicode.

* Adding some accents.

* Making sure decoding works.

* Make tests passing by being cheesy.

cf4eb8b3

up (#13448) · 607611f2
Patrick von Platen authored Sep 06, 2021

607611f2
Fix scheduled tests for `SpeechEncoderDecoderModel` (#13422) · 26700a95
Anton Lozhkov authored Sep 06, 2021
```
* Add inputs to pretrained tests

* Make style
```
26700a95
Fix tests without any real effect (#13406) · 73ad2588
Yih-Dar authored Sep 06, 2021
```
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
```
73ad2588

02 Sep, 2021 5 commits

✨

Add PyTorch image classification example (#13134) · 76c4d8bf

Nathan Raw authored Sep 02, 2021

* ✨ add pytorch image classification example

* 🔥 remove utils.py

* 💄 fix flake8 style issues

* 🔥 remove unnecessary line

* ✨ limit dataset sizes

* 📌 update reqs

* 🎨 restructure - use datasets lib

* 🎨 import transforms directly

* 📝 add comments

* 💄 style

* 🔥 remove flag

* 📌 update requirement warning

* 📝 add vision README.md

* 📝 update README.md

* 📝 update README.md

* 🎨 add image-classification tag to model card

* 🚚 rename vision ➡️ image-classification

* 📝 update image-classification README.md

76c4d8bf

up (#13396) · 9bd5d97c
Patrick von Platen authored Sep 02, 2021

9bd5d97c
fix (#13395) · efa4f5f0
Patrick von Platen authored Sep 02, 2021

efa4f5f0

Correct order of overflowing_tokens for slow tokenizer (#13179) · b91e65af

Apoorv Garg authored Sep 02, 2021

* correct order of overflowing_tokens for slow tokenizer (issue fix #13148)

* python 3.9 requires sentencepiece version 0.1.94 or above

* slicing of ids fixed in truncated_sequence()

* Update setup.py

* Correct order of overflowing tokens for pair of sentences

* code reformatted

* Update tokenization_utils_base.py

* reformatting file

* test to check single_input added

* missing function restored

* test to check pair_input overflowing tokens order

* test to check pair_input overflowing tokens order

* test to check pair_input overflowing tokens order

* added an error message for pair of seq and longest_first strategy

* test for pair_input modified

* variable name corrected

* fixed a typo in error message

* requested changes implemented

* required test added

* Corrected the message to match test message

* added error message for Luke Tokenizer

* lost test recovered

* docstring for truncate_sequences and prepare_for_model updated

* docstring for luke tokenizer updated

* updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING

* aligned text and fixed puncuatations

* improved style and quality of code

* fixed error_msg in truncate_sequences

* replaced encode_plus method with regular call method

* clean up

* rephrased the docstring

b91e65af

Enabling automatic loading of tokenizer with `pipeline` for (#13376) · c9184a2e
Nicolas Patry authored Sep 02, 2021
```
`audio-classification`.
```
c9184a2e

01 Sep, 2021 2 commits

fix (#13383) · a105c9b7
Patrick von Platen authored Sep 01, 2021

a105c9b7

Fix tokenizer saving during training with `Trainer` (#12806) · c4d78f01

SaulLu authored Sep 01, 2021



* add test in trainer and test tokenizer saving wi
th trainer

* quality

* reverse trainer changes

* replace test in test_trainer by a test for all the tokenizers

* format

* add can_save_slow_tokenizer attribute to all tokenizers

* fix Herbert

* format

* Change comment in error

* add comments and a new assert

* Update src/transformers/models/albert/tokenization_albert_fast.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change ValueError barthez

* change ValueError BigBird

* change ValueError Camembert

* change ValueError Mbart50

* change ValueError Pegasus

* change ValueError ReFormer

* change ValueError T5

* change ValueError RoBERTa

* XLNET fast

* Update tests/test_tokenization_common.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change `assert` into `self.assertIn`

* format
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

c4d78f01