Commits · 6e68597877247396dcea63a2f3393e3b09b90f3c · chenpangpang / transformers

"tests/test_tokenization_gpt2.py" did not exist on "0bab55d5d52e4d538888980d05d73acc6da6274a"

30 Jun, 2021 5 commits

Add CANINE (#12024) · 6e685978

NielsRogge authored Jun 30, 2021



* First pass

* More progress

* Add support for local attention

* More improvements

* More improvements

* Conversion script working

* Add CanineTokenizer

* Make style & quality

* First draft of integration test

* Remove decoder test

* Improve tests

* Add documentation

* Mostly docs improvements

* Add CanineTokenizer tests

* Fix most tests on GPU, improve upsampling projection

* Address most comments by @dhgarrette

* Remove decoder logic

* Improve Canine tests, improve docs of CanineConfig

* All tokenizer tests passing

* Make fix-copies and fix tokenizer tests

* Fix test_model_outputs_equivalence test

* Apply suggestions from @sgugger's review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Address some more comments

* Add support for hidden_states and attentions of shallow encoders

* Define custom CanineModelOutputWithPooling, tests pass

* First pass

* More progress

* Add support for local attention

* More improvements

* More improvements

* Conversion script working

* Add CanineTokenizer

* Make style & quality

* First draft of integration test

* Remove decoder test

* Improve tests

* Add documentation

* Mostly docs improvements

* Add CanineTokenizer tests

* Fix most tests on GPU, improve upsampling projection

* Address most comments by @dhgarrette

* Remove decoder logic

* Improve Canine tests, improve docs of CanineConfig

* All tokenizer tests passing

* Make fix-copies and fix tokenizer tests

* Fix test_model_outputs_equivalence test

* Apply suggestions from @sgugger's review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Address some more comments

* Make conversion script work for Canine-c too

* Fix tokenizer tests

* Remove file
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

6e685978

Add default bos_token and eos_token for tokenizer of deberta_v2 (#12429) · 69f57015

Jabin Huang authored Jun 30, 2021



* fix ids_to_tokens naming error in tokenizer of deberta v2

* Update tokenization_deberta_v2.py

Add bos_token and eos_token.

* format code
Co-authored-by: Jipeng Huang <jihuan@microsoft.com>

69f57015

Fix default bool in argparser (#12424) · c9486fd0
Sylvain Gugger authored Jun 30, 2021
```
* Fix default bool in argparser

* Add more to test
```
c9486fd0
Added to talks section (#12433) · 90d69456
Suzana Ilić authored Jun 30, 2021
```
Added one more confirmed speaker, zoom links and gcal event links
```
90d69456

Add option to save on each training node (#12421) · 31a81109

Sylvain Gugger authored Jun 30, 2021



* Add option to save on each training node

* Apply suggestions from code review
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Address review comments
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

31a81109

29 Jun, 2021 11 commits

[modelcard] fix (#12422) · 990540b7

Stas Bekman authored Jun 29, 2021

this PR is fixing an incorrect attribute - probably some tests are needed?

990540b7

Easily train a new fast tokenizer from a given one (#12361) · dc42e770

Sylvain Gugger authored Jun 29, 2021



* [WIP] Easily train a new fast tokenizer from a given one

* Fix test

* Roll out to other tokenizers and add tests

* Fix bug with unk id and add emoji to test

* Really use something different in test

* Implement special tokens map

* Map special tokens in the Transformers tokenizers

* Fix test

* Make test more robust

* Fix test for BPE

* More robust map and test

Co-authored-by SaulLu

* Test file

* Stronger tests
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>

* Map unk token for Wordpiece and address review comment

* Fix lowercase test and address review comment

* Fix all tests

* Simplify test

* Fix tests for realsies

* Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) (#12420)

* Propose change in tests regarding lower case

* add new test for special tokens types

* put back the test part about decoding

* add feature: the AddedToken is re-build with the different mapped content

* Address review comment: simplify AddedToken building
Co-authored-by: sgugger <sylvain.gugger@gmail.com>

* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

dc42e770

Added talks (#12415) · b440b8d1
Suzana Ilić authored Jun 29, 2021

b440b8d1
minor fixes in original RAG training (#12395) · 5257818e
Shamane Siri authored Jun 30, 2021

5257818e
fix ids_to_tokens naming error in tokenizer of deberta v2 (#12412) · e3f39a29
Jabin Huang authored Jun 29, 2021
```
Co-authored-by: Jipeng Huang <jihuan@microsoft.com>
```
e3f39a29
[Flax] Example scripts - correct weight decay (#12409) · 81332868
Patrick von Platen authored Jun 29, 2021
```
* fix_torch_device_generate_test

* remove @

* finish

* finish

* correct style
```
81332868

[example/flax] add summarization readme (#12393) · aecae533

Suraj Patil authored Jun 29, 2021



* add readme

* update readme and add requirements

* Update examples/flax/summarization/README.md
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

aecae533

Fix TFWav2Vec2 SpecAugment (#12289) · 38861045
Will Rice authored Jun 29, 2021
```
* Fix TFWav2Vec2 SpecAugment

* Invert masks

* Feedback changes
```
38861045
Add out of vocabulary error to ASR models (#12288) · bc084938
Will Rice authored Jun 29, 2021
```
* Add OOV error to ASR models

* Feedback changes
```
bc084938

Rename detr targets to labels (#12280) · 1fc6817a

NielsRogge authored Jun 29, 2021

* Rename target to labels in DetrFeatureExtractor

* Update DetrFeatureExtractor tests accordingly

* Improve docs of DetrFeatureExtractor

* Improve docs

* Make style

1fc6817a

[models] respect dtype of the model when instantiating it (#12316) · 7682e977

Stas Bekman authored Jun 28, 2021



* [models] respect dtype of the model when instantiating it

* cleanup

* cleanup

* rework to handle non-float dtype

* fix

* switch to fp32 tiny model

* improve

* use dtype.is_floating_point

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix the doc

* recode to use explicit torch_dtype_auto_detect, torch_dtype args

* docs and tweaks

* docs and tweaks

* docs and tweaks

* merge 2 args, add docs

* fix

* fix

* better doc

* better doc
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

7682e977

28 Jun, 2021 13 commits

[Flax] Add T5 pretraining script (#12355) · 31c3e7e7

Patrick von Platen authored Jun 28, 2021



* fix_torch_device_generate_test

* remove @

* add length computatan

* finish masking

* finish

* upload

* fix some bugs

* finish

* fix dependency table

* correct tensorboard

* Apply suggestions from code review

* correct processing

* slight change init

* correct some more mistakes

* apply suggestions

* improve readme

* fix indent

* Apply suggestions from code review
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* correct tokenizer

* finish

* finish

* finish

* finish
Co-authored-by: Patrick von Platen <patrick@huggingface.co>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

31c3e7e7

pass the matching trainer log level to deepspeed (#12401) · e2770748
Stas Bekman authored Jun 28, 2021

e2770748

Tensorflow LM examples (#12358) · 7e22609e

Matt authored Jun 28, 2021

* Tensorflow MLM example

* Add CLM example

* Style fixes, adding missing checkpoint code from the CLM example

* Fix TPU training, avoid massive dataset warnings

* Fix incorrect training length calculation for multi-GPU training

* Fix incorrect training length calculation for multi-GPU training

* Refactors and nitpicks from the review

* Style pass

* Adding README

7e22609e

[Flax] Adapt flax examples to include `push_to_hub` (#12391) · 2d70c912

Patrick von Platen authored Jun 28, 2021



* fix_torch_device_generate_test

* remove @

* finish

* correct summary writer

* correct push to hub

* fix indent

* finish

* finish

* finish

* finish

* finish
Co-authored-by: Patrick von Platen <patrick@huggingface.co>

2d70c912

Remove the need for `einsum` in Albert's attention computation (#12394) · a7d0b288
Funtowicz Morgan authored Jun 28, 2021
```
* debug albert einsum

* Fix matmul computation

* Let's use torch linear layer.

* Style.
```
a7d0b288
Fix copies · 276bc149
Sylvain Gugger authored Jun 28, 2021

276bc149
Update README.md · 27b6ac46
Patrick von Platen authored Jun 28, 2021

27b6ac46

[Flax community event] Add more description to readme (#12398) · 89b57a66

Patrick von Platen authored Jun 28, 2021



* fix_torch_device_generate_test

* remove @

* boom boom

* correct typos

* Apply suggestions from code review
Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Apply suggestions from code review
Co-authored-by: Suzana Ilić <io.suzanai@gmail.com>

* Apply suggestions from code review
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suzana Ilić <io.suzanai@gmail.com>

89b57a66

[Examples] Added context manager to datasets map (#12367) · 04dbea31

Bhadresh Savani authored Jun 28, 2021

* added cotext manager to datasets map

* fixed style and spaces

* fixed warning of deprecation

* changed desc

04dbea31

[CI] add dependency table sync verification (#12364) · d25ad34c

Stas Bekman authored Jun 28, 2021

* add dependency table sync verification

* improve the message

* improve the message

* revert

* ready to merge

d25ad34c

Add possibility to maintain full copies of files (#12312) · 57461ac0
Sylvain Gugger authored Jun 28, 2021

57461ac0

Update run_mlm.py (#12344) · 9490d668

Taha ValizadehAslani authored Jun 28, 2021

Before the code could not be used for validation only because of this line:
extension = data_args.train_file.split(".")[-1]
was assuming that extension must be extracted from the training dataset. This line would run regardless of the training or validation options of the user. This would lead to an error if the user only wants to run an evaluation only and does not want to do train (because the training file does not exist). I modified it to extract extension from the training file if the user wants to do train and extract it from the validation file if the user wants to run eval. This way the code can be used for both training and validation separately.

9490d668

[Documentation] Warn that DataCollatorForWholeWordMask is limited to... · c7faf2cc

Kilian Kluge authored Jun 28, 2021

[Documentation] Warn that DataCollatorForWholeWordMask is limited to BertTokenizer-like tokenizers (#12371)

* Notify users that DataCollatorForWholeWordMask is limited to BertTokenier-like tokenizers

* Fix code formatting

c7faf2cc

26 Jun, 2021 2 commits
- replace print with logger (#12368) · ff5cdc08
  Bhadresh Savani authored Jun 26, 2021
  
  ff5cdc08
- updated example template (#12365) · 9a754594
  Bhadresh Savani authored Jun 26, 2021
  
  9a754594
25 Jun, 2021 9 commits

[Examples] Replicates the new --log_level feature to all trainer-based pytorch (#12359) · 539ee456
Bhadresh Savani authored Jun 25, 2021
```
* added log_level

* fix comment

* fixed log_level

* Trigger CI

* Unfied logging

* simplified args for log_level
```
539ee456
[trainer] add main_process_first context manager (#12351) · 64e60980
Stas Bekman authored Jun 25, 2021
```
* main_process_first context manager

* handle multi-node, add context description

* sync desc
```
64e60980

fixed multiplechoice tokenization (#12362) · f8664258

cronoik authored Jun 25, 2021

* fixed multiplechoice tokenization

The model would have seen two sequences:
1. [CLS]prompt[SEP]prompt[SEP]
2. [CLS]choice0[SEP]choice1[SEP]
that is not correct as we want a contextualized embedding of prompt and choice

* removed outer brackets for proper sequence generation

f8664258

remove extra white space from log format (#12360) · 4a872cae
Stas Bekman authored Jun 25, 2021

4a872cae
Style · a3daabfe
Sylvain Gugger authored Jun 25, 2021

a3daabfe
Replace NotebookProgressReporter by ProgressReporter in Ray Tune run (#12357) · 238521b0
Kai Fricke authored Jun 25, 2021
```
* Replace NotebookProgressReporter by ProgressReporter in Ray Tune run

* Move to local import
```
238521b0

Add FlaxBigBird QuestionAnswering script (#12233) · 332a2458

Vasudev Gupta authored Jun 25, 2021

* port bigbird script

* adapt script a bit

* change location

* adapt more

* save progress

* init commit

* style

* dataset script tested

* readme add

332a2458

Fix exception in prediction loop occurring for certain batch sizes (#12350) · 55bb4c06

jglaser authored Jun 25, 2021



* fix distributed_concat for scalar outputs

* Update README.md

* fixed typo (#12356)

* simplify fix with terser syntax
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Trigger CI
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: michal pitr <21157924+MichalPitr@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

55bb4c06

fixed typo (#12356) · d4ce31e8
michal pitr authored Jun 25, 2021

d4ce31e8