Commits · 3323146e904a1092def4e8527de9d2a7479c1c14 · chenpangpang / transformers

23 Sep, 2020 3 commits

Sylvain Gugger authored Sep 23, 2020



* Clean up model documentation

* Formatting

* Preparation work

* Long lines

* Main work on rst files

* Cleanup all config files

* Syntax fix

* Clean all tokenizers

* Work on first models

* Models beginning

* FaluBERT

* All PyTorch models

* All models

* Long lines again

* Fixes

* More fixes

* Update docs/source/model_doc/bert.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/electra.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Last fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

3323146e

Fixed evaluation_strategy on epoch end bug (#7340) · 58405a52

Wissam Antoun authored Sep 23, 2020

* Fixed evaluation_strategy on epoch end bug

move the evaluation script outside the the iteration loop

* black formatting

58405a52

[testing] skip decorators: docs, tests, bugs (#7334) · 28cf8730

Stas Bekman authored Sep 23, 2020

* skip decorators: docs, tests, bugs

* another important note

* style

* bloody style

* add @pytest.mark.parametrize

* add note

* no idea what it wants :(

28cf8730

22 Sep, 2020 16 commits

[s2s] add src_lang kwarg for distributed eval (#7300) · e53138a1
Sam Shleifer authored Sep 22, 2020

e53138a1
Formatting · f5518e56
Sylvain Gugger authored Sep 22, 2020

f5518e56

Add num workers cli arg (#7322) · 17099ebd

Chady Kamar authored Sep 22, 2020

* Add dataloader_num_workers to TrainingArguments

This argument is meant to be used to set the
number of workers for the PyTorch DataLoader.

* Pass num_workers argument on DataLoader init

17099ebd

[Bug Fix] The actual batch_size is inconsistent with the settings. (#7235) · 6303b5a7

Huang Lianzhe authored Sep 23, 2020



* [bug fix] fixed the bug that the actual batch_size is inconsistent with the parameter settings

* reformat

* reformat

* reformat

* add support for dict and BatchEncoding

* add support for dict and BatchEncoding

* add documentation for DataCollatorForNextSentencePrediction

* Some more nits for the docstring
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* rename variables
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

6303b5a7

RAG (#6813) · c754c41c

Ola Piktus authored Sep 22, 2020

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* Formatting / renaming prior to actual work

* First commit

* improve comments

* Retrieval evaluation scripts

* refactor to include modeling outputs + MPI retriever

* Fix rag-token model + refactor

* Various fixes + finetuning logic

* use_bos fix

* Retrieval refactor

* Finetuning refactoring and cleanup

* Add documentation and cleanup

* Remove set_up_rag_env.sh file

* Fix retrieval wit HF index

* Fix import errors

* Fix quality errors

* Refactor as per suggestions in https://github.com/huggingface/transformers/pull/6813#issuecomment-687208867



* fix quality

* Fix RAG Sequence generation

* minor cleanup plus initial tests

* fix test

* fix tests 2

* Comments fix

* post-merge fixes

* Improve readme + post-rebase refactor

* Extra dependencied for tests

* Fix tests

* Fix tests 2

* Refactor test requirements

* Fix tests 3

* Post-rebase refactor

* rename nlp->datasets

* RAG integration tests

* add tokenizer to slow integration test and allow retriever to run on cpu

* add tests; fix position ids warning

* change structure

* change structure

* add from encoder generator

* save working solution

* make all integration tests pass

* add RagTokenizer.save/from_pretrained and RagRetriever.save/from_pretrained

* don't save paths

* delete unnecessary imports

* pass config to AutoTokenizer.from_pretrained for Rag tokenizers

* init wiki_dpr only once

* hardcode legacy index and passages paths (todo: add the right urls)

* finalize config

* finalize retriver api and config api

* LegacyIndex index download refactor

* add dpr to autotokenizer

* make from pretrained more flexible

* fix ragfortokengeneration

* small name changes in tokenizer

* add labels to models

* change default index name

* add retrieval tests

* finish token generate

* align test with previous version and make all tests pass

* add tests

* finalize tests

* implement thoms suggestions

* add first version of test

* make first tests work

* make retriever platform agnostic

* naming

* style

* add legacy index URL

* docstrings + simple retrieval test for distributed

* clean model api

* add doc_ids to retriever's outputs

* fix retrieval tests

* finish model outputs

* finalize model api

* fix generate problem for rag

* fix generate for other modles

* fix some tests

* save intermediate

* set generate to default

* big refactor generate

* delete rag_api

* correct pip faiss install

* fix auto tokenization test

* fix faiss install

* fix test

* move the distributed logic to examples

* model page

* docs

* finish tests

* fix dependencies

* fix import in __init__

* Refactor eval_rag and finetune scripts

* start docstring

* add psutil to test

* fix tf test

* move require torch to top

* fix retrieval test

* align naming

* finish automodel

* fix repo consistency

* test ragtokenizer save/load

* add rag model output docs

* fix ragtokenizer save/load from pretrained

* fix tokenizer dir

* remove torch in retrieval

* fix docs

* fixe finetune scripts

* finish model docs

* finish docs

* remove auto model for now

* add require torch

* remove solved todos

* integrate sylvains suggestions

* sams comments

* correct mistake on purpose

* improve README

* Add generation test cases

* fix rag token

* clean token generate

* fix test

* add note to test

* fix attention mask

* add t5 test for rag

* Fix handling prefix in finetune.py

* don't overwrite index_name
Co-authored-by: Patrick Lewis <plewis@fb.com>
Co-authored-by: Aleksandra Piktus <piktus@devfair0141.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5102.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5067.h2.fair>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

c754c41c

Release: v3.2.0 · 3ebb1b3a
Lysandre authored Sep 22, 2020

3ebb1b3a
Fixes for LayoutLM (#7318) · 01f0fd0b
Sylvain Gugger authored Sep 22, 2020

01f0fd0b

Create an XLA parameter and fix the mixed precision (#7311) · 702a76ff

Julien Plu authored Sep 22, 2020

* Create an XLA parameter and fix mixed precision creation

* Fix issue brought by intellisense

* Complete docstring

702a76ff

Add possibility to evaluate every epoch (#7302) · 89edf504

Sylvain Gugger authored Sep 22, 2020



* Add possibility to evaluate every epoch

* Remove multitype arg

* Remove needless import

* Use a proper enum

* Apply suggestions from @LysandreJik
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* One else and formatting
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

89edf504

is_pretokenized -> is_split_into_words (#7236) · 21ca1480
Sylvain Gugger authored Sep 22, 2020
```
* is_pretokenized -> is_split_into_words

* Fix tests
```
21ca1480
Fix saving TF custom models (#7291) · 324f361e
Julien Plu authored Sep 22, 2020
```
* Fix #7277

* Apply style

* Add a full training pipeline test

* Apply style
```
324f361e

Add LayoutLM Model (#7064) · cd9a0585

Minghao Li authored Sep 22, 2020



* first version

* finish test docs readme model/config/tokenization class

* apply make style and make quality

* fix layoutlm GitHub link

* fix conflict in index.rst and add layoutlm to pretrained_models.rst

* fix bug in test_parents_and_children_in_mappings

* reformat modeling_auto.py and tokenization_auto.py

* fix bug in test_modeling_layoutlm.py

* Update docs/source/model_doc/layoutlm.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update docs/source/model_doc/layoutlm.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* remove inh, add tokenizer fast, and update some doc

* copy and rename necessary class from modeling_bert to modeling_layoutlm

* Update src/transformers/configuration_layoutlm.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/configuration_layoutlm.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/configuration_layoutlm.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/configuration_layoutlm.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_layoutlm.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_layoutlm.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_layoutlm.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* add mish to activations.py, import ACT2FN and import logging from utils
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

cd9a0585

Fix #7304 (#7305) · 244e1b5b
Sylvain Gugger authored Sep 22, 2020

244e1b5b
Adds FSMT to LM head AutoModel (#7312) · e4610881
Lysandre Debut authored Sep 22, 2020

e4610881
[fsmt] no need to pass device (#7292) · e2964b8a
Stas Bekman authored Sep 22, 2020

e2964b8a

Copy code from Bert to Roberta and add safeguard script (#7219) · e4b94d8e

Sylvain Gugger authored Sep 22, 2020



* Copy code from Bert to Roberta and add safeguard script

* Fix docstring

* Comment code

* Formatting

* Update src/transformers/modeling_roberta.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add test and fix bugs

* Fix style and make new comand
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

e4b94d8e

21 Sep, 2020 4 commits

Fix #7284 (#7289) · 63276b76
Sylvain Gugger authored Sep 21, 2020

63276b76
Disable missing weight warning (#7282) · 8d464374
Raphaël Bournhonesque authored Sep 21, 2020

8d464374

[fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new... · 8ff88d25

Stas Bekman authored Sep 21, 2020


[fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test (#7224)

* fix USE_CUDA, add pipeline

* USE_CUDA fix

* recode SinusoidalPositionalEmbedding into nn.Embedding subclass

was needed for torchscript to work - this is now part of the state_dict, so will have to remove these keys during save_pretrained

* back out (ci debug)

* restore

* slow last?

* facilitate not saving certain keys and test

* remove no longer used keys

* style

* fix logging import

* cleanup

* Update src/transformers/modeling_utils.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* fix bug in max_positional_embeddings

* rename keys to keys_to_never_save per suggestion, improve the setup

* Update src/transformers/modeling_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

8ff88d25

Fixed target_mapping preparation for XLNet when batch size > 1 (incl. beam search) (#7267) · 39062d05
guillaume-be authored Sep 21, 2020

39062d05

18 Sep, 2020 3 commits

Add new pre-trained models BERTweet and PhoBERT (#6129) · af2322c7

Dat Quoc Nguyen authored Sep 19, 2020

* Add BERTweet and PhoBERT models

* Update modeling_auto.py

Re-add `bart` to LM_MAPPING

* Update tokenization_auto.py

Re-add `from .configuration_mobilebert import MobileBertConfig`
not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig`

* Add BERTweet and PhoBERT to pretrained_models.rst

* Update tokenization_auto.py

Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer.

* Update BertweetTokenizer - without nltk

* Update model card for BERTweet

* PhoBERT - with Auto mode - without import fastBPE

* PhoBERT - with Auto mode - without import fastBPE

* BERTweet - with Auto mode - without import fastBPE

* Add PhoBERT and BERTweet to TF modeling auto

* Improve Docstrings for PhobertTokenizer and BertweetTokenizer

* Update PhoBERT and BERTweet model cards

* Fixed a merge conflict in tokenization_auto

* Used black to reformat BERTweet- and PhoBERT-related files

* Used isort to reformat BERTweet- and PhoBERT-related files

* Reformatted BERTweet- and PhoBERT-related files based on flake8

* Updated test files

* Updated test files

* Updated tf test files

* Updated tf test files

* Updated tf test files

* Updated tf test files

* Update commits from huggingface

* Delete unnecessary files

* Add tokenizers to auto and init files

* Add test files for tokenizers

* Revised model cards

* Update save_vocabulary function in BertweetTokenizer and PhobertTokenizer and test files

* Revised test files

* Update orders of Phobert and Bertweet tokenizers in auto tokenization file

af2322c7

Fix a few countings (steps / epochs) in trainer_tf.py (#7175) · 3a03bab9
Yih-Dar authored Sep 18, 2020

3a03bab9
Fix a typo (#7225) · 7719ecd1
Yuta Hayashibe authored Sep 18, 2020

7719ecd1

17 Sep, 2020 6 commits

[model cards] fix yaml in cards (#7207) · 9c5bcab5
Stas Bekman authored Sep 17, 2020

9c5bcab5

Change to use relative imports in some files & Add python prompt symbols to example codes (#7202) · e643a297

Sohee Yang authored Sep 18, 2020



* Move 'from transformers' statements to relative imports in some files

* Add python prompt symbols in front of the example codes

* Reformat the code

* Add one missing space
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

e643a297

[ported model] FSMT (FairSeq MachineTranslation) (#6940) · 1eeb206b

Stas Bekman authored Sep 17, 2020

* ready for PR

* cleanup

* correct FSMT_PRETRAINED_MODEL_ARCHIVE_LIST

* fix

* perfectionism

* revert change from another PR

* odd, already committed this one

* non-interactive upload workaround

* backup the failed experiment

* store langs in config

* workaround for localizing model path

* doc clean up as in https://github.com/huggingface/transformers/pull/6956



* style

* back out debug mode

* document: run_eval.py --num_beams 10

* remove unneeded constant

* typo

* re-use bart's Attention

* re-use EncoderLayer, DecoderLayer from bart

* refactor

* send to cuda and fp16

* cleanup

* revert (moved to another PR)

* better error message

* document run_eval --num_beams

* solve the problem of tokenizer finding the right files when model is local

* polish, remove hardcoded config

* add a note that the file is autogenerated to avoid losing changes

* prep for org change, remove unneeded code

* switch to model4.pt, update scores

* s/python/bash/

* missing init (but doesn't impact the finetuned model)

* cleanup

* major refactor (reuse-bart)

* new model, new expected weights

* cleanup

* cleanup

* full link

* fix model type

* merge porting notes

* style

* cleanup

* have to create a DecoderConfig object to handle vocab_size properly

* doc fix

* add note (not a public class)

* parametrize

* - add bleu scores integration tests

* skip test if sacrebleu is not installed

* cache heavy models/tokenizers

* some tweaks

* remove tokens that aren't used

* more purging

* simplify code

* switch to using decoder_start_token_id

* add doc

* Revert "major refactor (reuse-bart)"

This reverts commit 226dad15ca6a9ef4e26178526e878e8fc5c85874.

* decouple from bart

* remove unused code #1

* remove unused code #2

* remove unused code #3

* update instructions

* clean up

* move bleu eval to examples

* check import only once

* move data+gen script into files

* reuse via import

* take less space

* add prepare_seq2seq_batch (auto-tested)

* cleanup

* recode test to use json instead of yaml

* ignore keys not needed

* use the new -y in transformers-cli upload -y

* [xlm tok] config dict: fix str into int to match definition (#7034)

* [s2s] --eval_max_generate_length (#7018)

* Fix CI with change of name of nlp (#7054)

* nlp -> datasets

* More nlp -> datasets

* Woopsie

* More nlp -> datasets

* One last

* extending to support allen_nlp wmt models

- allow a specific checkpoint file to be passed
- more arg settings
- scripts for allen_nlp models

* sync with changes

* s/fsmt-wmt/wmt/ in model names

* s/fsmt-wmt/wmt/ in model names (p2)

* s/fsmt-wmt/wmt/ in model names (p3)

* switch to a better checkpoint

* typo

* make non-optional args such - adjust tests where possible or skip when there is no other choice

* consistency

* style

* adjust header

* cards moved (model rename)

* use best custom hparams

* update info

* remove old cards

* cleanup

* s/stas/facebook/

* update scores

* s/allen_nlp/allenai/

* url maps aren't needed

* typo

* move all the doc / build /eval generators to their own scripts

* cleanup

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* fix indent

* duplicated line

* style

* use the correct add_start_docstrings

* oops

* resizing can't be done with the core approach, due to 2 dicts

* check that the arg is a list

* style

* style
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

1eeb206b

Trainer multi label (#7191) · 492bb6aa
Sylvain Gugger authored Sep 17, 2020
```
* Trainer accep multiple labels

* Missing import

* Fix dosctrings
```
492bb6aa

Transformer-XL: Remove unused parameters (#7087) · 70974592

RafaelWO authored Sep 17, 2020

* Removed 'tgt_len' and 'ext_len' from Transfomer-XL

 * Some changes are still to be done

* Removed 'tgt_len' and 'ext_len' from Transfomer-XL (2)

 * Removed comments
 * Fixed quality

* Changed warning to info

70974592

remove duplicated code (#7173) · 0cdafbf7
Stas Bekman authored Sep 17, 2020

0cdafbf7

16 Sep, 2020 4 commits
- use the correct add_start_docstrings (#7174) · 42049b8e
  Stas Bekman authored Sep 16, 2020
  
  42049b8e
- Add condition (#7161) · 9e376e15
  Donna Choi authored Sep 16, 2020
  
  9e376e15
- fix the warning message of overflowed sequence (#7151) · 08bfc171
  Xi Ye authored Sep 16, 2020
  
  08bfc171
- Refactoring the TF activations functions (#7150) · af8425b7
  Julien Plu authored Sep 16, 2020
```
* Refactoring the activations functions into a common file

* Apply style

* remove unused import

* fix tests

* Fix tests.
```
  af8425b7
15 Sep, 2020 4 commits
- fix encoder decoder kwargs (#7131) · 85ffda96
  Patrick von Platen authored Sep 15, 2020
  
  85ffda96
- fix ZeroDivisionError and epoch counting (#7125) · 4c62c602
  Yih-Dar authored Sep 15, 2020
```
* fix ZeroDivisionError and epoch counting

* Add test for num_train_epochs calculation in trainer.py

* Remove @require_non_multigpu for test_num_train_epochs_in_training
```
  4c62c602
- Multi predictions trainer (#7126) · 7186ca62
  Sylvain Gugger authored Sep 15, 2020
```
* Allow multiple outputs

* Formatting

* Move the unwrapping before metrics

* Fix typo

* Add test for non-supported config options
```
  7186ca62
- Tiny typo fix (#7143) · 1a85299a
  Siddharth Jain authored Sep 15, 2020
  
  1a85299a