Commits · f285e4c3adcec31b335e346d4dcea20fbdc73a1a · chenpangpang / transformers

08 Feb, 2021 5 commits

Update tokenizers requirement (#10077) · f285e4c3
Anthony MOI authored Feb 08, 2021

f285e4c3

Fix mlflow param overflow clean (#10071) · ddaafd78

noise-field authored Feb 08, 2021

* Unify logging with f-strings

* Get limits from MLflow rather than hardcode

* Add a check for parameter length overflow

Also constants are marked as internal

* Don't stop run in on_train_end

This causes bad behaviour when there is a seprarte validation step:
validation gets recorded as separate run.

* Fix style

ddaafd78

Restore TF embeddings and attention layers to their previous version (#9890) · 31563e05

Julien Plu authored Feb 08, 2021

* Refacto BERT

* Restore all the concerned models

* Remove print

* Update template

* Apply Sylvain's and Morgan's comments

* Fix cast

* Put the cast inside call

* Remove cond in ebds

* Fix funnel

* Restore previous dot product (attention_scores) computation

* Add ConvBERT and BART

* Make all the S2S models ONNX compliant

* Fix test

* Fix check copies

31563e05

Cleaning up `ConversationalPipeline` to support more than DialoGPT. (#10002) · b1aa4982

Nicolas Patry authored Feb 08, 2021

* Cleaning up `ConversationalPipeline` to support more than DialoGPT.

Currently ConversationalPipeline was heavily biased towards DialoGPT
,which is the default model for this pipeline.

This PR proposes changes to put back the modifications specific to
DialoGPT into tokenizer-specific behavior wherever possible, by
creating `_build_conversation_input_ids` function that takes
conversation as input, and returns a list of ints corresponding
to the tokens. It feels natural to put here because all models
have probably different strategies to build input_ids from the
full conversation and it's the tokenizer's job to transform strings
into tokens (and vice-versa)

If `_build_conversation_input_ids` is missing, previous behavior is
used so we don't break anything so far (except for blenderbot where it's a fix).

This PR also contains a fix for too long inputs. There used
to be dead code for trying to limit the size of incoming input.
The introduced fixed is that we limit
within `_build_conversation_input_ids` to `tokenizer.model_max_length`.
It corresponds to the intent of the removed dead code and is actually
better because it corresponds to `model_max_length` which is different
from `max_length` (which is a default parameter for `generate`).

- Removed `history` logic from the Conversation as it's not relevant
anymore because tokenization logic has been moved to tokenizer.
And tokenizer cannot save any cache, and conversation cannot know
what is relevant or not.
Also it's not usable from `blenderbot` because the input_ids are
not append only (EOS tokens is always at the end).

- Added `iter_texts` method on `Conversation` because all
the code was literred with some form of this iteration of
past/generated_responses.

* Removing torch mention in types.

* Adding type checking to `_build_conversation_input_ids`.

* Fixing import in strings.

b1aa4982

A few fixes in the documentation (#10033) · 45aaf5f7
Sylvain Gugger authored Feb 08, 2021

45aaf5f7

05 Feb, 2021 3 commits

[examples/seq2seq] support label smoothing (#9844) · 1cd16512

Suraj Patil authored Feb 05, 2021

* add prepare_decoder_input_ids_from_labels in s2s models

* support lbl smoothing and enc/emb freezing

* fix freezing

* use pad_token_id from config

* remove embed freezing and add warning

* prepare decoder_input_ids inside DataCollatorForSeq2Seq

1cd16512

Bump minimum Jax requirement to 2.8.0 (#10027) · b9720dd6
Patrick von Platen authored Feb 05, 2021
```
* Bump minimum Jax requirement to 2.8.0

* update table
```
b9720dd6
Clarify QA pipeline output based on character (#10021) · 4bbad604
Lysandre Debut authored Feb 05, 2021
```
* Clarify QA pipeline output based on character

* Style
```
4bbad604

04 Feb, 2021 10 commits

Bump version · ba607db1
Sylvain Gugger authored Feb 04, 2021

ba607db1
Release: 4.3.0.rc1 · 4cd22512
Sylvain Gugger authored Feb 04, 2021

4cd22512
Fix test for sagemaker and TPU integrations · 4739ce17
Sylvain Gugger authored Feb 04, 2021

4739ce17

Authorize last version of tokenizer (#9799) · 21b3922e

Sylvain Gugger authored Feb 04, 2021



* Authorize last version of tokenizer

* Update version table

* Fix conversion of spm tokenizers and fix some hub links

* Bump tokenizers version to 0.10.1rc1

* Add script to check tokenizers conversion with XNLI

* Add some more mask_token lstrip support

* Must modify mask_token in slow tokenizers too

* Keep using the old method for Pegasus

* add missing import
Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>

21b3922e

[trainer] a few fixes (#9993) · 8c3b1fcb

Stas Bekman authored Feb 04, 2021

* trainer fixes

* don't switch the model  just for deepspeed and mp

* correct the fix

8c3b1fcb

Remove "double" assignment in TF-BART like models (#9997) · 714855bd

Daniel Stancl authored Feb 04, 2021

* Replace `attn_weights = attn_wegihts = tf.reshape(...)`
with `attn_weights = tf.reshape(...)` and thus remove
unintentionally used "double" assignment.

714855bd

Adding new `encoder_no_repeat_ngram_size` to `generate`. (#9984) · aeb18b92

Nicolas Patry authored Feb 04, 2021

Adding new `encoder_no_repeat_ngram_size` to `generate`.

Blenderbot results seemed off compared to original ParlAI script:
`https://parl.ai/projects/recipes/`

. Notably the model seems
to repeat a lot what was said during the conversation.

The actual problem was that `no_repeat_ngram_size` actually applies
to the `encoder_input_ids` but HF's `no_repeat_ngram_size` applies
to the previously generated ids (within the decoder). The history
conversation of blenderbot is within the `encoder` part so that
explains why HF's implementation had the repetitions.

This fix was focused on blenderbot *not* small and added tests
for those because they are quite different in configuration.

This change includes:

- Adding a new EncoderNoRepeatLogitProcessor.
- Adding 1 new arg to `generate` (`encoder_no_repeat_ngram_size`)
- Adding 1 new config parameter `encoder_no_repeat_ngram_size`.
- Adding 2 tests, one for the pipeline (high level, inputs exhibited
repeat behavior, one low level for EncoderNoRepeatLogitProcessor)
- Factored NoRepeatLogitProcessor so that logic could be reused.

Further work:

- Blenderbot conversational pipeline still does not behave correctly
 as they way input is prepared within the pipeline is still incorrect
(follow up PR)
- Blenderbot allows the bot to have personas, which is done by
prepending "your personna: XXXX" to the input, this could be explored
too in a follow up PR.

@patrickvonplaten
@LysandreJik

* Update src/transformers/generation_logits_process.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/generation_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/generation_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/configuration_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Doc quality.

* Fixing test.

* Last fixes.

* Fixing to account for batch_size.

* Update src/transformers/configuration_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/generation_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

aeb18b92

Fix model templates (#9999) · e89c959a
Lysandre Debut authored Feb 04, 2021

e89c959a

BartForCausalLM analogs to `ProphetNetForCausalLM` (#9128) · 00031785

demSd authored Feb 04, 2021



* initiliaze bart4causalLM

* create BartDecoderWrapper, setters/getters

* delete spaces

* forward and additional methods

* update cache function, loss function, remove ngram* params in data class.

* add bartcausallm, bartdecoder testing

* correct bart for causal lm

* remove at

* add mbart as well

* up

* fix typo

* up

* correct

* add pegasusforcausallm

* add blenderbotforcausallm

* add blenderbotsmallforcausallm

* add marianforcausallm

* add test for MarianForCausalLM

* add Pegasus test

* add BlenderbotSmall test

* add blenderbot test

* fix a fail

* fix an import fail

* a fix

* fix

* Update modeling_pegasus.py

* fix models

* fix inputs_embeds setting getter

* adapt tests

* correct repo utils check

* finish test improvement

* fix tf models as well

* make style

* make fix-copies

* fix copies

* run all tests

* last changes

* fix all tests
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

00031785

Add `from_slow` in fast tokenizers build and fixes some bugs (#9987) · 7898fc03
Sylvain Gugger authored Feb 04, 2021

7898fc03

03 Feb, 2021 4 commits
- distilbert: fix creation of sinusoidal embeddings when using PyTorch 1.8+ (#9917) · 6244727e
  Stefan Schweter authored Feb 03, 2021
  
  6244727e
- fix steps_in_epoch variable in trainer when using max_steps (#9969) · 5442a11f
  yylun authored Feb 03, 2021
```
* fix steps_in_epoch variable when using max_steps

* redundant sentence

* Revert "redundant sentence"

This reverts commit ad5c0e9b6e66d65732dee2239cdc9c76dfa0dc5a.

* remove redundant sentence
Co-authored-by: wujindou <wujindou@sogou-inc.com>
```
  5442a11f
- Fix Longformer and LED (#9942) · 3f77c26d
  Julien Plu authored Feb 03, 2021
```
* Fix Longformer and LED

* Add a test for graph execution with inputs_embeds

* Apply style
```
  3f77c26d
- Fix GroupedLinearLayer in TF ConvBERT (#9972) · a1a67a3c
  abhishek thakur authored Feb 03, 2021
  
  a1a67a3c
02 Feb, 2021 8 commits

Add head_mask and decoder_head_mask to PyTorch LED (#9856) · 71bdc076

Daniel Stancl authored Feb 02, 2021

* Add {decoder_,}head_mask to LED

* Fix create_custom_forward signatue in encoder

* Add head_mask to longformer

* Add head_mask to longformer to fix dependencies
of LED on Longformer.

* Not working yet

* Add mising one input in longofrmer_modeling.py

* make fix-copies

71bdc076

Wav2Vec2 (#9659) · d6217fb3

Patrick von Platen authored Feb 02, 2021



* add raw scaffold

* implement feat extract layers

* make style

* remove +

* correctly convert weights

* make feat extractor work

* make feature extraction proj work

* run forward pass

* finish forward pass

* Succesful decoding example

* remove unused files

* more changes

* add wav2vec tokenizer

* add new structure

* fix run forward

* add other layer norm architecture

* finish 2nd structure

* add model tests

* finish tests for tok and model

* clean-up

* make style

* finish docstring for model and config

* make style

* correct docstring

* correct tests

* change checkpoints to fairseq

* fix examples

* finish wav2vec2

* make style

* apply sylvains suggestions

* apply lysandres suggestions

* change print to log.info

* re-add assert statement

* add input_values as required input name

* finish wav2vec2 tokenizer

* Update tests/test_tokenization_wav2vec2.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* apply sylvains suggestions
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

d6217fb3

Use compute_loss in prediction_step (#9935) · d996024a
Sylvain Gugger authored Feb 02, 2021

d996024a
convbert: minor fixes for conversion script (#9937) · aa438a42
Stefan Schweter authored Feb 02, 2021

aa438a42
Bump numpy (#9934) · 62024453
Sylvain Gugger authored Feb 02, 2021

62024453
Fix 9918 (#9932) · de38a6e4
Sylvain Gugger authored Feb 02, 2021
```
* Initial work

* Fix doc styler and other models
```
de38a6e4
fix typo in naming (#9944) · 0f4dc5d8
Patrick von Platen authored Feb 02, 2021

0f4dc5d8

[Tokenizer Utils Base] Make pad function more flexible (#9928) · 538b3b46

Patrick von Platen authored Feb 02, 2021

* change tokenizer requirement

* split line

* Correct typo from list to str

* improve style

* make other function pretty as well

* add comment

* correct typo

* add new test

* pass tests for tok without padding token

* Apply suggestions from code review

538b3b46

01 Feb, 2021 8 commits

Tensorflow doc changes on loss output size (#9922) · d1b14c9b

Jan Jitse Venselaar authored Feb 01, 2021

* Change documentation to correctly specify loss tensor size

* Change documentation to correct input format for labels

* Corrected output size of loss tensor for sequence classifier, multiple choice model and question answering

d1b14c9b

Fix bart conversion script (#9923) · 343057e1
Suraj Patil authored Feb 01, 2021
```
* fix conversion script

* typo

* import nn
```
343057e1
fix typos (#9924) · 0842c33e
Suraj Patil authored Feb 01, 2021

0842c33e

Adafactor: avoid updating group["lr"] attributes (#9751) · 8672bcda

CeShine Lee authored Feb 01, 2021

This affects Adafactor with relative_step=False and scale_parameter=True.
Updating group["lr"] makes the result of ._get_lr() depends on the previous call,
i.e., on the scale of other parameters. This isn't supposed to happen.

8672bcda

Remove subclass for sortish sampler (#9907) · 115d97dd
Sylvain Gugger authored Feb 01, 2021
```
* Remove subclass for sortish sampler

* Use old Seq2SeqTrainer in script

* Styling
```
115d97dd

Fit chinese wwm to new datasets (#9887) · 1682804e

wlhgtc authored Feb 01, 2021



* MOD: fit chinese wwm to new datasets

* MOD: move wwm to new folder

* MOD: formate code

* Styling

* MOD add param and recover trainer
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

1682804e

[wandb] restore WANDB_DISABLED=true to disable wandb (#9896) · 24881008

Stas Bekman authored Feb 01, 2021

* [t5 doc] typos

a few run away backticks

@sgugger

* style

* [trainer] put fp16 args together

this PR proposes a purely cosmetic change that puts all the fp16 args together - so they are easier to manager/read

@sgugger

* style

* [wandb] make WANDB_DISABLED disable wandb with any value

This PR solves part of https://github.com/huggingface/transformers/issues/9623

It tries to actually do what https://github.com/huggingface/transformers/issues/9699 requested/discussed and that is any value of `WANDB_DISABLED` should disable wandb.

The current behavior is that it has to be one of `ENV_VARS_TRUE_VALUES = {"1", "ON", "YES"}`

I have been using `WANDB_DISABLED=true` everywhere in scripts as it was originally advertised. I have no idea why this was changed to a sub-set of possible values. And it's not documented anywhere.

@sgugger

* WANDB_DISABLED=true to disable; make tf trainer consistent

* style

24881008

Add head_mask and decoder_head_mask to FSMT (#9819) · 0c6c0afc

Daniel Stancl authored Feb 01, 2021

* Add {decoder_,}head_mask to fsmt_modeling.py

* Enable test_headmasking and some changes to docs

* Remove test_head_masking flag from fsmt test file

Remove test_head_masking flag from test_modeling_fsmt.py
since test_head_masking is set to be True by default (thus it is redundant to store).

* Merge master and remove test_head_masking = True

* Rebase necessary due to an update of jaxlib

* Remove test_head_masking=True in tests/test_modeling_fsmt.py
as it is redundant.

0c6c0afc

31 Jan, 2021 2 commits

TFBart lables consider both pad token and -100 (#9847) · 74f16b82

Kiyoung Kim authored Feb 01, 2021



* TFBart lables consider both pad token and -100

* make style

* fix for all other models

Co-authored-by: kykim <kykim>
Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>

74f16b82

Clarify definition of seed argument in TrainingArguments (#9903) · 22121e81

lewtun authored Jan 31, 2021



* Clarify definition of seed argument in Trainer

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args_tf.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix style

* Update src/transformers/training_args.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

22121e81