Commits · e7f33e8cb38c446ce4004a742ebb386a4bef7174 · chenpangpang / transformers

09 Jul, 2021 3 commits

Pass `model_kwargs` when loading a model in `pipeline()` (#12449) · e7f33e8c

Alex Hedges authored Jul 09, 2021

* Pass model_kwargs when loading a model in pipeline

* Add test for model_kwargs parameter of pipeline()

* Rewrite test to not download model

* Fix failing style checks

e7f33e8c

[Flax] Add flax marian (#12595) · 65e27215

Patrick von Platen authored Jul 09, 2021



* fix_torch_device_generate_test

* remove @

* add marian

* finish make style

* add model

* add docs

* add test

* add integration tests

* up

* solve bug

* correct tests

* correct some tests

* Apply suggestions from code review
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* correct adapt marian

* finish
Co-authored-by: Patrick von Platen <patrick@huggingface.co>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

65e27215

This will reduce "Already borrowed error": (#12550) · cc12e1db

Nicolas Patry authored Jul 09, 2021

* This will reduce "Already borrowed error":

Original issue https://github.com/huggingface/tokenizers/issues/537



The original issue is caused by transformers calling many times
mutable functions on the rust tokenizers.
Rust needs to guarantee that only 1 agent has a mutable reference
to memory at a given time (for many reasons which don't need explaining
here). Usually, the rust compiler can guarantee that this property is
true at compile time.

Unfortunately, this is impossible for Python to do that, so PyO3, the
bridge between rust and python used by `tokenizers`, will change the
compile guarantee for a dynamic guarantee, so if multiple agents try
to have multiple mutable borrows at the same time, then the runtime will
yell with "Already borrowed".

The proposed fix here in transformers, is simply to reduce the actual
number of calls that really need mutable borrows. By reducing them,
we reduce the risk of running into "Already borrowed" error.
The caveat is now we add a call to read the current configuration of the
`_tokenizer`, so worst case we have 2 calls instead of 1, and best case
we simply have 1 + a Python comparison of a dict (should be negligible).

* Adding a test.

* trivial error :(.

* Update tests/test_tokenization_fast.py
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

* Adding reference to original issues in the tests.

* Update the tests with fast tokenizer.
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

cc12e1db

08 Jul, 2021 2 commits

Fixing the pipeline optimization by reindexing targets (V2) (#12330) · 4da568c1

Nicolas Patry authored Jul 08, 2021



* Fixing the pipeline optimization by rescaling the logits first.

* Add test for target equivalence
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

4da568c1

[RFC] Laying down building stone for more flexible ONNX export capabilities (#11786) · 2aa3cd93

Funtowicz Morgan authored Jul 08, 2021

* Laying down building stone for more flexible ONNX export capabilities

* Ability to provide a map of config key to override before exporting.

* Makes it possible to export BART with/without past keys.

* Supports simple mathematical syntax for OnnxVariable.repeated

* Effectively apply value override from onnx config for model

* Supports export with additional features such as with-past for seq2seq

* Store the output path directly in the args for uniform usage across.

* Make BART_ONNX_CONFIG_* constants and fix imports.

* Support BERT model.

* Use tokenizer for more flexibility in defining the inputs of a model.

* Add TODO as remainder to provide the batch/sequence_length as CLI args

* Enable optimizations to be done on the model.

* Enable GPT2 + past

* Improve model validation with outputs containing nested structures

* Enable Roberta

* Enable Albert

* Albert requires opset >= 12

* BERT-like models requires opset >= 12

* Remove double printing.

* Enable XLM-Roberta

* Enable DistilBERT

* Disable optimization by default

* Fix missing setattr when applying optimizer_features

* Add value field to OnnxVariable to define constant input (not from tokenizers)

* Add T5 support.

* Simplify model type retrieval

* Example exporting token_classification pipeline for DistilBERT.

* Refactoring to package `transformers.onnx`

* Solve circular dependency & __main__

* Remove unnecessary imports in `__init__`

* Licences

* Use @Narsil's suggestion to forward the model's configuration to the ONNXConfig to avoid interpolation.

* Onnx export v2 fixes (#12388)

* Tiny fixes
Remove `convert_pytorch` from onnxruntime-less runtimes
Correct reference to model

* Style

* Fix Copied from

* LongFormer ONNX config.

* Removed optimizations

* Remvoe bad merge relicas.

* Remove unused constants.

* Remove some deleted constants from imports.

* Fix unittest to remove usage of PyTorch model for onnx.utils.

* Fix distilbert export

* Enable ONNX export test for supported model.

* Style.

* Fix lint.

* Enable all supported default models.

* GPT2 only has one output

* Fix bad property name when overriding config.

* Added unittests and docstrings.

* Disable with_past tests for now.

* Enable outputs validation for default export.

* Remove graph opt lvls.

* Last commit with on-going past commented.

* Style.

* Disabled `with_past` for now

* Remove unused imports.

* Remove framework argument

* Remove TFPreTrainedModel reference

* Add documentation

* Add onnxruntime tests to CircleCI

* Add test

* Rename `convert_pytorch` to `export`

* Use OrderedDict for dummy inputs

* WIP Wav2Vec2

* Revert "WIP Wav2Vec2"

This reverts commit f665efb04c92525c3530e589029f0ae7afdf603e.

* Style

* Use OrderedDict for I/O

* Style.

* Specify OrderedDict documentation.

* Style :)
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

2aa3cd93

07 Jul, 2021 2 commits

Adding support for `pipeline("automatic-speech-recognition")`. (#11525) · ebc69afc

Nicolas Patry authored Jul 07, 2021

* Adding support for `pipeline("automatic-speech-recognition")`.

- Ugly `"config"` choice for AutoModel. It would be great to have the
possibility to have something like `AutoModelFor` that would implement
the same logic (Load the config, check Architectures and load the first
one)

* Remove `model_id` was not needed in the end.

* Rebased !

* Remove old code.

* Rename `nlp`.

ebc69afc

[Flax] Add FlaxMBart (#12236) · 61400e1e

Daniel Stancl authored Jul 07, 2021



* Copy BART to MBart and rename some stuff

* Add copy statements pointing to FlaxBart

* Update/add some common files

* Update shift_tokens_rigth + fix imports

* Fix shift_tokens_right method according to MBart implementation

* Update shift_tokens_right in tests accordingly

* Fix the import issue and update docs file
* make style quality

* Do some minor changes according to patil-suraj suggestions

* Change the order of normalization layer and attention

* Add some copu statementes

* Update generate method and add integration test for mBart

* Make a few updates after a review

Besides, add `lang_code_to_id` to MBartTokenizeFast

* fix-copies; make style quality

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

* fix output type, style

* add copied from

* resolve conflicts
Co-authored-by: Suraj Patil <surajp815@gmail.com>

61400e1e

06 Jul, 2021 3 commits

implementing tflxmertmodel integration test (#12497) · 3fd85777
sadakmed authored Jul 06, 2021
```
* implementing tflxmertmodel integration test

* move import

* revert and fix
```
3fd85777

FlaxGPTNeo (#12493) · 7a259c19

Suraj Patil authored Jul 06, 2021

* flax gpt neo

* fix query scaling

* update generation test

* use flax model for test

7a259c19

[RoFormer] Fix some issues (#12397) · 626a0a01

yujun authored Jul 06, 2021



* add RoFormerTokenizerFast into AutoTokenizer

* fix typo in roformer docs

* make onnx export happy

* update RoFormerConfig embedding_size

* use jieba not rjieba

* fix 12244 and make test_alignement passed

* update ARCHIVE_MAP

* make style & quality & fixup

* update

* make style & quality & fixup

* make style quality fixup

* update

* suggestion from LysandreJik
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* make style

* use rjieba
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

626a0a01

05 Jul, 2021 1 commit

create LxmertModelIntegrationTest Pytorch (#9989) · 0e1718af

sadakmed authored Jul 05, 2021

* create LxmertModelIntegrationTest

* implementation using numpy seeding to fix inputs params.

* fix code quality

* isort check

0e1718af

02 Jul, 2021 1 commit
- Fix TAPAS test uncovered by #12446 (#12480) · b889d3f6
  Lysandre Debut authored Jul 02, 2021
  
  b889d3f6
01 Jul, 2021 3 commits

[roberta] fix lm_head.decoder.weight ignore_key handling (#12446) · 2d1d9218

Stas Bekman authored Jul 01, 2021



* fix lm_head.decoder.weight ignore_key handling

* fix the mutable class variable

* Update src/transformers/models/roberta/modeling_roberta.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* replicate the comment

* make deterministic
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

2d1d9218

[Wav2Vec2, Hubert] Fix ctc loss test (#12458) · 27d348f2
Patrick von Platen authored Jul 01, 2021
```
* fix_torch_device_generate_test

* remove @

* fix test
```
27d348f2
Add test for a WordLevel tokenizer model (#12437) · 3aa37b94
SaulLu authored Jul 01, 2021
```
* add a test for a WordLevel tokenizer

* adapt common test to new tokenizer
```
3aa37b94

30 Jun, 2021 3 commits

[Flax] Add wav2vec2 (#12271) · 0d1f67e6

Patrick von Platen authored Jun 30, 2021



* fix_torch_device_generate_test

* remove @

* start flax wav2vec2

* save intermediate

* forward pass has correct shape

* add weight norm

* add files

* finish ctc

* make style

* finish gumbel quantizer

* correct docstrings

* correct some more files

* fix vit

* finish quality

* correct tests

* correct docstring

* correct tests

* start wav2vec2 pretraining script

* save intermediate

* start pretraining script

* finalize pretraining script

* finish

* finish

* small typo

* finish

* correct

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>

* make style

* push
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>

0d1f67e6

Add CANINE (#12024) · 6e685978

NielsRogge authored Jun 30, 2021



* First pass

* More progress

* Add support for local attention

* More improvements

* More improvements

* Conversion script working

* Add CanineTokenizer

* Make style & quality

* First draft of integration test

* Remove decoder test

* Improve tests

* Add documentation

* Mostly docs improvements

* Add CanineTokenizer tests

* Fix most tests on GPU, improve upsampling projection

* Address most comments by @dhgarrette

* Remove decoder logic

* Improve Canine tests, improve docs of CanineConfig

* All tokenizer tests passing

* Make fix-copies and fix tokenizer tests

* Fix test_model_outputs_equivalence test

* Apply suggestions from @sgugger's review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Address some more comments

* Add support for hidden_states and attentions of shallow encoders

* Define custom CanineModelOutputWithPooling, tests pass

* First pass

* More progress

* Add support for local attention

* More improvements

* More improvements

* Conversion script working

* Add CanineTokenizer

* Make style & quality

* First draft of integration test

* Remove decoder test

* Improve tests

* Add documentation

* Mostly docs improvements

* Add CanineTokenizer tests

* Fix most tests on GPU, improve upsampling projection

* Address most comments by @dhgarrette

* Remove decoder logic

* Improve Canine tests, improve docs of CanineConfig

* All tokenizer tests passing

* Make fix-copies and fix tokenizer tests

* Fix test_model_outputs_equivalence test

* Apply suggestions from @sgugger's review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Address some more comments

* Make conversion script work for Canine-c too

* Fix tokenizer tests

* Remove file
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

6e685978

Fix default bool in argparser (#12424) · c9486fd0
Sylvain Gugger authored Jun 30, 2021
```
* Fix default bool in argparser

* Add more to test
```
c9486fd0

29 Jun, 2021 4 commits

Easily train a new fast tokenizer from a given one (#12361) · dc42e770

Sylvain Gugger authored Jun 29, 2021



* [WIP] Easily train a new fast tokenizer from a given one

* Fix test

* Roll out to other tokenizers and add tests

* Fix bug with unk id and add emoji to test

* Really use something different in test

* Implement special tokens map

* Map special tokens in the Transformers tokenizers

* Fix test

* Make test more robust

* Fix test for BPE

* More robust map and test

Co-authored-by SaulLu

* Test file

* Stronger tests
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>

* Map unk token for Wordpiece and address review comment

* Fix lowercase test and address review comment

* Fix all tests

* Simplify test

* Fix tests for realsies

* Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) (#12420)

* Propose change in tests regarding lower case

* add new test for special tokens types

* put back the test part about decoding

* add feature: the AddedToken is re-build with the different mapped content

* Address review comment: simplify AddedToken building
Co-authored-by: sgugger <sylvain.gugger@gmail.com>

* Update src/transformers/tokenization_utils_fast.py
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

dc42e770

Add out of vocabulary error to ASR models (#12288) · bc084938
Will Rice authored Jun 29, 2021
```
* Add OOV error to ASR models

* Feedback changes
```
bc084938

Rename detr targets to labels (#12280) · 1fc6817a

NielsRogge authored Jun 29, 2021

* Rename target to labels in DetrFeatureExtractor

* Update DetrFeatureExtractor tests accordingly

* Improve docs of DetrFeatureExtractor

* Improve docs

* Make style

1fc6817a

[models] respect dtype of the model when instantiating it (#12316) · 7682e977

Stas Bekman authored Jun 28, 2021



* [models] respect dtype of the model when instantiating it

* cleanup

* cleanup

* rework to handle non-float dtype

* fix

* switch to fp32 tiny model

* improve

* use dtype.is_floating_point

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix the doc

* recode to use explicit torch_dtype_auto_detect, torch_dtype args

* docs and tweaks

* docs and tweaks

* docs and tweaks

* merge 2 args, add docs

* fix

* fix

* better doc

* better doc
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

7682e977

28 Jun, 2021 1 commit

[Examples] Added context manager to datasets map (#12367) · 04dbea31

Bhadresh Savani authored Jun 28, 2021

* added cotext manager to datasets map

* fixed style and spaces

* fixed warning of deprecation

* changed desc

04dbea31

25 Jun, 2021 1 commit
- remove extra white space from log format (#12360) · 4a872cae
  Stas Bekman authored Jun 25, 2021
  
  4a872cae
24 Jun, 2021 1 commit
- Fix torchscript tests (#12336) · 8ef62ec9
  Lysandre Debut authored Jun 24, 2021
```
* Fix torchscript tests

* Better test

* Remove bogus print
```
  8ef62ec9
23 Jun, 2021 6 commits

changed modeling_fx_utils.py to utils/fx.py for clarity (#12326) · 986ac03e
Michael Benayoun authored Jun 23, 2021
```
Co-authored-by: Michael Benayoun <michael@huggingface.co>
```
986ac03e
Temporarily revert the `fill-mask` improvements. · 941b4442
Lysandre authored Jun 23, 2021

941b4442

Clean push to hub API (#12187) · 53c60bab

Sylvain Gugger authored Jun 23, 2021



* Clean push to hub API

* Create working dir if it does not exist

* Different tweak

* New API + all models + test Flax

* Adds the Trainer clean up

* Update src/transformers/file_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* (nit) output types

* No need to set clone_from when folder exists

* Update src/transformers/trainer.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Add generated_from_trainer tag

* Update to new version

* Fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

53c60bab

Flax T5 (#12150) · e98233dd

Vasudev Gupta authored Jun 23, 2021



* copy pytorch-t5

* init

* boom boom

* forward pass same

* make generation work

* add more tests

* make test work

* finish normal tests

* make fix-copies

* finish quality

* correct slow example

* correct slow test

* version table

* upload models

* Update tests/test_modeling_flax_t5.py

* correct incorrectly deleted line
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick@huggingface.co>

e98233dd

Add output in a dictionary for TF `generate` method (#12139) · 26a2e365

Daniel Stancl authored Jun 23, 2021

* Add output args to greedy search

* Fix critical typo + make style quality

* Handle generate_beam_search

* Add dict_specific tests and fix the placement of encoder outputs

* Add  specific outputs

* Update doc

* Fix typo

* Adjust handling encoder_outputs + Fix generating for T5

* Fix generate for RAG

* Fix handling ouptut_attentions when target_mapping is not None

Take care of situations when target_mapping is provided
as there are 2-tuple of attentions

Change from:
if inputs["output_attentions"]:
    attentions = tuple(tf.transpose(t, perm(2, 3, 0, 1)) for t in attentions)

to:
if inputs["output_attentions"]:
    if inputs["target_mapping"] is not None:
        # when target_mapping is provided, there are 2-tuple of attentions
         attentions = tuple(
             tuple(tf.transpose(attn_stream, perm=(2, 3, 0, 1)) for attn_stream in t) for t in attentions
        )
    else:
        attentions = tuple(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)

* Rename kwargs to model_kwargs

* make style quality

* Move imports in test_modeling_tf_common.py

Move ModelOutput-related imports in test_modeling_tf_common.py
into the `is_tf_available():` statement.

* Rewrite nested if-statements

* Fix added tests

26a2e365

Optimizing away the `fill-mask` pipeline. (#12113) · d4be4984

Nicolas Patry authored Jun 23, 2021



* Optimizing away the `fill-mask` pipeline.

- Don't send anything to the tokenizer unless needed. Vocab check is
much faster
- Keep BC by sending data to the tokenizer when needed. User handling warning messages will see performance benefits again
- Make `targets` and `top_k` work together better `top_k` cannot be
higher than `len(targets)` but can be smaller still.
- Actually simplify the `target_ids` in case of duplicate (it can happen
because we're parsing raw strings)
- Removed useless code to fail on empty strings. It works only if empty
string is in first position, moved to ignoring them instead.
- Changed the related tests as only the tests would fail correctly
(having incorrect value in first position)

* Make tests compatible for 2 different vocabs... (at the price of a
warning).

Co-authored-by: @EtaoinWu

* ValueError working globally

* Update src/transformers/pipelines/fill_mask.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* `tokenizer.vocab` -> `tokenizer.get_vocab()` for more compatiblity +
fallback.
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

d4be4984

22 Jun, 2021 3 commits

[trainer] 2 bug fixes and a rename (#12309) · ebe54135
Stas Bekman authored Jun 22, 2021
```
* bug fixes and a rename

* add extended DDP test
```
ebe54135
[tests] multiple improvements (#12294) · 0d97ba8a
Stas Bekman authored Jun 21, 2021
```
* [tests] multiple improvements

* cleanup

* style

* todo to investigate

* fix
```
0d97ba8a

[trainer + examples] set log level from CLI (#12276) · dad414d5

Stas Bekman authored Jun 21, 2021



* set log level from CLI

* add log_level_replica + test + extended docs

* cleanup

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* rename datasets objects to allow datasets module

* improve the doc

* style

* doc improve
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

dad414d5

21 Jun, 2021 4 commits
- reset report_to to none, avoid deprecation warning (#12293) · a4ed074d
  Stas Bekman authored Jun 21, 2021
  
  a4ed074d
- [Flax] Fix flax test save pretrained (#12256) · 4e9a6796
  Patrick von Platen authored Jun 21, 2021
```
* fix_torch_device_generate_test

* remove @

* fix flax save pretrained test
```
  4e9a6796
- [Flax] [WIP] allow loading head model with base model weights (#12255) · eb881674
  Suraj Patil authored Jun 21, 2021
```
* boom boom

* remove flax clip example

* allow loading head model with base model weights

* add test

* fix imports

* disable save, load test for clip

* add test_save_load_to_base
```
  eb881674
- [FlaxClip] fix test from/save pretrained test (#12284) · 8d5b7f36
  Suraj Patil authored Jun 21, 2021
```
* boom boom

* remove flax clip example

* fix from_save_pretrained
```
  8d5b7f36
17 Jun, 2021 2 commits

AutoTokenizer: infer the class from the tokenizer config if possible (#12208) · adb70eda

Sylvain Gugger authored Jun 17, 2021



* AutoTokenizer: infer the class from the tokenizer config if possible

* Add tests

* Update src/transformers/models/auto/tokenization_auto.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

adb70eda

Pipeline update & tests (#12207) · b56848c8
Lysandre Debut authored Jun 17, 2021

b56848c8