Commits · 86a7845c0c16f24af7e665ff6a7bc3540024a6e4 · chenpangpang / transformers

15 Feb, 2022 2 commits

Fix ASR pipelines from local directories with wav2vec models that have... · 9eb7e9ba

Javier de la Rosa authored Feb 15, 2022

Fix ASR pipelines from local directories with wav2vec models that have language models attached (#15590)

* Fix loading pipelines with wav2vec models with lm when in local paths

* Adding tests

* Fix test

* Adding tests

* Flake8 fixes

* Removing conflict files :(

* Adding task type to test

* Remove unnecessary test and imports

9eb7e9ba

[SpeechEncoderDecoder] Make sure no EOS is generated in test (#15655) · 041fdc4a
Patrick von Platen authored Feb 15, 2022

041fdc4a

14 Feb, 2022 1 commit

Sylvain Gugger authored Feb 14, 2022

* Rework AutoFeatureExtractor.from_pretrained internal

* Custom feature extractor

* Add more tests

* Add support for custom feature extractor code

* Clean up

* Add register API to AutoFeatureExtractor

2e11a043

11 Feb, 2022 4 commits
- Add push to hub to feature extractor (#15632) · 52d2e6f6
  Sylvain Gugger authored Feb 11, 2022
```
* Add push to hub to feature extractor

* Quality

* Clean up
```
  52d2e6f6
- Custom feature extractor (#15630) · 7a32e472
  Sylvain Gugger authored Feb 11, 2022
```
* Rework AutoFeatureExtractor.from_pretrained internal

* Custom feature extractor

* Add more tests

* Add support for custom feature extractor code

* Clean up
```
  7a32e472
- Fix _configuration_file argument getting passed to model (#15629) · 2dce350b
  Sylvain Gugger authored Feb 11, 2022
  
  2dce350b
- TF MT5 embeddings resize (#15567) · 2f40c728
  Joao Gante authored Feb 11, 2022
```
* Fix TF MT5 vocab resize

* more assertive testing
```
  2f40c728
10 Feb, 2022 4 commits

Compute loss independent from decoder for TF EncDec models (as #14139) (#15175) · 724e51c6

Yih-Dar authored Feb 10, 2022



* Compute loss independent from decoder (as 14139)

* fix expected seq_len + style

* Apply the same change to TFVisionEncoderDecoderModel

* fix style

* Add case with labels in equivalence test

* uncomment

* Add case with labels in equivalence test

* add decoder_token_labels

* use hf_compute_loss

* Apply suggestions from code review
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Add copied from
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

724e51c6

Add Tensorflow handling of ONNX conversion (#13831) · cb7ed6e0

Alberto Bégué authored Feb 10, 2022



* Add TensorFlow support for ONNX export

* Change documentation to mention conversion with Tensorflow

* Refactor export into export_pytorch and export_tensorflow

* Check model's type instead of framework installation to choose between TF and Pytorch
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Alberto Bégué <alberto.begue@della.ai>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

cb7ed6e0

Reformat tokenization_fnet · e923917c
Lysandre authored Feb 09, 2022

e923917c
Make slow tests slow · 644ec052
Sylvain Gugger authored Feb 09, 2022

644ec052

09 Feb, 2022 8 commits

Fix tests hub failure (#15580) · 315e6740
Sylvain Gugger authored Feb 09, 2022
```
* Expose hub test problem

* Fix tests
```
315e6740
Fix quality · b1ba03e0
Sylvain Gugger authored Feb 09, 2022

b1ba03e0

Constrained Beam Search [without disjunctive decoding] (#15416) · 2b5603f6

Chan Woo Kim authored Feb 10, 2022



* added classes to get started with constrained beam search

* in progress, think i can directly force tokens now but not yet with the round robin

* think now i have total control, now need to code the bank selection

* technically works as desired, need to optimize and fix design choices leading to undersirable outputs

* complete PR #1 without disjunctive decoding

* removed incorrect tests

* Delete k.txt

* Delete test.py

* Delete test.sh

* revert changes to test scripts

* genutils

* full implementation with testing, no disjunctive yet

* shifted docs

* passing all tests realistically ran locally

* removing accidentally included print statements

* fixed source of error in initial PR test

* fixing the get_device() vs device trap

* fixed documentation docstrings about constrained_beam_search

* fixed tests having failing for Speech2TextModel's floating point inputs

* fix cuda long tensor

* added examples and testing for them and founx & fixed a bug in beam_search and constrained_beam_search

* deleted accidentally added test halting code with assert False

* code reformat

* Update tests/test_generation_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update tests/test_generation_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update tests/test_generation_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update tests/test_generation_utils.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update tests/test_generation_utils.py

* fixing based on comments on PR

* took out the testing code that should but work fails without the beam search moditification ; style changes

* fixing comments issues

* docstrings for ConstraintListState

* typo in PhrsalConstraint docstring

* docstrings improvements
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

2b5603f6

Add implementation of typical sampling (#15504) · 0113aae5

Clara Meister authored Feb 09, 2022

* typical decoding

* changing arg name

* add test config params

* forgotten arg rename

* fix edge case where scores are same

* test for typical logits warper

* code quality fixes

0113aae5

[Flax tests/FlaxBert] make from_pretrained test faster (#15561) · f588cf40
Suraj Patil authored Feb 09, 2022

f588cf40
Make sure custom configs work with Transformers (#15569) · 1f60bc46
Sylvain Gugger authored Feb 09, 2022
```
* Make sure custom configs work with Transformers

* Apply code review suggestions
```
1f60bc46
Upgrade black to version ~=22.0 (#15565) · 7732d0fe
Lysandre Debut authored Feb 09, 2022
```
* Upgrade black to version ~=22.0

* Check copies

* Fix code
```
7732d0fe
[Flax tests] fix test_model_outputs_equivalence (#15571) · a6885db9
Suraj Patil authored Feb 09, 2022
```
* fix test_model_outputs_equivalence

* fix tuple outputs for blenderbot
```
a6885db9

08 Feb, 2022 2 commits

Add TFSpeech2Text (#15113) · 8406fa6d

Joao Gante authored Feb 08, 2022

* Add wrapper classes

* convert inner layers to tf

* Add TF Encoder and Decoder layers

* TFSpeech2Text models

* Loadable model

* TF model with same outputs as PT model

* test skeleton

* correct tests and run the fixup

* correct attention expansion

* TFSpeech2Text pask_key_values with TF format

8406fa6d

electra is added to onnx supported model (#15084) · 87d08afb

aaron authored Feb 08, 2022



* electra is added to onnx supported model

* add google/electra-base-generator for test onnx module
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>

87d08afb

07 Feb, 2022 6 commits

FX tracing improvement (#14321) · 0fe17f37

Michael Benayoun authored Feb 07, 2022

* Change the way tracing happens, enabling dynamic axes out of the box

* Update the tests and modeling xlnet

* Add the non recoding of leaf modules to avoid recording more values for the methods to record than what will be seen at tracing time (which would otherwise desynchronize the recorded values and the values that need to be given to the proxies during tracing, causing errors).

* Comments and making tracing work for gpt-j and xlnet

* Refactore things related to num_choices (and batch_size, sequence_length)

* Update fx to work on PyTorch 1.10

* Postpone autowrap_function feature usage for later

* Add copyrights

* Remove unnecessary file

* Fix issue with add_new_model_like

* Apply suggestions

0fe17f37

Fix TF T5/LED missing cross attn in retrun values (#15511) · 131e2584

Yih-Dar authored Feb 07, 2022



* add cross attn to outputs

* add cross attn to outputs for TFLED

* add undo padding

* remove unused import

* fix style
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

131e2584

Remove Longformers from ONNX-supported models (#15273) · 6775b211
lewtun authored Feb 07, 2022

6775b211

Wav2Vec2 models must either throw or deal with add_apater (#15409) · 7a1412e1

François REMY authored Feb 07, 2022



* Wav2Vec2 models must either throw or deal with add_apater
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add pre-add_adapter backwards compatibility

* Add pre-add_adapter backwards compatibility

* Fix issue in tests/test_modeling_wav2vec2.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

7a1412e1

Add ConvNeXT (#15277) · 84eec9e6

NielsRogge authored Feb 07, 2022



* First draft

* Add conversion script

* Improve conversion script

* Improve docs and implement tests

* Define model output class

* Fix tests

* Fix more tests

* Add model to README

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply more suggestions from code review

* Apply suggestions from code review

* Rename dims to hidden_sizes

* Fix equivalence test

* Rename gamma to gamma_parameter

* Clean up conversion script

* Add ConvNextFeatureExtractor

* Add corresponding tests

* Implement feature extractor correctly

* Make implementation cleaner

* Add ConvNextStem class

* Improve design

* Update design to also include encoder

* Fix gamma parameter

* Use sample docstrings

* Finish conversion, add center cropping

* Replace nielsr by facebook, make feature extractor tests smaller

* Fix integration test
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

84eec9e6

[ASR pipeline] correct asr pipeline for seq2seq models (#15541) · 5f1918a4
Patrick von Platen authored Feb 07, 2022

5f1918a4

04 Feb, 2022 2 commits

Standardize semantic segmentation models outputs (#15469) · ac6aa10f

Sylvain Gugger authored Feb 04, 2022



* Standardize instance segmentation models outputs

* Rename output

* Update src/transformers/modeling_outputs.py
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Add legacy argument to the config and model forward

* Update src/transformers/models/beit/modeling_beit.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Copy fix in Segformer
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

ac6aa10f

Fix TFRemBertEncoder all_hidden_states (#15510) · bbe9c698

Yih-Dar authored Feb 04, 2022



* fix

* fix test

* remove expected_num_hidden_layers
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

bbe9c698

03 Feb, 2022 3 commits
- [WIP] Add preprocess_logits_for_metrics Trainer param (#15473) · f1a4c4ea
  davidleonfdez authored Feb 03, 2022
```
* Add preprocess_logits_for_metrics Trainer param

* Compute accuracy in LM examples

* Improve comments
```
  f1a4c4ea
- [deepspeed] fix a bug in a test (#15493) · 4f5faaf0
  Stas Bekman authored Feb 03, 2022
```
* [deepspeed] fix a bug in a test

* consistency
```
  4f5faaf0
- fix load_weight_prefix (#15101) · f5d98da2
  Yih-Dar authored Feb 03, 2022
```
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
```
  f5d98da2
02 Feb, 2022 7 commits

Correct eos_token_id settings in generate (#15403) · 5ec368d7

CHI LIU authored Feb 03, 2022

* Correct eos_token_id set in generate

* Set eos_token_id in test

* Correct eos_token_id set in generate

* Set eos_token_id in test

5ec368d7

fix set truncation attribute in `__init__` of `PreTrainedTokenizerBase` (#15456) · 39b5d1a6

SaulLu authored Feb 02, 2022



* change truncation_side in init of `PreTrainedTokenizerBase`
Co-authored-by: LSinev <LSinev@users.noreply.github.com>

* add test

* Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"

This reverts commit 7a98b87962d2635c7e4d4f00db3948b694624843.

* fix kwargs

* Revert "fix kwargs"

This reverts commit 67b0a5270e8cf1dbf70e6b0232e94c0452b6946f.

* Update tests/test_tokenization_common.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* delete truncation_side variable

* reorganize test

* format

* complete doc

* Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`""

This reverts commit d5a10a7e2680539e5d9e98ae5d896c893d224b80.

* fix typo

* fix typos to render documentation

* Revert "Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"""

This reverts commit 16cf58811943a08f43409a7c83eaa330686591d0.

* format
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

39b5d1a6

Add W&B backend for hyperparameter sweep (#14582) · c74f3d4c

Ayush Chaurasia authored Feb 03, 2022

# Add support for W&B hyperparameter sweep
This PR:
* allows using wandb for running hyperparameter search.
* The runs are visualized on W&B sweeps dashboard
* This supports runnning sweeps on parallel devices, all reporting to the same central dashboard.

### Usage
**To run new a hyperparameter search:**
```
trainer.hyperparameter_search(
    backend="wandb", 
    project="transformers_sweep", # name of the project
    n_trials=5,
    metric="eval/loss", # metric to be optimized, default 'eval/loss'. A warning is raised if the passed metric is not found
)
```
This outputs a sweep id. Eg. `my_project/sweep_id`

**To run sweeps on parallel devices:**
Just pass sweep id which you want to run parallel
```
trainer.hyperparameter_search(
    backend="wandb", 
    sweep_id = "my_project/sweep_id"
)
```

c74f3d4c

Save code of registered custom models (#15379) · 44b21f11

Sylvain Gugger authored Feb 02, 2022



* Allow dynamic modules to use relative imports

* Work for configs

* Fix last merge conflict

* Save code of registered custom objects

* Map strings to strings

* Fix test

* Add tokenizer

* Rework tests

* Tests

* Ignore fixtures py files for tests

* Tokenizer test + fix collection

* With full path

* Rework integration

* Fix typo

* Remove changes in conftest

* Test for tokenizers

* Add documentation

* Update docs/source/custom_models.mdx
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add file structure and file content

* Add more doc

* Style

* Update docs/source/custom_models.mdx
Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Address review comments
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Suraj Patil <surajp815@gmail.com>

44b21f11

Adding support for `microphone` streaming within pipeline. (#15046) · 623d8cb4

Nicolas Patry authored Feb 02, 2022



* Adding support for `microphone` streaming within pipeline.

- Uses `ffmpeg` to get microphone data.
- Makes sure alignment is made to `size_of_sample`.
- Works by sending `{"raw": ..data.., "stride": (n, left, right),
"partial": bool}`
directly to the pipeline enabling to stream partial results and still
get inference.
- Let's `partial` information flow through the pipeline to enable caller
  to get it back and choose to display text or not.

- The striding reconstitution is bound to have errors since CTC does not
keep previous state. Currently most of the errors are we don't know if
there's a space or not between two chunks.
Since we have some left striding info, we could use that during decoding
to choose what to do with those spaces and even extra letters maybe (if
the stride is long enough, it's bound to cover at least a few symbols)

Fixing tests.

Protecting with `require_torch`.

`raw_ctc` support for nicer demo.

Post rebase fixes.

Revamp to split raw_mic_data from it's live chunking.

- Requires a refactor to make everything a bit cleaner.

Automatic resampling.

Small fix.

Small fix.

* Post rebase fix (need to let super handle more logic, reorder args.)

* Update docstrings

* Docstring format.

* Remove print.

* Prevent flow of `input_values`.

* Fixing `stride` too.

* Fixing the PR by removing `raw_ctc`.

* Better docstrings.

* Fixing init.

* Update src/transformers/pipelines/audio_utils.py
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* Update tests/test_pipelines_automatic_speech_recognition.py
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* Quality.
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

623d8cb4

[Wav2Vec2ProcessorWithLM] add alpha & beta to batch decode & decode (#15465) · d718c0c3
Patrick von Platen authored Feb 02, 2022

d718c0c3

Add option to resize like torchvision's Resize (#15419) · 1d94d575

NielsRogge authored Feb 02, 2022

* Add torchvision's resize

* Rename torch_resize to default_to_square

* Apply suggestions from code review

* Add support for default_to_square and tuple of length 1

1d94d575

01 Feb, 2022 1 commit

fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86

SaulLu authored Feb 01, 2022

fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)

* add new test

* update test

* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`

* add `tokenizer_file` for the fast only tokenizer

* change global variables layoutxml

* remove `"tokenizer_file"` from DPR tokenizer's Global variables

* remove `tokenizer_file` from herbert slow tokenizer init

* `"tokenizer_file"` from LED tokenizer's Global variables

* remove `tokenizer_file` from mbart slow tokenizer init

* remove `tokenizer_file` from slow tokenizer template

* adapt to versioning

* adapt the `test_tokenizer_mismatch_warning` test

* clean test

* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py

* Revert "remove `tokenizer_file` from mbart slow tokenizer init"

This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.

* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"

This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.

* Revert "remove `tokenizer_file` from herbert slow tokenizer init"

This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.

* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"

This reverts commit da0895330bedfafc81ae3073470a9348c669f032.

* set `tokenizer_file` in super `__init__` of mbart

7b8bdd86