Commits · 623d8cb475804f2b0f85a47b04b8b2e522db06ef · chenpangpang / transformers

02 Feb, 2022 5 commits

Adding support for `microphone` streaming within pipeline. (#15046) · 623d8cb4

Nicolas Patry authored Feb 02, 2022



* Adding support for `microphone` streaming within pipeline.

- Uses `ffmpeg` to get microphone data.
- Makes sure alignment is made to `size_of_sample`.
- Works by sending `{"raw": ..data.., "stride": (n, left, right),
"partial": bool}`
directly to the pipeline enabling to stream partial results and still
get inference.
- Let's `partial` information flow through the pipeline to enable caller
  to get it back and choose to display text or not.

- The striding reconstitution is bound to have errors since CTC does not
keep previous state. Currently most of the errors are we don't know if
there's a space or not between two chunks.
Since we have some left striding info, we could use that during decoding
to choose what to do with those spaces and even extra letters maybe (if
the stride is long enough, it's bound to cover at least a few symbols)

Fixing tests.

Protecting with `require_torch`.

`raw_ctc` support for nicer demo.

Post rebase fixes.

Revamp to split raw_mic_data from it's live chunking.

- Requires a refactor to make everything a bit cleaner.

Automatic resampling.

Small fix.

Small fix.

* Post rebase fix (need to let super handle more logic, reorder args.)

* Update docstrings

* Docstring format.

* Remove print.

* Prevent flow of `input_values`.

* Fixing `stride` too.

* Fixing the PR by removing `raw_ctc`.

* Better docstrings.

* Fixing init.

* Update src/transformers/pipelines/audio_utils.py
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* Update tests/test_pipelines_automatic_speech_recognition.py
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* Quality.
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

623d8cb4

[Wav2Vec2ProcessorWithLM] add alpha & beta to batch decode & decode (#15465) · d718c0c3
Patrick von Platen authored Feb 02, 2022

d718c0c3

Add option to resize like torchvision's Resize (#15419) · 1d94d575

NielsRogge authored Feb 02, 2022

* Add torchvision's resize

* Rename torch_resize to default_to_square

* Apply suggestions from code review

* Add support for default_to_square and tuple of length 1

1d94d575

Update tutorial docs (#15165) · b9418a1d

Steven Liu authored Feb 01, 2022

* first draft of pipeline, autoclass, preprocess tutorials

* apply review feedback

* 🖍 apply feedback from patrick/niels

* 📝add output image to preprocessed image

* 🖍 apply feedback from patrick

b9418a1d

Update fine-tune docs (#15259) · c157c7e3

Steven Liu authored Feb 01, 2022

* add fine-tune tutorial

* make edits, fix style

* 📝 make edits

* 🖍 fix code format links to external libraries

* 🔄revert code formatting

* 🖍 use DefaultDataCollator instead of DataCollatorWithPadding

c157c7e3

01 Feb, 2022 11 commits

Harder check for IndexErrors in QA scripts (#15438) · d0b5ed11
Sylvain Gugger authored Feb 01, 2022
```
* Harder check for IndexErrors in QA scripts

* Make test stronger
```
d0b5ed11
`Trainer.push_to_hub` always tries to push to the Hub (#15463) · 8e5d4e49
Sylvain Gugger authored Feb 01, 2022

8e5d4e49
[BartTokenizer] remove inheritance on RobertaTokenizer (#15461) · 37800f13
Suraj Patil authored Feb 01, 2022
```
* refactor bart tokenizers

* doc

* replace assert with ValueError
```
37800f13

use mean instead of elementwise_mean in XLMPredLayer (#15436) · f427e750

Yih-Dar authored Feb 01, 2022



* use mean instead of elementwise_mean

* make style
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

f427e750

fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86

SaulLu authored Feb 01, 2022

fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)

* add new test

* update test

* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`

* add `tokenizer_file` for the fast only tokenizer

* change global variables layoutxml

* remove `"tokenizer_file"` from DPR tokenizer's Global variables

* remove `tokenizer_file` from herbert slow tokenizer init

* `"tokenizer_file"` from LED tokenizer's Global variables

* remove `tokenizer_file` from mbart slow tokenizer init

* remove `tokenizer_file` from slow tokenizer template

* adapt to versioning

* adapt the `test_tokenizer_mismatch_warning` test

* clean test

* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py

* Revert "remove `tokenizer_file` from mbart slow tokenizer init"

This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.

* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"

This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.

* Revert "remove `tokenizer_file` from herbert slow tokenizer init"

This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.

* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"

This reverts commit da0895330bedfafc81ae3073470a9348c669f032.

* set `tokenizer_file` in super `__init__` of mbart

7b8bdd86

replace assert with exception for padding_side arg in `PreTrainedTokenizerBase` `__init__` (#15454) · 6d585fe0

SaulLu authored Feb 01, 2022

* replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`

* add test

* fix kwargs

* reformat test

* format

* format

* fix typo to render the documentation

6d585fe0

Update README.md (#15462) · d2749cf7
Kamal Raj authored Feb 01, 2022
```
fix typo
```
d2749cf7
[M2M100, XGLM] fix positional emb resize (#15444) · 1c9648c4
Suraj Patil authored Feb 01, 2022

1c9648c4
fix from_vision_text_pretrained doc example (#15453) · 2ca62683
Yih-Dar authored Feb 01, 2022
```
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
```
2ca62683

Fix TF Causal LM models' returned logits (#15256) · dc05dd53

Yih-Dar authored Feb 01, 2022



* Fix TF Causal LM models' returned logits

* Fix expected shape in the tests
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

dc05dd53

remove "inputs" in tf common test script (no longer required) (#15262) · af5c3329
Yih-Dar authored Feb 01, 2022
```
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
```
af5c3329

31 Jan, 2022 24 commits
- [generate] fix synced_gpus default (#15446) · d12ae816
  Stas Bekman authored Jan 31, 2022
  
  d12ae816
- skip test for XGLM (#15445) · d4f201b8
  Suraj Patil authored Jan 31, 2022
  
  d4f201b8
- Error when group_by_length is used with an IterableDataset (#15437) · 0c17e766
  Sylvain Gugger authored Jan 31, 2022
  
  0c17e766
- Update modeling_wav2vec2.py (#15423) · 125a2882
  peregilk authored Jan 31, 2022
```
* Update modeling_wav2vec2.py

With very tiny sound files (less than 0.1 seconds) the num_masked_span can be too long. The issue is described in issue #15366 and discussed with @patrickvonplaten.

* correct errors with mask time indices

* remove bogus file

* make fix-copies
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
```
  125a2882
- Add 'with torch.no_grad()' to BEiT integration test forward passes (#14961) · d984b103
  Tavin Turner authored Jan 31, 2022
```
* Add 'with torch.no_grad()' to BEiT integration test forward pass

* Fix inconsistent use of tabs and spaces in indentation
```
  d984b103
- Misfiring tf warnings (#15442) · 09f9d072
  Matt authored Jan 31, 2022
```
* Fix spurious warning in TF TokenClassification models

* Fixing one last spurious warning

* Removing outdated warning altogether
```
  09f9d072
- [RobertaTokenizer] remove inheritance on GPT2Tokenizer (#15429) · 6915174e
  Suraj Patil authored Jan 31, 2022
```
* refactor roberta tokenizer

* refactor fast tokenizer

* remove old comment
```
  6915174e
- correct positionla emb size (#15441) · a5ecbf73
  Suraj Patil authored Jan 31, 2022
  
  a5ecbf73
- Fix TFLEDModel (#15356) · 5a709873
  Yih-Dar authored Jan 31, 2022
```
* fix tf led

* fix

* fix

* Add test_pt_tf_model_equivalence_extra for TFLED

* add a (temporary) test
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
```
  5a709873
- [examples/Flax] add a section about GPUs (#15198) · 87918d32
  Suraj Patil authored Jan 31, 2022
```
* add a section about GPUs

* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
```
  87918d32
- [Trainer] suppress warning for length-related columns (#15421) · b8810847
  Patrick von Platen authored Jan 31, 2022
```
* [Trainer] suppress warning for length-related columns

* improve message

* Update src/transformers/trainer.py
```
  b8810847
- Change REALM checkpoint to new ones (#15439) · 3385ca25
  Sylvain Gugger authored Jan 31, 2022
```
* Change REALM checkpoint to new ones

* Last checkpoint missing
```
  3385ca25
- Fix spurious warning in TF TokenClassification models (#15435) · 7e56ba28
  Matt authored Jan 31, 2022
  
  7e56ba28
- Fix loss calculation in TFXXXForTokenClassification models (#15294) · 554d333e
  Yih-Dar authored Jan 31, 2022
```
* Fix loss calculation in TFFunnelForTokenClassification

* revert the change in TFFunnelForTokenClassification

* fix FunnelForTokenClassification loss

* fix other TokenClassification loss

* fix more

* fix more

* add num_labels to ElectraForTokenClassification

* revert the change to research projects
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
```
  554d333e
- [deepspeed doc] fix import, extra notes (#15400) · 44c7857b
  Stas Bekman authored Jan 31, 2022
```
* [deepspeed doc] fix import, extra notes

* typo
```
  44c7857b
- Add header (#15434) · 47df0f22
  NielsRogge authored Jan 31, 2022
  
  47df0f22
- Add doc for add-new-model-like command (#15433) · 7fc6f41d
  Sylvain Gugger authored Jan 31, 2022
  
  7fc6f41d
- add t5 ner finetuning (#15432) · 282ae123
  Ogundepo Odunayo authored Jan 31, 2022
  
  282ae123
- [Hotfix] Fix Swin model outputs (#15414) · d4b3e56d
  NielsRogge authored Jan 31, 2022
```
* Fix Swin model outputs

* Rename pooler
```
  d4b3e56d
- import torch.utils.checkpoint (#15427) · 38dfb40a
  Suraj Patil authored Jan 31, 2022
  
  38dfb40a
- [Robust Speech Challenge] Add missing LR parameter (#15428) · f624249d
  Jonatas Grosman authored Jan 31, 2022
  
  f624249d
- Update README.md (#15430) · 3254080d
  Kamal Raj authored Jan 31, 2022
```
fix typo
```
  3254080d
- Add (M)Luke model training for Token Classification in the examples (#14880) · aa19f478
  Julien Plu authored Jan 31, 2022
```
* Add Luke training

* Fix true label tags

* Fix true label tags

* Fix true label tags

* Update the data collator for Luke

* Some training refactor for Luke

* Improve data collator for Luke

* Fix import

* Fix datasets concatenation

* Add the --max_entity_length argument for Luke models

* Remove unused code

* Fix style issues

* Fix style issues

* Move the Luke training into a separate folder

* Fix style

* Fix naming

* Fix filtering

* Fix filtering

* Fix filter

* Update some preprocessing

* Move luke to research_projects

* Checkstyle

* Address comments

* Fix style
```
  aa19f478
- Fix additional DataTrainingArguments documentation (#15408) · 0094eba3
  François REMY authored Jan 31, 2022
```
(This is an editorial change only)
```
  0094eba3