Commits · b2e4b091f08f1aaf21855d588c6c8d284baba9eb · chenpangpang / transformers

"vscode:/vscode.git/clone" did not exist on "5e9f6752eefe0fc830129e9964d2ae5027caa6ba"

29 Jul, 2022 1 commit

Replace `as_target` context managers by direct calls (#18325) · 986526a0

Sylvain Gugger authored Jul 29, 2022



* Preliminary work on tokenizers

* Quality + fix tests

* Treat processors

* Fix pad

* Remove all uses of  in tests, docs and examples

* Replace all as_target_tokenizer

* Fix tests

* Fix quality

* Update examples/flax/image-captioning/run_image_captioning_flax.py
Co-authored-by: amyeroberts <amy@huggingface.co>

* Style
Co-authored-by: amyeroberts <amy@huggingface.co>

986526a0

28 Jul, 2022 1 commit
- Fix codeparrot deduplication - ignore whitespaces (#18023) · 286a18fa
  Loubna Ben Allal authored Jul 28, 2022
```
* ignore whitspaces for hash

* reformat code

* Update README.md
```
  286a18fa
27 Jul, 2022 2 commits

Remove all uses of six (#18318) · cf32b2ee
Sylvain Gugger authored Jul 27, 2022
```
* Remove all uses of six

* fix quality
```
cf32b2ee

Update CodeParrot readme to include training in Megatron (#17798) · 1d71ad89

Loubna Ben Allal authored Jul 27, 2022



* add info about megatron training

* upload models and datasets from CodeParrot organization

* upload models and datasets from CodeParrot organization

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fix typo and add comment about codeparrot vs megatron
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

1d71ad89

11 Jul, 2022 2 commits

Fix RESOURCE_EXHAUSTED error when dealing with large datasets in Flax example scripts (#18069) · 1e8140ca

Duong A. Nguyen authored Jul 11, 2022

* Fix RESOURCE_EXHAUSTED error for large datasets on Flax example scripts

* using np.permutation for creating batch_idx

* train_samples_idx -> training_samples_idx

* fix type hints

1e8140ca

Fix some typos. (#17560) · 95113d13

Yulv-git authored Jul 11, 2022



* Fix some typos.
Signed-off-by: Yulv-git <yulvchi@qq.com>

* Fix typo.
Signed-off-by: Yulv-git <yulvchi@qq.com>

* make fixup.

95113d13

29 Jun, 2022 1 commit
- Fix all is_torch_tpu_available issues (#17936) · 7c4c6f60
  Zachary Mueller authored Jun 29, 2022
```
* Fix all is_torch_tpu_available 
```
  7c4c6f60
22 Jun, 2022 2 commits

Bump numpy from 1.21.0 to 1.22.0 in /examples/research_projects/lxmert (#17817) · c366ce10

dependabot[bot] authored Jun 22, 2022

Bumps [numpy](https://github.com/numpy/numpy) from 1.21.0 to 1.22.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst)
- [Commits](https://github.com/numpy/numpy/compare/v1.21.0...v1.22.0

)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

c366ce10

Bump numpy in /examples/research_projects/visual_bert (#17816) · af0d21e7

dependabot[bot] authored Jun 22, 2022

Bumps [numpy](https://github.com/numpy/numpy) from 1.21.0 to 1.22.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst)
- [Commits](https://github.com/numpy/numpy/compare/v1.21.0...v1.22.0

)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

af0d21e7

21 Jun, 2022 1 commit

[CodeParrot] Near-deduplication with jaccard similarity (#17054) · da2bd2ae

Jia LI authored Jun 21, 2022



* deduplication draft

* update style

* update style test

* dummy test main

* rename modules

* rename functions

* return extremes in deduplicate_clusters

* update style

* cast str for gzip

* update doc string

* time processing

* use dataset map to compute minhash

* fill value for short token

* remove da map method

* update style

* use share object to multiprocess

* update style

* use f-string and minor fix
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* update style

* use module parameters

* change ds_dedup to ds_filter

* save ds_dedup

* mv test to script tests

* make jaccard threshold a parameter of deduplicate_dataset

* update style

* add doc strings

* update style

* add doc string for DuplicationIndex

* save files into data dir

* update readme

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* make near deduplication optional

* move near deduplication in README

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* use f string
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

da2bd2ae

17 Jun, 2022 2 commits

Bump notebook in /examples/research_projects/lxmert (#17743) · e44a569f

dependabot[bot] authored Jun 17, 2022

Bumps [notebook](http://jupyter.org

) from 6.4.10 to 6.4.12.

---
updated-dependencies:
- dependency-name: notebook
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

e44a569f

Bump notebook in /examples/research_projects/visual_bert (#17742) · 5089a2d4

dependabot[bot] authored Jun 17, 2022

Bumps [notebook](http://jupyter.org

) from 6.4.10 to 6.4.12.

---
updated-dependencies:
- dependency-name: notebook
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

5089a2d4

14 Jun, 2022 1 commit

Rag end2end new (#17650) · 9068fa6c

Shamane Siri authored Jun 15, 2022

* check

* update the RAG-end2end with new PL and RAY

* removed unwanted comments

9068fa6c

10 Jun, 2022 3 commits

update README.md (#17657) · 3114df41
Loubna Ben Allal authored Jun 10, 2022
```
- use CodeParrot scores of v1.1
- change evaluation command to use accelerate
```
3114df41

🐛

Properly raise `RepoNotFoundError` when not authenticated (#17651) · c99ddcc4

Simon Brandeis authored Jun 10, 2022

* Raise RepoNotFoundError in case of 401

* Include changes from revert-17646-skip_repo_not_found

* Add a comment

* 💄 Code quality

* 💚 Update `get_from_cache` test

* 💚 Code quality & skip failing test

c99ddcc4

Bump cookiecutter in /examples/research_projects/decision_transformer (#17645) · 1d463303

dependabot[bot] authored Jun 10, 2022

Bumps [cookiecutter](https://github.com/cookiecutter/cookiecutter) from 1.7.2 to 2.1.1.
- [Release notes](https://github.com/cookiecutter/cookiecutter/releases)
- [Changelog](https://github.com/cookiecutter/cookiecutter/blob/master/HISTORY.md)
- [Commits](https://github.com/cookiecutter/cookiecutter/compare/1.7.2...2.1.1

)

---
updated-dependencies:
- dependency-name: cookiecutter
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

1d463303

24 May, 2022 2 commits

Bump tensorflow in /examples/research_projects/decision_transformer (#17400) · 1ef9a1ed

dependabot[bot] authored May 24, 2022

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.8.0 to 2.8.1.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.8.0...v2.8.1

)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

1ef9a1ed

Add LayoutLMv3 (#17060) · 31ee80d5

NielsRogge authored May 24, 2022



* Make forward pass work

* More improvements

* Remove unused imports

* Remove timm dependency

* Improve loss calculation of token classifier

* Fix most tests

* Add docs

* Add model integration test

* Make all tests pass

* Add LayoutLMv3FeatureExtractor

* Improve integration test + make fixup

* Add example script

* Fix style

* Add LayoutLMv3Processor

* Fix style

* Add option to add visual labels

* Make more tokenizer tests pass

* Fix more tests

* Make more tests pass

* Fix bug and improve docs

* Fix import of processors

* Improve docstrings

* Fix toctree and improve docs

* Fix auto tokenizer

* Move tests to model folder

* Move tests to model folder

* change default behavior add_prefix_space

* add prefix space for fast

* add_prefix_spcae set to True for Fast

* no space before `unique_no_split` token

* add test to hightligh special treatment of added tokens

* fix `test_batch_encode_dynamic_overflowing` by building a long enough example

* fix `test_full_tokenizer` with add_prefix_token

* Fix tokenizer integration test

* Make the code more readable

* Add tests for LayoutLMv3Processor

* Fix style

* Add model to README and update init

* Apply suggestions from code review

* Replace asserts by value errors

* Add suggestion by @ducviet00

* Add model to doc tests

* Simplify script

* Improve README

* a step ahead to fix

* Update pair_input_test

* Make all tokenizer tests pass - phew

* Make style

* Add LayoutLMv3 to CI job

* Fix auto mapping

* Fix CI job name

* Make all processor tests pass

* Make tests of LayoutLMv2 and LayoutXLM consistent

* Add copied from statements to fast tokenizer

* Add copied from statements to slow tokenizer

* Remove add_visual_labels attribute

* Fix tests

* Add link to notebooks

* Improve docs of LayoutLMv3Processor

* Fix reference to section
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

31ee80d5

23 May, 2022 1 commit

Fix CodeParrot training script (#17291) · b48ac1a0

Loubna Ben Allal authored May 23, 2022



* average loss over batches and accumulated steps for tracking

* fix layernorm weight decay

* use AdamW from Pytorch instead of Transformers

* add shuffling of sequences inside the batches

* add shuffling of sequences inside the batches

* add logging dir and reformat code

* fix lr tracking

* remove Mistral scaling

* keep Mistral scaling

* reformat code

* fix error

* fix error

* use shuffling function from Pytorch

* remove argument for shuffling batch sequences as it isn't optional

* update package versions and install accelerate from source

* remove unused package

* Update loss average over accumulated steps
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update loss average over accumulated steps
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* use one shuffle buffer argument

* compute avg_loss in one line
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

b48ac1a0

19 May, 2022 1 commit
- Fix bug in Wav2Vec2 pretrain example (#17326) · 48c22691
  ddobokki authored May 20, 2022
  
  48c22691
18 May, 2022 2 commits

Fix style · 47107028
Sylvain Gugger authored May 18, 2022

47107028

Add Information Gain Filtration algorithm (#16953) · 5fdb54ec

mraunak authored May 18, 2022



* Add information gain filtration algorithm

* Complying with black requirements

* Added author

* Fixed import order

* flake8 corrections
Co-authored-by: Javier Turek <javier.turek@intel.com>

5fdb54ec

16 May, 2022 3 commits

CodeParrot data pretokenization (#16932) · 05a90579

Loubna Ben Allal authored May 16, 2022



* add pretokenization arguments

* add pretokenization script

* add support for pretokenized data

* reformat code

* fix run command for training

* fix model call from config

* remove a package

* add comments on pretokenization in the readme

* remove explicit parallelization
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme -remove username
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme -remove username
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* keep data parallelization

* reformat code

* reformat code

* update readme

* reformat code

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>

05a90579

Update codeparrot data preprocessing (#16944) · e730e125

Loubna Ben Allal authored May 16, 2022



* add new preprocessing arguments

* add new filters

* add new filters to readme

* fix config and test count, update function names and docstrings

* reformat code

* update readme

* Update readme

* rename config_test filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename few_assignments filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename tokenizer in arguments
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename functions and add limit_line argument for config_test filter

* update threshold for config_test filter
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>

e730e125

fixed bug in run_mlm_flax_stream.py (#17203) · 71d18d08

Kenneth Enevoldsen authored May 16, 2022



* fixed bug run_mlm_flax_stream.py

Fixed bug caused by an update to tokenizer keys introduced in recent transformers versions (between `4.6.2` and `4.18.0`) where additional keys were introduced to the tokenizer output.

* Update run_mlm_flax_stream.py

* adding missing paranthesis

* formatted to black

* remove cols from dataset instead

* reformat to black

* moved rem. columns to map

* formatted to black
Co-authored-by: KennethEnevoldsen <kennethcenevolsen@gmail.com>

71d18d08

12 May, 2022 1 commit

Black preview (#17217) · afe5d42d

Sylvain Gugger authored May 12, 2022

* Black preview

* Fixup too!

* Fix check copies

* Use the same version as the CI

* Bump black

afe5d42d

09 May, 2022 1 commit
- Fix all docs for accelerate install directions (#17145) · d719bcd4
  Zachary Mueller authored May 09, 2022
  
  d719bcd4
04 May, 2022 3 commits

Bump notebook from 6.4.1 to 6.4.10 in /examples/research_projects/lxmert (#16634) · 2bf95e2b

dependabot[bot] authored May 04, 2022

Bumps [notebook](http://jupyter.org

) from 6.4.1 to 6.4.10.

---
updated-dependencies:
- dependency-name: notebook
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

2bf95e2b

Bump notebook in /examples/research_projects/visual_bert (#16635) · 7a229ef4

dependabot[bot] authored May 04, 2022

Bumps [notebook](http://jupyter.org

) from 6.4.1 to 6.4.10.

---
updated-dependencies:
- dependency-name: notebook
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

7a229ef4

Fix hashing for deduplication (#17048) · db034660
Thomas Wang authored May 04, 2022

db034660

03 May, 2022 1 commit
- Remove device parameter from create_extended_attention_mask_for_decoder (#16894) · 39f8eafc
  Pavel Belevich authored May 03, 2022
  
  39f8eafc
28 Apr, 2022 1 commit
- Add parameter --config_overrides for run_mlm_wwm.py (#16961) · 1be8d56e
  conan1024hao authored Apr 28, 2022
```
* dd parameter --config_overrides for run_mlm_wwm.py

* linter
```
  1be8d56e
27 Apr, 2022 1 commit
- [Research] Speed up evaluation for XTREME-S (#16785) · a4a88fa0
  Anton Lozhkov authored Apr 27, 2022
```
* Avoid repeated per-lang filtering

* Language groups and logits preprocessing

* Style
```
  a4a88fa0
25 Apr, 2022 2 commits
- Fix issue probably-meant-fstring found at https://codereview.doctor (#16913) · 65687520
  code-review-doctor authored Apr 25, 2022
  
  65687520
- Replace deprecated logger.warn with warning (#16876) · fea94d67
  Sanchit Gandhi authored Apr 25, 2022
  
  fea94d67
21 Apr, 2022 1 commit

New features for CodeParrot training script (#16851) · d9184131

Loubna Ben Allal authored Apr 21, 2022



* add tflops logging and fix grad accumulation

* add accelerate tracking and checkpointing

* scale loss of last batch correctly

* fix typo

* compress loss computation
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add resume from checkpoint argument

* add load_state accelerate from checkpoint, register lr scheduler and add tflops function

* reformat code

* reformat code

* add condition on path for resume checkpoint

* combine if conditions
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add source for tflops formula
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

d9184131

19 Apr, 2022 1 commit

[Flax] improve large model init and loading (#16148) · d3bd9ac7

Suraj Patil authored Apr 19, 2022



* begin do_init

* add params_shape_tree

* raise error if params are accessed when do_init is False

* don't allow do_init=False when keys are missing

* make shape tree a property

* assign self._params at the end

* add test for do_init

* add do_init arg to all flax models

* fix param setting

* disbale do_init for composite models

* update test

* add do_init in FlaxBigBirdForMultipleChoice

* better names and errors

* improve test

* style

* add a warning when do_init=False

* remove extra if

* set params after _required_params

* add test for from_pretrained

* do_init => _do_init

* chage warning to info

* fix typo

* add params in init_weights

* add params to gpt neo init

* add params to init_weights

* update do_init test

* Trigger CI

* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* update template

* trigger CI

* style

* style

* fix template
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

d3bd9ac7

13 Apr, 2022 1 commit

Add self training code for text classification (#16738) · 34ef029d

Tu Vu authored Apr 13, 2022

* Add self-training code for text-classification

* Add self-training code for text-classification

* Add self-training code for text-classification

* Add self-training code for text-classification

* Add self-training code for text-classification

* Delete strata

34ef029d

12 Apr, 2022 1 commit

Qdqbert example add benchmark script with ORT-TRT (#16592) · 14daa610

Shang Zhang authored Apr 12, 2022

* add ort-trt benchmark script

* Update README.md

* ort version can be newer

* formatting

* specify ORT version

14daa610

11 Apr, 2022 1 commit

Fix example logs repeating themselves (#16669) · 69233cf0

Zachary Mueller authored Apr 11, 2022

Move declaration of log streams to before tests, so that results won't get compounded on top of each other

69233cf0