Commits · 0c65fb7cfa1258fb5946c5ae4d13f5a2a88a2f56 · chenpangpang / transformers

10 May, 2023 1 commit

chore: allow protobuf 3.20.3 requirement (#22759) · 0c65fb7c

José Ángel Rey Liñares authored May 10, 2023



* chore: allow protobuf 3.20.3

Allow latest bugfix release for protobuf (3.20.3)

* chore: update auto-generated dependency table

update auto-generated dependency table

* run in subprocess

* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Apply suggestions

---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

0c65fb7c

24 Apr, 2023 1 commit

Prepare tests for hfh 0.14 (#22958) · 74c55ab9

Lucain authored Apr 24, 2023



* Test hf_hub 0.14.0rc1

* fix mocked tests

* package version

---------
Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
Co-authored-by: testbot <lucainp@hf.co>

74c55ab9

03 Apr, 2023 1 commit

Fix llama tokenizer (#22402) · c0f99b4d

Arthur authored Apr 03, 2023

* draft

* update tokenization limma and conversion script

* more udpates

* initial commit

* style

* default pad to None

* draft tokenization tests

* update test

* update tokenization tests

* nits

* update

* versioning test

* major fix

* fix more testst

* finish fixing special masks

* last nit

* more nits

* add encode decode tests

* add more

* fix token type ids

* style

c0f99b4d

29 Mar, 2023 1 commit

Add clean_up_tokenization_spaces to config (#22341) · 8d9c3836

Arthur authored Mar 29, 2023



* add draft changes

* fix failing wav2vec

* style

* make sure that the argument is saved + add tests

* style

* fixup

* update test

* default clean_up_tokenization_spaces to False for Bloom and Llama

* Update code based on review
Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>

* style

* quality

---------
Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>

8d9c3836

09 Mar, 2023 1 commit

Remove set_access_token usage + fail tests if FutureWarning (#22051) · 923110b7

Lucain authored Mar 09, 2023



* Remove set_access_token usage + fail tests if FutureWarning

* do not fail on FutureWarning in CI

---------
Co-authored-by: testbot <lucainp@hf.co>

923110b7

07 Feb, 2023 1 commit

Cleanup quality (#21493) · 67d07487

Sylvain Gugger authored Feb 07, 2023

* Remove mentions of flake8/isort

* Clean up inits

* Deall with all other inits

* Last special rule for dummy files

67d07487

06 Feb, 2023 1 commit

Update quality tooling for formatting (#21480) · 6f79d264

Sylvain Gugger authored Feb 06, 2023

* Result of black 23.1

* Update target to Python 3.7

* Switch flake8 to ruff

* Configure isort

* Configure isort

* Apply isort with line limit

* Put the right black version

* adapt black in check copies

* Fix copies

6f79d264

02 Nov, 2022 1 commit

🚨

Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in... · 9f9ddcc2

Ben Eyal authored Nov 02, 2022

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` (#15775)

* Add test for SentencePiece not adding special tokens to strings

* Add SentencePieceStringConversionMixin to fix issue 15003

* Fix conversion from tokens to string for most SentencePiece tokenizers

Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer

* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

* Fix DebertaV2Tokenizer

* Ignore LayoutXLMTokenizer in SentencePiece string conversion test

* Run 'make style' and 'make quality'

* Clean convert_tokens_to_string test

Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.

* Remove commented out code

* Improve robustness of convert_tokens_to_string test

Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.

* Inline and remove SentencePieceStringConversionMixin

The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.

* Run 'make style' and 'make quality'

* Revert removal of space in convert_tokens_to_string

* Remove redundant import

* Revert test text to original

* Uncomment the lowercasing of the reverse_text variable

* Mimic Rust tokenizer behavior for tokenizers

- Albert
- Barthez
- Camembert
- MBart50
- T5

* Fix accidentally skipping test in wrong tokenizer

* Add test for equivalent Rust and slow tokenizer behavior

* Override _decode in BigBirdTokenizer to mimic Rust behavior

* Override _decode in FNetTokenizer to mimic Rust behavior

* Override _decode in XLNetTokenizer to mimic Rust behavior

* Remove unused 're' import

* Update DebertaV2Tokenizer to mimic Rust tokenizer

* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.

* Ignore problematic tests in Deberta V2

* Add comment on why the Deberta V2 tests are skipped

9f9ddcc2

25 Oct, 2022 1 commit
- Fix incorrect model<->tokenizer mapping in tokenization testing (#19872) · f9257843
  Yih-Dar authored Oct 25, 2022
```
* Fix model-tokenizer mapping
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
```
  f9257843
14 Oct, 2022 1 commit
- Tokenizer from_pretrained should not use local files named like tokenizer files (#19626) · 3e490020
  Sylvain Gugger authored Oct 14, 2022
  
  3e490020
27 Sep, 2022 1 commit
- More tests for regression in cached non existence (#19216) · 34be08ef
  Sylvain Gugger authored Sep 27, 2022
```
* More tests for regression in cached non existence

* Style
```
  34be08ef
16 Sep, 2022 2 commits
- Add tests for legacy load by url and fix bugs (#19078) · ca485e56
  Sylvain Gugger authored Sep 16, 2022
  
  ca485e56
- Fix tokenizer load from one file (#19073) · 9017ba4c
  Sylvain Gugger authored Sep 16, 2022
```
* Fix tokenizer load from one file

* Add a test

* Style
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
```
  9017ba4c
15 Sep, 2022 1 commit

Fix custom tokenizers test (#19052) · f7ce4f1f

Sylvain Gugger authored Sep 15, 2022

* Fix CI for custom tokenizers

* Add nightly tests

* Run CI, run!

* Fix paths

* Typos

* Fix test

f7ce4f1f

29 Aug, 2022 1 commit
- Fix mock in `test_cached_files_are_used_when_internet_is_down` (#18804) · 169b8cde
  Lucain authored Aug 29, 2022
  
  169b8cde
24 Aug, 2022 1 commit

add warning to let the user know that the `__call__` method is faster than... · 6667b0d7

SaulLu authored Aug 24, 2022

add warning to let the user know that the `__call__` method is faster than `encode` + `pad` for a fast tokenizer (#18693)

* add warning to let the user know that the  method is slower that  for a fast tokenizer

* user warnings

* fix layoutlmv2

* fix layout*

* change warnings into logger.warning

6667b0d7

05 Aug, 2022 1 commit

Use new huggingface_hub tools for download models (#18438) · 5cd40323

Sylvain Gugger authored Aug 05, 2022

* Draft new cached_file

* Initial draft for config and model

* Small fixes

* Fix first batch of tests

* Look in cache when internet is down

* Fix last tests

* Bad black, not fixing all quality errors

* Make diff less

* Implement change for TF and Flax models

* Add tokenizer and feature extractor

* For compatibility with main

* Add utils to move the cache and auto-do it at first use.

* Quality

* Deal with empty commit shas

* Deal with empty etag

* Address review comments

5cd40323

01 Aug, 2022 1 commit

Rewrite push_to_hub to use upload_files (#18366) · 01db72ab

Sylvain Gugger authored Aug 01, 2022

* Rewrite push_to_hub to use upload_files

* Adapt the doc a bit

* Address review comments and clean doc

01db72ab

11 Jul, 2022 1 commit

Fix some typos. (#17560) · 95113d13

Yulv-git authored Jul 11, 2022



* Fix some typos.
Signed-off-by: Yulv-git <yulvchi@qq.com>

* Fix typo.
Signed-off-by: Yulv-git <yulvchi@qq.com>

* make fixup.

95113d13

23 Jun, 2022 1 commit
- Fix properties of unset special tokens in non verbose mode (#17797) · 3eed5530
  Guillaume Klein authored Jun 23, 2022
```
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
```
  3eed5530
21 Jun, 2022 1 commit

Prepare transformers for v0.8.0 huggingface-hub release (#17716) · 6a5272b2

Lysandre Debut authored Jun 21, 2022



* Prepare CI for v0.8.0

* pin hfh (revert before merge)

* Revert "pin hfh (revert before merge)"

This reverts commit a0103140e1c77b810ffcb735192968bc03be3e1f.

* Test rc3

* Test latest rc

* Unpin to the RC
Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>

6a5272b2

31 May, 2022 1 commit

[Json configs] Make json prettier for all saved tokenizer files & ensure same... · f394a2a5

Patrick von Platen authored May 31, 2022

[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) (#17457)

* [Json dump] Make json prettier

* correct more tokenizeirs

* more patterns

* add aggressive test

* the aggressive test was actually useful :-)

* more tests

* Apply suggestions from code review

f394a2a5

12 May, 2022 1 commit

Black preview (#17217) · afe5d42d

Sylvain Gugger authored May 12, 2022

* Black preview

* Fixup too!

* Fix check copies

* Use the same version as the CI

* Bump black

afe5d42d

13 Apr, 2022 1 commit

Fix #16660 (tokenizers setters of ids of special tokens) (#16661) · 9f8bfe70

davidleonfdez authored Apr 13, 2022

* Fix setters of *_token_id properties of SpecialTokensMixin

* Test setters of common tokens ids

* Move to a separate test checks of setters of tokens ids

* Add independent test for ByT5

* Add Canine test

* Test speech to text

9f8bfe70

04 Apr, 2022 1 commit
- add a test checking the format of `convert_tokens_to_string`'s output (#16540) · be9474bd
  SaulLu authored Apr 04, 2022
```
* add new tests

* add comment to overridden tests
```
  be9474bd
23 Mar, 2022 1 commit

Make Transformers use cache files when hf.co is down (#16362) · c595b6e6

Sylvain Gugger authored Mar 23, 2022

* Make Transformers use cache files when hf.co is down

* Fix tests

* Was there a random circleCI failure?

* Isolate patches

* Style

* Comment out the failure since it doesn't fail anymore

* Better comment

c595b6e6

15 Feb, 2022 1 commit

Allow custom code for Processors (#15649) · 45f56580

Sylvain Gugger authored Feb 15, 2022

* Allow custom code for Processors

* Add more test

* Test all auto_map configs are properly set

45f56580

02 Feb, 2022 2 commits

fix set truncation attribute in `__init__` of `PreTrainedTokenizerBase` (#15456) · 39b5d1a6

SaulLu authored Feb 02, 2022



* change truncation_side in init of `PreTrainedTokenizerBase`
Co-authored-by: LSinev <LSinev@users.noreply.github.com>

* add test

* Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"

This reverts commit 7a98b87962d2635c7e4d4f00db3948b694624843.

* fix kwargs

* Revert "fix kwargs"

This reverts commit 67b0a5270e8cf1dbf70e6b0232e94c0452b6946f.

* Update tests/test_tokenization_common.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* delete truncation_side variable

* reorganize test

* format

* complete doc

* Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`""

This reverts commit d5a10a7e2680539e5d9e98ae5d896c893d224b80.

* fix typo

* fix typos to render documentation

* Revert "Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"""

This reverts commit 16cf58811943a08f43409a7c83eaa330686591d0.

* format
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

39b5d1a6

Save code of registered custom models (#15379) · 44b21f11

Sylvain Gugger authored Feb 02, 2022



* Allow dynamic modules to use relative imports

* Work for configs

* Fix last merge conflict

* Save code of registered custom objects

* Map strings to strings

* Fix test

* Add tokenizer

* Rework tests

* Tests

* Ignore fixtures py files for tests

* Tokenizer test + fix collection

* With full path

* Rework integration

* Fix typo

* Remove changes in conftest

* Test for tokenizers

* Add documentation

* Update docs/source/custom_models.mdx
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add file structure and file content

* Add more doc

* Style

* Update docs/source/custom_models.mdx
Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Address review comments
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Suraj Patil <surajp815@gmail.com>

44b21f11

01 Feb, 2022 2 commits

fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86

SaulLu authored Feb 01, 2022

fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)

* add new test

* update test

* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`

* add `tokenizer_file` for the fast only tokenizer

* change global variables layoutxml

* remove `"tokenizer_file"` from DPR tokenizer's Global variables

* remove `tokenizer_file` from herbert slow tokenizer init

* `"tokenizer_file"` from LED tokenizer's Global variables

* remove `tokenizer_file` from mbart slow tokenizer init

* remove `tokenizer_file` from slow tokenizer template

* adapt to versioning

* adapt the `test_tokenizer_mismatch_warning` test

* clean test

* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py

* Revert "remove `tokenizer_file` from mbart slow tokenizer init"

This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.

* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"

This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.

* Revert "remove `tokenizer_file` from herbert slow tokenizer init"

This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.

* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"

This reverts commit da0895330bedfafc81ae3073470a9348c669f032.

* set `tokenizer_file` in super `__init__` of mbart

7b8bdd86

replace assert with exception for padding_side arg in `PreTrainedTokenizerBase` `__init__` (#15454) · 6d585fe0

SaulLu authored Feb 01, 2022

* replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`

* add test

* fix kwargs

* reformat test

* format

* format

* fix typo to render the documentation

6d585fe0

27 Jan, 2022 1 commit

improve saving strategy of sentencepiece tokenizer (#15328) · ade7371a

SaulLu authored Jan 27, 2022



* add new test

* add a feature to same the sentencepiece tokenizer model when the init file was deleted

* update marian

* update m2m_100

* fix marian

* update speech to text

* override test for layoutxlm

* fix saving bartpho

* remove harcoded values bartpho

* special token string version

* finish bartpho

* override layoutxml test

* add mbart

* move special tokens list

* format

* Revert "format"

This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7.

* simplify list of string of special tokens

* Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

ade7371a

06 Jan, 2022 1 commit
- Remove old asserts. (#15012) · d2183a46
  Nicolas Patry authored Jan 06, 2022
  
  d2183a46
03 Jan, 2022 1 commit

Improve truncation_side (#14947) · d33dc796

Nicolas Patry authored Jan 03, 2022



* Enabling `truncation_side` for Slow and Fast tokenizer.
Co-Authored-by: Niels Rogge <48327001+NielsRogge@users.noreply.github.com>

* Disable failing tests.

* Layout xlm.

* assert -> assertEqual.
Co-authored-by: Niels Rogge <48327001+NielsRogge@users.noreply.github.com>

d33dc796

30 Dec, 2021 1 commit
- Fixing a pathological case for slow tokenizers (#14981) · d7d60df0
  Nicolas Patry authored Dec 30, 2021
```
* Fixing a pathological case for slow tokenizers

* Update src/transformers/tokenization_utils.py
```
  d7d60df0
03 Dec, 2021 1 commit

Improve tokenizer tests (#13594) · 66ea7391

Li-Huai (Allan) Lin authored Dec 03, 2021

* Use new method to acquire tokenizers

* Resolve TODOs.

* Style

* Fix

* Enable do_lower_case in test_tokenize_special_tokens

* Apply suggestion from code review

* Fix mask token handling

* Revert "Fix mask token handling"

This reverts commit daaa3f5291b1f71e5bc3604ca281c000000c4648.

* Fix FNet mask token tokenization

* Complete everything

* Apply suggestions from code review

66ea7391

10 Nov, 2021 1 commit
- Fix list index out of range when padding nested empty lists (#13876) · 9e37c5cd
  Li-Huai (Allan) Lin authored Nov 11, 2021
```
* Fix index out of range when padding

* Apply suggestions from code review

* Style
```
  9e37c5cd
08 Nov, 2021 1 commit
- Expand dynamic supported objects to configs and tokenizers (#14296) · dfb00bf6
  Sylvain Gugger authored Nov 08, 2021
```
* Dynamic configs

* Add config test

* Better tests

* Add tokenizer and test

* Add to from_config

* With save
```
  dfb00bf6
02 Nov, 2021 1 commit
- Update Transformers to huggingface_hub >= 0.1.0 (#14251) · 558f8543
  Sylvain Gugger authored Nov 02, 2021
```
* Update Transformers to huggingface_hub >= 0.1.0

* Forgot to save...

* Style

* Fix test
```
  558f8543
11 Oct, 2021 1 commit

Honor existing attention mask in tokenzier.pad (#13926) · 4a18337b

Sylvain Gugger authored Oct 11, 2021

* Honor existing attention mask in tokenzier.pad

* Fix initialization of attention mask

* Roll the implem on all subclasses

* Fix tests

4a18337b