Commits · 70527ba69423df24e5c05e78bd337239569709ec · chenpangpang / transformers

11 Dec, 2020 1 commit
- Fix PreTrainedTokenizer.pad when first inputs are empty (#9018) · 70527ba6
  Sylvain Gugger authored Dec 11, 2020
```
* Fix PreTrainedTokenizer.pad when first inputs are empty

* Handle empty inputs case
```
  70527ba6
10 Dec, 2020 1 commit
- Refactor FLAX tests (#9034) · 8d4bb020
  Sylvain Gugger authored Dec 10, 2020
  
  8d4bb020
08 Dec, 2020 1 commit
- Fix interaction of return_token_type_ids and add_special_tokens (#8854) · b7cdd00f
  Lysandre Debut authored Dec 08, 2020
  
  b7cdd00f
03 Dec, 2020 1 commit
- Avoid erasing the attention mask when double padding (#8915) · 8453201c
  Sylvain Gugger authored Dec 03, 2020
  
  8453201c
02 Dec, 2020 1 commit

Warning about too long input for fast tokenizers too (#8799) · a8c3f9aa

Nicolas Patry authored Dec 02, 2020

* Warning about too long input for fast tokenizers too

If truncation is not set in tokenizers, but the tokenization is too long
for the model (`model_max_length`), we used to trigger a warning that

The input would probably fail (which it most likely will).

This PR re-enables the warning for fast tokenizers too and uses common
code for the trigger to make sure it's consistent across.

* Checking for pair of inputs too.

* Making the function private and adding it's doc.

* Remove formatting ?? in odd place.

* Missed uppercase.

a8c3f9aa

01 Dec, 2020 1 commit

Prevent BatchEncoding from blindly passing casts down to the tensors it... · 9c18f156

Adam Pocock authored Dec 01, 2020

Prevent BatchEncoding from blindly passing casts down to the tensors it contains. Fixes #6582. (#8860)

Update src/transformers/tokenization_utils_base.py with review fix
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

9c18f156

30 Nov, 2020 1 commit
- Correct docstring. (#8845) · cc983cd9
  Fraser Greenlee authored Nov 30, 2020
```
Related issue: https://github.com/huggingface/transformers/issues/8837
```
  cc983cd9
27 Nov, 2020 1 commit

Extend typing to path-like objects in `PretrainedConfig` and `PreTrainedModel` (#8770) · f9a2a9e3

Giovanni Compagnoni authored Nov 27, 2020

* update configuration_utils.py typing to allow pathlike objects when sensible

* update modeling_utils.py typing to allow pathlike objects when sensible

* black

* update tokenization_utils_base.py typing to allow pathlike objects when sensible

* update tokenization_utils_fast.py typing to allow pathlike objects when sensible

* update configuration_auto.py typing to allow pathlike objects when sensible

* update configuration_auto.py docstring to allow pathlike objects when sensible

* update tokenization_auto.py docstring to allow pathlike objects when sensible

* black

f9a2a9e3

19 Nov, 2020 1 commit
- [tokenizers] convert_to_tensors: don't reconvert when the type is already right (#8283) · 42111f1d
  Stas Bekman authored Nov 19, 2020
```
* don't reconvert when the type is already right

* better name

* adjust logic as suggested

* merge
```
  42111f1d
17 Nov, 2020 3 commits

Remove deprecated (#8604) · dd52804f

Sylvain Gugger authored Nov 17, 2020



* Remove old deprecated arguments
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

* Remove needless imports

* Fix tests
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

dd52804f

Tokenizers should be framework agnostic (#8599) · 3095ee9d

Lysandre Debut authored Nov 17, 2020



* Tokenizers should be framework agnostic

* Run the slow tests

* Not testing

* Fix documentation

* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

3095ee9d

Tokenizers: ability to load from model subfolder (#8586) · 042a6aa7

Julien Chaumond authored Nov 17, 2020



* <small>tiny typo</small>

* Tokenizers: ability to load from model subfolder

* use subfolder for local files as well

* Uniformize model shortcut name => model id

* from s3 => from huggingface.co
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

042a6aa7

15 Nov, 2020 1 commit

[breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests... · f4e04cd2

Thomas Wolf authored Nov 15, 2020


[breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency (#8073)

* Fixing roberta for slow-fast tests

* WIP getting equivalence on pipelines

* slow-to-fast equivalence - working on question-answering pipeline

* optional FAISS tests

* Pipeline Q&A

* Move pipeline tests to their own test job again

* update tokenizer to add sequence id methods

* update to tokenizers 0.9.4

* set sentencepiecce as optional

* clean up squad

* clean up pipelines to use sequence_ids

* style/quality

* wording

* Switch to use_fast = True by default

* update tests for use_fast at True by default

* fix rag tokenizer test

* removing protobuf from required dependencies

* fix NER test for use_fast = True by default

* fixing example tests (Q&A examples use slow tokenizers for now)

* protobuf in main deps extras["sentencepiece"] and example deps

* fix protobug install test

* try to fix seq2seq by switching to slow tokenizers for now

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

f4e04cd2

10 Nov, 2020 1 commit

Model versioning (#8324) · 70f622fa

Julien Chaumond authored Nov 10, 2020

* fix typo

* rm use_cdn & references, and implement new hf_bucket_url

* I'm pretty sure we don't need to `read` this file

* same here

* [BIG] file_utils.networking: do not gobble up errors anymore

* Fix CI 😇



* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Tiny doc tweak

* Add doc + pass kwarg everywhere

* Add more tests and explain

cc @sshleifer let me know if better
Co-Authored-By: Sam Shleifer <sshleifer@gmail.com>

* Also implement revision in pipelines

In the case where we're passing a task name or a string model identifier

* Fix CI 😇



* Fix CI

* [hf_api] new methods + command line implem

* make style

* Final endpoints post-migration

* Fix post-migration

* Py3.6 compat

cc @stefan-it

Thank you @stas00
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

70f622fa

29 Oct, 2020 1 commit

Fix doc errors and typos across the board (#8139) · 969859d5

Santiago Castro authored Oct 29, 2020

* Fix doc errors and typos across the board

* Fix a typo

* Fix the CI

* Fix more typos

* Fix CI

* More fixes

* Fix CI

* More fixes

* More fixes

969859d5

26 Oct, 2020 2 commits

Doc styling (#8067) · 08f534d2

Sylvain Gugger authored Oct 26, 2020

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy

08f534d2

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization (#8006) · 79eb3915
Thomas Wolf authored Oct 26, 2020
```
* fixing #8001

* make T5 tokenizer serialization more robust - style
```
79eb3915

24 Oct, 2020 1 commit
- [doc prepare_seq2seq_batch] fix docs (#8013) · 38f6739c
  Suraj Patil authored Oct 25, 2020
  
  38f6739c
23 Oct, 2020 2 commits

Fix BatchEncoding.word_to_tokens for removed tokens (#7939) · 5e323017
Anthony MOI authored Oct 23, 2020

5e323017

[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers... · 3a40cdf5

Thomas Wolf authored Oct 23, 2020


[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups (#7970)

* WIP refactoring pipeline tests - switching to fast tokenizers

* fix dialog pipeline and fill-mask

* refactoring pipeline tests backbone

* make large tests slow

* fix tests (tf Bart inactive for now)

* fix doc...

* clean up for merge

* fixing tests - remove bart from summarization until there is TF

* fix quality and RAG

* Add new translation pipeline tests - fix JAX tests

* only slow for dialog

* Fixing the missing TF-BART imports in modeling_tf_auto

* spin out pipeline tests in separate CI job

* adding pipeline test to CI YAML

* add slow pipeline tests

* speed up tf and pt join test to avoid redoing all the standalone pt and tf tests

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* Update src/transformers/pipelines.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/pipelines.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add require_torch and require_tf in is_pt_tf_cross_test
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

3a40cdf5

21 Oct, 2020 1 commit
- fix 'encode_plus' docstring for 'special_tokens_mask' (0s and 1s were reversed) (#7949) · 16da8771
  Evan Pete Walsh authored Oct 21, 2020
```
* fix docstring for 'special_tokens_mask'

* revert auto formatter changes

* revert another auto format

* revert another auto format
```
  16da8771
19 Oct, 2020 2 commits

Integrate Bert-like model on Flax runtime. (#3722) · 8f8f8d99

Funtowicz Morgan authored Oct 19, 2020



* WIP flax bert

* Initial commit Bert Jax/Flax implementation.

* Embeddings working and equivalent to PyTorch.

* Move embeddings in its own module BertEmbeddings

* Added jax.jit annotation on forward call

* BertEncoder on par with PyTorch ! :D

* Add BertPooler on par with PyTorch !!

* Working Jax+Flax implementation of BertModel with < 1e-5 differences on the last layer.

* Fix pooled output to take only the first token of the sequence.

* Refactoring to use BertConfig from transformers.

* Renamed FXBertModel to FlaxBertModel

* Model is now initialized in FlaxBertModel constructor and reused.

* WIP JaxPreTrainedModel

* Cleaning up the code of FlaxBertModel

* Added ability to load Flax model saved through save_pretrained()

* Added ability to convert Pytorch Bert model to FlaxBert

* FlaxBert can now load every Pytorch Bert model with on-the-fly conversion

* Fix hardcoded shape values in conversion scripts.

* Improve the way we handle LayerNorm conversion from PyTorch to Flax.

* Added positional embeddings as parameter of BertModel with default to np.arange.

* Let's roll FlaxRoberta !

* Fix missing position_ids parameters on predict for Bert

* Flax backend now supports batched inputs
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make it possible to load msgpacked model on convert from pytorch in last resort.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Moved save_pretrained to Jax base class along with more constructor parameters.

* Use specialized, model dependent conversion functio.

* Expose `is_flax_available` in file_utils.

* Added unittest for Flax models.

* Added run_tests_flax to the CI.

* Introduce FlaxAutoModel

* Added more unittests

* Flax model reference the _MODEL_ARCHIVE_MAP from PyTorch model.

* Addressing review comments.

* Expose seed in both Bert and Roberta

* Fix typo suggested by @stefan-it
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Attempt to make style

* Attempt to make style in tests too

* Added jax & jaxlib to the flax optional dependencies.

* Attempt to fix flake8 warnings ...

* Redo black again and again

* When black and flake8 fight each other for a space ... 💥 💥 💥

* Try removing trailing comma to make both black and flake happy!

* Fix invalid is_<framework>_available call, thanks @LysandreJik 🎉



* Fix another invalid import in flax_roberta test

* Bump and pin flax release to 0.1.0.

* Make flake8 happy, remove unused jax import

* Change the type of the catch for msgpack.

* Remove unused import.

* Put seed as optional constructor parameter.

* trigger ci again

* Fix too much parameters in BertAttention.

* Formatting.

* Simplify Flax unittests to avoid machine crashes.

* Fix invalid number of arguments when raising issue for an unknown model.

* Address @bastings comment in PR, moving jax.jit decorated outside of __call__

* Fix incorrect path to require_flax/require_pytorch functions.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Attempt to make style.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct rebasing of circle-ci dependencies
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix import sorting.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix unused imports.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Again import sorting...
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Installing missing nlp dependency for flax unittests.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix laoding of model for Flax implementations.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* jit the inner function call to make JAX-compatible
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Format !
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Flake one more time 🎶

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Rewrites BERT in Flax to the new Linen API (#7211)

* Rewrite Flax HuggingFace PR to Linen

* Some fixes

* Fix tests

* Fix CI with change of name of nlp (#7054)

* nlp -> datasets

* More nlp -> datasets

* Woopsie

* More nlp -> datasets

* One last

* Expose `is_flax_available` in file_utils.

* Added run_tests_flax to the CI.

* Attempt to make style

* trigger ci again

* Fix import sorting.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Revert "Rewrites BERT in Flax to the new Linen API (#7211)"

This reverts commit 23703a5eb3364e26a1cbc3ee34b4710d86a674b0.

* Remove jnp.lax references
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Reintroduce Linen changes ...
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use jax native's gelu function.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Renaming BertModel to BertModule to highlight the fact this is the Flax Module object.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Rewrite FlaxAutoModel test to not rely on pretrained_model_archive_map
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove unused variable in BertModule.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove unused variable in BertModule again
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Attempt to have is_flax_available working again.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Introduce JAX TensorType
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improve ImportError message when trying to convert to various TensorType format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Makes Flax model jittable.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure flax models are jittable in unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove unused imports.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Ensure jax imports are guarded behind is_flax_available.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style again
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style again again
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style again again again
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Update src/transformers/file_utils.py
Co-authored-by: Marc van Zee <marcvanzee@gmail.com>

* Bump flax to it's latest version
Co-authored-by: Marc van Zee <marcvanzee@gmail.com>

* Bump jax version to at least 0.2.0
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Style.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Update the unittest to use TensorType.JAX
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* isort import in tests.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Match new flax parameters name "params"
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove unused imports.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Add flax models to transformers __init__
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Attempt to address all CI related comments.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct circle.yml indent.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct circle.yml indent (2)
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove coverage from flax tests
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing many naming suggestions from comments
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Simplify for loop logic to interate over layers in FlaxBertLayerCollection
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* use f-string syntax for formatting logs.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use config property from FlaxPreTrainedModel.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* use "cls_token" instead of "first_token" variable name.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* use "hidden_state" instead of "h" variable name.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct class reference in docstring to link to Flax related modules.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added HF + Google Flax team copyright.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make Roberta independent from Bert
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Move activation functions to flax_utils.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Move activation functions to flax_utils for bert.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added docstring for BERT
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Update import for Bert and Roberta tokenizers
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fix-copies
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct FlaxRobertaLayer to match PyTorch.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use the same store_artifact for flax unittest
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Style.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make sure gradient are disabled only locally for flax unittest using torch equivalence.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use relative imports
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: Marc van Zee <marcvanzee@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

8f8f8d99

Fix small type hinting error (#7820) · 406a49df

AndreaSottana authored Oct 19, 2020



* Fix small type hinting error

* Update tokenization_utils_base.py

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

406a49df

18 Oct, 2020 1 commit

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a

Thomas Wolf authored Oct 18, 2020

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ba8c4d0a

13 Oct, 2020 2 commits
- fixed lots of typos. (#7758) · 7e73c128
  Tiger authored Oct 13, 2020
  
  7e73c128
- [Rag] Fix loading of pretrained Rag Tokenizer (#7756) · 82b09a84
  Patrick von Platen authored Oct 13, 2020
```
* fix rag

* Update tokenizer save_pretrained
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
```
  82b09a84
12 Oct, 2020 1 commit

Update tokenization_utils_base.py (#7696) · 34fcfb44

AndreaSottana authored Oct 12, 2020

Minor spelling corrections in docstrings. "information" is uncountable in English and has no plural.

34fcfb44

08 Oct, 2020 1 commit

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58

Thomas Wolf authored Oct 08, 2020


Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)

* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

9aeacb58

06 Oct, 2020 1 commit
- Fix tokenizer UnboundLocalError when padding is set to PaddingStrategy.MAX_LENGTH (#7610) · e084089e
  Gabriele Picco authored Oct 06, 2020
```
* Fix UnboundLocalError when PaddingStrategy is MAX_LENGTH

* Fix UnboundLocalError for TruncationStrategy
```
  e084089e
23 Sep, 2020 1 commit

Models doc (#7345) · 3323146e

Sylvain Gugger authored Sep 23, 2020



* Clean up model documentation

* Formatting

* Preparation work

* Long lines

* Main work on rst files

* Cleanup all config files

* Syntax fix

* Clean all tokenizers

* Work on first models

* Models beginning

* FaluBERT

* All PyTorch models

* All models

* Long lines again

* Fixes

* More fixes

* Update docs/source/model_doc/bert.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/electra.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Last fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

3323146e

22 Sep, 2020 1 commit
- is_pretokenized -> is_split_into_words (#7236) · 21ca1480
  Sylvain Gugger authored Sep 22, 2020
```
* is_pretokenized -> is_split_into_words

* Fix tests
```
  21ca1480
16 Sep, 2020 1 commit
- fix the warning message of overflowed sequence (#7151) · 08bfc171
  Xi Ye authored Sep 16, 2020
  
  08bfc171
14 Sep, 2020 1 commit

Add Mirror Option for Downloads (#6679) · 90cde2e9

Kevin Canwen Xu authored Sep 14, 2020

* Add Tuna Mirror for Downloads from China

* format fix

* Use preset instead of hardcoding URL

* Fix

* make style

* update the mirror option doc

* update the mirror

90cde2e9

09 Sep, 2020 1 commit

Batch encore plus and overflowing tokens fails when non existing overflowing... · 15478c12

Lysandre Debut authored Sep 09, 2020

Batch encore plus and overflowing tokens fails when non existing overflowing tokens for a sequence (#6677)

* Patch and test

* Fix tests

15478c12

08 Sep, 2020 1 commit

typo (#7001) · c18f5916

Stas Bekman authored Sep 07, 2020

apologies for the tiny PRs, just sending those as I find them.

c18f5916

04 Sep, 2020 1 commit
- typo (#6952) · eff274d6
  Stas Bekman authored Sep 04, 2020
  
  eff274d6
26 Aug, 2020 1 commit

Centralize logging (#6434) · 77abd1e7

Lysandre Debut authored Aug 26, 2020



* Logging

* Style

* hf_logging > utils.logging

* Address @thomwolf's comments

* Update test

* Update src/transformers/benchmark/benchmark_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Revert bad change
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

77abd1e7

24 Aug, 2020 1 commit
- Update repo to isort v5 (#6686) · a5737779
  Sylvain Gugger authored Aug 24, 2020
```
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
```
  a5737779
19 Aug, 2020 1 commit
- Fix #6575 (#6596) · 18ca0e91
  Sylvain Gugger authored Aug 19, 2020
  
  18ca0e91
12 Aug, 2020 1 commit

Fixes to make life easier with the nlp library (#6423) · e9c30314

Sylvain Gugger authored Aug 12, 2020



* allow using tokenizer.pad as a collate_fn in pytorch

* allow using tokenizer.pad as a collate_fn in pytorch

* Add documentation and tests

* Make attention mask the right shape

* Better test
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

e9c30314