Commits · 08f534d2da47875a4b7eb1c125cfa7f0f3b79642 · chenpangpang / transformers

"vscode:/vscode.git/clone" did not exist on "5daca95dddf940139d749b1ca42c59ebc5191979"

26 Oct, 2020 1 commit

Sylvain Gugger authored Oct 26, 2020

* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy

08f534d2

18 Oct, 2020 1 commit

[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a

Thomas Wolf authored Oct 18, 2020

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉



* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ba8c4d0a

08 Oct, 2020 1 commit

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58

Thomas Wolf authored Oct 08, 2020


Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)

* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

9aeacb58

23 Sep, 2020 1 commit

Models doc (#7345) · 3323146e

Sylvain Gugger authored Sep 23, 2020



* Clean up model documentation

* Formatting

* Preparation work

* Long lines

* Main work on rst files

* Cleanup all config files

* Syntax fix

* Clean all tokenizers

* Work on first models

* Models beginning

* FaluBERT

* All PyTorch models

* All models

* Long lines again

* Fixes

* More fixes

* Update docs/source/model_doc/bert.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/electra.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Last fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

3323146e

26 Aug, 2020 1 commit

Centralize logging (#6434) · 77abd1e7

Lysandre Debut authored Aug 26, 2020



* Logging

* Style

* hf_logging > utils.logging

* Address @thomwolf's comments

* Update test

* Update src/transformers/benchmark/benchmark_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Revert bad change
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

77abd1e7

28 May, 2020 1 commit
- [Longformer] more models + model cards (#4628) · ef03ae87
  Iz Beltagy authored May 28, 2020
```
* adding freeze roberta models

* model cards

* lint
```
  ef03ae87
26 May, 2020 1 commit
- [Longformer For Question Answering] Conversion script, doc, small fixes (#4593) · c589eae2
  Patrick von Platen authored May 26, 2020
```
* add new longformer for question answering model

* add new config as well

* fix links

* fix links part 2
```
  c589eae2
25 May, 2020 1 commit
- LongformerTokenizerFast (#4547) · 51397336
  Suraj Patil authored May 26, 2020
  
  51397336
19 May, 2020 1 commit

Longformer (#4352) · 8f1d0471

Iz Beltagy authored May 19, 2020

* first commit

* bug fixes

* better examples

* undo padding

* remove wrong VOCAB_FILES_NAMES

* License

* make style

* make isort happy

* unit tests

* integration test

* make `black` happy by undoing `isort` changes!!

* lint

* no need for the padding value

* batch_size not bsz

* remove unused type casting

* seqlen not seq_len

* staticmethod

* `bert` selfattention instead of `n2`

* uint8 instead of bool + lints

* pad inputs_embeds using embeddings not a constant

* black

* unit test with padding

* fix unit tests

* remove redundant unit test

* upload model weights

* resolve todo

* simpler _mask_invalid_locations without lru_cache + backward compatible masked_fill_

* increase unittest coverage

8f1d0471

18 May, 2020 1 commit
- [MbartTokenizer] save to sentencepiece.bpe.model (#4335) · 8581a670
  Mehrad Moradshahi authored May 18, 2020
  
  8581a670
10 Apr, 2020 1 commit
- Multilingual BART - (#3602) · 7a7fdf71
  Sam Shleifer authored Apr 10, 2020
```
- support mbart-en-ro weights
- add MBartTokenizer
```
  7a7fdf71
29 Mar, 2020 1 commit
- [BART] add bart-large-xsum weights (#3422) · f6a23d19
  Sam Shleifer authored Mar 29, 2020
  
  f6a23d19
02 Mar, 2020 1 commit

Bart-CNN (#3059) · b54ef78d

Sam Shleifer authored Mar 02, 2020

`generate` code that produces 99% identical summarizations to fairseq on CNN test data, with caching.

b54ef78d

20 Feb, 2020 1 commit

New BartModel (#2745) · 53ce3854

Sam Shleifer authored Feb 20, 2020

* Results same as fairseq
* Wrote a ton of tests
* Struggled with api signatures
* added some docs

53ce3854