Commits · c65863ce5341c3dac162fb5e301ba7f6bee800fa · chenpangpang / transformers

09 Oct, 2020 5 commits
- Import integration libraries first (#7650) · 9618cd69
  Doug Blank authored Oct 09, 2020
```
* Import intergration libraries first

* isort and black happiness

* flake8 happiness

* Add a test

* Black reformat

* Ignore import order in tests

* A heavy-handed method of disabling comet for tests

* Remove comet_ml tests

* Run black on setup.py
```
  9618cd69
- Complete release instruction · 4dcc424d
  sgugger authored Oct 09, 2020
  
  4dcc424d
- Better links for models in READMED and doc index (#7680) · a3cea6a8
  Sylvain Gugger authored Oct 09, 2020
  
  a3cea6a8
- Revert "Better model links in the README and index" · bc00b37a
  sgugger authored Oct 09, 2020
```
This reverts commit 76e05518.
```
  bc00b37a
- Better model links in the README and index · 76e05518
  sgugger authored Oct 09, 2020
  
  76e05518
08 Oct, 2020 1 commit

Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58

Thomas Wolf authored Oct 08, 2020


Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)

* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

9aeacb58

30 Sep, 2020 1 commit
- Bump isort version. (#7484) · 4c672846
  Sylvain Gugger authored Sep 30, 2020
  
  4c672846
29 Sep, 2020 1 commit
- Release: v3.3.1 · 1ba08dc2
  Sylvain Gugger authored Sep 29, 2020
  
  1ba08dc2
28 Sep, 2020 1 commit
- Release: v3.3.0 · 0613f052
  Lysandre authored Sep 28, 2020
  
  0613f052
22 Sep, 2020 2 commits

RAG (#6813) · c754c41c

Ola Piktus authored Sep 22, 2020

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* Formatting / renaming prior to actual work

* First commit

* improve comments

* Retrieval evaluation scripts

* refactor to include modeling outputs + MPI retriever

* Fix rag-token model + refactor

* Various fixes + finetuning logic

* use_bos fix

* Retrieval refactor

* Finetuning refactoring and cleanup

* Add documentation and cleanup

* Remove set_up_rag_env.sh file

* Fix retrieval wit HF index

* Fix import errors

* Fix quality errors

* Refactor as per suggestions in https://github.com/huggingface/transformers/pull/6813#issuecomment-687208867



* fix quality

* Fix RAG Sequence generation

* minor cleanup plus initial tests

* fix test

* fix tests 2

* Comments fix

* post-merge fixes

* Improve readme + post-rebase refactor

* Extra dependencied for tests

* Fix tests

* Fix tests 2

* Refactor test requirements

* Fix tests 3

* Post-rebase refactor

* rename nlp->datasets

* RAG integration tests

* add tokenizer to slow integration test and allow retriever to run on cpu

* add tests; fix position ids warning

* change structure

* change structure

* add from encoder generator

* save working solution

* make all integration tests pass

* add RagTokenizer.save/from_pretrained and RagRetriever.save/from_pretrained

* don't save paths

* delete unnecessary imports

* pass config to AutoTokenizer.from_pretrained for Rag tokenizers

* init wiki_dpr only once

* hardcode legacy index and passages paths (todo: add the right urls)

* finalize config

* finalize retriver api and config api

* LegacyIndex index download refactor

* add dpr to autotokenizer

* make from pretrained more flexible

* fix ragfortokengeneration

* small name changes in tokenizer

* add labels to models

* change default index name

* add retrieval tests

* finish token generate

* align test with previous version and make all tests pass

* add tests

* finalize tests

* implement thoms suggestions

* add first version of test

* make first tests work

* make retriever platform agnostic

* naming

* style

* add legacy index URL

* docstrings + simple retrieval test for distributed

* clean model api

* add doc_ids to retriever's outputs

* fix retrieval tests

* finish model outputs

* finalize model api

* fix generate problem for rag

* fix generate for other modles

* fix some tests

* save intermediate

* set generate to default

* big refactor generate

* delete rag_api

* correct pip faiss install

* fix auto tokenization test

* fix faiss install

* fix test

* move the distributed logic to examples

* model page

* docs

* finish tests

* fix dependencies

* fix import in __init__

* Refactor eval_rag and finetune scripts

* start docstring

* add psutil to test

* fix tf test

* move require torch to top

* fix retrieval test

* align naming

* finish automodel

* fix repo consistency

* test ragtokenizer save/load

* add rag model output docs

* fix ragtokenizer save/load from pretrained

* fix tokenizer dir

* remove torch in retrieval

* fix docs

* fixe finetune scripts

* finish model docs

* finish docs

* remove auto model for now

* add require torch

* remove solved todos

* integrate sylvains suggestions

* sams comments

* correct mistake on purpose

* improve README

* Add generation test cases

* fix rag token

* clean token generate

* fix test

* add note to test

* fix attention mask

* add t5 test for rag

* Fix handling prefix in finetune.py

* don't overwrite index_name
Co-authored-by: Patrick Lewis <plewis@fb.com>
Co-authored-by: Aleksandra Piktus <piktus@devfair0141.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5102.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5067.h2.fair>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

c754c41c

Release: v3.2.0 · 3ebb1b3a
Lysandre authored Sep 22, 2020

3ebb1b3a

14 Sep, 2020 1 commit
- Pin version of TF and torch · 206b78d4
  sgugger authored Sep 14, 2020
  
  206b78d4
07 Sep, 2020 2 commits

match CI's version of flake8 (#6941) · 159ef07e

Stas Bekman authored Sep 07, 2020

my flake8 wasn't up-to-date enough `make quality` wasn't reporting the same things CI did - this PR adds the actual required version.

Thinking more about some of these minimal versions - CI will always install afresh and thus will always run the latest version. Is there a way to tell pip to always install the latest versions of certain dependencies on `pip install -i ".[dev]"`, rather than hardcoding the minimals which quickly become outdated?

159ef07e

[testing] add dependency: parametrize (#6958) · b4a9c95f

Stas Bekman authored Sep 07, 2020

unittest doesn't support pytest's super-handy `@pytest.mark.parametrize`, I researched and there are many proposed workarounds, most tedious at best. If we include https://pypi.org/project/parameterized/ in dev dependencies - it will provide a very easy to write parameterization in tests. Same as pytest's fixture, plus quite a few other ways. 

Example:
```
from parameterized import parameterized
@parameterized([
    (2, 2, 4),
    (2, 3, 8),
    (1, 9, 1),
    (0, 9, 0),
])
def test_pow(base, exponent, expected):
   assert_equal(math.pow(base, exponent), expected)
```
(extra `self`var if inside a test class)

To remind the pytest style is slightly different:
```
    @pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
    def test_eval(test_input, expected):
```
More examples here: https://pypi.org/project/parameterized

May I suggest that it will make it much easier to write some types of tests?

b4a9c95f

01 Sep, 2020 1 commit
- Release: v3.1.0 · 4b3ee9cb
  Lysandre authored Sep 01, 2020
  
  4b3ee9cb
28 Aug, 2020 1 commit

[style] set the minimal required version for `black` (#6784) · 743d131d

Stas Bekman authored Aug 27, 2020

`make style` with `black` < 20.8b1 is a no go (in case some other package forced a lower version) - so make it explicit to avoid confusion

743d131d

25 Aug, 2020 1 commit
- Fix ONNX test_quantize unittest (#6716) · ac9702c2
  Funtowicz Morgan authored Aug 25, 2020
  
  ac9702c2
24 Aug, 2020 1 commit
- Update repo to isort v5 (#6686) · a5737779
  Sylvain Gugger authored Aug 24, 2020
```
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
```
  a5737779
17 Aug, 2020 1 commit

Support additional dictionaries for BERT Japanese tokenizers (#6515) · 48c6c613

Masatoshi Suzuki authored Aug 17, 2020

* Update BERT Japanese tokenizers

* Update CircleCI config to download unidic

* Specify to use the latest dictionary packages

48c6c613

31 Jul, 2020 1 commit

Replace mecab-python3 with fugashi for Japanese tokenization (#6086) · cf3cf304

Paul O'Leary McCann authored Jul 31, 2020



* Replace mecab-python3 with fugashi

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.

* Style fix

* Remove unused variable

Forgot to delete this...

* Adapt doc with install instructions

* Fix typo
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

cf3cf304

29 Jul, 2020 1 commit
- Fix TF CTRL model naming (#6134) · fc64559c
  Julien Plu authored Jul 29, 2020
  
  fc64559c
27 Jul, 2020 1 commit
- Pin TF while we wait for a fix · 9d0d3a66
  sgugger authored Jul 27, 2020
  
  9d0d3a66
18 Jul, 2020 1 commit
- Update tokenizers to 0.8.1.rc to fix Mac OS X issues (#5867) · eae6d8d1
  Sebastian authored Jul 18, 2020
  
  eae6d8d1
06 Jul, 2020 3 commits
- Post v3.0.2 release commit · 1d233286
  Lysandre authored Jul 06, 2020
  
  1d233286
- Release: v3.0.2 · b0892fa0
  Lysandre authored Jul 06, 2020
  
  b0892fa0
- Various tokenizers fixes (#5558) · 5787e4c1
  Anthony MOI authored Jul 06, 2020
```
* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization
```
  5787e4c1
03 Jul, 2020 2 commits
- unpining specific git versions in setup.py · b58a15a3
  Thomas Wolf authored Jul 03, 2020
  
  b58a15a3
- Release: 3.0.1 · fedabcd1
  Thomas Wolf authored Jul 03, 2020
  
  fedabcd1
02 Jul, 2020 1 commit
- Bans SentencePiece 0.1.92 (#5418) · 69d313e8
  Lysandre Debut authored Jul 02, 2020
  
  69d313e8
30 Jun, 2020 1 commit
- Repin versions · 90d13954
  Lysandre authored Jun 30, 2020
  
  90d13954
29 Jun, 2020 2 commits
- Release: v3.0.0 · b62ca595
  Lysandre authored Jun 29, 2020
  
  b62ca595
- Pin mecab for now (#5362) · 482c9178
  Sylvain Gugger authored Jun 29, 2020
  
  482c9178
25 Jun, 2020 1 commit

Refactor Code samples; Test code samples (#5036) · 364a5ae1

Lysandre Debut authored Jun 25, 2020



* Refactor code samples

* Test docstrings

* Style

* Tokenization examples

* Run rust of tests

* First step to testing source docs

* Style and BART comment

* Test the remainder of the code samples

* Style

* let to const

* Formatting fixes

* Ready for merge

* Fix fixture + Style

* Fix last tests

* Update docs/source/quicktour.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing @sgugger's comments + Fix MobileBERT in TF
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

364a5ae1

23 Jun, 2020 1 commit

Tokenizers API developments (#5103) · 11fdde02

Thomas Wolf authored Jun 23, 2020



* Add return lengths

* make pad a bit more flexible so it can be used as collate_fn

* check all kwargs sent to encoding method are known

* fixing kwargs in encodings

* New AddedToken class in python

This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.

* style and quality

* switched to hugginface tokenizers library for AddedTokens

* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state

* style and quality

* do not raise an error on additional or unused kwargs for tokenize() but only a warning

* transfo-xl pretrained model requires torch

* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

11fdde02

18 Jun, 2020 1 commit
- Pin `sphinx-rtd-theme` (#5128) · 97343326
  Lysandre Debut authored Jun 18, 2020
  
  97343326
15 Jun, 2020 1 commit

[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220

Anthony MOI authored Jun 15, 2020


[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)

* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

36434220

09 Jun, 2020 1 commit

[Benchmark] add tpu and torchscipt for benchmark (#4850) · 2cfb947f

Patrick von Platen authored Jun 09, 2020



* add tpu and torchscipt for benchmark

* fix name in tests

* "fix email"

* make style

* better log message for tpu

* add more print and info for tpu

* allow possibility to print tpu metrics

* correct cpu usage

* fix test for non-install

* remove bugus file

* include psutil in testing

* run a couple of times before tracing in torchscript

* do not allow tpu memory tracing for now

* make style

* add torchscript to env

* better name for torch tpu
Co-authored-by: Patrick von Platen <patrick@huggingface.co>

2cfb947f

02 Jun, 2020 2 commits
- Repin versions · d976ef26
  Lysandre authored Jun 02, 2020
  
  d976ef26
- Release: v2.11.0 · b43c78e5
  Lysandre authored Jun 02, 2020
  
  b43c78e5
26 May, 2020 1 commit

Make transformers-cli cross-platform (#4131) · 8cc6807e

Bram Vanroy authored May 26, 2020

* make transformers-cli cross-platform

Using "scripts" is a useful option in setup.py particularly when you want to get access to non-python scripts. However, in this case we want to have an entry point into some of our own Python scripts. To do this in a concise, cross-platfom way, we can use entry_points.console_scripts. This change is necessary to provide the CLI on different platforms, which "scripts" does not ensure. Usage remains the same, but the "transformers-cli" script has to be moved (be part of the library) and renamed (underscore + extension)

* make style & quality

8cc6807e