Commits · 039d8d65fc19ac74a8c7917233eb2828c46c0fa7 · chenpangpang / transformers

20 Aug, 2020 2 commits

add intro to nlp lib & dataset links to custom datasets tutorial (#6583) · 039d8d65
Joe Davison authored Aug 20, 2020
```
* add intro to nlp lib + links

* unique links...
```
039d8d65

Docs copy button misses ... prefixed code (#6518) · cabfdfaf

Romain Rigaux authored Aug 20, 2020

Tested in a local build of the docs.

e.g. Just above https://huggingface.co/transformers/task_summary.html#causal-language-modeling

Copy will copy the full code, e.g.

for token in top_5_tokens:
     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Instead of currently only:

for token in top_5_tokens:


>>> for token in top_5_tokens:
...     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

Docs for the option fix:
https://sphinx-copybutton.readthedocs.io/en/latest/

cabfdfaf

19 Aug, 2020 1 commit
- Fix #6575 (#6596) · 18ca0e91
  Sylvain Gugger authored Aug 19, 2020
  
  18ca0e91
18 Aug, 2020 4 commits
- [Pegasus Doc] minor typo (#6579) · fb6844af
  Suraj Patil authored Aug 18, 2020
```
Minor typo correction
@sshleifer
```
  fb6844af
- [docs] Fix number of 'ug' occurrences in tokenizer_summary (#6574) · 7516bcf2
  Romain Rigaux authored Aug 18, 2020
  
  7516bcf2
- [docs] Fix wrong newline in the middle of a paragraph (#6573) · 5a5af22e
  Romain Rigaux authored Aug 18, 2020
  
  5a5af22e
- [marian] converter supports models from new Tatoeba project (#6342) · 12d76241
  Sam Shleifer authored Aug 17, 2020
  
  12d76241
17 Aug, 2020 9 commits

[Doc] add more MBart and other doc (#6490) · c9564f53

Suraj Patil authored Aug 17, 2020

* add mbart example

* add Pegasus and MBart in readme

* typo

* add MBart in Pretrained models

* add pre-proc doc

* add DPR in readme

* fix indent

* doc fix

c9564f53

replace _ with __ rst links (#6541) · f68c8731
Stas Bekman authored Aug 17, 2020

f68c8731

[doc] multiple corrections to "Summary of the tasks" (#6509) · b732e7e1

Stas Bekman authored Aug 17, 2020

* [doc] multiple corrections to "Summary of the tasks"

* fix indentation

* correction

* fix links, add links to examples/seq2seq/README.md instead of non-existing script

b732e7e1

[doc] make the text more readable, fix some typos, add some disambiguation (#6508) · 84d33317

Stas Bekman authored Aug 17, 2020



* [doc] make the text more readable, fix some typos, add some disambiguation

* Update docs/source/glossary.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

84d33317

add custom datasets tutorial (#6466) · d0c2389f

Joe Davison authored Aug 17, 2020



* add custom datasets tutorial

* python -> bash code blocks

* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* minor review feedback changes

* add working native QA snippet
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

d0c2389f

fix pegasus doc (#6533) · 36010cb1
Patrick von Platen authored Aug 17, 2020

36010cb1
[doc] Summary of the models fixes (#6511) · 49d8076f
Stas Bekman authored Aug 17, 2020
```
* [doc] Summary of the models fixes

* correction
```
49d8076f
[doc] fix invalid env vars (#6504) · 423eb5b1
Stas Bekman authored Aug 16, 2020
```
- remove invalid `ENV_` prefix.
- add a few ':' while at it
```
423eb5b1
typos (#6505) · df15c7c2
Stas Bekman authored Aug 16, 2020

df15c7c2

14 Aug, 2020 3 commits

Generation doc (#6470) · 895ed8f4

Sylvain Gugger authored Aug 14, 2020



* Generation doc

* MBartForConditionalGeneration (#6441)

* add MBartForConditionalGeneration

* style

* rebase and fixes

* add mbart test in TEST_FILES_WITH_NO_COMMON_TESTS

* fix docs

* don't ignore mbart

* doc

* fix mbart fairseq link

* put mbart before bart

* apply doc suggestions

* Use hash to clean the test dirs (#6475)

* Use hash to clean the test dirs

* Use hash to clean the test dirs

* Use hash to clean the test dirs

* fix

* [EncoderDecoder] Add Cross Attention for GPT2 (#6415)

* add cross attention layers for gpt2

* make gpt2 cross attention work

* finish bert2gpt2

* add explicit comments

* remove attention mask since not yet supported

* revert attn mask in pipeline

* Update src/transformers/modeling_gpt2.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_encoder_decoder.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Sort unique_no_split_tokens to make it deterministic (#6461)

* change unique_no_split_tokens's type to set

* use sorted list instead of set

* style

* Import accuracy_score (#6480)

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address comments

* Styling

* Generation doc

* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address comments

* Styling
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: gijswijnholds <gijswijnholds@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

895ed8f4

Import accuracy_score (#6480) · b5ba758b
gijswijnholds authored Aug 14, 2020

b5ba758b

MBartForConditionalGeneration (#6441) · 680f1337

Suraj Patil authored Aug 14, 2020

* add MBartForConditionalGeneration

* style

* rebase and fixes

* add mbart test in TEST_FILES_WITH_NO_COMMON_TESTS

* fix docs

* don't ignore mbart

* doc

* fix mbart fairseq link

* put mbart before bart

* apply doc suggestions

680f1337

12 Aug, 2020 3 commits
- [EncoderDecoder] Add encoder-decoder for roberta/ vanilla longformer (#6411) · 0735def8
  Patrick von Platen authored Aug 12, 2020
```
* add encoder-decoder for roberta

* fix headmask

* apply Sylvains suggestions

* fix typo

* Apply suggestions from code review
```
  0735def8
- Activate check on the CI (#6427) · a8db954c
  Sylvain Gugger authored Aug 12, 2020
```
* Activate check on the CI

* Fix repo inconsistencies

* Don't document too much
```
  a8db954c
- Move prediction_loss_only to TrainingArguments (#6426) · 34fabe16
  Sylvain Gugger authored Aug 12, 2020
  
  34fabe16
11 Aug, 2020 2 commits
- rename prepare_translation_batch -> prepare_seq2seq_batch (#6103) · be1520d3
  Sam Shleifer authored Aug 11, 2020
  
  be1520d3
- PegasusForConditionalGeneration (torch version) (#6340) · 66fa8cea
  Sam Shleifer authored Aug 11, 2020
```
Co-authored-by: Jingqing  Zhang <jingqing.zhang15@imperial.ac.uk>
```
  66fa8cea
10 Aug, 2020 4 commits

TF Longformer (#5764) · 00bb0b25

Patrick von Platen authored Aug 10, 2020



* improve names and tests longformer

* more and better tests for longformer

* add first tf test

* finalize tf basic op functions

* fix merge

* tf shape test passes

* narrow down discrepancies

* make longformer local attn tf work

* correct tf longformer

* add first global attn function

* add more global longformer func

* advance tf longformer

* finish global attn

* upload big model

* finish all tests

* correct false any statement

* fix common tests

* make all tests pass except keras save load

* fix some tests

* fix torch test import

* finish tests

* fix test

* fix torch tf tests

* add docs

* finish docs

* Update src/transformers/modeling_longformer.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_tf_longformer.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* apply Lysandres suggestions

* reverse to assert statement because function will fail otherwise

* applying sylvains recommendations

* Update src/transformers/modeling_longformer.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* Update src/transformers/modeling_tf_longformer.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

00bb0b25

Fix links for open in colab (#6391) · 06bc347c
Sylvain Gugger authored Aug 10, 2020

06bc347c
Colab button (#6389) · 3e0fe3cf
Sylvain Gugger authored Aug 10, 2020
```
* Add colab button

* Add colab link for tutorials
```
3e0fe3cf
Small docfile fixes (#6328) · 6028ed92
Sylvain Gugger authored Aug 10, 2020

6028ed92

07 Aug, 2020 1 commit

Add a script to check all models are tested and documented (#6298) · 6ba540b7

Sylvain Gugger authored Aug 07, 2020



* Add a script to check all models are tested and documented

* Apply suggestions from code review
Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>

* Address comments
Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>

6ba540b7

05 Aug, 2020 1 commit

Tf model outputs (#6247) · c67d1a02

Sylvain Gugger authored Aug 05, 2020

* TF outputs and test on BERT

* Albert to DistilBert

* All remaining TF models except T5

* Documentation

* One file forgotten

* TF outputs and test on BERT

* Albert to DistilBert

* All remaining TF models except T5

* Documentation

* One file forgotten

* Add new models and fix issues

* Quality improvements

* Add T5

* A bit of cleanup

* Fix for slow tests

* Style

c67d1a02

04 Aug, 2020 1 commit
- fix zero shot pipeline docs (#6245) · 972535ea
  Joe Davison authored Aug 04, 2020
  
  972535ea
03 Aug, 2020 2 commits

Remove outdated BERT tips (#6217) · 3c289fb3

Kevin Canwen Xu authored Aug 04, 2020

* Remove out-dated BERT tips

* Update modeling_outputs.py

* Update bert.rst

* Update bert.rst

3c289fb3

Doc pipelines (#6175) · e4920c92

Sylvain Gugger authored Aug 03, 2020



* Init work on pipelines doc

* Work in progress

* Work in progress

* Doc pipelines

* Rm unwanted default

* Apply suggestions from code review

Lysandre comments
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

e4920c92

01 Aug, 2020 1 commit
- Fixed typo in Longformer (#6180) · a39dfe4f
  Faiaz Rahman authored Aug 01, 2020
  
  a39dfe4f
31 Jul, 2020 3 commits

Harmonize both Trainers API (#6157) · 86caab1e
Sylvain Gugger authored Jul 31, 2020
```
* Harmonize both Trainers API

* Fix test

* main_prcess -> process_zero
```
86caab1e

Replace mecab-python3 with fugashi for Japanese tokenization (#6086) · cf3cf304

Paul O'Leary McCann authored Jul 31, 2020



* Replace mecab-python3 with fugashi

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.

* Style fix

* Remove unused variable

Forgot to delete this...

* Adapt doc with install instructions

* Fix typo
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

cf3cf304

Enable ONNX/ONNXRuntime optimizations through converter script (#6131) · 7231f7b5

Funtowicz Morgan authored Jul 31, 2020



* Add onnxruntime transformers optimization support
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Optimization section in ONNX/ONNXRuntime documentation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improve note reference
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fixing imports order.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add warning about different level of optimization between torch and tf export.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Address @LysandreJik wording suggestion
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address @LysandreJik wording suggestion
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Always optimize model before quantization for maximum performances.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Address comments on the documentation.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improve TensorFlow optimization message as suggested by @yufenglee
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed --optimize parameter
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Warn the user about current quantization limitation when model is larger than 2GB.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Trigger CI for last check

* Small change in print for the optimization section.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

7231f7b5

30 Jul, 2020 3 commits

Doc tokenizer (#6110) · f3065abd

Sylvain Gugger authored Jul 30, 2020



* Start doc tokenizers

* Tokenizer documentation

* Start doc tokenizers

* Tokenizer documentation

* Formatting after rebase

* Formatting after merge

* Update docs/source/main_classes/tokenizer.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address comment

* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Address Thom's comments
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

f3065abd

Addition of a DialoguePipeline (#5516) · e642c789

guillaume-be authored Jul 30, 2020



* initial commit for pipeline implementation

Addition of input processing and history concatenation

* Conversation pipeline tested and working for single & multiple conversation inputs

* Added docstrings for dialogue pipeline

* Addition of dialogue pipeline integration tests

* Delete test_t5.py

* Fixed max code length

* Updated styling

* Fixed test broken by formatting tools

* Removed unused import

* Added unit test for DialoguePipeline

* Fixed Tensorflow compatibility

* Fixed multi-framework support using framework flag

* - Fixed docstring
- Added `min_length_for_response` as an initialization parameter
- Renamed `*args` to `conversations`, `conversations` being a `Conversation` or a `List[Conversation]`
- Updated truncation to truncate entire segments of conversations, instead of cutting in the middle of a user/bot input

* - renamed pipeline name from dialogue to conversational
- removed hardcoded default value of 1000 and use config.max_length instead
- added `append_response` and `set_history` method to the Conversation class to avoid direct fields mutation
- fixed bug in history truncation method

* - Updated ConversationalPipeline to accept only active conversations (otherwise a ValueError is raised)

* - Simplified input tensor conversion

* - Updated attention_mask value for Tensorflow compatibility

* - Updated last dialogue reference to conversational & fixed integration tests

* Fixed conflict with master

* Updates following review comments

* Updated formatting

* Added Conversation and ConversationalPipeline to the library __init__, addition of docstrings for Conversation, added both to the docs

* Update src/transformers/pipelines.py

Updated docsting following review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

e642c789

Switch from return_tuple to return_dict (#6138) · 91cb9546

Sylvain Gugger authored Jul 30, 2020



* Switch from return_tuple to return_dict

* Fix test

* [WIP] Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleC… (#5614)

* Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleChoice} models and tests

* AutoModels


Tiny tweaks

* Style

* Final changes before merge

* Re-order for simpler review

* Final fixes

* Addressing @sgugger's comments

* Test MultipleChoice

* Rework TF trainer (#6038)

* Fully rework training/prediction loops

* fix method name

* Fix variable name

* Fix property name

* Fix scope

* Fix method name

* Fix tuple index

* Fix tuple index

* Fix indentation

* Fix variable name

* fix eval before log

* Add drop remainder for test dataset

* Fix step number + fix logging datetime

* fix eval loss value

* use global step instead of step + fix logging at step 0

* Fix logging datetime

* Fix global_step usage

* Fix breaking loop + logging datetime

* Fix step in prediction loop

* Fix step breaking

* Fix train/test loops

* Force TF at least 2.2 for the trainer

* Use assert_cardinality to facilitate the dataset size computation

* Log steps per epoch

* Make tfds compliant with TPU

* Make tfds compliant with TPU

* Use TF dataset enumerate instead of the Python one

* revert previous commit

* Fix data_dir

* Apply style

* rebase on master

* Address Sylvain's comments

* Address Sylvain's and Lysandre comments

* Trigger CI

* Remove unused import

* Switch from return_tuple to return_dict

* Fix test

* Add recent model
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Julien Plu <plu.julien@gmail.com>

91cb9546