Commits · c4158a63141261ae51126990d6006f1c521ebc17 · chenpangpang / transformers

20 Apr, 2020 4 commits
- [Pipelines] Encode to max length of input not max length of tokenizer for batch input (#3857) · c4158a63
  Patrick von Platen authored Apr 20, 2020
```
* remove max_length = tokenizer.max_length when encoding

* make style
```
  c4158a63
- exbert links for my albert model cards (#3729) · 857ccdb2
  Mohamed El-Geish authored Apr 20, 2020
```
* exbert links for my albert model cards

* Added exbert tag to the metadata block

* Adding "how to cite"
```
  857ccdb2
- [examples] fix summarization do_predict (#3866) · a504cb49
  Sam Shleifer authored Apr 20, 2020
  
  a504cb49
- Update README.md · 52c85f84
  ahotrod authored Apr 19, 2020
  
  52c85f84
18 Apr, 2020 6 commits

add "by" to ReadMe · a21d4fa4
Patrick von Platen authored Apr 18, 2020

a21d4fa4

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

[model_cards] Fix CamemBERT table markdown · 60a42ef1
Julien Chaumond authored Apr 17, 2020
```
see https://github.com/huggingface/transformers/pull/3836
```
60a42ef1
[ci] GitHub-hosted runner has no space left on device · 88aecee6
Julien Chaumond authored Apr 17, 2020

88aecee6
Update camembert-base-README.md (#3836) · 73efa694
Benjamin Muller authored Apr 18, 2020

73efa694
[Config, Serialization] more readable config serialization (#3797) · e9d0bc02
Patrick von Platen authored Apr 18, 2020
```
* better config serialization

* finish configuration utils
```
e9d0bc02

17 Apr, 2020 8 commits
- XLM tokenizer should encode with bos token (#3791) · 8b63a01d
  Lysandre Debut authored Apr 17, 2020
```
* XLM tokenizer should encode with bos token

* Update tests
```
  8b63a01d
- Higher tolerance for past testing in TF T5 (#3844) · 1d4a35b3
  Patrick von Platen authored Apr 17, 2020
  
  1d4a35b3
- Higher tolerance for past testing in T5 (#3843) · d13eca11
  Patrick von Platen authored Apr 17, 2020
  
  d13eca11
- Add workflow to build docs (#3763) · b0c9fbb2
  Harutaka Kawamura authored Apr 18, 2020
  
  b0c9fbb2
- Add support for the null answer in `QuestionAnsweringPipeline` (#3441) · c19727fd
  Santiago Castro authored Apr 17, 2020
```
* Add support for the null answer in `QuestionAnsweringPipeline`

* black

* Fix min null score computation

* Fix a PR comment
```
  c19727fd
- Fix token_type_id in BERT question-answering example (#3790) · edf0582c
  Simon Böhm authored Apr 17, 2020
```
token_type_id is converted into the segment embedding. For question answering,
this needs to highlight whether a token belongs to sequence 0 or 1.
encode_plus takes care of correctly setting this parameter automatically.
```
  edf0582c
- Question Answering support for Albert and Roberta in TF (#3812) · 6d00033e
  Pierric Cistac authored Apr 17, 2020
```
* Add TFAlbertForQuestionAnswering

* Add TFRobertaForQuestionAnswering

* Update TFAutoModel with Roberta/Albert for QA

* Clean `super` TF Albert calls
```
  6d00033e
- Update README · f399c006
  Patrick von Platen authored Apr 17, 2020
  
  f399c006
16 Apr, 2020 12 commits

[examples] summarization/bart/finetune.py supports t5 (#3824) · f0c96faf
Sam Shleifer authored Apr 16, 2020
```
renames `run_bart_sum.py` to `finetune.py`
```
f0c96faf
typo: fine-grained token-leven · 0cec4fab
Jonathan Sum authored Apr 15, 2020
```
Changing from "fine-grained token-leven" to "fine-grained token-level"
```
0cec4fab
Tanh torch warnings · 14cdeee7
Aryansh Omray authored Apr 16, 2020

14cdeee7
[PretrainedTokenizer] Factor out tensor conversion method (#3777) · 16469fed
Sam Shleifer authored Apr 16, 2020

16469fed

[Examples, T5] Change newstest2013 to newstest2014 and clean up (#3817) · 80a16945

Patrick von Platen authored Apr 16, 2020



* Refactored use of newstest2013 to newstest2014. Fixed bug where argparse consumed first command line argument as model_size argument rather than using default model_size by forcing explicit --model_size flag inclusion

* More pythonic file handling through 'with' context

* COSMETIC - ran Black and isort

* Fixed reference to number of lines in newstest2014

* Fixed failing test. More pythonic file handling

* finish PR from tholiao

* remove outcommented lines

* make style

* make isort happy
Co-authored-by: Thomas Liao <tholiao@gmail.com>

80a16945

JIT not compatible with PyTorch/XLA (#3743) · d4867951
Lysandre Debut authored Apr 16, 2020

d4867951
Typo fix (#3821) · b1e2368b
Davide Fiocco authored Apr 16, 2020

b1e2368b
clean pipelines (#3795) · baca8fa8
Patrick von Platen authored Apr 16, 2020

baca8fa8

[TFT5, Cache] Add cache to TFT5 (#3772) · 38f7461d

Patrick von Platen authored Apr 16, 2020

* correct gpt2 test inputs

* make style

* delete modeling_gpt2 change in test file

* translate from pytorch

* correct tests

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* make tensorflow t5 caching work

* make style

* clean reorder cache

* remove unnecessary spaces

* fix test

38f7461d

change pad token id to config pad token id (#3793) · a5b24947
Patrick von Platen authored Apr 16, 2020

a5b24947
[cleanup] factor out get_head_mask, invert_attn_mask, get_exten… (#3806) · dbd04124
Sam Shleifer authored Apr 16, 2020
```
* Delete some copy pasted code
```
dbd04124

[Docs] Add DialoGPT (#3755) · d22894df

Patrick von Platen authored Apr 16, 2020



* add dialoGPT

* update README.md

* fix conflict

* update readme

* add code links to docs

* Update README.md

* Update dialo_gpt2.rst

* Update pretrained_models.rst

* Update docs/source/model_doc/dialo_gpt2.rst
Co-Authored-By: Julien Chaumond <chaumond@gmail.com>

* change filename of dialogpt
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

d22894df

15 Apr, 2020 2 commits
- [examples] unit test for run_bart_sum (#3544) · c59b1e68
  Sam Shleifer authored Apr 15, 2020
```
- adds pytorch-lightning dependency
```
  c59b1e68
- Create Modelcard for Reformer Model · 301bf8d1
  Patrick von Platen authored Apr 15, 2020
  
  301bf8d1
14 Apr, 2020 2 commits

[Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734) · 01c37dcd

Patrick von Platen authored Apr 14, 2020

* remove output_past from pt

* make style

* add optional input length for gpt2

* add use cache to prepare input

* save memory in gpt2

* correct gpt2 test inputs

* make past input optional for gpt2

* finish use_cache for all models

* make style

* delete modeling_gpt2 change in test file

* correct docstring

* correct is true statements for gpt2

01c37dcd

[Generation, EncoderDecoder] Apply Encoder Decoder 1.5GB memory… (#3778) · 092cf881
Patrick von Platen authored Apr 14, 2020

092cf881

13 Apr, 2020 2 commits
- Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716) · 352d5472
  Teven authored Apr 13, 2020
```
* Shifting labels inside TransfoXLLMHead

* Changed doc to reflect change

* Updated pytorch test

* removed IDE whitespace changes

* black reformat
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
```
  352d5472
- fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) · 5ebd8989
  elk-cloner authored Apr 13, 2020
  
  5ebd8989
11 Apr, 2020 2 commits

updated dutch squad model card (#3736) · 7972a401

HenrykBorzymowski authored Apr 11, 2020



* added model_cards for polish squad models

* corrected mistake in polish design cards

* updated model_cards for squad2_dutch model

* added links to benchmark models
Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>

7972a401

Added README huseinzol05/albert-tiny-bahasa-cased (#3746) · f8c1071c

HUSEIN ZOLKEPLI authored Apr 11, 2020

* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base

* added albert tiny

f8c1071c

10 Apr, 2020 2 commits
- Fix glue_convert_examples_to_features API breakage (#3742) · 700ccf6e
  Jin Young Sohn authored Apr 10, 2020
  
  700ccf6e
- Update tokenizers to 0.7.0-rc5 (#3705) · b7cf9f43
  Anthony MOI authored Apr 10, 2020
  
  b7cf9f43