Commits · 827d6d6ef071029cfe82838a18dab046b5813976 · chenpangpang / transformers

18 Apr, 2020 5 commits

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

[model_cards] Fix CamemBERT table markdown · 60a42ef1
Julien Chaumond authored Apr 17, 2020
```
see https://github.com/huggingface/transformers/pull/3836
```
60a42ef1
[ci] GitHub-hosted runner has no space left on device · 88aecee6
Julien Chaumond authored Apr 17, 2020

88aecee6
Update camembert-base-README.md (#3836) · 73efa694
Benjamin Muller authored Apr 18, 2020

73efa694
[Config, Serialization] more readable config serialization (#3797) · e9d0bc02
Patrick von Platen authored Apr 18, 2020
```
* better config serialization

* finish configuration utils
```
e9d0bc02

17 Apr, 2020 8 commits
- XLM tokenizer should encode with bos token (#3791) · 8b63a01d
  Lysandre Debut authored Apr 17, 2020
```
* XLM tokenizer should encode with bos token

* Update tests
```
  8b63a01d
- Higher tolerance for past testing in TF T5 (#3844) · 1d4a35b3
  Patrick von Platen authored Apr 17, 2020
  
  1d4a35b3
- Higher tolerance for past testing in T5 (#3843) · d13eca11
  Patrick von Platen authored Apr 17, 2020
  
  d13eca11
- Add workflow to build docs (#3763) · b0c9fbb2
  Harutaka Kawamura authored Apr 18, 2020
  
  b0c9fbb2
- Add support for the null answer in `QuestionAnsweringPipeline` (#3441) · c19727fd
  Santiago Castro authored Apr 17, 2020
```
* Add support for the null answer in `QuestionAnsweringPipeline`

* black

* Fix min null score computation

* Fix a PR comment
```
  c19727fd
- Fix token_type_id in BERT question-answering example (#3790) · edf0582c
  Simon Böhm authored Apr 17, 2020
```
token_type_id is converted into the segment embedding. For question answering,
this needs to highlight whether a token belongs to sequence 0 or 1.
encode_plus takes care of correctly setting this parameter automatically.
```
  edf0582c
- Question Answering support for Albert and Roberta in TF (#3812) · 6d00033e
  Pierric Cistac authored Apr 17, 2020
```
* Add TFAlbertForQuestionAnswering

* Add TFRobertaForQuestionAnswering

* Update TFAutoModel with Roberta/Albert for QA

* Clean `super` TF Albert calls
```
  6d00033e
- Update README · f399c006
  Patrick von Platen authored Apr 17, 2020
  
  f399c006
16 Apr, 2020 12 commits

[examples] summarization/bart/finetune.py supports t5 (#3824) · f0c96faf
Sam Shleifer authored Apr 16, 2020
```
renames `run_bart_sum.py` to `finetune.py`
```
f0c96faf
typo: fine-grained token-leven · 0cec4fab
Jonathan Sum authored Apr 15, 2020
```
Changing from "fine-grained token-leven" to "fine-grained token-level"
```
0cec4fab
Tanh torch warnings · 14cdeee7
Aryansh Omray authored Apr 16, 2020

14cdeee7
[PretrainedTokenizer] Factor out tensor conversion method (#3777) · 16469fed
Sam Shleifer authored Apr 16, 2020

16469fed

[Examples, T5] Change newstest2013 to newstest2014 and clean up (#3817) · 80a16945

Patrick von Platen authored Apr 16, 2020



* Refactored use of newstest2013 to newstest2014. Fixed bug where argparse consumed first command line argument as model_size argument rather than using default model_size by forcing explicit --model_size flag inclusion

* More pythonic file handling through 'with' context

* COSMETIC - ran Black and isort

* Fixed reference to number of lines in newstest2014

* Fixed failing test. More pythonic file handling

* finish PR from tholiao

* remove outcommented lines

* make style

* make isort happy
Co-authored-by: Thomas Liao <tholiao@gmail.com>

80a16945

JIT not compatible with PyTorch/XLA (#3743) · d4867951
Lysandre Debut authored Apr 16, 2020

d4867951
Typo fix (#3821) · b1e2368b
Davide Fiocco authored Apr 16, 2020

b1e2368b
clean pipelines (#3795) · baca8fa8
Patrick von Platen authored Apr 16, 2020

baca8fa8

[TFT5, Cache] Add cache to TFT5 (#3772) · 38f7461d

Patrick von Platen authored Apr 16, 2020

* correct gpt2 test inputs

* make style

* delete modeling_gpt2 change in test file

* translate from pytorch

* correct tests

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* make tensorflow t5 caching work

* make style

* clean reorder cache

* remove unnecessary spaces

* fix test

38f7461d

change pad token id to config pad token id (#3793) · a5b24947
Patrick von Platen authored Apr 16, 2020

a5b24947
[cleanup] factor out get_head_mask, invert_attn_mask, get_exten… (#3806) · dbd04124
Sam Shleifer authored Apr 16, 2020
```
* Delete some copy pasted code
```
dbd04124

[Docs] Add DialoGPT (#3755) · d22894df

Patrick von Platen authored Apr 16, 2020



* add dialoGPT

* update README.md

* fix conflict

* update readme

* add code links to docs

* Update README.md

* Update dialo_gpt2.rst

* Update pretrained_models.rst

* Update docs/source/model_doc/dialo_gpt2.rst
Co-Authored-By: Julien Chaumond <chaumond@gmail.com>

* change filename of dialogpt
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

d22894df

15 Apr, 2020 2 commits
- [examples] unit test for run_bart_sum (#3544) · c59b1e68
  Sam Shleifer authored Apr 15, 2020
```
- adds pytorch-lightning dependency
```
  c59b1e68
- Create Modelcard for Reformer Model · 301bf8d1
  Patrick von Platen authored Apr 15, 2020
  
  301bf8d1
14 Apr, 2020 2 commits

[Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734) · 01c37dcd

Patrick von Platen authored Apr 14, 2020

* remove output_past from pt

* make style

* add optional input length for gpt2

* add use cache to prepare input

* save memory in gpt2

* correct gpt2 test inputs

* make past input optional for gpt2

* finish use_cache for all models

* make style

* delete modeling_gpt2 change in test file

* correct docstring

* correct is true statements for gpt2

01c37dcd

[Generation, EncoderDecoder] Apply Encoder Decoder 1.5GB memory… (#3778) · 092cf881
Patrick von Platen authored Apr 14, 2020

092cf881

13 Apr, 2020 2 commits
- Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716) · 352d5472
  Teven authored Apr 13, 2020
```
* Shifting labels inside TransfoXLLMHead

* Changed doc to reflect change

* Updated pytorch test

* removed IDE whitespace changes

* black reformat
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
```
  352d5472
- fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) · 5ebd8989
  elk-cloner authored Apr 13, 2020
  
  5ebd8989
11 Apr, 2020 2 commits

updated dutch squad model card (#3736) · 7972a401

HenrykBorzymowski authored Apr 11, 2020



* added model_cards for polish squad models

* corrected mistake in polish design cards

* updated model_cards for squad2_dutch model

* added links to benchmark models
Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>

7972a401

Added README huseinzol05/albert-tiny-bahasa-cased (#3746) · f8c1071c

HUSEIN ZOLKEPLI authored Apr 11, 2020

* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base

* added albert tiny

f8c1071c

10 Apr, 2020 7 commits

Fix glue_convert_examples_to_features API breakage (#3742) · 700ccf6e
Jin Young Sohn authored Apr 10, 2020

700ccf6e
Update tokenizers to 0.7.0-rc5 (#3705) · b7cf9f43
Anthony MOI authored Apr 10, 2020

b7cf9f43

Add `run_glue_tpu.py` that trains models on TPUs (#3702) · 551b4505

Jin Young Sohn authored Apr 10, 2020

* Initial commit to get BERT + run_glue.py on TPU

* Add README section for TPU and address comments.

* Cleanup TPU bits from run_glue.py (#3)

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* Cleanup TPU bits from run_glue.py

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py

.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* No need to call `xm.mark_step()` explicitly (#4)

Since for gradient accumulation we're accumulating on batches from
`ParallelLoader` instance which on next() marks the step itself.

* Resolve R/W conflicts from multiprocessing (#5)

* Add XLNet in list of models for `run_glue_tpu.py` (#6)

* Add RoBERTa to list of models in TPU GLUE (#7)

* Add RoBERTa and DistilBert to list of models in TPU GLUE (#8)

* Use barriers to reduce duplicate work/resources (#9)

* Shard eval dataset and aggregate eval metrics (#10)

* Shard eval dataset and aggregate eval metrics

Also, instead of calling `eval_loss.item()` every time do summation with
tensors on device.

* Change defaultdict to float

* Reduce the pred, label tensors instead of metrics

As brought up during review some metrics like f1 cannot be aggregated
via averaging. GLUE task metrics depends largely on the dataset, so
instead we sync the prediction and label tensors so that the metrics can
be computed accurately on those instead.

* Only use tb_writer from master (#11)

* Apply huggingface black code formatting

* Style

* Remove `--do_lower_case` as example uses cased

* Add option to specify tensorboard logdir

This is needed for our testing framework which checks regressions
against key metrics writtern by the summary writer.

* Using configuration for `xla_device`

* Prefix TPU specific comments.

* num_cores clarification and namespace eval metrics

* Cache features file under `args.cache_dir`

Instead of under `args.data_dir`. This is needed as our test infra uses
data_dir with a read-only filesystem.

* Rename `run_glue_tpu` to `run_tpu_glue`
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

551b4505

[docs] The use of `do_lower_case` in scripts is on its way to deprecation (#3738) · cbad305c
Julien Chaumond authored Apr 10, 2020

cbad305c

[examples] Generate argparsers from type hints on dataclasses (#3669) · b169ac9c

Julien Chaumond authored Apr 10, 2020

* [examples] Generate argparsers from type hints on dataclasses

* [HfArgumentParser] way simpler API

* Restore run_language_modeling.py for easier diff

* [HfArgumentParser] final tweaks from code review

b169ac9c

Multilingual BART - (#3602) · 7a7fdf71
Sam Shleifer authored Apr 10, 2020
```
- support mbart-en-ro weights
- add MBartTokenizer
```
7a7fdf71

Big cleanup of `glue_convert_examples_to_features` (#3688) · f98d0ef2

Julien Chaumond authored Apr 10, 2020

* Big cleanup of `glue_convert_examples_to_features`

* Use batch_encode_plus

* Cleaner wrapping of glue_convert_examples_to_features for TF

@lysandrejik

* Cleanup syntax, thanks to @mfuntowicz

* Raise explicit error in case of user error

f98d0ef2