Commits · 5282b31df4aea9849a1c51240096781bed6f30ec · chenpangpang / transformers

02 May, 2020 2 commits
- Update run_pl_ner.py (#4118) · 5282b31d
  William Falcon authored May 02, 2020
  
  5282b31d
- NER: parse args from .args file or JSON (#4110) · 1e616c0a
  Stefan Schweter authored May 02, 2020
```
* ner: parse args from .args file or JSON

* examples: mention json-based configuration file support for run_ner script
```
  1e616c0a
01 May, 2020 1 commit
- Merge pull request #3934 from huggingface/examples_args_from_files · b8686174
  Julien Chaumond authored Apr 30, 2020
```
[qol] example scripts: parse args from .args file or JSON
```
  b8686174
29 Apr, 2020 1 commit

Julien Chaumond authored Apr 28, 2020

* [file_utils] use_cdn + documentation

* Move to cdn. urls for weights

* [urls] Hotfix for bert-base-japanese

455c6390

28 Apr, 2020 2 commits
- [isort] add known 3rd party to setup.cfg (#4053) · d714dfea
  Sam Shleifer authored Apr 28, 2020
```
* add known 3rd party to setup.cfg

* comment

* Update CONTRIBUTING.md
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
```
  d714dfea
- [Generation] Generation should allow to start with empty prompt (#3993) · 18058574
  Patrick von Platen authored Apr 28, 2020
```
* fix empty prompt

* fix length in generation pipeline
```
  18058574
24 Apr, 2020 2 commits
- [examples] For convenience, also save the tokenizer · c8115260
  Julien Chaumond authored Apr 24, 2020
```
Close #3921
```
  c8115260
- Shuffle train subset for summarization example (#3909) · b0167632
  Cola authored Apr 24, 2020
```
* Shuffle train subset

* Cleaner shuffle
```
  b0167632
22 Apr, 2020 2 commits

Fixes #3877 · 1dc9b3c7
Julien Chaumond authored Apr 22, 2020

1dc9b3c7

Trainer (#3800) · dd9d483d

Julien Chaumond authored Apr 21, 2020

* doc

* [tests] Add sample files for a regression task

* [HUGE] Trainer

* Feedback from @sshleifer

* Feedback from @thomwolf + logging tweak

* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes

* [glue] Use default max_seq_length of 128 like before

* [glue] move DataTrainingArguments around

* [ner] Change interface of InputExample, and align run_{tf,pl}

* Re-align the pl scripts a little bit

* ner

* [ner] Add integration test

* Fix language_modeling with API tweak

* [ci] Tweak loss target

* Don't break console output

* amp.initialize: model must be on right device before

* [multiple-choice] update for Trainer

* Re-align to 827d6d6e

dd9d483d

20 Apr, 2020 3 commits
- Fix bug in examples: double wrap into DataParallel during eval · b1ff0b2a
  Andrey Kulagin authored Apr 17, 2020
  
  b1ff0b2a
- Add `qas_id` to SquadResult and SquadExample (#3745) · c79b550d
  Jared T Nielsen authored Apr 20, 2020
```
* Add qas_id

* Fix incorrect name in squad.py

* Make output files optional for squad eval
```
  c79b550d
- [examples] fix summarization do_predict (#3866) · a504cb49
  Sam Shleifer authored Apr 20, 2020
  
  a504cb49
18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

16 Apr, 2020 3 commits

[examples] summarization/bart/finetune.py supports t5 (#3824) · f0c96faf
Sam Shleifer authored Apr 16, 2020
```
renames `run_bart_sum.py` to `finetune.py`
```
f0c96faf

[Examples, T5] Change newstest2013 to newstest2014 and clean up (#3817) · 80a16945

Patrick von Platen authored Apr 16, 2020



* Refactored use of newstest2013 to newstest2014. Fixed bug where argparse consumed first command line argument as model_size argument rather than using default model_size by forcing explicit --model_size flag inclusion

* More pythonic file handling through 'with' context

* COSMETIC - ran Black and isort

* Fixed reference to number of lines in newstest2014

* Fixed failing test. More pythonic file handling

* finish PR from tholiao

* remove outcommented lines

* make style

* make isort happy
Co-authored-by: Thomas Liao <tholiao@gmail.com>

80a16945

Typo fix (#3821) · b1e2368b
Davide Fiocco authored Apr 16, 2020

b1e2368b

15 Apr, 2020 1 commit
- [examples] unit test for run_bart_sum (#3544) · c59b1e68
  Sam Shleifer authored Apr 15, 2020
```
- adds pytorch-lightning dependency
```
  c59b1e68
14 Apr, 2020 1 commit

[Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734) · 01c37dcd

Patrick von Platen authored Apr 14, 2020

* remove output_past from pt

* make style

* add optional input length for gpt2

* add use cache to prepare input

* save memory in gpt2

* correct gpt2 test inputs

* make past input optional for gpt2

* finish use_cache for all models

* make style

* delete modeling_gpt2 change in test file

* correct docstring

* correct is true statements for gpt2

01c37dcd

13 Apr, 2020 1 commit
- fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) · 5ebd8989
  elk-cloner authored Apr 13, 2020
  
  5ebd8989
10 Apr, 2020 5 commits

Fix glue_convert_examples_to_features API breakage (#3742) · 700ccf6e
Jin Young Sohn authored Apr 10, 2020

700ccf6e

Add `run_glue_tpu.py` that trains models on TPUs (#3702) · 551b4505

Jin Young Sohn authored Apr 10, 2020

* Initial commit to get BERT + run_glue.py on TPU

* Add README section for TPU and address comments.

* Cleanup TPU bits from run_glue.py (#3)

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* Cleanup TPU bits from run_glue.py

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py

.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* No need to call `xm.mark_step()` explicitly (#4)

Since for gradient accumulation we're accumulating on batches from
`ParallelLoader` instance which on next() marks the step itself.

* Resolve R/W conflicts from multiprocessing (#5)

* Add XLNet in list of models for `run_glue_tpu.py` (#6)

* Add RoBERTa to list of models in TPU GLUE (#7)

* Add RoBERTa and DistilBert to list of models in TPU GLUE (#8)

* Use barriers to reduce duplicate work/resources (#9)

* Shard eval dataset and aggregate eval metrics (#10)

* Shard eval dataset and aggregate eval metrics

Also, instead of calling `eval_loss.item()` every time do summation with
tensors on device.

* Change defaultdict to float

* Reduce the pred, label tensors instead of metrics

As brought up during review some metrics like f1 cannot be aggregated
via averaging. GLUE task metrics depends largely on the dataset, so
instead we sync the prediction and label tensors so that the metrics can
be computed accurately on those instead.

* Only use tb_writer from master (#11)

* Apply huggingface black code formatting

* Style

* Remove `--do_lower_case` as example uses cased

* Add option to specify tensorboard logdir

This is needed for our testing framework which checks regressions
against key metrics writtern by the summary writer.

* Using configuration for `xla_device`

* Prefix TPU specific comments.

* num_cores clarification and namespace eval metrics

* Cache features file under `args.cache_dir`

Instead of under `args.data_dir`. This is needed as our test infra uses
data_dir with a read-only filesystem.

* Rename `run_glue_tpu` to `run_tpu_glue`
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

551b4505

[docs] The use of `do_lower_case` in scripts is on its way to deprecation (#3738) · cbad305c
Julien Chaumond authored Apr 10, 2020

cbad305c

[examples] Generate argparsers from type hints on dataclasses (#3669) · b169ac9c

Julien Chaumond authored Apr 10, 2020

* [examples] Generate argparsers from type hints on dataclasses

* [HfArgumentParser] way simpler API

* Restore run_language_modeling.py for easier diff

* [HfArgumentParser] final tweaks from code review

b169ac9c

Big cleanup of `glue_convert_examples_to_features` (#3688) · f98d0ef2

Julien Chaumond authored Apr 10, 2020

* Big cleanup of `glue_convert_examples_to_features`

* Use batch_encode_plus

* Cleaner wrapping of glue_convert_examples_to_features for TF

@lysandrejik

* Cleanup syntax, thanks to @mfuntowicz

* Raise explicit error in case of user error

f98d0ef2

07 Apr, 2020 3 commits
- [Bart] Replace config.output_past with use_cache kwarg (#3632) · 715aa5b1
  Sam Shleifer authored Apr 07, 2020
  
  715aa5b1
- [examples] SummarizationDataset cleanup (#3451) · e344e3d4
  Sam Shleifer authored Apr 07, 2020
  
  e344e3d4
- [Examples, Benchmark] Improve benchmark utils (#3674) · 80fa0f78
  Patrick von Platen authored Apr 07, 2020
```
* improve and add features to benchmark utils

* update benchmark style

* remove output files
```
  80fa0f78
06 Apr, 2020 1 commit

Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631) · e52d1258

Ethan Perez authored Apr 06, 2020

* Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py

`convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.

* Simplifying change to match recent commits

e52d1258

02 Apr, 2020 3 commits
- Resizing embedding matrix before sending it to the optimizer. (#3532) · c50aa67b
  Nicolas authored Apr 02, 2020
```
* Resizing embedding matrix after sending it to the optimizer prevents from updating the newly resized matrix.

* Remove space for style matter
```
  c50aa67b
- Adding should_continue check for retraining (#3509) · 1b101599
  Mark Kockerbeck authored Apr 02, 2020
  
  1b101599
- [T5, examples] replace heavy t5 models with tiny random models (#3556) · ab5d06a0
  Patrick von Platen authored Apr 02, 2020
```
* replace heavy t5 models with tiny random models as was done by sshleifer

* fix isort
```
  ab5d06a0
01 Apr, 2020 1 commit
- Tokenizers: Start cleaning examples a little (#3455) · 50e15c82
  Julien Chaumond authored Apr 01, 2020
```
* Start cleaning examples

* Fixup
```
  50e15c82
31 Mar, 2020 1 commit

[Examples] Clean summarization and translation example testing files for T5 and Bart (#3514) · ae6834e0

Patrick von Platen authored Mar 31, 2020

* fix conflicts

* add model size argument to summarization

* correct wrong import

* fix isort

* correct imports

* other isort make style

* make style

ae6834e0

30 Mar, 2020 3 commits

[Bug fix] Using loaded checkpoint with --do_predict (instead of… (#3437) · e5c393dc

Ethan Perez authored Mar 30, 2020

* Using loaded checkpoint with --do_predict

Without this fix, I'm getting near-random validation performance for a trained model, and the validation performance differs per validation run. I think this happens since the `model` variable isn't set with the loaded checkpoint, so I'm using a randomly initialized model. Looking at the model activations, they differ each time I run evaluation (but they don't with this fix).

* Update checkpoint loading

* Fixing model loading

e5c393dc

[bart-tiny-random] Put a 5MB model on S3 to allow faster exampl… (#3488) · 8deff3ac
Sam Shleifer authored Mar 30, 2020

8deff3ac

Update the NER TF script (#3511) · d38bbb22

Julien Plu authored Mar 30, 2020



* Update the NER TF script to remove the softmax and make the pad token label id to -1

* Reformat the quality and style
Co-authored-by: Julien Plu <julien.plu@adevinta.com>

d38bbb22

29 Mar, 2020 1 commit
- [Docs] examples/summarization/bart: Simplify CNN/DM preprocessi… (#3516) · 33ef7002
  Sam Shleifer authored Mar 29, 2020
  
  33ef7002
27 Mar, 2020 2 commits

Fix circle ci flaky fail of wmt example (#3485) · 17dceae7

Patrick von Platen authored Mar 27, 2020

* force bleu

* fix wrong file name

* rename file

* different filenames for each example test

* test files should clean up after themselves

* test files should clean up after themselves

* do not force bleu

* correct typo

* fix isort

17dceae7

run_ner.py / bert-base-multilingual-cased can output empty tokens (#2991) · b08259a1

Funtowicz Morgan authored Mar 27, 2020



* Use tokenizer.num_added_tokens to count number of added special_tokens instead of hardcoded numbers.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* run_ner.py - Do not add a label to the labels_ids if word_tokens is empty.

This can happen when using bert-base-multilingual-cased with an input containing an unique space.
In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior
over the labels_ids tokens adding one more tokens than tokens vector.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

b08259a1