Commits · 7defc6670fa76e857109e1b99f3e919da8d11f42 · chenpangpang / transformers

14 May, 2020 2 commits
- Fix: unpin flake8 and fix cs errors (#4367) · 448c4672
  Julien Chaumond authored May 14, 2020
```
* Fix: unpin flake8 and fix cs errors

* Ok we still need to quote those
```
  448c4672
- Use Filelock to ensure distributed barriers · c547f15a
  Julien Chaumond authored May 14, 2020
```
see context in https://github.com/huggingface/transformers/pull/4223
```
  c547f15a
13 May, 2020 2 commits

Question Answering for TF trainer (#4320) · ca136186

Julien Plu authored May 13, 2020

* Add QA trainer example for TF

* Make data_dir optional

* Fix parameter logic

* Fix feature convert

* Update the READMEs to add the question-answering task

* Apply style

* Change 'sequence-classification' to 'text-classification' and prefix with 'eval' all the metric names

* Apply style

* Apply style

ca136186

(v2) Improvements to the wandb integration (#4324) · 24175910

Julien Chaumond authored May 12, 2020



* Improvements to the wandb integration

* small reorg + no global necessary

* feat(trainer): log epoch and final metrics

* Simplify logging a bit

* Fixup

* Fix crash when just running eval
Co-authored-by: Chris Van Pelt <vanpelt@gmail.com>
Co-authored-by: Boris Dayma <boris.dayma@gmail.com>

24175910

12 May, 2020 1 commit

Add MultipleChoice to TFTrainer [WIP] (#4270) · e4512aab

Viktor Alm authored May 12, 2020



* catch gpu len 1 set to gpu0

* Add mpc to trainer

* Add MPC for TF

* fix TF automodel for MPC and add Albert

* Apply style

* Fix import

* Note to self: double check

* Make shape None, None for datasetgenerator output shapes

* Add from_pt bool which doesnt seem to work

* Original checkpoint dir

* Fix docstrings for automodel

* Update readme and apply style

* Colab should probably not be from users

* Colabs should probably not be from users

* Add colab

* Update README.md

* Update README.md

* Cleanup __intit__

* Cleanup flake8 trailing comma

* Update src/transformers/training_args_tf.py

* Update src/transformers/modeling_tf_auto.py
Co-authored-by: Viktor Alm <viktoralm@pop-os.localdomain>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

e4512aab

11 May, 2020 1 commit
- Documentation: fix links to NER examples (#4279) · 3f42eb97
  Stefan Schweter authored May 11, 2020
```
* docs: fix link to token classification (NER) example

* examples: fix links to NER scripts
```
  3f42eb97
08 May, 2020 1 commit

[TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223) · 7b75aa9f

Julien Chaumond authored May 08, 2020

* [TPU] Doc, fix xla_spawn.py, only preprocess dataset once

* Update examples/README.md

* [xla_spawn] Add `_mp_fn` to other Trainer scripts

* [TPU] Fix: eval dataloader was None

7b75aa9f

07 May, 2020 5 commits

[doc] Fix broken links + remove crazy big notebook · c99fe038
Julien Chaumond authored May 07, 2020

c99fe038
[examples] Add column for pytorch-lightning support · 6669915b
Julien Chaumond authored May 07, 2020

6669915b
Examples readme.md (#4215) · 612fa1b1
Julien Chaumond authored May 07, 2020
```
* README

* Update README.md
```
612fa1b1

BIG Reorganize examples (#4213) · 0ae96ff8

Julien Chaumond authored May 07, 2020

* Created using Colaboratory

* [examples] reorganize files

* remove run_tpu_glue.py as superseded by TPU support in Trainer

* Bugfix: int, not tuple

* move files around

0ae96ff8

Tpu trainer (#4146) · ebf80e2e

Lysandre Debut authored May 07, 2020



* wip

* wip

* a last wip

* Better logging when using TPUs

* Correct argument name

* Tests

* fix

* Metrics in evaluation

* Update src/transformers/training_args.py

* [tpu] Use launcher script instead

* [tpu] lots of tweaks

* Fix formatting
Co-authored-by: Julien Chaumond <chaumond@gmail.com>

ebf80e2e

06 May, 2020 2 commits

TF version of the trainer (#4017) · aad50151

Julien Plu authored May 06, 2020

* First commit to add a TF version of the trainer.

* Make the TF trainer closer to what looks the PT trainer

* Refactoring common code between the PT and TF trainer into an util file.

* Some bugfix + better similarity with the PT trainer

* Add missing class in transformers init

* Bugfix over prediction + use classification report instead of simple metrics

* Fix name error

* Fix optimization tests + style

* Apply style

* Several bugfix for multi-gpu training

* Apply style

* Apply style

* Add glue example for the TF trainer

* Several bugix + address the reviews

* Fix on the TF training args file

* Add a debug mode

* Bugfix in utils_ner.py when segment_ids is None

* Apply style

* Apply style

* Add TPU strategy

* Fix selection strategy

aad50151

Fix overwrite_cache behaviour for pytorch lightning examples (#4093) · 25296b12
Simone Primarosa authored May 06, 2020

25296b12

02 May, 2020 3 commits
- Update run_pl_glue.py (#4117) · 4c5bd921
  William Falcon authored May 02, 2020
  
  4c5bd921
- Update run_pl_ner.py (#4118) · 5282b31d
  William Falcon authored May 02, 2020
  
  5282b31d
- NER: parse args from .args file or JSON (#4110) · 1e616c0a
  Stefan Schweter authored May 02, 2020
```
* ner: parse args from .args file or JSON

* examples: mention json-based configuration file support for run_ner script
```
  1e616c0a
01 May, 2020 1 commit
- Merge pull request #3934 from huggingface/examples_args_from_files · b8686174
  Julien Chaumond authored Apr 30, 2020
```
[qol] example scripts: parse args from .args file or JSON
```
  b8686174
29 Apr, 2020 1 commit

CDN urls (#4030) · 455c6390

Julien Chaumond authored Apr 28, 2020

* [file_utils] use_cdn + documentation

* Move to cdn. urls for weights

* [urls] Hotfix for bert-base-japanese

455c6390

28 Apr, 2020 2 commits
- [isort] add known 3rd party to setup.cfg (#4053) · d714dfea
  Sam Shleifer authored Apr 28, 2020
```
* add known 3rd party to setup.cfg

* comment

* Update CONTRIBUTING.md
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
```
  d714dfea
- [Generation] Generation should allow to start with empty prompt (#3993) · 18058574
  Patrick von Platen authored Apr 28, 2020
```
* fix empty prompt

* fix length in generation pipeline
```
  18058574
24 Apr, 2020 2 commits
- [examples] For convenience, also save the tokenizer · c8115260
  Julien Chaumond authored Apr 24, 2020
```
Close #3921
```
  c8115260
- Shuffle train subset for summarization example (#3909) · b0167632
  Cola authored Apr 24, 2020
```
* Shuffle train subset

* Cleaner shuffle
```
  b0167632
22 Apr, 2020 2 commits

Fixes #3877 · 1dc9b3c7
Julien Chaumond authored Apr 22, 2020

1dc9b3c7

Trainer (#3800) · dd9d483d

Julien Chaumond authored Apr 21, 2020

* doc

* [tests] Add sample files for a regression task

* [HUGE] Trainer

* Feedback from @sshleifer

* Feedback from @thomwolf + logging tweak

* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes

* [glue] Use default max_seq_length of 128 like before

* [glue] move DataTrainingArguments around

* [ner] Change interface of InputExample, and align run_{tf,pl}

* Re-align the pl scripts a little bit

* ner

* [ner] Add integration test

* Fix language_modeling with API tweak

* [ci] Tweak loss target

* Don't break console output

* amp.initialize: model must be on right device before

* [multiple-choice] update for Trainer

* Re-align to 827d6d6e

dd9d483d

20 Apr, 2020 3 commits
- Fix bug in examples: double wrap into DataParallel during eval · b1ff0b2a
  Andrey Kulagin authored Apr 17, 2020
  
  b1ff0b2a
- Add `qas_id` to SquadResult and SquadExample (#3745) · c79b550d
  Jared T Nielsen authored Apr 20, 2020
```
* Add qas_id

* Fix incorrect name in squad.py

* Make output files optional for squad eval
```
  c79b550d
- [examples] fix summarization do_predict (#3866) · a504cb49
  Sam Shleifer authored Apr 20, 2020
  
  a504cb49
18 Apr, 2020 1 commit

Cleanup fast tokenizers integration (#3706) · 827d6d6e

Thomas Wolf authored Apr 18, 2020



* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>

827d6d6e

16 Apr, 2020 3 commits

[examples] summarization/bart/finetune.py supports t5 (#3824) · f0c96faf
Sam Shleifer authored Apr 16, 2020
```
renames `run_bart_sum.py` to `finetune.py`
```
f0c96faf

[Examples, T5] Change newstest2013 to newstest2014 and clean up (#3817) · 80a16945

Patrick von Platen authored Apr 16, 2020



* Refactored use of newstest2013 to newstest2014. Fixed bug where argparse consumed first command line argument as model_size argument rather than using default model_size by forcing explicit --model_size flag inclusion

* More pythonic file handling through 'with' context

* COSMETIC - ran Black and isort

* Fixed reference to number of lines in newstest2014

* Fixed failing test. More pythonic file handling

* finish PR from tholiao

* remove outcommented lines

* make style

* make isort happy
Co-authored-by: Thomas Liao <tholiao@gmail.com>

80a16945

Typo fix (#3821) · b1e2368b
Davide Fiocco authored Apr 16, 2020

b1e2368b

15 Apr, 2020 1 commit
- [examples] unit test for run_bart_sum (#3544) · c59b1e68
  Sam Shleifer authored Apr 15, 2020
```
- adds pytorch-lightning dependency
```
  c59b1e68
14 Apr, 2020 1 commit

[Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734) · 01c37dcd

Patrick von Platen authored Apr 14, 2020

* remove output_past from pt

* make style

* add optional input length for gpt2

* add use cache to prepare input

* save memory in gpt2

* correct gpt2 test inputs

* make past input optional for gpt2

* finish use_cache for all models

* make style

* delete modeling_gpt2 change in test file

* correct docstring

* correct is true statements for gpt2

01c37dcd

13 Apr, 2020 1 commit
- fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) · 5ebd8989
  elk-cloner authored Apr 13, 2020
  
  5ebd8989
10 Apr, 2020 5 commits

Fix glue_convert_examples_to_features API breakage (#3742) · 700ccf6e
Jin Young Sohn authored Apr 10, 2020

700ccf6e

Add `run_glue_tpu.py` that trains models on TPUs (#3702) · 551b4505

Jin Young Sohn authored Apr 10, 2020

* Initial commit to get BERT + run_glue.py on TPU

* Add README section for TPU and address comments.

* Cleanup TPU bits from run_glue.py (#3)

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* Cleanup TPU bits from run_glue.py

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py

.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* No need to call `xm.mark_step()` explicitly (#4)

Since for gradient accumulation we're accumulating on batches from
`ParallelLoader` instance which on next() marks the step itself.

* Resolve R/W conflicts from multiprocessing (#5)

* Add XLNet in list of models for `run_glue_tpu.py` (#6)

* Add RoBERTa to list of models in TPU GLUE (#7)

* Add RoBERTa and DistilBert to list of models in TPU GLUE (#8)

* Use barriers to reduce duplicate work/resources (#9)

* Shard eval dataset and aggregate eval metrics (#10)

* Shard eval dataset and aggregate eval metrics

Also, instead of calling `eval_loss.item()` every time do summation with
tensors on device.

* Change defaultdict to float

* Reduce the pred, label tensors instead of metrics

As brought up during review some metrics like f1 cannot be aggregated
via averaging. GLUE task metrics depends largely on the dataset, so
instead we sync the prediction and label tensors so that the metrics can
be computed accurately on those instead.

* Only use tb_writer from master (#11)

* Apply huggingface black code formatting

* Style

* Remove `--do_lower_case` as example uses cased

* Add option to specify tensorboard logdir

This is needed for our testing framework which checks regressions
against key metrics writtern by the summary writer.

* Using configuration for `xla_device`

* Prefix TPU specific comments.

* num_cores clarification and namespace eval metrics

* Cache features file under `args.cache_dir`

Instead of under `args.data_dir`. This is needed as our test infra uses
data_dir with a read-only filesystem.

* Rename `run_glue_tpu` to `run_tpu_glue`
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

551b4505

[docs] The use of `do_lower_case` in scripts is on its way to deprecation (#3738) · cbad305c
Julien Chaumond authored Apr 10, 2020

cbad305c

[examples] Generate argparsers from type hints on dataclasses (#3669) · b169ac9c

Julien Chaumond authored Apr 10, 2020

* [examples] Generate argparsers from type hints on dataclasses

* [HfArgumentParser] way simpler API

* Restore run_language_modeling.py for easier diff

* [HfArgumentParser] final tweaks from code review

b169ac9c

Big cleanup of `glue_convert_examples_to_features` (#3688) · f98d0ef2

Julien Chaumond authored Apr 10, 2020

* Big cleanup of `glue_convert_examples_to_features`

* Use batch_encode_plus

* Cleaner wrapping of glue_convert_examples_to_features for TF

@lysandrejik

* Cleanup syntax, thanks to @mfuntowicz

* Raise explicit error in case of user error

f98d0ef2