Commits · 352d5472b0c1dec0f420d606d16747d851b4bda8 · chenpangpang / transformers

13 Apr, 2020 2 commits
- Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716) · 352d5472
  Teven authored Apr 13, 2020
```
* Shifting labels inside TransfoXLLMHead

* Changed doc to reflect change

* Updated pytorch test

* removed IDE whitespace changes

* black reformat
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
```
  352d5472
- fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) · 5ebd8989
  elk-cloner authored Apr 13, 2020
  
  5ebd8989
11 Apr, 2020 2 commits

updated dutch squad model card (#3736) · 7972a401

HenrykBorzymowski authored Apr 11, 2020



* added model_cards for polish squad models

* corrected mistake in polish design cards

* updated model_cards for squad2_dutch model

* added links to benchmark models
Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>

7972a401

Added README huseinzol05/albert-tiny-bahasa-cased (#3746) · f8c1071c

HUSEIN ZOLKEPLI authored Apr 11, 2020

* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base

* added albert tiny

f8c1071c

10 Apr, 2020 7 commits

Fix glue_convert_examples_to_features API breakage (#3742) · 700ccf6e
Jin Young Sohn authored Apr 10, 2020

700ccf6e
Update tokenizers to 0.7.0-rc5 (#3705) · b7cf9f43
Anthony MOI authored Apr 10, 2020

b7cf9f43

Add `run_glue_tpu.py` that trains models on TPUs (#3702) · 551b4505

Jin Young Sohn authored Apr 10, 2020

* Initial commit to get BERT + run_glue.py on TPU

* Add README section for TPU and address comments.

* Cleanup TPU bits from run_glue.py (#3)

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* Cleanup TPU bits from run_glue.py

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py

.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* No need to call `xm.mark_step()` explicitly (#4)

Since for gradient accumulation we're accumulating on batches from
`ParallelLoader` instance which on next() marks the step itself.

* Resolve R/W conflicts from multiprocessing (#5)

* Add XLNet in list of models for `run_glue_tpu.py` (#6)

* Add RoBERTa to list of models in TPU GLUE (#7)

* Add RoBERTa and DistilBert to list of models in TPU GLUE (#8)

* Use barriers to reduce duplicate work/resources (#9)

* Shard eval dataset and aggregate eval metrics (#10)

* Shard eval dataset and aggregate eval metrics

Also, instead of calling `eval_loss.item()` every time do summation with
tensors on device.

* Change defaultdict to float

* Reduce the pred, label tensors instead of metrics

As brought up during review some metrics like f1 cannot be aggregated
via averaging. GLUE task metrics depends largely on the dataset, so
instead we sync the prediction and label tensors so that the metrics can
be computed accurately on those instead.

* Only use tb_writer from master (#11)

* Apply huggingface black code formatting

* Style

* Remove `--do_lower_case` as example uses cased

* Add option to specify tensorboard logdir

This is needed for our testing framework which checks regressions
against key metrics writtern by the summary writer.

* Using configuration for `xla_device`

* Prefix TPU specific comments.

* num_cores clarification and namespace eval metrics

* Cache features file under `args.cache_dir`

Instead of under `args.data_dir`. This is needed as our test infra uses
data_dir with a read-only filesystem.

* Rename `run_glue_tpu` to `run_tpu_glue`
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

551b4505

[docs] The use of `do_lower_case` in scripts is on its way to deprecation (#3738) · cbad305c
Julien Chaumond authored Apr 10, 2020

cbad305c

[examples] Generate argparsers from type hints on dataclasses (#3669) · b169ac9c

Julien Chaumond authored Apr 10, 2020

* [examples] Generate argparsers from type hints on dataclasses

* [HfArgumentParser] way simpler API

* Restore run_language_modeling.py for easier diff

* [HfArgumentParser] final tweaks from code review

b169ac9c

Multilingual BART - (#3602) · 7a7fdf71
Sam Shleifer authored Apr 10, 2020
```
- support mbart-en-ro weights
- add MBartTokenizer
```
7a7fdf71

Big cleanup of `glue_convert_examples_to_features` (#3688) · f98d0ef2

Julien Chaumond authored Apr 10, 2020

* Big cleanup of `glue_convert_examples_to_features`

* Use batch_encode_plus

* Cleaner wrapping of glue_convert_examples_to_features for TF

@lysandrejik

* Cleanup syntax, thanks to @mfuntowicz

* Raise explicit error in case of user error

f98d0ef2

09 Apr, 2020 5 commits

[T5, generation] Add decoder caching for T5 (#3682) · ce2298fb

Patrick von Platen authored Apr 10, 2020



* initial commit to add decoder caching for T5

* better naming for caching

* finish T5 decoder caching

* correct test

* added extensive past testing for T5

* clean files

* make tests cleaner

* improve docstring

* improve docstring

* better reorder cache

* make style

* Update src/transformers/modeling_t5.py
Co-Authored-By: Yacine Jernite <yjernite@users.noreply.github.com>

* make set output past work for all layers

* improve docstring

* improve docstring
Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>

ce2298fb

Fix force_download of files on Windows (#3697) · 9384e5f6
calpt authored Apr 09, 2020

9384e5f6
[Exbert] Change style of button · bc65afc4
Julien Chaumond authored Apr 09, 2020

bc65afc4
Update quotes · 31baeed6
LysandreJik authored Apr 09, 2020
```
cc @julien-c
```
31baeed6
Correct transformers-cli env call · f8208fa4
Teven authored Apr 09, 2020

f8208fa4

08 Apr, 2020 6 commits
- Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684) · 6435b9f9
  Lysandre Debut authored Apr 08, 2020
```
* Updating modeling tf files; adding tests

* Merge `encode_plus` and `batch_encode_plus`
```
  6435b9f9
- close #3699 · 500aa123
  LysandreJik authored Apr 08, 2020
  
  500aa123
- More doc for model cards (#3698) · a594ee9c
  Julien Chaumond authored Apr 08, 2020
```
see https://github.com/huggingface/transformers/pull/3679#pullrequestreview-389368270
```
  a594ee9c
- Update doc for {Summarization,Translation}Pipeline and other tweaks · 83703cd0
  Julien Chaumond authored Apr 07, 2020
  
  83703cd0
- Created README.md for model card ChemBERTa (#3666) · a1b3b416
  Seyone Chithrananda authored Apr 08, 2020
```
* created readme.md

* update readme with fixes

Fixes from PR comments
```
  a1b3b416
- Fix typo in FeatureExtractionPipeline docstring · 747907dc
  Lorenzo Ampil authored Apr 03, 2020
  
  747907dc
07 Apr, 2020 8 commits
- [Bart] Replace config.output_past with use_cache kwarg (#3632) · 715aa5b1
  Sam Shleifer authored Apr 07, 2020
  
  715aa5b1
- [examples] SummarizationDataset cleanup (#3451) · e344e3d4
  Sam Shleifer authored Apr 07, 2020
  
  e344e3d4
- [Tokenization] fix edge case for bert tokenization (#3517) · b0ad0695
  Patrick von Platen authored Apr 07, 2020
```
* fix egde gase for bert tokenization

* add Lysandres comments for improvement

* use new is_pretokenized_flag
```
  b0ad0695
- [Examples, Benchmark] Improve benchmark utils (#3674) · 80fa0f78
  Patrick von Platen authored Apr 07, 2020
```
* improve and add features to benchmark utils

* update benchmark style

* remove output files
```
  80fa0f78
- Optimize causal mask using torch.where (#2715) · 05deb52d
  Michael Pang authored Apr 07, 2020
```
* Optimize causal mask using torch.where

Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance.

* Maintain compatiblity with torch 1.0.0 - thanks for PR feedback

* Fix typo

* reformat line for CI
```
  05deb52d
- Speedup torch summarization tests (#3663) · 0a4b1068
  Sam Shleifer authored Apr 07, 2020
  
  0a4b1068
- Fix roberta checkpoint conversion script (#3642) · 5aa8a278
  Myle Ott authored Apr 07, 2020
  
  5aa8a278
- [model_cards] Turn down spurious warnings · 11cc1e16
  Julien Chaumond authored Apr 07, 2020
```
Close #3639 + spurious warning mentioned in #3227

cc @lysandrejik @thomwolf
```
  11cc1e16
06 Apr, 2020 10 commits

fixed TransfoXLLMHeadModel documentation (#3661) · 0a9d09b4
Teven authored Apr 07, 2020
```
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
```
0a9d09b4

Tokenizers v3.0.0 (#3185) · 96ab75b8

Funtowicz Morgan authored Apr 06, 2020

* Renamed num_added_tokens to num_special_tokens_to_add
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

96ab75b8

Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631) · e52d1258

Ethan Perez authored Apr 06, 2020

* Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py

`convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.

* Simplifying change to match recent commits

e52d1258

Create README.md · 0ac33ddd
ktrapeznikov authored Apr 06, 2020

0ac33ddd
Add model card · 326e6eba
Manuel Romero authored Apr 06, 2020

326e6eba
Add model card · 43eca3f8
Manuel Romero authored Apr 06, 2020

43eca3f8
Create README.md · 6bec88ca
Manuel Romero authored Apr 06, 2020

6bec88ca
Add model card (#3655) · 769b60f9
Manuel Romero authored Apr 06, 2020
```
* Add model card

* Fix model name in fine-tuning script
```
769b60f9
Create model card (#3654) · c4bcb019
Manuel Romero authored Apr 06, 2020
```
* Create model card

* Fix model name in fine-tuning script
```
c4bcb019
Create README.md · 6903a987
Manuel Romero authored Apr 06, 2020

6903a987