Commits · 06a6cb6f360e88866afdac5c0c4e295ab7da2c9b · chenpangpang / transformers

05 Mar, 2020 2 commits
- Refactor BartModel so that input checks are handled within BartEncoder and BartDecoder · 06a6cb6f
  Tom Hosking authored Mar 05, 2020
  
  06a6cb6f
- Fix failing doc samples · 07a79db5
  Lysandre authored Mar 04, 2020
  
  07a79db5
04 Mar, 2020 6 commits
- rename variables named 'word' to 'token' in generate fn (#3119) · 006097f8
  Patrick von Platen authored Mar 04, 2020
```
* fix conflits

* fixed naming bug

* make style
```
  006097f8
- correct beam search sampling · 7a89a3e4
  Patrick von Platen authored Mar 04, 2020
  
  7a89a3e4
- make GPT2 and CTRL shape consistent between torch and TF · c4c4c999
  Patrick von Platen authored Mar 04, 2020
  
  c4c4c999
- set redorder past sort dimension to its default · 2529b2d3
  patrickvonplaten authored Mar 04, 2020
  
  2529b2d3
- added beam_search generation for tf 2.0 · 61fef6e9
  patrickvonplaten authored Mar 04, 2020
  
  61fef6e9
- fix beam_search behavior when sampling (#3106) · 6701fb78
  Patrick von Platen authored Mar 04, 2020
```
* fix beam_search behavior when sampling

* delete print

* make correct style
```
  6701fb78
03 Mar, 2020 5 commits

fix: passing config as Layer trainable param · b1116fd6
Gunnlaugur Thor Briem authored Mar 03, 2020
```
Lurking bugs discovered while working on other stuff.
```
b1116fd6
BartForSequenceClassification: fix num_labels, add test (#3110) · e9e6efdc
Sam Shleifer authored Mar 03, 2020

e9e6efdc

[ci] Re-run integration ground truth from fairseq · f631e01d

Julien Chaumond authored Mar 03, 2020

Adopted best practice set by @patrickvonplaten of commenting lines run on fairseq, for easy comparison

also see #3020

f631e01d

[Bart] dont call .forward (#3094) · 5c5af879
Sam Shleifer authored Mar 03, 2020

5c5af879

Add generate() functionality to TF 2.0 (#3063) · 41341003

Patrick von Platen authored Mar 03, 2020

* add first copy past test to tf 2 generate

* add tf top_k_top_p_filter fn

* add generate function for TF

* add generate function for TF

* implemented generate for all models expect transfoXL

* implemented generate for all models expect transfoXL

* implemented generate for all models expect transfoXL

* make style

* change permission of test file to correct ones

* delete ipdb

* delete ipdb

* fix bug and finish simple gpt2 integration test

* clean test file

* clean test file

* make style

* make style

* make style

* make style

* change import style

* change import style

* make style

* make style

* add decorators

* add decorators

* fix tf ctrl bug dim => axis in TF

* make style

* make style

* refactored test file

* refactored test file

* take out test_torch_tf_conversion if nothing is defined

* take out test_torch_tf_conversion if nothing is defined

* remove useless files

* remove useless files

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* solve conflicts

* solve conflicts

* fix conflicts

* fix conflicts

* merge conflicts

* delete ipdb

* exposed top_k_top_p_filtering fns

* delete weirdly created w! file

* add comment to test tf common modeling

* fix conflicts

* fix conflicts

* make style

* merge conflicts

* make style

* change tf.tensor.shape to shape_list(tensor)

41341003

02 Mar, 2020 5 commits
- [BART] to each its own config + make BART compatible w/ Pipelines · eec5ec80
  Julien Chaumond authored Mar 02, 2020
```
cc @sshleifer
```
  eec5ec80
- Pipeline doc (#3055) · d3eb7d23
  Lysandre Debut authored Mar 02, 2020
```
* Pipeline doc initial commit

* pipeline abstraction

* Remove modelcard argument from pipeline

* Task-specific pipelines can be instantiated with no model or tokenizer

* All pipelines doc
```
  d3eb7d23
- correct greedy generation when doing beam search (#3078) · 2fdc7f6c
  Patrick von Platen authored Mar 02, 2020
```
* correct greedy generation when doing beam search

* improve comment
```
  2fdc7f6c
- Force pad_token_id to be set before padding for standard tokenizer (#3035) · c0135194
  Patrick von Platen authored Mar 02, 2020
```
* force pad_token_id to be set before padding

* fix tests and forbid padding without having a padding_token_id set
```
  c0135194
- Bart-CNN (#3059) · b54ef78d
  Sam Shleifer authored Mar 02, 2020
```
`generate` code that produces 99% identical summarizations to fairseq on CNN test data, with caching.
```
  b54ef78d
27 Feb, 2020 2 commits
- spelling: strictly (#3042) · 6a375880
  Sam Shleifer authored Feb 27, 2020
  
  6a375880
- Fix batch_encode_plus (#3041) · f4ff44a6
  Cola authored Feb 27, 2020
  
  f4ff44a6
26 Feb, 2020 4 commits
- Changes from reviews. · 9495d38b
  Martin Malmsten authored Feb 26, 2020
  
  9495d38b
- Fix attn mask gpt2 when using past (#3033) · fdd61b19
  Patrick von Platen authored Feb 26, 2020
```
* fix issue and add some tests

* fix issue and add some tests

* updated doc string gpt2
```
  fdd61b19
- Fix (non-slow) tests on GPU (torch) (#3024) · 9cda3620
  Julien Chaumond authored Feb 26, 2020
```
* Fix tests on GPU (torch)

* Fix bart slow tests
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
```
  9cda3620
- Delete all mentions of Model2Model (#3019) · 9df74b8b
  Sam Shleifer authored Feb 26, 2020
  
  9df74b8b
25 Feb, 2020 2 commits

Documentation (#2989) · bb7c4685

Lysandre Debut authored Feb 25, 2020

* All Tokenizers

BertTokenizer + few fixes
RobertaTokenizer
OpenAIGPTTokenizer + Fixes
GPT2Tokenizer + fixes
TransfoXLTokenizer
Correct rst for TransformerXL
XLMTokenizer + fixes
XLNet Tokenizer + Style
DistilBERT + Fix XLNet RST
CTRLTokenizer
CamemBERT Tokenizer
FlaubertTokenizer
XLMRobertaTokenizer
cleanup

* cleanup

bb7c4685

Change masking to direct labeling for TPU support. (#2982) · e8ce63ff
srush authored Feb 25, 2020
```
* change masking to direct labelings

* fix black

* switch to ignore index

* .

* fix black
```
e8ce63ff

24 Feb, 2020 11 commits

False by default (#3002) · 3716c3d8
Lysandre Debut authored Feb 24, 2020

3716c3d8
Release: v2.5.1 · f9ec5ca9
Lysandre authored Feb 24, 2020

f9ec5ca9

Fix for fast tokenizers save_pretrained compatibility with Python. (#2933) · 4cd9c097

Funtowicz Morgan authored Feb 25, 2020



* Renamed file generate by tokenizers when calling save_pretrained to match python.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added save_vocabulary tests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove python quick and dirty fix for clean Rust impl.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.5.1
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added some save_pretrained / from_pretrained unittests.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.5.2
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality and format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* flake8
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Making sure there is really a bug in unittest

* Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

4cd9c097

fix _update_memory fn call in transformer-xl (#2971) · ee60840e
Sandro Cavallari authored Feb 25, 2020

ee60840e
add explaining example to XLNet LM modeling (#2997) · 6a50d501
Patrick von Platen authored Feb 24, 2020
```
* add explaining example to XLNet LM modeling

* improve docstring for xlnet
```
6a50d501

Add preprocessing step for transfo-xl tokenization to avoid tokenizing words... · 65d74c49

Patrick von Platen authored Feb 24, 2020

Add preprocessing step for transfo-xl tokenization to avoid tokenizing words followed by punction to <unk> (#2987)

* add preprocessing to add space before punctuation for transfo_xl

* improve warning messages

* make style

* compile regex at instantination of tokenizer object

65d74c49

Add local_files_only parameter to pretrained items (#2930) · a143d947

Bram Vanroy authored Feb 24, 2020

* Add disable_outgoing to pretrained items

Setting disable_outgoing=True disables outgonig traffic:
- etags are not looked up
- models are not downloaded

* parameter name change

* Remove forgotten print

a143d947

kwargs are passed to both model and configuration in AutoModels (#2998) · 7984a70e
Lysandre Debut authored Feb 24, 2020

7984a70e

Testing that batch_encode_plus is the same as encode_plus (#2973) · 21d8b6a3

Lysandre Debut authored Feb 24, 2020

* Testing that encode_plus and batch_encode_plus behave the same way

Spoiler alert: they don't

* Testing rest of arguments in batch_encode_plus

* Test tensor return in batch_encode_plus

* Addressing Sam's comments

* flake8

* Simplified with `num_added_tokens`

21d8b6a3

Add slow generate tests for pretrained lm models (#2909) · 17c45c39

Patrick von Platen authored Feb 24, 2020

* add slow generate lm_model tests

* fix conflicts

* merge conflicts

* fix conflicts

* add slow generate lm_model tests

* make style

* delete unused variable

* fix conflicts

* fix conflicts

* fix conflicts

* delete unused variable

* fix conflicts

* finished hard coded tests

17c45c39

Warning on `add_special_tokens` (#2966) · 8194df8e

Lysandre Debut authored Feb 24, 2020

Warning on `add_special_tokens` when passed to `encode`, `encode_plus` and `batch_encode_plus`

8194df8e

23 Feb, 2020 3 commits

* Added support for Albert when fine-tuning for NER · 869b66f6

Martin Malmsten authored Feb 23, 2020

* Added support for Albert in NER pipeline

* Added command-line options to examples/ner/run_ner.py to better control tokenization

* Added class AlbertForTokenClassification

* Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens

869b66f6

Delete untested, broken Model2LSTM (#2968) · 129f0604
Sam Shleifer authored Feb 23, 2020

129f0604
Correct `special_tokens_mask` when `add_special_tokens=False` (#2965) · 0e84559d
Lysandre Debut authored Feb 23, 2020
```
Don't know of a use case where that would be useful, but this is more consistent
```
0e84559d