Commits · 7246d3c2f93c4461f3ec8ada7a26a002d8f196ea · chenpangpang / transformers

12 Nov, 2019 2 commits

Consider do_lower_case in PreTrainedTokenizer · 7246d3c2

Michael Watkins authored Nov 06, 2019

As pointed out in #1545, when using an uncased model, and adding
a new uncased token, the tokenizer does not correctly identify this
in the case that the input text contains the token in a cased format.

For instance, if we load bert-base-uncased into BertTokenizer, and
then use .add_tokens() to add "cool-token", we get the expected
result for .tokenize('this is a cool-token'). However, we get a
possibly unexpected result for .tokenize('this is a cOOl-Token'),
which in fact mirrors the result for the former from before the new
token was added.

This commit adds
- functionality to PreTrainedTokenizer to handle this
situation in case a tokenizer (currently Bert, DistilBert,
and XLNet) has the do_lower_case=True kwarg by:
    1) lowercasing tokens added with .add_tokens()
    2) lowercasing text at the beginning of .tokenize()
- new common test case for tokenizers

https://github.com/huggingface/transformers/issues/1545

7246d3c2

fix #1789 · 8aba81a0
thomwolf authored Nov 12, 2019

8aba81a0

11 Nov, 2019 1 commit
- Fix #1784 · b5d330d1
  Lysandre authored Nov 11, 2019
  
  b5d330d1
08 Nov, 2019 1 commit
- Fix run_bertology.py · 7a9aae10
  Adrian Bauer authored Nov 07, 2019
```
Make imports and args.overwrite_cache match run_glue.py
```
  7a9aae10
06 Nov, 2019 7 commits
- Add RoBERTa-based GPT-2 Output Detector from OpenAI · 1c542df7
  Julien Chaumond authored Nov 06, 2019
```
converted from https://github.com/openai/gpt-2-output-dataset/tree/master/detector

Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Co-Authored-By: Jong Wook Kim <jongwook@nyu.edu>
Co-Authored-By: Jeff Wu <wuthefwasthat@gmail.com>
```
  1c542df7
- Fix other PyTorch models · 2f3a4210
  Julien Chaumond authored Nov 06, 2019
  
  2f3a4210
- Fix BERT · d5319793
  Julien Chaumond authored Nov 06, 2019
  
  d5319793
- [tests] Flag to test on cuda · 27e015bd
  Julien Chaumond authored Nov 06, 2019
  
  27e015bd
- [tests] get rid of warning · 13d9135f
  Julien Chaumond authored Nov 06, 2019
```
cf. https://docs.pytest.org/en/latest/example/simple.html
```
  13d9135f
- [run_tf_glue] Add comment for context · f88c104d
  Julien Chaumond authored Nov 05, 2019
  
  f88c104d
- misc doc · 30968d70
  Julien Chaumond authored Nov 05, 2019
  
  30968d70
05 Nov, 2019 11 commits
- Updating docblocks in optimizers.py · de890ae6
  Dom Hudson authored Nov 05, 2019
  
  de890ae6
- GPT-2 XL · d7d36181
  Lysandre authored Nov 05, 2019
  
  d7d36181
- Merge pull request #1695 from huggingface/models_inputs_embeds · 7daacf00
  Julien Chaumond authored Nov 05, 2019
```
model forwards can take an inputs_embeds param
```
  7daacf00
- add authors for models · a44f112f
  Clement authored Nov 05, 2019
  
  a44f112f
- Merge pull request #1734 from orena1/patch-1 · e99071f1
  Thomas Wolf authored Nov 05, 2019
```
add progress bar to convert_examples_to_features
```
  e99071f1
- Merge pull request #1553 from WilliamTambellini/timeSquadInference · ba973342
  Thomas Wolf authored Nov 05, 2019
```
Add speed log to examples/run_squad.py
```
  ba973342
- Merge pull request #1709 from oneraghavan/master · 237fad33
  Thomas Wolf authored Nov 05, 2019
```
Fixing mode in evaluate during training
```
  237fad33
- Fix #1686 · f1e4db2a
  thomwolf authored Nov 05, 2019
  
  f1e4db2a
- add progress bar for convert_examples_to_features · d7906165
  Oren Amsalem authored Nov 05, 2019
```
It takes considerate amount of time (~10 min) to parse the examples to features, it is good to have a progress-bar to track this
```
  d7906165
- Merge pull request #1723 from huggingface/fix-1623 · d2e2577d
  Thomas Wolf authored Nov 05, 2019
```
Fix #1623
```
  d2e2577d
- [inputs_embeds] All PyTorch models · 00337e96
  Julien Chaumond authored Nov 05, 2019
  
  00337e96
04 Nov, 2019 12 commits
- docstring + check · 9eddf44b
  Julien Chaumond authored Nov 04, 2019
  
  9eddf44b
- model forwards can take an inputs_embeds param · 8e11de0e
  Julien Chaumond authored Nov 01, 2019
  
  8e11de0e
- Add `model.train()` line to ReadMe training example · 68f7064a
  Lysandre authored Nov 04, 2019
```
Co-Authored-By: Santosh-Gupta <San.Gupta.ML@gmail.com>
```
  68f7064a
- Merge pull request #1721 from huggingface/common_attributes · c8f27121
  Thomas Wolf authored Nov 04, 2019
```
Add common getter and setter for input_embeddings & output_embeddings
```
  c8f27121
- Fix #1623 · 89d62728
  thomwolf authored Nov 04, 2019
  
  89d62728
- fix tests - flagged as slow all the tests downloading from AWS · b340a910
  thomwolf authored Nov 04, 2019
  
  b340a910
- fix tests · f02805da
  thomwolf authored Nov 04, 2019
  
  f02805da
- Merge pull request #1549 from hlums/master · 1d4d0702
  Thomas Wolf authored Nov 04, 2019
```
Fix token order in xlnet preprocessing for SQuAD
```
  1d4d0702
- switch from properties to methods · 1724cee8
  thomwolf authored Nov 04, 2019
  
  1724cee8
- Add common properties input_embeddings and output_embeddings · 9b45d0f8
  thomwolf authored Nov 04, 2019
  
  9b45d0f8
- Merge branch 'master' into master · 9a3b173c
  Thomas Wolf authored Nov 04, 2019
  
  9a3b173c
- Update example readme · ad908686
  thomwolf authored Nov 04, 2019
  
  ad908686
03 Nov, 2019 1 commit
- Fixing mode in evaluate during training · e5b1048b
  Raghavan authored Nov 03, 2019
  
  e5b1048b
01 Nov, 2019 2 commits
- Merge pull request #1679 from cregouby/master · 8a628355
  Thomas Wolf authored Nov 01, 2019
```
Fix https://github.com/huggingface/transformers/issues/1673
```
  8a628355
- Close #1654 · 93d2fff0
  Julien Chaumond authored Nov 01, 2019
  
  93d2fff0
31 Oct, 2019 3 commits
- run_tf_glue MRPC evaluation only for MRPC · 1a2b40cb
  Lysandre authored Oct 31, 2019
  
  1a2b40cb
- Added mixed precision support to benchmarks.py · be36cf92
  Timothy Liu authored Oct 30, 2019
  
  be36cf92
- Merge branch 'mataney-fix_top_k_top_p_filtering' · 2a5663c2
  Julien Chaumond authored Oct 31, 2019
  
  2a5663c2