Commits · 82462c5cba0ec07a3eeb1e9455d229ceaf43b5f2 · chenpangpang / transformers

30 Aug, 2019 1 commit
- Added option to setup pretrained tokenizer arguments · 82462c5c
  thomwolf authored Aug 30, 2019
  
  82462c5c
28 Aug, 2019 1 commit
- Match order of casing in OSS XLM; Improve document; Clean up dependency · ca4baf8c
  Shijie Wu authored Aug 27, 2019
  
  ca4baf8c
24 Aug, 2019 1 commit
- Add custom tokenizer for zh and ja · e85123d3
  Shijie Wu authored Aug 23, 2019
  
  e85123d3
23 Aug, 2019 1 commit

Tokenization behave the same as original XLM proprocessing for most languages... · 436ce072

Shijie Wu authored Aug 23, 2019

Tokenization behave the same as original XLM proprocessing for most languages except zh, ja and th; Change API to allow specifying language in `tokenize`

436ce072

20 Aug, 2019 2 commits
- Update tokenization_xlm.py · 388e3251
  Guillem García Subies authored Aug 20, 2019
  
  388e3251
- Update tokenization_xlm.py · bfd75056
  Guillem García Subies authored Aug 20, 2019
  
  bfd75056
12 Aug, 2019 1 commit
- Added documentation and changed parameters for special_tokens_sentences_pair. · 22ac004a
  LysandreJik authored Aug 12, 2019
  
  22ac004a
09 Aug, 2019 1 commit
- Tokenization encode/decode class-based sequence handling · 14e970c2
  LysandreJik authored Aug 09, 2019
  
  14e970c2
16 Jul, 2019 1 commit
- update readme and pretrained model weight files · 1849aa7d
  thomwolf authored Jul 16, 2019
  
  1849aa7d
15 Jul, 2019 1 commit
- update tokenizer - update squad example for xlnet · 15d8b126
  thomwolf authored Jul 15, 2019
  
  15d8b126
10 Jul, 2019 3 commits
- Added the two CLM XLM pretrained checkpoints. · 7fdbc478
  LysandreJik authored Jul 10, 2019
```
Fixed file extensions for config/vocab/merges of XLM models.
```
  7fdbc478
- Fixed XLM weights conversion script. Added 5 new checkpoints for XLM. · dee3e45b
  LysandreJik authored Jul 10, 2019
  
  dee3e45b
- Fixed all links. Removed TPU. Changed CLI to Converting TF models. Many minor... · f773faa2
  LysandreJik authored Jul 10, 2019
```
Fixed all links. Removed TPU. Changed CLI to Converting TF models. Many minor formatting adjustments. Added "TODO Lysandre filled" where necessary.
```
  f773faa2
09 Jul, 2019 2 commits
- adding tests to examples - updating summary module - coverage update · d5481cbe
  thomwolf authored Jul 09, 2019
  
  d5481cbe
- unified tokenizer api and serialization + tests · b1978698
  thomwolf authored Jul 09, 2019
  
  b1978698
05 Jul, 2019 3 commits
- tokenization abstract class - tests for examples · 36bca545
  thomwolf authored Jul 05, 2019
  
  36bca545
- [BIG] name change · 0bab55d5
  thomwolf authored Jul 05, 2019
  
  0bab55d5
- standardizing tokenizers API and adding tests · e75c3f70
  thomwolf authored Jul 05, 2019
  
  e75c3f70
03 Jul, 2019 2 commits
- updating tests · 8fa3a1f0
  thomwolf authored Jul 03, 2019
  
  8fa3a1f0
- WIP XLM + refactoring · c41f2bad
  thomwolf authored Jul 03, 2019
  
  c41f2bad
02 Jul, 2019 1 commit
- xlm · 288be7b7
  thomwolf authored Jul 02, 2019
  
  288be7b7
17 Jun, 2019 1 commit
- better error messages · 8415a38b
  thomwolf authored Jun 17, 2019
  
  8415a38b
08 May, 2019 1 commit
- clean up in tokenization · 366a3b02
  thomwolf authored May 08, 2019
  
  366a3b02
16 Apr, 2019 2 commits
- updating GPT tokenization · bdaba189
  thomwolf authored Apr 16, 2019
  
  bdaba189
- improving GPT2 tokenization and adding tests · 18a8a15f
  thomwolf authored Apr 16, 2019
  
  18a8a15f
15 Apr, 2019 4 commits
- fix openai special tokens loading · d6160224
  thomwolf authored Apr 15, 2019
  
  d6160224
- fixing tests · e8568a3b
  thomwolf authored Apr 15, 2019
  
  e8568a3b
- added tokenizers serialization tests · 870b734b
  thomwolf authored Apr 15, 2019
  
  870b734b
- add serialization semantics to tokenizers - fix transfo-xl tokenizer · 3e65f255
  thomwolf authored Apr 15, 2019
  
  3e65f255
06 Mar, 2019 1 commit
- fix typo - logger info · 5c85fc39
  thomwolf authored Mar 06, 2019
  
  5c85fc39
03 Mar, 2019 1 commit

Allow tokenization of sequences > 512 for caching · 9775b2eb

Catalin Voss authored Mar 02, 2019

For many applications requiring randomized data access, it's easier to cache the tokenized representations than the words. So why not turn this into a warning?

9775b2eb

13 Feb, 2019 1 commit
- OpenAI GPT Tokenizer can fallback on using BERT BasicTokenizer · c6bea084
  thomwolf authored Feb 13, 2019
  
  c6bea084
11 Feb, 2019 1 commit
- added tests for OpenAI GPT and Transformer-XL tokenizers · b514a60c
  thomwolf authored Feb 11, 2019
  
  b514a60c
07 Feb, 2019 1 commit
- docstrings · f99f2fb6
  thomwolf authored Feb 07, 2019
  
  f99f2fb6
05 Feb, 2019 1 commit
- python 2 compatibility · 448937c0
  thomwolf authored Feb 06, 2019
  
  448937c0
04 Feb, 2019 4 commits
- clean up tokenization spaces · 6179f537
  thomwolf authored Feb 04, 2019
  
  6179f537
- strip decoded outputs · 850da1cc
  thomwolf authored Feb 04, 2019
  
  850da1cc
- more options on special tokens · 01a3966b
  thomwolf authored Feb 04, 2019
  
  01a3966b
- logging · 05f96184
  thomwolf authored Feb 04, 2019
  
  05f96184
28 Jan, 2019 1 commit
- directly load from TF checkpoints + code cleanup · d77dd62f
  thomwolf authored Jan 28, 2019
  
  d77dd62f