Commits · c754c41c6193565fecaf411b1de385bf90ab5c70 · chenpangpang / transformers

22 Sep, 2020 1 commit

RAG (#6813) · c754c41c

Ola Piktus authored Sep 22, 2020

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* Formatting / renaming prior to actual work

* First commit

* improve comments

* Retrieval evaluation scripts

* refactor to include modeling outputs + MPI retriever

* Fix rag-token model + refactor

* Various fixes + finetuning logic

* use_bos fix

* Retrieval refactor

* Finetuning refactoring and cleanup

* Add documentation and cleanup

* Remove set_up_rag_env.sh file

* Fix retrieval wit HF index

* Fix import errors

* Fix quality errors

* Refactor as per suggestions in https://github.com/huggingface/transformers/pull/6813#issuecomment-687208867



* fix quality

* Fix RAG Sequence generation

* minor cleanup plus initial tests

* fix test

* fix tests 2

* Comments fix

* post-merge fixes

* Improve readme + post-rebase refactor

* Extra dependencied for tests

* Fix tests

* Fix tests 2

* Refactor test requirements

* Fix tests 3

* Post-rebase refactor

* rename nlp->datasets

* RAG integration tests

* add tokenizer to slow integration test and allow retriever to run on cpu

* add tests; fix position ids warning

* change structure

* change structure

* add from encoder generator

* save working solution

* make all integration tests pass

* add RagTokenizer.save/from_pretrained and RagRetriever.save/from_pretrained

* don't save paths

* delete unnecessary imports

* pass config to AutoTokenizer.from_pretrained for Rag tokenizers

* init wiki_dpr only once

* hardcode legacy index and passages paths (todo: add the right urls)

* finalize config

* finalize retriver api and config api

* LegacyIndex index download refactor

* add dpr to autotokenizer

* make from pretrained more flexible

* fix ragfortokengeneration

* small name changes in tokenizer

* add labels to models

* change default index name

* add retrieval tests

* finish token generate

* align test with previous version and make all tests pass

* add tests

* finalize tests

* implement thoms suggestions

* add first version of test

* make first tests work

* make retriever platform agnostic

* naming

* style

* add legacy index URL

* docstrings + simple retrieval test for distributed

* clean model api

* add doc_ids to retriever's outputs

* fix retrieval tests

* finish model outputs

* finalize model api

* fix generate problem for rag

* fix generate for other modles

* fix some tests

* save intermediate

* set generate to default

* big refactor generate

* delete rag_api

* correct pip faiss install

* fix auto tokenization test

* fix faiss install

* fix test

* move the distributed logic to examples

* model page

* docs

* finish tests

* fix dependencies

* fix import in __init__

* Refactor eval_rag and finetune scripts

* start docstring

* add psutil to test

* fix tf test

* move require torch to top

* fix retrieval test

* align naming

* finish automodel

* fix repo consistency

* test ragtokenizer save/load

* add rag model output docs

* fix ragtokenizer save/load from pretrained

* fix tokenizer dir

* remove torch in retrieval

* fix docs

* fixe finetune scripts

* finish model docs

* finish docs

* remove auto model for now

* add require torch

* remove solved todos

* integrate sylvains suggestions

* sams comments

* correct mistake on purpose

* improve README

* Add generation test cases

* fix rag token

* clean token generate

* fix test

* add note to test

* fix attention mask

* add t5 test for rag

* Fix handling prefix in finetune.py

* don't overwrite index_name
Co-authored-by: Patrick Lewis <plewis@fb.com>
Co-authored-by: Aleksandra Piktus <piktus@devfair0141.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5102.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5067.h2.fair>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

c754c41c

10 Sep, 2020 1 commit
- Fix CI with change of name of nlp (#7054) · 51448673
  Sylvain Gugger authored Sep 10, 2020
```
* nlp -> datasets

* More nlp -> datasets

* Woopsie

* More nlp -> datasets

* One last
```
  51448673
24 Aug, 2020 1 commit
- Update repo to isort v5 (#6686) · a5737779
  Sylvain Gugger authored Aug 24, 2020
```
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
```
  a5737779
13 Aug, 2020 1 commit

Add POS tagging and Phrase chunking token classification examples (#6457) · eda07efa

vblagoje authored Aug 13, 2020

* Add more token classification examples

* POS tagging example

* Phrase chunking example

* PR review fixes

* Add conllu to third party list (used in token classification examples)

eda07efa

31 Jul, 2020 1 commit

Replace mecab-python3 with fugashi for Japanese tokenization (#6086) · cf3cf304

Paul O'Leary McCann authored Jul 31, 2020



* Replace mecab-python3 with fugashi

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.

* Style fix

* Remove unused variable

Forgot to delete this...

* Adapt doc with install instructions

* Fix typo
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

cf3cf304

27 Jul, 2020 1 commit
- Add fire to setup.cfg to make isort happy (#6066) · fd347e0d
  Sylvain Gugger authored Jul 27, 2020
  
  fd347e0d
24 Jul, 2020 1 commit
- Model utils doc (#6005) · 3b44aa93
  Sylvain Gugger authored Jul 24, 2020
```
* Document TF modeling utils

* Document all model utils
```
  3b44aa93
07 Jul, 2020 1 commit

Add mbart-large-cc25, support translation finetuning (#5129) · 353b8f1e

Sam Shleifer authored Jul 07, 2020

improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg

353b8f1e

25 Jun, 2020 1 commit
- examples/seq2seq supports translation (#5202) · 40457bce
  Sam Shleifer authored Jun 24, 2020
  
  40457bce
22 Jun, 2020 1 commit

Benchmarks (#4912) · fa0be6d7

Patrick von Platen authored Jun 22, 2020

* finish benchmark

* fix isort

* fix setup cfg

* retab

* fix time measuring of tf graph mode

* fix tf cuda

* clean code

* better error message

fa0be6d7

17 Jun, 2020 1 commit
- add pandas to setup.cfg (#5093) · f1a3d037
  Sam Shleifer authored Jun 17, 2020
  
  f1a3d037
05 Jun, 2020 1 commit
- [isort] add matplotlib to known 3rd party dependencies (#4800) · 875288b3
  Sam Shleifer authored Jun 05, 2020
  
  875288b3
14 May, 2020 1 commit
- Fix: unpin flake8 and fix cs errors (#4367) · 448c4672
  Julien Chaumond authored May 14, 2020
```
* Fix: unpin flake8 and fix cs errors

* Ok we still need to quote those
```
  448c4672
01 May, 2020 1 commit
- [testing] add timeout_decorator (#3543) · 18db92dd
  Sam Shleifer authored May 01, 2020
  
  18db92dd
28 Apr, 2020 2 commits
- MarianMTModel.from_pretrained('Helsinki-NLP/opus-marian-en-de') (#3908) · 847e7f33
  Sam Shleifer authored Apr 28, 2020
```
Co-Authored-By: Stefan Schweter <stefan@schweter.it>
```
  847e7f33
- [isort] add known 3rd party to setup.cfg (#4053) · d714dfea
  Sam Shleifer authored Apr 28, 2020
```
* add known 3rd party to setup.cfg

* comment

* Update CONTRIBUTING.md
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
```
  d714dfea
20 Feb, 2020 1 commit

Support for torch-lightning in NER examples (#2890) · b662f0e6

srush authored Feb 20, 2020



* initial pytorch lightning commit

* tested multigpu

* Fix learning rate schedule

* black formatting

* fix flake8

* isort

* isort

* .
Co-authored-by: Check your git settings! <chris@chris-laptop>

b662f0e6

13 Jan, 2020 1 commit
- Py35 doesn't like inline variable types · 3c86b6f3
  Julien Chaumond authored Jan 13, 2020
  
  3c86b6f3
10 Jan, 2020 2 commits
- keep list sorted · fd842332
  Julien Chaumond authored Jan 10, 2020
  
  fd842332
- [isort] declare more third-parties in case no tf install · 0cd81fb9
  Julien Chaumond authored Jan 10, 2020
  
  0cd81fb9
06 Jan, 2020 2 commits
- GPU text generation: mMoved the encoded_prompt to correct device · 81d6841b
  alberduris authored Dec 31, 2019
  
  81d6841b
- Moved the encoded_prompts to correct device · dd4df80f
  alberduris authored Dec 31, 2019
  
  dd4df80f
23 Dec, 2019 2 commits
- Enable F841 warning in flake8. · e74c73a8
  Aymeric Augustin authored Dec 23, 2019
  
  e74c73a8
- Include all optional dependencies in extras. · 76a1417f
  Aymeric Augustin authored Dec 22, 2019
```
Take advantage of this to simplify the Circle CI configuration.

Don't bother with tensorboardX: it's a fallback for PyTorch < 1.1.0.
```
  76a1417f
22 Dec, 2019 4 commits
- Sort imports for optional third-party libraries. · c11b3e29
  Aymeric Augustin authored Dec 22, 2019
```
These libraries aren't always installed in the virtual environment where
isort is running. Declaring them properly avoids mixing these
third-party imports with local imports.
```
  c11b3e29
- Stabilize import order for packaging. · 2a34d5b7
  Aymeric Augustin authored Dec 22, 2019
```
I don't want to consider it a dependency of transformers, but it's
usually there in local development and usually not there in CI.
```
  2a34d5b7
- Disable flake8 F841 in CI to get a passing run. · c9270086
  Aymeric Augustin authored Dec 21, 2019
```
I'll fix it later.
```
  c9270086
- Add black-compatible flake8 configuration. · 1efa0a75
  Aymeric Augustin authored Dec 21, 2019
  
  1efa0a75
21 Dec, 2019 1 commit
- Add black-compatible isort configuration. · bc1715c1
  Aymeric Augustin authored Dec 21, 2019
```
lines_after_imports = 2 is a matter of taste; I like it.
```
  bc1715c1