1. 22 Sep, 2020 1 commit
    • Ola Piktus's avatar
      RAG (#6813) · c754c41c
      Ola Piktus authored
      * added rag WIP
      
      * path fix
      
      * Formatting / renaming prior to actual work
      
      * added rag WIP
      
      * path fix
      
      * Formatting / renaming prior to actual work
      
      * added rag WIP
      
      * path fix
      
      * Formatting / renaming prior to actual work
      
      * added rag WIP
      
      * Formatting / renaming prior to actual work
      
      * First commit
      
      * improve comments
      
      * Retrieval evaluation scripts
      
      * refactor to include modeling outputs + MPI retriever
      
      * Fix rag-token model + refactor
      
      * Various fixes + finetuning logic
      
      * use_bos fix
      
      * Retrieval refactor
      
      * Finetuning refactoring and cleanup
      
      * Add documentation and cleanup
      
      * Remove set_up_rag_env.sh file
      
      * Fix retrieval wit HF index
      
      * Fix import errors
      
      * Fix quality errors
      
      * Refactor as per suggestions in https://github.com/huggingface/transformers/pull/6813#issuecomment-687208867
      
      
      
      * fix quality
      
      * Fix RAG Sequence generation
      
      * minor cleanup plus initial tests
      
      * fix test
      
      * fix tests 2
      
      * Comments fix
      
      * post-merge fixes
      
      * Improve readme + post-rebase refactor
      
      * Extra dependencied for tests
      
      * Fix tests
      
      * Fix tests 2
      
      * Refactor test requirements
      
      * Fix tests 3
      
      * Post-rebase refactor
      
      * rename nlp->datasets
      
      * RAG integration tests
      
      * add tokenizer to slow integration test and allow retriever to run on cpu
      
      * add tests; fix position ids warning
      
      * change structure
      
      * change structure
      
      * add from encoder generator
      
      * save working solution
      
      * make all integration tests pass
      
      * add RagTokenizer.save/from_pretrained and RagRetriever.save/from_pretrained
      
      * don't save paths
      
      * delete unnecessary imports
      
      * pass config to AutoTokenizer.from_pretrained for Rag tokenizers
      
      * init wiki_dpr only once
      
      * hardcode legacy index and passages paths (todo: add the right urls)
      
      * finalize config
      
      * finalize retriver api and config api
      
      * LegacyIndex index download refactor
      
      * add dpr to autotokenizer
      
      * make from pretrained more flexible
      
      * fix ragfortokengeneration
      
      * small name changes in tokenizer
      
      * add labels to models
      
      * change default index name
      
      * add retrieval tests
      
      * finish token generate
      
      * align test with previous version and make all tests pass
      
      * add tests
      
      * finalize tests
      
      * implement thoms suggestions
      
      * add first version of test
      
      * make first tests work
      
      * make retriever platform agnostic
      
      * naming
      
      * style
      
      * add legacy index URL
      
      * docstrings + simple retrieval test for distributed
      
      * clean model api
      
      * add doc_ids to retriever's outputs
      
      * fix retrieval tests
      
      * finish model outputs
      
      * finalize model api
      
      * fix generate problem for rag
      
      * fix generate for other modles
      
      * fix some tests
      
      * save intermediate
      
      * set generate to default
      
      * big refactor generate
      
      * delete rag_api
      
      * correct pip faiss install
      
      * fix auto tokenization test
      
      * fix faiss install
      
      * fix test
      
      * move the distributed logic to examples
      
      * model page
      
      * docs
      
      * finish tests
      
      * fix dependencies
      
      * fix import in __init__
      
      * Refactor eval_rag and finetune scripts
      
      * start docstring
      
      * add psutil to test
      
      * fix tf test
      
      * move require torch to top
      
      * fix retrieval test
      
      * align naming
      
      * finish automodel
      
      * fix repo consistency
      
      * test ragtokenizer save/load
      
      * add rag model output docs
      
      * fix ragtokenizer save/load from pretrained
      
      * fix tokenizer dir
      
      * remove torch in retrieval
      
      * fix docs
      
      * fixe finetune scripts
      
      * finish model docs
      
      * finish docs
      
      * remove auto model for now
      
      * add require torch
      
      * remove solved todos
      
      * integrate sylvains suggestions
      
      * sams comments
      
      * correct mistake on purpose
      
      * improve README
      
      * Add generation test cases
      
      * fix rag token
      
      * clean token generate
      
      * fix test
      
      * add note to test
      
      * fix attention mask
      
      * add t5 test for rag
      
      * Fix handling prefix in finetune.py
      
      * don't overwrite index_name
      Co-authored-by: default avatarPatrick Lewis <plewis@fb.com>
      Co-authored-by: default avatarAleksandra Piktus <piktus@devfair0141.h2.fair>
      Co-authored-by: default avatarAleksandra Piktus <piktus@learnfair5102.h2.fair>
      Co-authored-by: default avatarAleksandra Piktus <piktus@learnfair5067.h2.fair>
      Co-authored-by: default avatarYour Name <you@example.com>
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      Co-authored-by: default avatarQuentin Lhoest <lhoest.q@gmail.com>
      c754c41c
  2. 10 Sep, 2020 1 commit
  3. 24 Aug, 2020 1 commit
  4. 13 Aug, 2020 1 commit
  5. 31 Jul, 2020 1 commit
    • Paul O'Leary McCann's avatar
      Replace mecab-python3 with fugashi for Japanese tokenization (#6086) · cf3cf304
      Paul O'Leary McCann authored
      
      
      * Replace mecab-python3 with fugashi
      
      This replaces mecab-python3 with fugashi for Japanese tokenization. I am
      the maintainer of both projects.
      
      Both projects are MeCab wrappers, so the underlying C++ code is the
      same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
      use of the MeCab API it's easier to use.
      
      This code insures the use of a version of ipadic installed via pip,
      which should make versioning and tracking down issues easier.
      
      fugashi has wheels for Windows, OSX, and Linux, which will help with
      issues with installing old versions of mecab-python3 on Windows.
      Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
      require a C++ runtime to be installed on Windows.
      
      In adding this change I removed some code dealing with `cursor`,
      `token_start`, and `token_end` variables. These variables didn't seem to
      be used for anything, it is unclear to me why they were there.
      
      I ran the tests and they passed, though I couldn't figure out how to run
      the slow tests (`--runslow` gave an error) and didn't try testing with
      Tensorflow.
      
      * Style fix
      
      * Remove unused variable
      
      Forgot to delete this...
      
      * Adapt doc with install instructions
      
      * Fix typo
      Co-authored-by: default avatarsgugger <sylvain.gugger@gmail.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      cf3cf304
  6. 27 Jul, 2020 1 commit
  7. 24 Jul, 2020 1 commit
  8. 07 Jul, 2020 1 commit
  9. 25 Jun, 2020 1 commit
  10. 22 Jun, 2020 1 commit
    • Patrick von Platen's avatar
      Benchmarks (#4912) · fa0be6d7
      Patrick von Platen authored
      * finish benchmark
      
      * fix isort
      
      * fix setup cfg
      
      * retab
      
      * fix time measuring of tf graph mode
      
      * fix tf cuda
      
      * clean code
      
      * better error message
      fa0be6d7
  11. 17 Jun, 2020 1 commit
  12. 05 Jun, 2020 1 commit
  13. 14 May, 2020 1 commit
  14. 01 May, 2020 1 commit
  15. 28 Apr, 2020 2 commits
  16. 20 Feb, 2020 1 commit
  17. 13 Jan, 2020 1 commit
  18. 10 Jan, 2020 2 commits
  19. 06 Jan, 2020 2 commits
  20. 23 Dec, 2019 2 commits
  21. 22 Dec, 2019 4 commits
  22. 21 Dec, 2019 1 commit