1. 24 Aug, 2020 1 commit
  2. 17 Aug, 2020 1 commit
  3. 31 Jul, 2020 1 commit
    • Paul O'Leary McCann's avatar
      Replace mecab-python3 with fugashi for Japanese tokenization (#6086) · cf3cf304
      Paul O'Leary McCann authored
      
      
      * Replace mecab-python3 with fugashi
      
      This replaces mecab-python3 with fugashi for Japanese tokenization. I am
      the maintainer of both projects.
      
      Both projects are MeCab wrappers, so the underlying C++ code is the
      same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
      use of the MeCab API it's easier to use.
      
      This code insures the use of a version of ipadic installed via pip,
      which should make versioning and tracking down issues easier.
      
      fugashi has wheels for Windows, OSX, and Linux, which will help with
      issues with installing old versions of mecab-python3 on Windows.
      Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
      require a C++ runtime to be installed on Windows.
      
      In adding this change I removed some code dealing with `cursor`,
      `token_start`, and `token_end` variables. These variables didn't seem to
      be used for anything, it is unclear to me why they were there.
      
      I ran the tests and they passed, though I couldn't figure out how to run
      the slow tests (`--runslow` gave an error) and didn't try testing with
      Tensorflow.
      
      * Style fix
      
      * Remove unused variable
      
      Forgot to delete this...
      
      * Adapt doc with install instructions
      
      * Fix typo
      Co-authored-by: default avatarsgugger <sylvain.gugger@gmail.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      cf3cf304
  4. 29 Jul, 2020 1 commit
  5. 27 Jul, 2020 1 commit
  6. 18 Jul, 2020 1 commit
  7. 06 Jul, 2020 3 commits
  8. 03 Jul, 2020 2 commits
  9. 02 Jul, 2020 1 commit
  10. 30 Jun, 2020 1 commit
  11. 29 Jun, 2020 2 commits
  12. 25 Jun, 2020 1 commit
  13. 23 Jun, 2020 1 commit
    • Thomas Wolf's avatar
      Tokenizers API developments (#5103) · 11fdde02
      Thomas Wolf authored
      
      
      * Add return lengths
      
      * make pad a bit more flexible so it can be used as collate_fn
      
      * check all kwargs sent to encoding method are known
      
      * fixing kwargs in encodings
      
      * New AddedToken class in python
      
      This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.
      
      * style and quality
      
      * switched to hugginface tokenizers library for AddedTokens
      
      * up to tokenizer 0.8.0-rc3 - update API to use AddedToken state
      
      * style and quality
      
      * do not raise an error on additional or unused kwargs for tokenize() but only a warning
      
      * transfo-xl pretrained model requires torch
      
      * Update src/transformers/tokenization_utils.py
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      11fdde02
  14. 18 Jun, 2020 1 commit
  15. 15 Jun, 2020 1 commit
    • Anthony MOI's avatar
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220
      Anthony MOI authored
      
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
      
      * Use tokenizers pre-tokenized pipeline
      
      * failing pretrokenized test
      
      * Fix is_pretokenized in python
      
      * add pretokenized tests
      
      * style and quality
      
      * better tests for batched pretokenized inputs
      
      * tokenizers clean up - new padding_strategy - split the files
      
      * [HUGE] refactoring tokenizers - padding - truncation - tests
      
      * style and quality
      
      * bump up requied tokenizers version to 0.8.0-rc1
      
      * switched padding/truncation API - simpler better backward compat
      
      * updating tests for custom tokenizers
      
      * style and quality - tests on pad
      
      * fix QA pipeline
      
      * fix backward compatibility for max_length only
      
      * style and quality
      
      * Various cleans up - add verbose
      
      * fix tests
      
      * update docstrings
      
      * Fix tests
      
      * Docs reformatted
      
      * __call__ method documented
      Co-authored-by: default avatarThomas Wolf <thomwolf@users.noreply.github.com>
      Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
      36434220
  16. 09 Jun, 2020 1 commit
    • Patrick von Platen's avatar
      [Benchmark] add tpu and torchscipt for benchmark (#4850) · 2cfb947f
      Patrick von Platen authored
      
      
      * add tpu and torchscipt for benchmark
      
      * fix name in tests
      
      * "fix email"
      
      * make style
      
      * better log message for tpu
      
      * add more print and info for tpu
      
      * allow possibility to print tpu metrics
      
      * correct cpu usage
      
      * fix test for non-install
      
      * remove bugus file
      
      * include psutil in testing
      
      * run a couple of times before tracing in torchscript
      
      * do not allow tpu memory tracing for now
      
      * make style
      
      * add torchscript to env
      
      * better name for torch tpu
      Co-authored-by: default avatarPatrick von Platen <patrick@huggingface.co>
      2cfb947f
  17. 02 Jun, 2020 2 commits
  18. 26 May, 2020 1 commit
    • Bram Vanroy's avatar
      Make transformers-cli cross-platform (#4131) · 8cc6807e
      Bram Vanroy authored
      * make transformers-cli cross-platform
      
      Using "scripts" is a useful option in setup.py particularly when you want to get access to non-python scripts. However, in this case we want to have an entry point into some of our own Python scripts. To do this in a concise, cross-platfom way, we can use entry_points.console_scripts. This change is necessary to provide the CLI on different platforms, which "scripts" does not ensure. Usage remains the same, but the "transformers-cli" script has to be moved (be part of the library) and renamed (underscore + extension)
      
      * make style & quality
      8cc6807e
  19. 22 May, 2020 3 commits
  20. 14 May, 2020 3 commits
    • Funtowicz Morgan's avatar
      Conversion script to export transformers models to ONNX IR. (#4253) · db0076a9
      Funtowicz Morgan authored
      * Added generic ONNX conversion script for PyTorch model.
      
      * WIP initial TF support.
      
      * TensorFlow/Keras ONNX export working.
      
      * Print framework version info
      
      * Add possibility to check the model is correctly loading on ONNX runtime.
      
      * Remove quantization option.
      
      * Specify ONNX opset version when exporting.
      
      * Formatting.
      
      * Remove unused imports.
      
      * Make functions more generally reusable from other part of the code.
      
      * isort happy.
      
      * flake happy
      
      * Export only feature-extraction for now
      
      * Correctly check inputs order / filter before export.
      
      * Removed task variable
      
      * Fix invalid args call in load_graph_from_args.
      
      * Fix invalid args call in convert.
      
      * Fix invalid args call in infer_shapes.
      
      * Raise exception and catch in caller function instead of exit.
      
      * Add 04-onnx-export.ipynb notebook
      
      * More WIP on the notebook
      
      * Remove unused imports
      
      * Simplify & remove unused constants.
      
      * Export with constant_folding in PyTorch
      
      * Let's try to put function args in the right order this time ...
      
      * Disable external_data_format temporary
      
      * ONNX notebook draft ready.
      
      * Updated notebooks charts + wording
      
      * Correct error while exporting last chart in notebook.
      
      * Adressing @LysandreJik comment.
      
      * Set ONNX opset to 11 as default value.
      
      * Set opset param mandatory
      
      * Added ONNX export unittests
      
      * Quality.
      
      * flake8 happy
      
      * Add keras2onnx dependency on extras["tf"]
      
      * Pin keras2onnx on github master to v1.6.5
      
      * Second attempt.
      
      * Third attempt.
      
      * Use the right repo URL this time ...
      
      * Do the same for onnxconverter-common
      
      * Added keras2onnx and onnxconveter-common to 1.7.0 to supports TF2.2
      
      * Correct commit hash.
      
      * Addressing PR review: Optimization are enabled by default.
      
      * Addressing PR review: small changes in the notebook
      
      * setup.py comment about keras2onnx versioning.
      db0076a9
    • Julien Chaumond's avatar
      Fix: unpin flake8 and fix cs errors (#4367) · 448c4672
      Julien Chaumond authored
      * Fix: unpin flake8 and fix cs errors
      
      * Ok we still need to quote those
      448c4672
    • Julien Chaumond's avatar
      [ci skip] Pin isort · 015f7812
      Julien Chaumond authored
      015f7812
  21. 13 May, 2020 1 commit
  22. 12 May, 2020 2 commits
  23. 11 May, 2020 1 commit
  24. 07 May, 2020 2 commits
  25. 05 May, 2020 1 commit
    • Lysandre Debut's avatar
      Pytorch 1.5.0 (#3973) · 79b1c696
      Lysandre Debut authored
      * Standard deviation can no longer be set to 0
      
      * Remove torch pinned version
      
      * 9th instead of 10th, silly me
      79b1c696
  26. 01 May, 2020 1 commit
  27. 27 Apr, 2020 1 commit
  28. 22 Apr, 2020 1 commit
  29. 21 Apr, 2020 1 commit