1. 06 Apr, 2020 1 commit
    • Funtowicz Morgan's avatar
      Tokenizers v3.0.0 (#3185) · 96ab75b8
      Funtowicz Morgan authored
      
      
      * Renamed num_added_tokens to num_special_tokens_to_add
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Make fast tokenizers unittests work on Windows.
      
      * Entirely refactored unittest for tokenizers fast.
      
      * Remove ABC class for CommonFastTokenizerTest
      
      * Added embeded_special_tokens tests from allenai @dirkgr
      
      * Make embeded_special_tokens tests from allenai more generic
      
      * Uniformize vocab_size as a property for both Fast and normal tokenizers
      
      * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
      
      * Ensure providing None input raise the same ValueError than Python tokenizer + tests.
      
      * Fix invalid input for assert_padding when testing batch_encode_plus
      
      * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
      
      * Ensure tokenize() correctly forward add_special_tokens to rust.
      
      * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
      Avoid stripping on None values.
      
      * unittests ensure tokenize() also throws a ValueError if provided None
      
      * Added add_special_tokens unittest for all supported models.
      
      * Style
      
      * Make sure TransfoXL test run only if PyTorch is provided.
      
      * Split up tokenizers tests for each model type.
      
      * Fix invalid unittest with new tokenizers API.
      
      * Filter out Roberta openai detector models from unittests.
      
      * Introduce BatchEncoding on fast tokenizers path.
      
      This new structure exposes all the mappings retrieved from Rust.
      It also keeps the current behavior with model forward.
      
      * Introduce BatchEncoding on slow tokenizers path.
      
      Backward compatibility.
      
      * Improve error message on BatchEncoding for slow path
      
      * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
      
      * Style and format.
      
      * Added typing on all methods for PretrainedTokenizerFast
      
      * Style and format
      
      * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
      
      * Style and format
      
      * encode_plus now supports pretokenized inputs.
      
      * Remove user warning about add_special_tokens when working on pretokenized inputs.
      
      * Always go through the post processor.
      
      * Added support for pretokenized input pairs on encode_plus
      
      * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
      
      * Added pretokenized inputs support on batch_encode_plus
      
      * Update BatchEncoding methods name to match Encoding.
      
      * Bump setup.py tokenizers dependency to 0.7.0rc1
      
      * Remove unused parameters in BertTokenizerFast
      
      * Make sure Roberta returns token_type_ids for unittests.
      
      * Added missing typings
      
      * Update add_tokens prototype to match tokenizers side and allow AddedToken
      
      * Bumping tokenizers to 0.7.0rc2
      
      * Added documentation for BatchEncoding
      
      * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
      
      * Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
      
      * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
      
      * Fix text-classification pipeline using the wrong tokenizer
      
      * Make pipelines works with BatchEncoding
      
      * Turn off add_special_tokens on tokenize by default.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove add_prefix_space from tokenize call in unittest.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style and quality
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Correct message for batch_encode_plus none input exception.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Fix invalid list comprehension for offset_mapping overriding content every iteration.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * TransfoXL uses Strip normalizer.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Bump tokenizers dependency to 0.7.0rc3
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * SpecilaTokenMixin can use slots to faster access to underlying attributes.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove update_special_tokens from fast tokenizers.
      
      * Ensure TransfoXL unittests are run only when torch is available.
      
      * Style.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style
      
      * Style 馃檹馃檹
      
      
      
      * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
      
      * Remove Roberta warning on __init__.
      
      * Move documentation to Google style.
      Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
      96ab75b8
  2. 26 Mar, 2020 2 commits
    • Patrick von Platen's avatar
      Adds translation pipeline (#3419) · 022e8fab
      Patrick von Platen authored
      * fix merge conflicts
      
      * add t5 summarization example
      
      * change parameters for t5 summarization
      
      * make style
      
      * add first code snippet for translation
      
      * only add prefixes
      
      * add prefix patterns
      
      * make style
      
      * renaming
      
      * fix conflicts
      
      * remove unused patterns
      
      * solve conflicts
      
      * fix merge conflicts
      
      * remove translation example
      
      * remove summarization example
      
      * make sure tensors are in numpy for float comparsion
      
      * re-add t5 config
      
      * fix t5 import config typo
      
      * make style
      
      * remove unused numpy statements
      
      * update doctstring
      
      * import translation pipeline
      022e8fab
    • Patrick von Platen's avatar
      Add t5 to pipeline(task='summarization') (#3413) · 9c683ef0
      Patrick von Platen authored
      * solve conflicts
      
      * move warnings below
      
      * incorporate changes
      
      * add pad_to_max_length to pipelines
      
      * add bug fix for T5 beam search
      
      * add prefix patterns
      
      * make style
      
      * fix conflicts
      
      * adapt pipelines for task specific parameters
      
      * improve docstring
      
      * remove unused patterns
      9c683ef0
  3. 17 Mar, 2020 1 commit
    • Sam Shleifer's avatar
      Add Summarization to Pipelines (#3128) · 38a555a8
      Sam Shleifer authored
      * passing
      
      * Undo stupid chg
      
      * docs
      
      * undo rename
      
      * delete-cruft
      
      * only import if you have torch
      
      * Dont rely on dict ordering
      
      * Fix dict ordering upstream
      
      * docstring link
      
      * docstring link
      
      * remove trailing comma for 3.5 compat
      
      * new name
      
      * delegate kwarging
      
      * Update kwargs
      38a555a8
  4. 09 Mar, 2020 1 commit
  5. 02 Mar, 2020 1 commit
    • Lysandre Debut's avatar
      Pipeline doc (#3055) · d3eb7d23
      Lysandre Debut authored
      * Pipeline doc initial commit
      
      * pipeline abstraction
      
      * Remove modelcard argument from pipeline
      
      * Task-specific pipelines can be instantiated with no model or tokenizer
      
      * All pipelines doc
      d3eb7d23
  6. 19 Feb, 2020 1 commit
  7. 18 Feb, 2020 1 commit
  8. 13 Feb, 2020 1 commit
    • Joe Davison's avatar
      Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f
      Joe Davison authored
      * Preserve spaces in GPT-2 tokenizers
      
      Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
      tokenizers, enabling correct BPE encoding. Automatically inserts a space
      in front of first token in encode function when adding special tokens.
      
      * Add tokenization preprocessing method
      
      * Add framework argument to pipeline factory
      
      Also fixes pipeline test issue. Each test input now treated as a
      distinct sequence.
      f1e8a51f
  9. 07 Feb, 2020 2 commits
  10. 30 Jan, 2020 1 commit
    • Julien Chaumond's avatar
      fill_mask helper (#2576) · 9fa836a7
      Julien Chaumond authored
      * fill_mask helper
      
      * [poc] FillMaskPipeline
      
      * Revert "[poc] FillMaskPipeline"
      
      This reverts commit 67eeea55b0f97b46c2b828de0f4ee97d87338335.
      
      * Revert "fill_mask helper"
      
      This reverts commit cacc17b884e14bb6b07989110ffe884ad9e36eaa.
      
      * README: clarify that Pipelines can also do text-classification
      
      cf. question at the AI&ML meetup last week, @mfuntowicz
      
      * Fix test: test feature-extraction pipeline
      
      * Test tweaks
      
      * Slight refactor of existing pipeline (in preparation of new FillMaskPipeline)
      
      * Extraneous doc
      
      * More robust way of doing this
      
      @mfuntowicz as we don't rely on the model name anymore (see AutoConfig)
      
      * Also add RobertaConfig as a quickfix for wrong token_type_ids
      
      * cs
      
      * [BIG] FillMaskPipeline
      9fa836a7
  11. 15 Jan, 2020 2 commits
  12. 06 Jan, 2020 2 commits
  13. 22 Dec, 2019 4 commits
  14. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26
  15. 20 Dec, 2019 3 commits
  16. 10 Dec, 2019 2 commits