1. 16 Jun, 2020 1 commit
  2. 03 Jun, 2020 1 commit
    • Julien Chaumond's avatar
      Pipelines: miscellanea of QoL improvements and small features... (#4632) · 99207bd1
      Julien Chaumond authored
      * [hf_api] Attach all unknown attributes for future-proof compatibility
      
      * [Pipeline] NerPipeline is really a TokenClassificationPipeline
      
      * modelcard.py: I don't think we need to force the download
      
      * Remove config, tokenizer from SUPPORTED_TASKS as we're moving to one model = one weight + one tokenizer
      
      * FillMaskPipeline: also output token in string form
      
      * TextClassificationPipeline: option to return all scores, not just the argmax
      
      * Update docs/source/main_classes/pipelines.rst
      99207bd1
  3. 02 Jun, 2020 2 commits
  4. 22 May, 2020 1 commit
  5. 17 May, 2020 1 commit
    • Lorenzo Ampil's avatar
      Allow the creation of "entity groups" for NerPipeline #3548 (#3957) · 18d233d5
      Lorenzo Ampil authored
      * Add index to be returned by NerPipeline to allow for the creation of
      
      * Add entity groups
      
      * Convert entity list to dict
      
      * Add entity to entity_group_disagg atfter updating entity gorups
      
      * Change 'group' parameter to 'grouped_entities'
      
      * Add unit tests for grouped NER pipeline case
      
      * Correct variable name typo for NER_FINETUNED_MODELS
      
      * Sync grouped tests to recent test updates
      18d233d5
  6. 14 May, 2020 2 commits
  7. 11 May, 2020 1 commit
  8. 08 May, 2020 1 commit
  9. 07 May, 2020 1 commit
  10. 02 May, 2020 1 commit
  11. 28 Apr, 2020 1 commit
  12. 22 Apr, 2020 1 commit
    • Lorenzo Ampil's avatar
      Pipeline for Text Generation: GenerationPipeline (#3758) · f16540fc
      Lorenzo Ampil authored
      
      
      * Add GenerationPipeline
      
      * Fix parameter names
      
      * Correct parameter __call__ parameters
      
      * Add model type attribute and correct function calls for prepare_input
      
      * Take out trailing commas from init attributes
      
      * Remove unnecessary tokenization line
      
      * Implement support for multiple text inputs
      
      * Apply generation support for multiple input text prompts
      
      * Take out tensor coersion
      
      * Take out batch index
      
      * Add text prompt to return sequence
      
      * Squeeze token tensore before decoding
      
      * Return only a single list of sequences if only one prompt was used
      
      * Correct results variable name
      
      * Add GenerationPipeline to SUPPORTED_TASKS with the alias , initalized w GPT2
      
      * Registedred AutoModelWithLMHead for both pt and t
      
      * Update docstring for GenerationPipeline
      
      * Add kwargs parameter to mode.generate
      
      * Take out kwargs parameter after all
      
      * Add generation pipeline example in pipeline docstring
      
      * Fix max length by squeezing tokens tensor
      
      * Apply ensure_tensor_on_device to pytorch tensor
      
      * Include generation step in torch.no_grad
      
      * Take out input from prepare_xlm_input and set 'en' as default xlm_language
      
      * Apply framework specific encoding during prepare_input
      
      * Format w make style
      
      * Move GenerationPipeline import to follow proper import sorting
      
      * Take out training comma from generation dict
      
      * Apply requested changes
      
      * Change name to TextGenerationPipeline
      
      * Apply TextGenerationPipeline rename to __init___
      
      * Changing alias to
      
      * Set input mapping as input to ensure_tensor_on_device
      
      * Fix assertion placement
      
      * Add test_text_generation
      
      * Add TextGenerationPipeline to PipelineCommonTests
      
      * Take out whitespace
      
      * Format __init__ w black
      
      * Fix __init__ style
      
      * Forman __init___
      
      * Add line to end of __init__
      
      * Correct model tokenizer set for test_text_generation
      
      * Ensure to return list of list, not list of string (to pass test)
      
      * Limit test models to only 3 to limit runtime to address circleCI timeout error
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update tests/test_pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Remove argument docstring, __init__, add additional __call__ arguments, and reformat results to list of dict
      
      * Fix blank result list
      
      * Add TextGenerationPipeline to pipelines.rst
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Fix typos from adding PADDING_TEXT_TOKEN_LENGTH
      
      * Fix incorrectly moved result list
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Add back generation line and make style
      
      * Take out blank whitespace
      
      * Apply new alis, text-generation, to test_pipelines
      
      * Fix text generation alias in test
      
      * Update src/transformers/pipelines.py
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      Co-authored-by: default avatarJulien Chaumond <chaumond@gmail.com>
      f16540fc
  13. 20 Apr, 2020 2 commits
  14. 17 Apr, 2020 1 commit
  15. 16 Apr, 2020 1 commit
  16. 08 Apr, 2020 2 commits
  17. 06 Apr, 2020 1 commit
    • Funtowicz Morgan's avatar
      Tokenizers v3.0.0 (#3185) · 96ab75b8
      Funtowicz Morgan authored
      
      
      * Renamed num_added_tokens to num_special_tokens_to_add
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Make fast tokenizers unittests work on Windows.
      
      * Entirely refactored unittest for tokenizers fast.
      
      * Remove ABC class for CommonFastTokenizerTest
      
      * Added embeded_special_tokens tests from allenai @dirkgr
      
      * Make embeded_special_tokens tests from allenai more generic
      
      * Uniformize vocab_size as a property for both Fast and normal tokenizers
      
      * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
      
      * Ensure providing None input raise the same ValueError than Python tokenizer + tests.
      
      * Fix invalid input for assert_padding when testing batch_encode_plus
      
      * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
      
      * Ensure tokenize() correctly forward add_special_tokens to rust.
      
      * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
      Avoid stripping on None values.
      
      * unittests ensure tokenize() also throws a ValueError if provided None
      
      * Added add_special_tokens unittest for all supported models.
      
      * Style
      
      * Make sure TransfoXL test run only if PyTorch is provided.
      
      * Split up tokenizers tests for each model type.
      
      * Fix invalid unittest with new tokenizers API.
      
      * Filter out Roberta openai detector models from unittests.
      
      * Introduce BatchEncoding on fast tokenizers path.
      
      This new structure exposes all the mappings retrieved from Rust.
      It also keeps the current behavior with model forward.
      
      * Introduce BatchEncoding on slow tokenizers path.
      
      Backward compatibility.
      
      * Improve error message on BatchEncoding for slow path
      
      * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
      
      * Style and format.
      
      * Added typing on all methods for PretrainedTokenizerFast
      
      * Style and format
      
      * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
      
      * Style and format
      
      * encode_plus now supports pretokenized inputs.
      
      * Remove user warning about add_special_tokens when working on pretokenized inputs.
      
      * Always go through the post processor.
      
      * Added support for pretokenized input pairs on encode_plus
      
      * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
      
      * Added pretokenized inputs support on batch_encode_plus
      
      * Update BatchEncoding methods name to match Encoding.
      
      * Bump setup.py tokenizers dependency to 0.7.0rc1
      
      * Remove unused parameters in BertTokenizerFast
      
      * Make sure Roberta returns token_type_ids for unittests.
      
      * Added missing typings
      
      * Update add_tokens prototype to match tokenizers side and allow AddedToken
      
      * Bumping tokenizers to 0.7.0rc2
      
      * Added documentation for BatchEncoding
      
      * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
      
      * Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
      
      * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
      
      * Fix text-classification pipeline using the wrong tokenizer
      
      * Make pipelines works with BatchEncoding
      
      * Turn off add_special_tokens on tokenize by default.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove add_prefix_space from tokenize call in unittest.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style and quality
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Correct message for batch_encode_plus none input exception.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Fix invalid list comprehension for offset_mapping overriding content every iteration.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * TransfoXL uses Strip normalizer.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Bump tokenizers dependency to 0.7.0rc3
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * SpecilaTokenMixin can use slots to faster access to underlying attributes.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove update_special_tokens from fast tokenizers.
      
      * Ensure TransfoXL unittests are run only when torch is available.
      
      * Style.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style
      
      * Style 馃檹馃檹
      
      
      
      * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
      
      * Remove Roberta warning on __init__.
      
      * Move documentation to Google style.
      Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
      96ab75b8
  18. 01 Apr, 2020 1 commit
  19. 26 Mar, 2020 3 commits
    • Patrick von Platen's avatar
      rename string in pipeline · 31197054
      Patrick von Platen authored
      31197054
    • Patrick von Platen's avatar
      Adds translation pipeline (#3419) · 022e8fab
      Patrick von Platen authored
      * fix merge conflicts
      
      * add t5 summarization example
      
      * change parameters for t5 summarization
      
      * make style
      
      * add first code snippet for translation
      
      * only add prefixes
      
      * add prefix patterns
      
      * make style
      
      * renaming
      
      * fix conflicts
      
      * remove unused patterns
      
      * solve conflicts
      
      * fix merge conflicts
      
      * remove translation example
      
      * remove summarization example
      
      * make sure tensors are in numpy for float comparsion
      
      * re-add t5 config
      
      * fix t5 import config typo
      
      * make style
      
      * remove unused numpy statements
      
      * update doctstring
      
      * import translation pipeline
      022e8fab
    • Patrick von Platen's avatar
      Add t5 to pipeline(task='summarization') (#3413) · 9c683ef0
      Patrick von Platen authored
      * solve conflicts
      
      * move warnings below
      
      * incorporate changes
      
      * add pad_to_max_length to pipelines
      
      * add bug fix for T5 beam search
      
      * add prefix patterns
      
      * make style
      
      * fix conflicts
      
      * adapt pipelines for task specific parameters
      
      * improve docstring
      
      * remove unused patterns
      9c683ef0
  20. 17 Mar, 2020 1 commit
    • Sam Shleifer's avatar
      Add Summarization to Pipelines (#3128) · 38a555a8
      Sam Shleifer authored
      * passing
      
      * Undo stupid chg
      
      * docs
      
      * undo rename
      
      * delete-cruft
      
      * only import if you have torch
      
      * Dont rely on dict ordering
      
      * Fix dict ordering upstream
      
      * docstring link
      
      * docstring link
      
      * remove trailing comma for 3.5 compat
      
      * new name
      
      * delegate kwarging
      
      * Update kwargs
      38a555a8
  21. 06 Mar, 2020 1 commit
  22. 02 Mar, 2020 2 commits
  23. 23 Feb, 2020 1 commit
    • Martin Malmsten's avatar
      * Added support for Albert when fine-tuning for NER · 869b66f6
      Martin Malmsten authored
      * Added support for Albert in NER pipeline
      
      * Added command-line options to examples/ner/run_ner.py to better control tokenization
      
      * Added class AlbertForTokenClassification
      
      * Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens
      869b66f6
  24. 19 Feb, 2020 1 commit
  25. 14 Feb, 2020 1 commit
  26. 13 Feb, 2020 1 commit
    • Joe Davison's avatar
      Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f
      Joe Davison authored
      * Preserve spaces in GPT-2 tokenizers
      
      Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
      tokenizers, enabling correct BPE encoding. Automatically inserts a space
      in front of first token in encode function when adding special tokens.
      
      * Add tokenization preprocessing method
      
      * Add framework argument to pipeline factory
      
      Also fixes pipeline test issue. Each test input now treated as a
      distinct sequence.
      f1e8a51f
  27. 07 Feb, 2020 3 commits
  28. 03 Feb, 2020 1 commit
  29. 30 Jan, 2020 1 commit
    • Julien Chaumond's avatar
      fill_mask helper (#2576) · 9fa836a7
      Julien Chaumond authored
      * fill_mask helper
      
      * [poc] FillMaskPipeline
      
      * Revert "[poc] FillMaskPipeline"
      
      This reverts commit 67eeea55b0f97b46c2b828de0f4ee97d87338335.
      
      * Revert "fill_mask helper"
      
      This reverts commit cacc17b884e14bb6b07989110ffe884ad9e36eaa.
      
      * README: clarify that Pipelines can also do text-classification
      
      cf. question at the AI&ML meetup last week, @mfuntowicz
      
      * Fix test: test feature-extraction pipeline
      
      * Test tweaks
      
      * Slight refactor of existing pipeline (in preparation of new FillMaskPipeline)
      
      * Extraneous doc
      
      * More robust way of doing this
      
      @mfuntowicz as we don't rely on the model name anymore (see AutoConfig)
      
      * Also add RobertaConfig as a quickfix for wrong token_type_ids
      
      * cs
      
      * [BIG] FillMaskPipeline
      9fa836a7
  30. 15 Jan, 2020 2 commits