1. 07 Sep, 2020 1 commit
  2. 02 Sep, 2020 1 commit
    • Suraj Patil's avatar
      [pipelines] Text2TextGenerationPipeline (#6744) · 4230d30f
      Suraj Patil authored
      * add Text2TextGenerationPipeline
      
      * remove max length warning
      
      * remove comments
      
      * remove input_length
      
      * fix typo
      
      * add tests
      
      * use TFAutoModelForSeq2SeqLM
      
      * doc
      
      * typo
      
      * add the doc below TextGenerationPipeline
      
      * doc nit
      
      * style
      
      * delete comment
      4230d30f
  3. 26 Aug, 2020 1 commit
  4. 12 Aug, 2020 1 commit
  5. 30 Jul, 2020 1 commit
    • guillaume-be's avatar
      Addition of a DialoguePipeline (#5516) · e642c789
      guillaume-be authored
      
      
      * initial commit for pipeline implementation
      
      Addition of input processing and history concatenation
      
      * Conversation pipeline tested and working for single & multiple conversation inputs
      
      * Added docstrings for dialogue pipeline
      
      * Addition of dialogue pipeline integration tests
      
      * Delete test_t5.py
      
      * Fixed max code length
      
      * Updated styling
      
      * Fixed test broken by formatting tools
      
      * Removed unused import
      
      * Added unit test for DialoguePipeline
      
      * Fixed Tensorflow compatibility
      
      * Fixed multi-framework support using framework flag
      
      * - Fixed docstring
      - Added `min_length_for_response` as an initialization parameter
      - Renamed `*args` to `conversations`, `conversations` being a `Conversation` or a `List[Conversation]`
      - Updated truncation to truncate entire segments of conversations, instead of cutting in the middle of a user/bot input
      
      * - renamed pipeline name from dialogue to conversational
      - removed hardcoded default value of 1000 and use config.max_length instead
      - added `append_response` and `set_history` method to the Conversation class to avoid direct fields mutation
      - fixed bug in history truncation method
      
      * - Updated ConversationalPipeline to accept only active conversations (otherwise a ValueError is raised)
      
      * - Simplified input tensor conversion
      
      * - Updated attention_mask value for Tensorflow compatibility
      
      * - Updated last dialogue reference to conversational & fixed integration tests
      
      * Fixed conflict with master
      
      * Updates following review comments
      
      * Updated formatting
      
      * Added Conversation and ConversationalPipeline to the library __init__, addition of docstrings for Conversation, added both to the docs
      
      * Update src/transformers/pipelines.py
      
      Updated docsting following review
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      e642c789
  6. 27 Jul, 2020 1 commit
    • Joe Davison's avatar
      Zero shot classification pipeline (#5760) · 3deffc1d
      Joe Davison authored
      * add initial zero-shot pipeline
      
      * change default args
      
      * update default template
      
      * add label string splitting
      
      * add str labels support, remove nli from name
      
      * style
      
      * add input validation and working tf defaults
      
      * tests
      
      * quality check
      
      * add docstring to __call__
      
      * add slow tests
      
      * Change truncation to only_first
      
      also lower precision on tests for readibility
      
      * style
      3deffc1d
  7. 08 Jul, 2020 1 commit
    • Lorenzo Ampil's avatar
      Fix Inconsistent NER Grouping (Pipeline) (#4987) · 0cc4eae0
      Lorenzo Ampil authored
      
      
      * Add B I handling to grouping
      
      * Add fix to include separate entity as last token
      
      * move last_idx definition outside loop
      
      * Use first entity in entity group as reference for entity type
      
      * Add test cases
      
      * Take out extra class accidentally added
      
      * Return tf ner grouped test to original
      
      * Take out redundant last entity
      
      * Get last_idx safely
      Co-authored-by: default avatarColleterVi <36503688+ColleterVi@users.noreply.github.com>
      
      * Fix first entity comment
      
      * Create separate functions for group_sub_entities and group_entities (splitting call method to testable functions)
      
      * Take out unnecessary last_idx
      
      * Remove additional forward pass test
      
      * Move token classification basic tests to separate class
      
      * Move token classification basic tests back to monocolumninputtestcase
      
      * Move base ner tests to nerpipelinetests
      
      * Take out unused kwargs
      
      * Add back mandatory_keys argument
      
      * Add unitary tests for group_entities in _test_ner_pipeline
      
      * Fix last entity handling
      
      * Fix grouping fucntion used
      
      * Add typing to group_sub_entities and group_entities
      Co-authored-by: default avatarColleterVi <36503688+ColleterVi@users.noreply.github.com>
      0cc4eae0
  8. 01 Jul, 2020 2 commits
  9. 30 Jun, 2020 1 commit
  10. 26 Jun, 2020 1 commit
  11. 18 May, 2020 1 commit
  12. 17 May, 2020 1 commit
    • Lorenzo Ampil's avatar
      Allow the creation of "entity groups" for NerPipeline #3548 (#3957) · 18d233d5
      Lorenzo Ampil authored
      * Add index to be returned by NerPipeline to allow for the creation of
      
      * Add entity groups
      
      * Convert entity list to dict
      
      * Add entity to entity_group_disagg atfter updating entity gorups
      
      * Change 'group' parameter to 'grouped_entities'
      
      * Add unit tests for grouped NER pipeline case
      
      * Correct variable name typo for NER_FINETUNED_MODELS
      
      * Sync grouped tests to recent test updates
      18d233d5
  13. 14 May, 2020 1 commit
  14. 08 May, 2020 1 commit
  15. 07 May, 2020 1 commit
  16. 22 Apr, 2020 1 commit
    • Lorenzo Ampil's avatar
      Pipeline for Text Generation: GenerationPipeline (#3758) · f16540fc
      Lorenzo Ampil authored
      
      
      * Add GenerationPipeline
      
      * Fix parameter names
      
      * Correct parameter __call__ parameters
      
      * Add model type attribute and correct function calls for prepare_input
      
      * Take out trailing commas from init attributes
      
      * Remove unnecessary tokenization line
      
      * Implement support for multiple text inputs
      
      * Apply generation support for multiple input text prompts
      
      * Take out tensor coersion
      
      * Take out batch index
      
      * Add text prompt to return sequence
      
      * Squeeze token tensore before decoding
      
      * Return only a single list of sequences if only one prompt was used
      
      * Correct results variable name
      
      * Add GenerationPipeline to SUPPORTED_TASKS with the alias , initalized w GPT2
      
      * Registedred AutoModelWithLMHead for both pt and t
      
      * Update docstring for GenerationPipeline
      
      * Add kwargs parameter to mode.generate
      
      * Take out kwargs parameter after all
      
      * Add generation pipeline example in pipeline docstring
      
      * Fix max length by squeezing tokens tensor
      
      * Apply ensure_tensor_on_device to pytorch tensor
      
      * Include generation step in torch.no_grad
      
      * Take out input from prepare_xlm_input and set 'en' as default xlm_language
      
      * Apply framework specific encoding during prepare_input
      
      * Format w make style
      
      * Move GenerationPipeline import to follow proper import sorting
      
      * Take out training comma from generation dict
      
      * Apply requested changes
      
      * Change name to TextGenerationPipeline
      
      * Apply TextGenerationPipeline rename to __init___
      
      * Changing alias to
      
      * Set input mapping as input to ensure_tensor_on_device
      
      * Fix assertion placement
      
      * Add test_text_generation
      
      * Add TextGenerationPipeline to PipelineCommonTests
      
      * Take out whitespace
      
      * Format __init__ w black
      
      * Fix __init__ style
      
      * Forman __init___
      
      * Add line to end of __init__
      
      * Correct model tokenizer set for test_text_generation
      
      * Ensure to return list of list, not list of string (to pass test)
      
      * Limit test models to only 3 to limit runtime to address circleCI timeout error
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update tests/test_pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Remove argument docstring, __init__, add additional __call__ arguments, and reformat results to list of dict
      
      * Fix blank result list
      
      * Add TextGenerationPipeline to pipelines.rst
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Fix typos from adding PADDING_TEXT_TOKEN_LENGTH
      
      * Fix incorrectly moved result list
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      
      * Update src/transformers/pipelines.py
      Co-Authored-By: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Add back generation line and make style
      
      * Take out blank whitespace
      
      * Apply new alis, text-generation, to test_pipelines
      
      * Fix text generation alias in test
      
      * Update src/transformers/pipelines.py
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      Co-authored-by: default avatarJulien Chaumond <chaumond@gmail.com>
      f16540fc
  17. 16 Apr, 2020 1 commit
  18. 07 Apr, 2020 1 commit
  19. 06 Apr, 2020 1 commit
    • Funtowicz Morgan's avatar
      Tokenizers v3.0.0 (#3185) · 96ab75b8
      Funtowicz Morgan authored
      
      
      * Renamed num_added_tokens to num_special_tokens_to_add
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Make fast tokenizers unittests work on Windows.
      
      * Entirely refactored unittest for tokenizers fast.
      
      * Remove ABC class for CommonFastTokenizerTest
      
      * Added embeded_special_tokens tests from allenai @dirkgr
      
      * Make embeded_special_tokens tests from allenai more generic
      
      * Uniformize vocab_size as a property for both Fast and normal tokenizers
      
      * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
      
      * Ensure providing None input raise the same ValueError than Python tokenizer + tests.
      
      * Fix invalid input for assert_padding when testing batch_encode_plus
      
      * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
      
      * Ensure tokenize() correctly forward add_special_tokens to rust.
      
      * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
      Avoid stripping on None values.
      
      * unittests ensure tokenize() also throws a ValueError if provided None
      
      * Added add_special_tokens unittest for all supported models.
      
      * Style
      
      * Make sure TransfoXL test run only if PyTorch is provided.
      
      * Split up tokenizers tests for each model type.
      
      * Fix invalid unittest with new tokenizers API.
      
      * Filter out Roberta openai detector models from unittests.
      
      * Introduce BatchEncoding on fast tokenizers path.
      
      This new structure exposes all the mappings retrieved from Rust.
      It also keeps the current behavior with model forward.
      
      * Introduce BatchEncoding on slow tokenizers path.
      
      Backward compatibility.
      
      * Improve error message on BatchEncoding for slow path
      
      * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
      
      * Style and format.
      
      * Added typing on all methods for PretrainedTokenizerFast
      
      * Style and format
      
      * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
      
      * Style and format
      
      * encode_plus now supports pretokenized inputs.
      
      * Remove user warning about add_special_tokens when working on pretokenized inputs.
      
      * Always go through the post processor.
      
      * Added support for pretokenized input pairs on encode_plus
      
      * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
      
      * Added pretokenized inputs support on batch_encode_plus
      
      * Update BatchEncoding methods name to match Encoding.
      
      * Bump setup.py tokenizers dependency to 0.7.0rc1
      
      * Remove unused parameters in BertTokenizerFast
      
      * Make sure Roberta returns token_type_ids for unittests.
      
      * Added missing typings
      
      * Update add_tokens prototype to match tokenizers side and allow AddedToken
      
      * Bumping tokenizers to 0.7.0rc2
      
      * Added documentation for BatchEncoding
      
      * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
      
      * Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
      
      * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
      
      * Fix text-classification pipeline using the wrong tokenizer
      
      * Make pipelines works with BatchEncoding
      
      * Turn off add_special_tokens on tokenize by default.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove add_prefix_space from tokenize call in unittest.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style and quality
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Correct message for batch_encode_plus none input exception.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Fix invalid list comprehension for offset_mapping overriding content every iteration.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * TransfoXL uses Strip normalizer.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Bump tokenizers dependency to 0.7.0rc3
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * SpecilaTokenMixin can use slots to faster access to underlying attributes.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove update_special_tokens from fast tokenizers.
      
      * Ensure TransfoXL unittests are run only when torch is available.
      
      * Style.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style
      
      * Style 馃檹馃檹
      
      
      
      * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
      
      * Remove Roberta warning on __init__.
      
      * Move documentation to Google style.
      Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
      96ab75b8
  20. 26 Mar, 2020 2 commits
    • Patrick von Platen's avatar
      Adds translation pipeline (#3419) · 022e8fab
      Patrick von Platen authored
      * fix merge conflicts
      
      * add t5 summarization example
      
      * change parameters for t5 summarization
      
      * make style
      
      * add first code snippet for translation
      
      * only add prefixes
      
      * add prefix patterns
      
      * make style
      
      * renaming
      
      * fix conflicts
      
      * remove unused patterns
      
      * solve conflicts
      
      * fix merge conflicts
      
      * remove translation example
      
      * remove summarization example
      
      * make sure tensors are in numpy for float comparsion
      
      * re-add t5 config
      
      * fix t5 import config typo
      
      * make style
      
      * remove unused numpy statements
      
      * update doctstring
      
      * import translation pipeline
      022e8fab
    • Patrick von Platen's avatar
      Add t5 to pipeline(task='summarization') (#3413) · 9c683ef0
      Patrick von Platen authored
      * solve conflicts
      
      * move warnings below
      
      * incorporate changes
      
      * add pad_to_max_length to pipelines
      
      * add bug fix for T5 beam search
      
      * add prefix patterns
      
      * make style
      
      * fix conflicts
      
      * adapt pipelines for task specific parameters
      
      * improve docstring
      
      * remove unused patterns
      9c683ef0
  21. 17 Mar, 2020 1 commit
    • Sam Shleifer's avatar
      Add Summarization to Pipelines (#3128) · 38a555a8
      Sam Shleifer authored
      * passing
      
      * Undo stupid chg
      
      * docs
      
      * undo rename
      
      * delete-cruft
      
      * only import if you have torch
      
      * Dont rely on dict ordering
      
      * Fix dict ordering upstream
      
      * docstring link
      
      * docstring link
      
      * remove trailing comma for 3.5 compat
      
      * new name
      
      * delegate kwarging
      
      * Update kwargs
      38a555a8
  22. 09 Mar, 2020 1 commit
  23. 02 Mar, 2020 1 commit
    • Lysandre Debut's avatar
      Pipeline doc (#3055) · d3eb7d23
      Lysandre Debut authored
      * Pipeline doc initial commit
      
      * pipeline abstraction
      
      * Remove modelcard argument from pipeline
      
      * Task-specific pipelines can be instantiated with no model or tokenizer
      
      * All pipelines doc
      d3eb7d23
  24. 19 Feb, 2020 1 commit
  25. 18 Feb, 2020 1 commit
  26. 13 Feb, 2020 1 commit
    • Joe Davison's avatar
      Preserve spaces in GPT-2 tokenizers (#2778) · f1e8a51f
      Joe Davison authored
      * Preserve spaces in GPT-2 tokenizers
      
      Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
      tokenizers, enabling correct BPE encoding. Automatically inserts a space
      in front of first token in encode function when adding special tokens.
      
      * Add tokenization preprocessing method
      
      * Add framework argument to pipeline factory
      
      Also fixes pipeline test issue. Each test input now treated as a
      distinct sequence.
      f1e8a51f
  27. 07 Feb, 2020 2 commits
  28. 30 Jan, 2020 1 commit
    • Julien Chaumond's avatar
      fill_mask helper (#2576) · 9fa836a7
      Julien Chaumond authored
      * fill_mask helper
      
      * [poc] FillMaskPipeline
      
      * Revert "[poc] FillMaskPipeline"
      
      This reverts commit 67eeea55b0f97b46c2b828de0f4ee97d87338335.
      
      * Revert "fill_mask helper"
      
      This reverts commit cacc17b884e14bb6b07989110ffe884ad9e36eaa.
      
      * README: clarify that Pipelines can also do text-classification
      
      cf. question at the AI&ML meetup last week, @mfuntowicz
      
      * Fix test: test feature-extraction pipeline
      
      * Test tweaks
      
      * Slight refactor of existing pipeline (in preparation of new FillMaskPipeline)
      
      * Extraneous doc
      
      * More robust way of doing this
      
      @mfuntowicz as we don't rely on the model name anymore (see AutoConfig)
      
      * Also add RobertaConfig as a quickfix for wrong token_type_ids
      
      * cs
      
      * [BIG] FillMaskPipeline
      9fa836a7
  29. 15 Jan, 2020 2 commits
  30. 06 Jan, 2020 2 commits
  31. 22 Dec, 2019 4 commits
  32. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26