- 15 Oct, 2020 1 commit
-
-
Nicolas Patry authored
* Improving Pipelines by defaulting to framework='tf' when pytorch seems unavailable. * Actually changing the default resolution order to account for model defaults Adding a new tests for each pipeline to check that pipeline(task) works too without manually adding the framework too.
-
- 22 Sep, 2020 1 commit
-
-
Sylvain Gugger authored
* Make big downloads as slow * Add import * Right order for slow decorator * More slow tests
-
- 07 Sep, 2020 1 commit
-
-
Boris Dayma authored
* feat: allow padding_text for any generative model * docs(pipelines.py): correct typo * Update src/transformers/pipelines.py Co-authored-by:
Sam Shleifer <sshleifer@gmail.com> * feat: rename padding_text to prefix * fix: cannot tokenize empty text * fix: pass prefix arg to pipeline * test: add prefix to text-generetation pipeline * style: fix style * style: clean code and variable name more explicit * set arg docstring to optional Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com> Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 02 Sep, 2020 1 commit
-
-
Suraj Patil authored
* add Text2TextGenerationPipeline * remove max length warning * remove comments * remove input_length * fix typo * add tests * use TFAutoModelForSeq2SeqLM * doc * typo * add the doc below TextGenerationPipeline * doc nit * style * delete comment
-
- 26 Aug, 2020 1 commit
-
-
Lysandre authored
-
- 12 Aug, 2020 1 commit
-
-
Joe Davison authored
* add targets arg to fill-mask pipeline * add tests and more error handling * quality * update docstring
-
- 30 Jul, 2020 1 commit
-
-
guillaume-be authored
* initial commit for pipeline implementation Addition of input processing and history concatenation * Conversation pipeline tested and working for single & multiple conversation inputs * Added docstrings for dialogue pipeline * Addition of dialogue pipeline integration tests * Delete test_t5.py * Fixed max code length * Updated styling * Fixed test broken by formatting tools * Removed unused import * Added unit test for DialoguePipeline * Fixed Tensorflow compatibility * Fixed multi-framework support using framework flag * - Fixed docstring - Added `min_length_for_response` as an initialization parameter - Renamed `*args` to `conversations`, `conversations` being a `Conversation` or a `List[Conversation]` - Updated truncation to truncate entire segments of conversations, instead of cutting in the middle of a user/bot input * - renamed pipeline name from dialogue to conversational - removed hardcoded default value of 1000 and use config.max_length instead - added `append_response` and `set_history` method to the Conversation class to avoid direct fields mutation - fixed bug in history truncation method * - Updated ConversationalPipeline to accept only active conversations (otherwise a ValueError is raised) * - Simplified input tensor conversion * - Updated attention_mask value for Tensorflow compatibility * - Updated last dialogue reference to conversational & fixed integration tests * Fixed conflict with master * Updates following review comments * Updated formatting * Added Conversation and ConversationalPipeline to the library __init__, addition of docstrings for Conversation, added both to the docs * Update src/transformers/pipelines.py Updated docsting following review Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by:
Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
-
- 27 Jul, 2020 1 commit
-
-
Joe Davison authored
* add initial zero-shot pipeline * change default args * update default template * add label string splitting * add str labels support, remove nli from name * style * add input validation and working tf defaults * tests * quality check * add docstring to __call__ * add slow tests * Change truncation to only_first also lower precision on tests for readibility * style
-
- 08 Jul, 2020 1 commit
-
-
Lorenzo Ampil authored
* Add B I handling to grouping * Add fix to include separate entity as last token * move last_idx definition outside loop * Use first entity in entity group as reference for entity type * Add test cases * Take out extra class accidentally added * Return tf ner grouped test to original * Take out redundant last entity * Get last_idx safely Co-authored-by:
ColleterVi <36503688+ColleterVi@users.noreply.github.com> * Fix first entity comment * Create separate functions for group_sub_entities and group_entities (splitting call method to testable functions) * Take out unnecessary last_idx * Remove additional forward pass test * Move token classification basic tests to separate class * Move token classification basic tests back to monocolumninputtestcase * Move base ner tests to nerpipelinetests * Take out unused kwargs * Add back mandatory_keys argument * Add unitary tests for group_entities in _test_ner_pipeline * Fix last entity handling * Fix grouping fucntion used * Add typing to group_sub_entities and group_entities Co-authored-by:
ColleterVi <36503688+ColleterVi@users.noreply.github.com>
-
- 01 Jul, 2020 2 commits
-
-
Funtowicz Morgan authored
* Added PipelineException Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * fill-mask pipeline raises exception when more than one mask_token detected. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Put everything in a function. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Added tests on pipeline fill-mask when input has != 1 mask_token Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Fix numel() computation for TF Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Addressing PR comments. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Remove function typing to avoid import on specific framework. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Quality. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Retry typing with @julien-c tip. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Quality虏. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Simplify fill-mask mask_token checking. Signed-off-by:
Morgan Funtowicz <funtowiczmo@gmail.com> * Trigger CI
-
Sam Shleifer authored
-
- 30 Jun, 2020 1 commit
-
-
Sam Shleifer authored
-
- 26 Jun, 2020 1 commit
-
-
Sam Shleifer authored
-
- 18 May, 2020 1 commit
-
-
Sam Shleifer authored
-
- 17 May, 2020 1 commit
-
-
Lorenzo Ampil authored
* Add index to be returned by NerPipeline to allow for the creation of * Add entity groups * Convert entity list to dict * Add entity to entity_group_disagg atfter updating entity gorups * Change 'group' parameter to 'grouped_entities' * Add unit tests for grouped NER pipeline case * Correct variable name typo for NER_FINETUNED_MODELS * Sync grouped tests to recent test updates
-
- 14 May, 2020 1 commit
-
-
Sam Shleifer authored
covers torch and tf. Also fixes a failing @slow test
-
- 08 May, 2020 1 commit
-
-
Patrick von Platen authored
* fix PR * move tests to correct place
-
- 07 May, 2020 1 commit
-
-
Funtowicz Morgan authored
* Rewritten batch support in pipelines. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix imports sorting
馃敡 Signed-off-by:Morgan Funtowicz <morgan@huggingface.co> * Set pad_to_max_length=True by default on Pipeline. * Set pad_to_max_length=False for generation pipelines. Most of generation models doesn't have padding token. * Address @joeddav review comment: Uniformized *args. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Address @joeddav review comment: Uniformized *args (second). Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co>
-
- 22 Apr, 2020 1 commit
-
-
Lorenzo Ampil authored
* Add GenerationPipeline * Fix parameter names * Correct parameter __call__ parameters * Add model type attribute and correct function calls for prepare_input * Take out trailing commas from init attributes * Remove unnecessary tokenization line * Implement support for multiple text inputs * Apply generation support for multiple input text prompts * Take out tensor coersion * Take out batch index * Add text prompt to return sequence * Squeeze token tensore before decoding * Return only a single list of sequences if only one prompt was used * Correct results variable name * Add GenerationPipeline to SUPPORTED_TASKS with the alias , initalized w GPT2 * Registedred AutoModelWithLMHead for both pt and t * Update docstring for GenerationPipeline * Add kwargs parameter to mode.generate * Take out kwargs parameter after all * Add generation pipeline example in pipeline docstring * Fix max length by squeezing tokens tensor * Apply ensure_tensor_on_device to pytorch tensor * Include generation step in torch.no_grad * Take out input from prepare_xlm_input and set 'en' as default xlm_language * Apply framework specific encoding during prepare_input * Format w make style * Move GenerationPipeline import to follow proper import sorting * Take out training comma from generation dict * Apply requested changes * Change name to TextGenerationPipeline * Apply TextGenerationPipeline rename to __init___ * Changing alias to * Set input mapping as input to ensure_tensor_on_device * Fix assertion placement * Add test_text_generation * Add TextGenerationPipeline to PipelineCommonTests * Take out whitespace * Format __init__ w black * Fix __init__ style * Forman __init___ * Add line to end of __init__ * Correct model tokenizer set for test_text_generation * Ensure to return list of list, not list of string (to pass test) * Limit test models to only 3 to limit runtime to address circleCI timeout error * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update tests/test_pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Remove argument docstring, __init__, add additional __call__ arguments, and reformat results to list of dict * Fix blank result list * Add TextGenerationPipeline to pipelines.rst * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Fix typos from adding PADDING_TEXT_TOKEN_LENGTH * Fix incorrectly moved result list * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py * Update src/transformers/pipelines.py Co-Authored-By:
Patrick von Platen <patrick.v.platen@gmail.com> * Add back generation line and make style * Take out blank whitespace * Apply new alis, text-generation, to test_pipelines * Fix text generation alias in test * Update src/transformers/pipelines.py Co-authored-by:
Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by:
Julien Chaumond <chaumond@gmail.com>
-
- 16 Apr, 2020 1 commit
-
-
Patrick von Platen authored
-
- 07 Apr, 2020 1 commit
-
-
Sam Shleifer authored
-
- 06 Apr, 2020 1 commit
-
-
Funtowicz Morgan authored
* Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Style and quality Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Correct message for batch_encode_plus none input exception. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * TransfoXL uses Strip normalizer. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Style * Style
馃檹 馃檹 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by:LysandreJik <lysandre.debut@reseau.eseo.fr>
-
- 26 Mar, 2020 2 commits
-
-
Patrick von Platen authored
* fix merge conflicts * add t5 summarization example * change parameters for t5 summarization * make style * add first code snippet for translation * only add prefixes * add prefix patterns * make style * renaming * fix conflicts * remove unused patterns * solve conflicts * fix merge conflicts * remove translation example * remove summarization example * make sure tensors are in numpy for float comparsion * re-add t5 config * fix t5 import config typo * make style * remove unused numpy statements * update doctstring * import translation pipeline
-
Patrick von Platen authored
* solve conflicts * move warnings below * incorporate changes * add pad_to_max_length to pipelines * add bug fix for T5 beam search * add prefix patterns * make style * fix conflicts * adapt pipelines for task specific parameters * improve docstring * remove unused patterns
-
- 17 Mar, 2020 1 commit
-
-
Sam Shleifer authored
* passing * Undo stupid chg * docs * undo rename * delete-cruft * only import if you have torch * Dont rely on dict ordering * Fix dict ordering upstream * docstring link * docstring link * remove trailing comma for 3.5 compat * new name * delegate kwarging * Update kwargs
-
- 09 Mar, 2020 1 commit
-
-
Lysandre authored
-
- 02 Mar, 2020 1 commit
-
-
Lysandre Debut authored
* Pipeline doc initial commit * pipeline abstraction * Remove modelcard argument from pipeline * Task-specific pipelines can be instantiated with no model or tokenizer * All pipelines doc
-
- 19 Feb, 2020 1 commit
-
-
Funtowicz Morgan authored
* Implemented fast version of tokenizers Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bumped tokenizers version requirements to latest 0.2.1 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added matching tests Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Matching OpenAI GPT tokenization ! Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Matching GPT2 on tokenizers Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Expose add_prefix_space as constructor parameter for GPT2 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Matching Roberta tokenization ! Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Removed fast implementation of CTRL. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Binding TransformerXL tokenizers to Rust. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Updating tests accordingly. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added tokenizers as top-level modules. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Black & isort. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Rename LookupTable to WordLevel to match Rust side. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Black. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Use "fast" suffix instead of "ru" for rust tokenizers implementations. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Introduce tokenize() method on fast tokenizers. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * encode_plus dispatchs to batch_encode_plus Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * batch_encode_plus now dispatchs to encode if there is only one input element. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bind all the encode_plus parameter to the forwarded batch_encode_plus call. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.3.0 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Formatting. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix tokenization_auto with support for new (python, fast) mapping schema. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Give correct fixtures path in test_tokenization_fast.py for the CLI. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Expose max_len_ properties on BertTokenizerFast Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * _convert_encoding should keep the batch axis tensor if only one sample in the batch. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Add warning message for RobertaTokenizerFast if used for MLM. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added use_fast (bool) parameter on AutoTokenizer.from_pretrained(). This allows to easily enable/disable Rust-based tokenizer instantiation. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Let's tokenizers handle all the truncation and padding stuff. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Allow to provide tokenizer arguments during pipeline creation. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Update test_fill_mask pipeline to not use fast tokenizers. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix too much parameters for convert_encoding. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * When enabling padding, max_length should be set to None. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Avoid returning nested tensors of length 1 when calling encode_plus Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Ensure output is padded when return_tensor is not None. Tensor creation requires the inital list input to be of the exact same size. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Disable transfoxl unittest if pytorch is not available (required to load the model) Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * encode_plus should not remove the leading batch axis if return_tensor is set Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Temporary disable fast tokenizers on QA pipelines. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix formatting issues. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Update tokenizers to 0.4.0 * Update style * Enable truncation + stride unit test on fast tokenizers. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Add unittest ensuring special_tokens set match between Python and Rust. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Ensure special_tokens are correctly set during construction. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Give more warning feedback to the user in case of padding without pad_token. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * quality & format. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added possibility to add a single token as str Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Added unittest for add_tokens and add_special_tokens on fast tokenizers. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix rebase mismatch on pipelines qa default model. QA requires cased input while the tokenizers would be uncased. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Using offset mapping relative to the original string + unittest. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: save_vocabulary requires folder and file name Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Simplify import for Bert. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Remove private member access in tokenize() Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Bump tokenizers dependency to 0.4.2 Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * format & quality. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Use named arguments when applicable. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Relax type checking to include tuple and list object. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Addressing review comment: Document the truncate_and_pad manager behavior. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Raise an exception if return_offsets_mapping is not available with the current tokenizer. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Ensure padding is set on the tokenizers before setting any padding strategy + unittest. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * On pytorch we need to stack tensor to get proper new axis. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Generalize tests to different framework removing hard written return_tensors="..." Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizer dependency for num_special_tokens_to_add Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Overflowing tokens in batch_encode_plus are now stacked over the batch axis. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Improved error message for padding strategy without pad token. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Bumping tokenizers dependency to 0.5.0 for release. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Optimizing convert_encoding around 4x improvement.
馃殌 Signed-off-by:Morgan Funtowicz <morgan@huggingface.co> * expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Fix unittests for overflow_to_sampling_mapping not being returned as tensor. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Format & quality. Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Remove perfect alignment constraint for Roberta (allowing 1% difference max) Signed-off-by:
Morgan Funtowicz <morgan@huggingface.co> * Triggering final CI Co-authored-by:
MOI Anthony <xn1t0x@gmail.com>
-
- 18 Feb, 2020 1 commit
-
-
Sam Shleifer authored
* Skip flaky test * Style
-
- 13 Feb, 2020 1 commit
-
-
Joe Davison authored
* Preserve spaces in GPT-2 tokenizers Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa) tokenizers, enabling correct BPE encoding. Automatically inserts a space in front of first token in encode function when adding special tokens. * Add tokenization preprocessing method * Add framework argument to pipeline factory Also fixes pipeline test issue. Each test input now treated as a distinct sequence.
-
- 07 Feb, 2020 2 commits
-
-
VictorSanh authored
-
VictorSanh authored
-
- 30 Jan, 2020 1 commit
-
-
Julien Chaumond authored
* fill_mask helper * [poc] FillMaskPipeline * Revert "[poc] FillMaskPipeline" This reverts commit 67eeea55b0f97b46c2b828de0f4ee97d87338335. * Revert "fill_mask helper" This reverts commit cacc17b884e14bb6b07989110ffe884ad9e36eaa. * README: clarify that Pipelines can also do text-classification cf. question at the AI&ML meetup last week, @mfuntowicz * Fix test: test feature-extraction pipeline * Test tweaks * Slight refactor of existing pipeline (in preparation of new FillMaskPipeline) * Extraneous doc * More robust way of doing this @mfuntowicz as we don't rely on the model name anymore (see AutoConfig) * Also add RobertaConfig as a quickfix for wrong token_type_ids * cs * [BIG] FillMaskPipeline
-
- 15 Jan, 2020 2 commits
-
-
Julien Chaumond authored
-
Julien Chaumond authored
-
- 06 Jan, 2020 2 commits
-
-
alberduris authored
-
alberduris authored
-
- 22 Dec, 2019 3 commits
-
-
Aymeric Augustin authored
This construct isn't used anymore these days. Running python tests/test_foo.py puts the tests/ directory on PYTHONPATH, which isn't representative of how we run tests. Use python -m unittest tests/test_foo.py instead.
-
Aymeric Augustin authored
-
Aymeric Augustin authored
-