Unverified Commit c63fcabf authored by Nicolas Patry's avatar Nicolas Patry Committed by GitHub
Browse files

[Large PR] Entire rework of pipelines. (#13308)



* Enabling dataset iteration on pipelines.

Enabling dataset iteration on pipelines.

Unifying parameters under `set_parameters` function.

Small fix.

Last fixes after rebase

Remove print.

Fixing text2text `generate_kwargs`

No more `self.max_length`.

Fixing tf only conversational.

Consistency in start/stop index over TF/PT.

Speeding up drastically on TF (nasty bug where max_length would increase
a ton.)

Adding test for support for non fast tokenizers.

Fixign GPU usage on zero-shot.

Fix working on Tf.

Update src/transformers/pipelines/base.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/base.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

Small cleanup.

Remove all asserts + simple format.

* Fixing audio-classification for large PR.

* Overly explicity null checking.

* Encapsulating GPU/CPU pytorch manipulation directly within `base.py`.

* Removed internal state for parameters of the  pipeline.

Instead of overriding implicitly internal state, we moved
to real named arguments on every `preprocess`, `_forward`,
`postprocess` function.

Instead `_sanitize_parameters` will be used to split all kwargs
of both __init__ and __call__ into the 3 kinds of named parameters.

* Move import warnings.

* Small fixes.

* Quality.

* Another small fix, using the CI to debug faster.

* Last fixes.

* Last fix.

* Small cleanup of tensor moving.

* is not None.

* Adding a bunch of docs + a iteration test.

* Fixing doc style.

* KeyDataset = None guard.

* RRemoving the Cuda test for pipelines (was testing).

* Even more simple iteration test.

* Correct import .

* Long day.

* Fixes in docs.

* [WIP] migrating object detection.

* Fixed the target_size bug.

* Fixup.

* Bad variable name.

* Fixing `ensure_on_device` respects original ModelOutput.
parent 09549aa1
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
How to add a pipeline to 🤗 Transformers?
=======================================================================================================================
First and foremost, you need to decide the raw entries the pipeline will be able to take. It can be strings, raw bytes,
dictionnaries or whatever seems to be the most likely desired input. Try to keep these inputs as pure Python as
possible as it makes compatibility easier (even through other languages via JSON). Those will be the :obj:`inputs` of
the pipeline (:obj:`preprocess`).
Then define the :obj:`outputs`. Same policy as the :obj:`inputs`. The simpler, the better. Those will be the outputs of
:obj:`postprocess` method.
Start by inheriting the base class :obj:`Pipeline`. with the 4 methods needed to implement :obj:`preprocess`,
:obj:`_forward`, :obj:`postprocess` and :obj:`_sanitize_parameters`.
.. code-block::
from transformers import Pipeline
class MyPipeline(Pipeline):
def _sanitize_parameters(self, **kwargs)
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
return preprocess_kwargs, {}, {}
def preprocess(self, inputs, maybe_arg=2)
model_input = Tensor(....)
return {"model_input": model_input}
def _forward(self, model_inputs)
# model_inputs == {"model_input": model_input}
oututs = self.model(**model_inputs)
# Maybe {"logits": Tensor(...)}
return outputs
def postprocess(self, model_outputs)
best_class = model_outputs["logits"].softmax(-1)
return best_class
The structure of this breakdown is to support relatively seemless support for CPU/GPU, while supporting doing
pre/postprocessing on the CPU on different threads
:obj:`preprocess` will take the original defined inputs, and turn them something feedable to the model. It might
contain more information and is usally a :obj:`Dict`.
:obj:`_forward` is the implementation detail and is not meant to be called directly :obj:`forward` is the preferred
called method as it contains safeguards to make sure everything is working on the expected device. If anything is
linked to a real model it belongs in the :obj:`_forward` method, anything else is in the preprocess/postrocess.
:obj:`postprocess` methods will take the output of :obj:`_forward` and turn it into the final output that were decided
earlier.
:obj:`_sanitize_parameters` exists to allow users to pass any parameters whenever they wish, be it at initialization
time ``pipeline(...., maybe_arg=4)`` or at call time ``pipe = pipeline(...); output = pipe(...., maybe_arg=4)``.
The returns of :obj:`_sanitize_parameters` are the 3 dicts of kwargs that will be passed directly to :obj:`preprocess`,
:obj:`_forward` and :obj:`postprocess`. Don't fill anything if the caller didn't call with any extra parameter. That
allows to keep the default arguments in the function definition which is always more "natural".
A classic example would be a :obj:`top_k` argument in the post processing in classification tasks.
.. code-block::
>>> pipe = pipeline("my-new-task")
>>> pipe("This is a test")
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]
>>> pipe("This is a test", top_k=2)
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]
In order to achieve that, we'll update our :obj:`postprocess` method with a default parameter to :obj:`5`. and edit
:obj:`_sanitize_parameters` to allow this new parameter.
.. code-block::
def postprocess(self, model_outputs, top_k=5)
best_class = model_outputs["logits"].softmax(-1)
# Add logic to handle top_k
return best_class
def _sanitize_parameters(self, **kwargs)
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
postprocess_kwargs = {}
if "top_k" in kwargs:
preprocess_kwargs["top_k"] = kwargs["top_k"]
return preprocess_kwargs, {}, postprocess_kwargs
Try to keep the inputs/outputs very simple and ideally JSON-serializable as it makes the pipeline usage very easy
without requiring users to understand new kind of objects. It's also relatively common to support many different types
of arguments for ease of use (audio files, can be filenames, URLs or pure bytes)
Adding it to the list of supported tasks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Go to ``src/transformers/pipelines/__init__.py`` and fill in :obj:`SUPPORTED_TASKS` with your newly created pipeline.
If possible it should provide a default model.
Adding tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Create a new file ``tests/test_pipelines_MY_PIPELINE.py`` with example with the other tests.
The :obj:`run_pipeline_test` function will be very generic and run on small random models on every possible
architecture as defined by :obj:`model_mapping` and :obj:`tf_model_mapping`.
This is very important to test future compatibilty, meaning if someone adds a new model for
:obj:`XXXForQuestionAnswering` then the pipeline test will attempt to run on it. Because the models are random it's
impossible to check for actual values, that's why There is a helper :obj:`ANY` that will simply attempt to match the
output of the pipeline TYPE.
You also *need* to implement 2 (ideally 4) tests.
- :obj:`test_small_model_pt` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
and test the pipeline outputs. The results should be the same as :obj:`test_small_model_tf`.
- :obj:`test_small_model_tf` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
and test the pipeline outputs. The results should be the same as :obj:`test_small_model_pt`.
- :obj:`test_large_model_pt` (:obj:`optional`): Tests the pipeline on a real pipeline where the results are supposed to
make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
sure there is no drift in future releases
- :obj:`test_large_model_tf` (:obj:`optional`): Tests the pipeline on a real pipeline where the results are supposed to
make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
sure there is no drift in future releases
...@@ -506,6 +506,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -506,6 +506,7 @@ Flax), PyTorch, and/or TensorFlow.
migration migration
contributing contributing
add_new_model add_new_model
add_new_pipeline
fast_tokenizers fast_tokenizers
performance performance
parallelism parallelism
......
...@@ -46,12 +46,53 @@ The pipeline abstraction ...@@ -46,12 +46,53 @@ The pipeline abstraction
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
pipeline but requires an additional argument which is the `task`. pipeline but requires an additional argument which is the `task`.
Simple call on one item:
.. code-block::
>>> pipe = pipeline("text-classification")
>>> pipe("This restaurant is awesome")
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]
To call a pipeline on many items, you can either call with a `list`.
.. code-block::
>>> pipe = pipeline("text-classification")
>>> pipe(["This restaurant is awesome", "This restaurant is aweful"])
[{'label': 'POSITIVE', 'score': 0.9998743534088135},
{'label': 'NEGATIVE', 'score': 0.9996669292449951}]
To iterate of full datasets it is recommended to use a :obj:`dataset` directly. This means you don't need to allocate
the whole dataset at once, nor do you need to do batching yourself. This should work just as fast as custom loops on
GPU. If it doesn't don't hesitate to create an issue.
.. code-block::
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")
# KeyDataset (only `pt`) will simply return the item in the dict returned by the dataset item
# as we're not interested in the `target` part of the dataset.
for out in tqdm.tqdm(pipe(KeyDataset(dataset, "file"))):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
.. autofunction:: transformers.pipeline .. autofunction:: transformers.pipeline
Implementing a pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:doc:`Implementing a new pipeline <../add_new_pipeline>`
The task specific pipelines The task specific pipelines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AudioClassificationPipeline AudioClassificationPipeline
======================================================================================================================= =======================================================================================================================
......
...@@ -67,8 +67,8 @@ make them readable. For instance: ...@@ -67,8 +67,8 @@ make them readable. For instance:
>>> classifier('We are very happy to show you the 🤗 Transformers library.') >>> classifier('We are very happy to show you the 🤗 Transformers library.')
[{'label': 'POSITIVE', 'score': 0.9998}] [{'label': 'POSITIVE', 'score': 0.9998}]
That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model as a That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model, returning
`batch`, returning a list of dictionaries like this one: a list of dictionaries like this one:
.. code-block:: .. code-block::
...@@ -79,6 +79,8 @@ That's encouraging! You can use it on a list of sentences, which will be preproc ...@@ -79,6 +79,8 @@ That's encouraging! You can use it on a list of sentences, which will be preproc
label: POSITIVE, with score: 0.9998 label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309 label: NEGATIVE, with score: 0.5309
To use with a large dataset, look at :doc:`iterating over a pipeline <./main_classes/pipelines>`
You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
fairly neutral. fairly neutral.
......
...@@ -249,18 +249,22 @@ def check_task(task: str) -> Tuple[Dict, Any]: ...@@ -249,18 +249,22 @@ def check_task(task: str) -> Tuple[Dict, Any]:
task (:obj:`str`): task (:obj:`str`):
The task defining which pipeline will be returned. Currently accepted tasks are: The task defining which pipeline will be returned. Currently accepted tasks are:
- :obj:`"audio-classification"`
- :obj:`"automatic-speech-recognition"`
- :obj:`"conversational"`
- :obj:`"feature-extraction"` - :obj:`"feature-extraction"`
- :obj:`"text-classification"`
- :obj:`"sentiment-analysis"` (alias of :obj:`"text-classification")
- :obj:`"token-classification"`
- :obj:`"ner"` (alias of :obj:`"token-classification")
- :obj:`"question-answering"`
- :obj:`"fill-mask"` - :obj:`"fill-mask"`
- :obj:`"summarization"` - :obj:`"image-classification"`
- :obj:`"translation_xx_to_yy"` - :obj:`"question-answering"`
- :obj:`"translation"` - :obj:`"table-question-answering"`
- :obj:`"text2text-generation"`
- :obj:`"text-classification"` (alias :obj:`"sentiment-analysis" available)
- :obj:`"text-generation"` - :obj:`"text-generation"`
- :obj:`"conversational"` - :obj:`"token-classification"` (alias :obj:`"ner"` available)
- :obj:`"translation"`
- :obj:`"translation_xx_to_yy"`
- :obj:`"summarization"`
- :obj:`"zero-shot-classification"`
Returns: Returns:
(task_defaults:obj:`dict`, task_options: (:obj:`tuple`, None)) The actual dictionary required to initialize the (task_defaults:obj:`dict`, task_options: (:obj:`tuple`, None)) The actual dictionary required to initialize the
...@@ -312,21 +316,26 @@ def pipeline( ...@@ -312,21 +316,26 @@ def pipeline(
task (:obj:`str`): task (:obj:`str`):
The task defining which pipeline will be returned. Currently accepted tasks are: The task defining which pipeline will be returned. Currently accepted tasks are:
- :obj:`"feature-extraction"`: will return a :class:`~transformers.FeatureExtractionPipeline`. - :obj:`"audio-classification"`: will return a :class:`~transformers.AudioClassificationPipeline`:.
- :obj:`"text-classification"`: will return a :class:`~transformers.TextClassificationPipeline`. - :obj:`"automatic-speech-recognition"`: will return a
- :obj:`"sentiment-analysis"`: (alias of :obj:`"text-classification"`) will return a :class:`~transformers.AutomaticSpeechRecognitionPipeline`:.
:class:`~transformers.TextClassificationPipeline`. - :obj:`"conversational"`: will return a :class:`~transformers.ConversationalPipeline`:.
- :obj:`"token-classification"`: will return a :class:`~transformers.TokenClassificationPipeline`. - :obj:`"feature-extraction"`: will return a :class:`~transformers.FeatureExtractionPipeline`:.
- :obj:`"ner"` (alias of :obj:`"token-classification"`): will return a - :obj:`"fill-mask"`: will return a :class:`~transformers.FillMaskPipeline`:.
:class:`~transformers.TokenClassificationPipeline`. - :obj:`"image-classification"`: will return a :class:`~transformers.ImageClassificationPipeline`:.
- :obj:`"question-answering"`: will return a :class:`~transformers.QuestionAnsweringPipeline`. - :obj:`"question-answering"`: will return a :class:`~transformers.QuestionAnsweringPipeline`:.
- :obj:`"fill-mask"`: will return a :class:`~transformers.FillMaskPipeline`. - :obj:`"table-question-answering"`: will return a :class:`~transformers.TableQuestionAnsweringPipeline`:.
- :obj:`"summarization"`: will return a :class:`~transformers.SummarizationPipeline`. - :obj:`"text2text-generation"`: will return a :class:`~transformers.Text2TextGenerationPipeline`:.
- :obj:`"translation_xx_to_yy"`: will return a :class:`~transformers.TranslationPipeline`. - :obj:`"text-classification"` (alias :obj:`"sentiment-analysis" available): will return a
- :obj:`"text2text-generation"`: will return a :class:`~transformers.Text2TextGenerationPipeline`. :class:`~transformers.TextClassificationPipeline`:.
- :obj:`"text-generation"`: will return a :class:`~transformers.TextGenerationPipeline`. - :obj:`"text-generation"`: will return a :class:`~transformers.TextGenerationPipeline`:.
- :obj:`"zero-shot-classification:`: will return a :class:`~transformers.ZeroShotClassificationPipeline`. - :obj:`"token-classification"` (alias :obj:`"ner"` available): will return a
- :obj:`"conversational"`: will return a :class:`~transformers.ConversationalPipeline`. :class:`~transformers.TokenClassificationPipeline`:.
- :obj:`"translation"`: will return a :class:`~transformers.TranslationPipeline`:.
- :obj:`"translation_xx_to_yy"`: will return a :class:`~transformers.TranslationPipeline`:.
- :obj:`"summarization"`: will return a :class:`~transformers.SummarizationPipeline`:.
- :obj:`"zero-shot-classification"`: will return a :class:`~transformers.ZeroShotClassificationPipeline`:.
model (:obj:`str` or :obj:`~transformers.PreTrainedModel` or :obj:`~transformers.TFPreTrainedModel`, `optional`): model (:obj:`str` or :obj:`~transformers.PreTrainedModel` or :obj:`~transformers.TFPreTrainedModel`, `optional`):
The model that will be used by the pipeline to make predictions. This can be a model identifier or an The model that will be used by the pipeline to make predictions. This can be a model identifier or an
actual instance of a pretrained model inheriting from :class:`~transformers.PreTrainedModel` (for PyTorch) actual instance of a pretrained model inheriting from :class:`~transformers.PreTrainedModel` (for PyTorch)
......
...@@ -12,23 +12,16 @@ ...@@ -12,23 +12,16 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import subprocess import subprocess
from typing import TYPE_CHECKING, Optional, Union from typing import Union
import numpy as np import numpy as np
from ..feature_extraction_utils import PreTrainedFeatureExtractor
from ..file_utils import add_end_docstrings, is_torch_available from ..file_utils import add_end_docstrings, is_torch_available
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, Pipeline
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
if is_torch_available(): if is_torch_available():
import torch
from ..models.auto.modeling_auto import MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING from ..models.auto.modeling_auto import MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -84,14 +77,10 @@ class AudioClassificationPipeline(Pipeline): ...@@ -84,14 +77,10 @@ class AudioClassificationPipeline(Pipeline):
<https://huggingface.co/models?filter=audio-classification>`__. <https://huggingface.co/models?filter=audio-classification>`__.
""" """
def __init__( def __init__(self, *args, **kwargs):
self, # Default, might be overriden by the model.config.
model: Union["PreTrainedModel", "TFPreTrainedModel"], kwargs["top_k"] = 5
feature_extractor: PreTrainedFeatureExtractor, super().__init__(*args, **kwargs)
framework: Optional[str] = None,
**kwargs
):
super().__init__(model, feature_extractor=feature_extractor, framework=framework, **kwargs)
if self.framework != "pt": if self.framework != "pt":
raise ValueError(f"The {self.__class__} is only available in PyTorch.") raise ValueError(f"The {self.__class__} is only available in PyTorch.")
...@@ -101,7 +90,6 @@ class AudioClassificationPipeline(Pipeline): ...@@ -101,7 +90,6 @@ class AudioClassificationPipeline(Pipeline):
def __call__( def __call__(
self, self,
inputs: Union[np.ndarray, bytes, str], inputs: Union[np.ndarray, bytes, str],
top_k: Optional[int] = None,
**kwargs, **kwargs,
): ):
""" """
...@@ -126,6 +114,18 @@ class AudioClassificationPipeline(Pipeline): ...@@ -126,6 +114,18 @@ class AudioClassificationPipeline(Pipeline):
- **label** (:obj:`str`) -- The label predicted. - **label** (:obj:`str`) -- The label predicted.
- **score** (:obj:`float`) -- The corresponding probability. - **score** (:obj:`float`) -- The corresponding probability.
""" """
return super().__call__(inputs, **kwargs)
def _sanitize_parameters(self, top_k=None, **kwargs):
# No parameters on this pipeline right now
postprocess_params = {}
if top_k is not None:
if top_k > self.model.config.num_labels:
top_k = self.model.config.num_labels
postprocess_params["top_k"] = top_k
return {}, {}, postprocess_params
def preprocess(self, inputs):
if isinstance(inputs, str): if isinstance(inputs, str):
with open(inputs, "rb") as f: with open(inputs, "rb") as f:
inputs = f.read() inputs = f.read()
...@@ -136,24 +136,23 @@ class AudioClassificationPipeline(Pipeline): ...@@ -136,24 +136,23 @@ class AudioClassificationPipeline(Pipeline):
if not isinstance(inputs, np.ndarray): if not isinstance(inputs, np.ndarray):
raise ValueError("We expect a numpy ndarray as input") raise ValueError("We expect a numpy ndarray as input")
if len(inputs.shape) != 1: if len(inputs.shape) != 1:
raise ValueError("We expect a single channel audio input for AudioClassificationPipeline") raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
if top_k is None or top_k > self.model.config.num_labels:
top_k = self.model.config.num_labels
processed = self.feature_extractor( processed = self.feature_extractor(
inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt" inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt"
) )
processed = self.ensure_tensor_on_device(**processed) return processed
with torch.no_grad(): def _forward(self, model_inputs):
outputs = self.model(**processed) model_outputs = self.model(**model_inputs)
return model_outputs
probs = outputs.logits[0].softmax(-1) def postprocess(self, model_outputs, top_k=5):
scores, ids = probs.topk(top_k) probs = model_outputs.logits[0].softmax(-1)
scores, ids = probs.topk(top_k)
scores = scores.tolist() scores = scores.tolist()
ids = ids.tolist() ids = ids.tolist()
labels = [{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)] labels = [{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
......
...@@ -124,6 +124,13 @@ class AutomaticSpeechRecognitionPipeline(Pipeline): ...@@ -124,6 +124,13 @@ class AutomaticSpeechRecognitionPipeline(Pipeline):
- **text** (:obj:`str`) -- The recognized text. - **text** (:obj:`str`) -- The recognized text.
""" """
return super().__call__(inputs, **kwargs)
def _sanitize_parameters(self, **kwargs):
# No parameters on this pipeline right now
return {}, {}, {}
def preprocess(self, inputs):
if isinstance(inputs, str): if isinstance(inputs, str):
with open(inputs, "rb") as f: with open(inputs, "rb") as f:
inputs = f.read() inputs = f.read()
...@@ -131,27 +138,34 @@ class AutomaticSpeechRecognitionPipeline(Pipeline): ...@@ -131,27 +138,34 @@ class AutomaticSpeechRecognitionPipeline(Pipeline):
if isinstance(inputs, bytes): if isinstance(inputs, bytes):
inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate) inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)
assert isinstance(inputs, np.ndarray), "We expect a numpy ndarray as input" if not isinstance(inputs, np.ndarray):
assert len(inputs.shape) == 1, "We expect a single channel audio input for AutomaticSpeechRecognitionPipeline" raise ValueError("We expect a numpy ndarray as input")
if len(inputs.shape) != 1:
raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
processed = self.feature_extractor( processed = self.feature_extractor(
inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt" inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt"
) )
processed = self.ensure_tensor_on_device(**processed) return processed
def _forward(self, model_inputs):
name = self.model.__class__.__name__ name = self.model.__class__.__name__
if name.endswith("ForConditionalGeneration") or name.endswith("EncoderDecoderModel"): if name.endswith("ForConditionalGeneration") or name.endswith("EncoderDecoderModel"):
encoder = self.model.get_encoder() encoder = self.model.get_encoder()
# we need to pass `processed.get("attention_mask")` here since audio encoder # we need to pass `processed.get("attention_mask")` here since audio encoder
# attention mask length is different from expected text decoder `encoder_attention_mask` length # attention mask length is different from expected text decoder `encoder_attention_mask` length
# `generate` magic to create the mask automatically won't work, we basically need to help
# it here.
tokens = self.model.generate( tokens = self.model.generate(
encoder_outputs=encoder(**processed), attention_mask=processed.get("attention_mask") encoder_outputs=encoder(**model_inputs), attention_mask=model_inputs.get("attention_mask")
) )
tokens = tokens.squeeze(0) tokens = tokens.squeeze(0)
elif name.endswith("ForCTC"): elif name.endswith("ForCTC"):
outputs = self.model(**processed) outputs = self.model(**model_inputs)
tokens = outputs.logits.squeeze(0).argmax(dim=-1) tokens = outputs.logits.squeeze(0).argmax(dim=-1)
return tokens
def postprocess(self, model_outputs):
skip_special_tokens = False if "CTC" in self.tokenizer.__class__.__name__ else True skip_special_tokens = False if "CTC" in self.tokenizer.__class__.__name__ else True
recognized_string = self.tokenizer.decode(tokens, skip_special_tokens=skip_special_tokens) recognized_string = self.tokenizer.decode(model_outputs, skip_special_tokens=skip_special_tokens)
return {"text": recognized_string} return {"text": recognized_string}
...@@ -20,18 +20,21 @@ import pickle ...@@ -20,18 +20,21 @@ import pickle
import sys import sys
import warnings import warnings
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
from collections import UserDict
from contextlib import contextmanager from contextlib import contextmanager
from os.path import abspath, exists from os.path import abspath, exists
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union
from ..feature_extraction_utils import PreTrainedFeatureExtractor from ..feature_extraction_utils import PreTrainedFeatureExtractor
from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import ModelOutput, add_end_docstrings, is_tf_available, is_torch_available
from ..modelcard import ModelCard from ..modelcard import ModelCard
from ..models.auto.configuration_auto import AutoConfig from ..models.auto.configuration_auto import AutoConfig
from ..tokenization_utils import PreTrainedTokenizer, TruncationStrategy from ..tokenization_utils import PreTrainedTokenizer
from ..utils import logging from ..utils import logging
GenericTensor = Union[List["GenericTensor"], "torch.Tensor", "tf.Tensor"]
if is_tf_available(): if is_tf_available():
import tensorflow as tf import tensorflow as tf
...@@ -39,8 +42,12 @@ if is_tf_available(): ...@@ -39,8 +42,12 @@ if is_tf_available():
if is_torch_available(): if is_torch_available():
import torch import torch
from torch.utils.data import DataLoader, Dataset, IterableDataset
from ..models.auto.modeling_auto import AutoModel from ..models.auto.modeling_auto import AutoModel
else:
Dataset = None
KeyDataset = None
if TYPE_CHECKING: if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel from ..modeling_tf_utils import TFPreTrainedModel
...@@ -50,6 +57,12 @@ if TYPE_CHECKING: ...@@ -50,6 +57,12 @@ if TYPE_CHECKING:
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
def collate_fn(items):
if len(items) != 1:
raise ValueError("This collate_fn is meant to be used with batch_size=1")
return items[0]
def infer_framework_load_model( def infer_framework_load_model(
model, model,
config: AutoConfig, config: AutoConfig,
...@@ -585,6 +598,51 @@ PIPELINE_INIT_ARGS = r""" ...@@ -585,6 +598,51 @@ PIPELINE_INIT_ARGS = r"""
Flag indicating if the output the pipeline should happen in a binary format (i.e., pickle) or as raw text. Flag indicating if the output the pipeline should happen in a binary format (i.e., pickle) or as raw text.
""" """
if is_torch_available():
class PipelineDataset(Dataset):
def __init__(self, dataset, process, params):
self.dataset = dataset
self.process = process
self.params = params
def __len__(self):
return len(self.dataset)
def __getitem__(self, i):
item = self.dataset[i]
processed = self.process(item, **self.params)
return processed
class PipelineIterator(IterableDataset):
def __init__(self, loader, infer, params):
self.loader = loader
self.infer = infer
self.params = params
def __len__(self):
return len(self.loader)
def __iter__(self):
self.iterator = iter(self.loader)
return self
def __next__(self):
item = next(self.iterator)
processed = self.infer(item, **self.params)
return processed
class KeyDataset(Dataset):
def __init__(self, dataset: Dataset, key: str):
self.dataset = dataset
self.key = key
def __len__(self):
return len(self.dataset)
def __getitem__(self, i):
return self.dataset[i][self.key]
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
class Pipeline(_ScikitCompat): class Pipeline(_ScikitCompat):
...@@ -618,6 +676,7 @@ class Pipeline(_ScikitCompat): ...@@ -618,6 +676,7 @@ class Pipeline(_ScikitCompat):
args_parser: ArgumentHandler = None, args_parser: ArgumentHandler = None,
device: int = -1, device: int = -1,
binary_output: bool = False, binary_output: bool = False,
**kwargs,
): ):
if framework is None: if framework is None:
...@@ -641,6 +700,9 @@ class Pipeline(_ScikitCompat): ...@@ -641,6 +700,9 @@ class Pipeline(_ScikitCompat):
if task_specific_params is not None and task in task_specific_params: if task_specific_params is not None and task in task_specific_params:
self.model.config.update(task_specific_params.get(task)) self.model.config.update(task_specific_params.get(task))
self.call_count = 0
self._preprocess_params, self._forward_params, self._postprocess_params = self._sanitize_parameters(**kwargs)
def save_pretrained(self, save_directory: str): def save_pretrained(self, save_directory: str):
""" """
Save the pipeline's model and tokenizer. Save the pipeline's model and tokenizer.
...@@ -707,15 +769,31 @@ class Pipeline(_ScikitCompat): ...@@ -707,15 +769,31 @@ class Pipeline(_ScikitCompat):
Ensure PyTorch tensors are on the specified device. Ensure PyTorch tensors are on the specified device.
Args: Args:
inputs (keyword arguments that should be :obj:`torch.Tensor`): The tensors to place on :obj:`self.device`. inputs (keyword arguments that should be :obj:`torch.Tensor`, the rest is ignored): The tensors to place on :obj:`self.device`.
Recursive on lists **only**.
Return: Return:
:obj:`Dict[str, torch.Tensor]`: The same as :obj:`inputs` but on the proper device. :obj:`Dict[str, torch.Tensor]`: The same as :obj:`inputs` but on the proper device.
""" """
return { return self._ensure_tensor_on_device(inputs, self.device)
name: tensor.to(self.device) if isinstance(tensor, torch.Tensor) else tensor
for name, tensor in inputs.items() def _ensure_tensor_on_device(self, inputs, device):
} if isinstance(inputs, ModelOutput):
return ModelOutput(
{name: self._ensure_tensor_on_device(tensor, device) for name, tensor in inputs.items()}
)
elif isinstance(inputs, dict):
return {name: self._ensure_tensor_on_device(tensor, device) for name, tensor in inputs.items()}
elif isinstance(inputs, UserDict):
return UserDict({name: self._ensure_tensor_on_device(tensor, device) for name, tensor in inputs.items()})
elif isinstance(inputs, list):
return [self._ensure_tensor_on_device(item, device) for item in inputs]
elif isinstance(inputs, tuple):
return tuple([self._ensure_tensor_on_device(item, device) for item in inputs])
elif isinstance(inputs, torch.Tensor):
return inputs.to(self.device)
else:
return inputs
def check_model_type(self, supported_models: Union[List[str], dict]): def check_model_type(self, supported_models: Union[List[str], dict]):
""" """
...@@ -739,65 +817,108 @@ class Pipeline(_ScikitCompat): ...@@ -739,65 +817,108 @@ class Pipeline(_ScikitCompat):
f"The model '{self.model.__class__.__name__}' is not supported for {self.task}. Supported models are {supported_models}." f"The model '{self.model.__class__.__name__}' is not supported for {self.task}. Supported models are {supported_models}."
) )
def _parse_and_tokenize( @abstractmethod
self, inputs, padding=True, add_special_tokens=True, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs def _sanitize_parameters(self, **pipeline_parameters):
):
""" """
Parse arguments and tokenize _sanitize_parameters will be called with any excessive named arguments from either `__init__` or `__call__`
""" methods. It should return 3 dictionnaries of the resolved parameters used by the various `preprocess`,
# Parse arguments `forward` and `postprocess` methods. Do not fill dictionnaries if the caller didn't specify a kwargs. This
if getattr(self.tokenizer, "pad_token", None) is None: let's you keep defaults in function signatures, which is more "natural".
padding = False
inputs = self.tokenizer(
inputs,
add_special_tokens=add_special_tokens,
return_tensors=self.framework,
padding=padding,
truncation=truncation,
)
return inputs
def __call__(self, inputs, *args, **kwargs):
try:
model_inputs = self._parse_and_tokenize(inputs, *args, **kwargs)
outputs = self._forward(model_inputs)
return outputs
except ValueError:
# XXX: Some tokenizer do NOT have a pad token, hence we cannot run the inference
# in a batch, instead we run everything sequentially
if isinstance(inputs, list):
values = []
for input_ in inputs:
model_input = self._parse_and_tokenize(input_, padding=False, *args, **kwargs)
value = self._forward(model_input)
values.append(value.squeeze(0))
else:
model_input = self._parse_and_tokenize(inputs, padding=False, *args, **kwargs)
values = self._forward(model_input)
return values
def _forward(self, inputs, return_tensors=False): It is not meant to be called directly, it will be automatically called and the final parameters resolved by
`__init__` and `__call__`
""" """
Internal framework specific forward dispatching raise NotImplementedError("_sanitize_parameters not implemented")
Args: @abstractmethod
inputs: dict holding all the keyword arguments for required by the model forward method. def preprocess(self, input_: Any, **preprocess_parameters: Dict) -> Dict[str, GenericTensor]:
return_tensors: Whether to return native framework (pt/tf) tensors rather than numpy array """
Preprocess will take the `input_` of a specific pipeline and return a dictionnary of everything necessary for
`_forward` to run properly. It should contain at least one tensor, but might have arbitrary other items.
"""
raise NotImplementedError("preprocess not implemented")
Returns: @abstractmethod
Numpy array def _forward(self, input_tensors: Dict[str, GenericTensor], **forward_parameters: Dict) -> ModelOutput:
"""
_forward will receive the prepared dictionnary from `preprocess` and run it on the model. This method might
involve the GPU or the CPU and should be agnostic to it. Isolating this function is the reason for `preprocess`
and `postprocess` to exist, so that the hot path, this method generally can run as fast as possible.
It is not meant to be called directly, `forward` is preferred. It is basically the same but contains additional
code surrounding `_forward` making sure tensors and models are on the same device, disabling the training part
of the code (leading to faster inference).
"""
raise NotImplementedError("_forward not implemented")
@abstractmethod
def postprocess(self, model_outputs: ModelOutput, **postprocess_parameters: Dict) -> Any:
"""
Postprocess will receive the raw outputs of the `_forward` method, generally tensors, and reformat them into
something more friendly. Generally it will output a list or a dict or results (containing just strings and
numbers).
""" """
# Encode for forward raise NotImplementedError("postprocess not implemented")
def forward(self, model_inputs, **forward_params):
with self.device_placement(): with self.device_placement():
if self.framework == "tf": if self.framework == "tf":
# TODO trace model model_inputs["training"] = False
predictions = self.model(inputs.data, training=False)[0] model_outputs = self._forward(model_inputs, **forward_params)
else: elif self.framework == "pt":
with torch.no_grad(): with torch.no_grad():
inputs = self.ensure_tensor_on_device(**inputs) model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
predictions = self.model(**inputs)[0].cpu() model_outputs = self._forward(model_inputs, **forward_params)
model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
if return_tensors: else:
return predictions raise ValueError(f"Framework {self.framework} is not supported")
return model_outputs
def get_iterator(self, inputs, num_workers: int, preprocess_params, forward_params, postprocess_params):
if "TOKENIZERS_PARALLELISM" not in os.environ:
logger.info("Disabling tokenizer parallelism, we're using DataLoader multithreading already")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
dataset = PipelineDataset(inputs, self.preprocess, preprocess_params)
dataloader = DataLoader(dataset, num_workers=num_workers, batch_size=1, collate_fn=collate_fn)
model_iterator = PipelineIterator(dataloader, self.forward, forward_params)
final_iterator = PipelineIterator(model_iterator, self.postprocess, postprocess_params)
return final_iterator
def __call__(self, inputs, *args, num_workers=8, **kwargs):
if args:
logger.warning(f"Ignoring args : {args}")
preprocess_params, forward_params, postprocess_params = self._sanitize_parameters(**kwargs)
# Fuse __init__ params and __call__ params without modifying the __init__ ones.
preprocess_params = {**self._preprocess_params, **preprocess_params}
forward_params = {**self._forward_params, **forward_params}
postprocess_params = {**self._postprocess_params, **postprocess_params}
self.call_count += 1
if self.call_count > 10 and self.framework == "pt" and self.device.type == "cuda":
warnings.warn(
"You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset",
UserWarning,
)
if isinstance(inputs, list):
if self.framework == "pt":
final_iterator = self.get_iterator(
inputs, num_workers, preprocess_params, forward_params, postprocess_params
)
outputs = [output for output in final_iterator]
return outputs
else:
return self.run_multi(inputs, preprocess_params, forward_params, postprocess_params)
elif Dataset is not None and isinstance(inputs, Dataset):
return self.get_iterator(inputs, num_workers, preprocess_params, forward_params, postprocess_params)
else: else:
return predictions.numpy() return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):
return [self.run_single(item, preprocess_params, forward_params, postprocess_params) for item in inputs]
def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
model_inputs = self.preprocess(inputs, **preprocess_params)
model_outputs = self.forward(model_inputs, **forward_params)
outputs = self.postprocess(model_outputs, **postprocess_params)
return outputs
...@@ -190,23 +190,34 @@ class ConversationalPipeline(Pipeline): ...@@ -190,23 +190,34 @@ class ConversationalPipeline(Pipeline):
conversational_pipeline([conversation_1, conversation_2]) conversational_pipeline([conversation_1, conversation_2])
""" """
def __init__(self, min_length_for_response=32, minimum_tokens=10, *args, **kwargs): def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs) super().__init__(*args, **kwargs)
# We need at least an eos_token
# assert self.tokenizer.eos_token_id is not None, "ConversationalPipeline tokenizer should have an EOS token set"
if self.tokenizer.pad_token_id is None: if self.tokenizer.pad_token_id is None:
self.tokenizer.pad_token = self.tokenizer.eos_token self.tokenizer.pad_token = self.tokenizer.eos_token
self.min_length_for_response = min_length_for_response def _sanitize_parameters(
self.minimum_tokens = minimum_tokens self, min_length_for_response=None, minimum_tokens=None, clean_up_tokenization_spaces=None, **generate_kwargs
def __call__(
self,
conversations: Union[Conversation, List[Conversation]],
clean_up_tokenization_spaces=True,
**generate_kwargs
): ):
preprocess_params = {}
forward_params = {}
postprocess_params = {}
if min_length_for_response is not None:
preprocess_params["min_length_for_response"] = min_length_for_response
if minimum_tokens is not None:
forward_params["minimum_tokens"] = minimum_tokens
if "max_length" in generate_kwargs:
forward_params["max_length"] = generate_kwargs["max_length"]
# self.max_length = generate_kwargs.get("max_length", self.model.config.max_length)
if clean_up_tokenization_spaces is not None:
postprocess_params["clean_up_tokenization_spaces"] = clean_up_tokenization_spaces
if generate_kwargs:
forward_params.update(generate_kwargs)
return preprocess_params, forward_params, postprocess_params
def __call__(self, conversations: Union[Conversation, List[Conversation]], num_workers=0, **kwargs):
r""" r"""
Generate responses for the conversation(s) given as inputs. Generate responses for the conversation(s) given as inputs.
...@@ -223,117 +234,67 @@ class ConversationalPipeline(Pipeline): ...@@ -223,117 +234,67 @@ class ConversationalPipeline(Pipeline):
:class:`~transformers.Conversation` or a list of :class:`~transformers.Conversation`: Conversation(s) with :class:`~transformers.Conversation` or a list of :class:`~transformers.Conversation`: Conversation(s) with
updated generated responses for those containing a new user input. updated generated responses for those containing a new user input.
""" """
# XXX: num_workers==0 is required to be backward compatible
# Otherwise the threads will require a Conversation copy.
# This will definitely hinder performance on GPU, but has to be opted
# in because of this BC change.
outputs = super().__call__(conversations, num_workers=num_workers, **kwargs)
if isinstance(outputs, list) and len(outputs) == 1:
return outputs[0]
return outputs
if isinstance(conversations, Conversation): def preprocess(self, conversation: Conversation) -> Dict[str, Any]:
conversations = [conversations] if not isinstance(conversation, Conversation):
# Input validation raise ValueError("ConversationalPipeline, expects Conversation as inputs")
if isinstance(conversations, list): if conversation.new_user_input is None:
for conversation in conversations: raise ValueError(
assert isinstance( f"Conversation with UUID {type(conversation.uuid)} does not contain new user input to process. "
conversation, Conversation "Add user inputs with the conversation's `add_user_input` method"
), "ConversationalPipeline expects a Conversation or list of Conversations as an input"
if conversation.new_user_input is None:
raise ValueError(
f"Conversation with UUID {type(conversation.uuid)} does not contain new user input to process. "
"Add user inputs with the conversation's `add_user_input` method"
)
assert (
self.tokenizer.pad_token_id is not None or self.tokenizer.eos_token_id is not None
), "Please make sure that the tokenizer has a pad_token_id or eos_token_id when using a batch input"
else:
raise ValueError("ConversationalPipeline expects a Conversation or list of Conversations as an input")
with self.device_placement():
inputs = self._parse_and_tokenize(conversations)
if self.framework == "pt":
inputs = self.ensure_tensor_on_device(**inputs)
input_length = inputs["input_ids"].shape[-1]
elif self.framework == "tf":
input_length = tf.shape(inputs["input_ids"])[-1].numpy()
max_length = generate_kwargs.get("max_length", self.model.config.max_length)
n = inputs["input_ids"].shape[1]
if max_length - self.minimum_tokens < n:
logger.warning(
f"Conversation input is to long ({n}), trimming it to ({max_length} - {self.minimum_tokens})"
)
trim = max_length - self.minimum_tokens
inputs["input_ids"] = inputs["input_ids"][:, -trim:]
inputs["attention_mask"] = inputs["attention_mask"][:, -trim:]
generated_responses = self.model.generate(
inputs["input_ids"],
attention_mask=inputs["attention_mask"],
**generate_kwargs,
) )
if hasattr(self.tokenizer, "_build_conversation_input_ids"):
input_ids = self.tokenizer._build_conversation_input_ids(conversation)
else:
# If the tokenizer cannot handle conversations, we default to only the old version
input_ids = self._legacy_parse_and_tokenize(conversation)
if self.framework == "pt":
input_ids = torch.LongTensor([input_ids])
elif self.framework == "tf":
input_ids = tf.constant([input_ids])
return {"input_ids": input_ids, "conversation": conversation}
def _forward(self, model_inputs, minimum_tokens=10, **generate_kwargs):
max_length = generate_kwargs.get("max_length", self.model.config.max_length)
n = model_inputs["input_ids"].shape[1]
if max_length - minimum_tokens < n:
logger.warning(f"Conversation input is to long ({n}), trimming it to ({max_length} - {minimum_tokens})")
trim = max_length - minimum_tokens
model_inputs["input_ids"] = model_inputs["input_ids"][:, -trim:]
if "attention_mask" in model_inputs:
model_inputs["attention_mask"] = model_inputs["attention_mask"][:, -trim:]
conversation = model_inputs.pop("conversation")
model_inputs["max_length"] = max_length
output_ids = self.model.generate(**model_inputs, **generate_kwargs)
if self.model.config.is_encoder_decoder:
start_position = 1
else:
start_position = n
return {"output_ids": output_ids[0, start_position:], "conversation": conversation}
def postprocess(self, model_outputs, clean_up_tokenization_spaces=True):
output_ids = model_outputs["output_ids"]
answer = self.tokenizer.decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
)
conversation = model_outputs["conversation"]
conversation.mark_processed()
conversation.append_response(answer)
return conversation
if self.model.config.is_encoder_decoder: def _legacy_parse_and_tokenize(self, conversation: Conversation) -> Dict:
if self.framework == "pt":
history = torch.cat((inputs["input_ids"], generated_responses[:, 1:]), 1)
elif self.framework == "tf":
history = tf.concat([inputs["input_ids"], generated_responses[:, 1:]], 1)
else:
history = generated_responses
history = self._clean_padding_history(history)
if self.model.config.is_encoder_decoder:
start_position = 1
else:
start_position = input_length
output = []
for conversation_index, conversation in enumerate(conversations):
conversation.mark_processed()
conversation.generated_responses.append(
self.tokenizer.decode(
generated_responses[conversation_index][start_position:],
skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
)
)
output.append(conversation)
if len(output) == 1:
return output[0]
else:
return output
def _clean_padding_history(self, generated_tensor) -> List[List[int]]:
"""
Cleans the padding history. Padding may be generated in two places when multiple conversations are provided as
an input:
- at the end of the concatenated history and new user input, so that all input to the model have the same
length
- at the end of the generated response, as some responses will be longer than others
This method cleans up these padding token so that the history for each conversation is not impacted by the
batching process.
"""
outputs = []
for sequence in generated_tensor:
sequence_tokens = []
is_previous_pad = False
for token in sequence:
if token == self.tokenizer.pad_token_id:
if self.tokenizer.pad_token_id != self.tokenizer.eos_token_id:
continue
if is_previous_pad:
continue
else:
is_previous_pad = True
else:
is_previous_pad = False
if self.framework == "pt":
sequence_tokens.append(token.item())
else:
sequence_tokens.append(int(token.numpy()))
outputs.append(sequence_tokens)
return outputs
def _legacy_parse_and_tokenize(self, conversation: List[Conversation]) -> List[int]:
eos_token_id = self.tokenizer.eos_token_id eos_token_id = self.tokenizer.eos_token_id
input_ids = [] input_ids = []
for is_user, text in conversation.iter_texts(): for is_user, text in conversation.iter_texts():
...@@ -345,14 +306,3 @@ class ConversationalPipeline(Pipeline): ...@@ -345,14 +306,3 @@ class ConversationalPipeline(Pipeline):
if len(input_ids) > self.tokenizer.model_max_length: if len(input_ids) > self.tokenizer.model_max_length:
input_ids = input_ids[-self.tokenizer.model_max_length :] input_ids = input_ids[-self.tokenizer.model_max_length :]
return input_ids return input_ids
def _parse_and_tokenize(self, conversations: List[Conversation]) -> Dict[str, Any]:
if hasattr(self.tokenizer, "_build_conversation_input_ids"):
input_ids = [self.tokenizer._build_conversation_input_ids(conversation) for conversation in conversations]
else:
# If the tokenizer cannot handle conversations, we default to only the old version
input_ids = [self._legacy_parse_and_tokenize(conversation) for conversation in conversations]
inputs = self.tokenizer.pad(
{"input_ids": input_ids}, padding="longest", return_attention_mask=True, return_tensors=self.framework
)
return inputs
from typing import TYPE_CHECKING, Optional, Union from typing import Dict
from ..feature_extraction_utils import PreTrainedFeatureExtractor from .base import GenericTensor, Pipeline
from ..modelcard import ModelCard
from ..tokenization_utils import PreTrainedTokenizer
from .base import ArgumentHandler, Pipeline
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
# Can't use @add_end_docstrings(PIPELINE_INIT_ARGS) here because this one does not accept `binary_output` # Can't use @add_end_docstrings(PIPELINE_INIT_ARGS) here because this one does not accept `binary_output`
...@@ -49,28 +41,24 @@ class FeatureExtractionPipeline(Pipeline): ...@@ -49,28 +41,24 @@ class FeatureExtractionPipeline(Pipeline):
the associated CUDA device id. the associated CUDA device id.
""" """
def __init__( def _sanitize_parameters(self, **kwargs):
self, return {}, {}, {}
model: Union["PreTrainedModel", "TFPreTrainedModel"],
tokenizer: PreTrainedTokenizer, def preprocess(self, inputs) -> Dict[str, GenericTensor]:
feature_extractor: Optional[PreTrainedFeatureExtractor] = None, return_tensors = self.framework
modelcard: Optional[ModelCard] = None, model_inputs = self.tokenizer(inputs, return_tensors=return_tensors)
framework: Optional[str] = None, return model_inputs
args_parser: ArgumentHandler = None,
device: int = -1, def _forward(self, model_inputs):
task: str = "", model_outputs = self.model(**model_inputs)
): return model_outputs
super().__init__(
model=model, def postprocess(self, model_outputs):
tokenizer=tokenizer, # [0] is the first available tensor, logits or last_hidden_state.
feature_extractor=feature_extractor, if self.framework == "pt":
modelcard=modelcard, return model_outputs[0].tolist()
framework=framework, elif self.framework == "tf":
args_parser=args_parser, return model_outputs[0].numpy().tolist()
device=device,
binary_output=True,
task=task,
)
def __call__(self, *args, **kwargs): def __call__(self, *args, **kwargs):
""" """
...@@ -82,10 +70,4 @@ class FeatureExtractionPipeline(Pipeline): ...@@ -82,10 +70,4 @@ class FeatureExtractionPipeline(Pipeline):
Return: Return:
A nested list of :obj:`float`: The features computed by the model. A nested list of :obj:`float`: The features computed by the model.
""" """
results = super().__call__(*args, **kwargs) return super().__call__(*args, **kwargs)
if isinstance(results, list):
# Sequential run
results = [r.tolist() for r in results]
else:
results = results.tolist()
return results
from typing import TYPE_CHECKING, Dict, List, Optional, Union from typing import Dict
import numpy as np import numpy as np
from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available
from ..modelcard import ModelCard
from ..tokenization_utils import PreTrainedTokenizer
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline, PipelineException from .base import PIPELINE_INIT_ARGS, GenericTensor, Pipeline, PipelineException
GenericTensor = Union[List["GenericTensor"], "torch.Tensor", "tf.Tensor"]
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
if is_tf_available(): if is_tf_available():
import tensorflow as tf import tensorflow as tf
from ..models.auto.modeling_tf_auto import TF_MODEL_FOR_MASKED_LM_MAPPING
if is_torch_available(): if is_torch_available():
import torch import torch
from ..models.auto.modeling_auto import MODEL_FOR_MASKED_LM_MAPPING
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -58,39 +47,6 @@ class FillMaskPipeline(Pipeline): ...@@ -58,39 +47,6 @@ class FillMaskPipeline(Pipeline):
This pipeline only works for inputs with exactly one token masked. This pipeline only works for inputs with exactly one token masked.
""" """
def __init__(
self,
model: Union["PreTrainedModel", "TFPreTrainedModel"],
tokenizer: PreTrainedTokenizer,
modelcard: Optional[ModelCard] = None,
framework: Optional[str] = None,
args_parser: ArgumentHandler = None,
device: int = -1,
top_k=5,
targets=None,
task: str = "",
):
super().__init__(
model=model,
tokenizer=tokenizer,
modelcard=modelcard,
framework=framework,
args_parser=args_parser,
device=device,
binary_output=True,
task=task,
)
self.check_model_type(
TF_MODEL_FOR_MASKED_LM_MAPPING if self.framework == "tf" else MODEL_FOR_MASKED_LM_MAPPING
)
self.top_k = top_k
self.targets = targets
if self.tokenizer.mask_token_id is None:
raise PipelineException(
"fill-mask", self.model.base_model_prefix, "The tokenizer does not define a `mask_token`."
)
def get_masked_index(self, input_ids: GenericTensor) -> np.ndarray: def get_masked_index(self, input_ids: GenericTensor) -> np.ndarray:
if self.framework == "tf": if self.framework == "tf":
masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy() masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy()
...@@ -124,17 +80,124 @@ class FillMaskPipeline(Pipeline): ...@@ -124,17 +80,124 @@ class FillMaskPipeline(Pipeline):
for input_ids in model_inputs["input_ids"]: for input_ids in model_inputs["input_ids"]:
self._ensure_exactly_one_mask_token(input_ids) self._ensure_exactly_one_mask_token(input_ids)
def get_model_inputs(self, inputs, *args, **kwargs) -> Dict: def preprocess(self, inputs, return_tensors=None, **preprocess_parameters) -> Dict[str, GenericTensor]:
if isinstance(inputs, list) and self.tokenizer.pad_token is None: if return_tensors is None:
model_inputs = [] return_tensors = self.framework
for input_ in inputs: model_inputs = self.tokenizer(inputs, return_tensors=return_tensors)
model_input = self._parse_and_tokenize(input_, padding=False, *args, **kwargs) self.ensure_exactly_one_mask_token(model_inputs)
model_inputs.append(model_input)
else:
model_inputs = self._parse_and_tokenize(inputs, *args, **kwargs)
return model_inputs return model_inputs
def __call__(self, inputs, *args, targets=None, top_k: Optional[int] = None, **kwargs): def _forward(self, model_inputs):
model_outputs = self.model(**model_inputs)
model_outputs["input_ids"] = model_inputs["input_ids"][0]
return model_outputs
def postprocess(self, model_outputs, top_k=5, target_ids=None):
# Cap top_k if there are targets
if target_ids is not None and target_ids.shape[0] < top_k:
top_k = target_ids.shape[0]
input_ids = model_outputs["input_ids"]
outputs = model_outputs["logits"]
result = []
if self.framework == "tf":
masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy()
# Fill mask pipeline supports only one ${mask_token} per sample
logits = outputs[0, masked_index.item(), :]
probs = tf.nn.softmax(logits)
if target_ids is not None:
probs = tf.gather_nd(probs, tf.reshape(target_ids, (-1, 1)))
topk = tf.math.top_k(probs, k=top_k)
values, predictions = topk.values.numpy(), topk.indices.numpy()
else:
masked_index = torch.nonzero(input_ids == self.tokenizer.mask_token_id, as_tuple=False)
# Fill mask pipeline supports only one ${mask_token} per sample
logits = outputs[0, masked_index.item(), :]
probs = logits.softmax(dim=0)
if target_ids is not None:
probs = probs[..., target_ids]
values, predictions = probs.topk(top_k)
for v, p in zip(values.tolist(), predictions.tolist()):
tokens = input_ids.numpy()
if target_ids is not None:
p = target_ids[p].tolist()
tokens[masked_index] = p
# Filter padding out:
tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)]
result.append(
{
"sequence": self.tokenizer.decode(tokens, skip_special_tokens=True),
"score": v,
"token": p,
"token_str": self.tokenizer.decode(p),
}
)
return result
def get_target_ids(self, targets, top_k=None):
if isinstance(targets, str):
targets = [targets]
try:
vocab = self.tokenizer.get_vocab()
except Exception:
vocab = {}
target_ids = []
for target in targets:
id_ = vocab.get(target, None)
if id_ is None:
input_ids = self.tokenizer(
target,
add_special_tokens=False,
return_attention_mask=False,
return_token_type_ids=False,
max_length=1,
truncation=True,
)["input_ids"]
if len(input_ids) == 0:
logger.warning(
f"The specified target token `{target}` does not exist in the model vocabulary. "
f"We cannot replace it with anything meaningful, ignoring it"
)
continue
id_ = input_ids[0]
# XXX: If users encounter this pass
# it becomes pretty slow, so let's make sure
# The warning enables them to fix the input to
# get faster performance.
logger.warning(
f"The specified target token `{target}` does not exist in the model vocabulary. "
f"Replacing with `{self.tokenizer.convert_ids_to_tokens(id_)}`."
)
target_ids.append(id_)
target_ids = list(set(target_ids))
if len(target_ids) == 0:
raise ValueError("At least one target must be provided when passed.")
target_ids = np.array(target_ids)
return target_ids
def _sanitize_parameters(self, top_k=None, targets=None):
postprocess_params = {}
if targets is not None:
target_ids = self.get_target_ids(targets, top_k)
postprocess_params["target_ids"] = target_ids
if top_k is not None:
postprocess_params["top_k"] = top_k
if self.tokenizer.mask_token_id is None:
raise PipelineException(
"fill-mask", self.model.base_model_prefix, "The tokenizer does not define a `mask_token`."
)
return {}, {}, postprocess_params
def __call__(self, inputs, *args, **kwargs):
""" """
Fill the masked token in the text(s) given as inputs. Fill the masked token in the text(s) given as inputs.
...@@ -156,126 +219,4 @@ class FillMaskPipeline(Pipeline): ...@@ -156,126 +219,4 @@ class FillMaskPipeline(Pipeline):
- **token** (:obj:`int`) -- The predicted token id (to replace the masked one). - **token** (:obj:`int`) -- The predicted token id (to replace the masked one).
- **token** (:obj:`str`) -- The predicted token (to replace the masked one). - **token** (:obj:`str`) -- The predicted token (to replace the masked one).
""" """
model_inputs = self.get_model_inputs(inputs, *args, **kwargs) return super().__call__(inputs, **kwargs)
self.ensure_exactly_one_mask_token(model_inputs)
if isinstance(model_inputs, list):
outputs = []
for model_input in model_inputs:
output = self._forward(model_input, return_tensors=True)
outputs.append(output)
batch_size = len(model_inputs)
else:
outputs = self._forward(model_inputs, return_tensors=True)
batch_size = outputs.shape[0] if self.framework == "tf" else outputs.size(0)
# top_k must be defined
if top_k is None:
top_k = self.top_k
results = []
if targets is None and self.targets is not None:
targets = self.targets
if targets is not None:
if isinstance(targets, str):
targets = [targets]
try:
vocab = self.tokenizer.get_vocab()
except Exception:
vocab = {}
target_ids = []
for target in targets:
id_ = vocab.get(target, None)
if id_ is None:
input_ids = self.tokenizer(
target,
add_special_tokens=False,
return_attention_mask=False,
return_token_type_ids=False,
max_length=1,
truncation=True,
)["input_ids"]
if len(input_ids) == 0:
logger.warning(
f"The specified target token `{target}` does not exist in the model vocabulary. "
f"We cannot replace it with anything meaningful, ignoring it"
)
continue
id_ = input_ids[0]
# XXX: If users encounter this pass
# it becomes pretty slow, so let's make sure
# The warning enables them to fix the input to
# get faster performance.
logger.warning(
f"The specified target token `{target}` does not exist in the model vocabulary. "
f"Replacing with `{self.tokenizer.convert_ids_to_tokens(id_)}`."
)
target_ids.append(id_)
target_ids = list(set(target_ids))
if len(target_ids) == 0:
raise ValueError("At least one target must be provided when passed.")
target_ids = np.array(target_ids)
# Cap top_k if there are targets
if top_k > target_ids.shape[0]:
top_k = target_ids.shape[0]
for i in range(batch_size):
if isinstance(model_inputs, list):
input_ids = model_inputs[i]["input_ids"][0]
else:
input_ids = model_inputs["input_ids"][i]
result = []
if self.framework == "tf":
masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy()
# Fill mask pipeline supports only one ${mask_token} per sample
if isinstance(outputs, list):
logits = outputs[i][0, masked_index.item(), :]
else:
logits = outputs[i, masked_index.item(), :]
probs = tf.nn.softmax(logits)
if targets is not None:
probs = tf.gather_nd(probs, tf.reshape(target_ids, (-1, 1)))
topk = tf.math.top_k(probs, k=top_k)
values, predictions = topk.values.numpy(), topk.indices.numpy()
else:
masked_index = torch.nonzero(input_ids == self.tokenizer.mask_token_id, as_tuple=False)
# Fill mask pipeline supports only one ${mask_token} per sample
if isinstance(outputs, list):
logits = outputs[i][0, masked_index.item(), :]
else:
logits = outputs[i, masked_index.item(), :]
probs = logits.softmax(dim=0)
if targets is not None:
probs = probs[..., target_ids]
values, predictions = probs.topk(top_k)
for v, p in zip(values.tolist(), predictions.tolist()):
tokens = input_ids.numpy()
if targets is not None:
p = target_ids[p].tolist()
tokens[masked_index] = p
# Filter padding out:
tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)]
result.append(
{
"sequence": self.tokenizer.decode(tokens, skip_special_tokens=True),
"score": v,
"token": p,
"token_str": self.tokenizer.decode(p),
}
)
# Append
results += [result]
if len(results) == 1:
return results[0]
return results
import os import os
from typing import TYPE_CHECKING, List, Optional, Union from typing import List, Union
import requests import requests
from ..feature_extraction_utils import PreTrainedFeatureExtractor
from ..file_utils import add_end_docstrings, is_torch_available, is_vision_available, requires_backends from ..file_utils import add_end_docstrings, is_torch_available, is_vision_available, requires_backends
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, Pipeline
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
if is_vision_available(): if is_vision_available():
from PIL import Image from PIL import Image
if is_torch_available(): if is_torch_available():
import torch
from ..models.auto.modeling_auto import MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING from ..models.auto.modeling_auto import MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -37,24 +30,15 @@ class ImageClassificationPipeline(Pipeline): ...@@ -37,24 +30,15 @@ class ImageClassificationPipeline(Pipeline):
<https://huggingface.co/models?filter=image-classification>`__. <https://huggingface.co/models?filter=image-classification>`__.
""" """
def __init__( def __init__(self, *args, **kwargs):
self, super().__init__(*args, **kwargs)
model: Union["PreTrainedModel", "TFPreTrainedModel"],
feature_extractor: PreTrainedFeatureExtractor,
framework: Optional[str] = None,
**kwargs
):
super().__init__(model, feature_extractor=feature_extractor, framework=framework, **kwargs)
if self.framework == "tf": if self.framework == "tf":
raise ValueError(f"The {self.__class__} is only available in PyTorch.") raise ValueError(f"The {self.__class__} is only available in PyTorch.")
requires_backends(self, "vision") requires_backends(self, "vision")
self.check_model_type(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING) self.check_model_type(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING)
self.feature_extractor = feature_extractor
@staticmethod @staticmethod
def load_image(image: Union[str, "Image.Image"]): def load_image(image: Union[str, "Image.Image"]):
if isinstance(image, str): if isinstance(image, str):
...@@ -77,7 +61,13 @@ class ImageClassificationPipeline(Pipeline): ...@@ -77,7 +61,13 @@ class ImageClassificationPipeline(Pipeline):
image = image.convert("RGB") image = image.convert("RGB")
return image return image
def __call__(self, images: Union[str, List[str], "Image", List["Image"]], top_k=5): def _sanitize_parameters(self, top_k=None):
postprocess_params = {}
if top_k is not None:
postprocess_params["top_k"] = top_k
return {}, {}, postprocess_params
def __call__(self, images: Union[str, List[str], "Image", List["Image"]], **kwargs):
""" """
Assign labels to the image(s) passed as inputs. Assign labels to the image(s) passed as inputs.
...@@ -106,34 +96,23 @@ class ImageClassificationPipeline(Pipeline): ...@@ -106,34 +96,23 @@ class ImageClassificationPipeline(Pipeline):
- **label** (:obj:`str`) -- The label identified by the model. - **label** (:obj:`str`) -- The label identified by the model.
- **score** (:obj:`int`) -- The score attributed by the model for that label. - **score** (:obj:`int`) -- The score attributed by the model for that label.
""" """
is_batched = isinstance(images, list) return super().__call__(images, **kwargs)
if not is_batched: def preprocess(self, image):
images = [images] image = self.load_image(image)
model_inputs = self.feature_extractor(images=image, return_tensors="pt")
return model_inputs
images = [self.load_image(image) for image in images] def _forward(self, model_inputs):
model_outputs = self.model(**model_inputs)
return model_outputs
def postprocess(self, model_outputs, top_k=5):
if top_k > self.model.config.num_labels: if top_k > self.model.config.num_labels:
top_k = self.model.config.num_labels top_k = self.model.config.num_labels
probs = model_outputs.logits.softmax(-1)[0]
scores, ids = probs.topk(top_k)
with torch.no_grad(): scores = scores.tolist()
inputs = self.feature_extractor(images=images, return_tensors="pt") ids = ids.tolist()
outputs = self.model(**inputs) return [{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
probs = outputs.logits.softmax(-1)
scores, ids = probs.topk(top_k)
scores = scores.tolist()
ids = ids.tolist()
if not is_batched:
scores, ids = scores[0], ids[0]
labels = [{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
else:
labels = []
for scores, ids in zip(scores, ids):
labels.append(
[{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
)
return labels
import os import os
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union from typing import Any, Dict, List, Union
import requests import requests
from ..feature_extraction_utils import PreTrainedFeatureExtractor
from ..file_utils import add_end_docstrings, is_torch_available, is_vision_available, requires_backends from ..file_utils import add_end_docstrings, is_torch_available, is_vision_available, requires_backends
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, Pipeline
if TYPE_CHECKING:
from ..modeling_utils import PreTrainedModel
if is_vision_available(): if is_vision_available():
from PIL import Image from PIL import Image
...@@ -40,24 +36,15 @@ class ObjectDetectionPipeline(Pipeline): ...@@ -40,24 +36,15 @@ class ObjectDetectionPipeline(Pipeline):
<https://huggingface.co/models?filter=object-detection>`__. <https://huggingface.co/models?filter=object-detection>`__.
""" """
def __init__( def __init__(self, *args, **kwargs):
self, super().__init__(*args, **kwargs)
model: "PreTrainedModel",
feature_extractor: PreTrainedFeatureExtractor,
framework: Optional[str] = None,
**kwargs
):
super().__init__(model, feature_extractor=feature_extractor, framework=framework, **kwargs)
if self.framework == "tf": if self.framework == "tf":
raise ValueError(f"The {self.__class__} is only available in PyTorch.") raise ValueError(f"The {self.__class__} is only available in PyTorch.")
requires_backends(self, "vision") requires_backends(self, "vision")
self.check_model_type(MODEL_FOR_OBJECT_DETECTION_MAPPING) self.check_model_type(MODEL_FOR_OBJECT_DETECTION_MAPPING)
self.feature_extractor = feature_extractor
@staticmethod @staticmethod
def load_image(image: Union[str, "Image.Image"]): def load_image(image: Union[str, "Image.Image"]):
if isinstance(image, str): if isinstance(image, str):
...@@ -80,11 +67,13 @@ class ObjectDetectionPipeline(Pipeline): ...@@ -80,11 +67,13 @@ class ObjectDetectionPipeline(Pipeline):
image = image.convert("RGB") image = image.convert("RGB")
return image return image
def __call__( def _sanitize_parameters(self, **kwargs):
self, postprocess_kwargs = {}
images: Union[str, List[str], "Image", List["Image"]], if "threshold" in kwargs:
threshold: Optional[float] = 0.9, postprocess_kwargs["threshold"] = kwargs["threshold"]
) -> Union[Predictions, List[Prediction]]: return {}, {}, postprocess_kwargs
def __call__(self, *args, **kwargs) -> Union[Predictions, List[Prediction]]:
""" """
Detect objects (bounding boxes & classes) in the image(s) passed as inputs. Detect objects (bounding boxes & classes) in the image(s) passed as inputs.
...@@ -112,47 +101,42 @@ class ObjectDetectionPipeline(Pipeline): ...@@ -112,47 +101,42 @@ class ObjectDetectionPipeline(Pipeline):
- **score** (:obj:`float`) -- The score attributed by the model for that label. - **score** (:obj:`float`) -- The score attributed by the model for that label.
- **box** (:obj:`List[Dict[str, int]]`) -- The bounding box of detected object in image's original size. - **box** (:obj:`List[Dict[str, int]]`) -- The bounding box of detected object in image's original size.
""" """
is_batched = isinstance(images, list)
if not is_batched:
images = [images]
images = [self.load_image(image) for image in images]
with torch.no_grad():
inputs = self.feature_extractor(images=images, return_tensors="pt")
outputs = self.model(**inputs)
if self.framework == "pt":
target_sizes = torch.IntTensor([[im.height, im.width] for im in images])
else:
raise ValueError("The ObjectDetectionPipeline is only available in PyTorch.")
raw_annotations = self.feature_extractor.post_process(outputs, target_sizes)
annotations = []
for annotation in raw_annotations:
keep = annotation["scores"] > threshold
scores = annotation["scores"][keep]
labels = annotation["labels"][keep]
boxes = annotation["boxes"][keep]
annotation["scores"] = scores.tolist()
annotation["labels"] = [self.model.config.id2label[label.item()] for label in labels]
annotation["boxes"] = [self._get_bounding_box(box) for box in boxes]
# {"scores": [...], ...} --> [{"score":x, ...}, ...]
keys = ["score", "label", "box"]
annotation = [
dict(zip(keys, vals))
for vals in zip(annotation["scores"], annotation["labels"], annotation["boxes"])
]
annotations.append(annotation)
if not is_batched:
return annotations[0]
return annotations return super().__call__(*args, **kwargs)
def preprocess(self, image):
image = self.load_image(image)
target_size = torch.IntTensor([[image.height, image.width]])
inputs = self.feature_extractor(images=[image], return_tensors="pt")
inputs["target_size"] = target_size
return inputs
def _forward(self, model_inputs):
target_size = model_inputs.pop("target_size")
outputs = self.model(**model_inputs)
model_outputs = {"outputs": outputs, "target_size": target_size}
return model_outputs
def postprocess(self, model_outputs, threshold=0.9):
raw_annotations = self.feature_extractor.post_process(model_outputs["outputs"], model_outputs["target_size"])
raw_annotation = raw_annotations[0]
keep = raw_annotation["scores"] > threshold
scores = raw_annotation["scores"][keep]
labels = raw_annotation["labels"][keep]
boxes = raw_annotation["boxes"][keep]
raw_annotation["scores"] = scores.tolist()
raw_annotation["labels"] = [self.model.config.id2label[label.item()] for label in labels]
raw_annotation["boxes"] = [self._get_bounding_box(box) for box in boxes]
# {"scores": [...], ...} --> [{"score":x, ...}, ...]
keys = ["score", "label", "box"]
annotation = [
dict(zip(keys, vals))
for vals in zip(raw_annotation["scores"], raw_annotation["labels"], raw_annotation["boxes"])
]
return annotation
def _get_bounding_box(self, box: "torch.Tensor") -> Dict[str, int]: def _get_bounding_box(self, box: "torch.Tensor") -> Dict[str, int]:
""" """
......
...@@ -17,7 +17,7 @@ class TableQuestionAnsweringArgumentHandler(ArgumentHandler): ...@@ -17,7 +17,7 @@ class TableQuestionAnsweringArgumentHandler(ArgumentHandler):
Handles arguments for the TableQuestionAnsweringPipeline Handles arguments for the TableQuestionAnsweringPipeline
""" """
def __call__(self, table=None, query=None, sequential=False, padding=True, truncation=True): def __call__(self, table=None, query=None, **kwargs):
# Returns tqa_pipeline_inputs of shape: # Returns tqa_pipeline_inputs of shape:
# [ # [
# {"table": pd.DataFrame, "query": List[str]}, # {"table": pd.DataFrame, "query": List[str]},
...@@ -60,7 +60,7 @@ class TableQuestionAnsweringArgumentHandler(ArgumentHandler): ...@@ -60,7 +60,7 @@ class TableQuestionAnsweringArgumentHandler(ArgumentHandler):
tqa_pipeline_input["table"] = pd.DataFrame(tqa_pipeline_input["table"]) tqa_pipeline_input["table"] = pd.DataFrame(tqa_pipeline_input["table"])
return tqa_pipeline_inputs, sequential, padding, truncation return tqa_pipeline_inputs
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
...@@ -235,52 +235,76 @@ class TableQuestionAnsweringPipeline(Pipeline): ...@@ -235,52 +235,76 @@ class TableQuestionAnsweringPipeline(Pipeline):
- **cells** (:obj:`List[str]`) -- List of strings made up of the answer cell values. - **cells** (:obj:`List[str]`) -- List of strings made up of the answer cell values.
- **aggregator** (:obj:`str`) -- If the model has an aggregator, this returns the aggregator. - **aggregator** (:obj:`str`) -- If the model has an aggregator, this returns the aggregator.
""" """
pipeline_inputs, sequential, padding, truncation = self._args_parser(*args, **kwargs) pipeline_inputs = self._args_parser(*args, **kwargs)
batched_answers = []
for pipeline_input in pipeline_inputs: results = super().__call__(pipeline_inputs, **kwargs)
table, query = pipeline_input["table"], pipeline_input["query"] if len(results) == 1:
if table.empty: return results[0]
raise ValueError("table is empty") return results
if not query:
raise ValueError("query is empty") def _sanitize_parameters(self, sequential=None, padding=None, truncation=None, **kwargs):
inputs = self.tokenizer( preprocess_params = {}
table, query, return_tensors=self.framework, truncation="drop_rows_to_fit", padding=padding if padding is not None:
) preprocess_params["padding"] = padding
if truncation is not None:
outputs = self.sequential_inference(**inputs) if sequential else self.batch_inference(**inputs) preprocess_params["truncation"] = truncation
if self.aggregate: forward_params = {}
logits, logits_agg = outputs[:2] if sequential is not None:
predictions = self.tokenizer.convert_logits_to_predictions(inputs, logits.detach(), logits_agg) forward_params["sequential"] = sequential
answer_coordinates_batch, agg_predictions = predictions return preprocess_params, forward_params, {}
aggregators = {i: self.model.config.aggregation_labels[pred] for i, pred in enumerate(agg_predictions)}
def preprocess(self, pipeline_input, sequential=None, padding=True, truncation="drop_rows_to_fit"):
no_agg_label_index = self.model.config.no_aggregation_label_index table, query = pipeline_input["table"], pipeline_input["query"]
aggregators_prefix = { if table.empty:
i: aggregators[i] + " > " for i, pred in enumerate(agg_predictions) if pred != no_agg_label_index raise ValueError("table is empty")
} if query is None or query == "":
else: raise ValueError("query is empty")
logits = outputs[0] inputs = self.tokenizer(table, query, return_tensors=self.framework, truncation=truncation, padding=padding)
predictions = self.tokenizer.convert_logits_to_predictions(inputs, logits.detach()) inputs["table"] = table
answer_coordinates_batch = predictions[0] return inputs
aggregators = {}
aggregators_prefix = {} def _forward(self, model_inputs, sequential=False):
table = model_inputs.pop("table")
answers = [] outputs = self.sequential_inference(**model_inputs) if sequential else self.batch_inference(**model_inputs)
for index, coordinates in enumerate(answer_coordinates_batch): model_outputs = {"model_inputs": model_inputs, "table": table, "outputs": outputs}
cells = [table.iat[coordinate] for coordinate in coordinates] return model_outputs
aggregator = aggregators.get(index, "")
aggregator_prefix = aggregators_prefix.get(index, "") def postprocess(self, model_outputs):
answer = { inputs = model_outputs["model_inputs"]
"answer": aggregator_prefix + ", ".join(cells), table = model_outputs["table"]
"coordinates": coordinates, outputs = model_outputs["outputs"]
"cells": [table.iat[coordinate] for coordinate in coordinates], if self.aggregate:
} logits, logits_agg = outputs[:2]
if aggregator: predictions = self.tokenizer.convert_logits_to_predictions(inputs, logits.detach(), logits_agg)
answer["aggregator"] = aggregator answer_coordinates_batch, agg_predictions = predictions
aggregators = {i: self.model.config.aggregation_labels[pred] for i, pred in enumerate(agg_predictions)}
answers.append(answer)
if len(answer) == 0: no_agg_label_index = self.model.config.no_aggregation_label_index
raise PipelineException("Empty answer") aggregators_prefix = {
batched_answers.append(answers if len(answers) > 1 else answers[0]) i: aggregators[i] + " > " for i, pred in enumerate(agg_predictions) if pred != no_agg_label_index
return batched_answers if len(batched_answers) > 1 else batched_answers[0] }
else:
logits = outputs[0]
predictions = self.tokenizer.convert_logits_to_predictions(inputs, logits.detach())
answer_coordinates_batch = predictions[0]
aggregators = {}
aggregators_prefix = {}
answers = []
for index, coordinates in enumerate(answer_coordinates_batch):
cells = [table.iat[coordinate] for coordinate in coordinates]
aggregator = aggregators.get(index, "")
aggregator_prefix = aggregators_prefix.get(index, "")
answer = {
"answer": aggregator_prefix + ", ".join(cells),
"coordinates": coordinates,
"cells": [table.iat[coordinate] for coordinate in coordinates],
}
if aggregator:
answer["aggregator"] = aggregator
answers.append(answer)
if len(answer) == 0:
raise PipelineException("Empty answer")
return answers if len(answers) > 1 else answers[0]
from typing import Optional import enum
from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available
from ..tokenization_utils import TruncationStrategy from ..tokenization_utils import TruncationStrategy
...@@ -17,6 +17,11 @@ if is_torch_available(): ...@@ -17,6 +17,11 @@ if is_torch_available():
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
class ReturnType(enum.Enum):
TENSORS = 0
TEXT = 1
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
class Text2TextGenerationPipeline(Pipeline): class Text2TextGenerationPipeline(Pipeline):
""" """
...@@ -46,6 +51,32 @@ class Text2TextGenerationPipeline(Pipeline): ...@@ -46,6 +51,32 @@ class Text2TextGenerationPipeline(Pipeline):
else MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING else MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING
) )
def _sanitize_parameters(
self,
return_tensors=None,
return_text=None,
return_type=None,
clean_up_tokenization_spaces=None,
truncation=None,
**generate_kwargs
):
preprocess_params = {}
if truncation is not None:
preprocess_params["truncation"] = truncation
forward_params = generate_kwargs
postprocess_params = {}
if return_tensors is not None and return_type is None:
return_type = ReturnType.TENSORS if return_tensors else ReturnType.TEXT
if return_type is not None:
postprocess_params["return_type"] = return_type
if clean_up_tokenization_spaces is not None:
postprocess_params["clean_up_tokenization_spaces"] = clean_up_tokenization_spaces
return preprocess_params, forward_params, postprocess_params
def check_inputs(self, input_length: int, min_length: int, max_length: int): def check_inputs(self, input_length: int, min_length: int, max_length: int):
""" """
Checks whether there might be something wrong with given input with regard to the model. Checks whether there might be something wrong with given input with regard to the model.
...@@ -55,9 +86,8 @@ class Text2TextGenerationPipeline(Pipeline): ...@@ -55,9 +86,8 @@ class Text2TextGenerationPipeline(Pipeline):
def _parse_and_tokenize(self, *args, truncation): def _parse_and_tokenize(self, *args, truncation):
prefix = self.model.config.prefix if self.model.config.prefix is not None else "" prefix = self.model.config.prefix if self.model.config.prefix is not None else ""
if isinstance(args[0], list): if isinstance(args[0], list):
assert ( if self.tokenizer.pad_token_id is None:
self.tokenizer.pad_token_id is not None raise ValueError("Please make sure that the tokenizer has a pad_token_id when using a batch input")
), "Please make sure that the tokenizer has a pad_token_id when using a batch input"
args = ([prefix + arg for arg in args[0]],) args = ([prefix + arg for arg in args[0]],)
padding = True padding = True
...@@ -68,21 +98,13 @@ class Text2TextGenerationPipeline(Pipeline): ...@@ -68,21 +98,13 @@ class Text2TextGenerationPipeline(Pipeline):
raise ValueError( raise ValueError(
f" `args[0]`: {args[0]} have the wrong format. The should be either of type `str` or type `list`" f" `args[0]`: {args[0]} have the wrong format. The should be either of type `str` or type `list`"
) )
inputs = super()._parse_and_tokenize(*args, padding=padding, truncation=truncation) inputs = self.tokenizer(*args, padding=padding, truncation=truncation, return_tensors=self.framework)
# This is produced by tokenizers but is an invalid generate kwargs # This is produced by tokenizers but is an invalid generate kwargs
if "token_type_ids" in inputs: if "token_type_ids" in inputs:
del inputs["token_type_ids"] del inputs["token_type_ids"]
return inputs return inputs
def __call__( def __call__(self, *args, **kwargs):
self,
*args,
return_tensors=False,
return_text=True,
clean_up_tokenization_spaces=False,
truncation=TruncationStrategy.DO_NOT_TRUNCATE,
**generate_kwargs
):
r""" r"""
Generate the output text(s) using text(s) given as inputs. Generate the output text(s) using text(s) given as inputs.
...@@ -111,43 +133,40 @@ class Text2TextGenerationPipeline(Pipeline): ...@@ -111,43 +133,40 @@ class Text2TextGenerationPipeline(Pipeline):
-- The token ids of the generated text. -- The token ids of the generated text.
""" """
assert return_tensors or return_text, "You must specify return_tensors=True or return_text=True" result = super().__call__(*args, **kwargs)
if isinstance(result, dict):
return [result]
return result
with self.device_placement(): def preprocess(self, inputs, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs):
inputs = self._parse_and_tokenize(*args, truncation=truncation) inputs = self._parse_and_tokenize(inputs, truncation=truncation, **kwargs)
return self._generate(inputs, return_tensors, return_text, clean_up_tokenization_spaces, generate_kwargs) return inputs
def _generate( def _forward(self, model_inputs, **generate_kwargs):
self, inputs, return_tensors: bool, return_text: bool, clean_up_tokenization_spaces: bool, generate_kwargs
):
if self.framework == "pt": if self.framework == "pt":
inputs = self.ensure_tensor_on_device(**inputs) input_length = model_inputs["input_ids"].shape[-1]
input_length = inputs["input_ids"].shape[-1]
elif self.framework == "tf": elif self.framework == "tf":
input_length = tf.shape(inputs["input_ids"])[-1].numpy() input_length = tf.shape(model_inputs["input_ids"])[-1].numpy()
min_length = generate_kwargs.get("min_length", self.model.config.min_length) generate_kwargs["min_length"] = generate_kwargs.get("min_length", self.model.config.min_length)
max_length = generate_kwargs.get("max_length", self.model.config.max_length) generate_kwargs["max_length"] = generate_kwargs.get("max_length", self.model.config.max_length)
self.check_inputs(input_length, min_length, max_length) self.check_inputs(input_length, generate_kwargs["min_length"], generate_kwargs["max_length"])
output_ids = self.model.generate(**model_inputs, **generate_kwargs)
generate_kwargs.update(inputs) return {"output_ids": output_ids}
generations = self.model.generate( def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
**generate_kwargs, record = {}
) if return_type == ReturnType.TENSORS:
results = [] record = {f"{self.return_name}_token_ids": model_outputs}
for generation in generations: elif return_type == ReturnType.TEXT:
record = {} record = {
if return_tensors: f"{self.return_name}_text": self.tokenizer.decode(
record[f"{self.return_name}_token_ids"] = generation model_outputs["output_ids"][0],
if return_text:
record[f"{self.return_name}_text"] = self.tokenizer.decode(
generation,
skip_special_tokens=True, skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces, clean_up_tokenization_spaces=clean_up_tokenization_spaces,
) )
results.append(record) }
return results return record
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
...@@ -239,23 +258,6 @@ class TranslationPipeline(Text2TextGenerationPipeline): ...@@ -239,23 +258,6 @@ class TranslationPipeline(Text2TextGenerationPipeline):
# Used in the return key of the pipeline. # Used in the return key of the pipeline.
return_name = "translation" return_name = "translation"
src_lang: Optional[str] = None
tgt_lang: Optional[str] = None
def __init__(self, *args, src_lang=None, tgt_lang=None, **kwargs):
super().__init__(*args, **kwargs)
if src_lang is not None:
self.src_lang = src_lang
if tgt_lang is not None:
self.tgt_lang = tgt_lang
if src_lang is None and tgt_lang is None:
# Backward compatibility, direct arguments use is preferred.
task = kwargs.get("task", "")
items = task.split("_")
if task and len(items) == 4:
# translation, XX, to YY
self.src_lang = items[1]
self.tgt_lang = items[3]
def check_inputs(self, input_length: int, min_length: int, max_length: int): def check_inputs(self, input_length: int, min_length: int, max_length: int):
if input_length > 0.9 * max_length: if input_length > 0.9 * max_length:
...@@ -265,25 +267,31 @@ class TranslationPipeline(Text2TextGenerationPipeline): ...@@ -265,25 +267,31 @@ class TranslationPipeline(Text2TextGenerationPipeline):
) )
return True return True
def _parse_and_tokenize(self, *args, src_lang, tgt_lang, truncation): def preprocess(self, *args, truncation=TruncationStrategy.DO_NOT_TRUNCATE, src_lang=None, tgt_lang=None):
if getattr(self.tokenizer, "_build_translation_inputs", None): if getattr(self.tokenizer, "_build_translation_inputs", None):
return self.tokenizer._build_translation_inputs( return self.tokenizer._build_translation_inputs(
*args, return_tensors=self.framework, src_lang=src_lang, tgt_lang=tgt_lang, truncation=truncation *args, return_tensors=self.framework, truncation=truncation, src_lang=src_lang, tgt_lang=tgt_lang
) )
else: else:
return super()._parse_and_tokenize(*args, truncation=truncation) return super()._parse_and_tokenize(*args, truncation=truncation)
def __call__( def _sanitize_parameters(self, src_lang=None, tgt_lang=None, **kwargs):
self, preprocess_params, forward_params, postprocess_params = super()._sanitize_parameters(**kwargs)
*args, if src_lang is not None:
return_tensors=False, preprocess_params["src_lang"] = src_lang
return_text=True, if tgt_lang is not None:
clean_up_tokenization_spaces=False, preprocess_params["tgt_lang"] = tgt_lang
truncation=TruncationStrategy.DO_NOT_TRUNCATE, if src_lang is None and tgt_lang is None:
src_lang=None, # Backward compatibility, direct arguments use is preferred.
tgt_lang=None, task = kwargs.get("task", self.task)
**generate_kwargs items = task.split("_")
): if task and len(items) == 4:
# translation, XX, to YY
preprocess_params["src_lang"] = items[1]
preprocess_params["tgt_lang"] = items[3]
return preprocess_params, forward_params, postprocess_params
def __call__(self, *args, **kwargs):
r""" r"""
Translate the text(s) given as inputs. Translate the text(s) given as inputs.
...@@ -313,10 +321,4 @@ class TranslationPipeline(Text2TextGenerationPipeline): ...@@ -313,10 +321,4 @@ class TranslationPipeline(Text2TextGenerationPipeline):
- **translation_token_ids** (:obj:`torch.Tensor` or :obj:`tf.Tensor`, present when ``return_tensors=True``) - **translation_token_ids** (:obj:`torch.Tensor` or :obj:`tf.Tensor`, present when ``return_tensors=True``)
-- The token ids of the translation. -- The token ids of the translation.
""" """
assert return_tensors or return_text, "You must specify return_tensors=True or return_text=True" return super().__call__(*args, **kwargs)
src_lang = src_lang if src_lang is not None else self.src_lang
tgt_lang = tgt_lang if tgt_lang is not None else self.tgt_lang
with self.device_placement():
inputs = self._parse_and_tokenize(*args, truncation=truncation, src_lang=src_lang, tgt_lang=tgt_lang)
return self._generate(inputs, return_tensors, return_text, clean_up_tokenization_spaces, generate_kwargs)
from typing import Optional from typing import Dict
import numpy as np import numpy as np
from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, GenericTensor, Pipeline
if is_tf_available(): if is_tf_available():
...@@ -61,9 +61,10 @@ class TextClassificationPipeline(Pipeline): ...@@ -61,9 +61,10 @@ class TextClassificationPipeline(Pipeline):
<https://huggingface.co/models?filter=text-classification>`__. <https://huggingface.co/models?filter=text-classification>`__.
""" """
task = "text-classification" return_all_scores = False
function_to_apply = ClassificationFunction.NONE
def __init__(self, return_all_scores: bool = None, function_to_apply: str = None, **kwargs): def __init__(self, **kwargs):
super().__init__(**kwargs) super().__init__(**kwargs)
self.check_model_type( self.check_model_type(
...@@ -72,22 +73,24 @@ class TextClassificationPipeline(Pipeline): ...@@ -72,22 +73,24 @@ class TextClassificationPipeline(Pipeline):
else MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING else MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
) )
def _sanitize_parameters(self, return_all_scores=None, function_to_apply=None, **tokenizer_kwargs):
preprocess_params = tokenizer_kwargs
postprocess_params = {}
if hasattr(self.model.config, "return_all_scores") and return_all_scores is None: if hasattr(self.model.config, "return_all_scores") and return_all_scores is None:
return_all_scores = self.model.config.return_all_scores return_all_scores = self.model.config.return_all_scores
if hasattr(self.model.config, "function_to_apply") and function_to_apply is None: if return_all_scores is not None:
function_to_apply = self.model.config.function_to_apply postprocess_params["return_all_scores"] = return_all_scores
if isinstance(function_to_apply, str):
function_to_apply = ClassificationFunction[function_to_apply.upper()]
self.return_all_scores = return_all_scores if return_all_scores is not None else False if function_to_apply is not None:
self.function_to_apply = function_to_apply if function_to_apply is not None else None postprocess_params["function_to_apply"] = function_to_apply
return preprocess_params, {}, postprocess_params
def __call__( def __call__(self, *args, **kwargs):
self,
*args,
return_all_scores: Optional[bool] = None,
function_to_apply: Optional[ClassificationFunction] = None,
**kwargs
):
""" """
Classify the text(s) given as inputs. Classify the text(s) given as inputs.
...@@ -120,19 +123,32 @@ class TextClassificationPipeline(Pipeline): ...@@ -120,19 +123,32 @@ class TextClassificationPipeline(Pipeline):
If ``self.return_all_scores=True``, one such dictionary is returned per label. If ``self.return_all_scores=True``, one such dictionary is returned per label.
""" """
outputs = super().__call__(*args, **kwargs) return super().__call__(*args, **kwargs)
def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
return_tensors = self.framework
return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
return_all_scores = return_all_scores if return_all_scores is not None else self.return_all_scores def _forward(self, model_inputs):
function_to_apply = function_to_apply if function_to_apply is not None else self.function_to_apply return self.model(**model_inputs)
def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
# Default value before `set_parameters`
if function_to_apply is None: if function_to_apply is None:
if self.model.config.problem_type == "multi_label_classification" or self.model.config.num_labels == 1: if self.model.config.problem_type == "multi_label_classification" or self.model.config.num_labels == 1:
function_to_apply = ClassificationFunction.SIGMOID function_to_apply = ClassificationFunction.SIGMOID
elif self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1: elif self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1:
function_to_apply = ClassificationFunction.SOFTMAX function_to_apply = ClassificationFunction.SOFTMAX
elif hasattr(self.model.config, "function_to_apply") and function_to_apply is None:
if isinstance(function_to_apply, str): function_to_apply = self.model.config.function_to_apply
function_to_apply = ClassificationFunction[function_to_apply.upper()] else:
function_to_apply = ClassificationFunction.NONE
outputs = model_outputs["logits"][0]
if self.framework == "pt":
outputs = outputs.cpu().numpy()
else:
outputs = outputs.numpy()
if function_to_apply == ClassificationFunction.SIGMOID: if function_to_apply == ClassificationFunction.SIGMOID:
scores = sigmoid(outputs) scores = sigmoid(outputs)
...@@ -144,11 +160,13 @@ class TextClassificationPipeline(Pipeline): ...@@ -144,11 +160,13 @@ class TextClassificationPipeline(Pipeline):
raise ValueError(f"Unrecognized `function_to_apply` argument: {function_to_apply}") raise ValueError(f"Unrecognized `function_to_apply` argument: {function_to_apply}")
if return_all_scores: if return_all_scores:
return [ return [{"label": self.model.config.id2label[i], "score": score.item()} for i, score in enumerate(scores)]
[{"label": self.model.config.id2label[i], "score": score.item()} for i, score in enumerate(item)]
for item in scores
]
else: else:
return [ return {"label": self.model.config.id2label[scores.argmax().item()], "score": scores.max().item()}
{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores
] def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):
return [self.run_single(item, preprocess_params, forward_params, postprocess_params)[0] for item in inputs]
def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
"This pipeline is odd, and return a list when single item is run"
return [super().run_single(inputs, preprocess_params, forward_params, postprocess_params)]
import enum
from transformers import MODEL_FOR_CAUSAL_LM_MAPPING, TF_MODEL_FOR_CAUSAL_LM_MAPPING from transformers import MODEL_FOR_CAUSAL_LM_MAPPING, TF_MODEL_FOR_CAUSAL_LM_MAPPING
from ..file_utils import add_end_docstrings from ..file_utils import add_end_docstrings
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, Pipeline
class ReturnType(enum.Enum):
TENSORS = 0
NEW_TEXT = 1
FULL_TEXT = 2
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
class TextGenerationPipeline(Pipeline): class TextGenerationPipeline(Pipeline):
""" """
...@@ -32,29 +40,72 @@ class TextGenerationPipeline(Pipeline): ...@@ -32,29 +40,72 @@ class TextGenerationPipeline(Pipeline):
begging for his blessing. <eod> </s> <eos> begging for his blessing. <eod> </s> <eos>
""" """
ALLOWED_MODELS = [ def __init__(self, *args, **kwargs):
"XLNetLMHeadModel",
"TransfoXLLMHeadModel",
"ReformerModelWithLMHead",
"GPT2LMHeadModel",
"GPTJForCausalLM",
"GPTNeoForCausalLM",
"OpenAIGPTLMHeadModel",
"CTRLLMHeadModel",
"TFXLNetLMHeadModel",
"TFTransfoXLLMHeadModel",
"TFGPT2LMHeadModel",
"TFOpenAIGPTLMHeadModel",
"TFCTRLLMHeadModel",
]
def __init__(self, *args, return_full_text=True, **kwargs):
super().__init__(*args, **kwargs) super().__init__(*args, **kwargs)
self.check_model_type( self.check_model_type(
TF_MODEL_FOR_CAUSAL_LM_MAPPING if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING TF_MODEL_FOR_CAUSAL_LM_MAPPING if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING
) )
if "prefix" not in self._preprocess_params:
self.return_full_text = return_full_text # This is very specific. The logic is quite complex and needs to be done
# as a "default".
# It also defines both some preprocess_kwargs and generate_kwargs
# which is why we cannot put them in their respective methods.
prefix = None
if self.model.config.prefix is not None:
prefix = self.model.config.prefix
if prefix is None and self.model.__class__.__name__ in [
"XLNetLMHeadModel",
"TransfoXLLMHeadModel",
"TFXLNetLMHeadModel",
"TFTransfoXLLMHeadModel",
]:
# For XLNet and TransformerXL we add an article to the prompt to give more state to the model.
prefix = self.XL_PREFIX
if prefix is not None:
# Recalculate some generate_kwargs linked to prefix.
preprocess_params, forward_params, _ = self._sanitize_parameters(prefix=prefix, **self._forward_params)
self._preprocess_params = {**self._preprocess_params, **preprocess_params}
self._forward_params = {**self._forward_params, **forward_params}
def _sanitize_parameters(
self,
return_full_text=None,
return_tensors=None,
return_text=None,
return_type=None,
clean_up_tokenization_spaces=None,
prefix=None,
**generate_kwargs
):
preprocess_params = {}
if prefix is not None:
preprocess_params["prefix"] = prefix
if prefix:
prefix_inputs = self.tokenizer(
prefix, padding=False, add_special_tokens=False, return_tensors=self.framework
)
prefix_length = prefix_inputs["input_ids"].shape[-1]
if "max_length" in generate_kwargs:
generate_kwargs["max_length"] += prefix_length
else:
generate_kwargs["max_length"] = self.model.config.max_length + prefix_length
if "min_length" in generate_kwargs:
generate_kwargs["min_length"] += prefix_length
forward_params = generate_kwargs
postprocess_params = {}
if return_full_text is not None and return_type is None:
return_type = ReturnType.FULL_TEXT if return_full_text else ReturnType.NEW_TEXT
if return_tensors is not None and return_type is None:
return_type = ReturnType.TENSORS
if return_type is not None:
postprocess_params["return_type"] = return_type
if clean_up_tokenization_spaces is not None:
postprocess_params["clean_up_tokenization_spaces"] = clean_up_tokenization_spaces
return preprocess_params, forward_params, postprocess_params
# overriding _parse_and_tokenize to allow for unusual language-modeling tokenizer arguments # overriding _parse_and_tokenize to allow for unusual language-modeling tokenizer arguments
def _parse_and_tokenize(self, *args, **kwargs): def _parse_and_tokenize(self, *args, **kwargs):
...@@ -67,16 +118,7 @@ class TextGenerationPipeline(Pipeline): ...@@ -67,16 +118,7 @@ class TextGenerationPipeline(Pipeline):
return super()._parse_and_tokenize(*args, **kwargs) return super()._parse_and_tokenize(*args, **kwargs)
def __call__( def __call__(self, text_inputs, **kwargs):
self,
text_inputs,
return_tensors=False,
return_text=True,
return_full_text=None,
clean_up_tokenization_spaces=False,
prefix=None,
**generate_kwargs
):
""" """
Complete the prompt(s) given as inputs. Complete the prompt(s) given as inputs.
...@@ -105,95 +147,58 @@ class TextGenerationPipeline(Pipeline): ...@@ -105,95 +147,58 @@ class TextGenerationPipeline(Pipeline):
- **generated_token_ids** (:obj:`torch.Tensor` or :obj:`tf.Tensor`, present when ``return_tensors=True``) - **generated_token_ids** (:obj:`torch.Tensor` or :obj:`tf.Tensor`, present when ``return_tensors=True``)
-- The token ids of the generated text. -- The token ids of the generated text.
""" """
prefix = prefix if prefix is not None else self.model.config.prefix return super().__call__(text_inputs, **kwargs)
return_full_text = return_full_text if return_full_text is not None else self.return_full_text
if isinstance(text_inputs, str):
text_inputs = [text_inputs]
results = []
for prompt_text in text_inputs:
# Manage correct placement of the tensors
with self.device_placement():
if prefix is None and self.model.__class__.__name__ in [
"XLNetLMHeadModel",
"TransfoXLLMHeadModel",
"TFXLNetLMHeadModel",
"TFTransfoXLLMHeadModel",
]:
# For XLNet and TransformerXL we add an article to the prompt to give more state to the model.
prefix = self.XL_PREFIX
if prefix:
prefix_inputs = self._parse_and_tokenize(prefix, padding=False, add_special_tokens=False)
# This impacts max_length and min_length argument that need adjusting.
prefix_length = prefix_inputs["input_ids"].shape[-1]
if generate_kwargs.get("max_length", None) is not None:
generate_kwargs["max_length"] += prefix_length
else:
generate_kwargs["max_length"] = self.model.config.max_length + prefix_length
if generate_kwargs.get("min_length", None) is not None:
generate_kwargs["min_length"] += prefix_length
prefix = prefix or ""
inputs = self._parse_and_tokenize(prefix + prompt_text, padding=False, add_special_tokens=False)
# set input_ids to None to allow empty prompt
if inputs["input_ids"].shape[-1] == 0:
inputs["input_ids"] = None
inputs["attention_mask"] = None
if self.framework == "pt" and inputs["input_ids"] is not None:
inputs = self.ensure_tensor_on_device(**inputs)
input_ids = inputs["input_ids"]
# Ensure that batch size = 1 (batch generation not allowed for now)
assert (
input_ids is None or input_ids.shape[0] == 1
), "Batch generation is currently not supported. See https://github.com/huggingface/transformers/issues/3021 for more information."
output_sequences = self.model.generate(input_ids=input_ids, **generate_kwargs) # BS x SL
result = []
for generated_sequence in output_sequences:
if self.framework == "pt" and generated_sequence is not None:
generated_sequence = generated_sequence.cpu()
generated_sequence = generated_sequence.numpy().tolist()
record = {}
if return_tensors:
record["generated_token_ids"] = generated_sequence
if return_text:
# Decode text
text = self.tokenizer.decode(
generated_sequence,
skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
)
# Remove PADDING prompt of the sequence if XLNet or Transfo-XL model is used def preprocess(self, prompt_text, prefix=""):
if input_ids is None: inputs = self.tokenizer(
prompt_length = 0 prefix + prompt_text, padding=False, add_special_tokens=False, return_tensors=self.framework
else: )
prompt_length = len( inputs["prompt_text"] = prompt_text
self.tokenizer.decode( return inputs
input_ids[0],
skip_special_tokens=True, def _forward(self, model_inputs, **generate_kwargs):
clean_up_tokenization_spaces=clean_up_tokenization_spaces, input_ids = model_inputs["input_ids"]
) prompt_text = model_inputs.pop("prompt_text")
generated_sequence = self.model.generate(input_ids=input_ids, **generate_kwargs) # BS x SL
return {"generated_sequence": generated_sequence, "input_ids": input_ids, "prompt_text": prompt_text}
def postprocess(self, model_outputs, return_type=ReturnType.FULL_TEXT, clean_up_tokenization_spaces=True):
generated_sequence = model_outputs["generated_sequence"]
input_ids = model_outputs["input_ids"]
prompt_text = model_outputs["prompt_text"]
if self.framework == "pt" and generated_sequence is not None:
generated_sequence = generated_sequence.cpu()
generated_sequence = generated_sequence.numpy().tolist()
if return_type == ReturnType.TENSORS:
record = {"generated_token_ids": generated_sequence}
elif return_type in {ReturnType.NEW_TEXT, ReturnType.FULL_TEXT}:
# Decode text
record = []
for sequence in generated_sequence:
text = self.tokenizer.decode(
sequence,
skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
)
# Remove PADDING prompt of the sequence if XLNet or Transfo-XL model is used
if input_ids is None:
prompt_length = 0
else:
prompt_length = len(
self.tokenizer.decode(
input_ids[0],
skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
) )
)
if return_full_text: if return_type == ReturnType.FULL_TEXT:
all_text = prompt_text + text[prompt_length:] all_text = prompt_text + text[prompt_length:]
else: else:
all_text = text[prompt_length:] all_text = text[prompt_length:]
record["generated_text"] = all_text
result.append(record)
results += [result]
if len(results) == 1: item = {"generated_text": all_text}
return results[0] record.append(item)
return results return record
import warnings import warnings
from typing import TYPE_CHECKING, List, Optional, Tuple, Union from typing import List, Optional, Tuple, Union
import numpy as np import numpy as np
from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available
from ..modelcard import ModelCard
from ..models.bert.tokenization_bert import BasicTokenizer from ..models.bert.tokenization_bert import BasicTokenizer
from ..tokenization_utils import PreTrainedTokenizer
from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
if is_tf_available(): if is_tf_available():
from ..models.auto.modeling_tf_auto import TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING from ..models.auto.modeling_tf_auto import TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING
if is_torch_available(): if is_torch_available():
import torch
from ..models.auto.modeling_auto import MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING from ..models.auto.modeling_auto import MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING
...@@ -104,31 +95,9 @@ class TokenClassificationPipeline(Pipeline): ...@@ -104,31 +95,9 @@ class TokenClassificationPipeline(Pipeline):
default_input_names = "sequences" default_input_names = "sequences"
def __init__( def __init__(self, args_parser=TokenClassificationArgumentHandler(), *args, **kwargs):
self, self.ignore_labels = ["O"]
model: Union["PreTrainedModel", "TFPreTrainedModel"], super().__init__(*args, **kwargs)
tokenizer: PreTrainedTokenizer,
modelcard: Optional[ModelCard] = None,
framework: Optional[str] = None,
args_parser: ArgumentHandler = TokenClassificationArgumentHandler(),
device: int = -1,
binary_output: bool = False,
ignore_labels=["O"],
task: str = "",
grouped_entities: Optional[bool] = None,
ignore_subwords: Optional[bool] = None,
aggregation_strategy: Optional[AggregationStrategy] = None,
):
super().__init__(
model=model,
tokenizer=tokenizer,
modelcard=modelcard,
framework=framework,
device=device,
binary_output=binary_output,
task=task,
)
self.check_model_type( self.check_model_type(
TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING
if self.framework == "tf" if self.framework == "tf"
...@@ -137,40 +106,49 @@ class TokenClassificationPipeline(Pipeline): ...@@ -137,40 +106,49 @@ class TokenClassificationPipeline(Pipeline):
self._basic_tokenizer = BasicTokenizer(do_lower_case=False) self._basic_tokenizer = BasicTokenizer(do_lower_case=False)
self._args_parser = args_parser self._args_parser = args_parser
self.ignore_labels = ignore_labels
if aggregation_strategy is None: def _sanitize_parameters(
aggregation_strategy = AggregationStrategy.NONE self,
if grouped_entities is not None or ignore_subwords is not None: ignore_labels=None,
grouped_entities: Optional[bool] = None,
ignore_subwords: Optional[bool] = None,
aggregation_strategy: Optional[AggregationStrategy] = None,
):
if grouped_entities and ignore_subwords: postprocess_params = {}
aggregation_strategy = AggregationStrategy.FIRST if grouped_entities is not None or ignore_subwords is not None:
elif grouped_entities and not ignore_subwords: if grouped_entities and ignore_subwords:
aggregation_strategy = AggregationStrategy.SIMPLE aggregation_strategy = AggregationStrategy.FIRST
else: elif grouped_entities and not ignore_subwords:
aggregation_strategy = AggregationStrategy.NONE aggregation_strategy = AggregationStrategy.SIMPLE
else:
if grouped_entities is not None: aggregation_strategy = AggregationStrategy.NONE
warnings.warn(
f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.' if grouped_entities is not None:
) warnings.warn(
if ignore_subwords is not None: f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'
warnings.warn( )
f'`ignore_subwords` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.' if ignore_subwords is not None:
) warnings.warn(
if isinstance(aggregation_strategy, str): f'`ignore_subwords` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'
aggregation_strategy = AggregationStrategy[aggregation_strategy.upper()] )
if ( if aggregation_strategy is not None:
aggregation_strategy in {AggregationStrategy.FIRST, AggregationStrategy.MAX, AggregationStrategy.AVERAGE} if isinstance(aggregation_strategy, str):
and not self.tokenizer.is_fast aggregation_strategy = AggregationStrategy[aggregation_strategy.upper()]
): if (
raise ValueError( aggregation_strategy
"Slow tokenizers cannot handle subwords. Please set the `aggregation_strategy` option" in {AggregationStrategy.FIRST, AggregationStrategy.MAX, AggregationStrategy.AVERAGE}
'to `"simple"` or use a fast tokenizer.' and not self.tokenizer.is_fast
) ):
raise ValueError(
self.aggregation_strategy = aggregation_strategy "Slow tokenizers cannot handle subwords. Please set the `aggregation_strategy` option"
'to `"simple"` or use a fast tokenizer.'
)
postprocess_params["aggregation_strategy"] = aggregation_strategy
if ignore_labels is not None:
postprocess_params["ignore_labels"] = ignore_labels
return {}, {}, postprocess_params
def __call__(self, inputs: Union[str, List[str]], **kwargs): def __call__(self, inputs: Union[str, List[str]], **kwargs):
""" """
...@@ -198,56 +176,65 @@ class TokenClassificationPipeline(Pipeline): ...@@ -198,56 +176,65 @@ class TokenClassificationPipeline(Pipeline):
""" """
_inputs, offset_mappings = self._args_parser(inputs, **kwargs) _inputs, offset_mappings = self._args_parser(inputs, **kwargs)
self.offset_mappings = offset_mappings
return super().__call__(inputs, **kwargs)
def preprocess(self, sentence):
truncation = True if self.tokenizer.model_max_length and self.tokenizer.model_max_length > 0 else False
model_inputs = self.tokenizer(
sentence,
return_attention_mask=False,
return_tensors=self.framework,
truncation=truncation,
return_special_tokens_mask=True,
return_offsets_mapping=self.tokenizer.is_fast,
)
if self.offset_mappings:
offset_mapping = self.offset_mappings[0]
model_inputs["offset_mapping"] = offset_mapping
answers = [] model_inputs["sentence"] = sentence
for i, sentence in enumerate(_inputs):
# Manage correct placement of the tensors return model_inputs
with self.device_placement():
tokens = self.tokenizer( def _forward(self, model_inputs):
sentence, # Forward
return_attention_mask=False, special_tokens_mask = model_inputs.pop("special_tokens_mask")
return_tensors=self.framework, offset_mapping = model_inputs.pop("offset_mapping", None)
truncation=True, sentence = model_inputs.pop("sentence")
return_special_tokens_mask=True, if self.framework == "tf":
return_offsets_mapping=self.tokenizer.is_fast, outputs = self.model(model_inputs.data)[0][0].numpy()
) else:
if self.tokenizer.is_fast: outputs = self.model(**model_inputs)[0][0].numpy()
offset_mapping = tokens.pop("offset_mapping").cpu().numpy()[0] return {
elif offset_mappings: "outputs": outputs,
offset_mapping = offset_mappings[i] "special_tokens_mask": special_tokens_mask,
else: "offset_mapping": offset_mapping,
offset_mapping = None "sentence": sentence,
**model_inputs,
}
special_tokens_mask = tokens.pop("special_tokens_mask").cpu().numpy()[0] def postprocess(self, model_outputs, aggregation_strategy=AggregationStrategy.NONE):
outputs = model_outputs["outputs"]
sentence = model_outputs["sentence"]
input_ids = model_outputs["input_ids"][0]
offset_mapping = model_outputs["offset_mapping"][0] if model_outputs["offset_mapping"] is not None else None
special_tokens_mask = model_outputs["special_tokens_mask"][0].numpy()
# Forward scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
if self.framework == "tf": pre_entities = self.gather_pre_entities(
entities = self.model(tokens.data)[0][0].numpy() sentence, input_ids, scores, offset_mapping, special_tokens_mask, aggregation_strategy
input_ids = tokens["input_ids"].numpy()[0] )
else: grouped_entities = self.aggregate(pre_entities, aggregation_strategy)
with torch.no_grad(): # Filter anything that is in self.ignore_labels
tokens = self.ensure_tensor_on_device(**tokens) entities = [
entities = self.model(**tokens)[0][0].cpu().numpy() entity
input_ids = tokens["input_ids"].cpu().numpy()[0] for entity in grouped_entities
if entity.get("entity", None) not in self.ignore_labels
scores = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True) and entity.get("entity_group", None) not in self.ignore_labels
pre_entities = self.gather_pre_entities(sentence, input_ids, scores, offset_mapping, special_tokens_mask) ]
grouped_entities = self.aggregate(pre_entities, self.aggregation_strategy) return entities
# Filter anything that is in self.ignore_labels
entities = [
entity
for entity in grouped_entities
if entity.get("entity", None) not in self.ignore_labels
and entity.get("entity_group", None) not in self.ignore_labels
]
answers.append(entities)
if len(answers) == 1:
return answers[0]
return answers
def gather_pre_entities( def gather_pre_entities(
self, self,
...@@ -256,6 +243,7 @@ class TokenClassificationPipeline(Pipeline): ...@@ -256,6 +243,7 @@ class TokenClassificationPipeline(Pipeline):
scores: np.ndarray, scores: np.ndarray,
offset_mapping: Optional[List[Tuple[int, int]]], offset_mapping: Optional[List[Tuple[int, int]]],
special_tokens_mask: np.ndarray, special_tokens_mask: np.ndarray,
aggregation_strategy: AggregationStrategy,
) -> List[dict]: ) -> List[dict]:
"""Fuse various numpy arrays into dicts with all the information needed for aggregation""" """Fuse various numpy arrays into dicts with all the information needed for aggregation"""
pre_entities = [] pre_entities = []
...@@ -269,6 +257,12 @@ class TokenClassificationPipeline(Pipeline): ...@@ -269,6 +257,12 @@ class TokenClassificationPipeline(Pipeline):
word = self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])) word = self.tokenizer.convert_ids_to_tokens(int(input_ids[idx]))
if offset_mapping is not None: if offset_mapping is not None:
start_ind, end_ind = offset_mapping[idx] start_ind, end_ind = offset_mapping[idx]
if self.framework == "pt":
start_ind = start_ind.item()
end_ind = end_ind.item()
else:
start_ind = int(start_ind.numpy())
end_ind = int(end_ind.numpy())
word_ref = sentence[start_ind:end_ind] word_ref = sentence[start_ind:end_ind]
if getattr(self.tokenizer._tokenizer.model, "continuing_subword_prefix", None): if getattr(self.tokenizer._tokenizer.model, "continuing_subword_prefix", None):
# This is a BPE, word aware tokenizer, there is a correct way # This is a BPE, word aware tokenizer, there is a correct way
...@@ -276,7 +270,7 @@ class TokenClassificationPipeline(Pipeline): ...@@ -276,7 +270,7 @@ class TokenClassificationPipeline(Pipeline):
is_subword = len(word) != len(word_ref) is_subword = len(word) != len(word_ref)
else: else:
# This is a fallback heuristic. This will fail most likely on any kind of text + punctuation mixtures that will be considered "words". Non word aware models cannot do better than this unfortunately. # This is a fallback heuristic. This will fail most likely on any kind of text + punctuation mixtures that will be considered "words". Non word aware models cannot do better than this unfortunately.
if self.aggregation_strategy in { if aggregation_strategy in {
AggregationStrategy.FIRST, AggregationStrategy.FIRST,
AggregationStrategy.AVERAGE, AggregationStrategy.AVERAGE,
AggregationStrategy.MAX, AggregationStrategy.MAX,
...@@ -362,10 +356,11 @@ class TokenClassificationPipeline(Pipeline): ...@@ -362,10 +356,11 @@ class TokenClassificationPipeline(Pipeline):
Example: micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft| Example: micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft|
company| B-ENT I-ENT company| B-ENT I-ENT
""" """
assert aggregation_strategy not in { if aggregation_strategy in {
AggregationStrategy.NONE, AggregationStrategy.NONE,
AggregationStrategy.SIMPLE, AggregationStrategy.SIMPLE,
}, "NONE and SIMPLE strategies are invalid" }:
raise ValueError("NONE and SIMPLE strategies are invalid for word aggregation")
word_entities = [] word_entities = []
word_group = None word_group = None
......
...@@ -2,15 +2,12 @@ from typing import List, Union ...@@ -2,15 +2,12 @@ from typing import List, Union
import numpy as np import numpy as np
from ..file_utils import add_end_docstrings, is_torch_available from ..file_utils import add_end_docstrings
from ..tokenization_utils import TruncationStrategy from ..tokenization_utils import TruncationStrategy
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline
if is_torch_available():
import torch
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -22,7 +19,7 @@ class ZeroShotClassificationArgumentHandler(ArgumentHandler): ...@@ -22,7 +19,7 @@ class ZeroShotClassificationArgumentHandler(ArgumentHandler):
def _parse_labels(self, labels): def _parse_labels(self, labels):
if isinstance(labels, str): if isinstance(labels, str):
labels = [label.strip() for label in labels.split(",")] labels = [label.strip() for label in labels.split(",") if label.strip()]
return labels return labels
def __call__(self, sequences, labels, hypothesis_template): def __call__(self, sequences, labels, hypothesis_template):
...@@ -38,13 +35,12 @@ class ZeroShotClassificationArgumentHandler(ArgumentHandler): ...@@ -38,13 +35,12 @@ class ZeroShotClassificationArgumentHandler(ArgumentHandler):
if isinstance(sequences, str): if isinstance(sequences, str):
sequences = [sequences] sequences = [sequences]
labels = self._parse_labels(labels)
sequence_pairs = [] sequence_pairs = []
for sequence in sequences: for sequence in sequences:
sequence_pairs.extend([[sequence, hypothesis_template.format(label)] for label in labels]) sequence_pairs.extend([[sequence, hypothesis_template.format(label)] for label in labels])
return sequence_pairs return sequence_pairs, sequences
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
...@@ -66,8 +62,8 @@ class ZeroShotClassificationPipeline(Pipeline): ...@@ -66,8 +62,8 @@ class ZeroShotClassificationPipeline(Pipeline):
""" """
def __init__(self, args_parser=ZeroShotClassificationArgumentHandler(), *args, **kwargs): def __init__(self, args_parser=ZeroShotClassificationArgumentHandler(), *args, **kwargs):
super().__init__(*args, **kwargs)
self._args_parser = args_parser self._args_parser = args_parser
super().__init__(*args, **kwargs)
if self.entailment_id == -1: if self.entailment_id == -1:
logger.warning( logger.warning(
"Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to " "Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to "
...@@ -82,19 +78,11 @@ class ZeroShotClassificationPipeline(Pipeline): ...@@ -82,19 +78,11 @@ class ZeroShotClassificationPipeline(Pipeline):
return -1 return -1
def _parse_and_tokenize( def _parse_and_tokenize(
self, self, sequence_pairs, padding=True, add_special_tokens=True, truncation=TruncationStrategy.ONLY_FIRST, **kwargs
sequences,
candidate_labels,
hypothesis_template,
padding=True,
add_special_tokens=True,
truncation=TruncationStrategy.ONLY_FIRST,
**kwargs
): ):
""" """
Parse arguments and tokenize only_first so that hypothesis (label) is not truncated Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
""" """
sequence_pairs = self._args_parser(sequences, candidate_labels, hypothesis_template)
return_tensors = self.framework return_tensors = self.framework
if getattr(self.tokenizer, "pad_token", None) is None: if getattr(self.tokenizer, "pad_token", None) is None:
# XXX some tokenizers do not have a padding token, we use simple lists # XXX some tokenizers do not have a padding token, we use simple lists
...@@ -141,55 +129,27 @@ class ZeroShotClassificationPipeline(Pipeline): ...@@ -141,55 +129,27 @@ class ZeroShotClassificationPipeline(Pipeline):
return inputs return inputs
def _forward(self, inputs, return_tensors=False): def _sanitize_parameters(self, **kwargs):
""" if kwargs.get("multi_class", None) is not None:
Internal framework specific forward dispatching kwargs["multi_label"] = kwargs["multi_class"]
logger.warning(
"The `multi_class` argument has been deprecated and renamed to `multi_label`. "
"`multi_class` will be removed in a future version of Transformers."
)
preprocess_params = {}
if "candidate_labels" in kwargs:
preprocess_params["candidate_labels"] = self._args_parser._parse_labels(kwargs["candidate_labels"])
if "hypothesis_template" in kwargs:
preprocess_params["hypothesis_template"] = kwargs["hypothesis_template"]
Args: postprocess_params = {}
inputs: dict holding all the keyword arguments for required by the model forward method. if "multi_label" in kwargs:
return_tensors: Whether to return native framework (pt/tf) tensors rather than numpy array postprocess_params["multi_label"] = kwargs["multi_label"]
return preprocess_params, {}, postprocess_params
Returns:
Numpy array
"""
# Encode for forward
with self.device_placement():
if self.framework == "tf":
if isinstance(inputs, list):
predictions = []
for input_ in inputs:
prediction = self.model(input_.data, training=False)[0]
predictions.append(prediction)
else:
predictions = self.model(inputs.data, training=False)[0]
else:
with torch.no_grad():
if isinstance(inputs, list):
predictions = []
for input_ in inputs:
model_input = self.ensure_tensor_on_device(**input_)
prediction = self.model(**model_input)[0].cpu()
predictions.append(prediction)
else:
inputs = self.ensure_tensor_on_device(**inputs)
predictions = self.model(**inputs)[0].cpu()
if return_tensors:
return predictions
else:
if isinstance(predictions, list):
predictions = np.array([p.numpy() for p in predictions])
else:
predictions = predictions.numpy()
return predictions
def __call__( def __call__(
self, self,
sequences: Union[str, List[str]], sequences: Union[str, List[str]],
candidate_labels,
hypothesis_template="This example is {}.",
multi_label=False,
**kwargs, **kwargs,
): ):
""" """
...@@ -222,53 +182,78 @@ class ZeroShotClassificationPipeline(Pipeline): ...@@ -222,53 +182,78 @@ class ZeroShotClassificationPipeline(Pipeline):
- **labels** (:obj:`List[str]`) -- The labels sorted by order of likelihood. - **labels** (:obj:`List[str]`) -- The labels sorted by order of likelihood.
- **scores** (:obj:`List[float]`) -- The probabilities for each of the labels. - **scores** (:obj:`List[float]`) -- The probabilities for each of the labels.
""" """
if "multi_class" in kwargs and kwargs["multi_class"] is not None:
multi_label = kwargs.pop("multi_class")
logger.warning(
"The `multi_class` argument has been deprecated and renamed to `multi_label`. "
"`multi_class` will be removed in a future version of Transformers."
)
if sequences and isinstance(sequences, str): result = super().__call__(sequences, **kwargs)
sequences = [sequences] if len(result) == 1:
return result[0]
return result
outputs = super().__call__(sequences, candidate_labels, hypothesis_template) def preprocess(self, inputs, candidate_labels=None, hypothesis_template="This example is {}."):
if isinstance(outputs, list): sequence_pairs, sequences = self._args_parser(inputs, candidate_labels, hypothesis_template)
# XXX: Some tokenizers cannot handle batching because they don't model_inputs = self._parse_and_tokenize(sequence_pairs)
# have pad_token, so outputs will be a list, however, because outputs
# is only n logits and sequence_length is not present anymore, we prepared_inputs = {
# can recreate a tensor out of outputs. "candidate_labels": candidate_labels,
outputs = np.array(outputs) "sequences": sequences,
num_sequences = len(sequences) "inputs": model_inputs,
candidate_labels = self._args_parser._parse_labels(candidate_labels) }
reshaped_outputs = outputs.reshape((num_sequences, len(candidate_labels), -1)) return prepared_inputs
if len(candidate_labels) == 1: def _forward(self, inputs):
multi_label = True candidate_labels = inputs["candidate_labels"]
sequences = inputs["sequences"]
if not multi_label: model_inputs = inputs["inputs"]
# softmax the "entailment" logits over all candidate labels if isinstance(model_inputs, list):
entail_logits = reshaped_outputs[..., self.entailment_id] outputs = []
scores = np.exp(entail_logits) / np.exp(entail_logits).sum(-1, keepdims=True) for input_ in model_inputs:
prediction = self.model(**input_)[0].cpu()
outputs.append(prediction)
else:
outputs = self.model(**model_inputs)
model_outputs = {"candidate_labels": candidate_labels, "sequences": sequences, "outputs": outputs}
return model_outputs
def postprocess(self, model_outputs, multi_label=False):
candidate_labels = model_outputs["candidate_labels"]
sequences = model_outputs["sequences"]
outputs = model_outputs["outputs"]
if self.framework == "pt":
if isinstance(outputs, list):
logits = np.concatenate([output.cpu().numpy() for output in outputs], axis=0)
else:
logits = outputs["logits"].cpu().numpy()
else: else:
if isinstance(outputs, list):
logits = np.concatenate([output.numpy() for output in outputs], axis=0)
else:
logits = outputs["logits"].numpy()
N = logits.shape[0]
n = len(candidate_labels)
num_sequences = N // n
reshaped_outputs = logits.reshape((num_sequences, n, -1))
if multi_label or len(candidate_labels) == 1:
# softmax over the entailment vs. contradiction dim for each label independently # softmax over the entailment vs. contradiction dim for each label independently
entailment_id = self.entailment_id entailment_id = self.entailment_id
contradiction_id = -1 if entailment_id == 0 else 0 contradiction_id = -1 if entailment_id == 0 else 0
entail_contr_logits = reshaped_outputs[..., [contradiction_id, entailment_id]] entail_contr_logits = reshaped_outputs[..., [contradiction_id, entailment_id]]
scores = np.exp(entail_contr_logits) / np.exp(entail_contr_logits).sum(-1, keepdims=True) scores = np.exp(entail_contr_logits) / np.exp(entail_contr_logits).sum(-1, keepdims=True)
scores = scores[..., 1] scores = scores[..., 1]
else:
# softmax the "entailment" logits over all candidate labels
entail_logits = reshaped_outputs[..., self.entailment_id]
scores = np.exp(entail_logits) / np.exp(entail_logits).sum(-1, keepdims=True)
result = [] result = []
for iseq in range(num_sequences): for iseq in range(num_sequences):
top_inds = list(reversed(scores[iseq].argsort())) top_inds = list(reversed(scores[iseq].argsort()))
result.append( result.append(
{ {
"sequence": sequences if isinstance(sequences, str) else sequences[iseq], "sequence": sequences[iseq],
"labels": [candidate_labels[i] for i in top_inds], "labels": [candidate_labels[i] for i in top_inds],
"scores": scores[iseq][top_inds].tolist(), "scores": scores[iseq, top_inds].tolist(),
} }
) )
if len(result) == 1:
return result[0]
return result return result
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment