Unverified Commit c63fcabf authored by Nicolas Patry's avatar Nicolas Patry Committed by GitHub
Browse files

[Large PR] Entire rework of pipelines. (#13308)



* Enabling dataset iteration on pipelines.

Enabling dataset iteration on pipelines.

Unifying parameters under `set_parameters` function.

Small fix.

Last fixes after rebase

Remove print.

Fixing text2text `generate_kwargs`

No more `self.max_length`.

Fixing tf only conversational.

Consistency in start/stop index over TF/PT.

Speeding up drastically on TF (nasty bug where max_length would increase
a ton.)

Adding test for support for non fast tokenizers.

Fixign GPU usage on zero-shot.

Fix working on Tf.

Update src/transformers/pipelines/base.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/base.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

Small cleanup.

Remove all asserts + simple format.

* Fixing audio-classification for large PR.

* Overly explicity null checking.

* Encapsulating GPU/CPU pytorch manipulation directly within `base.py`.

* Removed internal state for parameters of the  pipeline.

Instead of overriding implicitly internal state, we moved
to real named arguments on every `preprocess`, `_forward`,
`postprocess` function.

Instead `_sanitize_parameters` will be used to split all kwargs
of both __init__ and __call__ into the 3 kinds of named parameters.

* Move import warnings.

* Small fixes.

* Quality.

* Another small fix, using the CI to debug faster.

* Last fixes.

* Last fix.

* Small cleanup of tensor moving.

* is not None.

* Adding a bunch of docs + a iteration test.

* Fixing doc style.

* KeyDataset = None guard.

* RRemoving the Cuda test for pipelines (was testing).

* Even more simple iteration test.

* Correct import .

* Long day.

* Fixes in docs.

* [WIP] migrating object detection.

* Fixed the target_size bug.

* Fixup.

* Bad variable name.

* Fixing `ensure_on_device` respects original ModelOutput.
parent 09549aa1
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
How to add a pipeline to 🤗 Transformers?
=======================================================================================================================
First and foremost, you need to decide the raw entries the pipeline will be able to take. It can be strings, raw bytes,
dictionnaries or whatever seems to be the most likely desired input. Try to keep these inputs as pure Python as
possible as it makes compatibility easier (even through other languages via JSON). Those will be the :obj:`inputs` of
the pipeline (:obj:`preprocess`).
Then define the :obj:`outputs`. Same policy as the :obj:`inputs`. The simpler, the better. Those will be the outputs of
:obj:`postprocess` method.
Start by inheriting the base class :obj:`Pipeline`. with the 4 methods needed to implement :obj:`preprocess`,
:obj:`_forward`, :obj:`postprocess` and :obj:`_sanitize_parameters`.
.. code-block::
from transformers import Pipeline
class MyPipeline(Pipeline):
def _sanitize_parameters(self, **kwargs)
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
return preprocess_kwargs, {}, {}
def preprocess(self, inputs, maybe_arg=2)
model_input = Tensor(....)
return {"model_input": model_input}
def _forward(self, model_inputs)
# model_inputs == {"model_input": model_input}
oututs = self.model(**model_inputs)
# Maybe {"logits": Tensor(...)}
return outputs
def postprocess(self, model_outputs)
best_class = model_outputs["logits"].softmax(-1)
return best_class
The structure of this breakdown is to support relatively seemless support for CPU/GPU, while supporting doing
pre/postprocessing on the CPU on different threads
:obj:`preprocess` will take the original defined inputs, and turn them something feedable to the model. It might
contain more information and is usally a :obj:`Dict`.
:obj:`_forward` is the implementation detail and is not meant to be called directly :obj:`forward` is the preferred
called method as it contains safeguards to make sure everything is working on the expected device. If anything is
linked to a real model it belongs in the :obj:`_forward` method, anything else is in the preprocess/postrocess.
:obj:`postprocess` methods will take the output of :obj:`_forward` and turn it into the final output that were decided
earlier.
:obj:`_sanitize_parameters` exists to allow users to pass any parameters whenever they wish, be it at initialization
time ``pipeline(...., maybe_arg=4)`` or at call time ``pipe = pipeline(...); output = pipe(...., maybe_arg=4)``.
The returns of :obj:`_sanitize_parameters` are the 3 dicts of kwargs that will be passed directly to :obj:`preprocess`,
:obj:`_forward` and :obj:`postprocess`. Don't fill anything if the caller didn't call with any extra parameter. That
allows to keep the default arguments in the function definition which is always more "natural".
A classic example would be a :obj:`top_k` argument in the post processing in classification tasks.
.. code-block::
>>> pipe = pipeline("my-new-task")
>>> pipe("This is a test")
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]
>>> pipe("This is a test", top_k=2)
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]
In order to achieve that, we'll update our :obj:`postprocess` method with a default parameter to :obj:`5`. and edit
:obj:`_sanitize_parameters` to allow this new parameter.
.. code-block::
def postprocess(self, model_outputs, top_k=5)
best_class = model_outputs["logits"].softmax(-1)
# Add logic to handle top_k
return best_class
def _sanitize_parameters(self, **kwargs)
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
postprocess_kwargs = {}
if "top_k" in kwargs:
preprocess_kwargs["top_k"] = kwargs["top_k"]
return preprocess_kwargs, {}, postprocess_kwargs
Try to keep the inputs/outputs very simple and ideally JSON-serializable as it makes the pipeline usage very easy
without requiring users to understand new kind of objects. It's also relatively common to support many different types
of arguments for ease of use (audio files, can be filenames, URLs or pure bytes)
Adding it to the list of supported tasks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Go to ``src/transformers/pipelines/__init__.py`` and fill in :obj:`SUPPORTED_TASKS` with your newly created pipeline.
If possible it should provide a default model.
Adding tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Create a new file ``tests/test_pipelines_MY_PIPELINE.py`` with example with the other tests.
The :obj:`run_pipeline_test` function will be very generic and run on small random models on every possible
architecture as defined by :obj:`model_mapping` and :obj:`tf_model_mapping`.
This is very important to test future compatibilty, meaning if someone adds a new model for
:obj:`XXXForQuestionAnswering` then the pipeline test will attempt to run on it. Because the models are random it's
impossible to check for actual values, that's why There is a helper :obj:`ANY` that will simply attempt to match the
output of the pipeline TYPE.
You also *need* to implement 2 (ideally 4) tests.
- :obj:`test_small_model_pt` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
and test the pipeline outputs. The results should be the same as :obj:`test_small_model_tf`.
- :obj:`test_small_model_tf` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
and test the pipeline outputs. The results should be the same as :obj:`test_small_model_pt`.
- :obj:`test_large_model_pt` (:obj:`optional`): Tests the pipeline on a real pipeline where the results are supposed to
make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
sure there is no drift in future releases
- :obj:`test_large_model_tf` (:obj:`optional`): Tests the pipeline on a real pipeline where the results are supposed to
make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
sure there is no drift in future releases
...@@ -506,6 +506,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -506,6 +506,7 @@ Flax), PyTorch, and/or TensorFlow.
migration migration
contributing contributing
add_new_model add_new_model
add_new_pipeline
fast_tokenizers fast_tokenizers
performance performance
parallelism parallelism
......
...@@ -46,12 +46,53 @@ The pipeline abstraction ...@@ -46,12 +46,53 @@ The pipeline abstraction
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
pipeline but requires an additional argument which is the `task`. pipeline but requires an additional argument which is the `task`.
Simple call on one item:
.. code-block::
>>> pipe = pipeline("text-classification")
>>> pipe("This restaurant is awesome")
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]
To call a pipeline on many items, you can either call with a `list`.
.. code-block::
>>> pipe = pipeline("text-classification")
>>> pipe(["This restaurant is awesome", "This restaurant is aweful"])
[{'label': 'POSITIVE', 'score': 0.9998743534088135},
{'label': 'NEGATIVE', 'score': 0.9996669292449951}]
To iterate of full datasets it is recommended to use a :obj:`dataset` directly. This means you don't need to allocate
the whole dataset at once, nor do you need to do batching yourself. This should work just as fast as custom loops on
GPU. If it doesn't don't hesitate to create an issue.
.. code-block::
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")
# KeyDataset (only `pt`) will simply return the item in the dict returned by the dataset item
# as we're not interested in the `target` part of the dataset.
for out in tqdm.tqdm(pipe(KeyDataset(dataset, "file"))):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
.. autofunction:: transformers.pipeline .. autofunction:: transformers.pipeline
Implementing a pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:doc:`Implementing a new pipeline <../add_new_pipeline>`
The task specific pipelines The task specific pipelines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AudioClassificationPipeline AudioClassificationPipeline
======================================================================================================================= =======================================================================================================================
......
...@@ -67,8 +67,8 @@ make them readable. For instance: ...@@ -67,8 +67,8 @@ make them readable. For instance:
>>> classifier('We are very happy to show you the 🤗 Transformers library.') >>> classifier('We are very happy to show you the 🤗 Transformers library.')
[{'label': 'POSITIVE', 'score': 0.9998}] [{'label': 'POSITIVE', 'score': 0.9998}]
That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model as a That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model, returning
`batch`, returning a list of dictionaries like this one: a list of dictionaries like this one:
.. code-block:: .. code-block::
...@@ -79,6 +79,8 @@ That's encouraging! You can use it on a list of sentences, which will be preproc ...@@ -79,6 +79,8 @@ That's encouraging! You can use it on a list of sentences, which will be preproc
label: POSITIVE, with score: 0.9998 label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309 label: NEGATIVE, with score: 0.5309
To use with a large dataset, look at :doc:`iterating over a pipeline <./main_classes/pipelines>`
You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
fairly neutral. fairly neutral.
......
...@@ -249,18 +249,22 @@ def check_task(task: str) -> Tuple[Dict, Any]: ...@@ -249,18 +249,22 @@ def check_task(task: str) -> Tuple[Dict, Any]:
task (:obj:`str`): task (:obj:`str`):
The task defining which pipeline will be returned. Currently accepted tasks are: The task defining which pipeline will be returned. Currently accepted tasks are:
- :obj:`"audio-classification"`
- :obj:`"automatic-speech-recognition"`
- :obj:`"conversational"`
- :obj:`"feature-extraction"` - :obj:`"feature-extraction"`
- :obj:`"text-classification"`
- :obj:`"sentiment-analysis"` (alias of :obj:`"text-classification")
- :obj:`"token-classification"`
- :obj:`"ner"` (alias of :obj:`"token-classification")
- :obj:`"question-answering"`
- :obj:`"fill-mask"` - :obj:`"fill-mask"`
- :obj:`"summarization"` - :obj:`"image-classification"`
- :obj:`"translation_xx_to_yy"` - :obj:`"question-answering"`
- :obj:`"translation"` - :obj:`"table-question-answering"`
- :obj:`"text2text-generation"`
- :obj:`"text-classification"` (alias :obj:`"sentiment-analysis" available)
- :obj:`"text-generation"` - :obj:`"text-generation"`
- :obj:`"conversational"` - :obj:`"token-classification"` (alias :obj:`"ner"` available)
- :obj:`"translation"`
- :obj:`"translation_xx_to_yy"`
- :obj:`"summarization"`
- :obj:`"zero-shot-classification"`
Returns: Returns:
(task_defaults:obj:`dict`, task_options: (:obj:`tuple`, None)) The actual dictionary required to initialize the (task_defaults:obj:`dict`, task_options: (:obj:`tuple`, None)) The actual dictionary required to initialize the
...@@ -312,21 +316,26 @@ def pipeline( ...@@ -312,21 +316,26 @@ def pipeline(
task (:obj:`str`): task (:obj:`str`):
The task defining which pipeline will be returned. Currently accepted tasks are: The task defining which pipeline will be returned. Currently accepted tasks are:
- :obj:`"feature-extraction"`: will return a :class:`~transformers.FeatureExtractionPipeline`. - :obj:`"audio-classification"`: will return a :class:`~transformers.AudioClassificationPipeline`:.
- :obj:`"text-classification"`: will return a :class:`~transformers.TextClassificationPipeline`. - :obj:`"automatic-speech-recognition"`: will return a
- :obj:`"sentiment-analysis"`: (alias of :obj:`"text-classification"`) will return a :class:`~transformers.AutomaticSpeechRecognitionPipeline`:.
:class:`~transformers.TextClassificationPipeline`. - :obj:`"conversational"`: will return a :class:`~transformers.ConversationalPipeline`:.
- :obj:`"token-classification"`: will return a :class:`~transformers.TokenClassificationPipeline`. - :obj:`"feature-extraction"`: will return a :class:`~transformers.FeatureExtractionPipeline`:.
- :obj:`"ner"` (alias of :obj:`"token-classification"`): will return a - :obj:`"fill-mask"`: will return a :class:`~transformers.FillMaskPipeline`:.
:class:`~transformers.TokenClassificationPipeline`. - :obj:`"image-classification"`: will return a :class:`~transformers.ImageClassificationPipeline`:.
- :obj:`"question-answering"`: will return a :class:`~transformers.QuestionAnsweringPipeline`. - :obj:`"question-answering"`: will return a :class:`~transformers.QuestionAnsweringPipeline`:.
- :obj:`"fill-mask"`: will return a :class:`~transformers.FillMaskPipeline`. - :obj:`"table-question-answering"`: will return a :class:`~transformers.TableQuestionAnsweringPipeline`:.
- :obj:`"summarization"`: will return a :class:`~transformers.SummarizationPipeline`. - :obj:`"text2text-generation"`: will return a :class:`~transformers.Text2TextGenerationPipeline`:.
- :obj:`"translation_xx_to_yy"`: will return a :class:`~transformers.TranslationPipeline`. - :obj:`"text-classification"` (alias :obj:`"sentiment-analysis" available): will return a
- :obj:`"text2text-generation"`: will return a :class:`~transformers.Text2TextGenerationPipeline`. :class:`~transformers.TextClassificationPipeline`:.
- :obj:`"text-generation"`: will return a :class:`~transformers.TextGenerationPipeline`. - :obj:`"text-generation"`: will return a :class:`~transformers.TextGenerationPipeline`:.
- :obj:`"zero-shot-classification:`: will return a :class:`~transformers.ZeroShotClassificationPipeline`. - :obj:`"token-classification"` (alias :obj:`"ner"` available): will return a
- :obj:`"conversational"`: will return a :class:`~transformers.ConversationalPipeline`. :class:`~transformers.TokenClassificationPipeline`:.
- :obj:`"translation"`: will return a :class:`~transformers.TranslationPipeline`:.
- :obj:`"translation_xx_to_yy"`: will return a :class:`~transformers.TranslationPipeline`:.
- :obj:`"summarization"`: will return a :class:`~transformers.SummarizationPipeline`:.
- :obj:`"zero-shot-classification"`: will return a :class:`~transformers.ZeroShotClassificationPipeline`:.
model (:obj:`str` or :obj:`~transformers.PreTrainedModel` or :obj:`~transformers.TFPreTrainedModel`, `optional`): model (:obj:`str` or :obj:`~transformers.PreTrainedModel` or :obj:`~transformers.TFPreTrainedModel`, `optional`):
The model that will be used by the pipeline to make predictions. This can be a model identifier or an The model that will be used by the pipeline to make predictions. This can be a model identifier or an
actual instance of a pretrained model inheriting from :class:`~transformers.PreTrainedModel` (for PyTorch) actual instance of a pretrained model inheriting from :class:`~transformers.PreTrainedModel` (for PyTorch)
......
...@@ -12,23 +12,16 @@ ...@@ -12,23 +12,16 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import subprocess import subprocess
from typing import TYPE_CHECKING, Optional, Union from typing import Union
import numpy as np import numpy as np
from ..feature_extraction_utils import PreTrainedFeatureExtractor
from ..file_utils import add_end_docstrings, is_torch_available from ..file_utils import add_end_docstrings, is_torch_available
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, Pipeline
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
if is_torch_available(): if is_torch_available():
import torch
from ..models.auto.modeling_auto import MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING from ..models.auto.modeling_auto import MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -84,14 +77,10 @@ class AudioClassificationPipeline(Pipeline): ...@@ -84,14 +77,10 @@ class AudioClassificationPipeline(Pipeline):
<https://huggingface.co/models?filter=audio-classification>`__. <https://huggingface.co/models?filter=audio-classification>`__.
""" """
def __init__( def __init__(self, *args, **kwargs):
self, # Default, might be overriden by the model.config.
model: Union["PreTrainedModel", "TFPreTrainedModel"], kwargs["top_k"] = 5
feature_extractor: PreTrainedFeatureExtractor, super().__init__(*args, **kwargs)
framework: Optional[str] = None,
**kwargs
):
super().__init__(model, feature_extractor=feature_extractor, framework=framework, **kwargs)
if self.framework != "pt": if self.framework != "pt":
raise ValueError(f"The {self.__class__} is only available in PyTorch.") raise ValueError(f"The {self.__class__} is only available in PyTorch.")
...@@ -101,7 +90,6 @@ class AudioClassificationPipeline(Pipeline): ...@@ -101,7 +90,6 @@ class AudioClassificationPipeline(Pipeline):
def __call__( def __call__(
self, self,
inputs: Union[np.ndarray, bytes, str], inputs: Union[np.ndarray, bytes, str],
top_k: Optional[int] = None,
**kwargs, **kwargs,
): ):
""" """
...@@ -126,6 +114,18 @@ class AudioClassificationPipeline(Pipeline): ...@@ -126,6 +114,18 @@ class AudioClassificationPipeline(Pipeline):
- **label** (:obj:`str`) -- The label predicted. - **label** (:obj:`str`) -- The label predicted.
- **score** (:obj:`float`) -- The corresponding probability. - **score** (:obj:`float`) -- The corresponding probability.
""" """
return super().__call__(inputs, **kwargs)
def _sanitize_parameters(self, top_k=None, **kwargs):
# No parameters on this pipeline right now
postprocess_params = {}
if top_k is not None:
if top_k > self.model.config.num_labels:
top_k = self.model.config.num_labels
postprocess_params["top_k"] = top_k
return {}, {}, postprocess_params
def preprocess(self, inputs):
if isinstance(inputs, str): if isinstance(inputs, str):
with open(inputs, "rb") as f: with open(inputs, "rb") as f:
inputs = f.read() inputs = f.read()
...@@ -136,20 +136,19 @@ class AudioClassificationPipeline(Pipeline): ...@@ -136,20 +136,19 @@ class AudioClassificationPipeline(Pipeline):
if not isinstance(inputs, np.ndarray): if not isinstance(inputs, np.ndarray):
raise ValueError("We expect a numpy ndarray as input") raise ValueError("We expect a numpy ndarray as input")
if len(inputs.shape) != 1: if len(inputs.shape) != 1:
raise ValueError("We expect a single channel audio input for AudioClassificationPipeline") raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
if top_k is None or top_k > self.model.config.num_labels:
top_k = self.model.config.num_labels
processed = self.feature_extractor( processed = self.feature_extractor(
inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt" inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt"
) )
processed = self.ensure_tensor_on_device(**processed) return processed
with torch.no_grad(): def _forward(self, model_inputs):
outputs = self.model(**processed) model_outputs = self.model(**model_inputs)
return model_outputs
probs = outputs.logits[0].softmax(-1) def postprocess(self, model_outputs, top_k=5):
probs = model_outputs.logits[0].softmax(-1)
scores, ids = probs.topk(top_k) scores, ids = probs.topk(top_k)
scores = scores.tolist() scores = scores.tolist()
......
...@@ -124,6 +124,13 @@ class AutomaticSpeechRecognitionPipeline(Pipeline): ...@@ -124,6 +124,13 @@ class AutomaticSpeechRecognitionPipeline(Pipeline):
- **text** (:obj:`str`) -- The recognized text. - **text** (:obj:`str`) -- The recognized text.
""" """
return super().__call__(inputs, **kwargs)
def _sanitize_parameters(self, **kwargs):
# No parameters on this pipeline right now
return {}, {}, {}
def preprocess(self, inputs):
if isinstance(inputs, str): if isinstance(inputs, str):
with open(inputs, "rb") as f: with open(inputs, "rb") as f:
inputs = f.read() inputs = f.read()
...@@ -131,27 +138,34 @@ class AutomaticSpeechRecognitionPipeline(Pipeline): ...@@ -131,27 +138,34 @@ class AutomaticSpeechRecognitionPipeline(Pipeline):
if isinstance(inputs, bytes): if isinstance(inputs, bytes):
inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate) inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)
assert isinstance(inputs, np.ndarray), "We expect a numpy ndarray as input" if not isinstance(inputs, np.ndarray):
assert len(inputs.shape) == 1, "We expect a single channel audio input for AutomaticSpeechRecognitionPipeline" raise ValueError("We expect a numpy ndarray as input")
if len(inputs.shape) != 1:
raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")
processed = self.feature_extractor( processed = self.feature_extractor(
inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt" inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="pt"
) )
processed = self.ensure_tensor_on_device(**processed) return processed
def _forward(self, model_inputs):
name = self.model.__class__.__name__ name = self.model.__class__.__name__
if name.endswith("ForConditionalGeneration") or name.endswith("EncoderDecoderModel"): if name.endswith("ForConditionalGeneration") or name.endswith("EncoderDecoderModel"):
encoder = self.model.get_encoder() encoder = self.model.get_encoder()
# we need to pass `processed.get("attention_mask")` here since audio encoder # we need to pass `processed.get("attention_mask")` here since audio encoder
# attention mask length is different from expected text decoder `encoder_attention_mask` length # attention mask length is different from expected text decoder `encoder_attention_mask` length
# `generate` magic to create the mask automatically won't work, we basically need to help
# it here.
tokens = self.model.generate( tokens = self.model.generate(
encoder_outputs=encoder(**processed), attention_mask=processed.get("attention_mask") encoder_outputs=encoder(**model_inputs), attention_mask=model_inputs.get("attention_mask")
) )
tokens = tokens.squeeze(0) tokens = tokens.squeeze(0)
elif name.endswith("ForCTC"): elif name.endswith("ForCTC"):
outputs = self.model(**processed) outputs = self.model(**model_inputs)
tokens = outputs.logits.squeeze(0).argmax(dim=-1) tokens = outputs.logits.squeeze(0).argmax(dim=-1)
return tokens
def postprocess(self, model_outputs):
skip_special_tokens = False if "CTC" in self.tokenizer.__class__.__name__ else True skip_special_tokens = False if "CTC" in self.tokenizer.__class__.__name__ else True
recognized_string = self.tokenizer.decode(tokens, skip_special_tokens=skip_special_tokens) recognized_string = self.tokenizer.decode(model_outputs, skip_special_tokens=skip_special_tokens)
return {"text": recognized_string} return {"text": recognized_string}
...@@ -20,18 +20,21 @@ import pickle ...@@ -20,18 +20,21 @@ import pickle
import sys import sys
import warnings import warnings
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
from collections import UserDict
from contextlib import contextmanager from contextlib import contextmanager
from os.path import abspath, exists from os.path import abspath, exists
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union
from ..feature_extraction_utils import PreTrainedFeatureExtractor from ..feature_extraction_utils import PreTrainedFeatureExtractor
from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import ModelOutput, add_end_docstrings, is_tf_available, is_torch_available
from ..modelcard import ModelCard from ..modelcard import ModelCard
from ..models.auto.configuration_auto import AutoConfig from ..models.auto.configuration_auto import AutoConfig
from ..tokenization_utils import PreTrainedTokenizer, TruncationStrategy from ..tokenization_utils import PreTrainedTokenizer
from ..utils import logging from ..utils import logging
GenericTensor = Union[List["GenericTensor"], "torch.Tensor", "tf.Tensor"]
if is_tf_available(): if is_tf_available():
import tensorflow as tf import tensorflow as tf
...@@ -39,8 +42,12 @@ if is_tf_available(): ...@@ -39,8 +42,12 @@ if is_tf_available():
if is_torch_available(): if is_torch_available():
import torch import torch
from torch.utils.data import DataLoader, Dataset, IterableDataset
from ..models.auto.modeling_auto import AutoModel from ..models.auto.modeling_auto import AutoModel
else:
Dataset = None
KeyDataset = None
if TYPE_CHECKING: if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel from ..modeling_tf_utils import TFPreTrainedModel
...@@ -50,6 +57,12 @@ if TYPE_CHECKING: ...@@ -50,6 +57,12 @@ if TYPE_CHECKING:
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
def collate_fn(items):
if len(items) != 1:
raise ValueError("This collate_fn is meant to be used with batch_size=1")
return items[0]
def infer_framework_load_model( def infer_framework_load_model(
model, model,
config: AutoConfig, config: AutoConfig,
...@@ -585,6 +598,51 @@ PIPELINE_INIT_ARGS = r""" ...@@ -585,6 +598,51 @@ PIPELINE_INIT_ARGS = r"""
Flag indicating if the output the pipeline should happen in a binary format (i.e., pickle) or as raw text. Flag indicating if the output the pipeline should happen in a binary format (i.e., pickle) or as raw text.
""" """
if is_torch_available():
class PipelineDataset(Dataset):
def __init__(self, dataset, process, params):
self.dataset = dataset
self.process = process
self.params = params
def __len__(self):
return len(self.dataset)
def __getitem__(self, i):
item = self.dataset[i]
processed = self.process(item, **self.params)
return processed
class PipelineIterator(IterableDataset):
def __init__(self, loader, infer, params):
self.loader = loader
self.infer = infer
self.params = params
def __len__(self):
return len(self.loader)
def __iter__(self):
self.iterator = iter(self.loader)
return self
def __next__(self):
item = next(self.iterator)
processed = self.infer(item, **self.params)
return processed
class KeyDataset(Dataset):
def __init__(self, dataset: Dataset, key: str):
self.dataset = dataset
self.key = key
def __len__(self):
return len(self.dataset)
def __getitem__(self, i):
return self.dataset[i][self.key]
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
class Pipeline(_ScikitCompat): class Pipeline(_ScikitCompat):
...@@ -618,6 +676,7 @@ class Pipeline(_ScikitCompat): ...@@ -618,6 +676,7 @@ class Pipeline(_ScikitCompat):
args_parser: ArgumentHandler = None, args_parser: ArgumentHandler = None,
device: int = -1, device: int = -1,
binary_output: bool = False, binary_output: bool = False,
**kwargs,
): ):
if framework is None: if framework is None:
...@@ -641,6 +700,9 @@ class Pipeline(_ScikitCompat): ...@@ -641,6 +700,9 @@ class Pipeline(_ScikitCompat):
if task_specific_params is not None and task in task_specific_params: if task_specific_params is not None and task in task_specific_params:
self.model.config.update(task_specific_params.get(task)) self.model.config.update(task_specific_params.get(task))
self.call_count = 0
self._preprocess_params, self._forward_params, self._postprocess_params = self._sanitize_parameters(**kwargs)
def save_pretrained(self, save_directory: str): def save_pretrained(self, save_directory: str):
""" """
Save the pipeline's model and tokenizer. Save the pipeline's model and tokenizer.
...@@ -707,15 +769,31 @@ class Pipeline(_ScikitCompat): ...@@ -707,15 +769,31 @@ class Pipeline(_ScikitCompat):
Ensure PyTorch tensors are on the specified device. Ensure PyTorch tensors are on the specified device.
Args: Args:
inputs (keyword arguments that should be :obj:`torch.Tensor`): The tensors to place on :obj:`self.device`. inputs (keyword arguments that should be :obj:`torch.Tensor`, the rest is ignored): The tensors to place on :obj:`self.device`.
Recursive on lists **only**.
Return: Return:
:obj:`Dict[str, torch.Tensor]`: The same as :obj:`inputs` but on the proper device. :obj:`Dict[str, torch.Tensor]`: The same as :obj:`inputs` but on the proper device.
""" """
return { return self._ensure_tensor_on_device(inputs, self.device)
name: tensor.to(self.device) if isinstance(tensor, torch.Tensor) else tensor
for name, tensor in inputs.items() def _ensure_tensor_on_device(self, inputs, device):
} if isinstance(inputs, ModelOutput):
return ModelOutput(
{name: self._ensure_tensor_on_device(tensor, device) for name, tensor in inputs.items()}
)
elif isinstance(inputs, dict):
return {name: self._ensure_tensor_on_device(tensor, device) for name, tensor in inputs.items()}
elif isinstance(inputs, UserDict):
return UserDict({name: self._ensure_tensor_on_device(tensor, device) for name, tensor in inputs.items()})
elif isinstance(inputs, list):
return [self._ensure_tensor_on_device(item, device) for item in inputs]
elif isinstance(inputs, tuple):
return tuple([self._ensure_tensor_on_device(item, device) for item in inputs])
elif isinstance(inputs, torch.Tensor):
return inputs.to(self.device)
else:
return inputs
def check_model_type(self, supported_models: Union[List[str], dict]): def check_model_type(self, supported_models: Union[List[str], dict]):
""" """
...@@ -739,65 +817,108 @@ class Pipeline(_ScikitCompat): ...@@ -739,65 +817,108 @@ class Pipeline(_ScikitCompat):
f"The model '{self.model.__class__.__name__}' is not supported for {self.task}. Supported models are {supported_models}." f"The model '{self.model.__class__.__name__}' is not supported for {self.task}. Supported models are {supported_models}."
) )
def _parse_and_tokenize( @abstractmethod
self, inputs, padding=True, add_special_tokens=True, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs def _sanitize_parameters(self, **pipeline_parameters):
):
""" """
Parse arguments and tokenize _sanitize_parameters will be called with any excessive named arguments from either `__init__` or `__call__`
""" methods. It should return 3 dictionnaries of the resolved parameters used by the various `preprocess`,
# Parse arguments `forward` and `postprocess` methods. Do not fill dictionnaries if the caller didn't specify a kwargs. This
if getattr(self.tokenizer, "pad_token", None) is None: let's you keep defaults in function signatures, which is more "natural".
padding = False
inputs = self.tokenizer(
inputs,
add_special_tokens=add_special_tokens,
return_tensors=self.framework,
padding=padding,
truncation=truncation,
)
return inputs
def __call__(self, inputs, *args, **kwargs): It is not meant to be called directly, it will be automatically called and the final parameters resolved by
try: `__init__` and `__call__`
model_inputs = self._parse_and_tokenize(inputs, *args, **kwargs) """
outputs = self._forward(model_inputs) raise NotImplementedError("_sanitize_parameters not implemented")
return outputs
except ValueError:
# XXX: Some tokenizer do NOT have a pad token, hence we cannot run the inference
# in a batch, instead we run everything sequentially
if isinstance(inputs, list):
values = []
for input_ in inputs:
model_input = self._parse_and_tokenize(input_, padding=False, *args, **kwargs)
value = self._forward(model_input)
values.append(value.squeeze(0))
else:
model_input = self._parse_and_tokenize(inputs, padding=False, *args, **kwargs)
values = self._forward(model_input)
return values
def _forward(self, inputs, return_tensors=False): @abstractmethod
def preprocess(self, input_: Any, **preprocess_parameters: Dict) -> Dict[str, GenericTensor]:
""" """
Internal framework specific forward dispatching Preprocess will take the `input_` of a specific pipeline and return a dictionnary of everything necessary for
`_forward` to run properly. It should contain at least one tensor, but might have arbitrary other items.
"""
raise NotImplementedError("preprocess not implemented")
Args: @abstractmethod
inputs: dict holding all the keyword arguments for required by the model forward method. def _forward(self, input_tensors: Dict[str, GenericTensor], **forward_parameters: Dict) -> ModelOutput:
return_tensors: Whether to return native framework (pt/tf) tensors rather than numpy array """
_forward will receive the prepared dictionnary from `preprocess` and run it on the model. This method might
involve the GPU or the CPU and should be agnostic to it. Isolating this function is the reason for `preprocess`
and `postprocess` to exist, so that the hot path, this method generally can run as fast as possible.
Returns: It is not meant to be called directly, `forward` is preferred. It is basically the same but contains additional
Numpy array code surrounding `_forward` making sure tensors and models are on the same device, disabling the training part
of the code (leading to faster inference).
"""
raise NotImplementedError("_forward not implemented")
@abstractmethod
def postprocess(self, model_outputs: ModelOutput, **postprocess_parameters: Dict) -> Any:
"""
Postprocess will receive the raw outputs of the `_forward` method, generally tensors, and reformat them into
something more friendly. Generally it will output a list or a dict or results (containing just strings and
numbers).
""" """
# Encode for forward raise NotImplementedError("postprocess not implemented")
def forward(self, model_inputs, **forward_params):
with self.device_placement(): with self.device_placement():
if self.framework == "tf": if self.framework == "tf":
# TODO trace model model_inputs["training"] = False
predictions = self.model(inputs.data, training=False)[0] model_outputs = self._forward(model_inputs, **forward_params)
else: elif self.framework == "pt":
with torch.no_grad(): with torch.no_grad():
inputs = self.ensure_tensor_on_device(**inputs) model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
predictions = self.model(**inputs)[0].cpu() model_outputs = self._forward(model_inputs, **forward_params)
model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
if return_tensors:
return predictions
else: else:
return predictions.numpy() raise ValueError(f"Framework {self.framework} is not supported")
return model_outputs
def get_iterator(self, inputs, num_workers: int, preprocess_params, forward_params, postprocess_params):
if "TOKENIZERS_PARALLELISM" not in os.environ:
logger.info("Disabling tokenizer parallelism, we're using DataLoader multithreading already")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
dataset = PipelineDataset(inputs, self.preprocess, preprocess_params)
dataloader = DataLoader(dataset, num_workers=num_workers, batch_size=1, collate_fn=collate_fn)
model_iterator = PipelineIterator(dataloader, self.forward, forward_params)
final_iterator = PipelineIterator(model_iterator, self.postprocess, postprocess_params)
return final_iterator
def __call__(self, inputs, *args, num_workers=8, **kwargs):
if args:
logger.warning(f"Ignoring args : {args}")
preprocess_params, forward_params, postprocess_params = self._sanitize_parameters(**kwargs)
# Fuse __init__ params and __call__ params without modifying the __init__ ones.
preprocess_params = {**self._preprocess_params, **preprocess_params}
forward_params = {**self._forward_params, **forward_params}
postprocess_params = {**self._postprocess_params, **postprocess_params}
self.call_count += 1
if self.call_count > 10 and self.framework == "pt" and self.device.type == "cuda":
warnings.warn(
"You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset",
UserWarning,
)
if isinstance(inputs, list):
if self.framework == "pt":
final_iterator = self.get_iterator(
inputs, num_workers, preprocess_params, forward_params, postprocess_params
)
outputs = [output for output in final_iterator]
return outputs
else:
return self.run_multi(inputs, preprocess_params, forward_params, postprocess_params)
elif Dataset is not None and isinstance(inputs, Dataset):
return self.get_iterator(inputs, num_workers, preprocess_params, forward_params, postprocess_params)
else:
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):
return [self.run_single(item, preprocess_params, forward_params, postprocess_params) for item in inputs]
def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
model_inputs = self.preprocess(inputs, **preprocess_params)
model_outputs = self.forward(model_inputs, **forward_params)
outputs = self.postprocess(model_outputs, **postprocess_params)
return outputs
...@@ -190,23 +190,34 @@ class ConversationalPipeline(Pipeline): ...@@ -190,23 +190,34 @@ class ConversationalPipeline(Pipeline):
conversational_pipeline([conversation_1, conversation_2]) conversational_pipeline([conversation_1, conversation_2])
""" """
def __init__(self, min_length_for_response=32, minimum_tokens=10, *args, **kwargs): def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs) super().__init__(*args, **kwargs)
# We need at least an eos_token
# assert self.tokenizer.eos_token_id is not None, "ConversationalPipeline tokenizer should have an EOS token set"
if self.tokenizer.pad_token_id is None: if self.tokenizer.pad_token_id is None:
self.tokenizer.pad_token = self.tokenizer.eos_token self.tokenizer.pad_token = self.tokenizer.eos_token
self.min_length_for_response = min_length_for_response def _sanitize_parameters(
self.minimum_tokens = minimum_tokens self, min_length_for_response=None, minimum_tokens=None, clean_up_tokenization_spaces=None, **generate_kwargs
def __call__(
self,
conversations: Union[Conversation, List[Conversation]],
clean_up_tokenization_spaces=True,
**generate_kwargs
): ):
preprocess_params = {}
forward_params = {}
postprocess_params = {}
if min_length_for_response is not None:
preprocess_params["min_length_for_response"] = min_length_for_response
if minimum_tokens is not None:
forward_params["minimum_tokens"] = minimum_tokens
if "max_length" in generate_kwargs:
forward_params["max_length"] = generate_kwargs["max_length"]
# self.max_length = generate_kwargs.get("max_length", self.model.config.max_length)
if clean_up_tokenization_spaces is not None:
postprocess_params["clean_up_tokenization_spaces"] = clean_up_tokenization_spaces
if generate_kwargs:
forward_params.update(generate_kwargs)
return preprocess_params, forward_params, postprocess_params
def __call__(self, conversations: Union[Conversation, List[Conversation]], num_workers=0, **kwargs):
r""" r"""
Generate responses for the conversation(s) given as inputs. Generate responses for the conversation(s) given as inputs.
...@@ -223,117 +234,67 @@ class ConversationalPipeline(Pipeline): ...@@ -223,117 +234,67 @@ class ConversationalPipeline(Pipeline):
:class:`~transformers.Conversation` or a list of :class:`~transformers.Conversation`: Conversation(s) with :class:`~transformers.Conversation` or a list of :class:`~transformers.Conversation`: Conversation(s) with
updated generated responses for those containing a new user input. updated generated responses for those containing a new user input.
""" """
# XXX: num_workers==0 is required to be backward compatible
# Otherwise the threads will require a Conversation copy.
# This will definitely hinder performance on GPU, but has to be opted
# in because of this BC change.
outputs = super().__call__(conversations, num_workers=num_workers, **kwargs)
if isinstance(outputs, list) and len(outputs) == 1:
return outputs[0]
return outputs
if isinstance(conversations, Conversation): def preprocess(self, conversation: Conversation) -> Dict[str, Any]:
conversations = [conversations] if not isinstance(conversation, Conversation):
# Input validation raise ValueError("ConversationalPipeline, expects Conversation as inputs")
if isinstance(conversations, list):
for conversation in conversations:
assert isinstance(
conversation, Conversation
), "ConversationalPipeline expects a Conversation or list of Conversations as an input"
if conversation.new_user_input is None: if conversation.new_user_input is None:
raise ValueError( raise ValueError(
f"Conversation with UUID {type(conversation.uuid)} does not contain new user input to process. " f"Conversation with UUID {type(conversation.uuid)} does not contain new user input to process. "
"Add user inputs with the conversation's `add_user_input` method" "Add user inputs with the conversation's `add_user_input` method"
) )
assert ( if hasattr(self.tokenizer, "_build_conversation_input_ids"):
self.tokenizer.pad_token_id is not None or self.tokenizer.eos_token_id is not None input_ids = self.tokenizer._build_conversation_input_ids(conversation)
), "Please make sure that the tokenizer has a pad_token_id or eos_token_id when using a batch input"
else: else:
raise ValueError("ConversationalPipeline expects a Conversation or list of Conversations as an input") # If the tokenizer cannot handle conversations, we default to only the old version
input_ids = self._legacy_parse_and_tokenize(conversation)
with self.device_placement():
inputs = self._parse_and_tokenize(conversations)
if self.framework == "pt": if self.framework == "pt":
inputs = self.ensure_tensor_on_device(**inputs) input_ids = torch.LongTensor([input_ids])
input_length = inputs["input_ids"].shape[-1]
elif self.framework == "tf": elif self.framework == "tf":
input_length = tf.shape(inputs["input_ids"])[-1].numpy() input_ids = tf.constant([input_ids])
return {"input_ids": input_ids, "conversation": conversation}
def _forward(self, model_inputs, minimum_tokens=10, **generate_kwargs):
max_length = generate_kwargs.get("max_length", self.model.config.max_length) max_length = generate_kwargs.get("max_length", self.model.config.max_length)
n = inputs["input_ids"].shape[1]
if max_length - self.minimum_tokens < n:
logger.warning(
f"Conversation input is to long ({n}), trimming it to ({max_length} - {self.minimum_tokens})"
)
trim = max_length - self.minimum_tokens
inputs["input_ids"] = inputs["input_ids"][:, -trim:]
inputs["attention_mask"] = inputs["attention_mask"][:, -trim:]
generated_responses = self.model.generate(
inputs["input_ids"],
attention_mask=inputs["attention_mask"],
**generate_kwargs,
)
if self.model.config.is_encoder_decoder: n = model_inputs["input_ids"].shape[1]
if self.framework == "pt": if max_length - minimum_tokens < n:
history = torch.cat((inputs["input_ids"], generated_responses[:, 1:]), 1) logger.warning(f"Conversation input is to long ({n}), trimming it to ({max_length} - {minimum_tokens})")
elif self.framework == "tf": trim = max_length - minimum_tokens
history = tf.concat([inputs["input_ids"], generated_responses[:, 1:]], 1) model_inputs["input_ids"] = model_inputs["input_ids"][:, -trim:]
else: if "attention_mask" in model_inputs:
history = generated_responses model_inputs["attention_mask"] = model_inputs["attention_mask"][:, -trim:]
conversation = model_inputs.pop("conversation")
history = self._clean_padding_history(history) model_inputs["max_length"] = max_length
output_ids = self.model.generate(**model_inputs, **generate_kwargs)
if self.model.config.is_encoder_decoder: if self.model.config.is_encoder_decoder:
start_position = 1 start_position = 1
else: else:
start_position = input_length start_position = n
return {"output_ids": output_ids[0, start_position:], "conversation": conversation}
output = [] def postprocess(self, model_outputs, clean_up_tokenization_spaces=True):
for conversation_index, conversation in enumerate(conversations): output_ids = model_outputs["output_ids"]
conversation.mark_processed() answer = self.tokenizer.decode(
conversation.generated_responses.append( output_ids,
self.tokenizer.decode(
generated_responses[conversation_index][start_position:],
skip_special_tokens=True, skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces, clean_up_tokenization_spaces=clean_up_tokenization_spaces,
) )
) conversation = model_outputs["conversation"]
output.append(conversation) conversation.mark_processed()
if len(output) == 1: conversation.append_response(answer)
return output[0] return conversation
else:
return output
def _clean_padding_history(self, generated_tensor) -> List[List[int]]:
"""
Cleans the padding history. Padding may be generated in two places when multiple conversations are provided as
an input:
- at the end of the concatenated history and new user input, so that all input to the model have the same
length
- at the end of the generated response, as some responses will be longer than others
This method cleans up these padding token so that the history for each conversation is not impacted by the
batching process.
"""
outputs = []
for sequence in generated_tensor:
sequence_tokens = []
is_previous_pad = False
for token in sequence:
if token == self.tokenizer.pad_token_id:
if self.tokenizer.pad_token_id != self.tokenizer.eos_token_id:
continue
if is_previous_pad:
continue
else:
is_previous_pad = True
else:
is_previous_pad = False
if self.framework == "pt":
sequence_tokens.append(token.item())
else:
sequence_tokens.append(int(token.numpy()))
outputs.append(sequence_tokens)
return outputs
def _legacy_parse_and_tokenize(self, conversation: List[Conversation]) -> List[int]: def _legacy_parse_and_tokenize(self, conversation: Conversation) -> Dict:
eos_token_id = self.tokenizer.eos_token_id eos_token_id = self.tokenizer.eos_token_id
input_ids = [] input_ids = []
for is_user, text in conversation.iter_texts(): for is_user, text in conversation.iter_texts():
...@@ -345,14 +306,3 @@ class ConversationalPipeline(Pipeline): ...@@ -345,14 +306,3 @@ class ConversationalPipeline(Pipeline):
if len(input_ids) > self.tokenizer.model_max_length: if len(input_ids) > self.tokenizer.model_max_length:
input_ids = input_ids[-self.tokenizer.model_max_length :] input_ids = input_ids[-self.tokenizer.model_max_length :]
return input_ids return input_ids
def _parse_and_tokenize(self, conversations: List[Conversation]) -> Dict[str, Any]:
if hasattr(self.tokenizer, "_build_conversation_input_ids"):
input_ids = [self.tokenizer._build_conversation_input_ids(conversation) for conversation in conversations]
else:
# If the tokenizer cannot handle conversations, we default to only the old version
input_ids = [self._legacy_parse_and_tokenize(conversation) for conversation in conversations]
inputs = self.tokenizer.pad(
{"input_ids": input_ids}, padding="longest", return_attention_mask=True, return_tensors=self.framework
)
return inputs
from typing import TYPE_CHECKING, Optional, Union from typing import Dict
from ..feature_extraction_utils import PreTrainedFeatureExtractor from .base import GenericTensor, Pipeline
from ..modelcard import ModelCard
from ..tokenization_utils import PreTrainedTokenizer
from .base import ArgumentHandler, Pipeline
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
# Can't use @add_end_docstrings(PIPELINE_INIT_ARGS) here because this one does not accept `binary_output` # Can't use @add_end_docstrings(PIPELINE_INIT_ARGS) here because this one does not accept `binary_output`
...@@ -49,28 +41,24 @@ class FeatureExtractionPipeline(Pipeline): ...@@ -49,28 +41,24 @@ class FeatureExtractionPipeline(Pipeline):
the associated CUDA device id. the associated CUDA device id.
""" """
def __init__( def _sanitize_parameters(self, **kwargs):
self, return {}, {}, {}
model: Union["PreTrainedModel", "TFPreTrainedModel"],
tokenizer: PreTrainedTokenizer, def preprocess(self, inputs) -> Dict[str, GenericTensor]:
feature_extractor: Optional[PreTrainedFeatureExtractor] = None, return_tensors = self.framework
modelcard: Optional[ModelCard] = None, model_inputs = self.tokenizer(inputs, return_tensors=return_tensors)
framework: Optional[str] = None, return model_inputs
args_parser: ArgumentHandler = None,
device: int = -1, def _forward(self, model_inputs):
task: str = "", model_outputs = self.model(**model_inputs)
): return model_outputs
super().__init__(
model=model, def postprocess(self, model_outputs):
tokenizer=tokenizer, # [0] is the first available tensor, logits or last_hidden_state.
feature_extractor=feature_extractor, if self.framework == "pt":
modelcard=modelcard, return model_outputs[0].tolist()
framework=framework, elif self.framework == "tf":
args_parser=args_parser, return model_outputs[0].numpy().tolist()
device=device,
binary_output=True,
task=task,
)
def __call__(self, *args, **kwargs): def __call__(self, *args, **kwargs):
""" """
...@@ -82,10 +70,4 @@ class FeatureExtractionPipeline(Pipeline): ...@@ -82,10 +70,4 @@ class FeatureExtractionPipeline(Pipeline):
Return: Return:
A nested list of :obj:`float`: The features computed by the model. A nested list of :obj:`float`: The features computed by the model.
""" """
results = super().__call__(*args, **kwargs) return super().__call__(*args, **kwargs)
if isinstance(results, list):
# Sequential run
results = [r.tolist() for r in results]
else:
results = results.tolist()
return results
from typing import TYPE_CHECKING, Dict, List, Optional, Union from typing import Dict
import numpy as np import numpy as np
from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available
from ..modelcard import ModelCard
from ..tokenization_utils import PreTrainedTokenizer
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline, PipelineException from .base import PIPELINE_INIT_ARGS, GenericTensor, Pipeline, PipelineException
GenericTensor = Union[List["GenericTensor"], "torch.Tensor", "tf.Tensor"]
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
if is_tf_available(): if is_tf_available():
import tensorflow as tf import tensorflow as tf
from ..models.auto.modeling_tf_auto import TF_MODEL_FOR_MASKED_LM_MAPPING
if is_torch_available(): if is_torch_available():
import torch import torch
from ..models.auto.modeling_auto import MODEL_FOR_MASKED_LM_MAPPING
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -58,39 +47,6 @@ class FillMaskPipeline(Pipeline): ...@@ -58,39 +47,6 @@ class FillMaskPipeline(Pipeline):
This pipeline only works for inputs with exactly one token masked. This pipeline only works for inputs with exactly one token masked.
""" """
def __init__(
self,
model: Union["PreTrainedModel", "TFPreTrainedModel"],
tokenizer: PreTrainedTokenizer,
modelcard: Optional[ModelCard] = None,
framework: Optional[str] = None,
args_parser: ArgumentHandler = None,
device: int = -1,
top_k=5,
targets=None,
task: str = "",
):
super().__init__(
model=model,
tokenizer=tokenizer,
modelcard=modelcard,
framework=framework,
args_parser=args_parser,
device=device,
binary_output=True,
task=task,
)
self.check_model_type(
TF_MODEL_FOR_MASKED_LM_MAPPING if self.framework == "tf" else MODEL_FOR_MASKED_LM_MAPPING
)
self.top_k = top_k
self.targets = targets
if self.tokenizer.mask_token_id is None:
raise PipelineException(
"fill-mask", self.model.base_model_prefix, "The tokenizer does not define a `mask_token`."
)
def get_masked_index(self, input_ids: GenericTensor) -> np.ndarray: def get_masked_index(self, input_ids: GenericTensor) -> np.ndarray:
if self.framework == "tf": if self.framework == "tf":
masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy() masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy()
...@@ -124,63 +80,69 @@ class FillMaskPipeline(Pipeline): ...@@ -124,63 +80,69 @@ class FillMaskPipeline(Pipeline):
for input_ids in model_inputs["input_ids"]: for input_ids in model_inputs["input_ids"]:
self._ensure_exactly_one_mask_token(input_ids) self._ensure_exactly_one_mask_token(input_ids)
def get_model_inputs(self, inputs, *args, **kwargs) -> Dict: def preprocess(self, inputs, return_tensors=None, **preprocess_parameters) -> Dict[str, GenericTensor]:
if isinstance(inputs, list) and self.tokenizer.pad_token is None: if return_tensors is None:
model_inputs = [] return_tensors = self.framework
for input_ in inputs: model_inputs = self.tokenizer(inputs, return_tensors=return_tensors)
model_input = self._parse_and_tokenize(input_, padding=False, *args, **kwargs) self.ensure_exactly_one_mask_token(model_inputs)
model_inputs.append(model_input)
else:
model_inputs = self._parse_and_tokenize(inputs, *args, **kwargs)
return model_inputs return model_inputs
def __call__(self, inputs, *args, targets=None, top_k: Optional[int] = None, **kwargs): def _forward(self, model_inputs):
""" model_outputs = self.model(**model_inputs)
Fill the masked token in the text(s) given as inputs. model_outputs["input_ids"] = model_inputs["input_ids"][0]
return model_outputs
Args: def postprocess(self, model_outputs, top_k=5, target_ids=None):
args (:obj:`str` or :obj:`List[str]`): # Cap top_k if there are targets
One or several texts (or one list of prompts) with masked tokens. if target_ids is not None and target_ids.shape[0] < top_k:
targets (:obj:`str` or :obj:`List[str]`, `optional`): top_k = target_ids.shape[0]
When passed, the model will limit the scores to the passed targets instead of looking up in the whole input_ids = model_outputs["input_ids"]
vocab. If the provided targets are not in the model vocab, they will be tokenized and the first outputs = model_outputs["logits"]
resulting token will be used (with a warning, and that might be slower). result = []
top_k (:obj:`int`, `optional`):
When passed, overrides the number of predictions to return.
Return: if self.framework == "tf":
A list or a list of list of :obj:`dict`: Each result comes as list of dictionaries with the following keys: masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy()
- **sequence** (:obj:`str`) -- The corresponding input with the mask token prediction. # Fill mask pipeline supports only one ${mask_token} per sample
- **score** (:obj:`float`) -- The corresponding probability.
- **token** (:obj:`int`) -- The predicted token id (to replace the masked one).
- **token** (:obj:`str`) -- The predicted token (to replace the masked one).
"""
model_inputs = self.get_model_inputs(inputs, *args, **kwargs)
self.ensure_exactly_one_mask_token(model_inputs)
if isinstance(model_inputs, list):
outputs = []
for model_input in model_inputs:
output = self._forward(model_input, return_tensors=True)
outputs.append(output)
batch_size = len(model_inputs) logits = outputs[0, masked_index.item(), :]
probs = tf.nn.softmax(logits)
if target_ids is not None:
probs = tf.gather_nd(probs, tf.reshape(target_ids, (-1, 1)))
topk = tf.math.top_k(probs, k=top_k)
values, predictions = topk.values.numpy(), topk.indices.numpy()
else: else:
outputs = self._forward(model_inputs, return_tensors=True) masked_index = torch.nonzero(input_ids == self.tokenizer.mask_token_id, as_tuple=False)
batch_size = outputs.shape[0] if self.framework == "tf" else outputs.size(0) # Fill mask pipeline supports only one ${mask_token} per sample
# top_k must be defined logits = outputs[0, masked_index.item(), :]
if top_k is None: probs = logits.softmax(dim=0)
top_k = self.top_k if target_ids is not None:
probs = probs[..., target_ids]
results = [] values, predictions = probs.topk(top_k)
if targets is None and self.targets is not None: for v, p in zip(values.tolist(), predictions.tolist()):
targets = self.targets tokens = input_ids.numpy()
if targets is not None: if target_ids is not None:
p = target_ids[p].tolist()
tokens[masked_index] = p
# Filter padding out:
tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)]
result.append(
{
"sequence": self.tokenizer.decode(tokens, skip_special_tokens=True),
"score": v,
"token": p,
"token_str": self.tokenizer.decode(p),
}
)
return result
def get_target_ids(self, targets, top_k=None):
if isinstance(targets, str): if isinstance(targets, str):
targets = [targets] targets = [targets]
try: try:
vocab = self.tokenizer.get_vocab() vocab = self.tokenizer.get_vocab()
except Exception: except Exception:
...@@ -217,65 +179,44 @@ class FillMaskPipeline(Pipeline): ...@@ -217,65 +179,44 @@ class FillMaskPipeline(Pipeline):
if len(target_ids) == 0: if len(target_ids) == 0:
raise ValueError("At least one target must be provided when passed.") raise ValueError("At least one target must be provided when passed.")
target_ids = np.array(target_ids) target_ids = np.array(target_ids)
# Cap top_k if there are targets return target_ids
if top_k > target_ids.shape[0]:
top_k = target_ids.shape[0]
for i in range(batch_size):
if isinstance(model_inputs, list):
input_ids = model_inputs[i]["input_ids"][0]
else:
input_ids = model_inputs["input_ids"][i]
result = []
if self.framework == "tf":
masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy()
# Fill mask pipeline supports only one ${mask_token} per sample def _sanitize_parameters(self, top_k=None, targets=None):
postprocess_params = {}
if isinstance(outputs, list):
logits = outputs[i][0, masked_index.item(), :]
else:
logits = outputs[i, masked_index.item(), :]
probs = tf.nn.softmax(logits)
if targets is not None: if targets is not None:
probs = tf.gather_nd(probs, tf.reshape(target_ids, (-1, 1))) target_ids = self.get_target_ids(targets, top_k)
postprocess_params["target_ids"] = target_ids
topk = tf.math.top_k(probs, k=top_k) if top_k is not None:
values, predictions = topk.values.numpy(), topk.indices.numpy() postprocess_params["top_k"] = top_k
else:
masked_index = torch.nonzero(input_ids == self.tokenizer.mask_token_id, as_tuple=False)
# Fill mask pipeline supports only one ${mask_token} per sample
if isinstance(outputs, list): if self.tokenizer.mask_token_id is None:
logits = outputs[i][0, masked_index.item(), :] raise PipelineException(
else: "fill-mask", self.model.base_model_prefix, "The tokenizer does not define a `mask_token`."
logits = outputs[i, masked_index.item(), :] )
probs = logits.softmax(dim=0) return {}, {}, postprocess_params
if targets is not None:
probs = probs[..., target_ids]
values, predictions = probs.topk(top_k) def __call__(self, inputs, *args, **kwargs):
"""
Fill the masked token in the text(s) given as inputs.
for v, p in zip(values.tolist(), predictions.tolist()): Args:
tokens = input_ids.numpy() args (:obj:`str` or :obj:`List[str]`):
if targets is not None: One or several texts (or one list of prompts) with masked tokens.
p = target_ids[p].tolist() targets (:obj:`str` or :obj:`List[str]`, `optional`):
tokens[masked_index] = p When passed, the model will limit the scores to the passed targets instead of looking up in the whole
# Filter padding out: vocab. If the provided targets are not in the model vocab, they will be tokenized and the first
tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)] resulting token will be used (with a warning, and that might be slower).
result.append( top_k (:obj:`int`, `optional`):
{ When passed, overrides the number of predictions to return.
"sequence": self.tokenizer.decode(tokens, skip_special_tokens=True),
"score": v,
"token": p,
"token_str": self.tokenizer.decode(p),
}
)
# Append Return:
results += [result] A list or a list of list of :obj:`dict`: Each result comes as list of dictionaries with the following keys:
if len(results) == 1: - **sequence** (:obj:`str`) -- The corresponding input with the mask token prediction.
return results[0] - **score** (:obj:`float`) -- The corresponding probability.
return results - **token** (:obj:`int`) -- The predicted token id (to replace the masked one).
- **token** (:obj:`str`) -- The predicted token (to replace the masked one).
"""
return super().__call__(inputs, **kwargs)
import os import os
from typing import TYPE_CHECKING, List, Optional, Union from typing import List, Union
import requests import requests
from ..feature_extraction_utils import PreTrainedFeatureExtractor
from ..file_utils import add_end_docstrings, is_torch_available, is_vision_available, requires_backends from ..file_utils import add_end_docstrings, is_torch_available, is_vision_available, requires_backends
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, Pipeline
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
if is_vision_available(): if is_vision_available():
from PIL import Image from PIL import Image
if is_torch_available(): if is_torch_available():
import torch
from ..models.auto.modeling_auto import MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING from ..models.auto.modeling_auto import MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -37,24 +30,15 @@ class ImageClassificationPipeline(Pipeline): ...@@ -37,24 +30,15 @@ class ImageClassificationPipeline(Pipeline):
<https://huggingface.co/models?filter=image-classification>`__. <https://huggingface.co/models?filter=image-classification>`__.
""" """
def __init__( def __init__(self, *args, **kwargs):
self, super().__init__(*args, **kwargs)
model: Union["PreTrainedModel", "TFPreTrainedModel"],
feature_extractor: PreTrainedFeatureExtractor,
framework: Optional[str] = None,
**kwargs
):
super().__init__(model, feature_extractor=feature_extractor, framework=framework, **kwargs)
if self.framework == "tf": if self.framework == "tf":
raise ValueError(f"The {self.__class__} is only available in PyTorch.") raise ValueError(f"The {self.__class__} is only available in PyTorch.")
requires_backends(self, "vision") requires_backends(self, "vision")
self.check_model_type(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING) self.check_model_type(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING)
self.feature_extractor = feature_extractor
@staticmethod @staticmethod
def load_image(image: Union[str, "Image.Image"]): def load_image(image: Union[str, "Image.Image"]):
if isinstance(image, str): if isinstance(image, str):
...@@ -77,7 +61,13 @@ class ImageClassificationPipeline(Pipeline): ...@@ -77,7 +61,13 @@ class ImageClassificationPipeline(Pipeline):
image = image.convert("RGB") image = image.convert("RGB")
return image return image
def __call__(self, images: Union[str, List[str], "Image", List["Image"]], top_k=5): def _sanitize_parameters(self, top_k=None):
postprocess_params = {}
if top_k is not None:
postprocess_params["top_k"] = top_k
return {}, {}, postprocess_params
def __call__(self, images: Union[str, List[str], "Image", List["Image"]], **kwargs):
""" """
Assign labels to the image(s) passed as inputs. Assign labels to the image(s) passed as inputs.
...@@ -106,34 +96,23 @@ class ImageClassificationPipeline(Pipeline): ...@@ -106,34 +96,23 @@ class ImageClassificationPipeline(Pipeline):
- **label** (:obj:`str`) -- The label identified by the model. - **label** (:obj:`str`) -- The label identified by the model.
- **score** (:obj:`int`) -- The score attributed by the model for that label. - **score** (:obj:`int`) -- The score attributed by the model for that label.
""" """
is_batched = isinstance(images, list) return super().__call__(images, **kwargs)
if not is_batched: def preprocess(self, image):
images = [images] image = self.load_image(image)
model_inputs = self.feature_extractor(images=image, return_tensors="pt")
return model_inputs
images = [self.load_image(image) for image in images] def _forward(self, model_inputs):
model_outputs = self.model(**model_inputs)
return model_outputs
def postprocess(self, model_outputs, top_k=5):
if top_k > self.model.config.num_labels: if top_k > self.model.config.num_labels:
top_k = self.model.config.num_labels top_k = self.model.config.num_labels
probs = model_outputs.logits.softmax(-1)[0]
with torch.no_grad():
inputs = self.feature_extractor(images=images, return_tensors="pt")
outputs = self.model(**inputs)
probs = outputs.logits.softmax(-1)
scores, ids = probs.topk(top_k) scores, ids = probs.topk(top_k)
scores = scores.tolist() scores = scores.tolist()
ids = ids.tolist() ids = ids.tolist()
return [{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
if not is_batched:
scores, ids = scores[0], ids[0]
labels = [{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
else:
labels = []
for scores, ids in zip(scores, ids):
labels.append(
[{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
)
return labels
import os import os
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union from typing import Any, Dict, List, Union
import requests import requests
from ..feature_extraction_utils import PreTrainedFeatureExtractor
from ..file_utils import add_end_docstrings, is_torch_available, is_vision_available, requires_backends from ..file_utils import add_end_docstrings, is_torch_available, is_vision_available, requires_backends
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, Pipeline
if TYPE_CHECKING:
from ..modeling_utils import PreTrainedModel
if is_vision_available(): if is_vision_available():
from PIL import Image from PIL import Image
...@@ -40,24 +36,15 @@ class ObjectDetectionPipeline(Pipeline): ...@@ -40,24 +36,15 @@ class ObjectDetectionPipeline(Pipeline):
<https://huggingface.co/models?filter=object-detection>`__. <https://huggingface.co/models?filter=object-detection>`__.
""" """
def __init__( def __init__(self, *args, **kwargs):
self, super().__init__(*args, **kwargs)
model: "PreTrainedModel",
feature_extractor: PreTrainedFeatureExtractor,
framework: Optional[str] = None,
**kwargs
):
super().__init__(model, feature_extractor=feature_extractor, framework=framework, **kwargs)
if self.framework == "tf": if self.framework == "tf":
raise ValueError(f"The {self.__class__} is only available in PyTorch.") raise ValueError(f"The {self.__class__} is only available in PyTorch.")
requires_backends(self, "vision") requires_backends(self, "vision")
self.check_model_type(MODEL_FOR_OBJECT_DETECTION_MAPPING) self.check_model_type(MODEL_FOR_OBJECT_DETECTION_MAPPING)
self.feature_extractor = feature_extractor
@staticmethod @staticmethod
def load_image(image: Union[str, "Image.Image"]): def load_image(image: Union[str, "Image.Image"]):
if isinstance(image, str): if isinstance(image, str):
...@@ -80,11 +67,13 @@ class ObjectDetectionPipeline(Pipeline): ...@@ -80,11 +67,13 @@ class ObjectDetectionPipeline(Pipeline):
image = image.convert("RGB") image = image.convert("RGB")
return image return image
def __call__( def _sanitize_parameters(self, **kwargs):
self, postprocess_kwargs = {}
images: Union[str, List[str], "Image", List["Image"]], if "threshold" in kwargs:
threshold: Optional[float] = 0.9, postprocess_kwargs["threshold"] = kwargs["threshold"]
) -> Union[Predictions, List[Prediction]]: return {}, {}, postprocess_kwargs
def __call__(self, *args, **kwargs) -> Union[Predictions, List[Prediction]]:
""" """
Detect objects (bounding boxes & classes) in the image(s) passed as inputs. Detect objects (bounding boxes & classes) in the image(s) passed as inputs.
...@@ -112,47 +101,42 @@ class ObjectDetectionPipeline(Pipeline): ...@@ -112,47 +101,42 @@ class ObjectDetectionPipeline(Pipeline):
- **score** (:obj:`float`) -- The score attributed by the model for that label. - **score** (:obj:`float`) -- The score attributed by the model for that label.
- **box** (:obj:`List[Dict[str, int]]`) -- The bounding box of detected object in image's original size. - **box** (:obj:`List[Dict[str, int]]`) -- The bounding box of detected object in image's original size.
""" """
is_batched = isinstance(images, list)
if not is_batched: return super().__call__(*args, **kwargs)
images = [images]
def preprocess(self, image):
images = [self.load_image(image) for image in images] image = self.load_image(image)
target_size = torch.IntTensor([[image.height, image.width]])
with torch.no_grad(): inputs = self.feature_extractor(images=[image], return_tensors="pt")
inputs = self.feature_extractor(images=images, return_tensors="pt") inputs["target_size"] = target_size
outputs = self.model(**inputs) return inputs
if self.framework == "pt": def _forward(self, model_inputs):
target_sizes = torch.IntTensor([[im.height, im.width] for im in images]) target_size = model_inputs.pop("target_size")
else: outputs = self.model(**model_inputs)
raise ValueError("The ObjectDetectionPipeline is only available in PyTorch.") model_outputs = {"outputs": outputs, "target_size": target_size}
return model_outputs
raw_annotations = self.feature_extractor.post_process(outputs, target_sizes)
annotations = [] def postprocess(self, model_outputs, threshold=0.9):
for annotation in raw_annotations: raw_annotations = self.feature_extractor.post_process(model_outputs["outputs"], model_outputs["target_size"])
keep = annotation["scores"] > threshold raw_annotation = raw_annotations[0]
scores = annotation["scores"][keep] keep = raw_annotation["scores"] > threshold
labels = annotation["labels"][keep] scores = raw_annotation["scores"][keep]
boxes = annotation["boxes"][keep] labels = raw_annotation["labels"][keep]
boxes = raw_annotation["boxes"][keep]
annotation["scores"] = scores.tolist()
annotation["labels"] = [self.model.config.id2label[label.item()] for label in labels] raw_annotation["scores"] = scores.tolist()
annotation["boxes"] = [self._get_bounding_box(box) for box in boxes] raw_annotation["labels"] = [self.model.config.id2label[label.item()] for label in labels]
raw_annotation["boxes"] = [self._get_bounding_box(box) for box in boxes]
# {"scores": [...], ...} --> [{"score":x, ...}, ...] # {"scores": [...], ...} --> [{"score":x, ...}, ...]
keys = ["score", "label", "box"] keys = ["score", "label", "box"]
annotation = [ annotation = [
dict(zip(keys, vals)) dict(zip(keys, vals))
for vals in zip(annotation["scores"], annotation["labels"], annotation["boxes"]) for vals in zip(raw_annotation["scores"], raw_annotation["labels"], raw_annotation["boxes"])
] ]
annotations.append(annotation) return annotation
if not is_batched:
return annotations[0]
return annotations
def _get_bounding_box(self, box: "torch.Tensor") -> Dict[str, int]: def _get_bounding_box(self, box: "torch.Tensor") -> Dict[str, int]:
""" """
......
import warnings
from collections.abc import Iterable from collections.abc import Iterable
from typing import TYPE_CHECKING, Dict, List, Optional, Tuple, Union from typing import TYPE_CHECKING, Dict, List, Optional, Tuple, Union
...@@ -109,6 +110,7 @@ class QuestionAnsweringPipeline(Pipeline): ...@@ -109,6 +110,7 @@ class QuestionAnsweringPipeline(Pipeline):
""" """
default_input_names = "question,context" default_input_names = "question,context"
handle_impossible_answer = False
def __init__( def __init__(
self, self,
...@@ -158,6 +160,44 @@ class QuestionAnsweringPipeline(Pipeline): ...@@ -158,6 +160,44 @@ class QuestionAnsweringPipeline(Pipeline):
else: else:
return SquadExample(None, question, context, None, None, None) return SquadExample(None, question, context, None, None, None)
def _sanitize_parameters(
self,
padding=None,
topk=None,
top_k=None,
doc_stride=None,
max_answer_len=None,
max_seq_len=None,
max_question_len=None,
handle_impossible_answer=None,
**kwargs
):
# Set defaults values
preprocess_params = {}
if padding is not None:
preprocess_params["padding"] = padding
if doc_stride is not None:
preprocess_params["doc_stride"] = doc_stride
if max_question_len is not None:
preprocess_params["max_question_len"] = max_question_len
postprocess_params = {}
if topk is not None and top_k is None:
warnings.warn("topk parameter is deprecated, use top_k instead", UserWarning)
top_k = topk
if top_k is not None:
if top_k < 1:
raise ValueError(f"top_k parameter should be >= 1 (got {top_k})")
postprocess_params["top_k"] = top_k
if max_answer_len is not None:
if max_answer_len < 1:
raise ValueError(f"max_answer_len parameter should be >= 1 (got {max_answer_len}")
if max_answer_len is not None:
postprocess_params["max_answer_len"] = max_answer_len
if handle_impossible_answer is not None:
postprocess_params["handle_impossible_answer"] = handle_impossible_answer
return preprocess_params, {}, postprocess_params
def __call__(self, *args, **kwargs): def __call__(self, *args, **kwargs):
""" """
Answer the question(s) given as inputs by using the context(s). Answer the question(s) given as inputs by using the context(s).
...@@ -201,50 +241,36 @@ class QuestionAnsweringPipeline(Pipeline): ...@@ -201,50 +241,36 @@ class QuestionAnsweringPipeline(Pipeline):
- **end** (:obj:`int`) -- The character end index of the answer (in the tokenized version of the input). - **end** (:obj:`int`) -- The character end index of the answer (in the tokenized version of the input).
- **answer** (:obj:`str`) -- The answer to the question. - **answer** (:obj:`str`) -- The answer to the question.
""" """
# Set defaults values
kwargs.setdefault("padding", "longest" if getattr(self.tokenizer, "pad_token", None) is not None else False)
kwargs.setdefault("topk", 1)
kwargs.setdefault("doc_stride", 128)
kwargs.setdefault("max_answer_len", 15)
kwargs.setdefault("max_seq_len", 384)
kwargs.setdefault("max_question_len", 64)
kwargs.setdefault("handle_impossible_answer", False)
if kwargs["topk"] < 1:
raise ValueError(f"topk parameter should be >= 1 (got {kwargs['topk']})")
if kwargs["max_answer_len"] < 1:
raise ValueError(f"max_answer_len parameter should be >= 1 (got {(kwargs['max_answer_len'])}")
# Convert inputs to features # Convert inputs to features
examples = self._args_parser(*args, **kwargs) examples = self._args_parser(*args, **kwargs)
if len(examples) == 1:
return super().__call__(examples[0], **kwargs)
return super().__call__(examples, **kwargs)
def preprocess(self, example, padding="do_not_pad", doc_stride=128, max_question_len=64, max_seq_len=384):
if not self.tokenizer.is_fast: if not self.tokenizer.is_fast:
features_list = [ features = squad_convert_examples_to_features(
squad_convert_examples_to_features(
examples=[example], examples=[example],
tokenizer=self.tokenizer, tokenizer=self.tokenizer,
max_seq_length=kwargs["max_seq_len"], max_seq_length=max_seq_len,
doc_stride=kwargs["doc_stride"], doc_stride=doc_stride,
max_query_length=kwargs["max_question_len"], max_query_length=max_question_len,
padding_strategy=PaddingStrategy.MAX_LENGTH.value, padding_strategy=PaddingStrategy.MAX_LENGTH,
is_training=False, is_training=False,
tqdm_enabled=False, tqdm_enabled=False,
) )
for example in examples
]
else: else:
features_list = []
for example in examples:
# Define the side we want to truncate / pad and the text/pair sorting # Define the side we want to truncate / pad and the text/pair sorting
question_first = bool(self.tokenizer.padding_side == "right") question_first = self.tokenizer.padding_side == "right"
encoded_inputs = self.tokenizer( encoded_inputs = self.tokenizer(
text=example.question_text if question_first else example.context_text, text=example.question_text if question_first else example.context_text,
text_pair=example.context_text if question_first else example.question_text, text_pair=example.context_text if question_first else example.question_text,
padding=kwargs["padding"], padding=padding,
truncation="only_second" if question_first else "only_first", truncation="only_second" if question_first else "only_first",
max_length=kwargs["max_seq_len"], max_length=max_seq_len,
stride=kwargs["doc_stride"], stride=doc_stride,
return_tensors="np", return_tensors="np",
return_token_type_ids=True, return_token_type_ids=True,
return_overflowing_tokens=True, return_overflowing_tokens=True,
...@@ -297,31 +323,40 @@ class QuestionAnsweringPipeline(Pipeline): ...@@ -297,31 +323,40 @@ class QuestionAnsweringPipeline(Pipeline):
qas_id=None, qas_id=None,
) )
) )
features_list.append(features) return {"features": features, "example": example}
all_answers = [] def _forward(self, model_inputs):
for features, example in zip(features_list, examples): features = model_inputs["features"]
example = model_inputs["example"]
model_input_names = self.tokenizer.model_input_names model_input_names = self.tokenizer.model_input_names
fw_args = {k: [feature.__dict__[k] for feature in features] for k in model_input_names} fw_args = {k: [feature.__dict__[k] for feature in features] for k in model_input_names}
# Manage tensor allocation on correct device
with self.device_placement():
if self.framework == "tf": if self.framework == "tf":
fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()} fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()}
start, end = self.model(fw_args)[:2] start, end = self.model(fw_args)[:2]
start, end = start.numpy(), end.numpy() start, end = start.numpy(), end.numpy()
else: elif self.framework == "pt":
with torch.no_grad():
# Retrieve the score for the context tokens only (removing question tokens) # Retrieve the score for the context tokens only (removing question tokens)
fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()} fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}
# On Windows, the default int type in numpy is np.int32 so we get some non-long tensors. # On Windows, the default int type in numpy is np.int32 so we get some non-long tensors.
fw_args = {k: v.long() if v.dtype == torch.int32 else v for (k, v) in fw_args.items()} fw_args = {k: v.long() if v.dtype == torch.int32 else v for (k, v) in fw_args.items()}
start, end = self.model(**fw_args)[:2] start, end = self.model(**fw_args)[:2]
start, end = start.cpu().numpy(), end.cpu().numpy() start, end = start.cpu().numpy(), end.cpu().numpy()
return {"start": start, "end": end, "features": features, "example": example}
def postprocess(
self,
model_outputs,
top_k=1,
handle_impossible_answer=False,
max_answer_len=15,
):
min_null_score = 1000000 # large and positive min_null_score = 1000000 # large and positive
answers = [] answers = []
for (feature, start_, end_) in zip(features, start, end): start_ = model_outputs["start"][0]
end_ = model_outputs["end"][0]
feature = model_outputs["features"][0]
example = model_outputs["example"]
# Ensure padded tokens & question tokens cannot belong to the set of candidate answers. # Ensure padded tokens & question tokens cannot belong to the set of candidate answers.
undesired_tokens = np.abs(np.array(feature.p_mask) - 1) & feature.attention_mask undesired_tokens = np.abs(np.array(feature.p_mask) - 1) & feature.attention_mask
...@@ -336,15 +371,13 @@ class QuestionAnsweringPipeline(Pipeline): ...@@ -336,15 +371,13 @@ class QuestionAnsweringPipeline(Pipeline):
start_ = np.exp(start_ - np.log(np.sum(np.exp(start_), axis=-1, keepdims=True))) start_ = np.exp(start_ - np.log(np.sum(np.exp(start_), axis=-1, keepdims=True)))
end_ = np.exp(end_ - np.log(np.sum(np.exp(end_), axis=-1, keepdims=True))) end_ = np.exp(end_ - np.log(np.sum(np.exp(end_), axis=-1, keepdims=True)))
if kwargs["handle_impossible_answer"]: if handle_impossible_answer:
min_null_score = min(min_null_score, (start_[0] * end_[0]).item()) min_null_score = min(min_null_score, (start_[0] * end_[0]).item())
# Mask CLS # Mask CLS
start_[0] = end_[0] = 0.0 start_[0] = end_[0] = 0.0
starts, ends, scores = self.decode( starts, ends, scores = self.decode(start_, end_, top_k, max_answer_len, undesired_tokens)
start_, end_, kwargs["topk"], kwargs["max_answer_len"], undesired_tokens
)
if not self.tokenizer.is_fast: if not self.tokenizer.is_fast:
char_to_word = np.array(example.char_to_word_offset) char_to_word = np.array(example.char_to_word_offset)
...@@ -397,15 +430,13 @@ class QuestionAnsweringPipeline(Pipeline): ...@@ -397,15 +430,13 @@ class QuestionAnsweringPipeline(Pipeline):
} }
) )
if kwargs["handle_impossible_answer"]: if handle_impossible_answer:
answers.append({"score": min_null_score, "start": 0, "end": 0, "answer": ""}) answers.append({"score": min_null_score, "start": 0, "end": 0, "answer": ""})
answers = sorted(answers, key=lambda x: x["score"], reverse=True)[: kwargs["topk"]] answers = sorted(answers, key=lambda x: x["score"], reverse=True)[:top_k]
all_answers += answers if len(answers) == 1:
return answers[0]
if len(all_answers) == 1: return answers
return all_answers[0]
return all_answers
def decode( def decode(
self, start: np.ndarray, end: np.ndarray, topk: int, max_answer_len: int, undesired_tokens: np.ndarray self, start: np.ndarray, end: np.ndarray, topk: int, max_answer_len: int, undesired_tokens: np.ndarray
......
...@@ -17,7 +17,7 @@ class TableQuestionAnsweringArgumentHandler(ArgumentHandler): ...@@ -17,7 +17,7 @@ class TableQuestionAnsweringArgumentHandler(ArgumentHandler):
Handles arguments for the TableQuestionAnsweringPipeline Handles arguments for the TableQuestionAnsweringPipeline
""" """
def __call__(self, table=None, query=None, sequential=False, padding=True, truncation=True): def __call__(self, table=None, query=None, **kwargs):
# Returns tqa_pipeline_inputs of shape: # Returns tqa_pipeline_inputs of shape:
# [ # [
# {"table": pd.DataFrame, "query": List[str]}, # {"table": pd.DataFrame, "query": List[str]},
...@@ -60,7 +60,7 @@ class TableQuestionAnsweringArgumentHandler(ArgumentHandler): ...@@ -60,7 +60,7 @@ class TableQuestionAnsweringArgumentHandler(ArgumentHandler):
tqa_pipeline_input["table"] = pd.DataFrame(tqa_pipeline_input["table"]) tqa_pipeline_input["table"] = pd.DataFrame(tqa_pipeline_input["table"])
return tqa_pipeline_inputs, sequential, padding, truncation return tqa_pipeline_inputs
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
...@@ -235,20 +235,45 @@ class TableQuestionAnsweringPipeline(Pipeline): ...@@ -235,20 +235,45 @@ class TableQuestionAnsweringPipeline(Pipeline):
- **cells** (:obj:`List[str]`) -- List of strings made up of the answer cell values. - **cells** (:obj:`List[str]`) -- List of strings made up of the answer cell values.
- **aggregator** (:obj:`str`) -- If the model has an aggregator, this returns the aggregator. - **aggregator** (:obj:`str`) -- If the model has an aggregator, this returns the aggregator.
""" """
pipeline_inputs, sequential, padding, truncation = self._args_parser(*args, **kwargs) pipeline_inputs = self._args_parser(*args, **kwargs)
batched_answers = []
for pipeline_input in pipeline_inputs: results = super().__call__(pipeline_inputs, **kwargs)
if len(results) == 1:
return results[0]
return results
def _sanitize_parameters(self, sequential=None, padding=None, truncation=None, **kwargs):
preprocess_params = {}
if padding is not None:
preprocess_params["padding"] = padding
if truncation is not None:
preprocess_params["truncation"] = truncation
forward_params = {}
if sequential is not None:
forward_params["sequential"] = sequential
return preprocess_params, forward_params, {}
def preprocess(self, pipeline_input, sequential=None, padding=True, truncation="drop_rows_to_fit"):
table, query = pipeline_input["table"], pipeline_input["query"] table, query = pipeline_input["table"], pipeline_input["query"]
if table.empty: if table.empty:
raise ValueError("table is empty") raise ValueError("table is empty")
if not query: if query is None or query == "":
raise ValueError("query is empty") raise ValueError("query is empty")
inputs = self.tokenizer( inputs = self.tokenizer(table, query, return_tensors=self.framework, truncation=truncation, padding=padding)
table, query, return_tensors=self.framework, truncation="drop_rows_to_fit", padding=padding inputs["table"] = table
) return inputs
outputs = self.sequential_inference(**inputs) if sequential else self.batch_inference(**inputs) def _forward(self, model_inputs, sequential=False):
table = model_inputs.pop("table")
outputs = self.sequential_inference(**model_inputs) if sequential else self.batch_inference(**model_inputs)
model_outputs = {"model_inputs": model_inputs, "table": table, "outputs": outputs}
return model_outputs
def postprocess(self, model_outputs):
inputs = model_outputs["model_inputs"]
table = model_outputs["table"]
outputs = model_outputs["outputs"]
if self.aggregate: if self.aggregate:
logits, logits_agg = outputs[:2] logits, logits_agg = outputs[:2]
predictions = self.tokenizer.convert_logits_to_predictions(inputs, logits.detach(), logits_agg) predictions = self.tokenizer.convert_logits_to_predictions(inputs, logits.detach(), logits_agg)
...@@ -282,5 +307,4 @@ class TableQuestionAnsweringPipeline(Pipeline): ...@@ -282,5 +307,4 @@ class TableQuestionAnsweringPipeline(Pipeline):
answers.append(answer) answers.append(answer)
if len(answer) == 0: if len(answer) == 0:
raise PipelineException("Empty answer") raise PipelineException("Empty answer")
batched_answers.append(answers if len(answers) > 1 else answers[0]) return answers if len(answers) > 1 else answers[0]
return batched_answers if len(batched_answers) > 1 else batched_answers[0]
from typing import Optional import enum
from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available
from ..tokenization_utils import TruncationStrategy from ..tokenization_utils import TruncationStrategy
...@@ -17,6 +17,11 @@ if is_torch_available(): ...@@ -17,6 +17,11 @@ if is_torch_available():
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
class ReturnType(enum.Enum):
TENSORS = 0
TEXT = 1
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
class Text2TextGenerationPipeline(Pipeline): class Text2TextGenerationPipeline(Pipeline):
""" """
...@@ -46,6 +51,32 @@ class Text2TextGenerationPipeline(Pipeline): ...@@ -46,6 +51,32 @@ class Text2TextGenerationPipeline(Pipeline):
else MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING else MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING
) )
def _sanitize_parameters(
self,
return_tensors=None,
return_text=None,
return_type=None,
clean_up_tokenization_spaces=None,
truncation=None,
**generate_kwargs
):
preprocess_params = {}
if truncation is not None:
preprocess_params["truncation"] = truncation
forward_params = generate_kwargs
postprocess_params = {}
if return_tensors is not None and return_type is None:
return_type = ReturnType.TENSORS if return_tensors else ReturnType.TEXT
if return_type is not None:
postprocess_params["return_type"] = return_type
if clean_up_tokenization_spaces is not None:
postprocess_params["clean_up_tokenization_spaces"] = clean_up_tokenization_spaces
return preprocess_params, forward_params, postprocess_params
def check_inputs(self, input_length: int, min_length: int, max_length: int): def check_inputs(self, input_length: int, min_length: int, max_length: int):
""" """
Checks whether there might be something wrong with given input with regard to the model. Checks whether there might be something wrong with given input with regard to the model.
...@@ -55,9 +86,8 @@ class Text2TextGenerationPipeline(Pipeline): ...@@ -55,9 +86,8 @@ class Text2TextGenerationPipeline(Pipeline):
def _parse_and_tokenize(self, *args, truncation): def _parse_and_tokenize(self, *args, truncation):
prefix = self.model.config.prefix if self.model.config.prefix is not None else "" prefix = self.model.config.prefix if self.model.config.prefix is not None else ""
if isinstance(args[0], list): if isinstance(args[0], list):
assert ( if self.tokenizer.pad_token_id is None:
self.tokenizer.pad_token_id is not None raise ValueError("Please make sure that the tokenizer has a pad_token_id when using a batch input")
), "Please make sure that the tokenizer has a pad_token_id when using a batch input"
args = ([prefix + arg for arg in args[0]],) args = ([prefix + arg for arg in args[0]],)
padding = True padding = True
...@@ -68,21 +98,13 @@ class Text2TextGenerationPipeline(Pipeline): ...@@ -68,21 +98,13 @@ class Text2TextGenerationPipeline(Pipeline):
raise ValueError( raise ValueError(
f" `args[0]`: {args[0]} have the wrong format. The should be either of type `str` or type `list`" f" `args[0]`: {args[0]} have the wrong format. The should be either of type `str` or type `list`"
) )
inputs = super()._parse_and_tokenize(*args, padding=padding, truncation=truncation) inputs = self.tokenizer(*args, padding=padding, truncation=truncation, return_tensors=self.framework)
# This is produced by tokenizers but is an invalid generate kwargs # This is produced by tokenizers but is an invalid generate kwargs
if "token_type_ids" in inputs: if "token_type_ids" in inputs:
del inputs["token_type_ids"] del inputs["token_type_ids"]
return inputs return inputs
def __call__( def __call__(self, *args, **kwargs):
self,
*args,
return_tensors=False,
return_text=True,
clean_up_tokenization_spaces=False,
truncation=TruncationStrategy.DO_NOT_TRUNCATE,
**generate_kwargs
):
r""" r"""
Generate the output text(s) using text(s) given as inputs. Generate the output text(s) using text(s) given as inputs.
...@@ -111,43 +133,40 @@ class Text2TextGenerationPipeline(Pipeline): ...@@ -111,43 +133,40 @@ class Text2TextGenerationPipeline(Pipeline):
-- The token ids of the generated text. -- The token ids of the generated text.
""" """
assert return_tensors or return_text, "You must specify return_tensors=True or return_text=True" result = super().__call__(*args, **kwargs)
if isinstance(result, dict):
return [result]
return result
with self.device_placement(): def preprocess(self, inputs, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs):
inputs = self._parse_and_tokenize(*args, truncation=truncation) inputs = self._parse_and_tokenize(inputs, truncation=truncation, **kwargs)
return self._generate(inputs, return_tensors, return_text, clean_up_tokenization_spaces, generate_kwargs) return inputs
def _generate( def _forward(self, model_inputs, **generate_kwargs):
self, inputs, return_tensors: bool, return_text: bool, clean_up_tokenization_spaces: bool, generate_kwargs
):
if self.framework == "pt": if self.framework == "pt":
inputs = self.ensure_tensor_on_device(**inputs) input_length = model_inputs["input_ids"].shape[-1]
input_length = inputs["input_ids"].shape[-1]
elif self.framework == "tf": elif self.framework == "tf":
input_length = tf.shape(inputs["input_ids"])[-1].numpy() input_length = tf.shape(model_inputs["input_ids"])[-1].numpy()
min_length = generate_kwargs.get("min_length", self.model.config.min_length) generate_kwargs["min_length"] = generate_kwargs.get("min_length", self.model.config.min_length)
max_length = generate_kwargs.get("max_length", self.model.config.max_length) generate_kwargs["max_length"] = generate_kwargs.get("max_length", self.model.config.max_length)
self.check_inputs(input_length, min_length, max_length) self.check_inputs(input_length, generate_kwargs["min_length"], generate_kwargs["max_length"])
output_ids = self.model.generate(**model_inputs, **generate_kwargs)
return {"output_ids": output_ids}
generate_kwargs.update(inputs) def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
generations = self.model.generate(
**generate_kwargs,
)
results = []
for generation in generations:
record = {} record = {}
if return_tensors: if return_type == ReturnType.TENSORS:
record[f"{self.return_name}_token_ids"] = generation record = {f"{self.return_name}_token_ids": model_outputs}
if return_text: elif return_type == ReturnType.TEXT:
record[f"{self.return_name}_text"] = self.tokenizer.decode( record = {
generation, f"{self.return_name}_text": self.tokenizer.decode(
model_outputs["output_ids"][0],
skip_special_tokens=True, skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces, clean_up_tokenization_spaces=clean_up_tokenization_spaces,
) )
results.append(record) }
return results return record
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
...@@ -239,23 +258,6 @@ class TranslationPipeline(Text2TextGenerationPipeline): ...@@ -239,23 +258,6 @@ class TranslationPipeline(Text2TextGenerationPipeline):
# Used in the return key of the pipeline. # Used in the return key of the pipeline.
return_name = "translation" return_name = "translation"
src_lang: Optional[str] = None
tgt_lang: Optional[str] = None
def __init__(self, *args, src_lang=None, tgt_lang=None, **kwargs):
super().__init__(*args, **kwargs)
if src_lang is not None:
self.src_lang = src_lang
if tgt_lang is not None:
self.tgt_lang = tgt_lang
if src_lang is None and tgt_lang is None:
# Backward compatibility, direct arguments use is preferred.
task = kwargs.get("task", "")
items = task.split("_")
if task and len(items) == 4:
# translation, XX, to YY
self.src_lang = items[1]
self.tgt_lang = items[3]
def check_inputs(self, input_length: int, min_length: int, max_length: int): def check_inputs(self, input_length: int, min_length: int, max_length: int):
if input_length > 0.9 * max_length: if input_length > 0.9 * max_length:
...@@ -265,25 +267,31 @@ class TranslationPipeline(Text2TextGenerationPipeline): ...@@ -265,25 +267,31 @@ class TranslationPipeline(Text2TextGenerationPipeline):
) )
return True return True
def _parse_and_tokenize(self, *args, src_lang, tgt_lang, truncation): def preprocess(self, *args, truncation=TruncationStrategy.DO_NOT_TRUNCATE, src_lang=None, tgt_lang=None):
if getattr(self.tokenizer, "_build_translation_inputs", None): if getattr(self.tokenizer, "_build_translation_inputs", None):
return self.tokenizer._build_translation_inputs( return self.tokenizer._build_translation_inputs(
*args, return_tensors=self.framework, src_lang=src_lang, tgt_lang=tgt_lang, truncation=truncation *args, return_tensors=self.framework, truncation=truncation, src_lang=src_lang, tgt_lang=tgt_lang
) )
else: else:
return super()._parse_and_tokenize(*args, truncation=truncation) return super()._parse_and_tokenize(*args, truncation=truncation)
def __call__( def _sanitize_parameters(self, src_lang=None, tgt_lang=None, **kwargs):
self, preprocess_params, forward_params, postprocess_params = super()._sanitize_parameters(**kwargs)
*args, if src_lang is not None:
return_tensors=False, preprocess_params["src_lang"] = src_lang
return_text=True, if tgt_lang is not None:
clean_up_tokenization_spaces=False, preprocess_params["tgt_lang"] = tgt_lang
truncation=TruncationStrategy.DO_NOT_TRUNCATE, if src_lang is None and tgt_lang is None:
src_lang=None, # Backward compatibility, direct arguments use is preferred.
tgt_lang=None, task = kwargs.get("task", self.task)
**generate_kwargs items = task.split("_")
): if task and len(items) == 4:
# translation, XX, to YY
preprocess_params["src_lang"] = items[1]
preprocess_params["tgt_lang"] = items[3]
return preprocess_params, forward_params, postprocess_params
def __call__(self, *args, **kwargs):
r""" r"""
Translate the text(s) given as inputs. Translate the text(s) given as inputs.
...@@ -313,10 +321,4 @@ class TranslationPipeline(Text2TextGenerationPipeline): ...@@ -313,10 +321,4 @@ class TranslationPipeline(Text2TextGenerationPipeline):
- **translation_token_ids** (:obj:`torch.Tensor` or :obj:`tf.Tensor`, present when ``return_tensors=True``) - **translation_token_ids** (:obj:`torch.Tensor` or :obj:`tf.Tensor`, present when ``return_tensors=True``)
-- The token ids of the translation. -- The token ids of the translation.
""" """
assert return_tensors or return_text, "You must specify return_tensors=True or return_text=True" return super().__call__(*args, **kwargs)
src_lang = src_lang if src_lang is not None else self.src_lang
tgt_lang = tgt_lang if tgt_lang is not None else self.tgt_lang
with self.device_placement():
inputs = self._parse_and_tokenize(*args, truncation=truncation, src_lang=src_lang, tgt_lang=tgt_lang)
return self._generate(inputs, return_tensors, return_text, clean_up_tokenization_spaces, generate_kwargs)
from typing import Optional from typing import Dict
import numpy as np import numpy as np
from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, GenericTensor, Pipeline
if is_tf_available(): if is_tf_available():
...@@ -61,9 +61,10 @@ class TextClassificationPipeline(Pipeline): ...@@ -61,9 +61,10 @@ class TextClassificationPipeline(Pipeline):
<https://huggingface.co/models?filter=text-classification>`__. <https://huggingface.co/models?filter=text-classification>`__.
""" """
task = "text-classification" return_all_scores = False
function_to_apply = ClassificationFunction.NONE
def __init__(self, return_all_scores: bool = None, function_to_apply: str = None, **kwargs): def __init__(self, **kwargs):
super().__init__(**kwargs) super().__init__(**kwargs)
self.check_model_type( self.check_model_type(
...@@ -72,22 +73,24 @@ class TextClassificationPipeline(Pipeline): ...@@ -72,22 +73,24 @@ class TextClassificationPipeline(Pipeline):
else MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING else MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
) )
def _sanitize_parameters(self, return_all_scores=None, function_to_apply=None, **tokenizer_kwargs):
preprocess_params = tokenizer_kwargs
postprocess_params = {}
if hasattr(self.model.config, "return_all_scores") and return_all_scores is None: if hasattr(self.model.config, "return_all_scores") and return_all_scores is None:
return_all_scores = self.model.config.return_all_scores return_all_scores = self.model.config.return_all_scores
if hasattr(self.model.config, "function_to_apply") and function_to_apply is None: if return_all_scores is not None:
function_to_apply = self.model.config.function_to_apply postprocess_params["return_all_scores"] = return_all_scores
self.return_all_scores = return_all_scores if return_all_scores is not None else False if isinstance(function_to_apply, str):
self.function_to_apply = function_to_apply if function_to_apply is not None else None function_to_apply = ClassificationFunction[function_to_apply.upper()]
def __call__( if function_to_apply is not None:
self, postprocess_params["function_to_apply"] = function_to_apply
*args, return preprocess_params, {}, postprocess_params
return_all_scores: Optional[bool] = None,
function_to_apply: Optional[ClassificationFunction] = None, def __call__(self, *args, **kwargs):
**kwargs
):
""" """
Classify the text(s) given as inputs. Classify the text(s) given as inputs.
...@@ -120,19 +123,32 @@ class TextClassificationPipeline(Pipeline): ...@@ -120,19 +123,32 @@ class TextClassificationPipeline(Pipeline):
If ``self.return_all_scores=True``, one such dictionary is returned per label. If ``self.return_all_scores=True``, one such dictionary is returned per label.
""" """
outputs = super().__call__(*args, **kwargs) return super().__call__(*args, **kwargs)
def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
return_tensors = self.framework
return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
return_all_scores = return_all_scores if return_all_scores is not None else self.return_all_scores def _forward(self, model_inputs):
function_to_apply = function_to_apply if function_to_apply is not None else self.function_to_apply return self.model(**model_inputs)
def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
# Default value before `set_parameters`
if function_to_apply is None: if function_to_apply is None:
if self.model.config.problem_type == "multi_label_classification" or self.model.config.num_labels == 1: if self.model.config.problem_type == "multi_label_classification" or self.model.config.num_labels == 1:
function_to_apply = ClassificationFunction.SIGMOID function_to_apply = ClassificationFunction.SIGMOID
elif self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1: elif self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1:
function_to_apply = ClassificationFunction.SOFTMAX function_to_apply = ClassificationFunction.SOFTMAX
elif hasattr(self.model.config, "function_to_apply") and function_to_apply is None:
function_to_apply = self.model.config.function_to_apply
else:
function_to_apply = ClassificationFunction.NONE
if isinstance(function_to_apply, str): outputs = model_outputs["logits"][0]
function_to_apply = ClassificationFunction[function_to_apply.upper()] if self.framework == "pt":
outputs = outputs.cpu().numpy()
else:
outputs = outputs.numpy()
if function_to_apply == ClassificationFunction.SIGMOID: if function_to_apply == ClassificationFunction.SIGMOID:
scores = sigmoid(outputs) scores = sigmoid(outputs)
...@@ -144,11 +160,13 @@ class TextClassificationPipeline(Pipeline): ...@@ -144,11 +160,13 @@ class TextClassificationPipeline(Pipeline):
raise ValueError(f"Unrecognized `function_to_apply` argument: {function_to_apply}") raise ValueError(f"Unrecognized `function_to_apply` argument: {function_to_apply}")
if return_all_scores: if return_all_scores:
return [ return [{"label": self.model.config.id2label[i], "score": score.item()} for i, score in enumerate(scores)]
[{"label": self.model.config.id2label[i], "score": score.item()} for i, score in enumerate(item)]
for item in scores
]
else: else:
return [ return {"label": self.model.config.id2label[scores.argmax().item()], "score": scores.max().item()}
{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores
] def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):
return [self.run_single(item, preprocess_params, forward_params, postprocess_params)[0] for item in inputs]
def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
"This pipeline is odd, and return a list when single item is run"
return [super().run_single(inputs, preprocess_params, forward_params, postprocess_params)]
import enum
from transformers import MODEL_FOR_CAUSAL_LM_MAPPING, TF_MODEL_FOR_CAUSAL_LM_MAPPING from transformers import MODEL_FOR_CAUSAL_LM_MAPPING, TF_MODEL_FOR_CAUSAL_LM_MAPPING
from ..file_utils import add_end_docstrings from ..file_utils import add_end_docstrings
from .base import PIPELINE_INIT_ARGS, Pipeline from .base import PIPELINE_INIT_ARGS, Pipeline
class ReturnType(enum.Enum):
TENSORS = 0
NEW_TEXT = 1
FULL_TEXT = 2
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
class TextGenerationPipeline(Pipeline): class TextGenerationPipeline(Pipeline):
""" """
...@@ -32,29 +40,72 @@ class TextGenerationPipeline(Pipeline): ...@@ -32,29 +40,72 @@ class TextGenerationPipeline(Pipeline):
begging for his blessing. <eod> </s> <eos> begging for his blessing. <eod> </s> <eos>
""" """
ALLOWED_MODELS = [ def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.check_model_type(
TF_MODEL_FOR_CAUSAL_LM_MAPPING if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING
)
if "prefix" not in self._preprocess_params:
# This is very specific. The logic is quite complex and needs to be done
# as a "default".
# It also defines both some preprocess_kwargs and generate_kwargs
# which is why we cannot put them in their respective methods.
prefix = None
if self.model.config.prefix is not None:
prefix = self.model.config.prefix
if prefix is None and self.model.__class__.__name__ in [
"XLNetLMHeadModel", "XLNetLMHeadModel",
"TransfoXLLMHeadModel", "TransfoXLLMHeadModel",
"ReformerModelWithLMHead",
"GPT2LMHeadModel",
"GPTJForCausalLM",
"GPTNeoForCausalLM",
"OpenAIGPTLMHeadModel",
"CTRLLMHeadModel",
"TFXLNetLMHeadModel", "TFXLNetLMHeadModel",
"TFTransfoXLLMHeadModel", "TFTransfoXLLMHeadModel",
"TFGPT2LMHeadModel", ]:
"TFOpenAIGPTLMHeadModel", # For XLNet and TransformerXL we add an article to the prompt to give more state to the model.
"TFCTRLLMHeadModel", prefix = self.XL_PREFIX
] if prefix is not None:
# Recalculate some generate_kwargs linked to prefix.
preprocess_params, forward_params, _ = self._sanitize_parameters(prefix=prefix, **self._forward_params)
self._preprocess_params = {**self._preprocess_params, **preprocess_params}
self._forward_params = {**self._forward_params, **forward_params}
def __init__(self, *args, return_full_text=True, **kwargs): def _sanitize_parameters(
super().__init__(*args, **kwargs) self,
self.check_model_type( return_full_text=None,
TF_MODEL_FOR_CAUSAL_LM_MAPPING if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING return_tensors=None,
return_text=None,
return_type=None,
clean_up_tokenization_spaces=None,
prefix=None,
**generate_kwargs
):
preprocess_params = {}
if prefix is not None:
preprocess_params["prefix"] = prefix
if prefix:
prefix_inputs = self.tokenizer(
prefix, padding=False, add_special_tokens=False, return_tensors=self.framework
) )
prefix_length = prefix_inputs["input_ids"].shape[-1]
if "max_length" in generate_kwargs:
generate_kwargs["max_length"] += prefix_length
else:
generate_kwargs["max_length"] = self.model.config.max_length + prefix_length
if "min_length" in generate_kwargs:
generate_kwargs["min_length"] += prefix_length
self.return_full_text = return_full_text forward_params = generate_kwargs
postprocess_params = {}
if return_full_text is not None and return_type is None:
return_type = ReturnType.FULL_TEXT if return_full_text else ReturnType.NEW_TEXT
if return_tensors is not None and return_type is None:
return_type = ReturnType.TENSORS
if return_type is not None:
postprocess_params["return_type"] = return_type
if clean_up_tokenization_spaces is not None:
postprocess_params["clean_up_tokenization_spaces"] = clean_up_tokenization_spaces
return preprocess_params, forward_params, postprocess_params
# overriding _parse_and_tokenize to allow for unusual language-modeling tokenizer arguments # overriding _parse_and_tokenize to allow for unusual language-modeling tokenizer arguments
def _parse_and_tokenize(self, *args, **kwargs): def _parse_and_tokenize(self, *args, **kwargs):
...@@ -67,16 +118,7 @@ class TextGenerationPipeline(Pipeline): ...@@ -67,16 +118,7 @@ class TextGenerationPipeline(Pipeline):
return super()._parse_and_tokenize(*args, **kwargs) return super()._parse_and_tokenize(*args, **kwargs)
def __call__( def __call__(self, text_inputs, **kwargs):
self,
text_inputs,
return_tensors=False,
return_text=True,
return_full_text=None,
clean_up_tokenization_spaces=False,
prefix=None,
**generate_kwargs
):
""" """
Complete the prompt(s) given as inputs. Complete the prompt(s) given as inputs.
...@@ -105,68 +147,36 @@ class TextGenerationPipeline(Pipeline): ...@@ -105,68 +147,36 @@ class TextGenerationPipeline(Pipeline):
- **generated_token_ids** (:obj:`torch.Tensor` or :obj:`tf.Tensor`, present when ``return_tensors=True``) - **generated_token_ids** (:obj:`torch.Tensor` or :obj:`tf.Tensor`, present when ``return_tensors=True``)
-- The token ids of the generated text. -- The token ids of the generated text.
""" """
prefix = prefix if prefix is not None else self.model.config.prefix return super().__call__(text_inputs, **kwargs)
return_full_text = return_full_text if return_full_text is not None else self.return_full_text
if isinstance(text_inputs, str):
text_inputs = [text_inputs]
results = []
for prompt_text in text_inputs:
# Manage correct placement of the tensors
with self.device_placement():
if prefix is None and self.model.__class__.__name__ in [
"XLNetLMHeadModel",
"TransfoXLLMHeadModel",
"TFXLNetLMHeadModel",
"TFTransfoXLLMHeadModel",
]:
# For XLNet and TransformerXL we add an article to the prompt to give more state to the model.
prefix = self.XL_PREFIX
if prefix: def preprocess(self, prompt_text, prefix=""):
prefix_inputs = self._parse_and_tokenize(prefix, padding=False, add_special_tokens=False) inputs = self.tokenizer(
# This impacts max_length and min_length argument that need adjusting. prefix + prompt_text, padding=False, add_special_tokens=False, return_tensors=self.framework
prefix_length = prefix_inputs["input_ids"].shape[-1] )
if generate_kwargs.get("max_length", None) is not None: inputs["prompt_text"] = prompt_text
generate_kwargs["max_length"] += prefix_length return inputs
else:
generate_kwargs["max_length"] = self.model.config.max_length + prefix_length def _forward(self, model_inputs, **generate_kwargs):
input_ids = model_inputs["input_ids"]
if generate_kwargs.get("min_length", None) is not None: prompt_text = model_inputs.pop("prompt_text")
generate_kwargs["min_length"] += prefix_length generated_sequence = self.model.generate(input_ids=input_ids, **generate_kwargs) # BS x SL
return {"generated_sequence": generated_sequence, "input_ids": input_ids, "prompt_text": prompt_text}
prefix = prefix or ""
inputs = self._parse_and_tokenize(prefix + prompt_text, padding=False, add_special_tokens=False) def postprocess(self, model_outputs, return_type=ReturnType.FULL_TEXT, clean_up_tokenization_spaces=True):
generated_sequence = model_outputs["generated_sequence"]
# set input_ids to None to allow empty prompt input_ids = model_outputs["input_ids"]
if inputs["input_ids"].shape[-1] == 0: prompt_text = model_outputs["prompt_text"]
inputs["input_ids"] = None
inputs["attention_mask"] = None
if self.framework == "pt" and inputs["input_ids"] is not None:
inputs = self.ensure_tensor_on_device(**inputs)
input_ids = inputs["input_ids"]
# Ensure that batch size = 1 (batch generation not allowed for now)
assert (
input_ids is None or input_ids.shape[0] == 1
), "Batch generation is currently not supported. See https://github.com/huggingface/transformers/issues/3021 for more information."
output_sequences = self.model.generate(input_ids=input_ids, **generate_kwargs) # BS x SL
result = []
for generated_sequence in output_sequences:
if self.framework == "pt" and generated_sequence is not None: if self.framework == "pt" and generated_sequence is not None:
generated_sequence = generated_sequence.cpu() generated_sequence = generated_sequence.cpu()
generated_sequence = generated_sequence.numpy().tolist() generated_sequence = generated_sequence.numpy().tolist()
record = {} if return_type == ReturnType.TENSORS:
if return_tensors: record = {"generated_token_ids": generated_sequence}
record["generated_token_ids"] = generated_sequence elif return_type in {ReturnType.NEW_TEXT, ReturnType.FULL_TEXT}:
if return_text:
# Decode text # Decode text
record = []
for sequence in generated_sequence:
text = self.tokenizer.decode( text = self.tokenizer.decode(
generated_sequence, sequence,
skip_special_tokens=True, skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces, clean_up_tokenization_spaces=clean_up_tokenization_spaces,
) )
...@@ -183,17 +193,12 @@ class TextGenerationPipeline(Pipeline): ...@@ -183,17 +193,12 @@ class TextGenerationPipeline(Pipeline):
) )
) )
if return_full_text: if return_type == ReturnType.FULL_TEXT:
all_text = prompt_text + text[prompt_length:] all_text = prompt_text + text[prompt_length:]
else: else:
all_text = text[prompt_length:] all_text = text[prompt_length:]
record["generated_text"] = all_text item = {"generated_text": all_text}
record.append(item)
result.append(record)
results += [result]
if len(results) == 1:
return results[0]
return results return record
import warnings import warnings
from typing import TYPE_CHECKING, List, Optional, Tuple, Union from typing import List, Optional, Tuple, Union
import numpy as np import numpy as np
from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available
from ..modelcard import ModelCard
from ..models.bert.tokenization_bert import BasicTokenizer from ..models.bert.tokenization_bert import BasicTokenizer
from ..tokenization_utils import PreTrainedTokenizer
from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline
if TYPE_CHECKING:
from ..modeling_tf_utils import TFPreTrainedModel
from ..modeling_utils import PreTrainedModel
if is_tf_available(): if is_tf_available():
from ..models.auto.modeling_tf_auto import TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING from ..models.auto.modeling_tf_auto import TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING
if is_torch_available(): if is_torch_available():
import torch
from ..models.auto.modeling_auto import MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING from ..models.auto.modeling_auto import MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING
...@@ -104,31 +95,9 @@ class TokenClassificationPipeline(Pipeline): ...@@ -104,31 +95,9 @@ class TokenClassificationPipeline(Pipeline):
default_input_names = "sequences" default_input_names = "sequences"
def __init__( def __init__(self, args_parser=TokenClassificationArgumentHandler(), *args, **kwargs):
self, self.ignore_labels = ["O"]
model: Union["PreTrainedModel", "TFPreTrainedModel"], super().__init__(*args, **kwargs)
tokenizer: PreTrainedTokenizer,
modelcard: Optional[ModelCard] = None,
framework: Optional[str] = None,
args_parser: ArgumentHandler = TokenClassificationArgumentHandler(),
device: int = -1,
binary_output: bool = False,
ignore_labels=["O"],
task: str = "",
grouped_entities: Optional[bool] = None,
ignore_subwords: Optional[bool] = None,
aggregation_strategy: Optional[AggregationStrategy] = None,
):
super().__init__(
model=model,
tokenizer=tokenizer,
modelcard=modelcard,
framework=framework,
device=device,
binary_output=binary_output,
task=task,
)
self.check_model_type( self.check_model_type(
TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING
if self.framework == "tf" if self.framework == "tf"
...@@ -137,12 +106,17 @@ class TokenClassificationPipeline(Pipeline): ...@@ -137,12 +106,17 @@ class TokenClassificationPipeline(Pipeline):
self._basic_tokenizer = BasicTokenizer(do_lower_case=False) self._basic_tokenizer = BasicTokenizer(do_lower_case=False)
self._args_parser = args_parser self._args_parser = args_parser
self.ignore_labels = ignore_labels
if aggregation_strategy is None: def _sanitize_parameters(
aggregation_strategy = AggregationStrategy.NONE self,
if grouped_entities is not None or ignore_subwords is not None: ignore_labels=None,
grouped_entities: Optional[bool] = None,
ignore_subwords: Optional[bool] = None,
aggregation_strategy: Optional[AggregationStrategy] = None,
):
postprocess_params = {}
if grouped_entities is not None or ignore_subwords is not None:
if grouped_entities and ignore_subwords: if grouped_entities and ignore_subwords:
aggregation_strategy = AggregationStrategy.FIRST aggregation_strategy = AggregationStrategy.FIRST
elif grouped_entities and not ignore_subwords: elif grouped_entities and not ignore_subwords:
...@@ -158,19 +132,23 @@ class TokenClassificationPipeline(Pipeline): ...@@ -158,19 +132,23 @@ class TokenClassificationPipeline(Pipeline):
warnings.warn( warnings.warn(
f'`ignore_subwords` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.' f'`ignore_subwords` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'
) )
if aggregation_strategy is not None:
if isinstance(aggregation_strategy, str): if isinstance(aggregation_strategy, str):
aggregation_strategy = AggregationStrategy[aggregation_strategy.upper()] aggregation_strategy = AggregationStrategy[aggregation_strategy.upper()]
if ( if (
aggregation_strategy in {AggregationStrategy.FIRST, AggregationStrategy.MAX, AggregationStrategy.AVERAGE} aggregation_strategy
in {AggregationStrategy.FIRST, AggregationStrategy.MAX, AggregationStrategy.AVERAGE}
and not self.tokenizer.is_fast and not self.tokenizer.is_fast
): ):
raise ValueError( raise ValueError(
"Slow tokenizers cannot handle subwords. Please set the `aggregation_strategy` option" "Slow tokenizers cannot handle subwords. Please set the `aggregation_strategy` option"
'to `"simple"` or use a fast tokenizer.' 'to `"simple"` or use a fast tokenizer.'
) )
postprocess_params["aggregation_strategy"] = aggregation_strategy
self.aggregation_strategy = aggregation_strategy if ignore_labels is not None:
postprocess_params["ignore_labels"] = ignore_labels
return {}, {}, postprocess_params
def __call__(self, inputs: Union[str, List[str]], **kwargs): def __call__(self, inputs: Union[str, List[str]], **kwargs):
""" """
...@@ -198,44 +176,57 @@ class TokenClassificationPipeline(Pipeline): ...@@ -198,44 +176,57 @@ class TokenClassificationPipeline(Pipeline):
""" """
_inputs, offset_mappings = self._args_parser(inputs, **kwargs) _inputs, offset_mappings = self._args_parser(inputs, **kwargs)
self.offset_mappings = offset_mappings
answers = [] return super().__call__(inputs, **kwargs)
for i, sentence in enumerate(_inputs): def preprocess(self, sentence):
truncation = True if self.tokenizer.model_max_length and self.tokenizer.model_max_length > 0 else False
# Manage correct placement of the tensors model_inputs = self.tokenizer(
with self.device_placement():
tokens = self.tokenizer(
sentence, sentence,
return_attention_mask=False, return_attention_mask=False,
return_tensors=self.framework, return_tensors=self.framework,
truncation=True, truncation=truncation,
return_special_tokens_mask=True, return_special_tokens_mask=True,
return_offsets_mapping=self.tokenizer.is_fast, return_offsets_mapping=self.tokenizer.is_fast,
) )
if self.tokenizer.is_fast: if self.offset_mappings:
offset_mapping = tokens.pop("offset_mapping").cpu().numpy()[0] offset_mapping = self.offset_mappings[0]
elif offset_mappings: model_inputs["offset_mapping"] = offset_mapping
offset_mapping = offset_mappings[i]
else: model_inputs["sentence"] = sentence
offset_mapping = None
special_tokens_mask = tokens.pop("special_tokens_mask").cpu().numpy()[0] return model_inputs
def _forward(self, model_inputs):
# Forward # Forward
special_tokens_mask = model_inputs.pop("special_tokens_mask")
offset_mapping = model_inputs.pop("offset_mapping", None)
sentence = model_inputs.pop("sentence")
if self.framework == "tf": if self.framework == "tf":
entities = self.model(tokens.data)[0][0].numpy() outputs = self.model(model_inputs.data)[0][0].numpy()
input_ids = tokens["input_ids"].numpy()[0]
else: else:
with torch.no_grad(): outputs = self.model(**model_inputs)[0][0].numpy()
tokens = self.ensure_tensor_on_device(**tokens) return {
entities = self.model(**tokens)[0][0].cpu().numpy() "outputs": outputs,
input_ids = tokens["input_ids"].cpu().numpy()[0] "special_tokens_mask": special_tokens_mask,
"offset_mapping": offset_mapping,
scores = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True) "sentence": sentence,
pre_entities = self.gather_pre_entities(sentence, input_ids, scores, offset_mapping, special_tokens_mask) **model_inputs,
grouped_entities = self.aggregate(pre_entities, self.aggregation_strategy) }
def postprocess(self, model_outputs, aggregation_strategy=AggregationStrategy.NONE):
outputs = model_outputs["outputs"]
sentence = model_outputs["sentence"]
input_ids = model_outputs["input_ids"][0]
offset_mapping = model_outputs["offset_mapping"][0] if model_outputs["offset_mapping"] is not None else None
special_tokens_mask = model_outputs["special_tokens_mask"][0].numpy()
scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
pre_entities = self.gather_pre_entities(
sentence, input_ids, scores, offset_mapping, special_tokens_mask, aggregation_strategy
)
grouped_entities = self.aggregate(pre_entities, aggregation_strategy)
# Filter anything that is in self.ignore_labels # Filter anything that is in self.ignore_labels
entities = [ entities = [
entity entity
...@@ -243,11 +234,7 @@ class TokenClassificationPipeline(Pipeline): ...@@ -243,11 +234,7 @@ class TokenClassificationPipeline(Pipeline):
if entity.get("entity", None) not in self.ignore_labels if entity.get("entity", None) not in self.ignore_labels
and entity.get("entity_group", None) not in self.ignore_labels and entity.get("entity_group", None) not in self.ignore_labels
] ]
answers.append(entities) return entities
if len(answers) == 1:
return answers[0]
return answers
def gather_pre_entities( def gather_pre_entities(
self, self,
...@@ -256,6 +243,7 @@ class TokenClassificationPipeline(Pipeline): ...@@ -256,6 +243,7 @@ class TokenClassificationPipeline(Pipeline):
scores: np.ndarray, scores: np.ndarray,
offset_mapping: Optional[List[Tuple[int, int]]], offset_mapping: Optional[List[Tuple[int, int]]],
special_tokens_mask: np.ndarray, special_tokens_mask: np.ndarray,
aggregation_strategy: AggregationStrategy,
) -> List[dict]: ) -> List[dict]:
"""Fuse various numpy arrays into dicts with all the information needed for aggregation""" """Fuse various numpy arrays into dicts with all the information needed for aggregation"""
pre_entities = [] pre_entities = []
...@@ -269,6 +257,12 @@ class TokenClassificationPipeline(Pipeline): ...@@ -269,6 +257,12 @@ class TokenClassificationPipeline(Pipeline):
word = self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])) word = self.tokenizer.convert_ids_to_tokens(int(input_ids[idx]))
if offset_mapping is not None: if offset_mapping is not None:
start_ind, end_ind = offset_mapping[idx] start_ind, end_ind = offset_mapping[idx]
if self.framework == "pt":
start_ind = start_ind.item()
end_ind = end_ind.item()
else:
start_ind = int(start_ind.numpy())
end_ind = int(end_ind.numpy())
word_ref = sentence[start_ind:end_ind] word_ref = sentence[start_ind:end_ind]
if getattr(self.tokenizer._tokenizer.model, "continuing_subword_prefix", None): if getattr(self.tokenizer._tokenizer.model, "continuing_subword_prefix", None):
# This is a BPE, word aware tokenizer, there is a correct way # This is a BPE, word aware tokenizer, there is a correct way
...@@ -276,7 +270,7 @@ class TokenClassificationPipeline(Pipeline): ...@@ -276,7 +270,7 @@ class TokenClassificationPipeline(Pipeline):
is_subword = len(word) != len(word_ref) is_subword = len(word) != len(word_ref)
else: else:
# This is a fallback heuristic. This will fail most likely on any kind of text + punctuation mixtures that will be considered "words". Non word aware models cannot do better than this unfortunately. # This is a fallback heuristic. This will fail most likely on any kind of text + punctuation mixtures that will be considered "words". Non word aware models cannot do better than this unfortunately.
if self.aggregation_strategy in { if aggregation_strategy in {
AggregationStrategy.FIRST, AggregationStrategy.FIRST,
AggregationStrategy.AVERAGE, AggregationStrategy.AVERAGE,
AggregationStrategy.MAX, AggregationStrategy.MAX,
...@@ -362,10 +356,11 @@ class TokenClassificationPipeline(Pipeline): ...@@ -362,10 +356,11 @@ class TokenClassificationPipeline(Pipeline):
Example: micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft| Example: micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft|
company| B-ENT I-ENT company| B-ENT I-ENT
""" """
assert aggregation_strategy not in { if aggregation_strategy in {
AggregationStrategy.NONE, AggregationStrategy.NONE,
AggregationStrategy.SIMPLE, AggregationStrategy.SIMPLE,
}, "NONE and SIMPLE strategies are invalid" }:
raise ValueError("NONE and SIMPLE strategies are invalid for word aggregation")
word_entities = [] word_entities = []
word_group = None word_group = None
......
...@@ -2,15 +2,12 @@ from typing import List, Union ...@@ -2,15 +2,12 @@ from typing import List, Union
import numpy as np import numpy as np
from ..file_utils import add_end_docstrings, is_torch_available from ..file_utils import add_end_docstrings
from ..tokenization_utils import TruncationStrategy from ..tokenization_utils import TruncationStrategy
from ..utils import logging from ..utils import logging
from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline from .base import PIPELINE_INIT_ARGS, ArgumentHandler, Pipeline
if is_torch_available():
import torch
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
...@@ -22,7 +19,7 @@ class ZeroShotClassificationArgumentHandler(ArgumentHandler): ...@@ -22,7 +19,7 @@ class ZeroShotClassificationArgumentHandler(ArgumentHandler):
def _parse_labels(self, labels): def _parse_labels(self, labels):
if isinstance(labels, str): if isinstance(labels, str):
labels = [label.strip() for label in labels.split(",")] labels = [label.strip() for label in labels.split(",") if label.strip()]
return labels return labels
def __call__(self, sequences, labels, hypothesis_template): def __call__(self, sequences, labels, hypothesis_template):
...@@ -38,13 +35,12 @@ class ZeroShotClassificationArgumentHandler(ArgumentHandler): ...@@ -38,13 +35,12 @@ class ZeroShotClassificationArgumentHandler(ArgumentHandler):
if isinstance(sequences, str): if isinstance(sequences, str):
sequences = [sequences] sequences = [sequences]
labels = self._parse_labels(labels)
sequence_pairs = [] sequence_pairs = []
for sequence in sequences: for sequence in sequences:
sequence_pairs.extend([[sequence, hypothesis_template.format(label)] for label in labels]) sequence_pairs.extend([[sequence, hypothesis_template.format(label)] for label in labels])
return sequence_pairs return sequence_pairs, sequences
@add_end_docstrings(PIPELINE_INIT_ARGS) @add_end_docstrings(PIPELINE_INIT_ARGS)
...@@ -66,8 +62,8 @@ class ZeroShotClassificationPipeline(Pipeline): ...@@ -66,8 +62,8 @@ class ZeroShotClassificationPipeline(Pipeline):
""" """
def __init__(self, args_parser=ZeroShotClassificationArgumentHandler(), *args, **kwargs): def __init__(self, args_parser=ZeroShotClassificationArgumentHandler(), *args, **kwargs):
super().__init__(*args, **kwargs)
self._args_parser = args_parser self._args_parser = args_parser
super().__init__(*args, **kwargs)
if self.entailment_id == -1: if self.entailment_id == -1:
logger.warning( logger.warning(
"Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to " "Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to "
...@@ -82,19 +78,11 @@ class ZeroShotClassificationPipeline(Pipeline): ...@@ -82,19 +78,11 @@ class ZeroShotClassificationPipeline(Pipeline):
return -1 return -1
def _parse_and_tokenize( def _parse_and_tokenize(
self, self, sequence_pairs, padding=True, add_special_tokens=True, truncation=TruncationStrategy.ONLY_FIRST, **kwargs
sequences,
candidate_labels,
hypothesis_template,
padding=True,
add_special_tokens=True,
truncation=TruncationStrategy.ONLY_FIRST,
**kwargs
): ):
""" """
Parse arguments and tokenize only_first so that hypothesis (label) is not truncated Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
""" """
sequence_pairs = self._args_parser(sequences, candidate_labels, hypothesis_template)
return_tensors = self.framework return_tensors = self.framework
if getattr(self.tokenizer, "pad_token", None) is None: if getattr(self.tokenizer, "pad_token", None) is None:
# XXX some tokenizers do not have a padding token, we use simple lists # XXX some tokenizers do not have a padding token, we use simple lists
...@@ -141,55 +129,27 @@ class ZeroShotClassificationPipeline(Pipeline): ...@@ -141,55 +129,27 @@ class ZeroShotClassificationPipeline(Pipeline):
return inputs return inputs
def _forward(self, inputs, return_tensors=False): def _sanitize_parameters(self, **kwargs):
""" if kwargs.get("multi_class", None) is not None:
Internal framework specific forward dispatching kwargs["multi_label"] = kwargs["multi_class"]
logger.warning(
Args: "The `multi_class` argument has been deprecated and renamed to `multi_label`. "
inputs: dict holding all the keyword arguments for required by the model forward method. "`multi_class` will be removed in a future version of Transformers."
return_tensors: Whether to return native framework (pt/tf) tensors rather than numpy array )
preprocess_params = {}
Returns: if "candidate_labels" in kwargs:
Numpy array preprocess_params["candidate_labels"] = self._args_parser._parse_labels(kwargs["candidate_labels"])
""" if "hypothesis_template" in kwargs:
# Encode for forward preprocess_params["hypothesis_template"] = kwargs["hypothesis_template"]
with self.device_placement():
if self.framework == "tf":
if isinstance(inputs, list):
predictions = []
for input_ in inputs:
prediction = self.model(input_.data, training=False)[0]
predictions.append(prediction)
else:
predictions = self.model(inputs.data, training=False)[0]
else:
with torch.no_grad():
if isinstance(inputs, list):
predictions = []
for input_ in inputs:
model_input = self.ensure_tensor_on_device(**input_)
prediction = self.model(**model_input)[0].cpu()
predictions.append(prediction)
else:
inputs = self.ensure_tensor_on_device(**inputs)
predictions = self.model(**inputs)[0].cpu()
if return_tensors: postprocess_params = {}
return predictions if "multi_label" in kwargs:
else: postprocess_params["multi_label"] = kwargs["multi_label"]
if isinstance(predictions, list): return preprocess_params, {}, postprocess_params
predictions = np.array([p.numpy() for p in predictions])
else:
predictions = predictions.numpy()
return predictions
def __call__( def __call__(
self, self,
sequences: Union[str, List[str]], sequences: Union[str, List[str]],
candidate_labels,
hypothesis_template="This example is {}.",
multi_label=False,
**kwargs, **kwargs,
): ):
""" """
...@@ -222,53 +182,78 @@ class ZeroShotClassificationPipeline(Pipeline): ...@@ -222,53 +182,78 @@ class ZeroShotClassificationPipeline(Pipeline):
- **labels** (:obj:`List[str]`) -- The labels sorted by order of likelihood. - **labels** (:obj:`List[str]`) -- The labels sorted by order of likelihood.
- **scores** (:obj:`List[float]`) -- The probabilities for each of the labels. - **scores** (:obj:`List[float]`) -- The probabilities for each of the labels.
""" """
if "multi_class" in kwargs and kwargs["multi_class"] is not None:
multi_label = kwargs.pop("multi_class")
logger.warning(
"The `multi_class` argument has been deprecated and renamed to `multi_label`. "
"`multi_class` will be removed in a future version of Transformers."
)
if sequences and isinstance(sequences, str): result = super().__call__(sequences, **kwargs)
sequences = [sequences] if len(result) == 1:
return result[0]
return result
def preprocess(self, inputs, candidate_labels=None, hypothesis_template="This example is {}."):
sequence_pairs, sequences = self._args_parser(inputs, candidate_labels, hypothesis_template)
model_inputs = self._parse_and_tokenize(sequence_pairs)
prepared_inputs = {
"candidate_labels": candidate_labels,
"sequences": sequences,
"inputs": model_inputs,
}
return prepared_inputs
def _forward(self, inputs):
candidate_labels = inputs["candidate_labels"]
sequences = inputs["sequences"]
model_inputs = inputs["inputs"]
if isinstance(model_inputs, list):
outputs = []
for input_ in model_inputs:
prediction = self.model(**input_)[0].cpu()
outputs.append(prediction)
else:
outputs = self.model(**model_inputs)
model_outputs = {"candidate_labels": candidate_labels, "sequences": sequences, "outputs": outputs}
return model_outputs
def postprocess(self, model_outputs, multi_label=False):
candidate_labels = model_outputs["candidate_labels"]
sequences = model_outputs["sequences"]
outputs = model_outputs["outputs"]
outputs = super().__call__(sequences, candidate_labels, hypothesis_template) if self.framework == "pt":
if isinstance(outputs, list): if isinstance(outputs, list):
# XXX: Some tokenizers cannot handle batching because they don't logits = np.concatenate([output.cpu().numpy() for output in outputs], axis=0)
# have pad_token, so outputs will be a list, however, because outputs
# is only n logits and sequence_length is not present anymore, we
# can recreate a tensor out of outputs.
outputs = np.array(outputs)
num_sequences = len(sequences)
candidate_labels = self._args_parser._parse_labels(candidate_labels)
reshaped_outputs = outputs.reshape((num_sequences, len(candidate_labels), -1))
if len(candidate_labels) == 1:
multi_label = True
if not multi_label:
# softmax the "entailment" logits over all candidate labels
entail_logits = reshaped_outputs[..., self.entailment_id]
scores = np.exp(entail_logits) / np.exp(entail_logits).sum(-1, keepdims=True)
else: else:
logits = outputs["logits"].cpu().numpy()
else:
if isinstance(outputs, list):
logits = np.concatenate([output.numpy() for output in outputs], axis=0)
else:
logits = outputs["logits"].numpy()
N = logits.shape[0]
n = len(candidate_labels)
num_sequences = N // n
reshaped_outputs = logits.reshape((num_sequences, n, -1))
if multi_label or len(candidate_labels) == 1:
# softmax over the entailment vs. contradiction dim for each label independently # softmax over the entailment vs. contradiction dim for each label independently
entailment_id = self.entailment_id entailment_id = self.entailment_id
contradiction_id = -1 if entailment_id == 0 else 0 contradiction_id = -1 if entailment_id == 0 else 0
entail_contr_logits = reshaped_outputs[..., [contradiction_id, entailment_id]] entail_contr_logits = reshaped_outputs[..., [contradiction_id, entailment_id]]
scores = np.exp(entail_contr_logits) / np.exp(entail_contr_logits).sum(-1, keepdims=True) scores = np.exp(entail_contr_logits) / np.exp(entail_contr_logits).sum(-1, keepdims=True)
scores = scores[..., 1] scores = scores[..., 1]
else:
# softmax the "entailment" logits over all candidate labels
entail_logits = reshaped_outputs[..., self.entailment_id]
scores = np.exp(entail_logits) / np.exp(entail_logits).sum(-1, keepdims=True)
result = [] result = []
for iseq in range(num_sequences): for iseq in range(num_sequences):
top_inds = list(reversed(scores[iseq].argsort())) top_inds = list(reversed(scores[iseq].argsort()))
result.append( result.append(
{ {
"sequence": sequences if isinstance(sequences, str) else sequences[iseq], "sequence": sequences[iseq],
"labels": [candidate_labels[i] for i in top_inds], "labels": [candidate_labels[i] for i in top_inds],
"scores": scores[iseq][top_inds].tolist(), "scores": scores[iseq, top_inds].tolist(),
} }
) )
if len(result) == 1:
return result[0]
return result return result
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment