[TokenClassification] Label realignment for subword aggregation (#11680)

* [TokenClassification] Label realignment for subword aggregation Tentative to replace https://github.com/huggingface/transformers/pull/11622/files - Added `AggregationStrategy` - `ignore_subwords` and `grouped_entities` arguments are now fused into `aggregation_strategy`. It makes more sense anyway because `ignore_subwords=True` with `grouped_entities=False` did not have a meaning anyway. - Added 2 new ways to aggregate which are MAX, and AVERAGE - AVERAGE requires a bit more information than the others, for now this case is slightly specific, we should keep that in mind for future changes. - Testing has been modified to reflect new argument, and to check the correct deprecation and the new aggregation_strategy. - Put the testing argument and testing results for aggregation_strategy, close together, so that readers can understand what is supposed to happen. - `aggregate` is now only tested on a small model as it does not mean anything to test it globally for all models. - Previous tests are unchanged in desired output. - Added a new test case that showcases better the difference between the FIRST, MAX and AVERAGE strategies. * Wrong framework. * Addressing three issues. 1- Tags might not follow B-, I- convention, so any tag should work now (assumed as B-TAG) 2- Fixed an issue with average that leads to a substantial code change. 3- The testing suite was not checking for the "index" key for "none" strategy. This is now fixed. The issue is that "O" could not be chosen by AVERAGE strategy because those tokens were filtered out beforehand, so their relative scores were not counted in the average. Now filtering on ignore_labels will happen at the very end of the pipeline fixing that issue. It's a bit hard to make sure this stays like that because we do not have a end-to-end test for that behavior * Formatting. * Adding formatting to code + cleaner handling of B-, I- tags. Co-authored-by: Francesco Rubbo <rubbo.francesco@gmail.com> Co-authored-by: elk-cloner <rezakakhki.rk@gmail.com> * Typo. Co-authored-by: Francesco Rubbo <rubbo.francesco@gmail.com> Co-authored-by: elk-cloner <rezakakhki.rk@gmail.com>

[TokenClassification] Label realignment for subword aggregation (#11680)
* [TokenClassification] Label realignment for subword aggregation Tentative to replace https://github.com/huggingface/transformers/pull/11622/files - Added `AggregationStrategy` - `ignore_subwords` and `grouped_entities` arguments are now fused into `aggregation_strategy`. It makes more sense anyway because `ignore_subwords=True` with `grouped_entities=False` did not have a meaning anyway. - Added 2 new ways to aggregate which are MAX, and AVERAGE - AVERAGE requires a bit more information than the others, for now this case is slightly specific, we should keep that in mind for future changes. - Testing has been modified to reflect new argument, and to check the correct deprecation and the new aggregation_strategy. - Put the testing argument and testing results for aggregation_strategy, close together, so that readers can understand what is supposed to happen. - `aggregate` is now only tested on a small model as it does not mean anything to test it globally for all models. - Previous tests are unchanged in desired output. - Added a new test case that showcases better the difference between the FIRST, MAX and AVERAGE strategies. * Wrong framework. * Addressing three issues. 1- Tags might not follow B-, I- convention, so any tag should work now (assumed as B-TAG) 2- Fixed an issue with average that leads to a substantial code change. 3- The testing suite was not checking for the "index" key for "none" strategy. This is now fixed. The issue is that "O" could not be chosen by AVERAGE strategy because those tokens were filtered out beforehand, so their relative scores were not counted in the average. Now filtering on ignore_labels will happen at the very end of the pipeline fixing that issue. It's a bit hard to make sure this stays like that because we do not have a end-to-end test for that behavior * Formatting. * Adding formatting to code + cleaner handling of B-, I- tags. Co-authored-by: Francesco Rubbo <rubbo.francesco@gmail.com> Co-authored-by: elk-cloner <rezakakhki.rk@gmail.com> * Typo. Co-authored-by: Francesco Rubbo <rubbo.francesco@gmail.com> Co-authored-by: elk-cloner <rezakakhki.rk@gmail.com>
b88e0e01 · Nicolas Patry · GitHub · c73e3532 · b88e0e01 · b88e0e01
Unverified Commit b88e0e01 authored May 18, 2021 by Nicolas Patry Committed by GitHub May 18, 2021
4 changed files
--- a/src/transformers/pipelines/__init__.py
+++ b/src/transformers/pipelines/__init__.py
@@ -48,7 +48,12 @@ from .table_question_answering import TableQuestionAnsweringArgumentHandler, Tab
 from .text2text_generation import SummarizationPipeline, Text2TextGenerationPipeline, TranslationPipeline
 from .text_classification import TextClassificationPipeline
 from .text_generation import TextGenerationPipeline
-from .token_classification import NerPipeline, TokenClassificationArgumentHandler, TokenClassificationPipeline
+from .token_classification import (
+    AggregationStrategy,
+    NerPipeline,
+    TokenClassificationArgumentHandler,
+    TokenClassificationPipeline,
+)
 from .zero_shot_classification import ZeroShotClassificationArgumentHandler, ZeroShotClassificationPipeline



--- a/src/transformers/pipelines/token_classification.py
+++ b/src/transformers/pipelines/token_classification.py
-from typing import TYPE_CHECKING, List, Optional, Union
+import warnings
+from typing import TYPE_CHECKING, List, Optional, Tuple, Union

 import numpy as np

-from ..file_utils import add_end_docstrings, is_tf_available, is_torch_available
+from ..file_utils import ExplicitEnum, add_end_docstrings, is_tf_available, is_torch_available
 from ..modelcard import ModelCard
 from ..models.bert.tokenization_bert import BasicTokenizer
 from ..tokenization_utils import PreTrainedTokenizer
@@ -48,13 +49,43 @@ class TokenClassificationArgumentHandler(ArgumentHandler):
        return inputs, offset_mapping


+class AggregationStrategy(ExplicitEnum):
+    """All the valid aggregation strategies for TokenClassificationPipeline"""
+
+    NONE = "none"
+    SIMPLE = "simple"
+    FIRST = "first"
+    AVERAGE = "average"
+    MAX = "max"
+
+
 @add_end_docstrings(
    PIPELINE_INIT_ARGS,
    r"""
        ignore_labels (:obj:`List[str]`, defaults to :obj:`["O"]`):
            A list of labels to ignore.
        grouped_entities (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether or not to group the tokens corresponding to the same entity together in the predictions or not.
+            DEPRECATED, use :obj:`aggregation_strategy` instead. Whether or not to group the tokens corresponding to
+            the same entity together in the predictions or not.
+        aggregation_strategy (:obj:`str`, `optional`, defaults to :obj:`"none"`): The strategy to fuse (or not) tokens based on the model prediction.
+
+                - "none" : Will simply not do any aggregation and simply return raw results from the model
+                - "simple" : Will attempt to group entities following the default schema. (A, B-TAG), (B, I-TAG), (C,
+                  I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D",
+                  "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Notice that two consecutive B tags will end up as
+                  different entities. On word based languages, we might end up splitting words undesirably : Imagine
+                  Microsoft being tagged as [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity":
+                  "NAME"}]. Look for FIRST, MAX, AVERAGE for ways to mitigate that and disambiguate words (on languages
+                  that support that meaning, which is basically tokens separated by a space). These mitigations will
+                  only work on real words, "New york" might still be tagged with two different entities.
+                - "first" : (works only on word based models) Will use the :obj:`SIMPLE` strategy except that words,
+                  cannot end up with different tags. Words will simply use the tag of the first token of the word when
+                  there is ambiguity.
+                - "average" : (works only on word based models) Will use the :obj:`SIMPLE` strategy except that words,
+                  cannot end up with different tags. scores will be averaged first across tokens, and then the maximum
+                  label is applied.
+                - "max" : (works only on word based models) Will use the :obj:`SIMPLE` strategy except that words,
+                  cannot end up with different tags. Word entity will simply be the token with the maximum score.
    """,
 )
 class TokenClassificationPipeline(Pipeline):
@@ -84,8 +115,9 @@ class TokenClassificationPipeline(Pipeline):
        binary_output: bool = False,
        ignore_labels=["O"],
        task: str = "",
-        grouped_entities: bool = False,
-        ignore_subwords: bool = False,
+        grouped_entities: Optional[bool] = None,
+        ignore_subwords: Optional[bool] = None,
+        aggregation_strategy: Optional[AggregationStrategy] = None,
    ):
        super().__init__(
            model=model,
@@ -106,15 +138,40 @@ class TokenClassificationPipeline(Pipeline):
        self._basic_tokenizer = BasicTokenizer(do_lower_case=False)
        self._args_parser = args_parser
        self.ignore_labels = ignore_labels
-        self.grouped_entities = grouped_entities
-        self.ignore_subwords = ignore_subwords

-        if self.ignore_subwords and not self.tokenizer.is_fast:
+        if aggregation_strategy is None:
+            aggregation_strategy = AggregationStrategy.NONE
+            if grouped_entities is not None or ignore_subwords is not None:
+
+                if grouped_entities and ignore_subwords:
+                    aggregation_strategy = AggregationStrategy.FIRST
+                elif grouped_entities and not ignore_subwords:
+                    aggregation_strategy = AggregationStrategy.SIMPLE
+                else:
+                    aggregation_strategy = AggregationStrategy.NONE
+
+                if grouped_entities is not None:
+                    warnings.warn(
+                        f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'
+                    )
+                if ignore_subwords is not None:
+                    warnings.warn(
+                        f'`ignore_subwords` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'
+                    )
+        if isinstance(aggregation_strategy, str):
+            aggregation_strategy = AggregationStrategy[aggregation_strategy.upper()]
+
+        if (
+            aggregation_strategy in {AggregationStrategy.FIRST, AggregationStrategy.MAX, AggregationStrategy.AVERAGE}
+            and not self.tokenizer.is_fast
+        ):
            raise ValueError(
-                "Slow tokenizers cannot ignore subwords. Please set the `ignore_subwords` option"
-                "to `False` or use a fast tokenizer."
+                "Slow tokenizers cannot handle subwords. Please set the `aggregation_strategy` option"
+                'to `"simple"` or use a fast tokenizer.'
            )

+        self.aggregation_strategy = aggregation_strategy
+
    def __call__(self, inputs: Union[str, List[str]], **kwargs):
        """
        Classify each token of the text(s) given as inputs.
@@ -125,14 +182,14 @@ class TokenClassificationPipeline(Pipeline):

        Return:
            A list or a list of list of :obj:`dict`: Each result comes as a list of dictionaries (one for each token in
-            the corresponding input, or each entity if this pipeline was instantiated with
-            :obj:`grouped_entities=True`) with the following keys:
+            the corresponding input, or each entity if this pipeline was instantiated with an aggregation_strategy)
+            with the following keys:

            - **word** (:obj:`str`) -- The token/word classified.
            - **score** (:obj:`float`) -- The corresponding probability for :obj:`entity`.
            - **entity** (:obj:`str`) -- The entity predicted for that token/word (it is named `entity_group` when
-              `grouped_entities` is set to True.
-            - **index** (:obj:`int`, only present when ``self.grouped_entities=False``) -- The index of the
+              `aggregation_strategy` is not :obj:`"none"`.
+            - **index** (:obj:`int`, only present when ``aggregation_strategy="none"``) -- The index of the
              corresponding token in the sentence.
            - **start** (:obj:`int`, `optional`) -- The index of the start of the corresponding entity in the sentence.
              Only exists if the offsets are available within the tokenizer
@@ -176,57 +233,141 @@ class TokenClassificationPipeline(Pipeline):
                        entities = self.model(**tokens)[0][0].cpu().numpy()
                        input_ids = tokens["input_ids"].cpu().numpy()[0]

-            score = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)
-            labels_idx = score.argmax(axis=-1)
-
-            entities = []
-            # Filter to labels not in `self.ignore_labels`
-            # Filter special_tokens
-            filtered_labels_idx = [
-                (idx, label_idx)
-                for idx, label_idx in enumerate(labels_idx)
-                if (self.model.config.id2label[label_idx] not in self.ignore_labels) and not special_tokens_mask[idx]
+            scores = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)
+            pre_entities = self.gather_pre_entities(sentence, input_ids, scores, offset_mapping, special_tokens_mask)
+            grouped_entities = self.aggregate(pre_entities, self.aggregation_strategy)
+            # Filter anything that is in self.ignore_labels
+            entities = [
+                entity
+                for entity in grouped_entities
+                if entity.get("entity", None) not in self.ignore_labels
+                and entity.get("entity_group", None) not in self.ignore_labels
            ]
+            answers.append(entities)

-            for idx, label_idx in filtered_labels_idx:
+        if len(answers) == 1:
+            return answers[0]
+        return answers
+
+    def gather_pre_entities(
+        self,
+        sentence: str,
+        input_ids: np.ndarray,
+        scores: np.ndarray,
+        offset_mapping: Optional[List[Tuple[int, int]]],
+        special_tokens_mask: np.ndarray,
+    ) -> List[dict]:
+        """Fuse various numpy arrays into dicts with all the information needed for aggregation"""
+        pre_entities = []
+        for idx, token_scores in enumerate(scores):
+            # Filter special_tokens, they should only occur
+            # at the sentence boundaries since we're not encoding pairs of
+            # sentences so we don't have to keep track of those.
+            if special_tokens_mask[idx]:
+                continue
+
+            word = self.tokenizer.convert_ids_to_tokens(int(input_ids[idx]))
            if offset_mapping is not None:
                start_ind, end_ind = offset_mapping[idx]
                word_ref = sentence[start_ind:end_ind]
-                    word = self.tokenizer.convert_ids_to_tokens([int(input_ids[idx])])[0]
                is_subword = len(word_ref) != len(word)

                if int(input_ids[idx]) == self.tokenizer.unk_token_id:
                    word = word_ref
                    is_subword = False
            else:
-                    word = self.tokenizer.convert_ids_to_tokens(int(input_ids[idx]))
-
                start_ind = None
                end_ind = None
+                is_subword = False

-                entity = {
+            pre_entity = {
                "word": word,
-                    "score": score[idx][label_idx].item(),
-                    "entity": self.model.config.id2label[label_idx],
-                    "index": idx,
+                "scores": token_scores,
                "start": start_ind,
                "end": end_ind,
+                "index": idx,
+                "is_subword": is_subword,
            }
+            pre_entities.append(pre_entity)
+        return pre_entities

-                if self.grouped_entities and self.ignore_subwords:
-                    entity["is_subword"] = is_subword
+    def aggregate(self, pre_entities: List[dict], aggregation_strategy: AggregationStrategy) -> List[dict]:
+        if aggregation_strategy in {AggregationStrategy.NONE, AggregationStrategy.SIMPLE}:
+            entities = []
+            for pre_entity in pre_entities:
+                entity_idx = pre_entity["scores"].argmax()
+                score = pre_entity["scores"][entity_idx]
+                entity = {
+                    "entity": self.model.config.id2label[entity_idx],
+                    "score": score,
+                    "index": pre_entity["index"],
+                    "word": pre_entity["word"],
+                    "start": pre_entity["start"],
+                    "end": pre_entity["end"],
+                }
+                entities.append(entity)
+        else:
+            entities = self.aggregate_words(pre_entities, aggregation_strategy)
+
+        if aggregation_strategy == AggregationStrategy.NONE:
+            return entities
+        return self.group_entities(entities)
+
+    def aggregate_word(self, entities: List[dict], aggregation_strategy: AggregationStrategy) -> dict:
+        word = self.tokenizer.convert_tokens_to_string([entity["word"] for entity in entities])
+        if aggregation_strategy == AggregationStrategy.FIRST:
+            scores = entities[0]["scores"]
+            idx = scores.argmax()
+            score = scores[idx]
+            entity = self.model.config.id2label[idx]
+        elif aggregation_strategy == AggregationStrategy.MAX:
+            max_entity = max(entities, key=lambda entity: entity["scores"].max())
+            scores = max_entity["scores"]
+            idx = scores.argmax()
+            score = scores[idx]
+            entity = self.model.config.id2label[idx]
+        elif aggregation_strategy == AggregationStrategy.AVERAGE:
+            scores = np.stack([entity["scores"] for entity in entities])
+            average_scores = np.nanmean(scores, axis=0)
+            entity_idx = average_scores.argmax()
+            entity = self.model.config.id2label[entity_idx]
+            score = average_scores[entity_idx]
+        else:
+            raise ValueError("Invalid aggregation_strategy")
+        new_entity = {
+            "entity": entity,
+            "score": score,
+            "word": word,
+            "start": entities[0]["start"],
+            "end": entities[-1]["end"],
+        }
+        return new_entity

-                entities += [entity]
+    def aggregate_words(self, entities: List[dict], aggregation_strategy: AggregationStrategy) -> List[dict]:
+        """
+        Override tokens from a given word that disagree to force agreement on word boundaries.

-            if self.grouped_entities:
-                answers += [self.group_entities(entities)]
-            # Append ungrouped entities
-            else:
-                answers += [entities]
+        Example: micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft|
+        company| B-ENT I-ENT
+        """
+        assert aggregation_strategy not in {
+            AggregationStrategy.NONE,
+            AggregationStrategy.SIMPLE,
+        }, "NONE and SIMPLE strategies are invalid"

-        if len(answers) == 1:
-            return answers[0]
-        return answers
+        word_entities = []
+        word_group = None
+        for entity in entities:
+            if word_group is None:
+                word_group = [entity]
+            elif entity["is_subword"]:
+                word_group.append(entity)
+            else:
+                word_entities.append(self.aggregate_word(word_group, aggregation_strategy))
+                word_group = [entity]
+        # Last item
+        word_entities.append(self.aggregate_word(word_group, aggregation_strategy))
+        return word_entities

    def group_sub_entities(self, entities: List[dict]) -> dict:
        """
@@ -249,6 +390,19 @@ class TokenClassificationPipeline(Pipeline):
        }
        return entity_group

+    def get_tag(self, entity_name: str) -> Tuple[str, str]:
+        if entity_name.startswith("B-"):
+            bi = "B"
+            tag = entity_name[2:]
+        elif entity_name.startswith("I-"):
+            bi = "I"
+            tag = entity_name[2:]
+        else:
+            # It's not in B-, I- format
+            bi = "B"
+            tag = entity_name
+        return bi, tag
+
    def group_entities(self, entities: List[dict]) -> List[dict]:
        """
        Find and group together the adjacent tokens with the same entity predicted.
@@ -260,45 +414,29 @@ class TokenClassificationPipeline(Pipeline):
        entity_groups = []
        entity_group_disagg = []

-        if entities:
-            last_idx = entities[-1]["index"]
-
        for entity in entities:
-
-            is_last_idx = entity["index"] == last_idx
-            is_subword = self.ignore_subwords and entity["is_subword"]
            if not entity_group_disagg:
-                entity_group_disagg += [entity]
-                if is_last_idx:
-                    entity_groups += [self.group_sub_entities(entity_group_disagg)]
+                entity_group_disagg.append(entity)
                continue

-            # If the current entity is similar and adjacent to the previous entity, append it to the disaggregated entity group
-            # The split is meant to account for the "B" and "I" suffixes
+            # If the current entity is similar and adjacent to the previous entity,
+            # append it to the disaggregated entity group
+            # The split is meant to account for the "B" and "I" prefixes
            # Shouldn't merge if both entities are B-type
-            if (
-                (
-                    entity["entity"].split("-")[-1] == entity_group_disagg[-1]["entity"].split("-")[-1]
-                    and entity["entity"].split("-")[0] != "B"
-                )
-                and entity["index"] == entity_group_disagg[-1]["index"] + 1
-            ) or is_subword:
+            bi, tag = self.get_tag(entity["entity"])
+            last_bi, last_tag = self.get_tag(entity_group_disagg[-1]["entity"])
+
+            if tag == last_tag and bi != "B":
                # Modify subword type to be previous_type
-                if is_subword:
-                    entity["entity"] = entity_group_disagg[-1]["entity"].split("-")[-1]
-                    entity["score"] = np.nan  # set ignored scores to nan and use np.nanmean
-
-                entity_group_disagg += [entity]
-                # Group the entities at the last entity
-                if is_last_idx:
-                    entity_groups += [self.group_sub_entities(entity_group_disagg)]
-            # If the current entity is different from the previous entity, aggregate the disaggregated entity group
+                entity_group_disagg.append(entity)
            else:
-                entity_groups += [self.group_sub_entities(entity_group_disagg)]
+                # If the current entity is different from the previous entity
+                # aggregate the disaggregated entity group
+                entity_groups.append(self.group_sub_entities(entity_group_disagg))
                entity_group_disagg = [entity]
-                # If it's the last entity, add it to the entity groups
-                if is_last_idx:
-                    entity_groups += [self.group_sub_entities(entity_group_disagg)]
+        if entity_group_disagg:
+            # it's the last entity, add it to the entity groups
+            entity_groups.append(self.group_sub_entities(entity_group_disagg))

        return entity_groups


--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -1207,19 +1207,25 @@ def nested_simplify(obj, decimals=3):
    Simplifies an object by rounding float numbers, and downcasting tensors/numpy arrays to get simple equality test
    within tests.
    """
+    import numpy as np
+
    from transformers.tokenization_utils import BatchEncoding

    if isinstance(obj, list):
        return [nested_simplify(item, decimals) for item in obj]
+    elif isinstance(obj, np.ndarray):
+        return nested_simplify(obj.tolist())
    elif isinstance(obj, (dict, BatchEncoding)):
        return {nested_simplify(k, decimals): nested_simplify(v, decimals) for k, v in obj.items()}
-    elif isinstance(obj, (str, int)):
+    elif isinstance(obj, (str, int, np.int64)):
        return obj
    elif is_torch_available() and isinstance(obj, torch.Tensor):
-        return nested_simplify(obj.tolist())
+        return nested_simplify(obj.tolist(), decimals)
    elif is_tf_available() and tf.is_tensor(obj):
        return nested_simplify(obj.numpy().tolist())
    elif isinstance(obj, float):
        return round(obj, decimals)
+    elif isinstance(obj, np.float32):
+        return nested_simplify(obj.item(), decimals)
    else:
        raise Exception(f"Not supported: {type(obj)}")
--- a/tests/test_pipelines_token_classification.py
+++ b/tests/test_pipelines_token_classification.py