This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of `0.5` means that the model is predicting exactly at chance, i.e. the model's predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. A score above `0.5` indicates that the model is doing better than chance, while a score below `0.5` indicates that the model is doing worse than chance.
This metric has three separate use cases:
- binary: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation.
- multiclass: The case in which there can be more than two different label classes, but each example still gets only one label.
- multilabel: The case in which there can be more than two different label classes, and each example can have more than one label.
---
# Metric Card for ROC AUC
## Metric Description
This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of `0.5` means that the model is predicting exactly at chance, i.e. the model's predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. A score above `0.5` indicates that the model is doing better than chance, while a score below `0.5` indicates that the model is doing worse than chance.
This metric has three separate use cases:
-**binary**: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation.
-**multiclass**: The case in which there can be more than two different label classes, but each example still gets only one label.
-**multilabel**: The case in which there can be more than two different label classes, and each example can have more than one label.
## How to Use
At minimum, this metric requires references and prediction scores:
The default implementation of this metric is the **binary** implementation. If employing the **multiclass** or **multilabel** use cases, the keyword `"multiclass"` or `"multilabel"` must be specified when loading the metric:
- In the **multiclass** case, the metric is loaded with:
See the [Examples Section Below](#examples_section) for more extensive examples.
### Inputs
-**`references`** (array-like of shape (n_samples,) or (n_samples, n_classes)): Ground truth labels. Expects different inputs based on use case:
- binary: expects an array-like of shape (n_samples,)
- multiclass: expects an array-like of shape (n_samples,)
- multilabel: expects an array-like of shape (n_samples, n_classes)
-**`prediction_scores`** (array-like of shape (n_samples,) or (n_samples, n_classes)): Model predictions. Expects different inputs based on use case:
- binary: expects an array-like of shape (n_samples,)
- multiclass: expects an array-like of shape (n_samples, n_classes). The probability estimates must sum to 1 across the possible classes.
- multilabel: expects an array-like of shape (n_samples, n_classes)
-**`average`** (`str`): Type of average, and is ignored in the binary use case. Defaults to `'macro'`. Options are:
-`'micro'`: Calculates metrics globally by considering each element of the label indicator matrix as a label. Only works with the multilabel use case.
-`'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
-`'weighted'`: Calculate metrics for each label, and find their average, weighted by support (i.e. the number of true instances for each label).
-`'samples'`: Calculate metrics for each instance, and find their average. Only works with the multilabel use case.
-`None`: No average is calculated, and scores for each class are returned. Only works with the multilabels use case.
-**`sample_weight`** (array-like of shape (n_samples,)): Sample weights. Defaults to None.
-**`max_fpr`** (`float`): If not None, the standardized partial AUC over the range [0, `max_fpr`] is returned. Must be greater than `0` and less than or equal to `1`. Defaults to `None`. Note: For the multiclass use case, `max_fpr` should be either `None` or `1.0` as ROC AUC partial computation is not currently supported for `multiclass`.
-**`multi_class`** (`str`): Only used for multiclass targets, in which case it is required. Determines the type of configuration to use. Options are:
-`'ovr'`: Stands for One-vs-rest. Computes the AUC of each class against the rest. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when `average == 'macro'`, because class imbalance affects the composition of each of the 'rest' groupings.
-`'ovo'`: Stands for One-vs-one. Computes the average AUC of all possible pairwise combinations of classes. Insensitive to class imbalance when `average == 'macro'`.
-**`labels`** (array-like of shape (n_classes,)): Only used for multiclass targets. List of labels that index the classes in `prediction_scores`. If `None`, the numerical or lexicographical order of the labels in `prediction_scores` is used. Defaults to `None`.
### Output Values
This metric returns a dict containing the `roc_auc` score. The score is a `float`, unless it is the multilabel case with `average=None`, in which case the score is a numpy `array` with entries of type `float`.
The output therefore generally takes the following format:
```python
{'roc_auc':0.778}
```
In contrast, though, the output takes the following format in the multilabel case when `average=None`:
```python
{'roc_auc':array([0.83333333,0.375,0.94444444])}
```
ROC AUC scores can take on any value between `0` and `1`, inclusive.
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
```
## Further References
This implementation is a wrapper around the [Scikit-learn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). Much of the documentation here was adapted from their existing documentation, as well.
The [Guide to ROC and AUC](https://youtu.be/iCZJfO-7C5Q) video from the channel Data Science Bits is also very informative.
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Accuracy metric."""
importdatasets
fromsklearn.metricsimportroc_auc_score
importevaluate
_DESCRIPTION="""
This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of `0.5` means that the model is predicting exactly at chance, i.e. the model's predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. A score above `0.5` indicates that the model is doing better than chance, while a score below `0.5` indicates that the model is doing worse than chance.
This metric has three separate use cases:
- binary: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation.
- multiclass: The case in which there can be more than two different label classes, but each example still gets only one label.
- multilabel: The case in which there can be more than two different label classes, and each example can have more than one label.
"""
_KWARGS_DESCRIPTION="""
Args:
- references (array-like of shape (n_samples,) or (n_samples, n_classes)): Ground truth labels. Expects different input based on use case:
- binary: expects an array-like of shape (n_samples,)
- multiclass: expects an array-like of shape (n_samples,)
- multilabel: expects an array-like of shape (n_samples, n_classes)
- prediction_scores (array-like of shape (n_samples,) or (n_samples, n_classes)): Model predictions. Expects different inputs based on use case:
- binary: expects an array-like of shape (n_samples,)
- multiclass: expects an array-like of shape (n_samples, n_classes)
- multilabel: expects an array-like of shape (n_samples, n_classes)
- average (`str`): Type of average, and is ignored in the binary use case. Defaults to 'macro'. Options are:
- `'micro'`: Calculates metrics globally by considering each element of the label indicator matrix as a label. Only works with the multilabel use case.
- `'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
- `'weighted'`: Calculate metrics for each label, and find their average, weighted by support (i.e. the number of true instances for each label).
- `'samples'`: Calculate metrics for each instance, and find their average. Only works with the multilabel use case.
- `None`: No average is calculated, and scores for each class are returned. Only works with the multilabels use case.
- sample_weight (array-like of shape (n_samples,)): Sample weights. Defaults to None.
- max_fpr (`float`): If not None, the standardized partial AUC over the range [0, `max_fpr`] is returned. Must be greater than `0` and less than or equal to `1`. Defaults to `None`. Note: For the multiclass use case, `max_fpr` should be either `None` or `1.0` as ROC AUC partial computation is not currently supported for `multiclass`.
- multi_class (`str`): Only used for multiclass targets, where it is required. Determines the type of configuration to use. Options are:
- `'ovr'`: Stands for One-vs-rest. Computes the AUC of each class against the rest. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when `average == 'macro'`, because class imbalance affects the composition of each of the 'rest' groupings.
- `'ovo'`: Stands for One-vs-one. Computes the average AUC of all possible pairwise combinations of classes. Insensitive to class imbalance when `average == 'macro'`.
- labels (array-like of shape (n_classes,)): Only used for multiclass targets. List of labels that index the classes in
`prediction_scores`. If `None`, the numerical or lexicographical order of the labels in
`prediction_scores` is used. Defaults to `None`.
Returns:
roc_auc (`float` or array-like of shape (n_classes,)): Returns array if in multilabel use case and `average='None'`. Otherwise, returns `float`.
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for
evaluating automatic summarization and machine translation software in natural language processing.
The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
This metrics is a wrapper around Google Research reimplementation of ROUGE:
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)
## How to Use
At minimum, this metric takes as input a list of predictions and a list of references:
-**predictions** (`list`): list of predictions to score. Each prediction
should be a string with tokens separated by spaces.
-**references** (`list` or `list[list]`): list of reference for each prediction or a list of several references per prediction. Each
reference should be a string with tokens separated by spaces.
-**rouge_types** (`list`): A list of rouge types to calculate. Defaults to `['rouge1', 'rouge2', 'rougeL', 'rougeLsum']`.
- Valid rouge types:
-`"rouge1"`: unigram (1-gram) based scoring
-`"rouge2"`: bigram (2-gram) based scoring
-`"rougeL"`: Longest common subsequence based scoring.
-`"rougeLSum"`: splits text using `"\n"`
- See [here](https://github.com/huggingface/datasets/issues/617) for more information
-**use_aggregator** (`boolean`): If True, returns aggregates. Defaults to `True`.
-**use_stemmer** (`boolean`): If `True`, uses Porter stemmer to strip word suffixes. Defaults to `False`.
### Output Values
The output is a dictionary with one entry for each rouge type in the input list `rouge_types`. If `use_aggregator=False`, each dictionary entry is a list of scores, with one score for each sentence. E.g. if `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=False`, the output is:
If `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=True`, the output is of the following format:
```python
{'rouge1':1.0,'rouge2':1.0}
```
The ROUGE values are in the range of 0 to 1.
#### Values from Popular Papers
### Examples
An example without aggregation:
```python
>>>rouge=evaluate.load('rouge')
>>>predictions=["hello goodbye","ankh morpork"]
>>>references=["goodbye","general kenobi"]
>>>results=rouge.compute(predictions=predictions,
...references=references,
...use_aggregator=False)
>>>print(list(results.keys()))
['rouge1','rouge2','rougeL','rougeLsum']
>>>print(results["rouge1"])
[0.5,0.0]
```
The same example, but with aggregation:
```python
>>>rouge=evaluate.load('rouge')
>>>predictions=["hello goodbye","ankh morpork"]
>>>references=["goodbye","general kenobi"]
>>>results=rouge.compute(predictions=predictions,
...references=references,
...use_aggregator=True)
>>>print(list(results.keys()))
['rouge1','rouge2','rougeL','rougeLsum']
>>>print(results["rouge1"])
0.25
```
The same example, but only calculating `rouge_1`:
```python
>>>rouge=evaluate.load('rouge')
>>>predictions=["hello goodbye","ankh morpork"]
>>>references=["goodbye","general kenobi"]
>>>results=rouge.compute(predictions=predictions,
...references=references,
...rouge_types=['rouge_1'],
...use_aggregator=True)
>>>print(list(results.keys()))
['rouge1']
>>>print(results["rouge1"])
0.25
```
## Limitations and Bias
See [Schluter (2017)](https://aclanthology.org/E17-2007/) for an in-depth discussion of many of ROUGE's limits.
## Citation
```bibtex
@inproceedings{lin-2004-rouge,
title="{ROUGE}: A Package for Automatic Evaluation of Summaries",
author="Lin, Chin-Yew",
booktitle="Text Summarization Branches Out",
month=jul,
year="2004",
address="Barcelona, Spain",
publisher="Association for Computational Linguistics",
url="https://www.aclweb.org/anthology/W04-1013",
pages="74--81",
}
```
## Further References
- This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for
evaluating automatic summarization and machine translation software in natural language processing.
The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
This metrics is a wrapper around Google Research reimplementation of ROUGE:
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores.
Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.
It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.
---
# Metric Card for SacreBLEU
## Metric Description
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization.
See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.
## How to Use
This metric takes a set of predictions and a set of references as input, along with various optional parameters.
```python
>>>predictions=["hello there general kenobi","foo bar foobar"]
>>>references=[["hello there general kenobi","hello there !"],
-**`predictions`** (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
-**`references`** (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
-**`smooth_method`** (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
-`'none'`: no smoothing
-`'floor'`: increment zero counts
-`'add-k'`: increment num/denom by k for n>1
-`'exp'`: exponential decay
-**`smooth_value`** (`float`): The smoothing value. Only valid when `smooth_method='floor'` (in which case `smooth_value` defaults to `0.1`) or `smooth_method='add-k'` (in which case `smooth_value` defaults to `1`).
-**`tokenize`** (`str`): Tokenization method to use for BLEU. If not provided, defaults to `'zh'` for Chinese, `'ja-mecab'` for Japanese and `'13a'` (mteval) otherwise. Possible values are:
-`'none'`: No tokenization.
-`'zh'`: Chinese tokenization.
-`'13a'`: mimics the `mteval-v13a` script from Moses.
-`'intl'`: International tokenization, mimics the `mteval-v14` script from Moses
-`'ja-mecab'`: Japanese tokenization. Uses the [MeCab tokenizer](https://pypi.org/project/mecab-python3).
-**`lowercase`** (`bool`): If `True`, lowercases the input, enabling case-insensitivity. Defaults to `False`.
-**`force`** (`bool`): If `True`, insists that your tokenized input is actually detokenized. Defaults to `False`.
-**`use_effective_order`** (`bool`): If `True`, stops including n-gram orders for which precision is 0. This should be `True`, if sentence-level BLEU will be computed. Defaults to `False`.
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores.
Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.
It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.
"""
_KWARGS_DESCRIPTION="""
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.
Args:
predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
smooth_method (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
- `'none'`: no smoothing
- `'floor'`: increment zero counts
- `'add-k'`: increment num/denom by k for n>1
- `'exp'`: exponential decay
smooth_value (`float`): The smoothing value. Only valid when `smooth_method='floor'` (in which case `smooth_value` defaults to `0.1`) or `smooth_method='add-k'` (in which case `smooth_value` defaults to `1`).
tokenize (`str`): Tokenization method to use for BLEU. If not provided, defaults to `'zh'` for Chinese, `'ja-mecab'` for Japanese and `'13a'` (mteval) otherwise. Possible values are:
- `'none'`: No tokenization.
- `'zh'`: Chinese tokenization.
- `'13a'`: mimics the `mteval-v13a` script from Moses.
- `'intl'`: International tokenization, mimics the `mteval-v14` script from Moses
- `'ja-mecab'`: Japanese tokenization. Uses the [MeCab tokenizer](https://pypi.org/project/mecab-python3).
lowercase (`bool`): If `True`, lowercases the input, enabling case-insensitivity. Defaults to `False`.
force (`bool`): If `True`, insists that your tokenized input is actually detokenized. Defaults to `False`.
use_effective_order (`bool`): If `True`, stops including n-gram orders for which precision is 0. This should be `True`, if sentence-level BLEU will be computed. Defaults to `False`.
Returns:
'score': BLEU score,
'counts': Counts,
'totals': Totals,
'precisions': Precisions,
'bp': Brevity penalty,
'sys_len': predictions length,
'ref_len': reference length,
Examples:
Example 1:
>>> predictions = ["hello there general kenobi", "foo bar foobar"]
>>> references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]
SARI (***s**ystem output **a**gainst **r**eferences and against the **i**nput sentence*) is a metric used for evaluating automatic text simplification systems.
The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system.
SARI can be computed as:
`sari = ( F1_add + F1_keep + P_del) / 3`
where
`F1_add` is the n-gram F1 score for add operations
`F1_keep` is the n-gram F1 score for keep operations
`P_del` is the n-gram precision score for delete operations
The number of n grams, `n`, is equal to 4, as in the original paper.
This implementation is adapted from [Tensorflow's tensor2tensor implementation](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py).
It has two differences with the [original GitHub implementation](https://github.com/cocoxu/simplification/blob/master/SARI.py):
1) It defines 0/0=1 instead of 0 to give higher scores for predictions that match a target exactly.
2) It fixes an [alleged bug](https://github.com/cocoxu/simplification/issues/6) in the keep score computation.
## How to use
The metric takes 3 inputs: sources (a list of source sentence strings), predictions (a list of predicted sentence strings) and references (a list of lists of reference sentence strings)
```python
fromevaluateimportload
sari=load("sari")
sources=["About 95 species are currently accepted."]
predictions=["About 95 you now get in."]
references=[["About 95 species are currently known.","About 95 species are now accepted.","95 species are now accepted."]]
This metric outputs a dictionary with the SARI score:
```
print(sari_score)
{'sari': 26.953601953601954}
```
The range of values for the SARI score is between 0 and 100 -- the higher the value, the better the performance of the model being evaluated, with a SARI of 100 being a perfect score.
### Values from popular papers
The [original paper that proposes the SARI metric](https://aclanthology.org/Q16-1029.pdf) reports scores ranging from 26 to 43 for different simplification systems and different datasets. They also find that the metric ranks all of the simplification systems and human references in the same order as the human assessment used as a comparison, and that it correlates reasonably with human judgments.
More recent SARI scores for text simplification can be found on leaderboards for datasets such as [TurkCorpus](https://paperswithcode.com/sota/text-simplification-on-turkcorpus) and [Newsela](https://paperswithcode.com/sota/text-simplification-on-newsela).
## Examples
Perfect match between prediction and reference:
```python
fromevaluateimportload
sari=load("sari")
sources=["About 95 species are currently accepted ."]
predictions=["About 95 species are currently accepted ."]
references=[["About 95 species are currently accepted ."]]
SARI is a valuable measure for comparing different text simplification systems as well as one that can assist the iterative development of a system.
However, while the [original paper presenting SARI](https://aclanthology.org/Q16-1029.pdf) states that it captures "the notion of grammaticality and meaning preservation", this is a difficult claim to empirically validate.
## Citation
```bibtex
@inproceedings{xu-etal-2016-optimizing,
title={Optimizing Statistical Machine Translation for Text Simplification},
authors={Xu, Wei and Napoles, Courtney and Pavlick, Ellie and Chen, Quanze and Callison-Burch, Chris},
journal={Transactions of the Association for Computational Linguistics},
volume={4},
year={2016},
url={https://www.aclweb.org/anthology/Q16-1029},
pages={401--415},
}
```
## Further References
-[NLP Progress -- Text Simplification](http://nlpprogress.com/english/simplification.html)
-[Hugging Face Hub -- Text Simplification Models](https://huggingface.co/datasets?filter=task_ids:text-simplification)
seqeval is a Python framework for sequence labeling evaluation.
seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
This is well-tested by using the Perl script conlleval, which can be used for
measuring the performance of a system that has processed the CoNLL-2000 shared task data.
seqeval supports following formats:
IOB1
IOB2
IOE1
IOE2
IOBES
See the [README.md] file at https://github.com/chakki-works/seqeval for more information.
---
# Metric Card for seqeval
## Metric description
seqeval is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
## How to use
Seqeval produces labelling scores along with its sufficient statistics from a source against one or more references.
It takes two mandatory arguments:
`predictions`: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.
`references`: a list of lists of reference labels, i.e. the ground truth/target values.
It can also take several optional arguments:
`suffix` (boolean): `True` if the IOB tag is a suffix (after type) instead of a prefix (before type), `False` otherwise. The default value is `False`, i.e. the IOB tag is a prefix (before type).
`scheme`: the target tagging scheme, which can be one of [`IOB1`, `IOB2`, `IOE1`, `IOE2`, `IOBES`, `BILOU`]. The default value is `None`.
`mode`: whether to count correct entity labels with incorrect I/B tags as true positives or not. If you want to only count exact matches, pass `mode="strict"` and a specific `scheme` value. The default is `None`.
`sample_weight`: An array-like of shape (n_samples,) that provides weights for individual samples. The default is `None`.
`zero_division`: Which value to substitute as a metric value when encountering zero division. Should be one of [`0`,`1`,`"warn"`]. `"warn"` acts as `0`, but the warning is raised.
This metric returns a dictionary with a summary of scores for overall and per type:
Overall:
`accuracy`: the average [accuracy](https://huggingface.co/metrics/accuracy), on a scale between 0.0 and 1.0.
`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
`f1`: the average [F1 score](https://huggingface.co/metrics/f1), which is the harmonic mean of the precision and recall. It also has a scale of 0.0 to 1.0.
Per type (e.g. `MISC`, `PER`, `LOC`,...):
`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
`f1`: the average [F1 score](https://huggingface.co/metrics/f1), on a scale between 0.0 and 1.0.
### Values from popular papers
The 1995 "Text Chunking using Transformation-Based Learning" [paper](https://aclanthology.org/W95-0107) reported a baseline recall of 81.9% and a precision of 78.2% using non Deep Learning-based methods.
More recently, seqeval continues being used for reporting performance on tasks such as [named entity detection](https://www.mdpi.com/2306-5729/6/8/84/htm) and [information extraction](https://ieeexplore.ieee.org/abstract/document/9697942/).
seqeval supports following IOB formats (short for inside, outside, beginning) : `IOB1`, `IOB2`, `IOE1`, `IOE2`, `IOBES`, `IOBES` (only in strict mode) and `BILOU` (only in strict mode).
For more information about IOB formats, refer to the [Wikipedia page](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) and the description of the [CoNLL-2000 shared task](https://aclanthology.org/W02-2024).
## Citation
```bibtex
@inproceedings{ramshaw-marcus-1995-text,
title="Text Chunking using Transformation-Based Learning",
author="Ramshaw, Lance and
Marcus, Mitch",
booktitle="Third Workshop on Very Large Corpora",
year="1995",
url="https://www.aclweb.org/anthology/W95-0107",
}
```
```bibtex
@misc{seqeval,
title={{seqeval}: A Python framework for sequence labeling evaluation},
url={https://github.com/chakki-works/seqeval},
note={Software available from https://github.com/chakki-works/seqeval},
author={Hiroki Nakayama},
year={2018},
}
```
## Further References
-[README for seqeval at GitHub](https://github.com/chakki-works/seqeval)