Commit 25991f98 authored by hepj's avatar hepj
Browse files

修改readme

parent ac192496
Pipeline #1415 failed with stages
in 0 seconds
---
title: ROC AUC
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of `0.5` means that the model is predicting exactly at chance, i.e. the model's predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. A score above `0.5` indicates that the model is doing better than chance, while a score below `0.5` indicates that the model is doing worse than chance.
This metric has three separate use cases:
- binary: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation.
- multiclass: The case in which there can be more than two different label classes, but each example still gets only one label.
- multilabel: The case in which there can be more than two different label classes, and each example can have more than one label.
---
# Metric Card for ROC AUC
## Metric Description
This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of `0.5` means that the model is predicting exactly at chance, i.e. the model's predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. A score above `0.5` indicates that the model is doing better than chance, while a score below `0.5` indicates that the model is doing worse than chance.
This metric has three separate use cases:
- **binary**: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation.
- **multiclass**: The case in which there can be more than two different label classes, but each example still gets only one label.
- **multilabel**: The case in which there can be more than two different label classes, and each example can have more than one label.
## How to Use
At minimum, this metric requires references and prediction scores:
```python
>>> roc_auc_score = evaluate.load("roc_auc")
>>> refs = [1, 0, 1, 1, 0, 0]
>>> pred_scores = [0.5, 0.2, 0.99, 0.3, 0.1, 0.7]
>>> results = roc_auc_score.compute(references=refs, prediction_scores=pred_scores)
>>> print(round(results['roc_auc'], 2))
0.78
```
The default implementation of this metric is the **binary** implementation. If employing the **multiclass** or **multilabel** use cases, the keyword `"multiclass"` or `"multilabel"` must be specified when loading the metric:
- In the **multiclass** case, the metric is loaded with:
```python
>>> roc_auc_score = evaluate.load("roc_auc", "multiclass")
```
- In the **multilabel** case, the metric is loaded with:
```python
>>> roc_auc_score = evaluate.load("roc_auc", "multilabel")
```
See the [Examples Section Below](#examples_section) for more extensive examples.
### Inputs
- **`references`** (array-like of shape (n_samples,) or (n_samples, n_classes)): Ground truth labels. Expects different inputs based on use case:
- binary: expects an array-like of shape (n_samples,)
- multiclass: expects an array-like of shape (n_samples,)
- multilabel: expects an array-like of shape (n_samples, n_classes)
- **`prediction_scores`** (array-like of shape (n_samples,) or (n_samples, n_classes)): Model predictions. Expects different inputs based on use case:
- binary: expects an array-like of shape (n_samples,)
- multiclass: expects an array-like of shape (n_samples, n_classes). The probability estimates must sum to 1 across the possible classes.
- multilabel: expects an array-like of shape (n_samples, n_classes)
- **`average`** (`str`): Type of average, and is ignored in the binary use case. Defaults to `'macro'`. Options are:
- `'micro'`: Calculates metrics globally by considering each element of the label indicator matrix as a label. Only works with the multilabel use case.
- `'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
- `'weighted'`: Calculate metrics for each label, and find their average, weighted by support (i.e. the number of true instances for each label).
- `'samples'`: Calculate metrics for each instance, and find their average. Only works with the multilabel use case.
- `None`: No average is calculated, and scores for each class are returned. Only works with the multilabels use case.
- **`sample_weight`** (array-like of shape (n_samples,)): Sample weights. Defaults to None.
- **`max_fpr`** (`float`): If not None, the standardized partial AUC over the range [0, `max_fpr`] is returned. Must be greater than `0` and less than or equal to `1`. Defaults to `None`. Note: For the multiclass use case, `max_fpr` should be either `None` or `1.0` as ROC AUC partial computation is not currently supported for `multiclass`.
- **`multi_class`** (`str`): Only used for multiclass targets, in which case it is required. Determines the type of configuration to use. Options are:
- `'ovr'`: Stands for One-vs-rest. Computes the AUC of each class against the rest. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when `average == 'macro'`, because class imbalance affects the composition of each of the 'rest' groupings.
- `'ovo'`: Stands for One-vs-one. Computes the average AUC of all possible pairwise combinations of classes. Insensitive to class imbalance when `average == 'macro'`.
- **`labels`** (array-like of shape (n_classes,)): Only used for multiclass targets. List of labels that index the classes in `prediction_scores`. If `None`, the numerical or lexicographical order of the labels in `prediction_scores` is used. Defaults to `None`.
### Output Values
This metric returns a dict containing the `roc_auc` score. The score is a `float`, unless it is the multilabel case with `average=None`, in which case the score is a numpy `array` with entries of type `float`.
The output therefore generally takes the following format:
```python
{'roc_auc': 0.778}
```
In contrast, though, the output takes the following format in the multilabel case when `average=None`:
```python
{'roc_auc': array([0.83333333, 0.375, 0.94444444])}
```
ROC AUC scores can take on any value between `0` and `1`, inclusive.
#### Values from Popular Papers
### <a name="examples_section"></a>Examples
Example 1, the **binary** use case:
```python
>>> roc_auc_score = evaluate.load("roc_auc")
>>> refs = [1, 0, 1, 1, 0, 0]
>>> pred_scores = [0.5, 0.2, 0.99, 0.3, 0.1, 0.7]
>>> results = roc_auc_score.compute(references=refs, prediction_scores=pred_scores)
>>> print(round(results['roc_auc'], 2))
0.78
```
Example 2, the **multiclass** use case:
```python
>>> roc_auc_score = evaluate.load("roc_auc", "multiclass")
>>> refs = [1, 0, 1, 2, 2, 0]
>>> pred_scores = [[0.3, 0.5, 0.2],
... [0.7, 0.2, 0.1],
... [0.005, 0.99, 0.005],
... [0.2, 0.3, 0.5],
... [0.1, 0.1, 0.8],
... [0.1, 0.7, 0.2]]
>>> results = roc_auc_score.compute(references=refs,
... prediction_scores=pred_scores,
... multi_class='ovr')
>>> print(round(results['roc_auc'], 2))
0.85
```
Example 3, the **multilabel** use case:
```python
>>> roc_auc_score = evaluate.load("roc_auc", "multilabel")
>>> refs = [[1, 1, 0],
... [1, 1, 0],
... [0, 1, 0],
... [0, 0, 1],
... [0, 1, 1],
... [1, 0, 1]]
>>> pred_scores = [[0.3, 0.5, 0.2],
... [0.7, 0.2, 0.1],
... [0.005, 0.99, 0.005],
... [0.2, 0.3, 0.5],
... [0.1, 0.1, 0.8],
... [0.1, 0.7, 0.2]]
>>> results = roc_auc_score.compute(references=refs,
... prediction_scores=pred_scores,
... average=None)
>>> print([round(res, 2) for res in results['roc_auc'])
[0.83, 0.38, 0.94]
```
## Limitations and Bias
## Citation
```bibtex
@article{doi:10.1177/0272989X8900900307,
author = {Donna Katzman McClish},
title ={Analyzing a Portion of the ROC Curve},
journal = {Medical Decision Making},
volume = {9},
number = {3},
pages = {190-195},
year = {1989},
doi = {10.1177/0272989X8900900307},
note ={PMID: 2668680},
URL = {https://doi.org/10.1177/0272989X8900900307},
eprint = {https://doi.org/10.1177/0272989X8900900307}
}
```
```bibtex
@article{10.1023/A:1010920819831,
author = {Hand, David J. and Till, Robert J.},
title = {A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems},
year = {2001},
issue_date = {November 2001},
publisher = {Kluwer Academic Publishers},
address = {USA},
volume = {45},
number = {2},
issn = {0885-6125},
url = {https://doi.org/10.1023/A:1010920819831},
doi = {10.1023/A:1010920819831},
journal = {Mach. Learn.},
month = {oct},
pages = {171–186},
numpages = {16},
keywords = {Gini index, AUC, error rate, ROC curve, receiver operating characteristic}
}
```
```bibtex
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
```
## Further References
This implementation is a wrapper around the [Scikit-learn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). Much of the documentation here was adapted from their existing documentation, as well.
The [Guide to ROC and AUC](https://youtu.be/iCZJfO-7C5Q) video from the channel Data Science Bits is also very informative.
import evaluate
from evaluate.utils import launch_gradio_widget
module = evaluate.load("roc_auc")
launch_gradio_widget(module)
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
scikit-learn
\ No newline at end of file
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Accuracy metric."""
import datasets
from sklearn.metrics import roc_auc_score
import evaluate
_DESCRIPTION = """
This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of `0.5` means that the model is predicting exactly at chance, i.e. the model's predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. A score above `0.5` indicates that the model is doing better than chance, while a score below `0.5` indicates that the model is doing worse than chance.
This metric has three separate use cases:
- binary: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation.
- multiclass: The case in which there can be more than two different label classes, but each example still gets only one label.
- multilabel: The case in which there can be more than two different label classes, and each example can have more than one label.
"""
_KWARGS_DESCRIPTION = """
Args:
- references (array-like of shape (n_samples,) or (n_samples, n_classes)): Ground truth labels. Expects different input based on use case:
- binary: expects an array-like of shape (n_samples,)
- multiclass: expects an array-like of shape (n_samples,)
- multilabel: expects an array-like of shape (n_samples, n_classes)
- prediction_scores (array-like of shape (n_samples,) or (n_samples, n_classes)): Model predictions. Expects different inputs based on use case:
- binary: expects an array-like of shape (n_samples,)
- multiclass: expects an array-like of shape (n_samples, n_classes)
- multilabel: expects an array-like of shape (n_samples, n_classes)
- average (`str`): Type of average, and is ignored in the binary use case. Defaults to 'macro'. Options are:
- `'micro'`: Calculates metrics globally by considering each element of the label indicator matrix as a label. Only works with the multilabel use case.
- `'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
- `'weighted'`: Calculate metrics for each label, and find their average, weighted by support (i.e. the number of true instances for each label).
- `'samples'`: Calculate metrics for each instance, and find their average. Only works with the multilabel use case.
- `None`: No average is calculated, and scores for each class are returned. Only works with the multilabels use case.
- sample_weight (array-like of shape (n_samples,)): Sample weights. Defaults to None.
- max_fpr (`float`): If not None, the standardized partial AUC over the range [0, `max_fpr`] is returned. Must be greater than `0` and less than or equal to `1`. Defaults to `None`. Note: For the multiclass use case, `max_fpr` should be either `None` or `1.0` as ROC AUC partial computation is not currently supported for `multiclass`.
- multi_class (`str`): Only used for multiclass targets, where it is required. Determines the type of configuration to use. Options are:
- `'ovr'`: Stands for One-vs-rest. Computes the AUC of each class against the rest. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when `average == 'macro'`, because class imbalance affects the composition of each of the 'rest' groupings.
- `'ovo'`: Stands for One-vs-one. Computes the average AUC of all possible pairwise combinations of classes. Insensitive to class imbalance when `average == 'macro'`.
- labels (array-like of shape (n_classes,)): Only used for multiclass targets. List of labels that index the classes in
`prediction_scores`. If `None`, the numerical or lexicographical order of the labels in
`prediction_scores` is used. Defaults to `None`.
Returns:
roc_auc (`float` or array-like of shape (n_classes,)): Returns array if in multilabel use case and `average='None'`. Otherwise, returns `float`.
Examples:
Example 1:
>>> roc_auc_score = evaluate.load("roc_auc")
>>> refs = [1, 0, 1, 1, 0, 0]
>>> pred_scores = [0.5, 0.2, 0.99, 0.3, 0.1, 0.7]
>>> results = roc_auc_score.compute(references=refs, prediction_scores=pred_scores)
>>> print(round(results['roc_auc'], 2))
0.78
Example 2:
>>> roc_auc_score = evaluate.load("roc_auc", "multiclass")
>>> refs = [1, 0, 1, 2, 2, 0]
>>> pred_scores = [[0.3, 0.5, 0.2],
... [0.7, 0.2, 0.1],
... [0.005, 0.99, 0.005],
... [0.2, 0.3, 0.5],
... [0.1, 0.1, 0.8],
... [0.1, 0.7, 0.2]]
>>> results = roc_auc_score.compute(references=refs, prediction_scores=pred_scores, multi_class='ovr')
>>> print(round(results['roc_auc'], 2))
0.85
Example 3:
>>> roc_auc_score = evaluate.load("roc_auc", "multilabel")
>>> refs = [[1, 1, 0],
... [1, 1, 0],
... [0, 1, 0],
... [0, 0, 1],
... [0, 1, 1],
... [1, 0, 1]]
>>> pred_scores = [[0.3, 0.5, 0.2],
... [0.7, 0.2, 0.1],
... [0.005, 0.99, 0.005],
... [0.2, 0.3, 0.5],
... [0.1, 0.1, 0.8],
... [0.1, 0.7, 0.2]]
>>> results = roc_auc_score.compute(references=refs, prediction_scores=pred_scores, average=None)
>>> print([round(res, 2) for res in results['roc_auc']])
[0.83, 0.38, 0.94]
"""
_CITATION = """\
@article{doi:10.1177/0272989X8900900307,
author = {Donna Katzman McClish},
title ={Analyzing a Portion of the ROC Curve},
journal = {Medical Decision Making},
volume = {9},
number = {3},
pages = {190-195},
year = {1989},
doi = {10.1177/0272989X8900900307},
note ={PMID: 2668680},
URL = {https://doi.org/10.1177/0272989X8900900307},
eprint = {https://doi.org/10.1177/0272989X8900900307}
}
@article{10.1023/A:1010920819831,
author = {Hand, David J. and Till, Robert J.},
title = {A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems},
year = {2001},
issue_date = {November 2001},
publisher = {Kluwer Academic Publishers},
address = {USA},
volume = {45},
number = {2},
issn = {0885-6125},
url = {https://doi.org/10.1023/A:1010920819831},
doi = {10.1023/A:1010920819831},
journal = {Mach. Learn.},
month = {oct},
pages = {171–186},
numpages = {16},
keywords = {Gini index, AUC, error rate, ROC curve, receiver operating characteristic}
}
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
"""
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class ROCAUC(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"prediction_scores": datasets.Sequence(datasets.Value("float")),
"references": datasets.Value("int32"),
}
if self.config_name == "multiclass"
else {
"references": datasets.Sequence(datasets.Value("int32")),
"prediction_scores": datasets.Sequence(datasets.Value("float")),
}
if self.config_name == "multilabel"
else {
"references": datasets.Value("int32"),
"prediction_scores": datasets.Value("float"),
}
),
reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html"],
)
def _compute(
self,
references,
prediction_scores,
average="macro",
sample_weight=None,
max_fpr=None,
multi_class="raise",
labels=None,
):
return {
"roc_auc": roc_auc_score(
references,
prediction_scores,
average=average,
sample_weight=sample_weight,
max_fpr=max_fpr,
multi_class=multi_class,
labels=labels,
)
}
---
title: ROUGE
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for
evaluating automatic summarization and machine translation software in natural language processing.
The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
This metrics is a wrapper around Google Research reimplementation of ROUGE:
https://github.com/google-research/google-research/tree/master/rouge
---
# Metric Card for ROUGE
## Metric Description
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)
## How to Use
At minimum, this metric takes as input a list of predictions and a list of references:
```python
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references)
>>> print(results)
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
```
One can also pass a custom tokenizer which is especially useful for non-latin languages.
```python
>>> results = rouge.compute(predictions=predictions,
... references=references,
tokenizer=lambda x: x.split())
>>> print(results)
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
```
It can also deal with lists of references for each predictions:
```python
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = [["hello", "there"], ["general kenobi", "general yoda"]]
>>> results = rouge.compute(predictions=predictions,
... references=references)
>>> print(results)
{'rouge1': 0.8333, 'rouge2': 0.5, 'rougeL': 0.8333, 'rougeLsum': 0.8333}```
```
### Inputs
- **predictions** (`list`): list of predictions to score. Each prediction
should be a string with tokens separated by spaces.
- **references** (`list` or `list[list]`): list of reference for each prediction or a list of several references per prediction. Each
reference should be a string with tokens separated by spaces.
- **rouge_types** (`list`): A list of rouge types to calculate. Defaults to `['rouge1', 'rouge2', 'rougeL', 'rougeLsum']`.
- Valid rouge types:
- `"rouge1"`: unigram (1-gram) based scoring
- `"rouge2"`: bigram (2-gram) based scoring
- `"rougeL"`: Longest common subsequence based scoring.
- `"rougeLSum"`: splits text using `"\n"`
- See [here](https://github.com/huggingface/datasets/issues/617) for more information
- **use_aggregator** (`boolean`): If True, returns aggregates. Defaults to `True`.
- **use_stemmer** (`boolean`): If `True`, uses Porter stemmer to strip word suffixes. Defaults to `False`.
### Output Values
The output is a dictionary with one entry for each rouge type in the input list `rouge_types`. If `use_aggregator=False`, each dictionary entry is a list of scores, with one score for each sentence. E.g. if `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=False`, the output is:
```python
{'rouge1': [0.6666666666666666, 1.0], 'rouge2': [0.0, 1.0]}
```
If `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=True`, the output is of the following format:
```python
{'rouge1': 1.0, 'rouge2': 1.0}
```
The ROUGE values are in the range of 0 to 1.
#### Values from Popular Papers
### Examples
An example without aggregation:
```python
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello goodbye", "ankh morpork"]
>>> references = ["goodbye", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references,
... use_aggregator=False)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
[0.5, 0.0]
```
The same example, but with aggregation:
```python
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello goodbye", "ankh morpork"]
>>> references = ["goodbye", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references,
... use_aggregator=True)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
0.25
```
The same example, but only calculating `rouge_1`:
```python
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello goodbye", "ankh morpork"]
>>> references = ["goodbye", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references,
... rouge_types=['rouge_1'],
... use_aggregator=True)
>>> print(list(results.keys()))
['rouge1']
>>> print(results["rouge1"])
0.25
```
## Limitations and Bias
See [Schluter (2017)](https://aclanthology.org/E17-2007/) for an in-depth discussion of many of ROUGE's limits.
## Citation
```bibtex
@inproceedings{lin-2004-rouge,
title = "{ROUGE}: A Package for Automatic Evaluation of Summaries",
author = "Lin, Chin-Yew",
booktitle = "Text Summarization Branches Out",
month = jul,
year = "2004",
address = "Barcelona, Spain",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W04-1013",
pages = "74--81",
}
```
## Further References
- This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)
\ No newline at end of file
import evaluate
from evaluate.utils import launch_gradio_widget
module = evaluate.load("rouge")
launch_gradio_widget(module)
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
absl-py
nltk
rouge_score>=0.1.2
\ No newline at end of file
# Copyright 2020 The HuggingFace Evaluate Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" ROUGE metric from Google Research github repo. """
# The dependencies in https://github.com/google-research/google-research/blob/master/rouge/requirements.txt
import absl # Here to have a nice missing dependency error message early on
import datasets
import nltk # Here to have a nice missing dependency error message early on
import numpy # Here to have a nice missing dependency error message early on
import six # Here to have a nice missing dependency error message early on
from rouge_score import rouge_scorer, scoring
import evaluate
_CITATION = """\
@inproceedings{lin-2004-rouge,
title = "{ROUGE}: A Package for Automatic Evaluation of Summaries",
author = "Lin, Chin-Yew",
booktitle = "Text Summarization Branches Out",
month = jul,
year = "2004",
address = "Barcelona, Spain",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W04-1013",
pages = "74--81",
}
"""
_DESCRIPTION = """\
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for
evaluating automatic summarization and machine translation software in natural language processing.
The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
This metrics is a wrapper around Google Research reimplementation of ROUGE:
https://github.com/google-research/google-research/tree/master/rouge
"""
_KWARGS_DESCRIPTION = """
Calculates average rouge scores for a list of hypotheses and references
Args:
predictions: list of predictions to score. Each prediction
should be a string with tokens separated by spaces.
references: list of reference for each prediction. Each
reference should be a string with tokens separated by spaces.
rouge_types: A list of rouge types to calculate.
Valid names:
`"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
`"rougeL"`: Longest common subsequence based scoring.
`"rougeLsum"`: rougeLsum splits text using `"\n"`.
See details in https://github.com/huggingface/datasets/issues/617
use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
use_aggregator: Return aggregates if this is set to True
Returns:
rouge1: rouge_1 (f1),
rouge2: rouge_2 (f1),
rougeL: rouge_l (f1),
rougeLsum: rouge_lsum (f1)
Examples:
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions, references=references)
>>> print(results)
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
"""
class Tokenizer:
"""Helper class to wrap a callable into a class with a `tokenize` method as used by rouge-score."""
def __init__(self, tokenizer_func):
self.tokenizer_func = tokenizer_func
def tokenize(self, text):
return self.tokenizer_func(text)
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Rouge(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=[
datasets.Features(
{
"predictions": datasets.Value("string", id="sequence"),
"references": datasets.Sequence(datasets.Value("string", id="sequence")),
}
),
datasets.Features(
{
"predictions": datasets.Value("string", id="sequence"),
"references": datasets.Value("string", id="sequence"),
}
),
],
codebase_urls=["https://github.com/google-research/google-research/tree/master/rouge"],
reference_urls=[
"https://en.wikipedia.org/wiki/ROUGE_(metric)",
"https://github.com/google-research/google-research/tree/master/rouge",
],
)
def _compute(
self, predictions, references, rouge_types=None, use_aggregator=True, use_stemmer=False, tokenizer=None
):
if rouge_types is None:
rouge_types = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
multi_ref = isinstance(references[0], list)
if tokenizer is not None:
tokenizer = Tokenizer(tokenizer)
scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=use_stemmer, tokenizer=tokenizer)
if use_aggregator:
aggregator = scoring.BootstrapAggregator()
else:
scores = []
for ref, pred in zip(references, predictions):
if multi_ref:
score = scorer.score_multi(ref, pred)
else:
score = scorer.score(ref, pred)
if use_aggregator:
aggregator.add_scores(score)
else:
scores.append(score)
if use_aggregator:
result = aggregator.aggregate()
for key in result:
result[key] = result[key].mid.fmeasure
else:
result = {}
for key in scores[0]:
result[key] = list(score[key].fmeasure for score in scores)
return result
---
title: SacreBLEU
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores.
Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.
It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.
---
# Metric Card for SacreBLEU
## Metric Description
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization.
See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.
## How to Use
This metric takes a set of predictions and a set of references as input, along with various optional parameters.
```python
>>> predictions = ["hello there general kenobi", "foo bar foobar"]
>>> references = [["hello there general kenobi", "hello there !"],
... ["foo bar foobar", "foo bar foobar"]]
>>> sacrebleu = evaluate.load("sacrebleu")
>>> results = sacrebleu.compute(predictions=predictions,
... references=references)
>>> print(list(results.keys()))
['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
>>> print(round(results["score"], 1))
100.0
```
### Inputs
- **`predictions`** (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
- **`references`** (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
- **`smooth_method`** (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
- `'none'`: no smoothing
- `'floor'`: increment zero counts
- `'add-k'`: increment num/denom by k for n>1
- `'exp'`: exponential decay
- **`smooth_value`** (`float`): The smoothing value. Only valid when `smooth_method='floor'` (in which case `smooth_value` defaults to `0.1`) or `smooth_method='add-k'` (in which case `smooth_value` defaults to `1`).
- **`tokenize`** (`str`): Tokenization method to use for BLEU. If not provided, defaults to `'zh'` for Chinese, `'ja-mecab'` for Japanese and `'13a'` (mteval) otherwise. Possible values are:
- `'none'`: No tokenization.
- `'zh'`: Chinese tokenization.
- `'13a'`: mimics the `mteval-v13a` script from Moses.
- `'intl'`: International tokenization, mimics the `mteval-v14` script from Moses
- `'char'`: Language-agnostic character-level tokenization.
- `'ja-mecab'`: Japanese tokenization. Uses the [MeCab tokenizer](https://pypi.org/project/mecab-python3).
- **`lowercase`** (`bool`): If `True`, lowercases the input, enabling case-insensitivity. Defaults to `False`.
- **`force`** (`bool`): If `True`, insists that your tokenized input is actually detokenized. Defaults to `False`.
- **`use_effective_order`** (`bool`): If `True`, stops including n-gram orders for which precision is 0. This should be `True`, if sentence-level BLEU will be computed. Defaults to `False`.
### Output Values
- `score`: BLEU score
- `counts`: Counts
- `totals`: Totals
- `precisions`: Precisions
- `bp`: Brevity penalty
- `sys_len`: predictions length
- `ref_len`: reference length
The output is in the following format:
```python
{'score': 39.76353643835252, 'counts': [6, 4, 2, 1], 'totals': [10, 8, 6, 4], 'precisions': [60.0, 50.0, 33.333333333333336, 25.0], 'bp': 1.0, 'sys_len': 10, 'ref_len': 7}
```
The score can take any value between `0.0` and `100.0`, inclusive.
#### Values from Popular Papers
### Examples
```python
>>> predictions = ["hello there general kenobi",
... "on our way to ankh morpork"]
>>> references = [["hello there general kenobi", "hello there !"],
... ["goodbye ankh morpork", "ankh morpork"]]
>>> sacrebleu = evaluate.load("sacrebleu")
>>> results = sacrebleu.compute(predictions=predictions,
... references=references)
>>> print(list(results.keys()))
['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
>>> print(round(results["score"], 1))
39.8
```
## Limitations and Bias
Because what this metric calculates is BLEU scores, it has the same limitations as that metric, except that sacreBLEU is more easily reproducible.
## Citation
```bibtex
@inproceedings{post-2018-call,
title = "A Call for Clarity in Reporting {BLEU} Scores",
author = "Post, Matt",
booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers",
month = oct,
year = "2018",
address = "Belgium, Brussels",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W18-6319",
pages = "186--191",
}
```
## Further References
- See the [sacreBLEU README.md file](https://github.com/mjpost/sacreBLEU) for more information.
\ No newline at end of file
import sys
import evaluate
from evaluate.utils import launch_gradio_widget
sys.path = [p for p in sys.path if p != "/home/user/app"]
module = evaluate.load("sacrebleu")
sys.path = ["/home/user/app"] + sys.path
launch_gradio_widget(module)
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
sacrebleu
\ No newline at end of file
# Copyright 2020 The HuggingFace Evaluate Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" SACREBLEU metric. """
import datasets
import sacrebleu as scb
from packaging import version
import evaluate
_CITATION = """\
@inproceedings{post-2018-call,
title = "A Call for Clarity in Reporting {BLEU} Scores",
author = "Post, Matt",
booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers",
month = oct,
year = "2018",
address = "Belgium, Brussels",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W18-6319",
pages = "186--191",
}
"""
_DESCRIPTION = """\
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores.
Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.
It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.
"""
_KWARGS_DESCRIPTION = """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.
Args:
predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
smooth_method (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
- `'none'`: no smoothing
- `'floor'`: increment zero counts
- `'add-k'`: increment num/denom by k for n>1
- `'exp'`: exponential decay
smooth_value (`float`): The smoothing value. Only valid when `smooth_method='floor'` (in which case `smooth_value` defaults to `0.1`) or `smooth_method='add-k'` (in which case `smooth_value` defaults to `1`).
tokenize (`str`): Tokenization method to use for BLEU. If not provided, defaults to `'zh'` for Chinese, `'ja-mecab'` for Japanese and `'13a'` (mteval) otherwise. Possible values are:
- `'none'`: No tokenization.
- `'zh'`: Chinese tokenization.
- `'13a'`: mimics the `mteval-v13a` script from Moses.
- `'intl'`: International tokenization, mimics the `mteval-v14` script from Moses
- `'char'`: Language-agnostic character-level tokenization.
- `'ja-mecab'`: Japanese tokenization. Uses the [MeCab tokenizer](https://pypi.org/project/mecab-python3).
lowercase (`bool`): If `True`, lowercases the input, enabling case-insensitivity. Defaults to `False`.
force (`bool`): If `True`, insists that your tokenized input is actually detokenized. Defaults to `False`.
use_effective_order (`bool`): If `True`, stops including n-gram orders for which precision is 0. This should be `True`, if sentence-level BLEU will be computed. Defaults to `False`.
Returns:
'score': BLEU score,
'counts': Counts,
'totals': Totals,
'precisions': Precisions,
'bp': Brevity penalty,
'sys_len': predictions length,
'ref_len': reference length,
Examples:
Example 1:
>>> predictions = ["hello there general kenobi", "foo bar foobar"]
>>> references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]
>>> sacrebleu = evaluate.load("sacrebleu")
>>> results = sacrebleu.compute(predictions=predictions, references=references)
>>> print(list(results.keys()))
['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
>>> print(round(results["score"], 1))
100.0
Example 2:
>>> predictions = ["hello there general kenobi",
... "on our way to ankh morpork"]
>>> references = [["hello there general kenobi", "hello there !"],
... ["goodbye ankh morpork", "ankh morpork"]]
>>> sacrebleu = evaluate.load("sacrebleu")
>>> results = sacrebleu.compute(predictions=predictions,
... references=references)
>>> print(list(results.keys()))
['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
>>> print(round(results["score"], 1))
39.8
"""
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Sacrebleu(evaluate.Metric):
def _info(self):
if version.parse(scb.__version__) < version.parse("1.4.12"):
raise ImportWarning(
"To use `sacrebleu`, the module `sacrebleu>=1.4.12` is required, and the current version of `sacrebleu` doesn't match this condition.\n"
'You can install it with `pip install "sacrebleu>=1.4.12"`.'
)
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
homepage="https://github.com/mjpost/sacreBLEU",
inputs_description=_KWARGS_DESCRIPTION,
features=[
datasets.Features(
{
"predictions": datasets.Value("string", id="sequence"),
"references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
}
),
datasets.Features(
{
"predictions": datasets.Value("string", id="sequence"),
"references": datasets.Value("string", id="sequence"),
}
),
],
codebase_urls=["https://github.com/mjpost/sacreBLEU"],
reference_urls=[
"https://github.com/mjpost/sacreBLEU",
"https://en.wikipedia.org/wiki/BLEU",
"https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213",
],
)
def _compute(
self,
predictions,
references,
smooth_method="exp",
smooth_value=None,
force=False,
lowercase=False,
tokenize=None,
use_effective_order=False,
):
# if only one reference is provided make sure we still use list of lists
if isinstance(references[0], str):
references = [[ref] for ref in references]
references_per_prediction = len(references[0])
if any(len(refs) != references_per_prediction for refs in references):
raise ValueError("Sacrebleu requires the same number of references for each prediction")
transformed_references = [[refs[i] for refs in references] for i in range(references_per_prediction)]
output = scb.corpus_bleu(
predictions,
transformed_references,
smooth_method=smooth_method,
smooth_value=smooth_value,
force=force,
lowercase=lowercase,
use_effective_order=use_effective_order,
**(dict(tokenize=tokenize) if tokenize else {}),
)
output_dict = {
"score": output.score,
"counts": output.counts,
"totals": output.totals,
"precisions": output.precisions,
"bp": output.bp,
"sys_len": output.sys_len,
"ref_len": output.ref_len,
}
return output_dict
---
title: SARI
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
SARI is a metric used for evaluating automatic text simplification systems.
The metric compares the predicted simplified sentences against the reference
and the source sentences. It explicitly measures the goodness of words that are
added, deleted and kept by the system.
Sari = (F1_add + F1_keep + P_del) / 3
where
F1_add: n-gram F1 score for add operation
F1_keep: n-gram F1 score for keep operation
P_del: n-gram precision score for delete operation
n = 4, as in the original paper.
This implementation is adapted from Tensorflow's tensor2tensor implementation [3].
It has two differences with the original GitHub [1] implementation:
(1) Defines 0/0=1 instead of 0 to give higher scores for predictions that match
a target exactly.
(2) Fixes an alleged bug [2] in the keep score computation.
[1] https://github.com/cocoxu/simplification/blob/master/SARI.py
(commit 0210f15)
[2] https://github.com/cocoxu/simplification/issues/6
[3] https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py
---
# Metric Card for SARI
## Metric description
SARI (***s**ystem output **a**gainst **r**eferences and against the **i**nput sentence*) is a metric used for evaluating automatic text simplification systems.
The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system.
SARI can be computed as:
`sari = ( F1_add + F1_keep + P_del) / 3`
where
`F1_add` is the n-gram F1 score for add operations
`F1_keep` is the n-gram F1 score for keep operations
`P_del` is the n-gram precision score for delete operations
The number of n grams, `n`, is equal to 4, as in the original paper.
This implementation is adapted from [Tensorflow's tensor2tensor implementation](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py).
It has two differences with the [original GitHub implementation](https://github.com/cocoxu/simplification/blob/master/SARI.py):
1) It defines 0/0=1 instead of 0 to give higher scores for predictions that match a target exactly.
2) It fixes an [alleged bug](https://github.com/cocoxu/simplification/issues/6) in the keep score computation.
## How to use
The metric takes 3 inputs: sources (a list of source sentence strings), predictions (a list of predicted sentence strings) and references (a list of lists of reference sentence strings)
```python
from evaluate import load
sari = load("sari")
sources=["About 95 species are currently accepted."]
predictions=["About 95 you now get in."]
references=[["About 95 species are currently known.","About 95 species are now accepted.","95 species are now accepted."]]
sari_score = sari.compute(sources=sources, predictions=predictions, references=references)
```
## Output values
This metric outputs a dictionary with the SARI score:
```
print(sari_score)
{'sari': 26.953601953601954}
```
The range of values for the SARI score is between 0 and 100 -- the higher the value, the better the performance of the model being evaluated, with a SARI of 100 being a perfect score.
### Values from popular papers
The [original paper that proposes the SARI metric](https://aclanthology.org/Q16-1029.pdf) reports scores ranging from 26 to 43 for different simplification systems and different datasets. They also find that the metric ranks all of the simplification systems and human references in the same order as the human assessment used as a comparison, and that it correlates reasonably with human judgments.
More recent SARI scores for text simplification can be found on leaderboards for datasets such as [TurkCorpus](https://paperswithcode.com/sota/text-simplification-on-turkcorpus) and [Newsela](https://paperswithcode.com/sota/text-simplification-on-newsela).
## Examples
Perfect match between prediction and reference:
```python
from evaluate import load
sari = load("sari")
sources=["About 95 species are currently accepted ."]
predictions=["About 95 species are currently accepted ."]
references=[["About 95 species are currently accepted ."]]
sari_score = sari.compute(sources=sources, predictions=predictions, references=references)
print(sari_score)
{'sari': 100.0}
```
Partial match between prediction and reference:
```python
from evaluate import load
sari = load("sari")
sources=["About 95 species are currently accepted ."]
predictions=["About 95 you now get in ."]
references=[["About 95 species are currently known .","About 95 species are now accepted .","95 species are now accepted ."]]
sari_score = sari.compute(sources=sources, predictions=predictions, references=references)
print(sari_score)
{'sari': 26.953601953601954}
```
## Limitations and bias
SARI is a valuable measure for comparing different text simplification systems as well as one that can assist the iterative development of a system.
However, while the [original paper presenting SARI](https://aclanthology.org/Q16-1029.pdf) states that it captures "the notion of grammaticality and meaning preservation", this is a difficult claim to empirically validate.
## Citation
```bibtex
@inproceedings{xu-etal-2016-optimizing,
title = {Optimizing Statistical Machine Translation for Text Simplification},
authors={Xu, Wei and Napoles, Courtney and Pavlick, Ellie and Chen, Quanze and Callison-Burch, Chris},
journal = {Transactions of the Association for Computational Linguistics},
volume = {4},
year={2016},
url = {https://www.aclweb.org/anthology/Q16-1029},
pages = {401--415},
}
```
## Further References
- [NLP Progress -- Text Simplification](http://nlpprogress.com/english/simplification.html)
- [Hugging Face Hub -- Text Simplification Models](https://huggingface.co/datasets?filter=task_ids:text-simplification)
import evaluate
from evaluate.utils import launch_gradio_widget
module = evaluate.load("sari")
launch_gradio_widget(module)
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
sacrebleu
sacremoses
\ No newline at end of file
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" SARI metric."""
from collections import Counter
import datasets
import sacrebleu
import sacremoses
from packaging import version
import evaluate
_CITATION = """\
@inproceedings{xu-etal-2016-optimizing,
title = {Optimizing Statistical Machine Translation for Text Simplification},
authors={Xu, Wei and Napoles, Courtney and Pavlick, Ellie and Chen, Quanze and Callison-Burch, Chris},
journal = {Transactions of the Association for Computational Linguistics},
volume = {4},
year={2016},
url = {https://www.aclweb.org/anthology/Q16-1029},
pages = {401--415},
}
"""
_DESCRIPTION = """\
SARI is a metric used for evaluating automatic text simplification systems.
The metric compares the predicted simplified sentences against the reference
and the source sentences. It explicitly measures the goodness of words that are
added, deleted and kept by the system.
Sari = (F1_add + F1_keep + P_del) / 3
where
F1_add: n-gram F1 score for add operation
F1_keep: n-gram F1 score for keep operation
P_del: n-gram precision score for delete operation
n = 4, as in the original paper.
This implementation is adapted from Tensorflow's tensor2tensor implementation [3].
It has two differences with the original GitHub [1] implementation:
(1) Defines 0/0=1 instead of 0 to give higher scores for predictions that match
a target exactly.
(2) Fixes an alleged bug [2] in the keep score computation.
[1] https://github.com/cocoxu/simplification/blob/master/SARI.py
(commit 0210f15)
[2] https://github.com/cocoxu/simplification/issues/6
[3] https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py
"""
_KWARGS_DESCRIPTION = """
Calculates sari score (between 0 and 100) given a list of source and predicted
sentences, and a list of lists of reference sentences.
Args:
sources: list of source sentences where each sentence should be a string.
predictions: list of predicted sentences where each sentence should be a string.
references: list of lists of reference sentences where each sentence should be a string.
Returns:
sari: sari score
Examples:
>>> sources=["About 95 species are currently accepted ."]
>>> predictions=["About 95 you now get in ."]
>>> references=[["About 95 species are currently known .","About 95 species are now accepted .","95 species are now accepted ."]]
>>> sari = evaluate.load("sari")
>>> results = sari.compute(sources=sources, predictions=predictions, references=references)
>>> print(results)
{'sari': 26.953601953601954}
"""
def SARIngram(sgrams, cgrams, rgramslist, numref):
rgramsall = [rgram for rgrams in rgramslist for rgram in rgrams]
rgramcounter = Counter(rgramsall)
sgramcounter = Counter(sgrams)
sgramcounter_rep = Counter()
for sgram, scount in sgramcounter.items():
sgramcounter_rep[sgram] = scount * numref
cgramcounter = Counter(cgrams)
cgramcounter_rep = Counter()
for cgram, ccount in cgramcounter.items():
cgramcounter_rep[cgram] = ccount * numref
# KEEP
keepgramcounter_rep = sgramcounter_rep & cgramcounter_rep
keepgramcountergood_rep = keepgramcounter_rep & rgramcounter
keepgramcounterall_rep = sgramcounter_rep & rgramcounter
keeptmpscore1 = 0
keeptmpscore2 = 0
for keepgram in keepgramcountergood_rep:
keeptmpscore1 += keepgramcountergood_rep[keepgram] / keepgramcounter_rep[keepgram]
# Fix an alleged bug [2] in the keep score computation.
# keeptmpscore2 += keepgramcountergood_rep[keepgram] / keepgramcounterall_rep[keepgram]
keeptmpscore2 += keepgramcountergood_rep[keepgram]
# Define 0/0=1 instead of 0 to give higher scores for predictions that match
# a target exactly.
keepscore_precision = 1
keepscore_recall = 1
if len(keepgramcounter_rep) > 0:
keepscore_precision = keeptmpscore1 / len(keepgramcounter_rep)
if len(keepgramcounterall_rep) > 0:
# Fix an alleged bug [2] in the keep score computation.
# keepscore_recall = keeptmpscore2 / len(keepgramcounterall_rep)
keepscore_recall = keeptmpscore2 / sum(keepgramcounterall_rep.values())
keepscore = 0
if keepscore_precision > 0 or keepscore_recall > 0:
keepscore = 2 * keepscore_precision * keepscore_recall / (keepscore_precision + keepscore_recall)
# DELETION
delgramcounter_rep = sgramcounter_rep - cgramcounter_rep
delgramcountergood_rep = delgramcounter_rep - rgramcounter
delgramcounterall_rep = sgramcounter_rep - rgramcounter
deltmpscore1 = 0
deltmpscore2 = 0
for delgram in delgramcountergood_rep:
deltmpscore1 += delgramcountergood_rep[delgram] / delgramcounter_rep[delgram]
deltmpscore2 += delgramcountergood_rep[delgram] / delgramcounterall_rep[delgram]
# Define 0/0=1 instead of 0 to give higher scores for predictions that match
# a target exactly.
delscore_precision = 1
if len(delgramcounter_rep) > 0:
delscore_precision = deltmpscore1 / len(delgramcounter_rep)
# ADDITION
addgramcounter = set(cgramcounter) - set(sgramcounter)
addgramcountergood = set(addgramcounter) & set(rgramcounter)
addgramcounterall = set(rgramcounter) - set(sgramcounter)
addtmpscore = 0
for addgram in addgramcountergood:
addtmpscore += 1
# Define 0/0=1 instead of 0 to give higher scores for predictions that match
# a target exactly.
addscore_precision = 1
addscore_recall = 1
if len(addgramcounter) > 0:
addscore_precision = addtmpscore / len(addgramcounter)
if len(addgramcounterall) > 0:
addscore_recall = addtmpscore / len(addgramcounterall)
addscore = 0
if addscore_precision > 0 or addscore_recall > 0:
addscore = 2 * addscore_precision * addscore_recall / (addscore_precision + addscore_recall)
return (keepscore, delscore_precision, addscore)
def SARIsent(ssent, csent, rsents):
numref = len(rsents)
s1grams = ssent.split(" ")
c1grams = csent.split(" ")
s2grams = []
c2grams = []
s3grams = []
c3grams = []
s4grams = []
c4grams = []
r1gramslist = []
r2gramslist = []
r3gramslist = []
r4gramslist = []
for rsent in rsents:
r1grams = rsent.split(" ")
r2grams = []
r3grams = []
r4grams = []
r1gramslist.append(r1grams)
for i in range(0, len(r1grams) - 1):
if i < len(r1grams) - 1:
r2gram = r1grams[i] + " " + r1grams[i + 1]
r2grams.append(r2gram)
if i < len(r1grams) - 2:
r3gram = r1grams[i] + " " + r1grams[i + 1] + " " + r1grams[i + 2]
r3grams.append(r3gram)
if i < len(r1grams) - 3:
r4gram = r1grams[i] + " " + r1grams[i + 1] + " " + r1grams[i + 2] + " " + r1grams[i + 3]
r4grams.append(r4gram)
r2gramslist.append(r2grams)
r3gramslist.append(r3grams)
r4gramslist.append(r4grams)
for i in range(0, len(s1grams) - 1):
if i < len(s1grams) - 1:
s2gram = s1grams[i] + " " + s1grams[i + 1]
s2grams.append(s2gram)
if i < len(s1grams) - 2:
s3gram = s1grams[i] + " " + s1grams[i + 1] + " " + s1grams[i + 2]
s3grams.append(s3gram)
if i < len(s1grams) - 3:
s4gram = s1grams[i] + " " + s1grams[i + 1] + " " + s1grams[i + 2] + " " + s1grams[i + 3]
s4grams.append(s4gram)
for i in range(0, len(c1grams) - 1):
if i < len(c1grams) - 1:
c2gram = c1grams[i] + " " + c1grams[i + 1]
c2grams.append(c2gram)
if i < len(c1grams) - 2:
c3gram = c1grams[i] + " " + c1grams[i + 1] + " " + c1grams[i + 2]
c3grams.append(c3gram)
if i < len(c1grams) - 3:
c4gram = c1grams[i] + " " + c1grams[i + 1] + " " + c1grams[i + 2] + " " + c1grams[i + 3]
c4grams.append(c4gram)
(keep1score, del1score, add1score) = SARIngram(s1grams, c1grams, r1gramslist, numref)
(keep2score, del2score, add2score) = SARIngram(s2grams, c2grams, r2gramslist, numref)
(keep3score, del3score, add3score) = SARIngram(s3grams, c3grams, r3gramslist, numref)
(keep4score, del4score, add4score) = SARIngram(s4grams, c4grams, r4gramslist, numref)
avgkeepscore = sum([keep1score, keep2score, keep3score, keep4score]) / 4
avgdelscore = sum([del1score, del2score, del3score, del4score]) / 4
avgaddscore = sum([add1score, add2score, add3score, add4score]) / 4
finalscore = (avgkeepscore + avgdelscore + avgaddscore) / 3
return finalscore
def normalize(sentence, lowercase: bool = True, tokenizer: str = "13a", return_str: bool = True):
# Normalization is requried for the ASSET dataset (one of the primary
# datasets in sentence simplification) to allow using space
# to split the sentence. Even though Wiki-Auto and TURK datasets,
# do not require normalization, we do it for consistency.
# Code adapted from the EASSE library [1] written by the authors of the ASSET dataset.
# [1] https://github.com/feralvam/easse/blob/580bba7e1378fc8289c663f864e0487188fe8067/easse/utils/preprocessing.py#L7
if lowercase:
sentence = sentence.lower()
if tokenizer in ["13a", "intl"]:
if version.parse(sacrebleu.__version__).major >= 2:
normalized_sent = sacrebleu.metrics.bleu._get_tokenizer(tokenizer)()(sentence)
else:
normalized_sent = sacrebleu.TOKENIZERS[tokenizer]()(sentence)
elif tokenizer == "moses":
normalized_sent = sacremoses.MosesTokenizer().tokenize(sentence, return_str=True, escape=False)
elif tokenizer == "penn":
normalized_sent = sacremoses.MosesTokenizer().penn_tokenize(sentence, return_str=True)
else:
normalized_sent = sentence
if not return_str:
normalized_sent = normalized_sent.split()
return normalized_sent
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Sari(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"sources": datasets.Value("string", id="sequence"),
"predictions": datasets.Value("string", id="sequence"),
"references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
}
),
codebase_urls=[
"https://github.com/cocoxu/simplification/blob/master/SARI.py",
"https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py",
],
reference_urls=["https://www.aclweb.org/anthology/Q16-1029.pdf"],
)
def _compute(self, sources, predictions, references):
if not (len(sources) == len(predictions) == len(references)):
raise ValueError("Sources length must match predictions and references lengths.")
sari_score = 0
for src, pred, refs in zip(sources, predictions, references):
sari_score += SARIsent(normalize(src), normalize(pred), [normalize(sent) for sent in refs])
sari_score = sari_score / len(predictions)
return {"sari": 100 * sari_score}
---
title: seqeval
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
seqeval is a Python framework for sequence labeling evaluation.
seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
This is well-tested by using the Perl script conlleval, which can be used for
measuring the performance of a system that has processed the CoNLL-2000 shared task data.
seqeval supports following formats:
IOB1
IOB2
IOE1
IOE2
IOBES
See the [README.md] file at https://github.com/chakki-works/seqeval for more information.
---
# Metric Card for seqeval
## Metric description
seqeval is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
## How to use
Seqeval produces labelling scores along with its sufficient statistics from a source against one or more references.
It takes two mandatory arguments:
`predictions`: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.
`references`: a list of lists of reference labels, i.e. the ground truth/target values.
It can also take several optional arguments:
`suffix` (boolean): `True` if the IOB tag is a suffix (after type) instead of a prefix (before type), `False` otherwise. The default value is `False`, i.e. the IOB tag is a prefix (before type).
`scheme`: the target tagging scheme, which can be one of [`IOB1`, `IOB2`, `IOE1`, `IOE2`, `IOBES`, `BILOU`]. The default value is `None`.
`mode`: whether to count correct entity labels with incorrect I/B tags as true positives or not. If you want to only count exact matches, pass `mode="strict"` and a specific `scheme` value. The default is `None`.
`sample_weight`: An array-like of shape (n_samples,) that provides weights for individual samples. The default is `None`.
`zero_division`: Which value to substitute as a metric value when encountering zero division. Should be one of [`0`,`1`,`"warn"`]. `"warn"` acts as `0`, but the warning is raised.
```python
>>> seqeval = evaluate.load('seqeval')
>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> results = seqeval.compute(predictions=predictions, references=references)
```
## Output values
This metric returns a dictionary with a summary of scores for overall and per type:
Overall:
`accuracy`: the average [accuracy](https://huggingface.co/metrics/accuracy), on a scale between 0.0 and 1.0.
`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
`f1`: the average [F1 score](https://huggingface.co/metrics/f1), which is the harmonic mean of the precision and recall. It also has a scale of 0.0 to 1.0.
Per type (e.g. `MISC`, `PER`, `LOC`,...):
`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
`f1`: the average [F1 score](https://huggingface.co/metrics/f1), on a scale between 0.0 and 1.0.
### Values from popular papers
The 1995 "Text Chunking using Transformation-Based Learning" [paper](https://aclanthology.org/W95-0107) reported a baseline recall of 81.9% and a precision of 78.2% using non Deep Learning-based methods.
More recently, seqeval continues being used for reporting performance on tasks such as [named entity detection](https://www.mdpi.com/2306-5729/6/8/84/htm) and [information extraction](https://ieeexplore.ieee.org/abstract/document/9697942/).
## Examples
Maximal values (full match) :
```python
>>> seqeval = evaluate.load('seqeval')
>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> references = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> results = seqeval.compute(predictions=predictions, references=references)
>>> print(results)
{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'overall_precision': 1.0, 'overall_recall': 1.0, 'overall_f1': 1.0, 'overall_accuracy': 1.0}
```
Minimal values (no match):
```python
>>> seqeval = evaluate.load('seqeval')
>>> predictions = [['O', 'B-MISC', 'I-MISC'], ['B-PER', 'I-PER', 'O']]
>>> references = [['B-MISC', 'O', 'O'], ['I-PER', '0', 'I-PER']]
>>> results = seqeval.compute(predictions=predictions, references=references)
>>> print(results)
{'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'PER': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2}, '_': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'overall_precision': 0.0, 'overall_recall': 0.0, 'overall_f1': 0.0, 'overall_accuracy': 0.0}
```
Partial match:
```python
>>> seqeval = evaluate.load('seqeval')
>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> results = seqeval.compute(predictions=predictions, references=references)
>>> print(results)
{'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'overall_precision': 0.5, 'overall_recall': 0.5, 'overall_f1': 0.5, 'overall_accuracy': 0.8}
```
## Limitations and bias
seqeval supports following IOB formats (short for inside, outside, beginning) : `IOB1`, `IOB2`, `IOE1`, `IOE2`, `IOBES`, `IOBES` (only in strict mode) and `BILOU` (only in strict mode).
For more information about IOB formats, refer to the [Wikipedia page](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) and the description of the [CoNLL-2000 shared task](https://aclanthology.org/W02-2024).
## Citation
```bibtex
@inproceedings{ramshaw-marcus-1995-text,
title = "Text Chunking using Transformation-Based Learning",
author = "Ramshaw, Lance and
Marcus, Mitch",
booktitle = "Third Workshop on Very Large Corpora",
year = "1995",
url = "https://www.aclweb.org/anthology/W95-0107",
}
```
```bibtex
@misc{seqeval,
title={{seqeval}: A Python framework for sequence labeling evaluation},
url={https://github.com/chakki-works/seqeval},
note={Software available from https://github.com/chakki-works/seqeval},
author={Hiroki Nakayama},
year={2018},
}
```
## Further References
- [README for seqeval at GitHub](https://github.com/chakki-works/seqeval)
- [CoNLL-2000 shared task](https://www.clips.uantwerpen.be/conll2002/ner/bin/conlleval.txt)
import sys
import evaluate
from evaluate.utils import launch_gradio_widget
sys.path = [p for p in sys.path if p != "/home/user/app"]
module = evaluate.load("seqeval")
sys.path = ["/home/user/app"] + sys.path
launch_gradio_widget(module)
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
seqeval
\ No newline at end of file
# Copyright 2020 The HuggingFace Evaluate Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" seqeval metric. """
import importlib
from typing import List, Optional, Union
import datasets
from seqeval.metrics import accuracy_score, classification_report
import evaluate
_CITATION = """\
@inproceedings{ramshaw-marcus-1995-text,
title = "Text Chunking using Transformation-Based Learning",
author = "Ramshaw, Lance and
Marcus, Mitch",
booktitle = "Third Workshop on Very Large Corpora",
year = "1995",
url = "https://www.aclweb.org/anthology/W95-0107",
}
@misc{seqeval,
title={{seqeval}: A Python framework for sequence labeling evaluation},
url={https://github.com/chakki-works/seqeval},
note={Software available from https://github.com/chakki-works/seqeval},
author={Hiroki Nakayama},
year={2018},
}
"""
_DESCRIPTION = """\
seqeval is a Python framework for sequence labeling evaluation.
seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
This is well-tested by using the Perl script conlleval, which can be used for
measuring the performance of a system that has processed the CoNLL-2000 shared task data.
seqeval supports following formats:
IOB1
IOB2
IOE1
IOE2
IOBES
See the [README.md] file at https://github.com/chakki-works/seqeval for more information.
"""
_KWARGS_DESCRIPTION = """
Produces labelling scores along with its sufficient statistics
from a source against one or more references.
Args:
predictions: List of List of predicted labels (Estimated targets as returned by a tagger)
references: List of List of reference labels (Ground truth (correct) target values)
suffix: True if the IOB prefix is after type, False otherwise. default: False
scheme: Specify target tagging scheme. Should be one of ["IOB1", "IOB2", "IOE1", "IOE2", "IOBES", "BILOU"].
default: None
mode: Whether to count correct entity labels with incorrect I/B tags as true positives or not.
If you want to only count exact matches, pass mode="strict". default: None.
sample_weight: Array-like of shape (n_samples,), weights for individual samples. default: None
zero_division: Which value to substitute as a metric value when encountering zero division. Should be on of 0, 1,
"warn". "warn" acts as 0, but the warning is raised.
Returns:
'scores': dict. Summary of the scores for overall and per type
Overall:
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': F1 score, also known as balanced F-score or F-measure,
Per type:
'precision': precision,
'recall': recall,
'f1': F1 score, also known as balanced F-score or F-measure
Examples:
>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> seqeval = evaluate.load("seqeval")
>>> results = seqeval.compute(predictions=predictions, references=references)
>>> print(list(results.keys()))
['MISC', 'PER', 'overall_precision', 'overall_recall', 'overall_f1', 'overall_accuracy']
>>> print(results["overall_f1"])
0.5
>>> print(results["PER"]["f1"])
1.0
"""
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Seqeval(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
homepage="https://github.com/chakki-works/seqeval",
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions": datasets.Sequence(datasets.Value("string", id="label"), id="sequence"),
"references": datasets.Sequence(datasets.Value("string", id="label"), id="sequence"),
}
),
codebase_urls=["https://github.com/chakki-works/seqeval"],
reference_urls=["https://github.com/chakki-works/seqeval"],
)
def _compute(
self,
predictions,
references,
suffix: bool = False,
scheme: Optional[str] = None,
mode: Optional[str] = None,
sample_weight: Optional[List[int]] = None,
zero_division: Union[str, int] = "warn",
):
if scheme is not None:
try:
scheme_module = importlib.import_module("seqeval.scheme")
scheme = getattr(scheme_module, scheme)
except AttributeError:
raise ValueError(f"Scheme should be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU], got {scheme}")
report = classification_report(
y_true=references,
y_pred=predictions,
suffix=suffix,
output_dict=True,
scheme=scheme,
mode=mode,
sample_weight=sample_weight,
zero_division=zero_division,
)
report.pop("macro avg")
report.pop("weighted avg")
overall_score = report.pop("micro avg")
scores = {
type_name: {
"precision": score["precision"],
"recall": score["recall"],
"f1": score["f1-score"],
"number": score["support"],
}
for type_name, score in report.items()
}
scores["overall_precision"] = overall_score["precision"]
scores["overall_recall"] = overall_score["recall"]
scores["overall_f1"] = overall_score["f1-score"]
scores["overall_accuracy"] = accuracy_score(y_true=references, y_pred=predictions)
return scores
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment