title={SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems},
author={Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},
journal={arXiv preprint arXiv:1905.00537},
year={2019}
}
"""
_DESCRIPTION="""\
SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after
GLUE with a new set of more difficult language understanding tasks, improved
resources, and a new public leaderboard.
"""
_KWARGS_DESCRIPTION="""
Compute SuperGLUE evaluation metric associated to each SuperGLUE dataset.
Args:
predictions: list of predictions to score. Depending on the SuperGlUE subset:
- for 'record': list of question-answer dictionaries with the following keys:
- 'idx': index of the question as specified by the dataset
- 'prediction_text': the predicted answer text
- for 'multirc': list of question-answer dictionaries with the following keys:
- 'idx': index of the question-answer pair as specified by the dataset
- 'prediction': the predicted answer label
- otherwise: list of predicted labels
references: list of reference labels. Depending on the SuperGLUE subset:
- for 'record': list of question-answers dictionaries with the following keys:
- 'idx': index of the question as specified by the dataset
- 'answers': list of possible answers
- otherwise: list of reference labels
Returns: depending on the SuperGLUE subset:
- for 'record':
- 'exact_match': Exact match between answer and gold answer
- 'f1': F1 score
- for 'multirc':
- 'exact_match': Exact match between answer and gold answer
- 'f1_m': Per-question macro-F1 score
- 'f1_a': Average F1 score over all answers
- for 'axb':
'matthews_correlation': Matthew Correlation
- for 'cb':
- 'accuracy': Accuracy
- 'f1': F1 score
- for all others:
- 'accuracy': Accuracy
Examples:
>>> super_glue_metric = evaluate.load('super_glue', 'copa') # any of ["copa", "rte", "wic", "wsc", "wsc.fixed", "boolq", "axg"]
TER (Translation Edit Rate, also called Translation Error Rate) is a metric to quantify the edit operations that a
hypothesis requires to match a reference translation. We use the implementation that is already present in sacrebleu
(https://github.com/mjpost/sacreBLEU#ter), which in turn is inspired by the TERCOM implementation, which can be found
here: https://github.com/jhclark/tercom.
The implementation here is slightly different from sacrebleu in terms of the required input format. The length of
the references and hypotheses lists need to be the same, so you may need to transpose your references compared to
sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534
See the README.md file at https://github.com/mjpost/sacreBLEU#ter for more information.
---
# Metric Card for TER
## Metric Description
TER (Translation Edit Rate, also called Translation Error Rate) is a metric to quantify the edit operations that a hypothesis requires to match a reference translation. We use the implementation that is already present in [sacrebleu](https://github.com/mjpost/sacreBLEU#ter), which in turn is inspired by the [TERCOM implementation](https://github.com/jhclark/tercom).
The implementation here is slightly different from sacrebleu in terms of the required input format. The length of the references and hypotheses lists need to be the same, so you may need to transpose your references compared to sacrebleu's required input format. See [this github issue](https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534).
See the README.md file at https://github.com/mjpost/sacreBLEU#ter for more information.
## How to Use
This metric takes, at minimum, predicted sentences and reference sentences:
```python
>>>predictions=["does this sentence match??",
..."what about this sentence?",
..."What did the TER metric user say to the developer?"]
>>>references=[["does this sentence match","does this sentence match!?!"],
...["wHaT aBoUt ThIs SeNtEnCe?","wHaT aBoUt ThIs SeNtEnCe?"],
...["Your jokes are...","...TERrible"]]
>>>ter=evaluate.load("ter")
>>>results=ter.compute(predictions=predictions,
...references=references,
...case_sensitive=True)
>>>print(results)
{'score':150.0,'num_edits':15,'ref_length':10.0}
```
### Inputs
This metric takes the following as input:
-**`predictions`** (`list` of `str`): The system stream (a sequence of segments).
-**`references`** (`list` of `list` of `str`): A list of one or more reference streams (each a sequence of segments).
-**`normalized`** (`boolean`): If `True`, applies basic tokenization and normalization to sentences. Defaults to `False`.
-**`ignore_punct`** (`boolean`): If `True`, applies basic tokenization and normalization to sentences. Defaults to `False`.
-**`support_zh_ja_chars`** (`boolean`): If `True`, tokenization/normalization supports processing of Chinese characters, as well as Japanese Kanji, Hiragana, Katakana, and Phonetic Extensions of Katakana. Only applies if `normalized = True`. Defaults to `False`.
-**`case_sensitive`** (`boolean`): If `False`, makes all predictions and references lowercase to ignore differences in case. Defaults to `False`.
### Output Values
This metric returns the following:
-**`score`** (`float`): TER score (num_edits / sum_ref_lengths * 100)
-**`num_edits`** (`int`): The cumulative number of edits
-**`ref_length`** (`float`): The cumulative average reference length
The metric can take on any value `0` and above. `0` is a perfect score, meaning the predictions exactly match the references and no edits were necessary. Higher scores are worse. Scores above 100 mean that the cumulative number of edits, `num_edits`, is higher than the cumulative length of the references, `ref_length`.
#### Values from Popular Papers
### Examples
Basic example with only predictions and references as inputs:
```python
>>>predictions=["does this sentence match??",
..."what about this sentence?"]
>>>references=[["does this sentence match","does this sentence match!?!"],
...["wHaT aBoUt ThIs SeNtEnCe?","wHaT aBoUt ThIs SeNtEnCe?"]]
>>>ter=evaluate.load("ter")
>>>results=ter.compute(predictions=predictions,
...references=references,
...case_sensitive=True)
>>>print(results)
{'score':62.5,'num_edits':5,'ref_length':8.0}
```
Example with `normalization = True`:
```python
>>>predictions=["does this sentence match??",
..."what about this sentence?"]
>>>references=[["does this sentence match","does this sentence match!?!"],
...["wHaT aBoUt ThIs SeNtEnCe?","wHaT aBoUt ThIs SeNtEnCe?"]]
The TREC Eval metric combines a number of information retrieval metrics such as precision and nDCG. It is used to score rankings of retrieved documents with reference values.
---
# Metric Card for TREC Eval
## Metric Description
The TREC Eval metric combines a number of information retrieval metrics such as precision and normalized Discounted Cumulative Gain (nDCG). It is used to score rankings of retrieved documents with reference values.
Word error rate (WER) is a common metric of the performance of an automatic speech recognition system.
The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort.
This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate.
Word error rate can then be computed as:
WER = (S + D + I) / N = (S + D + I) / (S + D + C)
where
S is the number of substitutions,
D is the number of deletions,
I is the number of insertions,
C is the number of correct words,
N is the number of words in the reference (N=S+D+C).
This value indicates the average number of errors per reference word. The lower the value, the better the
performance of the ASR system with a WER of 0 being a perfect score.
---
# Metric Card for WER
## Metric description
Word error rate (WER) is a common metric of the performance of an automatic speech recognition (ASR) system.
The general difficulty of measuring the performance of ASR systems lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance), working at the word level.
This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between [perplexity](https://huggingface.co/metrics/perplexity) and word error rate (see [this article](https://www.cs.cmu.edu/~roni/papers/eval-metrics-bntuw-9802.pdf) for further information).
Word error rate can then be computed as:
`WER = (S + D + I) / N = (S + D + I) / (S + D + C)`
where
`S` is the number of substitutions,
`D` is the number of deletions,
`I` is the number of insertions,
`C` is the number of correct words,
`N` is the number of words in the reference (`N=S+D+C`).
## How to use
The metric takes two inputs: references (a list of references for each speech input) and predictions (a list of transcriptions to score).
This metric outputs a float representing the word error rate.
```
print(wer_score)
0.5
```
This value indicates the average number of errors per reference word.
The **lower** the value, the **better** the performance of the ASR system, with a WER of 0 being a perfect score.
### Values from popular papers
This metric is highly dependent on the content and quality of the dataset, and therefore users can expect very different values for the same model but on different datasets.
For example, datasets such as [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) report a WER in the 1.8-3.3 range, whereas ASR models evaluated on [Timit](https://huggingface.co/datasets/timit_asr) report a WER in the 8.3-20.4 range.
See the leaderboards for [LibriSpeech](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean) and [Timit](https://paperswithcode.com/sota/speech-recognition-on-timit) for the most recent values.
WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort.
## Citation
```bibtex
@inproceedings{woodard1982,
author={Woodard, J.P. and Nelson, J.T.,
year = {1982},
journal = {Workshop on standardisation for speech I/O technology, Naval Air Development Center, Warminster, PA},
title = {An information theoretic measure of speech recognition performance}
}
```
```bibtex
@inproceedings{morris2004,
author={Morris, Andrew and Maier, Viktoria and Green, Phil},
year={2004},
month={01},
pages={},
title={From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}
# Copyright 2021 The HuggingFace Evaluate Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Word Error Ratio (WER) metric. """
importdatasets
fromjiwerimportcompute_measures
importevaluate
_CITATION="""\
@inproceedings{inproceedings,
author = {Morris, Andrew and Maier, Viktoria and Green, Phil},
year = {2004},
month = {01},
pages = {},
title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}
}
"""
_DESCRIPTION="""\
Word error rate (WER) is a common metric of the performance of an automatic speech recognition system.
The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort.
This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate.
Word error rate can then be computed as:
WER = (S + D + I) / N = (S + D + I) / (S + D + C)
where
S is the number of substitutions,
D is the number of deletions,
I is the number of insertions,
C is the number of correct words,
N is the number of words in the reference (N=S+D+C).
This value indicates the average number of errors per reference word. The lower the value, the better the
performance of the ASR system with a WER of 0 being a perfect score.
"""
_KWARGS_DESCRIPTION="""
Compute WER score of transcribed segments against references.
Args:
references: List of references for each speech input.
predictions: List of transcriptions to score.
concatenate_texts (bool, default=False): Whether to concatenate all input texts or compute WER iteratively.
Returns:
(float): the word error rate
Examples:
>>> predictions = ["this is the prediction", "there is an other sample"]
>>> references = ["this is the reference", "there is another one"]
WIKI_SPLIT is the combination of three metrics SARI, EXACT and SACREBLEU
It can be used to evaluate the quality of machine-generated texts.
---
# Metric Card for WikiSplit
## Metric description
WikiSplit is the combination of three metrics: [SARI](https://huggingface.co/metrics/sari), [exact match](https://huggingface.co/metrics/exact_match) and [SacreBLEU](https://huggingface.co/metrics/sacrebleu).
It can be used to evaluate the quality of sentence splitting approaches, which require rewriting a long sentence into two or more coherent short sentences, e.g. based on the [WikiSplit dataset](https://huggingface.co/datasets/wiki_split).
## How to use
The WIKI_SPLIT metric takes three inputs:
`sources`: a list of source sentences, where each sentence should be a string.
`predictions`: a list of predicted sentences, where each sentence should be a string.
`references`: a list of lists of reference sentences, where each sentence should be a string.
```python
>>>wiki_split=evaluate.load("wiki_split")
>>>sources=["About 95 species are currently accepted ."]
>>>predictions=["About 95 you now get in ."]
>>>references=[["About 95 species are currently known ."]]
This metric outputs a dictionary containing three scores:
`sari`: the [SARI](https://huggingface.co/metrics/sari) score, whose range is between `0.0` and `100.0` -- the higher the value, the better the performance of the model being evaluated, with a SARI of 100 being a perfect score.
`sacrebleu`: the [SacreBLEU](https://huggingface.co/metrics/sacrebleu) score, which can take any value between `0.0` and `100.0`, inclusive.
`exact`: the [exact match](https://huggingface.co/metrics/exact_match) score, which represents the sum of all of the individual exact match scores in the set, divided by the total number of predictions in the set. It ranges from `0.0` to `100`, inclusive. Here, `0.0` means no prediction/reference pairs were matches, while `100.0` means they all were.
This metric was initially used by [Rothe et al.(2020)](https://arxiv.org/pdf/1907.12461.pdf) to evaluate the performance of different split-and-rephrase approaches on the [WikiSplit dataset](https://huggingface.co/datasets/wiki_split). They reported a SARI score of 63.5, a SacreBLEU score of 77.2, and an EXACT_MATCH score of 16.3.
## Examples
Perfect match between prediction and reference:
```python
>>>wiki_split=evaluate.load("wiki_split")
>>>sources=["About 95 species are currently accepted ."]
>>>predictions=["About 95 species are currently accepted ."]
>>>references=[["About 95 species are currently accepted ."]]
This metric is not the official metric to evaluate models on the [WikiSplit dataset](https://huggingface.co/datasets/wiki_split). It was initially proposed by [Rothe et al.(2020)](https://arxiv.org/pdf/1907.12461.pdf), whereas the [original paper introducing the WikiSplit dataset (2018)](https://aclanthology.org/D18-1080.pdf) uses different metrics to evaluate performance, such as corpus-level [BLEU](https://huggingface.co/metrics/bleu) and sentence-level BLEU.
## Citation
```bibtex
@article{rothe2020leveraging,
title={Leveraging pre-trained checkpoints for sequence generation tasks},
author={Rothe, Sascha and Narayan, Shashi and Severyn, Aliaksei},
journal={Transactions of the Association for Computational Linguistics},
XNLI is a subset of a few thousand examples from MNLI which has been translated
into a 14 different languages (some low-ish resource). As with MNLI, the goal is
to predict textual entailment (does sentence A imply/contradict/neither sentence
B) and is a classification task (given two sentences, predict one of three
labels).
---
# Metric Card for XNLI
## Metric description
The XNLI metric allows to evaluate a model's score on the [XNLI dataset](https://huggingface.co/datasets/xnli), which is a subset of a few thousand examples from the [MNLI dataset](https://huggingface.co/datasets/glue/viewer/mnli) that have been translated into a 14 different languages, some of which are relatively low resource such as Swahili and Urdu.
As with MNLI, the task is to predict textual entailment (does sentence A imply/contradict/neither sentence B) and is a classification task (given two sentences, predict one of three labels).
## How to use
The XNLI metric is computed based on the `predictions` (a list of predicted labels) and the `references` (a list of ground truth labels).
The output of the XNLI metric is simply the `accuracy`, i.e. the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).
### Values from popular papers
The [original XNLI paper](https://arxiv.org/pdf/1809.05053.pdf) reported accuracies ranging from 59.3 (for `ur`) to 73.7 (for `en`) for the BiLSTM-max model.
For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/xnli).
While accuracy alone does give a certain indication of performance, it can be supplemented by error analysis and a better understanding of the model's mistakes on each of the categories represented in the dataset, especially if they are unbalanced.
While the XNLI dataset is multilingual and represents a diversity of languages, in reality, cross-lingual sentence understanding goes beyond translation, given that there are many cultural differences that have an impact on human sentiment annotations. Since the XNLI dataset was obtained by translation based on English sentences, it does not capture these cultural differences.