Symmetric Mean Absolute Percentage Error (sMAPE) is the symmetric mean percentage error difference between the predicted and actual values defined by Chen and Yang (2004).
---
# Metric Card for sMAPE
## Metric Description
Symmetric Mean Absolute Error (sMAPE) is the symmetric mean of the percentage error of difference between the predicted $x_i$ and actual $y_i$ numeric values:
-`predictions`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the estimated target values.
-`references`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the ground truth (correct) target values.
Optional arguments:
-`sample_weight`: numeric array-like of shape (`n_samples,`) representing sample weights. The default is `None`.
-`multioutput`: `raw_values`, `uniform_average` or numeric array-like of shape (`n_outputs,`), which defines the aggregation of multiple output values. The default value is `uniform_average`.
-`raw_values` returns a full set of errors in case of multioutput input.
-`uniform_average` means that the errors of all outputs are averaged with uniform weight.
- the array-like value defines weights used to average errors.
### Output Values
This metric outputs a dictionary, containing the mean absolute error score, which is of type:
-`float`: if multioutput is `uniform_average` or an ndarray of weights, then the weighted average of all output errors is returned.
- numeric array-like of shape (`n_outputs,`): if multioutput is `raw_values`, then the score is returned for each output separately.
Each sMAPE `float` value ranges from `0.0` to `2.0`, with the best value being 0.0.
This metric is called a measure of "percentage error" even though there is no multiplier of 100. The range is between (0, 2) with it being two when the target and prediction are both zero.
## Citation(s)
```bibtex
@article{article,
author={Chen, Zhuo and Yang, Yuhong},
year={2004},
month={04},
pages={},
title={Assessing forecast accuracy measures}
}
```
## Further References
-[Symmetric Mean absolute percentage error - Wikipedia](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)
Symmetric Mean Absolute Percentage Error (sMAPE) is the symmetric mean percentage error
difference between the predicted and actual values as defined by Chen and Yang (2004),
based on the metric by Armstrong (1985) and Makridakis (1993).
"""
_KWARGS_DESCRIPTION="""
Args:
predictions: array-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.
references: array-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
sample_weight: array-like of shape (n_samples,), default=None
Sample weights.
multioutput: {"raw_values", "uniform_average"} or array-like of shape (n_outputs,), default="uniform_average"
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
"raw_values" : Returns a full set of errors in case of multioutput input.
"uniform_average" : Errors of all outputs are averaged with uniform weight.
Returns:
smape : symmetric mean absolute percentage error.
If multioutput is "raw_values", then symmetric mean absolute percentage error is returned for each output separately. If multioutput is "uniform_average" or an ndarray of weights, then the weighted average of all output errors is returned.
sMAPE output is non-negative floating point in the range (0, 2). The best value is 0.0.
This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD).
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
from the corresponding reading passage, or the question might be unanswerable.
---
# Metric Card for SQuAD
## Metric description
This metric wraps the official scoring script for version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad).
SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
## How to use
The metric takes two files or two lists of question-answers dictionaries as inputs : one with the predictions of the model and the other with the references to be compared to:
This metric outputs a dictionary with two values: the average exact match score and the average [F1 score](https://huggingface.co/metrics/f1).
```
{'exact_match': 100.0, 'f1': 100.0}
```
The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.
The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
### Values from popular papers
The [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an F1 score of 51.0% and an Exact Match score of 40.0%. They also report that human performance on the dataset represents an F1 score of 90.5% and an Exact Match score of 80.3%.
For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).
## Examples
Maximal values for both exact match and F1 (perfect match):
references=[{'answers':{'answer_start':[97],'text':['1976']},'id':'56e10a3be3433e1400422b22'},{'answers':{'answer_start':[233],'text':['Beyoncé and Bruno Mars']},'id':'56d2051ce7d4791d0090260b'},{'answers':{'answer_start':[891],'text':['climate change']},'id':'5733b5344776f419006610e1'}]
This metric works only with datasets that have the same format as [SQuAD v.1 dataset](https://huggingface.co/datasets/squad).
The SQuAD dataset does contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers.
## Citation
@inproceedings{Rajpurkar2016SQuAD10,
title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
This metric wrap the official scoring script for version 2 of the Stanford Question Answering Dataset (SQuAD).
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
from the corresponding reading passage, or the question might be unanswerable.
SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions
written adversarially by crowdworkers to look similar to answerable ones.
To do well on SQuAD2.0, systems must not only answer questions when possible, but also
determine when no answer is supported by the paragraph and abstain from answering.
---
# Metric Card for SQuAD v2
## Metric description
This metric wraps the official scoring script for version 2 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad_v2).
SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD 2.0 combines the 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
## How to use
The metric takes two files or two lists - one representing model predictions and the other the references to compare them to.
*Predictions* : List of triple for question-answers to score with the following key-value pairs:
*`'id'`: the question-answer identification field of the question and answer pair
*`'prediction_text'` : the text of the answer
*`'no_answer_probability'` : the probability that the question has no answer
*References*: List of question-answers dictionaries with the following key-value pairs:
*`'id'`: id of the question-answer pair (see above),
*`'answers'`: a list of Dict {'text': text of the answer as a string}
*`'no_answer_threshold'`: the probability threshold to decide that a question has no answer.
*`'exact'`: Exact match (the normalized answer exactly match the gold answer) (see the `exact_match` metric (forthcoming))
*`'f1'`: The average F1-score of predicted tokens versus the gold answer (see the [F1 score](https://huggingface.co/metrics/f1) metric)
*`'total'`: Number of scores considered
*`'HasAns_exact'`: Exact match (the normalized answer exactly match the gold answer)
*`'HasAns_f1'`: The F-score of predicted tokens versus the gold answer
*`'HasAns_total'`: How many of the questions have answers
*`'NoAns_exact'`: Exact match (the normalized answer exactly match the gold answer)
*`'NoAns_f1'`: The F-score of predicted tokens versus the gold answer
*`'NoAns_total'`: How many of the questions have no answers
*`'best_exact'` : Best exact match (with varying threshold)
*`'best_exact_thresh'`: No-answer probability threshold associated to the best exact match
*`'best_f1'`: Best F1 score (with varying threshold)
*`'best_f1_thresh'`: No-answer probability threshold associated to the best F1
The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.
The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
The range of `total` depends on the length of predictions/references: its minimal value is 0, and maximal value is the total number of questions in the predictions and references.
### Values from popular papers
The [SQuAD v2 paper](https://arxiv.org/pdf/1806.03822.pdf) reported an F1 score of 66.3% and an Exact Match score of 63.4%.
They also report that human performance on the dataset represents an F1 score of 89.5% and an Exact Match score of 86.9%.
For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).
## Examples
Maximal values for both exact match and F1 (perfect match):
references=[{'answers':{'answer_start':[97],'text':['1976']},'id':'56e10a3be3433e1400422b22'},{'answers':{'answer_start':[233],'text':['Beyoncé and Bruno Mars']},'id':'56d2051ce7d4791d0090260b'},{'answers':{'answer_start':[891],'text':['climate change']},'id':'5733b5344776f419006610e1'}]
This metric works only with the datasets in the same format as the [SQuAD v.2 dataset](https://huggingface.co/datasets/squad_v2).
The SQuAD datasets do contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers.
## Citation
```bibtex
@inproceedings{Rajpurkar2018SQuAD2,
title={Know What You Don't Know: Unanswerable Questions for SQuAD},
author={Pranav Rajpurkar and Jian Zhang and Percy Liang},
SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
---
# Metric Card for SuperGLUE
## Metric description
This metric is used to compute the SuperGLUE evaluation metric associated to each of the subsets of the [SuperGLUE dataset](https://huggingface.co/datasets/super_glue).
SuperGLUE is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
## How to use
There are two steps: (1) loading the SuperGLUE metric relevant to the subset of the dataset being used for evaluation; and (2) calculating the metric.
1.**Loading the relevant SuperGLUE metric** : the subsets of SuperGLUE are the following: `boolq`, `cb`, `copa`, `multirc`, `record`, `rte`, `wic`, `wsc`, `wsc.fixed`, `axb`, `axg`.
More information about the different subsets of the SuperGLUE dataset can be found on the [SuperGLUE dataset page](https://huggingface.co/datasets/super_glue) and on the [official dataset website](https://super.gluebenchmark.com/).
2.**Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one list of reference labels. The structure of both inputs depends on the SuperGlUE subset being used:
Format of `predictions`:
- for `record`: list of question-answer dictionaries with the following keys:
-`idx`: index of the question as specified by the dataset
-`prediction_text`: the predicted answer text
- for `multirc`: list of question-answer dictionaries with the following keys:
-`idx`: index of the question-answer pair as specified by the dataset
-`prediction`: the predicted answer label
- otherwise: list of predicted labels
Format of `references`:
- for `record`: list of question-answers dictionaries with the following keys:
-`idx`: index of the question as specified by the dataset
The output of the metric depends on the SuperGLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
`exact_match`: A given predicted string's exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise. (See [Exact Match](https://huggingface.co/metrics/exact_match) for more information).
`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
`matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
### Values from popular papers
The [original SuperGLUE paper](https://arxiv.org/pdf/1905.00537.pdf) reported average scores ranging from 47 to 71.5%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
For more recent model performance, see the [dataset leaderboard](https://super.gluebenchmark.com/leaderboard).
## Examples
Maximal values for the COPA subset (which outputs `accuracy`):
```python
fromevaluateimportload
super_glue_metric=load('super_glue','copa')# any of ["copa", "rte", "wic", "wsc", "wsc.fixed", "boolq", "axg"]
This metric works only with datasets that have the same format as the [SuperGLUE dataset](https://huggingface.co/datasets/super_glue).
The dataset also includes Winogender, a subset of the dataset that is designed to measure gender bias in coreference resolution systems. However, as noted in the SuperGLUE paper, this subset has its limitations: *"It offers only positive predictive value: A poor bias score is clear evidence that a model exhibits gender bias, but a good score does not mean that the model is unbiased.[...] Also, Winogender does not cover all forms of social bias, or even all forms of gender. For instance, the version of the data used here offers no coverage of gender-neutral they or non-binary pronouns."
## Citation
```bibtex
@article{wang2019superglue,
title={Super{GLUE}: A Stickier Benchmark for General-Purpose Language Understanding Systems},
author={Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},