The BLEU score has some undesirable properties when used for single
sentences, as it was designed to be a corpus measure. We therefore
use a slightly different score for our RL experiments which we call
the 'GLEU score'. For the GLEU score, we record all sub-sequences of
1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then
compute a recall, which is the ratio of the number of matching n-grams
to the number of total n-grams in the target (ground truth) sequence,
and a precision, which is the ratio of the number of matching n-grams
to the number of total n-grams in the generated output sequence. Then
GLEU score is simply the minimum of recall and precision. This GLEU
score's range is always between 0 (no matches) and 1 (all match) and
it is symmetrical when switching output and target. According to
our experiments, GLEU score correlates quite well with the BLEU
metric on a corpus level but does not have its drawbacks for our per
sentence reward objective.
---
# Metric Card for Google BLEU
## Metric Description
The BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. The Google BLEU score is designed to limit these undesirable properties when used for single sentences.
To calculate this score, all sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams) are recorded. The precision and recall, described below, are then computed.
-**precision:** the ratio of the number of matching n-grams to the number of total n-grams in the generated output sequence
-**recall:** the ratio of the number of matching n-grams to the number of total n-grams in the target (ground truth) sequence
The minimum value of precision and recall is then returned as the score.
## Intended Uses
This metric is generally used to evaluate machine translation models. It is especially used when scores of individual (prediction, reference) sentence pairs are needed, as opposed to when averaging over the (prediction, reference) scores for a whole corpus. That being said, it can also be used when averaging over the scores for a whole corpus.
Because it performs better on individual sentence pairs as compared to BLEU, Google BLEU has also been used in RL experiments.
## How to Use
This metric takes a list of predicted sentences, as well as a list of references.
-**predictions** (list of str): list of translations to score.
-**references** (list of list of str): list of lists of references for each translation.
-**tokenizer** : approach used for tokenizing `predictions` and `references`.
The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT. This can be replaced by any function that takes a string as input and returns a list of tokens as output.
-**min_len** (int): The minimum order of n-gram this function should extract. Defaults to 1.
-**max_len** (int): The maximum order of n-gram this function should extract. Defaults to 4.
### Output Values
This metric returns the following in a dict:
-**google_bleu** (float): google_bleu score
The output format is as follows:
```python
{'google_bleu':google_bleuscore}
```
This metric can take on values from 0 to 1, inclusive. Higher scores are better, with 0 indicating no matches, and 1 indicating a perfect match.
Note that this score is symmetrical when switching output and target. This means that, given two sentences, `sentence1` and `sentence2`, whatever score is output when `sentence1` is the predicted sentence and `sencence2` is the reference sentence will be the same as when the sentences are swapped and `sentence2` is the predicted sentence while `sentence1` is the reference sentence. In code, this looks like:
>>>predictions=['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat','he read the book because he was interested in world history']
>>>references=[['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat'],['he was interested in world history because he read the book']]
Example with multiple references for the first sample:
```python
>>>predictions=['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat','he read the book because he was interested in world history']
>>>references=[['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat','It is a guide to action that ensures that the rubber duck will never heed the cat commands','It is the practical guide for the rubber duck army never to heed the directions of the cat'],['he was interested in world history because he read the book']]
Example with multiple references for the first sample, and with `min_len` adjusted to `2`, instead of the default `1`, which means that the function extracts n-grams of length `2`:
```python
>>>predictions=['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat','he read the book because he was interested in world history']
>>>references=[['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat','It is a guide to action that ensures that the rubber duck will never heed the cat commands','It is the practical guide for the rubber duck army never to heed the directions of the cat'],['he was interested in world history because he read the book']]
Example with multiple references for the first sample, with `min_len` adjusted to `2`, instead of the default `1`, and `max_len` adjusted to `6` instead of the default `4`:
```python
>>>predictions=['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat','he read the book because he was interested in world history']
>>>references=[['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat','It is a guide to action that ensures that the rubber duck will never heed the cat commands','It is the practical guide for the rubber duck army never to heed the directions of the cat'],['he was interested in world history because he read the book']]
The GoogleBLEU metric does not come with a predefined tokenization function; previous versions simply used `split()` to split the input strings into tokens. Using a tokenizer such as the default one, `tokenizer_13a`, makes results more standardized and reproducible. The BLEU and sacreBLEU metrics also use this default tokenizer.
## Citation
```bibtex
@misc{wu2016googles,
title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
author={Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens and George Kurian and Nishant Patil and Wei Wang and Cliff Young and Jason Smith and Jason Riesa and Alex Rudnick and Oriol Vinyals and Greg Corrado and Macduff Hughes and Jeffrey Dean},
year={2016},
eprint={1609.08144},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Further References
- This Hugging Face implementation uses the [nltk.translate.gleu_score implementation](https://www.nltk.org/_modules/nltk/translate/gleu_score.html)
IndicGLUE is a natural language understanding benchmark for Indian languages. It contains a wide
variety of tasks and covers 11 major Indian languages - as, bn, gu, hi, kn, ml, mr, or, pa, ta, te.
---
# Metric Card for IndicGLUE
## Metric description
This metric is used to compute the evaluation metric for the [IndicGLUE dataset](https://huggingface.co/datasets/indic_glue).
IndicGLUE is a natural language understanding benchmark for Indian languages. It contains a wide variety of tasks and covers 11 major Indian languages - Assamese (`as`), Bengali (`bn`), Gujarati (`gu`), Hindi (`hi`), Kannada (`kn`), Malayalam (`ml`), Marathi(`mr`), Oriya(`or`), Panjabi (`pa`), Tamil(`ta`) and Telugu (`te`).
## How to use
There are two steps: (1) loading the IndicGLUE metric relevant to the subset of the dataset being used for evaluation; and (2) calculating the metric.
1.**Loading the relevant IndicGLUE metric** : the subsets of IndicGLUE are the following: `wnli`, `copa`, `sna`, `csqa`, `wstp`, `inltkh`, `bbca`, `cvit-mkb-clsr`, `iitp-mr`, `iitp-pr`, `actsa-sc`, `md`, and`wiki-ner`.
More information about the different subsets of the Indic GLUE dataset can be found on the [IndicGLUE dataset page](https://indicnlp.ai4bharat.org/indic-glue/).
2.**Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one lists of references for each translation for all subsets of the dataset except for `cvit-mkb-clsr`, where each prediction and reference is a vector of floats.
The output of the metric depends on the IndicGLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
`accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).
`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
`precision@10`: the fraction of the true examples among the top 10 predicted examples, with a range between 0 and 1 (see [precision](https://huggingface.co/metrics/precision) for more information).
The `cvit-mkb-clsr` subset returns `precision@10`, the `wiki-ner` subset returns `accuracy` and `f1`, and all other subsets of Indic GLUE return only accuracy.
### Values from popular papers
The [original IndicGlue paper](https://aclanthology.org/2020.findings-emnlp.445.pdf) reported an average accuracy of 0.766 on the dataset, which varies depending on the subset selected.
## Examples
Maximal values for the WNLI subset (which outputs `accuracy`):
-`predictions`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the estimated target values.
-`references`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the ground truth (correct) target values.
Optional arguments:
-`sample_weight`: numeric array-like of shape (`n_samples,`) representing sample weights. The default is `None`.
-`multioutput`: `raw_values`, `uniform_average` or numeric array-like of shape (`n_outputs,`), which defines the aggregation of multiple output values. The default value is `uniform_average`.
-`raw_values` returns a full set of errors in case of multioutput input.
-`uniform_average` means that the errors of all outputs are averaged with uniform weight.
- the array-like value defines weights used to average errors.
### Output Values
This metric outputs a dictionary, containing the mean absolute error score, which is of type:
-`float`: if multioutput is `uniform_average` or an ndarray of weights, then the weighted average of all output errors is returned.
- numeric array-like of shape (`n_outputs,`): if multioutput is `raw_values`, then the score is returned for each output separately.
Each MAE `float` value ranges from `0.0` to `+inf`, with the best value being 0.0.
One limitation of MAE is that the relative size of the error is not always obvious, meaning that it can be difficult to tell a big error from a smaller one -- metrics such as Mean Absolute Percentage Error (MAPE) have been proposed to calculate MAE in percentage terms.
Also, since it calculates the mean, MAE may underestimate the impact of big, but infrequent, errors -- metrics such as the Root Mean Square Error (RMSE) compensate for this by taking the root of error values.
## Citation(s)
```bibtex
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
```
```bibtex
@article{willmott2005advantages,
title={Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance},
# Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""MAE - Mean Absolute Error Metric"""
importdatasets
fromsklearn.metricsimportmean_absolute_error
importevaluate
_CITATION="""\
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
"""
_DESCRIPTION="""\
Mean Absolute Error (MAE) is the mean of the magnitude of difference between the predicted and actual
values.
"""
_KWARGS_DESCRIPTION="""
Args:
predictions: array-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.
references: array-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
sample_weight: array-like of shape (n_samples,), default=None
Sample weights.
multioutput: {"raw_values", "uniform_average"} or array-like of shape (n_outputs,), default="uniform_average"
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
"raw_values" : Returns a full set of errors in case of multioutput input.
"uniform_average" : Errors of all outputs are averaged with uniform weight.
Returns:
mae : mean absolute error.
If multioutput is "raw_values", then mean absolute error is returned for each output separately. If multioutput is "uniform_average" or an ndarray of weights, then the weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
Mahalonobis distance is the distance between a point and a distribution (as opposed to the distance between two points), making it the multivariate equivalent of the Euclidean distance.
It is often used in multivariate anomaly detection, classification on highly imbalanced datasets and one-class classification.
## How to Use
At minimum, this metric requires two `list`s of datapoints:
The Mahalanobis distance is only able to capture linear relationships between the variables, which means it cannot capture all types of outliers. Mahalanobis distance also fails to faithfully represent data that is highly skewed or multimodal.
## Citation
```bibtex
@inproceedings{mahalanobis1936generalized,
title={On the generalized distance in statistics},
author={Mahalanobis, Prasanta Chandra},
year={1936},
organization={National Institute of Science of India}
}
```
```bibtex
@article{de2000mahalanobis,
title={The Mahalanobis distance},
author={De Maesschalck, Roy and Jouan-Rimbaud, Delphine and Massart, D{\'e}sir{\'e} L},
journal={Chemometrics and intelligent laboratory systems},
-`predictions`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the estimated target values.
-`references`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the ground truth (correct) target values.
Optional arguments:
-`sample_weight`: numeric array-like of shape (`n_samples,`) representing sample weights. The default is `None`.
-`multioutput`: `raw_values`, `uniform_average` or numeric array-like of shape (`n_outputs,`), which defines the aggregation of multiple output values. The default value is `uniform_average`.
-`raw_values` returns a full set of errors in case of multioutput input.
-`uniform_average` means that the errors of all outputs are averaged with uniform weight.
- the array-like value defines weights used to average errors.
### Output Values
This metric outputs a dictionary, containing the mean absolute error score, which is of type:
-`float`: if multioutput is `uniform_average` or an ndarray of weights, then the weighted average of all output errors is returned.
- numeric array-like of shape (`n_outputs,`): if multioutput is `raw_values`, then the score is returned for each output separately.
Each MAPE `float` value is postive with the best value being 0.0.
One limitation of MAPE is that it cannot be used if the ground truth is zero or close to zero. This metric is also asymmetric in that it puts a heavier penalty on predictions less than the ground truth and a smaller penalty on predictions bigger than the ground truth and thus can lead to a bias of methods being select which under-predict if selected via this metric.
## Citation(s)
```bibtex
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
```
```bibtex
@article{DEMYTTENAERE201638,
title={Mean Absolute Percentage Error for regression models},
journal={Neurocomputing},
volume={192},
pages={38--48},
year={2016},
note={Advances in artificial neural networks, machine learning and computational intelligence},