title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
"""
_DESCRIPTION="""\
Mean Absolute Percentage Error (MAPE) is the mean percentage error difference between the predicted and actual
values.
"""
_KWARGS_DESCRIPTION="""
Args:
predictions: array-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.
references: array-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
sample_weight: array-like of shape (n_samples,), default=None
Sample weights.
multioutput: {"raw_values", "uniform_average"} or array-like of shape (n_outputs,), default="uniform_average"
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
"raw_values" : Returns a full set of errors in case of multioutput input.
"uniform_average" : Errors of all outputs are averaged with uniform weight.
Returns:
mape : mean absolute percentage error.
If multioutput is "raw_values", then mean absolute percentage error is returned for each output separately. If multioutput is "uniform_average" or an ndarray of weights, then the weighted average of all output errors is returned.
MAPE output is non-negative floating point. The best value is 0.0.
Mean Absolute Scaled Error (MASE) is the mean absolute error of the forecast values, divided by the mean absolute error of the in-sample one-step naive forecast on the training set.
---
# Metric Card for MASE
## Metric Description
Mean Absolute Scaled Error (MASE) is the mean absolute error of the forecast values, divided by the mean absolute error of the in-sample one-step naive forecast. For prediction $x_i$ and corresponding ground truth $y_i$ as well as training data $z_t$ with seasonality $p$ the metric is given by:
* has predictable behavior when predicted/ground-truth data is near zero;
* symmetric;
* interpretable, as values greater than one indicate that in-sample one-step forecasts from the naïve method perform better than the forecast values under consideration.
## How to Use
At minimum, this metric requires predictions, references and training data as inputs.
-`predictions`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the estimated target values.
-`references`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the ground truth (correct) target values.
-`training`: numeric array-like of shape (`n_train_samples,`) or (`n_train_samples`, `n_outputs`), representing the in sample training data.
Optional arguments:
-`periodicity`: the seasonal periodicity of training data. The default is 1.
-`sample_weight`: numeric array-like of shape (`n_samples,`) representing sample weights. The default is `None`.
-`multioutput`: `raw_values`, `uniform_average` or numeric array-like of shape (`n_outputs,`), which defines the aggregation of multiple output values. The default value is `uniform_average`.
-`raw_values` returns a full set of errors in case of multioutput input.
-`uniform_average` means that the errors of all outputs are averaged with uniform weight.
- the array-like value defines weights used to average errors.
### Output Values
This metric outputs a dictionary, containing the mean absolute error score, which is of type:
-`float`: if multioutput is `uniform_average` or an ndarray of weights, then the weighted average of all output errors is returned.
- numeric array-like of shape (`n_outputs,`): if multioutput is `raw_values`, then the score is returned for each output separately.
Each MASE `float` value ranges from `0.0` to `1.0`, with the best value being 0.0.
Mean Absolute Scaled Error (MASE) is the mean absolute error of the forecast values, divided by the mean absolute error of the in-sample one-step naive forecast.
"""
_KWARGS_DESCRIPTION="""
Args:
predictions: array-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.
references: array-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
training: array-like of shape (n_train_samples,) or (n_train_samples, n_outputs)
In sample training data for naive forecast.
periodicity: int, default=1
Seasonal periodicity of training data.
sample_weight: array-like of shape (n_samples,), default=None
Sample weights.
multioutput: {"raw_values", "uniform_average"} or array-like of shape (n_outputs,), default="uniform_average"
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
"raw_values" : Returns a full set of errors in case of multioutput input.
"uniform_average" : Errors of all outputs are averaged with uniform weight.
Returns:
mase : mean absolute scaled error.
If multioutput is "raw_values", then mean absolute percentage error is returned for each output separately. If multioutput is "uniform_average" or an ndarray of weights, then the weighted average of all output errors is returned.
MASE output is non-negative floating point. The best value is 0.0.
-**`predictions`** (`list` of `int`s): Predicted class labels.
-**`references`** (`list` of `int`s): Ground truth labels.
-**`sample_weight`** (`list` of `int`s, `float`s, or `bool`s): Sample weights. Defaults to `None`.
-**`average`**(`None` or `macro`): For the multilabel case, whether to return one correlation coefficient per feature (`average=None`), or the average of them (`average='macro'`). Defaults to `None`.
### Output Values
-**`matthews_correlation`** (`float` or `list` of `float`s): Matthews correlation coefficient, or list of them in the multilabel case without averaging.
The metric output takes the following form:
```python
{'matthews_correlation':0.54}
```
This metric can be any value from -1 to +1, inclusive.
#### Values from Popular Papers
### Examples
A basic example with only predictions and references as inputs:
*Note any limitations or biases that the metric has.*
## Citation
```bibtex
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
```
## Further References
- This Hugging Face implementation uses [this scikit-learn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html)
MAUVE is a measure of the statistical gap between two text distributions, e.g., how far the text written by a model is the distribution of human text, using samples from both distributions.
MAUVE is obtained by computing Kullback–Leibler (KL) divergences between the two distributions in a quantized embedding space of a large language model. It can quantify differences in the quality of generated text based on the size of the model, the decoding algorithm, and the length of the generated text. MAUVE was found to correlate the strongest with human evaluations over baseline metrics for open-ended text generation.
---
# Metric Card for MAUVE
## Metric description
MAUVE is a measure of the gap between neural text and human text. It is computed using the [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between the two distributions of text in a quantized embedding space of a large language model. MAUVE can identify differences in quality arising from model sizes and decoding algorithms.
This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454).
## How to use
The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction):
`num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer.
`pca_max_data`: the number of data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`.
`kmeans_explained_var`: the amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`.
`kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`.
`kmeans_max_iter`: maximum number of k-means iterations. The default is `500`.
`featurize_model_name`: name of the model from which features are obtained, from one of the following: `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`. The default is `gpt2-large`.
`device_id`: Device for featurization. Supply a GPU id (e.g. `0` or `3`) to use GPU. If no GPU with this id is found, the metric will use CPU.
`max_text_length`: maximum number of tokens to consider. The default is `1024`.
`divergence_curve_discretization_size` Number of points to consider on the divergence curve. The default is `25`.
`mauve_scaling_factor`: Hyperparameter for scaling. The default is `5`.
`verbose`: If `True` (default), running the metric will print running time updates.
`seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
## Output values
This metric outputs a dictionary with 5 key-value pairs:
`mauve`: MAUVE score, which ranges between 0 and 1. **Larger** values indicate that P and Q are closer.
`frontier_integral`: Frontier Integral, which ranges between 0 and 1. **Smaller** values indicate that P and Q are closer.
`divergence_curve`: a numpy.ndarray of shape (m, 2); plot it with `matplotlib` to view the divergence curve.
`p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`.
`q_hist`: same as above, but with `q_text`.
### Values from popular papers
The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores and that MAUVE is correlated with human judgments.
The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified.
Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance, `gpt` is 523MB.
It is a good idea to use at least 1000 samples for each distribution to compute MAUVE (the original paper uses 5000).
MAUVE is unable to identify very small differences between different settings of generation (e.g., between top-p sampling with p=0.95 versus 0.96). It is important, therefore, to account for the randomness inside the generation (e.g., due to sampling) and within the MAUVE estimation procedure (see the `seed` parameter above). Concretely, it is a good idea to obtain generations using multiple random seeds and/or to use rerun MAUVE with multiple values of the parameter `seed`.
For MAUVE to be large, the model distribution must be close to the human text distribution as seen by the embeddings. It is possible to have high-quality model text that still has a small MAUVE score (i.e., large gap) if it contains text about different topics/subjects, or uses a different writing style or vocabulary, or contains texts of a different length distribution. MAUVE summarizes the statistical gap (as measured by the large language model embeddings) --- this includes all these factors in addition to the quality-related aspects such as grammaticality.
See the [official implementation](https://github.com/krishnap25/mauve#best-practices-for-mauve) for more details about best practices.
## Citation
```bibtex
@inproceedings{pillutla-etal:mauve:neurips2021,
title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
# Copyright 2020 The HuggingFace Evaluate Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" MAUVE metric from https://github.com/krishnap25/mauve. """
importdatasets
importfaiss# Here to have a nice missing dependency error message early on
importnumpy# Here to have a nice missing dependency error message early on
importrequests# Here to have a nice missing dependency error message early on
importsklearn# Here to have a nice missing dependency error message early on
importtqdm# Here to have a nice missing dependency error message early on
frommauveimportcompute_mauve# From: mauve-text
importevaluate
_CITATION="""\
@inproceedings{pillutla-etal:mauve:neurips2021,
title={{MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers}},
author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
booktitle = {NeurIPS},
year = {2021}
}
@article{pillutla-etal:mauve:arxiv2022,
title={{MAUVE Scores for Generative Models: Theory and Practice}},
author={Pillutla, Krishna and Liu, Lang and Thickstun, John and Welleck, Sean and Swayamdipta, Swabha and Zellers, Rowan and Oh, Sewoong and Choi, Yejin and Harchaoui, Zaid},
journal={arXiv Preprint},
year={2022}
}
"""
_DESCRIPTION="""\
MAUVE is a measure of the statistical gap between two text distributions, e.g., how far the text written by a model is the distribution of human text, using samples from both distributions.
MAUVE is obtained by computing Kullback–Leibler (KL) divergences between the two distributions in a quantized embedding space of a large language model.
It can quantify differences in the quality of generated text based on the size of the model, the decoding algorithm, and the length of the generated text.
MAUVE was found to correlate the strongest with human evaluations over baseline metrics for open-ended text generation.
This metrics is a wrapper around the official implementation of MAUVE:
https://github.com/krishnap25/mauve
"""
_KWARGS_DESCRIPTION="""
Calculates MAUVE scores between two lists of generated text and reference text.
Args:
predictions: list of generated text to score. Each predictions
should be a string with tokens separated by spaces.
references: list of reference for each prediction. Each
reference should be a string with tokens separated by spaces.
Optional Args:
num_buckets: the size of the histogram to quantize P and Q. Options: 'auto' (default) or an integer
pca_max_data: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. Default -1
kmeans_explained_var: amount of variance of the data to keep in dimensionality reduction by PCA. Default 0.9
kmeans_num_redo: number of times to redo k-means clustering (the best objective is kept). Default 5
kmeans_max_iter: maximum number of k-means iterations. Default 500
featurize_model_name: name of the model from which features are obtained. Default 'gpt2-large' Use one of ['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'].
device_id: Device for featurization. Supply a GPU id (e.g. 0 or 3) to use GPU. If no GPU with this id is found, use CPU
max_text_length: maximum number of tokens to consider. Default 1024
divergence_curve_discretization_size: Number of points to consider on the divergence curve. Default 25
mauve_scaling_factor: "c" from the paper. Default 5.
verbose: If True (default), print running time updates
seed: random seed to initialize k-means cluster assignments.
Returns:
mauve: MAUVE score, a number between 0 and 1. Larger values indicate that P and Q are closer,
frontier_integral: Frontier Integral, a number between 0 and 1. Smaller values indicate that P and Q are closer,
divergence_curve: a numpy.ndarray of shape (m, 2); plot it with matplotlib to view the divergence curve,
p_hist: a discrete distribution, which is a quantized version of the text distribution p_text,
q_hist: same as above, but with q_text.
Examples:
>>> # faiss segfaults in doctest for some reason, so the .compute call is not tested with doctest
IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union
between the predicted segmentation and the ground truth. For binary (two classes) or multi-class segmentation,
the mean IoU of the image is calculated by taking the IoU of each class and averaging them.
---
# Metric Card for Mean IoU
## Metric Description
IoU (Intersection over Union) is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.
For binary (two classes) or multi-class segmentation, the *mean IoU* of the image is calculated by taking the IoU of each class and averaging them.
## How to Use
The Mean IoU metric takes two lists of numeric 2D arrays as input corresponding to the predicted and ground truth segmentations:
-`predictions` (`List[ndarray]`): List of predicted segmentation maps, each of shape (height, width). Each segmentation map can be of a different size.
-`references` (`List[ndarray]`): List of ground truth segmentation maps, each of shape (height, width). Each segmentation map can be of a different size.
-`num_labels` (`int`): Number of classes (categories).
-`ignore_index` (`int`): Index that will be ignored during evaluation.
**Optional inputs**
-`nan_to_num` (`int`): If specified, NaN values will be replaced by the number defined by the user.
-`label_map` (`dict`): If specified, dictionary mapping old label indices to new label indices.
-`reduce_labels` (`bool`): Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255. The default value is `False`.
### Output Values
The metric returns a dictionary with the following elements:
-`mean_iou` (`float`): Mean Intersection-over-Union (IoU averaged over all categories).
-`mean_accuracy` (`float`): Mean accuracy (averaged over all categories).
-`overall_accuracy` (`float`): Overall accuracy on all images.
-`per_category_accuracy` (`ndarray` of shape `(num_labels,)`): Per category accuracy.
-`per_category_iou` (`ndarray` of shape `(num_labels,)`): Per category IoU.
The values of all of the scores reported range from from `0.0` (minimum) and `1.0` (maximum).
The [leaderboard for the CityScapes dataset](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes) reports a Mean IOU ranging from 64 to 84; that of [ADE20k](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k) ranges from 30 to a peak of 59.9, indicating that the dataset is more difficult for current approaches (as of 2022).
### Examples
```python
>>>importnumpyasnp
>>>mean_iou=evaluate.load("mean_iou")
>>># suppose one has 3 different segmentation maps predicted
Mean IOU is an average metric, so it will not show you where model predictions differ from the ground truth (i.e. if there are particular regions or classes that the model does poorly on). Further error analysis is needed to gather actional insights that can be used to inform model improvements.
METEOR, an automatic metric for machine translation evaluation
that is based on a generalized concept of unigram matching between the
machine-produced translation and human-produced reference translations.
Unigrams can be matched based on their surface forms, stemmed forms,
and meanings; furthermore, METEOR can be easily extended to include more
advanced matching strategies. Once all generalized unigram matches
between the two strings have been found, METEOR computes a score for
this matching using a combination of unigram-precision, unigram-recall, and
a measure of fragmentation that is designed to directly capture how
well-ordered the matched words in the machine translation are in relation
to the reference.
METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic
data and 0.331 on the Chinese data. This is shown to be an improvement on
using simply unigram-precision, unigram-recall and their harmonic F1
combination.
---
# Metric Card for METEOR
## Metric description
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a machine translation evaluation metric, which is calculated based on the harmonic mean of precision and recall, with recall weighted more than precision.
METEOR is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
## How to use
METEOR has two mandatory arguments:
`predictions`: a `list` of predictions to score. Each prediction should be a string with tokens separated by spaces.
`references`: a `list` of references (in the case of one `reference` per `prediction`), or a `list` of `lists` of references (in the case of multiple `references` per `prediction`. Each reference should be a string with tokens separated by spaces.
It also has several optional parameters:
`alpha`: Parameter for controlling relative weights of precision and recall. The default value is `0.9`.
`beta`: Parameter for controlling shape of penalty as a function of fragmentation. The default value is `3`.
`gamma`: The relative weight assigned to fragmentation penalty. The default is `0.5`.
Refer to the [METEOR paper](https://aclanthology.org/W05-0909.pdf) for more information about parameter values and ranges.
```python
>>>meteor=evaluate.load('meteor')
>>>predictions=["It is a guide to action which ensures that the military always obeys the commands of the party"]
>>>references=["It is a guide to action that ensures that the military will forever heed Party commands"]
The metric outputs a dictionary containing the METEOR score. Its values range from 0 to 1, e.g.:
```
{'meteor': 0.9999142661179699}
```
### Values from popular papers
The [METEOR paper](https://aclanthology.org/W05-0909.pdf) does not report METEOR score values for different models, but it does report that METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data.
## Examples
One `reference` per `prediction`:
```python
>>>meteor=evaluate.load('meteor')
>>>predictions=["It is a guide to action which ensures that the military always obeys the commands of the party"]
>>>reference=["It is a guide to action which ensures that the military always obeys the commands of the party"]
>>>predictions=["It is a guide to action which ensures that the military always obeys the commands of the party"]
>>>references=[['It is a guide to action that ensures that the military will forever heed Party commands','It is the guiding principle which guarantees the military forces always being under the command of the Party','It is the practical guide for the army always to heed the directions of the party']]
Multiple `references` per `prediction`, partial match:
```python
>>>meteor=evaluate.load('meteor')
>>>predictions=["It is a guide to action which ensures that the military always obeys the commands of the party"]
>>>references=[['It is a guide to action that ensures that the military will forever heed Party commands','It is the guiding principle which guarantees the military forces always being under the command of the Party','It is the practical guide for the army always to heed the directions of the party']]
While the correlation between METEOR and human judgments was measured for Chinese and Arabic and found to be significant, further experimentation is needed to check its correlation for other languages.
Furthermore, while the alignment and matching done in METEOR is based on unigrams, using multiple word entities (e.g. bigrams) could contribute to improving its accuracy -- this has been proposed in [more recent publications](https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-naacl-2010.pdf) on the subject.
## Citation
```bibtex
@inproceedings{banarjee2005,
title={{METEOR}: An Automatic Metric for {MT} Evaluation with Improved Correlation with Human Judgments},
author={Banerjee, Satanjeev and Lavie, Alon},
booktitle={Proceedings of the {ACL} Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization},
month=jun,
year={2005},
address={Ann Arbor, Michigan},
publisher={Association for Computational Linguistics},