🤗 Evaluate strives to be transparent and explicit about how it works, but this can be quite verbose at times. We have included a series of logging methods which allow you to easily adjust the level of verbosity of the entire library. Currently the default verbosity of the library is set to `WARNING`.
To change the level of verbosity, use one of the direct setters. For instance, here is how to change the verbosity to the `INFO` level:
```py
import evaluate
evaluate.logging.set_verbosity_info()
```
You can also use the environment variable `EVALUATE_VERBOSITY` to override the default verbosity, and set it to one of the following: `debug`, `info`, `warning`, `error`, `critical`:
```bash
EVALUATE_VERBOSITY=error ./myprogram.py
```
All the methods of this logging module are documented below. The main ones are:
- [`logging.get_verbosity`] to get the current level of verbosity in the logger
- [`logging.set_verbosity`] to set the verbosity to the level of your choice
In order from the least to the most verbose (with their corresponding `int` values):
1. `logging.CRITICAL` or `logging.FATAL` (int value, 50): only report the most critical errors.
2. `logging.ERROR` (int value, 40): only report errors.
3. `logging.WARNING` or `logging.WARN` (int value, 30): only reports error and warnings. This the default level used by the library.
5. `logging.DEBUG` (int value, 10): report all information.
By default, `tqdm` progress bars will be displayed during evaluate download and processing. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior.
To run the scikit-learn examples make sure you have installed the following library:
```bash
pip install -U scikit-learn
```
The metrics in `evaluate` can be easily integrated with an Scikit-Learn estimator or [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline).
However, these metrics require that we generate the predictions from the model. The predictions and labels from the estimators can be passed to `evaluate` mertics to compute the required values.
```python
import numpy as np
np.random.seed(0)
import evaluate
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
```
Load data from https://www.openml.org/d/40945:
```python
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
```
Alternatively X and y can be obtained directly from the frame attribute:
```python
X = titanic.frame.drop('survived', axis=1)
y = titanic.frame['survived']
```
We create the preprocessing pipelines for both numeric and categorical data. Note that pclass could either be treated as a categorical or numeric feature.
Youcanuseany`evaluate`metricwiththe`Trainer`and`Seq2SeqTrainer`aslongastheyarecompatiblewiththetaskandpredictions.Incaseyoudon't want to train a model but just evaluate an existing model you can replace `trainer.train()` with `trainer.evaluate()` in the above scripts.
The goal of the 🤗 Evaluate library is to support different types of evaluation, depending on different goals, datasets and models.
Here are the types of evaluations that are currently supported with a few examples for each:
## Metrics
A metric measures the performance of a model on a given dataset. This is often based on an existing ground truth (i.e. a set of references), but there are also *referenceless metrics* which allow evaluating generated text by leveraging a pretrained model such as [GPT-2](https://huggingface.co/gpt2).
Examples of metrics include:
- [Accuracy](https://huggingface.co/metrics/accuracy) : the proportion of correct predictions among the total number of cases processed.
- [Exact Match](https://huggingface.co/metrics/exact_match): the rate at which the input predicted strings exactly match their references.
- [Mean Intersection over union (IoUO)](https://huggingface.co/metrics/mean_iou): the area of overlap between the predicted segmentation of an image and the ground truth divided by the area of union between the predicted segmentation and the ground truth.
Metrics are often used to track model performance on benchmark datasets, and to report progress on tasks such as [machine translation](https://huggingface.co/tasks/translation) and [image classification](https://huggingface.co/tasks/image-classification).
## Comparisons
Comparisons can be useful to compare the performance of two or more models on a single test dataset.
For instance, the [McNemar Test](https://github.com/huggingface/evaluate/tree/main/comparisons/mcnemar) is a paired nonparametric statistical hypothesis test that takes the predictions of two models and compares them, aiming to measure whether the models's predictions diverge or not. The p value it outputs, which ranges from `0.0` to `1.0`, indicates the difference between the two models' predictions, with a lower p value indicating a more significant difference.
Comparisons have yet to be systematically used when comparing and reporting model performance, however they are useful tools to go beyond simply comparing leaderboard scores and for getting more information on the way model prediction differ.
## Measurements
In the 🤗 Evaluate library, measurements are tools for gaining more insights on datasets and model predictions.
For instance, in the case of datasets, it can be useful to calculate the [average word length](https://github.com/huggingface/evaluate/tree/main/measurements/word_length) of a dataset's entries, and how it is distributed -- this can help when choosing the maximum input length for [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer).
In the case of model predictions, it can help to calculate the average [perplexity](https://huggingface.co/metrics/perplexity) of model predictions using different models such as [GPT-2](https://huggingface.co/gpt2) and [BERT](https://huggingface.co/bert-base-uncased), which can indicate the quality of generated text when no reference is available.
All three types of evaluation supported by the 🤗 Evaluate library are meant to be mutually complementary, and help our community carry out more mindful and responsible evaluation.
We will continue adding more types of metrics, measurements and comparisons in coming months, and are counting on community involvement (via [PRs](https://github.com/huggingface/evaluate/compare) and [issues](https://github.com/huggingface/evaluate/issues/new/choose)) to make the library as extensive and inclusive as possible!
The HONEST score is a multilingual score that aims to compute how likely each language model is to produce hurtful completions based on a predefined set of prompts.
---
# Measurement Card for HONEST
## Measurement description
The HONEST score aims to measure hurtful sentence completions in language models.
The score uses HurtLex, a multilingual lexicon of hurtful language, to evaluate the completions.
It aims to quantify how often sentences are completed with a hurtful word, and if there is a difference between
groups (e.g. genders, sexual orientations, etc.).
## How to use
When loading the model, specify the language of the prompts and completions.
The available languages are: 'it' (Italian), 'fr' (French), 'es' (Spanish), 'pt' (Portuguese), 'ro' (Romanian), 'en' (English).
```python
>>>honest=evaluate.load('honest','en')
```
Arguments:
**predictions** (list of list of `str`): a list of completions to [HONEST prompts](https://huggingface.co/datasets/MilaNLProc/honest)
**groups** (list of `str`) (*optional*): a list of the identity groups each list of completions belongs to.
## Output values
`honest_score`: the HONEST score, representing the average number of hurtful completions across all groups
`honest_score_per_group`: the HONEST score of each group separately.
### Values from popular papers
In the [original HONEST paper](https://aclanthology.org/2021.naacl-main.191.pdf), the following scores were calculated for models, with Top K referring to the number of model completions that were evaluated:
title={"{HONEST}: Measuring Hurtful Sentence Completion in Language Models"},
author="Nozza, Debora and Bianchi, Federico and Hovy, Dirk",
booktitle="Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month=jun,
year="2021",
address="Online",
publisher="Association for Computational Linguistics",
title={Measuring Harmful Sentence Completion in Language Models for LGBTQIA+ Individuals},
author="Nozza, Debora and Bianchi, Federico and Lauscher, Anne and Hovy, Dirk",
booktitle="Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion",
publisher="Association for Computational Linguistics",
year={2022}
}
```
## Further References
- Bassignana, Elisa, Valerio Basile, and Viviana Patti. ["Hurtlex: A multilingual lexicon of words to hurt."](http://ceur-ws.org/Vol-2253/paper49.pdf) 5th Italian Conference on Computational Linguistics, CLiC-it 2018. Vol. 2253. CEUR-WS, 2018.
title = {"{HONEST}: Measuring Hurtful Sentence Completion in Language Models"},
author = "Nozza, Debora and Bianchi, Federico and Hovy, Dirk",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
-**data** (`list`): a list of integers or strings containing the data labels.
### Output Values
By default, this metric outputs a dictionary that contains :
-**label_distribution** (`dict`) : a dictionary containing two sets of keys and values: `labels`, which includes the list of labels contained in the dataset, and `fractions`, which includes the fraction of each label.
-**label_skew** (`scalar`) : the asymmetry of the label distribution.
If skewness is 0, the dataset is perfectly balanced; if it is less than -1 or greater than 1, the distribution is highly skewed; anything in between can be considered moderately skewed.
#### Values from Popular Papers
### Examples
Calculating the label distribution of a dataset with binary labels:
While label distribution can be a useful signal for analyzing datasets and choosing metrics for measuring model performance, it can be useful to accompany it with additional data exploration to better understand each subset of the dataset and how they differ.
## Citation
## Further References
-[Facing Imbalanced Data Recommendations for the Use of Performance Metrics](https://sites.pitt.edu/~jeffcohn/skew/PID2829477.pdf)
# Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Label Distribution Measurement."""
fromcollectionsimportCounter
importdatasets
importpandasaspd
fromscipyimportstats
importevaluate
_DESCRIPTION="""
Returns the label ratios of the dataset labels, as well as a scalar for skewness.
"""
_KWARGS_DESCRIPTION="""
Args:
`data`: a list containing the data labels
Returns:
`label_distribution` (`dict`) : a dictionary containing two sets of keys and values: `labels`, which includes the list of labels contained in the dataset, and `fractions`, which includes the fraction of each label.
`label_skew` (`scalar`) : the asymmetry of the label distribution.
Examples:
>>> data = [1, 0, 1, 1, 0, 1, 0]
>>> distribution = evaluate.load("label_distribution")
Perplexity (PPL) can be used to evaluate the extent to which a dataset is similar to the distribution of text that a given model was trained on.
It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).
---
# Measurement Card for Perplexity
## Measurement Description
Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.
As a measurement, it can be used to evaluate how well text matches the distribution of text that the input model was trained on.
In this case, `model_id` should be the trained model, and `data` should be the text to be evaluated.
This implementation of perplexity is calculated with log base `e`, as in `perplexity = e**(sum(losses) / num_tokenized_tokens)`, following recent convention in deep learning frameworks.
## Intended Uses
Dataset analysis or exploration.
## How to Use
The measurement takes a list of texts as input, as well as the name of the model used to compute the metric:
-**model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
- This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
-**data** (list of str): input text, where each separate text snippet is one list entry.
-**batch_size** (int): the batch size to run texts through the model. Defaults to 16.
-**add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
-**device** (str): device to run on, defaults to `cuda` when available
### Output Values
This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.
## Citation
```bibtex
@article{jelinek1977perplexity,
title={Perplexity—a measure of the difficulty of speech recognition tasks},
author={Jelinek, Fred and Mercer, Robert L and Bahl, Lalit R and Baker, James K},
journal={The Journal of the Acoustical Society of America},
volume={62},
number={S1},
pages={S63--S63},
year={1977},
publisher={Acoustical Society of America}
}
```
## Further References
-[Hugging Face Perplexity Blog Post](https://huggingface.co/docs/transformers/perplexity)