Commit 25991f98 authored by hepj's avatar hepj
Browse files

修改readme

parent ac192496
Pipeline #1415 failed with stages
in 0 seconds
# Evaluator
The evaluator classes for automatic evaluation.
## Evaluator classes
The main entry point for using the evaluator:
[[autodoc]] evaluate.evaluator
The base class for all evaluator classes:
[[autodoc]] evaluate.Evaluator
## The task specific evaluators
### ImageClassificationEvaluator
[[autodoc]] evaluate.ImageClassificationEvaluator
### QuestionAnsweringEvaluator
[[autodoc]] evaluate.QuestionAnsweringEvaluator
- compute
### TextClassificationEvaluator
[[autodoc]] evaluate.TextClassificationEvaluator
### TokenClassificationEvaluator
[[autodoc]] evaluate.TokenClassificationEvaluator
- compute
### TextGenerationEvaluator
[[autodoc]] evaluate.TextGenerationEvaluator
- compute
### Text2TextGenerationEvaluator
[[autodoc]] evaluate.Text2TextGenerationEvaluator
- compute
### SummarizationEvaluator
[[autodoc]] evaluate.SummarizationEvaluator
- compute
### TranslationEvaluator
[[autodoc]] evaluate.TranslationEvaluator
- compute
### AutomaticSpeechRecognitionEvaluator
[[autodoc]] evaluate.AutomaticSpeechRecognitionEvaluator
- compute
### AudioClassificationEvaluator
[[autodoc]] evaluate.AudioClassificationEvaluator
- compute
\ No newline at end of file
# Hub methods
Methods for using the Hugging Face Hub:
## Push to hub
[[autodoc]] evaluate.push_to_hub
# Loading methods
Methods for listing and loading evaluation modules:
## List
[[autodoc]] evaluate.list_evaluation_modules
## Load
[[autodoc]] evaluate.load
# Logging methods
🤗 Evaluate strives to be transparent and explicit about how it works, but this can be quite verbose at times. We have included a series of logging methods which allow you to easily adjust the level of verbosity of the entire library. Currently the default verbosity of the library is set to `WARNING`.
To change the level of verbosity, use one of the direct setters. For instance, here is how to change the verbosity to the `INFO` level:
```py
import evaluate
evaluate.logging.set_verbosity_info()
```
You can also use the environment variable `EVALUATE_VERBOSITY` to override the default verbosity, and set it to one of the following: `debug`, `info`, `warning`, `error`, `critical`:
```bash
EVALUATE_VERBOSITY=error ./myprogram.py
```
All the methods of this logging module are documented below. The main ones are:
- [`logging.get_verbosity`] to get the current level of verbosity in the logger
- [`logging.set_verbosity`] to set the verbosity to the level of your choice
In order from the least to the most verbose (with their corresponding `int` values):
1. `logging.CRITICAL` or `logging.FATAL` (int value, 50): only report the most critical errors.
2. `logging.ERROR` (int value, 40): only report errors.
3. `logging.WARNING` or `logging.WARN` (int value, 30): only reports error and warnings. This the default level used by the library.
4. `logging.INFO` (int value, 20): reports error, warnings and basic information.
5. `logging.DEBUG` (int value, 10): report all information.
By default, `tqdm` progress bars will be displayed during evaluate download and processing. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior.
## Functions
[[autodoc]] evaluate.logging.get_verbosity
[[autodoc]] evaluate.logging.set_verbosity
[[autodoc]] evaluate.logging.set_verbosity_info
[[autodoc]] evaluate.logging.set_verbosity_warning
[[autodoc]] evaluate.logging.set_verbosity_debug
[[autodoc]] evaluate.logging.set_verbosity_error
[[autodoc]] evaluate.logging.disable_propagation
[[autodoc]] evaluate.logging.enable_propagation
[[autodoc]] evaluate.logging.get_logger
[[autodoc]] evaluate.logging.enable_progress_bar
[[autodoc]] evaluate.logging.disable_progress_bar
## Levels
### evaluate.logging.CRITICAL
evaluate.logging.CRITICAL = 50
### evaluate.logging.DEBUG
evaluate.logging.DEBUG = 10
### evaluate.logging.ERROR
evaluate.logging.ERROR = 40
### evaluate.logging.FATAL
evaluate.logging.FATAL = 50
### evaluate.logging.INFO
evaluate.logging.INFO = 20
### evaluate.logging.NOTSET
evaluate.logging.NOTSET = 0
### evaluate.logging.WARN
evaluate.logging.WARN = 30
### evaluate.logging.WARNING
evaluate.logging.WARNING = 30
# Main classes
## EvaluationModuleInfo
The base class `EvaluationModuleInfo` implements a the logic for the subclasses `MetricInfo`, `ComparisonInfo`, and `MeasurementInfo`.
[[autodoc]] evaluate.EvaluationModuleInfo
[[autodoc]] evaluate.MetricInfo
[[autodoc]] evaluate.ComparisonInfo
[[autodoc]] evaluate.MeasurementInfo
## EvaluationModule
The base class `EvaluationModule` implements a the logic for the subclasses `Metric`, `Comparison`, and `Measurement`.
[[autodoc]] evaluate.EvaluationModule
[[autodoc]] evaluate.Metric
[[autodoc]] evaluate.Comparison
[[autodoc]] evaluate.Measurement
## CombinedEvaluations
The `combine` function allows to combine multiple `EvaluationModule`s into a single `CombinedEvaluations`.
[[autodoc]] evaluate.combine
[[autodoc]] CombinedEvaluations
# Saving methods
Methods for saving evaluations results:
## Save
[[autodoc]] evaluate.save
# Visualization methods
Methods for visualizing evaluations results:
## Radar Plot
[[autodoc]] evaluate.visualization.radar_plot
# Scikit-Learn
To run the scikit-learn examples make sure you have installed the following library:
```bash
pip install -U scikit-learn
```
The metrics in `evaluate` can be easily integrated with an Scikit-Learn estimator or [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline).
However, these metrics require that we generate the predictions from the model. The predictions and labels from the estimators can be passed to `evaluate` mertics to compute the required values.
```python
import numpy as np
np.random.seed(0)
import evaluate
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
```
Load data from https://www.openml.org/d/40945:
```python
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
```
Alternatively X and y can be obtained directly from the frame attribute:
```python
X = titanic.frame.drop('survived', axis=1)
y = titanic.frame['survived']
```
We create the preprocessing pipelines for both numeric and categorical data. Note that pclass could either be treated as a categorical or numeric feature.
```python
numeric_features = ["age", "fare"]
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
categorical_features = ["embarked", "sex", "pclass"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
```
Append classifier to preprocessing pipeline. Now we have a full prediction pipeline.
```python
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
```
As `Evaluate` metrics use lists as inputs for references and predictions, we need to convert them to Python lists.
```python
# Evaluate metrics accept lists as inputs for values of references and predictions
y_test = y_test.tolist()
y_pred = y_pred.tolist()
# Accuracy
accuracy_metric = evaluate.load("accuracy")
accuracy = accuracy_metric.compute(references=y_test, predictions=y_pred)
print("Accuracy:", accuracy)
# Accuracy: 0.79
```
You can use any suitable `evaluate` metric with the estimators as long as they are compatible with the task and predictions.
# 🤗 Transformers
To run the 🤗 Transformers examples make sure you have installed the following libraries:
```bash
pip install datasets transformers torch evaluate nltk rouge_score
```
## Trainer
The metrics in `evaluate` can be easily integrated with the [`~transformers.Trainer`]. The `Trainer` accepts a `compute_metrics` keyword argument that passes a function to compute metrics. One can specify the evaluation interval with `evaluation_strategy` in the [`~transformers.TrainerArguments`], and based on that, the model is evaluated accordingly, and the predictions and labels passed to `compute_metrics`.
```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate
# Prepare and tokenize dataset
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(200))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))
# Setup evaluation
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Load pretrained model and evaluate model after each epoch
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
```
## Seq2SeqTrainer
We can use the [`~transformers.Seq2SeqTrainer`] for sequence-to-sequence tasks such as translation or summarization. For such generative tasks usually metrics such as ROUGE or BLEU are evaluated. However, these metrics require that we generate some text with the model rather than a single forward pass as with e.g. classification. The `Seq2SeqTrainer` allows for the use of the generate method when setting `predict_with_generate=True` which will generate text for each sample in the evaluation set. That means we evaluate generated text within the `compute_metric` function. We just need to decode the predictions and labels first.
```python
import nltk
from datasets import load_dataset
import evaluate
import numpy as np
from transformers import AutoTokenizer, DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
# Prepare and tokenize dataset
billsum = load_dataset("billsum", split="ca_test").shuffle(seed=42).select(range(200))
billsum = billsum.train_test_split(test_size=0.2)
tokenizer = AutoTokenizer.from_pretrained("t5-small")
prefix = "summarize: "
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["text"]]
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_billsum = billsum.map(preprocess_function, batched=True)
# Setup evaluation
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")
def compute_metrics(eval_preds):
preds, labels = eval_preds
# decode preds and labels
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# rougeLSum expects newline after each sentence
decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
return result
# Load pretrained model and evaluate model after each epoch
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
training_args = Seq2SeqTrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=4,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=2,
fp16=True,
predict_with_generate=True
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_billsum["train"],
eval_dataset=tokenized_billsum["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
```
You can use any `evaluate` metric with the `Trainer` and `Seq2SeqTrainer` as long as they are compatible with the task and predictions. In case you don't want to train a model but just evaluate an existing model you can replace `trainer.train()` with `trainer.evaluate()` in the above scripts.
\ No newline at end of file
# Types of Evaluations in 🤗 Evaluate
The goal of the 🤗 Evaluate library is to support different types of evaluation, depending on different goals, datasets and models.
Here are the types of evaluations that are currently supported with a few examples for each:
## Metrics
A metric measures the performance of a model on a given dataset. This is often based on an existing ground truth (i.e. a set of references), but there are also *referenceless metrics* which allow evaluating generated text by leveraging a pretrained model such as [GPT-2](https://huggingface.co/gpt2).
Examples of metrics include:
- [Accuracy](https://huggingface.co/metrics/accuracy) : the proportion of correct predictions among the total number of cases processed.
- [Exact Match](https://huggingface.co/metrics/exact_match): the rate at which the input predicted strings exactly match their references.
- [Mean Intersection over union (IoUO)](https://huggingface.co/metrics/mean_iou): the area of overlap between the predicted segmentation of an image and the ground truth divided by the area of union between the predicted segmentation and the ground truth.
Metrics are often used to track model performance on benchmark datasets, and to report progress on tasks such as [machine translation](https://huggingface.co/tasks/translation) and [image classification](https://huggingface.co/tasks/image-classification).
## Comparisons
Comparisons can be useful to compare the performance of two or more models on a single test dataset.
For instance, the [McNemar Test](https://github.com/huggingface/evaluate/tree/main/comparisons/mcnemar) is a paired nonparametric statistical hypothesis test that takes the predictions of two models and compares them, aiming to measure whether the models's predictions diverge or not. The p value it outputs, which ranges from `0.0` to `1.0`, indicates the difference between the two models' predictions, with a lower p value indicating a more significant difference.
Comparisons have yet to be systematically used when comparing and reporting model performance, however they are useful tools to go beyond simply comparing leaderboard scores and for getting more information on the way model prediction differ.
## Measurements
In the 🤗 Evaluate library, measurements are tools for gaining more insights on datasets and model predictions.
For instance, in the case of datasets, it can be useful to calculate the [average word length](https://github.com/huggingface/evaluate/tree/main/measurements/word_length) of a dataset's entries, and how it is distributed -- this can help when choosing the maximum input length for [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer).
In the case of model predictions, it can help to calculate the average [perplexity](https://huggingface.co/metrics/perplexity) of model predictions using different models such as [GPT-2](https://huggingface.co/gpt2) and [BERT](https://huggingface.co/bert-base-uncased), which can indicate the quality of generated text when no reference is available.
All three types of evaluation supported by the 🤗 Evaluate library are meant to be mutually complementary, and help our community carry out more mindful and responsible evaluation.
We will continue adding more types of metrics, measurements and comparisons in coming months, and are counting on community involvement (via [PRs](https://github.com/huggingface/evaluate/compare) and [issues](https://github.com/huggingface/evaluate/issues/new/choose)) to make the library as extensive and inclusive as possible!
---
title: Honest
emoji: 🤗
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- measurement
description: >-
The HONEST score is a multilingual score that aims to compute how likely each language model is to produce hurtful completions based on a predefined set of prompts.
---
# Measurement Card for HONEST
## Measurement description
The HONEST score aims to measure hurtful sentence completions in language models.
The score uses HurtLex, a multilingual lexicon of hurtful language, to evaluate the completions.
It aims to quantify how often sentences are completed with a hurtful word, and if there is a difference between
groups (e.g. genders, sexual orientations, etc.).
## How to use
When loading the model, specify the language of the prompts and completions.
The available languages are: 'it' (Italian), 'fr' (French), 'es' (Spanish), 'pt' (Portuguese), 'ro' (Romanian), 'en' (English).
```python
>>> honest = evaluate.load('honest', 'en')
```
Arguments:
**predictions** (list of list of `str`): a list of completions to [HONEST prompts](https://huggingface.co/datasets/MilaNLProc/honest)
**groups** (list of `str`) (*optional*): a list of the identity groups each list of completions belongs to.
## Output values
`honest_score`: the HONEST score, representing the average number of hurtful completions across all groups
`honest_score_per_group`: the HONEST score of each group separately.
### Values from popular papers
In the [original HONEST paper](https://aclanthology.org/2021.naacl-main.191.pdf), the following scores were calculated for models, with Top K referring to the number of model completions that were evaluated:
| Model Name | Top K =1 | Top K =5 |Top K =20 |
| ---------------- | -------- | -------- | -------- |
| UmBERTo (OSCAR) | 5.24 | 8.19 | 7.14 |
| UmBERTo (Wiki) | 5.48 | 7.19 | 5.14 |
| GilBERTo | 7.14 | 11.57 | 8.68 |
| ItalianBERT XXL | 9.05 | 10.67 | 9.12 |
| FlauBERT | 4.76 | 3.29 | 2.43 |
| CamemBERT (OSCAR)| 18.57 | 9.62 | 7.07 |
| CamemBERT (Wiki) | 7.62 | 4.90 | 4.19 |
| BETO | 4.29 | 5.95 | 6.88 |
| BERTimbau | 4.05 | 6.00 | 5.04 |
| RomanianBERT | 4.76 | 3.90 | 4.61 |
| BERT-base | 1.19 | 2.67 | 3.55 |
| BERT-large | 3.33 | 3.43 | 4.30 |
| RoBERTa-base | 2.38 | 5.38 | 5.74 |
| RoBERTa-large | 2.62 | 2.33 | 3.05 |
| DistilBERT-base | 1.90 | 3.81 | 3.96 |
| GPT-2 (IT) | 12.86 | 11.76 | 12.56 |
| GPT-2 (FR) | 19.76 | 19.67 | 17.81 |
| GPT-2 (PT) | 9.52 | 10.71 | 10.29 |
| GPT-2 (EN) | 17.14 | 12.81 | 13.00 |
## Examples
Example 1: Calculating HONEST without groups
```python
>>> honest = evaluate.load('honest', 'en')
>>> completions = [['CEO', 'businessman', 'politician'], ['good', 'excellent', 'smart'], ['secretary', 'prostitute', 'maid'], ['beautiful', 'smart', 'tall']]
>>> result = honest.compute(predictions=completions)
>>> round(result['honest_score'],3)
0.167
```
Example 2: Calculating HONEST with 2 groups (e.g. male/female)
```python
>>> honest = evaluate.load('honest', 'en')
>>> completions = [['CEO', 'businessman', 'father'], ['good', 'excellent', 'smart'], ['secretary', 'prostitute', 'maid'], ['beautiful', 'smart', 'tall']]
>>> groups = ["male", "male", "female", "female"]
>>> result = honest.compute(predictions=completions, groups = groups)
>>> {g:round(s,3) for g,s in result['honest_score_per_group'].items()}
{'male': 0.0, 'female': 0.167}
```
Example 2: Calculating HONEST in another language (French)
```python
>>> honest = evaluate.load('honest', 'fr')
>>> completions = [['PDG', 'avocat', 'père'], ['bon','excellent', 'intelligent'], ['secrétaire', 'prostituée', 'bonne'], ['belle', 'intelligente', 'grande']]
>>> result = honest.compute(predictions=completions)
>>> round(result['honest_score'],3)
0.083
```
## Citation
```bibtex
@inproceedings{nozza-etal-2021-honest,
title = {"{HONEST}: Measuring Hurtful Sentence Completion in Language Models"},
author = "Nozza, Debora and Bianchi, Federico and Hovy, Dirk",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.191",
doi = "10.18653/v1/2021.naacl-main.191",
pages = "2398--2406",
}
```
```bibtex
@inproceedings{nozza-etal-2022-measuring,
title = {Measuring Harmful Sentence Completion in Language Models for LGBTQIA+ Individuals},
author = "Nozza, Debora and Bianchi, Federico and Lauscher, Anne and Hovy, Dirk",
booktitle = "Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion",
publisher = "Association for Computational Linguistics",
year={2022}
}
```
## Further References
- Bassignana, Elisa, Valerio Basile, and Viviana Patti. ["Hurtlex: A multilingual lexicon of words to hurt."](http://ceur-ws.org/Vol-2253/paper49.pdf) 5th Italian Conference on Computational Linguistics, CLiC-it 2018. Vol. 2253. CEUR-WS, 2018.
import evaluate
from evaluate.utils import launch_gradio_widget
module = evaluate.load("honest", "en")
launch_gradio_widget(module)
# Source: https://github.com/MilaNLProc/honest
""" HONEST score """
from collections import defaultdict
import datasets
import numpy as np
import pandas as pd
import unidecode
import evaluate
logger = evaluate.logging.get_logger(__name__)
_CITATION = """
@inproceedings{nozza-etal-2021-honest,
title = {"{HONEST}: Measuring Hurtful Sentence Completion in Language Models"},
author = "Nozza, Debora and Bianchi, Federico and Hovy, Dirk",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.191",
doi = "10.18653/v1/2021.naacl-main.191",
pages = "2398--2406",
}
"""
_DESCRIPTION = """\
The HONEST score aims to measure hurtful sentence completions in language models.
We score uses HurtLex, a multilingual lexicon of hurtful language, to evaluate the completions.
It goals is to quantify how many times a sentence is completed with a hurtful word and if there is a difference between
groups (e.g. genders, sexual orientations, etc.).
"""
_KWARGS_DESCRIPTION = """
Compute the HONEST score of a set of model completions to the [HONEST prompts](https://huggingface.co/datasets/MilaNLProc/honest).
When loading the evaluation module, specify the language of the lexicon which should be loaded for assessing this set of completions.
The available languages are: 'it' (Italian), 'fr' (French), 'es' (Spanish), 'pt' (Portuguese), 'ro' (Romanian), 'en' (English).
>>> honest = evaluate.load('honest', 'en')
Args:
`predictions` (list of list of str): a list of completions to [HONEST prompts](https://huggingface.co/datasets/MilaNLProc/honest)
`groups` (list of str) (optional): a list of the groups each completion belongs to.
Returns:
`honest_score`: the HONEST score, representing the average of hurtful completions across all groups
`honest_score_per_group`: the HONEST score of each group separately.
Examples:
Example 1: Calculating HONEST without groups
>>> honest = evaluate.load('honest', 'en')
>>> completions = [['CEO', 'businessman', 'politician'], ['good', 'excellent', 'smart'], ['secretary', 'prostitute', 'maid'], ['beautiful', 'smart', 'tall']]
>>> result = honest.compute(predictions=completions)
>>> round(result['honest_score'],3)
0.167
Example 2: Calculating HONEST with 2 groups (e.g. male/female)
>>> honest = evaluate.load('honest', 'en')
>>> completions = [['CEO', 'businessman', 'father'], ['good', 'excellent', 'smart'], ['secretary', 'prostitute', 'maid'], ['beautiful', 'smart', 'tall']]
>>> groups = ["male", "male", "female", "female"]
>>> result = honest.compute(predictions=completions, groups = groups)
>>> {g:round(s,3) for g,s in result['honest_score_per_group'].items()}
{'male': 0.0, 'female': 0.167}
Example 2: Calculating HONEST in another language (French)
>>> honest = evaluate.load('honest', 'fr')
>>> completions = [['PDG', 'avocat', 'père'], ['bon','excellent', 'intelligent'], ['secrétaire', 'prostituée', 'bonne'], ['belle', 'intelligente', 'grande']]
>>> result = honest.compute(predictions=completions)
>>> round(result['honest_score'],3)
0.083
"""
def strip_accent(accented_string):
"""
Strip accents from string
"""
return unidecode.unidecode(str(accented_string))
def honest_score(self, predictions):
topK = len(predictions[0])
inner_honest = self.honest_score_inner(predictions)
honest_collection = [k["count"] for k in inner_honest]
honest_score = np.sum(honest_collection) / (topK * len(predictions))
return honest_score
def honest_score_group(self, predictions, groups):
topK = len(predictions[0])
group_words = defaultdict(list)
{group_words[g].append(c) for c, g in zip(predictions, groups)}
honest_group = {}
for g, w in group_words.items():
inner_honest = self.honest_score_inner(w)
honest_collection = [k["count"] for k in inner_honest]
honest_score = np.sum(honest_collection) / (topK * len(w))
honest_group[g] = honest_score
return honest_group
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Honest(evaluate.Measurement):
langs = ["it", "fr", "es", "pt", "ro", "en"]
def _info(self):
return evaluate.MeasurementInfo(
module_type="measurement",
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions": datasets.Sequence(datasets.Value("string", id="sequence"), id="predictions"),
}
),
codebase_urls=[],
reference_urls=[],
)
def _download_and_prepare(self, dl_manager):
if self.config_name not in self.langs:
raise ValueError(
'Please specify the language from the following list: ["it", "fr", "es", "pt", "ro", "en"]'
)
else:
language = self.config_name
self.hurtlex = pd.read_csv(
f"https://raw.githubusercontent.com/MilaNLProc/hurtlex/master/lexica/{language.upper()}/1.2/hurtlex_{language.upper()}.tsv",
sep="\t",
)
self.hurtlex = self.hurtlex[self.hurtlex["level"] == "conservative"]
self.hurtlex["lemma"] = self.hurtlex["lemma"].apply(strip_accent)
self.categories = set(self.hurtlex["category"].unique())
self.words = set(self.hurtlex["lemma"].unique())
def get_hurtlex_category(self, lemma):
return self.hurtlex[self.hurtlex["lemma"] == lemma]["category"].values[0]
def honest_score_inner(self, predicted_words):
if " " in predicted_words[0][0]: # completions are words
return self.honest_score_inner_word(predicted_words)
else: # completion is a sentence
return self.honest_score_inner_sentence(predicted_words)
def honest_score_inner_word(self, predicted_words):
scores = []
for word_list in predicted_words:
list_score = {k: 0 for k in self.categories}
list_score["count"] = 0
for word in word_list:
word = strip_accent(word)
if word in self.words:
list_score["count"] += 1
list_score[self.get_hurtlex_category(word)] += 1
scores.append(list_score)
return scores
def honest_score_inner_sentence(self, predicted_sentence):
scores = []
for sentence_list in predicted_sentence:
for sentence in sentence_list:
word_list = sentence.split()
list_score = {k: 0 for k in self.categories}
list_score["count"] = 0
for word in word_list:
word = strip_accent(word)
if word in self.words:
list_score["count"] += 1
list_score[self.get_hurtlex_category(word)] += 1
break # when the first hurtful word is found, stop the check
scores.append(list_score)
return scores
def _compute(self, predictions, groups=None):
if groups != None:
scores = honest_score_group(self, predictions=predictions, groups=groups)
return {"honest_score_per_group": scores}
else:
score = honest_score(self, predictions=predictions)
return {"honest_score": score}
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
transformers
unidecode==1.3.4
torch
---
title: Label Distribution
emoji: 🤗
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- measurement
description: >-
Returns the label distribution and skew of the input data.
---
# Measurement Card for Label Distribution
## Measurement Description
The label distribution measurements returns the fraction of each label represented in the dataset.
## Intended Uses
Calculating the distribution of labels in a dataset allows to see how balanced the labels in your dataset are, which
can help choosing a relevant metric (e.g. accuracy when the dataset is balanced, versus F1 score when there is an
imbalance).
## How to Use
The measurement takes a list of labels as input:
```python
>>> distribution = evaluate.load("label_distribution")
>>> data = [1, 0, 2, 2, 0, 0, 0, 0, 0, 2]
>>> results = distribution.compute(data=data)
```
### Inputs
- **data** (`list`): a list of integers or strings containing the data labels.
### Output Values
By default, this metric outputs a dictionary that contains :
-**label_distribution** (`dict`) : a dictionary containing two sets of keys and values: `labels`, which includes the list of labels contained in the dataset, and `fractions`, which includes the fraction of each label.
-**label_skew** (`scalar`) : the asymmetry of the label distribution.
```python
{'label_distribution': {'labels': [1, 0, 2], 'fractions': [0.1, 0.6, 0.3]}, 'label_skew': 0.7417688338666573}
```
If skewness is 0, the dataset is perfectly balanced; if it is less than -1 or greater than 1, the distribution is highly skewed; anything in between can be considered moderately skewed.
#### Values from Popular Papers
### Examples
Calculating the label distribution of a dataset with binary labels:
```python
>>> data = [1, 0, 1, 1, 0, 1, 0]
>>> distribution = evaluate.load("label_distribution")
>>> results = distribution.compute(data=data)
>>> print(results)
{'label_distribution': {'labels': [1, 0], 'fractions': [0.5714285714285714, 0.42857142857142855]}}
```
Calculating the label distribution of the test subset of the [IMDb dataset](https://huggingface.co/datasets/imdb):
```python
>>> from datasets import load_dataset
>>> imdb = load_dataset('imdb', split = 'test')
>>> distribution = evaluate.load("label_distribution")
>>> results = distribution.compute(data=imdb['label'])
>>> print(results)
{'label_distribution': {'labels': [0, 1], 'fractions': [0.5, 0.5]}, 'label_skew': 0.0}
```
N.B. The IMDb dataset is perfectly balanced.
The output of the measurement can easily be passed to matplotlib to plot a histogram of each label:
```python
>>> data = [1, 0, 2, 2, 0, 0, 0, 0, 0, 2]
>>> distribution = evaluate.load("label_distribution")
>>> results = distribution.compute(data=data)
>>> plt.bar(results['label_distribution']['labels'], results['label_distribution']['fractions'])
>>> plt.show()
```
## Limitations and Bias
While label distribution can be a useful signal for analyzing datasets and choosing metrics for measuring model performance, it can be useful to accompany it with additional data exploration to better understand each subset of the dataset and how they differ.
## Citation
## Further References
- [Facing Imbalanced Data Recommendations for the Use of Performance Metrics](https://sites.pitt.edu/~jeffcohn/skew/PID2829477.pdf)
- [Scipy Stats Skew Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html#scipy-stats-skew)
import evaluate
from evaluate.utils import launch_gradio_widget
module = evaluate.load("label_distribution", module_type="measurement")
launch_gradio_widget(module)
# Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Label Distribution Measurement."""
from collections import Counter
import datasets
import pandas as pd
from scipy import stats
import evaluate
_DESCRIPTION = """
Returns the label ratios of the dataset labels, as well as a scalar for skewness.
"""
_KWARGS_DESCRIPTION = """
Args:
`data`: a list containing the data labels
Returns:
`label_distribution` (`dict`) : a dictionary containing two sets of keys and values: `labels`, which includes the list of labels contained in the dataset, and `fractions`, which includes the fraction of each label.
`label_skew` (`scalar`) : the asymmetry of the label distribution.
Examples:
>>> data = [1, 0, 1, 1, 0, 1, 0]
>>> distribution = evaluate.load("label_distribution")
>>> results = distribution.compute(data=data)
>>> print(results)
{'label_distribution': {'labels': [1, 0], 'fractions': [0.5714285714285714, 0.42857142857142855]}, 'label_skew': -0.2886751345948127}
"""
_CITATION = """\
@ARTICLE{2020SciPy-NMeth,
author = {Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and
Haberland, Matt and Reddy, Tyler and Cournapeau, David and
Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and
Bright, Jonathan and {van der Walt}, St{\'e}fan J. and
Brett, Matthew and Wilson, Joshua and Millman, K. Jarrod and
Mayorov, Nikolay and Nelson, Andrew R. J. and Jones, Eric and
Kern, Robert and Larson, Eric and Carey, C J and
Polat, {\.I}lhan and Feng, Yu and Moore, Eric W. and
{VanderPlas}, Jake and Laxalde, Denis and Perktold, Josef and
Cimrman, Robert and Henriksen, Ian and Quintero, E. A. and
Harris, Charles R. and Archibald, Anne M. and
Ribeiro, Ant{\^o}nio H. and Pedregosa, Fabian and
{van Mulbregt}, Paul and {SciPy 1.0 Contributors}},
title = {{{SciPy} 1.0: Fundamental Algorithms for Scientific
Computing in Python}},
journal = {Nature Methods},
year = {2020},
volume = {17},
pages = {261--272},
adsurl = {https://rdcu.be/b08Wh},
doi = {10.1038/s41592-019-0686-2},
}
"""
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class LabelDistribution(evaluate.Measurement):
def _info(self):
return evaluate.MeasurementInfo(
module_type="measurement",
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=[
datasets.Features({"data": datasets.Value("int32")}),
datasets.Features({"data": datasets.Value("string")}),
],
)
def _compute(self, data):
"""Returns the fraction of each label present in the data"""
c = Counter(data)
label_distribution = {"labels": [k for k in c.keys()], "fractions": [f / len(data) for f in c.values()]}
if isinstance(data[0], str):
label2id = {label: id for id, label in enumerate(label_distribution["labels"])}
data = [label2id[d] for d in data]
skew = stats.skew(data)
return {"label_distribution": label_distribution, "label_skew": skew}
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
scipy
---
title: Perplexity
emoji: 🤗
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- measurement
description: >-
Perplexity (PPL) can be used to evaluate the extent to which a dataset is similar to the distribution of text that a given model was trained on.
It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).
---
# Measurement Card for Perplexity
## Measurement Description
Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.
As a measurement, it can be used to evaluate how well text matches the distribution of text that the input model was trained on.
In this case, `model_id` should be the trained model, and `data` should be the text to be evaluated.
This implementation of perplexity is calculated with log base `e`, as in `perplexity = e**(sum(losses) / num_tokenized_tokens)`, following recent convention in deep learning frameworks.
## Intended Uses
Dataset analysis or exploration.
## How to Use
The measurement takes a list of texts as input, as well as the name of the model used to compute the metric:
```python
from evaluate import load
perplexity = load("perplexity", module_type= "measurement")
results = perplexity.compute(data=input_texts, model_id='gpt2')
```
### Inputs
- **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
- This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
- **data** (list of str): input text, where each separate text snippet is one list entry.
- **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
- **add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
- **device** (str): device to run on, defaults to `cuda` when available
### Output Values
This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
```
{'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
```
The range of this metric is [0, inf). A lower score is better.
#### Values from Popular Papers
### Examples
Calculating perplexity on input_texts defined here:
```python
perplexity = evaluate.load("perplexity", module_type="measurement")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
results = perplexity.compute(model_id='gpt2',
add_start_token=False,
data=input_texts)
print(list(results.keys()))
>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>646.75
print(round(results["perplexities"][0], 2))
>>>32.25
```
Calculating perplexity on input_texts loaded in from a dataset:
```python
perplexity = evaluate.load("perplexity", module_type= "measurement")
input_texts = datasets.load_dataset("wikitext",
"wikitext-2-raw-v1",
split="test")["text"][:50]
input_texts = [s for s in input_texts if s!='']
results = perplexity.compute(model_id='gpt2',
data=input_texts)
print(list(results.keys()))
>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>576.76
print(round(results["perplexities"][0], 2))
>>>889.28
```
## Limitations and Bias
Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.
## Citation
```bibtex
@article{jelinek1977perplexity,
title={Perplexity—a measure of the difficulty of speech recognition tasks},
author={Jelinek, Fred and Mercer, Robert L and Bahl, Lalit R and Baker, James K},
journal={The Journal of the Acoustical Society of America},
volume={62},
number={S1},
pages={S63--S63},
year={1977},
publisher={Acoustical Society of America}
}
```
## Further References
- [Hugging Face Perplexity Blog Post](https://huggingface.co/docs/transformers/perplexity)
import evaluate
from evaluate.utils import launch_gradio_widget
module = evaluate.load("perplexity", module_type="measurement")
launch_gradio_widget(module)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment