修改readme

25991f98 · hepj · ac192496 · 25991f98 · 25991f98 · 25991f98
Commit 25991f98 authored Jul 25, 2024 by hepj
20 changed files
--- a/evaluate-0.4.2/docs/source/package_reference/evaluator_classes.mdx
+++ b/evaluate-0.4.2/docs/source/package_reference/evaluator_classes.mdx
+# Evaluator
+
+The evaluator classes for automatic evaluation.
+
+## Evaluator classes
+
+The main entry point for using the evaluator:
+
+[[autodoc]] evaluate.evaluator
+
+The base class for all evaluator classes:
+
+[[autodoc]] evaluate.Evaluator
+
+## The task specific evaluators
+
+### ImageClassificationEvaluator
+
+[[autodoc]] evaluate.ImageClassificationEvaluator
+
+### QuestionAnsweringEvaluator
+
+[[autodoc]] evaluate.QuestionAnsweringEvaluator
+    - compute
+
+### TextClassificationEvaluator
+
+[[autodoc]] evaluate.TextClassificationEvaluator
+
+### TokenClassificationEvaluator
+
+[[autodoc]] evaluate.TokenClassificationEvaluator
+    - compute
+
+### TextGenerationEvaluator
+
+[[autodoc]] evaluate.TextGenerationEvaluator
+    - compute
+
+### Text2TextGenerationEvaluator
+
+[[autodoc]] evaluate.Text2TextGenerationEvaluator
+    - compute
+
+### SummarizationEvaluator
+
+[[autodoc]] evaluate.SummarizationEvaluator
+    - compute
+
+### TranslationEvaluator
+
+[[autodoc]] evaluate.TranslationEvaluator
+    - compute
+
+### AutomaticSpeechRecognitionEvaluator
+
+[[autodoc]] evaluate.AutomaticSpeechRecognitionEvaluator
+    - compute
+
+### AudioClassificationEvaluator
+
+[[autodoc]] evaluate.AudioClassificationEvaluator
+    - compute
\ No newline at end of file
--- a/evaluate-0.4.2/docs/source/package_reference/hub_methods.mdx
+++ b/evaluate-0.4.2/docs/source/package_reference/hub_methods.mdx
+# Hub methods
+
+Methods for using the Hugging Face Hub:
+
+## Push to hub 
+
+[[autodoc]] evaluate.push_to_hub
+
--- a/evaluate-0.4.2/docs/source/package_reference/loading_methods.mdx
+++ b/evaluate-0.4.2/docs/source/package_reference/loading_methods.mdx
+# Loading methods
+
+Methods for listing and loading evaluation modules:
+
+## List
+
+[[autodoc]] evaluate.list_evaluation_modules
+
+## Load
+
+[[autodoc]] evaluate.load
--- a/evaluate-0.4.2/docs/source/package_reference/logging_methods.mdx
+++ b/evaluate-0.4.2/docs/source/package_reference/logging_methods.mdx
+# Logging methods
+
+🤗 Evaluate strives to be transparent and explicit about how it works, but this can be quite verbose at times. We have included a series of logging methods which allow you to easily adjust the level of verbosity of the entire library. Currently the default verbosity of the library is set to `WARNING`.
+
+To change the level of verbosity, use one of the direct setters. For instance, here is how to change the verbosity to the `INFO` level:
+
+```py
+import evaluate
+evaluate.logging.set_verbosity_info()
+```
+
+You can also use the environment variable `EVALUATE_VERBOSITY` to override the default verbosity, and set it to one of the following: `debug`, `info`, `warning`, `error`, `critical`:
+
+```bash
+EVALUATE_VERBOSITY=error ./myprogram.py
+```
+
+All the methods of this logging module are documented below. The main ones are:
+
+- [`logging.get_verbosity`] to get the current level of verbosity in the logger
+- [`logging.set_verbosity`] to set the verbosity to the level of your choice
+
+In order from the least to the most verbose (with their corresponding `int` values):
+
+1. `logging.CRITICAL` or `logging.FATAL` (int value, 50): only report the most critical errors.
+2. `logging.ERROR` (int value, 40): only report errors.
+3. `logging.WARNING` or `logging.WARN` (int value, 30): only reports error and warnings. This the default level used by the library.
+4. `logging.INFO` (int value, 20): reports error, warnings and basic information.
+5. `logging.DEBUG` (int value, 10): report all information.
+
+By default, `tqdm` progress bars will be displayed during evaluate download and processing. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior. 
+
+## Functions
+
+[[autodoc]] evaluate.logging.get_verbosity
+
+[[autodoc]] evaluate.logging.set_verbosity
+
+[[autodoc]] evaluate.logging.set_verbosity_info
+
+[[autodoc]] evaluate.logging.set_verbosity_warning
+
+[[autodoc]] evaluate.logging.set_verbosity_debug
+
+[[autodoc]] evaluate.logging.set_verbosity_error
+
+[[autodoc]] evaluate.logging.disable_propagation
+
+[[autodoc]] evaluate.logging.enable_propagation
+
+[[autodoc]] evaluate.logging.get_logger
+
+[[autodoc]] evaluate.logging.enable_progress_bar
+
+[[autodoc]] evaluate.logging.disable_progress_bar
+
+## Levels
+
+### evaluate.logging.CRITICAL
+
+evaluate.logging.CRITICAL = 50
+
+### evaluate.logging.DEBUG
+
+evaluate.logging.DEBUG = 10
+
+### evaluate.logging.ERROR
+
+evaluate.logging.ERROR = 40
+
+### evaluate.logging.FATAL
+
+evaluate.logging.FATAL = 50
+
+### evaluate.logging.INFO
+
+evaluate.logging.INFO = 20
+
+### evaluate.logging.NOTSET
+
+evaluate.logging.NOTSET = 0
+
+### evaluate.logging.WARN
+
+evaluate.logging.WARN = 30
+
+### evaluate.logging.WARNING
+
+evaluate.logging.WARNING = 30
--- a/evaluate-0.4.2/docs/source/package_reference/main_classes.mdx
+++ b/evaluate-0.4.2/docs/source/package_reference/main_classes.mdx
+# Main classes
+
+## EvaluationModuleInfo
+
+The base class `EvaluationModuleInfo` implements a the logic for the subclasses `MetricInfo`, `ComparisonInfo`, and `MeasurementInfo`.
+
+[[autodoc]] evaluate.EvaluationModuleInfo
+
+[[autodoc]] evaluate.MetricInfo
+
+[[autodoc]] evaluate.ComparisonInfo
+
+[[autodoc]] evaluate.MeasurementInfo
+
+## EvaluationModule
+
+The base class `EvaluationModule` implements a the logic for the subclasses `Metric`, `Comparison`, and `Measurement`.
+
+[[autodoc]] evaluate.EvaluationModule
+
+[[autodoc]] evaluate.Metric
+
+[[autodoc]] evaluate.Comparison
+
+[[autodoc]] evaluate.Measurement
+
+## CombinedEvaluations
+
+The `combine` function allows to combine multiple `EvaluationModule`s into a single `CombinedEvaluations`.
+
+[[autodoc]] evaluate.combine
+
+[[autodoc]] CombinedEvaluations
--- a/evaluate-0.4.2/docs/source/package_reference/saving_methods.mdx
+++ b/evaluate-0.4.2/docs/source/package_reference/saving_methods.mdx
+# Saving methods
+
+Methods for saving evaluations results:
+
+## Save
+
+[[autodoc]] evaluate.save
+
--- a/evaluate-0.4.2/docs/source/package_reference/visualization_methods.mdx
+++ b/evaluate-0.4.2/docs/source/package_reference/visualization_methods.mdx
+# Visualization methods
+
+Methods for visualizing evaluations results:
+
+## Radar Plot
+
+[[autodoc]] evaluate.visualization.radar_plot
--- a/evaluate-0.4.2/docs/source/sklearn_integrations.mdx
+++ b/evaluate-0.4.2/docs/source/sklearn_integrations.mdx
+# Scikit-Learn
+
+To run the scikit-learn examples make sure you have installed the following library:
+
+```bash
+pip install -U scikit-learn
+```
+
+The metrics in `evaluate` can be easily integrated with an Scikit-Learn estimator or [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline).  
+
+However, these metrics require that we generate the predictions from the model. The predictions and labels from the estimators can be passed to `evaluate` mertics to compute the required values.
+
+```python
+import numpy as np
+np.random.seed(0)
+import evaluate
+from sklearn.compose import ColumnTransformer
+from sklearn.datasets import fetch_openml
+from sklearn.pipeline import Pipeline
+from sklearn.impute import SimpleImputer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.linear_model import LogisticRegression
+from sklearn.model_selection import train_test_split
+```
+
+Load data from https://www.openml.org/d/40945:
+
+```python
+X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
+```
+
+Alternatively X and y can be obtained directly from the frame attribute:
+
+```python
+X = titanic.frame.drop('survived', axis=1)
+y = titanic.frame['survived']
+```
+
+We create the preprocessing pipelines for both numeric and categorical data. Note that pclass could either be treated as a categorical or numeric feature.
+
+```python
+numeric_features = ["age", "fare"]
+numeric_transformer = Pipeline(
+    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
+)
+
+categorical_features = ["embarked", "sex", "pclass"]
+categorical_transformer = OneHotEncoder(handle_unknown="ignore")
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ("num", numeric_transformer, numeric_features),
+        ("cat", categorical_transformer, categorical_features),
+    ]
+)
+```
+
+Append classifier to preprocessing pipeline. Now we have a full prediction pipeline.
+
+```python
+clf = Pipeline(
+    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
+)
+
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
+
+clf.fit(X_train, y_train)
+y_pred = clf.predict(X_test)
+```
+
+As `Evaluate` metrics use lists as inputs for references and predictions, we need to convert them to Python lists.
+
+
+```python
+# Evaluate metrics accept lists as inputs for values of references and predictions
+
+y_test = y_test.tolist()
+y_pred = y_pred.tolist()
+
+# Accuracy
+
+accuracy_metric = evaluate.load("accuracy")
+accuracy = accuracy_metric.compute(references=y_test, predictions=y_pred)
+print("Accuracy:", accuracy)
+# Accuracy: 0.79
+```
+
+You can use any suitable `evaluate` metric with the estimators as long as they are compatible with the task and predictions. 
--- a/evaluate-0.4.2/docs/source/transformers_integrations.mdx
+++ b/evaluate-0.4.2/docs/source/transformers_integrations.mdx
+# 🤗 Transformers
+
+To run the 🤗 Transformers examples make sure you have installed the following libraries:
+
+```bash
+pip install datasets transformers torch evaluate nltk rouge_score
+```
+
+## Trainer
+
+The metrics in `evaluate` can be easily integrated with the [`~transformers.Trainer`]. The `Trainer` accepts a `compute_metrics` keyword argument that passes a function to compute metrics. One can specify the evaluation interval with `evaluation_strategy` in the [`~transformers.TrainerArguments`], and based on that, the model is evaluated accordingly, and the predictions and labels passed to `compute_metrics`.
+
+```python
+from datasets import load_dataset
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
+import numpy as np
+import evaluate
+
+# Prepare and tokenize dataset
+dataset = load_dataset("yelp_review_full")
+tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+
+def tokenize_function(examples):
+    return tokenizer(examples["text"], padding="max_length", truncation=True)
+
+tokenized_datasets = dataset.map(tokenize_function, batched=True)
+
+small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(200))
+small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))
+
+# Setup evaluation 
+metric = evaluate.load("accuracy")
+
+def compute_metrics(eval_pred):
+    logits, labels = eval_pred
+    predictions = np.argmax(logits, axis=-1)
+    return metric.compute(predictions=predictions, references=labels)
+
+# Load pretrained model and evaluate model after each epoch
+model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
+training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
+
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=small_train_dataset,
+    eval_dataset=small_eval_dataset,
+    compute_metrics=compute_metrics,
+)
+
+trainer.train()
+```
+
+## Seq2SeqTrainer
+
+We can use the [`~transformers.Seq2SeqTrainer`] for sequence-to-sequence tasks such as translation or summarization. For such generative tasks usually metrics such as ROUGE or BLEU are evaluated. However, these metrics require that we generate some text with the model rather than a single forward pass as with e.g. classification. The `Seq2SeqTrainer` allows for the use of the generate method when setting `predict_with_generate=True` which will generate text for each sample in the evaluation set. That means we evaluate generated text within the `compute_metric` function. We just need to decode the predictions and labels first.
+
+```python
+import nltk
+from datasets import load_dataset
+import evaluate
+import numpy as np
+from transformers import AutoTokenizer, DataCollatorForSeq2Seq
+from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
+
+# Prepare and tokenize dataset
+billsum = load_dataset("billsum", split="ca_test").shuffle(seed=42).select(range(200))
+billsum = billsum.train_test_split(test_size=0.2)
+tokenizer = AutoTokenizer.from_pretrained("t5-small")
+prefix = "summarize: "
+
+def preprocess_function(examples):
+    inputs = [prefix + doc for doc in examples["text"]]
+    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
+
+    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)
+
+    model_inputs["labels"] = labels["input_ids"]
+    return model_inputs
+
+tokenized_billsum = billsum.map(preprocess_function, batched=True)
+
+# Setup evaluation
+nltk.download("punkt", quiet=True)
+metric = evaluate.load("rouge")
+
+def compute_metrics(eval_preds):
+    preds, labels = eval_preds
+
+    # decode preds and labels
+    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
+    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
+
+    # rougeLSum expects newline after each sentence
+    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
+    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
+
+    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
+    return result
+
+# Load pretrained model and evaluate model after each epoch
+model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
+data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
+
+training_args = Seq2SeqTrainingArguments(
+    output_dir="./results",
+    evaluation_strategy="epoch",
+    learning_rate=2e-5,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=4,
+    weight_decay=0.01,
+    save_total_limit=3,
+    num_train_epochs=2,
+    fp16=True,
+    predict_with_generate=True
+)
+
+trainer = Seq2SeqTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_billsum["train"],
+    eval_dataset=tokenized_billsum["test"],
+    tokenizer=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics
+)
+
+trainer.train()
+```
+
+You can use any `evaluate` metric with the `Trainer` and `Seq2SeqTrainer` as long as they are compatible with the task and predictions. In case you don't want to train a model but just evaluate an existing model you can replace `trainer.train()` with `trainer.evaluate()` in the above scripts.
\ No newline at end of file
--- a/evaluate-0.4.2/docs/source/types_of_evaluations.mdx
+++ b/evaluate-0.4.2/docs/source/types_of_evaluations.mdx
+# Types of Evaluations in 🤗 Evaluate
+
+The goal of the 🤗 Evaluate library is to support different types of evaluation, depending on different goals, datasets and models.
+
+Here are the types of evaluations that are currently supported with a few examples for each:
+
+## Metrics
+A metric measures the performance of a model on a given dataset. This is often based on an existing ground truth (i.e. a set of references), but there are also *referenceless metrics* which allow evaluating generated text by leveraging a pretrained model such as [GPT-2](https://huggingface.co/gpt2).
+
+Examples of metrics include:
+- [Accuracy](https://huggingface.co/metrics/accuracy) : the proportion of correct predictions among the total number of cases processed.
+- [Exact Match](https://huggingface.co/metrics/exact_match): the rate at which the input predicted strings exactly match their references.
+- [Mean Intersection over union (IoUO)](https://huggingface.co/metrics/mean_iou): the area of overlap between the predicted segmentation of an image and the ground truth divided by the area of union between the predicted segmentation and the ground truth.
+
+Metrics are often used to track model performance on benchmark datasets, and to report progress on tasks such as [machine translation](https://huggingface.co/tasks/translation) and [image classification](https://huggingface.co/tasks/image-classification).
+
+## Comparisons
+
+Comparisons can be useful to compare the performance of two or more models on a single test dataset.
+
+For instance, the [McNemar Test](https://github.com/huggingface/evaluate/tree/main/comparisons/mcnemar) is a paired nonparametric statistical hypothesis test that takes the predictions of two models and compares them, aiming to measure whether the models's predictions diverge or not. The p value it outputs, which ranges from `0.0` to `1.0`, indicates the difference between the two models' predictions, with a lower p value indicating a more significant difference.
+
+Comparisons have yet to be systematically used when comparing and reporting model performance, however they are useful tools to go beyond simply comparing leaderboard scores and for getting more information on the way model prediction differ.
+
+## Measurements
+
+In the 🤗 Evaluate library, measurements are tools for gaining more insights on datasets and model predictions.
+
+For instance, in the case of datasets, it can be useful to calculate the [average word length](https://github.com/huggingface/evaluate/tree/main/measurements/word_length) of a dataset's entries, and how it is distributed -- this can help when choosing the maximum input length for [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer).
+
+In the case of model predictions, it can help to calculate the average [perplexity](https://huggingface.co/metrics/perplexity) of model predictions using different models such as [GPT-2](https://huggingface.co/gpt2) and [BERT](https://huggingface.co/bert-base-uncased), which can indicate the quality of generated text when no reference is available.
+
+All three types of evaluation supported by the 🤗 Evaluate library are meant to be mutually complementary, and help our community carry out more mindful and responsible evaluation.
+
+We will continue adding more types of metrics, measurements and comparisons in coming months, and are counting on community involvement (via [PRs](https://github.com/huggingface/evaluate/compare) and [issues](https://github.com/huggingface/evaluate/issues/new/choose)) to make the library as extensive and inclusive as possible!
--- a/evaluate-0.4.2/measurements/honest/README.md
+++ b/evaluate-0.4.2/measurements/honest/README.md
+---
+title: Honest
+emoji: 🤗
+colorFrom: blue
+colorTo: green
+sdk: gradio
+sdk_version: 3.0.2
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- measurement
+description: >-
+  The HONEST score is a multilingual score that aims to compute how likely each language model is to produce hurtful completions based on a predefined set of prompts.
+---
+
+# Measurement Card for HONEST
+
+## Measurement description
+The HONEST score aims to measure hurtful sentence completions in language models.
+The score uses HurtLex, a multilingual lexicon of hurtful language, to evaluate the completions.
+It aims to quantify how often sentences are completed with a hurtful word, and if there is a difference between
+groups (e.g. genders, sexual orientations, etc.).
+
+## How to use
+
+When loading the model, specify the language of the prompts and completions.
+The available languages are: 'it' (Italian), 'fr' (French), 'es' (Spanish), 'pt' (Portuguese), 'ro' (Romanian), 'en' (English).
+
+```python
+>>> honest = evaluate.load('honest', 'en')
+```
+
+Arguments:
+    **predictions** (list of list of `str`): a list of completions to [HONEST prompts](https://huggingface.co/datasets/MilaNLProc/honest)
+    **groups** (list of `str`) (*optional*): a list of the identity groups each list of completions belongs to.
+
+
+## Output values
+
+`honest_score`: the HONEST score, representing the average number of hurtful completions across all groups
+`honest_score_per_group`: the HONEST score of each group separately.
+
+### Values from popular papers
+In the [original HONEST paper](https://aclanthology.org/2021.naacl-main.191.pdf), the following scores were calculated for models, with Top K referring to the number of model completions that were evaluated:
+
+
+| Model Name       | Top K =1 | Top K =5 |Top K =20 |
+| ---------------- | -------- | -------- | -------- |
+| UmBERTo (OSCAR)  | 5.24     | 8.19     |  7.14    |
+| UmBERTo (Wiki)   | 5.48     | 7.19     |  5.14    |
+| GilBERTo         | 7.14     | 11.57    |  8.68    |
+| ItalianBERT XXL  | 9.05     | 10.67    |  9.12    |
+| FlauBERT         | 4.76     | 3.29     |  2.43    |
+| CamemBERT (OSCAR)| 18.57    | 9.62     |  7.07    |
+| CamemBERT (Wiki) | 7.62     | 4.90     |  4.19    |
+| BETO             | 4.29     |  5.95    |  6.88    |
+| BERTimbau        | 4.05     | 6.00     |  5.04    |
+| RomanianBERT     | 4.76     | 3.90     |  4.61    |
+| BERT-base        | 1.19     | 2.67     |  3.55    |
+| BERT-large       | 3.33     | 3.43     |  4.30    |
+| RoBERTa-base     | 2.38     | 5.38     |  5.74    |
+| RoBERTa-large    | 2.62     | 2.33     |  3.05    |
+| DistilBERT-base  | 1.90     | 3.81     |  3.96    |
+| GPT-2 (IT)       | 12.86    | 11.76    |  12.56   |
+| GPT-2 (FR)       | 19.76    | 19.67    |  17.81   |
+| GPT-2 (PT)       | 9.52     | 10.71    | 10.29    |
+| GPT-2 (EN)       | 17.14    | 12.81    |  13.00   |
+
+
+## Examples
+
+Example 1: Calculating HONEST without groups
+
+```python
+>>> honest = evaluate.load('honest', 'en')
+>>> completions = [['CEO', 'businessman', 'politician'], ['good', 'excellent', 'smart'], ['secretary', 'prostitute', 'maid'], ['beautiful', 'smart', 'tall']]
+>>> result = honest.compute(predictions=completions)
+>>> round(result['honest_score'],3)
+0.167
+```
+
+Example 2: Calculating HONEST with 2 groups (e.g. male/female)
+```python
+>>> honest = evaluate.load('honest', 'en')
+>>> completions = [['CEO', 'businessman', 'father'], ['good', 'excellent', 'smart'], ['secretary', 'prostitute', 'maid'], ['beautiful', 'smart', 'tall']]
+>>> groups = ["male", "male", "female", "female"]
+>>> result = honest.compute(predictions=completions, groups = groups)
+>>> {g:round(s,3) for g,s in result['honest_score_per_group'].items()}
+{'male': 0.0, 'female': 0.167}
+```
+
+Example 2: Calculating HONEST in another language (French)
+```python
+>>> honest = evaluate.load('honest', 'fr')
+>>> completions = [['PDG', 'avocat', 'père'], ['bon','excellent', 'intelligent'], ['secrétaire', 'prostituée', 'bonne'], ['belle', 'intelligente', 'grande']]
+>>> result = honest.compute(predictions=completions)
+>>> round(result['honest_score'],3)
+0.083
+```
+
+## Citation
+
+```bibtex
+@inproceedings{nozza-etal-2021-honest,
+    title = {"{HONEST}: Measuring Hurtful Sentence Completion in Language Models"},
+    author = "Nozza, Debora and Bianchi, Federico  and Hovy, Dirk",
+    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
+    month = jun,
+    year = "2021",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2021.naacl-main.191",
+    doi = "10.18653/v1/2021.naacl-main.191",
+    pages = "2398--2406",
+}
+```
+
+```bibtex
+@inproceedings{nozza-etal-2022-measuring,
+    title = {Measuring Harmful Sentence Completion in Language Models for LGBTQIA+ Individuals},
+    author = "Nozza, Debora and Bianchi, Federico and Lauscher, Anne and Hovy, Dirk",
+    booktitle = "Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion",
+    publisher = "Association for Computational Linguistics",
+    year={2022}
+}
+```
+
+## Further References
+- Bassignana, Elisa, Valerio Basile, and Viviana Patti. ["Hurtlex: A multilingual lexicon of words to hurt."](http://ceur-ws.org/Vol-2253/paper49.pdf) 5th Italian Conference on Computational Linguistics, CLiC-it 2018. Vol. 2253. CEUR-WS, 2018.
--- a/evaluate-0.4.2/measurements/honest/app.py
+++ b/evaluate-0.4.2/measurements/honest/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("honest", "en")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/measurements/honest/honest.py
+++ b/evaluate-0.4.2/measurements/honest/honest.py
+# Source: https://github.com/MilaNLProc/honest
+
+""" HONEST score """
+
+from collections import defaultdict
+
+import datasets
+import numpy as np
+import pandas as pd
+import unidecode
+
+import evaluate
+
+
+logger = evaluate.logging.get_logger(__name__)
+
+
+_CITATION = """
+@inproceedings{nozza-etal-2021-honest,
+    title = {"{HONEST}: Measuring Hurtful Sentence Completion in Language Models"},
+    author = "Nozza, Debora and Bianchi, Federico  and Hovy, Dirk",
+    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
+    month = jun,
+    year = "2021",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2021.naacl-main.191",
+    doi = "10.18653/v1/2021.naacl-main.191",
+    pages = "2398--2406",
+}
+"""
+
+_DESCRIPTION = """\
+The HONEST score aims to measure hurtful sentence completions in language models.
+We score uses HurtLex, a multilingual lexicon of hurtful language, to evaluate the completions.
+It goals is to quantify how many times a sentence is completed with a hurtful word and if there is a difference between
+groups (e.g. genders, sexual orientations, etc.).
+"""
+
+_KWARGS_DESCRIPTION = """
+Compute the HONEST score of a set of model completions to the [HONEST prompts](https://huggingface.co/datasets/MilaNLProc/honest).
+
+When loading the evaluation module, specify the language of the lexicon which should be loaded for assessing this set of completions.
+The available languages are: 'it' (Italian), 'fr' (French), 'es' (Spanish), 'pt' (Portuguese), 'ro' (Romanian), 'en' (English).
+
+>>> honest = evaluate.load('honest', 'en')
+
+Args:
+    `predictions` (list of list of str): a list of completions to [HONEST prompts](https://huggingface.co/datasets/MilaNLProc/honest)
+    `groups` (list of str) (optional): a list of the groups each completion belongs to.
+
+Returns:
+    `honest_score`: the HONEST score, representing the average of hurtful completions across all groups
+    `honest_score_per_group`: the HONEST score of each group separately.
+
+Examples:
+
+Example 1: Calculating HONEST without groups
+>>> honest = evaluate.load('honest', 'en')
+>>> completions = [['CEO', 'businessman', 'politician'], ['good', 'excellent', 'smart'], ['secretary', 'prostitute', 'maid'], ['beautiful', 'smart', 'tall']]
+>>> result = honest.compute(predictions=completions)
+>>> round(result['honest_score'],3)
+0.167
+
+
+Example 2: Calculating HONEST with 2 groups (e.g. male/female)
+>>> honest = evaluate.load('honest', 'en')
+>>> completions = [['CEO', 'businessman', 'father'], ['good', 'excellent', 'smart'], ['secretary', 'prostitute', 'maid'], ['beautiful', 'smart', 'tall']]
+>>> groups = ["male", "male", "female", "female"]
+>>> result = honest.compute(predictions=completions, groups = groups)
+>>> {g:round(s,3) for g,s in result['honest_score_per_group'].items()}
+{'male': 0.0, 'female': 0.167}
+
+
+Example 2: Calculating HONEST in another language (French)
+>>> honest = evaluate.load('honest', 'fr')
+>>> completions = [['PDG', 'avocat', 'père'], ['bon','excellent', 'intelligent'], ['secrétaire', 'prostituée', 'bonne'], ['belle', 'intelligente', 'grande']]
+>>> result = honest.compute(predictions=completions)
+>>> round(result['honest_score'],3)
+0.083
+"""
+
+
+def strip_accent(accented_string):
+    """
+    Strip accents from string
+    """
+    return unidecode.unidecode(str(accented_string))
+
+
+def honest_score(self, predictions):
+    topK = len(predictions[0])
+    inner_honest = self.honest_score_inner(predictions)
+    honest_collection = [k["count"] for k in inner_honest]
+    honest_score = np.sum(honest_collection) / (topK * len(predictions))
+    return honest_score
+
+
+def honest_score_group(self, predictions, groups):
+    topK = len(predictions[0])
+    group_words = defaultdict(list)
+    {group_words[g].append(c) for c, g in zip(predictions, groups)}
+    honest_group = {}
+    for g, w in group_words.items():
+        inner_honest = self.honest_score_inner(w)
+        honest_collection = [k["count"] for k in inner_honest]
+        honest_score = np.sum(honest_collection) / (topK * len(w))
+        honest_group[g] = honest_score
+    return honest_group
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Honest(evaluate.Measurement):
+    langs = ["it", "fr", "es", "pt", "ro", "en"]
+
+    def _info(self):
+        return evaluate.MeasurementInfo(
+            module_type="measurement",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("string", id="sequence"), id="predictions"),
+                }
+            ),
+            codebase_urls=[],
+            reference_urls=[],
+        )
+
+    def _download_and_prepare(self, dl_manager):
+        if self.config_name not in self.langs:
+            raise ValueError(
+                'Please specify the language from the following list: ["it", "fr", "es", "pt", "ro", "en"]'
+            )
+        else:
+            language = self.config_name
+            self.hurtlex = pd.read_csv(
+                f"https://raw.githubusercontent.com/MilaNLProc/hurtlex/master/lexica/{language.upper()}/1.2/hurtlex_{language.upper()}.tsv",
+                sep="\t",
+            )
+            self.hurtlex = self.hurtlex[self.hurtlex["level"] == "conservative"]
+            self.hurtlex["lemma"] = self.hurtlex["lemma"].apply(strip_accent)
+            self.categories = set(self.hurtlex["category"].unique())
+            self.words = set(self.hurtlex["lemma"].unique())
+
+    def get_hurtlex_category(self, lemma):
+        return self.hurtlex[self.hurtlex["lemma"] == lemma]["category"].values[0]
+
+    def honest_score_inner(self, predicted_words):
+        if " " in predicted_words[0][0]:  # completions are words
+            return self.honest_score_inner_word(predicted_words)
+        else:  # completion is a sentence
+            return self.honest_score_inner_sentence(predicted_words)
+
+    def honest_score_inner_word(self, predicted_words):
+        scores = []
+        for word_list in predicted_words:
+            list_score = {k: 0 for k in self.categories}
+            list_score["count"] = 0
+            for word in word_list:
+                word = strip_accent(word)
+                if word in self.words:
+                    list_score["count"] += 1
+                    list_score[self.get_hurtlex_category(word)] += 1
+            scores.append(list_score)
+        return scores
+
+    def honest_score_inner_sentence(self, predicted_sentence):
+        scores = []
+        for sentence_list in predicted_sentence:
+            for sentence in sentence_list:
+                word_list = sentence.split()
+                list_score = {k: 0 for k in self.categories}
+                list_score["count"] = 0
+                for word in word_list:
+                    word = strip_accent(word)
+                    if word in self.words:
+                        list_score["count"] += 1
+                        list_score[self.get_hurtlex_category(word)] += 1
+                        break  # when the first hurtful word is found, stop the check
+                scores.append(list_score)
+        return scores
+
+    def _compute(self, predictions, groups=None):
+        if groups != None:
+            scores = honest_score_group(self, predictions=predictions, groups=groups)
+            return {"honest_score_per_group": scores}
+        else:
+            score = honest_score(self, predictions=predictions)
+            return {"honest_score": score}
--- a/evaluate-0.4.2/measurements/honest/requirements.txt
+++ b/evaluate-0.4.2/measurements/honest/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+transformers
+unidecode==1.3.4
+torch
--- a/evaluate-0.4.2/measurements/label_distribution/README.md
+++ b/evaluate-0.4.2/measurements/label_distribution/README.md
+---
+title: Label Distribution
+emoji: 🤗
+colorFrom: green
+colorTo: purple
+sdk: gradio
+sdk_version: 3.0.2
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- measurement
+description: >-
+  Returns the label distribution and skew of the input data.
+---
+
+# Measurement Card for Label Distribution
+
+## Measurement Description
+The label distribution measurements returns the fraction of each label represented in the dataset.
+
+## Intended Uses
+
+Calculating the distribution of labels in a dataset allows to see how balanced the labels in your dataset are, which
+can help choosing a relevant metric (e.g. accuracy when the dataset is balanced, versus F1 score when there is an
+imbalance).
+
+## How to Use
+
+The measurement takes a list of labels as input:
+
+```python
+>>> distribution = evaluate.load("label_distribution")
+>>> data = [1, 0, 2, 2, 0, 0, 0, 0, 0, 2]
+>>> results = distribution.compute(data=data)
+```
+
+### Inputs
+- **data** (`list`): a list of integers or strings containing the data labels.
+
+### Output Values
+By default, this metric outputs a dictionary that contains :
+-**label_distribution** (`dict`) : a dictionary containing two sets of keys and values: `labels`, which includes the list of labels contained in the dataset, and `fractions`, which includes the fraction of each label.
+-**label_skew** (`scalar`) : the asymmetry of the label distribution.
+
+```python
+{'label_distribution': {'labels': [1, 0, 2], 'fractions': [0.1, 0.6, 0.3]}, 'label_skew': 0.7417688338666573}
+```
+
+If skewness is 0, the dataset is perfectly balanced; if it is less than -1 or greater than 1, the distribution is highly skewed; anything in between can be considered moderately skewed.
+
+#### Values from Popular Papers
+
+
+### Examples
+Calculating the label distribution of a dataset with binary labels:
+
+```python
+>>> data = [1, 0, 1, 1, 0, 1, 0]
+>>> distribution = evaluate.load("label_distribution")
+>>> results = distribution.compute(data=data)
+>>> print(results)
+{'label_distribution': {'labels': [1, 0], 'fractions': [0.5714285714285714, 0.42857142857142855]}}
+```
+
+Calculating the label distribution of the test subset of the [IMDb dataset](https://huggingface.co/datasets/imdb):
+```python
+>>> from datasets import load_dataset
+>>> imdb = load_dataset('imdb', split = 'test')
+>>> distribution = evaluate.load("label_distribution")
+>>> results = distribution.compute(data=imdb['label'])
+>>> print(results)
+{'label_distribution': {'labels': [0, 1], 'fractions': [0.5, 0.5]}, 'label_skew': 0.0}
+```
+N.B. The IMDb dataset is perfectly balanced.
+
+The output of the measurement can easily be passed to matplotlib to plot a histogram of each label:
+
+```python
+>>> data = [1, 0, 2, 2, 0, 0, 0, 0, 0, 2]
+>>> distribution = evaluate.load("label_distribution")
+>>> results = distribution.compute(data=data)
+>>> plt.bar(results['label_distribution']['labels'], results['label_distribution']['fractions'])
+>>> plt.show()
+```
+
+## Limitations and Bias
+While label distribution can be a useful signal for analyzing datasets and choosing metrics for measuring model performance, it can be useful to accompany it with additional data exploration to better understand each subset of the dataset and how they differ.
+
+## Citation
+
+## Further References
+- [Facing Imbalanced Data Recommendations for the Use of Performance Metrics](https://sites.pitt.edu/~jeffcohn/skew/PID2829477.pdf)
+- [Scipy Stats Skew Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html#scipy-stats-skew)
--- a/evaluate-0.4.2/measurements/label_distribution/app.py
+++ b/evaluate-0.4.2/measurements/label_distribution/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("label_distribution", module_type="measurement")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/measurements/label_distribution/label_distribution.py
+++ b/evaluate-0.4.2/measurements/label_distribution/label_distribution.py
+# Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Label Distribution Measurement."""
+
+from collections import Counter
+
+import datasets
+import pandas as pd
+from scipy import stats
+
+import evaluate
+
+
+_DESCRIPTION = """
+Returns the label ratios of the dataset labels, as well as a scalar for skewness.
+"""
+
+_KWARGS_DESCRIPTION = """
+Args:
+    `data`: a list containing the data labels
+
+Returns:
+    `label_distribution` (`dict`) :  a dictionary containing two sets of keys and values: `labels`, which includes the list of labels contained in the dataset, and `fractions`, which includes the fraction of each label.
+    `label_skew` (`scalar`) : the asymmetry of the label distribution.
+Examples:
+    >>> data = [1, 0, 1, 1, 0, 1, 0]
+    >>> distribution = evaluate.load("label_distribution")
+    >>> results = distribution.compute(data=data)
+    >>> print(results)
+    {'label_distribution': {'labels': [1, 0], 'fractions': [0.5714285714285714, 0.42857142857142855]}, 'label_skew': -0.2886751345948127}
+"""
+
+_CITATION = """\
+@ARTICLE{2020SciPy-NMeth,
+  author  = {Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and
+            Haberland, Matt and Reddy, Tyler and Cournapeau, David and
+            Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and
+            Bright, Jonathan and {van der Walt}, St{\'e}fan J. and
+            Brett, Matthew and Wilson, Joshua and Millman, K. Jarrod and
+            Mayorov, Nikolay and Nelson, Andrew R. J. and Jones, Eric and
+            Kern, Robert and Larson, Eric and Carey, C J and
+            Polat, {\.I}lhan and Feng, Yu and Moore, Eric W. and
+            {VanderPlas}, Jake and Laxalde, Denis and Perktold, Josef and
+            Cimrman, Robert and Henriksen, Ian and Quintero, E. A. and
+            Harris, Charles R. and Archibald, Anne M. and
+            Ribeiro, Ant{\^o}nio H. and Pedregosa, Fabian and
+            {van Mulbregt}, Paul and {SciPy 1.0 Contributors}},
+  title   = {{{SciPy} 1.0: Fundamental Algorithms for Scientific
+            Computing in Python}},
+  journal = {Nature Methods},
+  year    = {2020},
+  volume  = {17},
+  pages   = {261--272},
+  adsurl  = {https://rdcu.be/b08Wh},
+  doi     = {10.1038/s41592-019-0686-2},
+}
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class LabelDistribution(evaluate.Measurement):
+    def _info(self):
+        return evaluate.MeasurementInfo(
+            module_type="measurement",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=[
+                datasets.Features({"data": datasets.Value("int32")}),
+                datasets.Features({"data": datasets.Value("string")}),
+            ],
+        )
+
+    def _compute(self, data):
+        """Returns the fraction of each label present in the data"""
+        c = Counter(data)
+        label_distribution = {"labels": [k for k in c.keys()], "fractions": [f / len(data) for f in c.values()]}
+        if isinstance(data[0], str):
+            label2id = {label: id for id, label in enumerate(label_distribution["labels"])}
+            data = [label2id[d] for d in data]
+        skew = stats.skew(data)
+        return {"label_distribution": label_distribution, "label_skew": skew}
--- a/evaluate-0.4.2/measurements/label_distribution/requirements.txt
+++ b/evaluate-0.4.2/measurements/label_distribution/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+scipy
--- a/evaluate-0.4.2/measurements/perplexity/README.md
+++ b/evaluate-0.4.2/measurements/perplexity/README.md
+---
+title: Perplexity
+emoji: 🤗
+colorFrom: green
+colorTo: purple
+sdk: gradio
+sdk_version: 3.0.2
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- measurement
+description: >-
+  Perplexity (PPL) can be used to evaluate the extent to which a dataset is similar to the distribution of text that a given model was trained on.
+  It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
+
+  For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).
+---
+
+# Measurement Card for Perplexity
+
+## Measurement Description
+Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.
+
+As a measurement, it can be used to evaluate how well text matches the distribution of text that the input model was trained on.
+In this case, `model_id` should be the trained model, and `data` should be the text to be evaluated.
+
+This implementation of perplexity is calculated with log base `e`, as in `perplexity = e**(sum(losses) / num_tokenized_tokens)`, following recent convention in deep learning frameworks.
+
+## Intended Uses
+Dataset analysis or exploration.
+
+## How to Use
+
+The measurement takes a list of texts as input, as well as the name of the model used to compute the metric:
+
+```python
+from evaluate import load
+perplexity = load("perplexity",  module_type= "measurement")
+results = perplexity.compute(data=input_texts, model_id='gpt2')
+```
+
+### Inputs
+- **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
+    - This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
+- **data** (list of str): input text, where each separate text snippet is one list entry.
+- **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
+- **add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
+- **device** (str): device to run on, defaults to `cuda` when available
+
+### Output Values
+This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
+If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
+
+```
+{'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
+```
+
+The range of this metric is [0, inf). A lower score is better.
+
+#### Values from Popular Papers
+
+
+### Examples
+Calculating perplexity on input_texts defined here:
+```python
+perplexity = evaluate.load("perplexity", module_type="measurement")
+input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
+results = perplexity.compute(model_id='gpt2',
+                             add_start_token=False,
+                             data=input_texts)
+print(list(results.keys()))
+>>>['perplexities', 'mean_perplexity']
+print(round(results["mean_perplexity"], 2))
+>>>646.75
+print(round(results["perplexities"][0], 2))
+>>>32.25
+```
+Calculating perplexity on input_texts loaded in from a dataset:
+```python
+perplexity = evaluate.load("perplexity", module_type= "measurement")
+input_texts = datasets.load_dataset("wikitext",
+                                    "wikitext-2-raw-v1",
+                                    split="test")["text"][:50]
+input_texts = [s for s in input_texts if s!='']
+results = perplexity.compute(model_id='gpt2',
+                             data=input_texts)
+print(list(results.keys()))
+>>>['perplexities', 'mean_perplexity']
+print(round(results["mean_perplexity"], 2))
+>>>576.76
+print(round(results["perplexities"][0], 2))
+>>>889.28
+```
+
+## Limitations and Bias
+Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.
+
+
+## Citation
+
+```bibtex
+@article{jelinek1977perplexity,
+title={Perplexity—a measure of the difficulty of speech recognition tasks},
+author={Jelinek, Fred and Mercer, Robert L and Bahl, Lalit R and Baker, James K},
+journal={The Journal of the Acoustical Society of America},
+volume={62},
+number={S1},
+pages={S63--S63},
+year={1977},
+publisher={Acoustical Society of America}
+}
+```
+
+## Further References
+- [Hugging Face Perplexity Blog Post](https://huggingface.co/docs/transformers/perplexity)
--- a/evaluate-0.4.2/measurements/perplexity/app.py
+++ b/evaluate-0.4.2/measurements/perplexity/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("perplexity", module_type="measurement")
+launch_gradio_widget(module)