修改readme

25991f98 · hepj · ac192496 · 25991f98 · 25991f98 · 25991f98
Commit 25991f98 authored Jul 25, 2024 by hepj
20 changed files
--- a/evaluate-0.4.2/metrics/coval/requirements.txt
+++ b/evaluate-0.4.2/metrics/coval/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+git+https://github.com/ns-moosavi/coval.git
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/cuad/README.md
+++ b/evaluate-0.4.2/metrics/cuad/README.md
+---
+title: CUAD
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  This metric wrap the official scoring script for version 1 of the Contract Understanding Atticus Dataset (CUAD).
+
+  Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of more than 13,000 labels in 510 
+  commercial legal contracts that have been manually labeled to identify 41 categories of important
+  clauses that lawyers look for when reviewing contracts in connection with corporate transactions.
+---
+
+# Metric Card for CUAD
+
+## Metric description
+
+This metric wraps the official scoring script for version 1 of the [Contract Understanding Atticus Dataset (CUAD)](https://huggingface.co/datasets/cuad), which is a corpus of more than 13,000 labels in 510 commercial legal contracts that have been manually labeled to identify 41 categories of important clauses that lawyers look for when reviewing contracts in connection with corporate transactions.
+
+The CUAD metric computes several scores: [Exact Match](https://huggingface.co/metrics/exact_match), [F1 score](https://huggingface.co/metrics/f1), Area Under the Precision-Recall Curve, [Precision](https://huggingface.co/metrics/precision) at 80% [recall](https://huggingface.co/metrics/recall) and Precision at 90% recall.
+
+## How to use 
+
+The CUAD metric takes two inputs :
+
+
+`predictions`, a list of question-answer dictionaries with the following key-values:
+- `id`: the id of the question-answer pair as given in the references.
+- `prediction_text`: a list of possible texts for the answer, as a list of strings depending on a threshold on the confidence probability of each prediction.
+
+
+`references`: a list of question-answer dictionaries with the following key-values:
+ - `id`: the id of the question-answer pair (the same as above).
+ - `answers`: a dictionary *in the CUAD dataset format* with the following keys:
+   - `text`: a list of possible texts for the answer, as a list of strings.
+   - `answer_start`: a list of start positions for the answer, as a list of ints.
+
+ Note that `answer_start` values are not taken into account to compute the metric.
+
+```python
+from evaluate import load
+cuad_metric = load("cuad")
+predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
+references = [{'answers': {'answer_start': [143, 49], 'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.']}, 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
+results = cuad_metric.compute(predictions=predictions, references=references)
+```
+## Output values
+
+The output of the CUAD metric consists of a dictionary that contains one or several of the following metrics:
+
+`exact_match`: The normalized answers that exactly match the reference answer, with a range between 0.0 and 1.0 (see [exact match](https://huggingface.co/metrics/exact_match) for more information).
+
+`f1`: The harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is between 0.0 and 1.0 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
+
+`aupr`: The Area Under the Precision-Recall curve, with a range between 0.0 and 1.0, with a higher value representing both high recall and high precision, and a low value representing low values for both. See the [Wikipedia article](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) for more information.
+
+`prec_at_80_recall`: The fraction of true examples among the predicted examples at a recall rate of 80%. Its range is between 0.0 and 1.0. For more information, see [precision](https://huggingface.co/metrics/precision) and [recall](https://huggingface.co/metrics/recall).
+
+`prec_at_90_recall`: The fraction of true examples among the predicted examples at a recall rate of 90%. Its range is between 0.0 and 1.0. 
+
+
+### Values from popular papers
+The [original CUAD paper](https://arxiv.org/pdf/2103.06268.pdf) reports that a [DeBERTa model](https://huggingface.co/microsoft/deberta-base) attains
+an AUPR of 47.8%, a Precision at 80% Recall of 44.0%, and a Precision at 90% Recall of 17.8% (they do not report F1 or Exact Match separately).
+
+For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/cuad).
+
+## Examples 
+
+Maximal values :
+
+```python
+from evaluate import load
+cuad_metric = load("cuad")
+predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
+references = [{'answers': {'answer_start': [143, 49], 'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.']}, 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
+results = cuad_metric.compute(predictions=predictions, references=references)
+print(results)
+{'exact_match': 100.0, 'f1': 100.0, 'aupr': 0.0, 'prec_at_80_recall': 1.0, 'prec_at_90_recall': 1.0}
+```
+
+Minimal values:
+
+```python
+from evaluate import load
+cuad_metric = load("cuad")
+predictions = [{'prediction_text': ['The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement.'], 'id': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Exclusivity_0'}]
+references = [{'answers': {'answer_start': [143], 'text': 'The seller'}, 'id': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Exclusivity_0'}]
+results = cuad_metric.compute(predictions=predictions, references=references)
+print(results)
+{'exact_match': 0.0, 'f1': 0.0, 'aupr': 0.0, 'prec_at_80_recall': 0, 'prec_at_90_recall': 0}
+```
+
+Partial match: 
+
+```python
+from evaluate import load
+cuad_metric = load("cuad")
+predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
+predictions = [{'prediction_text': ['The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement.', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
+results = cuad_metric.compute(predictions=predictions, references=references)
+print(results)
+{'exact_match': 100.0, 'f1': 50.0, 'aupr': 0.0, 'prec_at_80_recall': 0, 'prec_at_90_recall': 0}
+```
+
+## Limitations and bias
+This metric works only with datasets that have the same format as the [CUAD dataset](https://huggingface.co/datasets/cuad). The limitations of the biases of this dataset are not discussed, but could exhibit annotation bias given the homogeneity of annotators for this dataset.
+
+In terms of the metric itself, the accuracy of AUPR has been debated because its estimates are quite noisy and because of the fact that reducing the Precision-Recall Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system. Reporting the original F1 and exact match scores is therefore useful to ensure a more complete representation of system performance.
+
+
+## Citation
+
+```bibtex
+@article{hendrycks2021cuad,
+      title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
+      author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
+      journal={arXiv preprint arXiv:2103.06268},
+      year={2021}
+}
+```
+    
+## Further References 
+
+- [CUAD dataset homepage](https://www.atticusprojectai.org/cuad-v1-performance-announcements)
--- a/evaluate-0.4.2/metrics/cuad/app.py
+++ b/evaluate-0.4.2/metrics/cuad/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("cuad")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/cuad/compute_score.py
+++ b/evaluate-0.4.2/metrics/cuad/compute_score.py
+""" Official evaluation script for CUAD dataset. """
+
+import argparse
+import json
+import re
+import string
+import sys
+
+import numpy as np
+
+
+IOU_THRESH = 0.5
+
+
+def get_jaccard(prediction, ground_truth):
+    remove_tokens = [".", ",", ";", ":"]
+
+    for token in remove_tokens:
+        ground_truth = ground_truth.replace(token, "")
+        prediction = prediction.replace(token, "")
+
+    ground_truth, prediction = ground_truth.lower(), prediction.lower()
+    ground_truth, prediction = ground_truth.replace("/", " "), prediction.replace("/", " ")
+    ground_truth, prediction = set(ground_truth.split(" ")), set(prediction.split(" "))
+
+    intersection = ground_truth.intersection(prediction)
+    union = ground_truth.union(prediction)
+    jaccard = len(intersection) / len(union)
+    return jaccard
+
+
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def compute_precision_recall(predictions, ground_truths, qa_id):
+    tp, fp, fn = 0, 0, 0
+
+    substr_ok = "Parties" in qa_id
+
+    # first check if ground truth is empty
+    if len(ground_truths) == 0:
+        if len(predictions) > 0:
+            fp += len(predictions)  # false positive for each one
+    else:
+        for ground_truth in ground_truths:
+            assert len(ground_truth) > 0
+            # check if there is a match
+            match_found = False
+            for pred in predictions:
+                if substr_ok:
+                    is_match = get_jaccard(pred, ground_truth) >= IOU_THRESH or ground_truth in pred
+                else:
+                    is_match = get_jaccard(pred, ground_truth) >= IOU_THRESH
+                if is_match:
+                    match_found = True
+
+            if match_found:
+                tp += 1
+            else:
+                fn += 1
+
+        # now also get any fps by looping through preds
+        for pred in predictions:
+            # Check if there's a match. if so, don't count (don't want to double count based on the above)
+            # but if there's no match, then this is a false positive.
+            # (Note: we get the true positives in the above loop instead of this loop so that we don't double count
+            # multiple predictions that are matched with the same answer.)
+            match_found = False
+            for ground_truth in ground_truths:
+                assert len(ground_truth) > 0
+                if substr_ok:
+                    is_match = get_jaccard(pred, ground_truth) >= IOU_THRESH or ground_truth in pred
+                else:
+                    is_match = get_jaccard(pred, ground_truth) >= IOU_THRESH
+                if is_match:
+                    match_found = True
+
+            if not match_found:
+                fp += 1
+
+    precision = tp / (tp + fp) if tp + fp > 0 else np.nan
+    recall = tp / (tp + fn) if tp + fn > 0 else np.nan
+
+    return precision, recall
+
+
+def process_precisions(precisions):
+    """
+    Processes precisions to ensure that precision and recall don't both get worse.
+    Assumes the list precision is sorted in order of recalls
+    """
+    precision_best = precisions[::-1]
+    for i in range(1, len(precision_best)):
+        precision_best[i] = max(precision_best[i - 1], precision_best[i])
+    precisions = precision_best[::-1]
+    return precisions
+
+
+def get_aupr(precisions, recalls):
+    processed_precisions = process_precisions(precisions)
+    aupr = np.trapz(processed_precisions, recalls)
+    if np.isnan(aupr):
+        return 0
+    return aupr
+
+
+def get_prec_at_recall(precisions, recalls, recall_thresh):
+    """Assumes recalls are sorted in increasing order"""
+    processed_precisions = process_precisions(precisions)
+    prec_at_recall = 0
+    for prec, recall in zip(processed_precisions, recalls):
+        if recall >= recall_thresh:
+            prec_at_recall = prec
+            break
+    return prec_at_recall
+
+
+def exact_match_score(prediction, ground_truth):
+    return normalize_answer(prediction) == normalize_answer(ground_truth)
+
+
+def metric_max_over_ground_truths(metric_fn, predictions, ground_truths):
+    score = 0
+    for pred in predictions:
+        for ground_truth in ground_truths:
+            score = metric_fn(pred, ground_truth)
+            if score == 1:  # break the loop when one prediction matches the ground truth
+                break
+        if score == 1:
+            break
+    return score
+
+
+def compute_score(dataset, predictions):
+    f1 = exact_match = total = 0
+    precisions = []
+    recalls = []
+    for article in dataset:
+        for paragraph in article["paragraphs"]:
+            for qa in paragraph["qas"]:
+                total += 1
+                if qa["id"] not in predictions:
+                    message = "Unanswered question " + qa["id"] + " will receive score 0."
+                    print(message, file=sys.stderr)
+                    continue
+                ground_truths = list(map(lambda x: x["text"], qa["answers"]))
+                prediction = predictions[qa["id"]]
+                precision, recall = compute_precision_recall(prediction, ground_truths, qa["id"])
+
+                precisions.append(precision)
+                recalls.append(recall)
+
+                if precision == 0 and recall == 0:
+                    f1 += 0
+                else:
+                    f1 += 2 * (precision * recall) / (precision + recall)
+
+                exact_match += metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
+
+    precisions = [x for _, x in sorted(zip(recalls, precisions))]
+    recalls.sort()
+
+    f1 = 100.0 * f1 / total
+    exact_match = 100.0 * exact_match / total
+    aupr = get_aupr(precisions, recalls)
+
+    prec_at_90_recall = get_prec_at_recall(precisions, recalls, recall_thresh=0.9)
+    prec_at_80_recall = get_prec_at_recall(precisions, recalls, recall_thresh=0.8)
+
+    return {
+        "exact_match": exact_match,
+        "f1": f1,
+        "aupr": aupr,
+        "prec_at_80_recall": prec_at_80_recall,
+        "prec_at_90_recall": prec_at_90_recall,
+    }
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Evaluation for CUAD")
+    parser.add_argument("dataset_file", help="Dataset file")
+    parser.add_argument("prediction_file", help="Prediction File")
+    args = parser.parse_args()
+    with open(args.dataset_file) as dataset_file:
+        dataset_json = json.load(dataset_file)
+        dataset = dataset_json["data"]
+    with open(args.prediction_file) as prediction_file:
+        predictions = json.load(prediction_file)
+    print(json.dumps(compute_score(dataset, predictions)))
--- a/evaluate-0.4.2/metrics/cuad/cuad.py
+++ b/evaluate-0.4.2/metrics/cuad/cuad.py
+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" CUAD metric. """
+
+import datasets
+
+import evaluate
+
+from .compute_score import compute_score
+
+
+_CITATION = """\
+@article{hendrycks2021cuad,
+      title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
+      author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
+      journal={arXiv preprint arXiv:2103.06268},
+      year={2021}
+}
+"""
+
+_DESCRIPTION = """
+This metric wrap the official scoring script for version 1 of the Contract
+Understanding Atticus Dataset (CUAD).
+Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of more than 13,000 labels in 510
+commercial legal contracts that have been manually labeled to identify 41 categories of important
+clauses that lawyers look for when reviewing contracts in connection with corporate transactions.
+"""
+
+_KWARGS_DESCRIPTION = """
+Computes CUAD scores (EM, F1, AUPR, Precision@80%Recall, and Precision@90%Recall).
+Args:
+    predictions: List of question-answers dictionaries with the following key-values:
+        - 'id': id of the question-answer pair as given in the references (see below)
+        - 'prediction_text': list of possible texts for the answer, as a list of strings
+        depending on a threshold on the confidence probability of each prediction.
+    references: List of question-answers dictionaries with the following key-values:
+        - 'id': id of the question-answer pair (see above),
+        - 'answers': a Dict in the CUAD dataset format
+            {
+                'text': list of possible texts for the answer, as a list of strings
+                'answer_start': list of start positions for the answer, as a list of ints
+            }
+            Note that answer_start values are not taken into account to compute the metric.
+Returns:
+    'exact_match': Exact match (the normalized answer exactly match the gold answer)
+    'f1': The F-score of predicted tokens versus the gold answer
+    'aupr': Area Under the Precision-Recall curve
+    'prec_at_80_recall': Precision at 80% recall
+    'prec_at_90_recall': Precision at 90% recall
+Examples:
+    >>> predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
+    >>> references = [{'answers': {'answer_start': [143, 49], 'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.']}, 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
+    >>> cuad_metric = evaluate.load("cuad")
+    >>> results = cuad_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'exact_match': 100.0, 'f1': 100.0, 'aupr': 0.0, 'prec_at_80_recall': 1.0, 'prec_at_90_recall': 1.0}
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class CUAD(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": {
+                        "id": datasets.Value("string"),
+                        "prediction_text": datasets.features.Sequence(datasets.Value("string")),
+                    },
+                    "references": {
+                        "id": datasets.Value("string"),
+                        "answers": datasets.features.Sequence(
+                            {
+                                "text": datasets.Value("string"),
+                                "answer_start": datasets.Value("int32"),
+                            }
+                        ),
+                    },
+                }
+            ),
+            codebase_urls=["https://www.atticusprojectai.org/cuad"],
+            reference_urls=["https://www.atticusprojectai.org/cuad"],
+        )
+
+    def _compute(self, predictions, references):
+        pred_dict = {prediction["id"]: prediction["prediction_text"] for prediction in predictions}
+        dataset = [
+            {
+                "paragraphs": [
+                    {
+                        "qas": [
+                            {
+                                "answers": [{"text": answer_text} for answer_text in ref["answers"]["text"]],
+                                "id": ref["id"],
+                            }
+                            for ref in references
+                        ]
+                    }
+                ]
+            }
+        ]
+        score = compute_score(dataset=dataset, predictions=pred_dict)
+        return score
--- a/evaluate-0.4.2/metrics/cuad/requirements.txt
+++ b/evaluate-0.4.2/metrics/cuad/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/exact_match/README.md
+++ b/evaluate-0.4.2/metrics/exact_match/README.md
+---
+title: Exact Match
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  Returns the rate at which the input predicted strings exactly match their references, ignoring any strings input as part of the regexes_to_ignore list.
+---
+
+# Metric Card for Exact Match
+
+
+## Metric Description
+A given predicted string's exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise.
+
+- **Example 1**: The exact match score of prediction "Happy Birthday!" is 0, given its reference is "Happy New Year!".
+- **Example 2**: The exact match score of prediction "The Colour of Magic (1983)" is 1, given its reference is also "The Colour of Magic (1983)".
+
+The exact match score of a set of predictions is the sum of all of the individual exact match scores in the set, divided by the total number of predictions in the set.
+
+- **Example**: The exact match score of the set {Example 1, Example 2} (above) is 0.5.
+
+
+## How to Use
+At minimum, this metric takes as input predictions and references:
+```python
+>>> from evaluate import load
+>>> exact_match_metric = load("exact_match")
+>>> results = exact_match_metric.compute(predictions=predictions, references=references)
+```
+
+### Inputs
+- **`predictions`** (`list` of `str`): List of predicted texts.
+- **`references`** (`list` of `str`): List of reference texts.
+- **`regexes_to_ignore`** (`list` of `str`): Regex expressions of characters to ignore when calculating the exact matches. Defaults to `None`. Note: the regex changes are applied before capitalization is normalized.
+- **`ignore_case`** (`bool`): If `True`, turns everything to lowercase so that capitalization differences are ignored. Defaults to `False`.
+- **`ignore_punctuation`** (`bool`): If `True`, removes punctuation before comparing strings. Defaults to `False`.
+- **`ignore_numbers`** (`bool`): If `True`, removes all digits before comparing strings. Defaults to `False`.
+
+
+### Output Values
+This metric outputs a dictionary with one value: the average exact match score.
+
+```python
+{'exact_match': 1.0}
+```
+
+This metric's range is 0-1, inclusive. Here, 0.0 means no prediction/reference pairs were matches, while 1.0 means they all were.
+
+#### Values from Popular Papers
+The exact match metric is often included in other metrics, such as SQuAD. For example, the [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an Exact Match score of 40.0%. They also report that the human performance Exact Match score on the dataset was 80.3%.
+
+### Examples
+Without including any regexes to ignore:
+```python
+>>> exact_match = evaluate.load("exact_match")
+>>> refs = ["the cat", "theater", "YELLING", "agent007"]
+>>> preds = ["cat?", "theater", "yelling", "agent"]
+>>> results = exact_match.compute(references=refs, predictions=preds)
+>>> print(round(results["exact_match"], 2))
+0.25
+```
+
+Ignoring regexes "the" and "yell", as well as ignoring case and punctuation:
+```python
+>>> exact_match = evaluate.load("exact_match")
+>>> refs = ["the cat", "theater", "YELLING", "agent007"]
+>>> preds = ["cat?", "theater", "yelling", "agent"]
+>>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell"], ignore_case=True, ignore_punctuation=True)
+>>> print(round(results["exact_match"], 2))
+0.5
+```
+Note that in the example above, because the regexes are ignored before the case is normalized, "yell" from "YELLING" is not deleted.
+
+Ignoring "the", "yell", and "YELL", as well as ignoring case and punctuation:
+```python
+>>> exact_match = evaluate.load("exact_match")
+>>> refs = ["the cat", "theater", "YELLING", "agent007"]
+>>> preds = ["cat?", "theater", "yelling", "agent"]
+>>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell", "YELL"], ignore_case=True, ignore_punctuation=True)
+>>> print(round(results["exact_match"], 2))
+0.75
+```
+
+Ignoring "the", "yell", and "YELL", as well as ignoring case, punctuation, and numbers:
+```python
+>>> exact_match = evaluate.load("exact_match")
+>>> refs = ["the cat", "theater", "YELLING", "agent007"]
+>>> preds = ["cat?", "theater", "yelling", "agent"]
+>>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell", "YELL"], ignore_case=True, ignore_punctuation=True, ignore_numbers=True)
+>>> print(round(results["exact_match"], 2))
+1.0
+```
+
+An example that includes sentences:
+```python
+>>> exact_match = evaluate.load("exact_match")
+>>> refs = ["The cat sat on the mat.", "Theaters are great.", "It's like comparing oranges and apples."]
+>>> preds = ["The cat sat on the mat?", "Theaters are great.", "It's like comparing apples and oranges."]
+>>> results = exact_match.compute(references=refs, predictions=preds)
+>>> print(round(results["exact_match"], 2))
+0.33
+```
+
+
+## Limitations and Bias
+This metric is limited in that it outputs the same score for something that is completely wrong as for something that is correct except for a single character. In other words, there is no award for being *almost* right.
+
+## Citation
+
+## Further References
+- Also used in the [SQuAD metric](https://github.com/huggingface/datasets/tree/master/metrics/squad) 
--- a/evaluate-0.4.2/metrics/exact_match/app.py
+++ b/evaluate-0.4.2/metrics/exact_match/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("exact_match")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/exact_match/exact_match.py
+++ b/evaluate-0.4.2/metrics/exact_match/exact_match.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Exact Match metric."""
+import re
+import string
+
+import datasets
+import numpy as np
+
+import evaluate
+
+
+_DESCRIPTION = """
+Returns the rate at which the input predicted strings exactly match their references, ignoring any strings input as part of the regexes_to_ignore list.
+"""
+
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions: List of predicted texts.
+    references: List of reference texts.
+    regexes_to_ignore: List, defaults to None. Regex expressions of characters to
+        ignore when calculating the exact matches. Note: these regexes are removed
+        from the input data before the changes based on the options below (e.g. ignore_case,
+        ignore_punctuation, ignore_numbers) are applied.
+    ignore_case: Boolean, defaults to False. If true, turns everything
+        to lowercase so that capitalization differences are ignored.
+    ignore_punctuation: Boolean, defaults to False. If true, removes all punctuation before
+        comparing predictions and references.
+    ignore_numbers: Boolean, defaults to False. If true, removes all punctuation before
+        comparing predictions and references.
+Returns:
+    exact_match: Dictionary containing exact_match rate. Possible values are between 0.0 and 1.0, inclusive.
+Examples:
+    >>> exact_match = evaluate.load("exact_match")
+    >>> refs = ["the cat", "theater", "YELLING", "agent007"]
+    >>> preds = ["cat?", "theater", "yelling", "agent"]
+    >>> results = exact_match.compute(references=refs, predictions=preds)
+    >>> print(round(results["exact_match"], 2))
+    0.25
+
+    >>> exact_match = evaluate.load("exact_match")
+    >>> refs = ["the cat", "theater", "YELLING", "agent007"]
+    >>> preds = ["cat?", "theater", "yelling", "agent"]
+    >>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell"], ignore_case=True, ignore_punctuation=True)
+    >>> print(round(results["exact_match"], 2))
+    0.5
+
+
+    >>> exact_match = evaluate.load("exact_match")
+    >>> refs = ["the cat", "theater", "YELLING", "agent007"]
+    >>> preds = ["cat?", "theater", "yelling", "agent"]
+    >>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell", "YELL"], ignore_case=True, ignore_punctuation=True)
+    >>> print(round(results["exact_match"], 2))
+    0.75
+
+    >>> exact_match = evaluate.load("exact_match")
+    >>> refs = ["the cat", "theater", "YELLING", "agent007"]
+    >>> preds = ["cat?", "theater", "yelling", "agent"]
+    >>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell", "YELL"], ignore_case=True, ignore_punctuation=True, ignore_numbers=True)
+    >>> print(round(results["exact_match"], 2))
+    1.0
+
+    >>> exact_match = evaluate.load("exact_match")
+    >>> refs = ["The cat sat on the mat.", "Theaters are great.", "It's like comparing oranges and apples."]
+    >>> preds = ["The cat sat on the mat?", "Theaters are great.", "It's like comparing apples and oranges."]
+    >>> results = exact_match.compute(references=refs, predictions=preds)
+    >>> print(round(results["exact_match"], 2))
+    0.33
+"""
+
+_CITATION = """
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class ExactMatch(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string", id="sequence"),
+                    "references": datasets.Value("string", id="sequence"),
+                }
+            ),
+            reference_urls=[],
+        )
+
+    def _compute(
+        self,
+        predictions,
+        references,
+        regexes_to_ignore=None,
+        ignore_case=False,
+        ignore_punctuation=False,
+        ignore_numbers=False,
+    ):
+
+        if regexes_to_ignore is not None:
+            for s in regexes_to_ignore:
+                predictions = np.array([re.sub(s, "", x) for x in predictions])
+                references = np.array([re.sub(s, "", x) for x in references])
+        else:
+            predictions = np.asarray(predictions)
+            references = np.asarray(references)
+
+        if ignore_case:
+            predictions = np.char.lower(predictions)
+            references = np.char.lower(references)
+
+        if ignore_punctuation:
+            repl_table = string.punctuation.maketrans("", "", string.punctuation)
+            predictions = np.char.translate(predictions, table=repl_table)
+            references = np.char.translate(references, table=repl_table)
+
+        if ignore_numbers:
+            repl_table = string.digits.maketrans("", "", string.digits)
+            predictions = np.char.translate(predictions, table=repl_table)
+            references = np.char.translate(references, table=repl_table)
+
+        score_list = predictions == references
+
+        return {"exact_match": np.mean(score_list)}
--- a/evaluate-0.4.2/metrics/exact_match/requirements.txt
+++ b/evaluate-0.4.2/metrics/exact_match/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/f1/README.md
+++ b/evaluate-0.4.2/metrics/f1/README.md
+---
+title: F1
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
+  F1 = 2 * (precision * recall) / (precision + recall)
+---
+
+# Metric Card for F1
+
+
+## Metric Description
+
+The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
+F1 = 2 * (precision * recall) / (precision + recall)
+
+
+## How to Use
+
+At minimum, this metric requires predictions and references as input
+
+```python
+>>> f1_metric = evaluate.load("f1")
+>>> results = f1_metric.compute(predictions=[0, 1], references=[0, 1])
+>>> print(results)
+["{'f1': 1.0}"]
+```
+
+
+### Inputs
+- **predictions** (`list` of `int`): Predicted labels.
+- **references** (`list` of `int`): Ground truth labels.
+- **labels** (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`, and the order of the labels if `average` is `None`. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.
+- **pos_label** (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.
+- **average** (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
+    - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.
+    - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
+    - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+    - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.
+    - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
+- **sample_weight** (`list` of `float`): Sample weights Defaults to None.
+
+
+### Output Values
+- **f1**(`float` or `array` of `float`): F1 score or list of f1 scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher f1 scores are better.
+
+Output Example(s):
+```python
+{'f1': 0.26666666666666666}
+```
+```python
+{'f1': array([0.8, 0.0, 0.0])}
+```
+
+This metric outputs a dictionary, with either a single f1 score, of type `float`, or an array of f1 scores, with entries of type `float`.
+
+
+#### Values from Popular Papers
+
+
+
+
+### Examples
+
+Example 1-A simple binary example
+```python
+>>> f1_metric = evaluate.load("f1")
+>>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])
+>>> print(results)
+{'f1': 0.5}
+```
+
+Example 2-The same simple binary example as in Example 1, but with `pos_label` set to `0`.
+```python
+>>> f1_metric = evaluate.load("f1")
+>>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0)
+>>> print(round(results['f1'], 2))
+0.67
+```
+
+Example 3-The same simple binary example as in Example 1, but with `sample_weight` included.
+```python
+>>> f1_metric = evaluate.load("f1")
+>>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])
+>>> print(round(results['f1'], 2))
+0.35
+```
+
+Example 4-A multiclass example, with different values for the `average` input.
+```python
+>>> predictions = [0, 2, 1, 0, 0, 1]
+>>> references = [0, 1, 2, 0, 1, 2]
+>>> results = f1_metric.compute(predictions=predictions, references=references, average="macro")
+>>> print(round(results['f1'], 2))
+0.27
+>>> results = f1_metric.compute(predictions=predictions, references=references, average="micro")
+>>> print(round(results['f1'], 2))
+0.33
+>>> results = f1_metric.compute(predictions=predictions, references=references, average="weighted")
+>>> print(round(results['f1'], 2))
+0.27
+>>> results = f1_metric.compute(predictions=predictions, references=references, average=None)
+>>> print(results)
+{'f1': array([0.8, 0. , 0. ])}
+```
+
+
+## Limitations and Bias
+
+
+
+## Citation(s)
+```bibtex
+@article{scikit-learn,
+    title={Scikit-learn: Machine Learning in {P}ython},
+    author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+           and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+           and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+           Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+    journal={Journal of Machine Learning Research},
+    volume={12},
+    pages={2825--2830},
+    year={2011}
+}
+```
+
+
+## Further References
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/f1/app.py
+++ b/evaluate-0.4.2/metrics/f1/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("f1")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/f1/f1.py
+++ b/evaluate-0.4.2/metrics/f1/f1.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""F1 metric."""
+
+import datasets
+from sklearn.metrics import f1_score
+
+import evaluate
+
+
+_DESCRIPTION = """
+The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
+F1 = 2 * (precision * recall) / (precision + recall)
+"""
+
+
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions (`list` of `int`): Predicted labels.
+    references (`list` of `int`): Ground truth labels.
+    labels (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`, and the order of the labels if `average` is `None`. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.
+    pos_label (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.
+    average (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
+
+        - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.
+        - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
+        - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+        - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.
+        - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
+    sample_weight (`list` of `float`): Sample weights Defaults to None.
+
+Returns:
+    f1 (`float` or `array` of `float`): F1 score or list of f1 scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher f1 scores are better.
+
+Examples:
+
+    Example 1-A simple binary example
+        >>> f1_metric = evaluate.load("f1")
+        >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])
+        >>> print(results)
+        {'f1': 0.5}
+
+    Example 2-The same simple binary example as in Example 1, but with `pos_label` set to `0`.
+        >>> f1_metric = evaluate.load("f1")
+        >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0)
+        >>> print(round(results['f1'], 2))
+        0.67
+
+    Example 3-The same simple binary example as in Example 1, but with `sample_weight` included.
+        >>> f1_metric = evaluate.load("f1")
+        >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])
+        >>> print(round(results['f1'], 2))
+        0.35
+
+    Example 4-A multiclass example, with different values for the `average` input.
+        >>> predictions = [0, 2, 1, 0, 0, 1]
+        >>> references = [0, 1, 2, 0, 1, 2]
+        >>> results = f1_metric.compute(predictions=predictions, references=references, average="macro")
+        >>> print(round(results['f1'], 2))
+        0.27
+        >>> results = f1_metric.compute(predictions=predictions, references=references, average="micro")
+        >>> print(round(results['f1'], 2))
+        0.33
+        >>> results = f1_metric.compute(predictions=predictions, references=references, average="weighted")
+        >>> print(round(results['f1'], 2))
+        0.27
+        >>> results = f1_metric.compute(predictions=predictions, references=references, average=None)
+        >>> print(results)
+        {'f1': array([0.8, 0. , 0. ])}
+
+    Example 5-A multi-label example
+        >>> f1_metric = evaluate.load("f1", "multilabel")
+        >>> results = f1_metric.compute(predictions=[[0, 1, 1], [1, 1, 0]], references=[[0, 1, 1], [0, 1, 0]], average="macro")
+        >>> print(round(results['f1'], 2))
+        0.67
+"""
+
+
+_CITATION = """
+@article{scikit-learn,
+    title={Scikit-learn: Machine Learning in {P}ython},
+    author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+           and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+           and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+           Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+    journal={Journal of Machine Learning Research},
+    volume={12},
+    pages={2825--2830},
+    year={2011}
+}
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class F1(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("int32")),
+                    "references": datasets.Sequence(datasets.Value("int32")),
+                }
+                if self.config_name == "multilabel"
+                else {
+                    "predictions": datasets.Value("int32"),
+                    "references": datasets.Value("int32"),
+                }
+            ),
+            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html"],
+        )
+
+    def _compute(self, predictions, references, labels=None, pos_label=1, average="binary", sample_weight=None):
+        score = f1_score(
+            references, predictions, labels=labels, pos_label=pos_label, average=average, sample_weight=sample_weight
+        )
+        return {"f1": float(score) if score.size == 1 else score}
--- a/evaluate-0.4.2/metrics/f1/requirements.txt
+++ b/evaluate-0.4.2/metrics/f1/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+scikit-learn
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/frugalscore/README.md
+++ b/evaluate-0.4.2/metrics/frugalscore/README.md
+---
+title: 
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  FrugalScore is a reference-based metric for NLG models evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
+---
+
+
+## Metric Description
+FrugalScore is a reference-based metric for Natural Language Generation (NLG) model evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
+
+The FrugalScore models are obtained by continuing the pretraining of small models on a synthetic dataset constructed using summarization, backtranslation and denoising models. During the training, the small models learn the internal mapping of the expensive metric, including any similarity function.
+
+## How to use 
+
+When loading FrugalScore, you can indicate the model you wish to use to compute the score. The default model is `moussaKam/frugalscore_tiny_bert-base_bert-score`, and a full list of models can be found in the [Limitations and bias](#Limitations-and-bias) section.
+
+```python
+>>> frugalscore = evaluate.load("frugalscore", "moussaKam/frugalscore_medium_bert-base_mover-score")
+```
+
+FrugalScore calculates how good are the predictions given some references, based on a set of scores.
+
+The inputs it takes are:
+
+`predictions`: a list of strings representing the predictions to score. 
+
+`references`: a list of string representing the references for each prediction. 
+
+Its optional arguments are:
+
+`batch_size`: the batch size for predictions (default value is `32`).
+
+`max_length`: the maximum sequence length (default value is `128`).
+
+`device`: either "gpu" or "cpu" (default value is `None`). 
+
+```python
+>>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'], batch_size=16, max_length=64, device="gpu")
+```
+
+## Output values
+
+The output of FrugalScore is a dictionary with the list of scores for each prediction-reference pair:
+```python
+{'scores': [0.6307541, 0.6449357]}
+```
+
+### Values from popular papers
+The [original FrugalScore paper](https://arxiv.org/abs/2110.08559) reported that FrugalScore-Tiny retains 97.7/94.7% of the original performance compared to [BertScore](https://huggingface.co/metrics/bertscore) while running 54 times faster and having 84 times less parameters.
+
+## Examples 
+
+Maximal values (exact match between `references` and `predictions`): 
+
+```python
+>>> frugalscore = evaluate.load("frugalscore")
+>>> results = frugalscore.compute(predictions=['hello world'], references=['hello world'])
+>>> print(results)
+{'scores': [0.9891098]}
+```
+
+Partial values: 
+
+```python
+>>> frugalscore = evaluate.load("frugalscore")
+>>> results = frugalscore.compute(predictions=['hello world'], references=['hugging face'])
+>>> print(results)
+{'scores': [0.42482382]}
+```
+
+## Limitations and bias
+
+FrugalScore is based on [BertScore](https://huggingface.co/metrics/bertscore) and [MoverScore](https://arxiv.org/abs/1909.02622), and the models used are based on the original models used for these scores.
+
+The full list of available models for FrugalScore is:
+
+| FrugalScore                                        | Student     | Teacher        | Method     |
+|----------------------------------------------------|-------------|----------------|------------|
+| [moussaKam/frugalscore_tiny_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_bert-score)    | BERT-tiny   | BERT-Base      | BERTScore  |
+| [moussaKam/frugalscore_small_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_bert-score)   | BERT-small  | BERT-Base      | BERTScore  |
+| [moussaKam/frugalscore_medium_bert-base_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_bert-score) | BERT-medium | BERT-Base      | BERTScore  |
+| [moussaKam/frugalscore_tiny_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_roberta_bert-score)     | BERT-tiny   | RoBERTa-Large  | BERTScore  |
+| [moussaKam/frugalscore_small_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_roberta_bert-score)     | BERT-small  | RoBERTa-Large  | BERTScore  |
+| [moussaKam/frugalscore_medium_roberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_roberta_bert-score)    | BERT-medium | RoBERTa-Large  | BERTScore  |
+| [moussaKam/frugalscore_tiny_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_tiny_deberta_bert-score)      | BERT-tiny   | DeBERTa-XLarge | BERTScore  |
+| [moussaKam/frugalscore_small_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_small_deberta_bert-score)     | BERT-small  | DeBERTa-XLarge | BERTScore  |
+| [moussaKam/frugalscore_medium_deberta_bert-score](https://huggingface.co/moussaKam/frugalscore_medium_deberta_bert-score)    | BERT-medium | DeBERTa-XLarge | BERTScore  |
+| [moussaKam/frugalscore_tiny_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_tiny_bert-base_mover-score)   | BERT-tiny   | BERT-Base      | MoverScore |
+| [moussaKam/frugalscore_small_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_small_bert-base_mover-score)  | BERT-small  | BERT-Base      | MoverScore |
+| [moussaKam/frugalscore_medium_bert-base_mover-score](https://huggingface.co/moussaKam/frugalscore_medium_bert-base_mover-score) | BERT-medium | BERT-Base      | MoverScore |
+
+Depending on the size of the model picked, the loading time will vary: the `tiny` models will load very quickly, whereas the `medium` ones can take several minutes, depending on your Internet connection. 
+
+## Citation
+```bibtex
+@article{eddine2021frugalscore,
+  title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
+  author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
+  journal={arXiv preprint arXiv:2110.08559},
+  year={2021}
+}
+```
+
+## Further References
+- [Original FrugalScore code](https://github.com/moussaKam/FrugalScore)
+- [FrugalScore paper](https://arxiv.org/abs/2110.08559) 
--- a/evaluate-0.4.2/metrics/frugalscore/app.py
+++ b/evaluate-0.4.2/metrics/frugalscore/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("frugalscore")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/frugalscore/frugalscore.py
+++ b/evaluate-0.4.2/metrics/frugalscore/frugalscore.py
+# Copyright 2022 The HuggingFace Datasets Authors and the current metric script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""FrugalScore metric."""
+
+import datasets
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
+
+import evaluate
+
+
+_CITATION = """\
+@article{eddine2021frugalscore,
+  title={FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation},
+  author={Eddine, Moussa Kamal and Shang, Guokan and Tixier, Antoine J-P and Vazirgiannis, Michalis},
+  journal={arXiv preprint arXiv:2110.08559},
+  year={2021}
+}
+"""
+
+_DESCRIPTION = """\
+FrugalScore is a reference-based metric for NLG models evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.
+"""
+
+
+_KWARGS_DESCRIPTION = """
+Calculates how good are predictions given some references, using certain scores.
+Args:
+    predictions (list of str): list of predictions to score. Each predictions
+        should be a string.
+    references (list of str): list of reference for each prediction. Each
+        reference should be a string.
+    batch_size (int): the batch size for predictions.
+    max_length (int): maximum sequence length.
+    device (str): either gpu or cpu
+Returns:
+    scores (list of int): list of scores.
+Examples:
+    >>> frugalscore = evaluate.load("frugalscore")
+    >>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'])
+    >>> print([round(s, 3) for s in results["scores"]])
+    [0.631, 0.645]
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class FRUGALSCORE(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string"),
+                    "references": datasets.Value("string"),
+                }
+            ),
+            homepage="https://github.com/moussaKam/FrugalScore",
+        )
+
+    def _download_and_prepare(self, dl_manager):
+        if self.config_name == "default":
+            checkpoint = "moussaKam/frugalscore_tiny_bert-base_bert-score"
+        else:
+            checkpoint = self.config_name
+        self.model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
+        self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+
+    def _compute(
+        self,
+        predictions,
+        references,
+        batch_size=32,
+        max_length=128,
+        device=None,
+    ):
+        """Returns the scores"""
+        assert len(predictions) == len(
+            references
+        ), "predictions and references should have the same number of sentences."
+        if device is not None:
+            assert device in ["gpu", "cpu"], "device should be either gpu or cpu."
+        else:
+            device = "gpu" if torch.cuda.is_available() else "cpu"
+        training_args = TrainingArguments(
+            "trainer",
+            fp16=(device == "gpu"),
+            per_device_eval_batch_size=batch_size,
+            report_to="all",
+            no_cuda=(device == "cpu"),
+            log_level="warning",
+        )
+        dataset = {"sentence1": predictions, "sentence2": references}
+        raw_datasets = datasets.Dataset.from_dict(dataset)
+
+        def tokenize_function(data):
+            return self.tokenizer(
+                data["sentence1"], data["sentence2"], max_length=max_length, truncation=True, padding=True
+            )
+
+        tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
+        tokenized_datasets.remove_columns(["sentence1", "sentence2"])
+        trainer = Trainer(self.model, training_args, tokenizer=self.tokenizer)
+        predictions = trainer.predict(tokenized_datasets)
+        return {"scores": list(predictions.predictions.squeeze(-1))}
--- a/evaluate-0.4.2/metrics/frugalscore/requirements.txt
+++ b/evaluate-0.4.2/metrics/frugalscore/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+torch
+transformers
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/glue/README.md
+++ b/evaluate-0.4.2/metrics/glue/README.md
+---
+title: GLUE
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  GLUE, the General Language Understanding Evaluation benchmark
+  (https://gluebenchmark.com/) is a collection of resources for training,
+  evaluating, and analyzing natural language understanding systems.
+---
+
+# Metric Card for GLUE
+
+## Metric description
+This metric is used to compute the GLUE evaluation metric associated to each [GLUE dataset](https://huggingface.co/datasets/glue). 
+
+GLUE, the General Language Understanding Evaluation benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
+
+## How to use 
+
+There are two steps: (1) loading the GLUE metric relevant to the subset of the GLUE dataset being used for evaluation; and (2) calculating the metric.
+
+1. **Loading the relevant GLUE metric** : the subsets of GLUE are the following: `sst2`,  `mnli`, `mnli_mismatched`, `mnli_matched`, `qnli`, `rte`, `wnli`, `cola`,`stsb`, `mrpc`, `qqp`, and `hans`.
+
+More information about the different subsets of the GLUE dataset can be found on the [GLUE dataset page](https://huggingface.co/datasets/glue).
+
+2. **Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one lists of references for each translation.
+
+```python
+from evaluate import load
+glue_metric = load('glue', 'sst2')
+references = [0, 1]
+predictions = [0, 1]
+results = glue_metric.compute(predictions=predictions, references=references)
+```
+## Output values
+
+The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
+
+`accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information). 
+
+`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
+
+`pearson`: a measure of the linear relationship between two datasets (see [Pearson correlation](https://huggingface.co/metrics/pearsonr) for more information). Its range is between -1 and +1, with 0 implying no correlation, and -1/+1 implying an exact linear relationship. Positive correlations imply that as x increases, so does y, whereas negative correlations imply that as x increases, y decreases. 
+
+`spearmanr`:  a nonparametric measure of the monotonicity of the relationship between two datasets(see [Spearman Correlation](https://huggingface.co/metrics/spearmanr) for more information). `spearmanr` has the same range as `pearson`.
+
+`matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
+
+The `cola` subset returns `matthews_correlation`, the `stsb` subset returns `pearson` and `spearmanr`, the `mrpc` and `qqp` subsets return both `accuracy` and `f1`, and all other subsets of GLUE return only accuracy. 
+
+### Values from popular papers
+The [original GLUE paper](https://huggingface.co/datasets/glue) reported average scores ranging from 58 to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
+
+For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/glue).
+
+## Examples 
+
+Maximal values for the MRPC subset (which outputs `accuracy` and `f1`):
+
+```python
+from evaluate import load
+glue_metric = load('glue', 'mrpc')  # 'mrpc' or 'qqp'
+references = [0, 1]
+predictions = [0, 1]
+results = glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'accuracy': 1.0, 'f1': 1.0}
+```
+
+Minimal values for the STSB subset (which outputs `pearson` and `spearmanr`):
+
+```python
+from evaluate import load
+glue_metric = load('glue', 'stsb')
+references = [0., 1., 2., 3., 4., 5.]
+predictions = [-10., -11., -12., -13., -14., -15.]
+results = glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'pearson': -1.0, 'spearmanr': -1.0}
+```
+
+Partial match for the COLA subset (which outputs `matthews_correlation`) 
+
+```python
+from evaluate import load
+glue_metric = load('glue', 'cola')
+references = [0, 1]
+predictions = [1, 1]
+results = glue_metric.compute(predictions=predictions, references=references)
+results
+{'matthews_correlation': 0.0}
+```
+
+## Limitations and bias
+This metric works only with datasets that have the same format as the [GLUE dataset](https://huggingface.co/datasets/glue).
+
+While the GLUE dataset is meant to represent "General Language Understanding", the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such. 
+
+Also, while the GLUE subtasks were considered challenging during its creation in 2019, they are no longer considered as such given the impressive progress made since then. A more complex (or "stickier") version of it, called [SuperGLUE](https://huggingface.co/datasets/super_glue), was subsequently created.
+
+## Citation
+
+```bibtex
+ @inproceedings{wang2019glue,
+  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
+  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
+  note={In the Proceedings of ICLR.},
+  year={2019}
+}
+```
+    
+## Further References 
+
+- [GLUE benchmark homepage](https://gluebenchmark.com/)
+- [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?)
--- a/evaluate-0.4.2/metrics/glue/app.py
+++ b/evaluate-0.4.2/metrics/glue/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("glue", "sst2")
+launch_gradio_widget(module)