修改readme

25991f98 · hepj · ac192496 · 25991f98 · 25991f98 · 25991f98
Commit 25991f98 authored Jul 25, 2024 by hepj
20 changed files
--- a/evaluate-0.4.2/metrics/smape/README.md
+++ b/evaluate-0.4.2/metrics/smape/README.md
+---
+title: sMAPE
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  Symmetric Mean Absolute Percentage Error (sMAPE) is the symmetric mean percentage error difference between the predicted and actual values defined by Chen and Yang (2004).
+---
+
+# Metric Card for sMAPE
+
+
+## Metric Description
+
+Symmetric Mean Absolute Error (sMAPE) is the symmetric mean of the percentage error of difference between the predicted $x_i$ and actual $y_i$ numeric values:
+
+![image](https://user-images.githubusercontent.com/8100/200009801-ae8be6c8-facf-401b-8df0-3f80a458b9f4.png)
+
+
+## How to Use
+
+At minimum, this metric requires predictions and references as inputs.
+
+```python
+>>> smape_metric = evaluate.load("smape")
+>>> predictions = [2.5, 0.0, 2, 8]
+>>> references = [3, -0.5, 2, 7]
+>>> results = smape_metric.compute(predictions=predictions, references=references)
+```
+
+### Inputs
+
+Mandatory inputs: 
+- `predictions`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the estimated target values.
+- `references`: numeric array-like of shape (`n_samples,`) or (`n_samples`, `n_outputs`), representing the ground truth (correct) target values.
+
+Optional arguments:
+- `sample_weight`: numeric array-like of shape (`n_samples,`) representing sample weights. The default is `None`.
+- `multioutput`: `raw_values`, `uniform_average` or numeric array-like of shape (`n_outputs,`), which defines the aggregation of multiple output values. The default value is `uniform_average`.
+  - `raw_values` returns a full set of errors in case of multioutput input.
+  - `uniform_average` means that the errors of all outputs are averaged with uniform weight. 
+  - the array-like value defines weights used to average errors.
+
+### Output Values
+This metric outputs a dictionary, containing the mean absolute error score, which is of type:
+- `float`: if multioutput is `uniform_average` or an ndarray of weights, then the weighted average of all output errors is returned.
+- numeric array-like of shape (`n_outputs,`): if multioutput is `raw_values`, then the score is returned for each output separately. 
+
+Each sMAPE `float` value ranges from `0.0` to `2.0`, with the best value being 0.0.
+
+Output Example(s):
+```python
+{'smape': 0.5}
+```
+
+If `multioutput="raw_values"`:
+```python
+{'smape': array([0.5, 1.5 ])}
+```
+
+#### Values from Popular Papers
+
+
+### Examples
+
+Example with the `uniform_average` config:
+```python
+>>> smape_metric = evaluate.load("smape")
+>>> predictions = [2.5, 0.0, 2, 8]
+>>> references = [3, -0.5, 2, 7]
+>>> results = smape_metric.compute(predictions=predictions, references=references)
+>>> print(results)
+{'smape': 0.5787...}
+```
+
+Example with multi-dimensional lists, and the `raw_values` config:
+```python
+>>> smape_metric = evaluate.load("smape", "multilist")
+>>> predictions = [[0.5, 1], [-1, 1], [7, -6]]
+>>> references = [[0.1, 2], [-1, 2], [8, -5]]
+>>> results = smape_metric.compute(predictions=predictions, references=references)
+>>> print(results)
+{'smape': 0.8874...}
+>>> results = smape_metric.compute(predictions=predictions, references=references, multioutput='raw_values')
+>>> print(results)
+{'smape': array([1.3749..., 0.4])}
+```
+
+## Limitations and Bias
+This metric is called a measure of "percentage error" even though there is no multiplier of 100. The range is between (0, 2) with it being two when the target and prediction are both zero. 
+
+## Citation(s)
+
+```bibtex
+@article{article,
+    author = {Chen, Zhuo and Yang, Yuhong},
+    year = {2004},
+    month = {04},
+    pages = {},
+    title = {Assessing forecast accuracy measures}
+}
+```
+
+## Further References
+- [Symmetric Mean absolute percentage error - Wikipedia](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)
--- a/evaluate-0.4.2/metrics/smape/app.py
+++ b/evaluate-0.4.2/metrics/smape/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("smape")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/smape/requirements.txt
+++ b/evaluate-0.4.2/metrics/smape/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+scikit-learn
--- a/evaluate-0.4.2/metrics/smape/smape.py
+++ b/evaluate-0.4.2/metrics/smape/smape.py
+# Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""sMAPE - Symmetric Mean Absolute Percentage Error Metric"""
+
+import datasets
+import numpy as np
+from sklearn.metrics._regression import _check_reg_targets
+from sklearn.utils.validation import check_consistent_length
+
+import evaluate
+
+
+_CITATION = """\
+@article{article,
+    author = {Chen, Zhuo and Yang, Yuhong},
+    year = {2004},
+    month = {04},
+    pages = {},
+    title = {Assessing forecast accuracy measures}
+}
+"""
+
+_DESCRIPTION = """\
+Symmetric Mean Absolute Percentage Error (sMAPE) is the symmetric mean percentage error
+difference between the predicted and actual values as defined by Chen and Yang (2004),
+based on the metric by Armstrong (1985) and Makridakis (1993).
+"""
+
+
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions: array-like of shape (n_samples,) or (n_samples, n_outputs)
+        Estimated target values.
+    references: array-like of shape (n_samples,) or (n_samples, n_outputs)
+        Ground truth (correct) target values.
+    sample_weight: array-like of shape (n_samples,), default=None
+        Sample weights.
+    multioutput: {"raw_values", "uniform_average"} or array-like of shape (n_outputs,), default="uniform_average"
+        Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
+
+                 "raw_values" : Returns a full set of errors in case of multioutput input.
+
+                 "uniform_average" : Errors of all outputs are averaged with uniform weight.
+
+Returns:
+    smape : symmetric mean absolute percentage error.
+        If multioutput is "raw_values", then symmetric mean absolute percentage error is returned for each output separately. If multioutput is "uniform_average" or an ndarray of weights, then the weighted average of all output errors is returned.
+        sMAPE output is non-negative floating point in the range (0, 2). The best value is 0.0.
+Examples:
+
+    >>> smape_metric = evaluate.load("smape")
+    >>> predictions = [2.5, 0.0, 2, 8]
+    >>> references = [3, -0.5, 2, 7]
+    >>> results = smape_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'smape': 0.5787878787878785}
+
+    If you're using multi-dimensional lists, then set the config as follows :
+
+    >>> smape_metric = evaluate.load("smape", "multilist")
+    >>> predictions = [[0.5, 1], [-1, 1], [7, -6]]
+    >>> references = [[0.1, 2], [-1, 2], [8, -5]]
+    >>> results = smape_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'smape': 0.49696969558995985}
+    >>> results = smape_metric.compute(predictions=predictions, references=references, multioutput='raw_values')
+    >>> print(results)
+    {'smape': array([0.48888889, 0.50505051])}
+"""
+
+
+def symmetric_mean_absolute_percentage_error(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
+    """Symmetric Mean absolute percentage error (sMAPE) metric using sklearn's api and helpers.
+
+    Parameters
+    ----------
+    y_true : array-like of shape (n_samples,) or (n_samples, n_outputs)
+        Ground truth (correct) target values.
+    y_pred : array-like of shape (n_samples,) or (n_samples, n_outputs)
+        Estimated target values.
+    sample_weight : array-like of shape (n_samples,), default=None
+        Sample weights.
+    multioutput : {'raw_values', 'uniform_average'} or array-like
+        Defines aggregating of multiple output values.
+        Array-like value defines weights used to average errors.
+        If input is list then the shape must be (n_outputs,).
+        'raw_values' :
+            Returns a full set of errors in case of multioutput input.
+        'uniform_average' :
+            Errors of all outputs are averaged with uniform weight.
+    Returns
+    -------
+    loss : float or ndarray of floats
+        If multioutput is 'raw_values', then mean absolute percentage error
+        is returned for each output separately.
+        If multioutput is 'uniform_average' or an ndarray of weights, then the
+        weighted average of all output errors is returned.
+        sMAPE output is non-negative floating point. The best value is 0.0.
+    """
+    y_type, y_true, y_pred, multioutput = _check_reg_targets(y_true, y_pred, multioutput)
+    check_consistent_length(y_true, y_pred, sample_weight)
+    epsilon = np.finfo(np.float64).eps
+    smape = 2 * np.abs(y_pred - y_true) / (np.maximum(np.abs(y_true), epsilon) + np.maximum(np.abs(y_pred), epsilon))
+    output_errors = np.average(smape, weights=sample_weight, axis=0)
+    if isinstance(multioutput, str):
+        if multioutput == "raw_values":
+            return output_errors
+        elif multioutput == "uniform_average":
+            # pass None as weights to np.average: uniform mean
+            multioutput = None
+
+    return np.average(output_errors, weights=multioutput)
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Smape(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(self._get_feature_types()),
+            reference_urls=["https://robjhyndman.com/hyndsight/smape/"],
+        )
+
+    def _get_feature_types(self):
+        if self.config_name == "multilist":
+            return {
+                "predictions": datasets.Sequence(datasets.Value("float")),
+                "references": datasets.Sequence(datasets.Value("float")),
+            }
+        else:
+            return {
+                "predictions": datasets.Value("float"),
+                "references": datasets.Value("float"),
+            }
+
+    def _compute(self, predictions, references, sample_weight=None, multioutput="uniform_average"):
+
+        smape_score = symmetric_mean_absolute_percentage_error(
+            references,
+            predictions,
+            sample_weight=sample_weight,
+            multioutput=multioutput,
+        )
+
+        return {"smape": smape_score}
--- a/evaluate-0.4.2/metrics/spearmanr/README.md
+++ b/evaluate-0.4.2/metrics/spearmanr/README.md
+---
+title: Spearman Correlation Coefficient Metric 
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  The Spearman rank-order correlation coefficient is a measure of the
+  relationship between two datasets. Like other correlation coefficients,
+  this one varies between -1 and +1 with 0 implying no correlation.
+  Positive correlations imply that as data in dataset x increases, so
+  does data in dataset y. Negative correlations imply that as x increases,
+  y decreases. Correlations of -1 or +1 imply an exact monotonic relationship.
+  
+  Unlike the Pearson correlation, the Spearman correlation does not
+  assume that both datasets are normally distributed.
+  
+  The p-value roughly indicates the probability of an uncorrelated system
+  producing datasets that have a Spearman correlation at least as extreme
+  as the one computed from these datasets. The p-values are not entirely
+  reliable but are probably reasonable for datasets larger than 500 or so.
+---
+
+# Metric Card for Spearman Correlation Coefficient Metric (spearmanr)
+
+
+## Metric Description
+The Spearman rank-order correlation coefficient is a measure of the
+relationship between two datasets. Like other correlation coefficients,
+this one varies between -1 and +1 with 0 implying no correlation.
+Positive correlations imply that as data in dataset x increases, so 
+does data in dataset y. Negative correlations imply that as x increases,
+y decreases. Correlations of -1 or +1 imply an exact monotonic relationship.
+
+Unlike the Pearson correlation, the Spearman correlation does not 
+assume that both datasets are normally distributed. 
+
+The p-value roughly indicates the probability of an uncorrelated system
+producing datasets that have a Spearman correlation at least as extreme
+as the one computed from these datasets. The p-values are not entirely
+reliable but are probably reasonable for datasets larger than 500 or so.
+
+
+## How to Use
+At minimum, this metric only requires a `list` of predictions and a `list` of references:
+
+```python
+>>> spearmanr_metric = evaluate.load("spearmanr")
+>>> results = spearmanr_metric.compute(references=[1, 2, 3, 4, 5], predictions=[10, 9, 2.5, 6, 4])
+>>> print(results)
+{'spearmanr': -0.7}
+```
+
+### Inputs
+- **`predictions`** (`list` of `float`): Predicted labels, as returned by a model.
+- **`references`** (`list` of `float`): Ground truth labels.
+- **`return_pvalue`** (`bool`): If `True`, returns the p-value. If `False`, returns
+                    only the spearmanr score. Defaults to `False`.
+
+### Output Values
+-  **`spearmanr`** (`float`): Spearman correlation coefficient.
+- **`p-value`** (`float`): p-value. **Note**: is only returned
+                        if `return_pvalue=True` is input.
+
+If `return_pvalue=False`, the output is a `dict` with one value, as below:
+```python
+{'spearmanr': -0.7}
+```
+
+Otherwise, if `return_pvalue=True`, the output is a `dict` containing a the `spearmanr` value as well as the corresponding `pvalue`:
+```python
+{'spearmanr': -0.7, 'spearmanr_pvalue': 0.1881204043741873}
+```
+
+Spearman rank-order correlations can take on any value from `-1` to `1`, inclusive.
+
+The p-values can take on any value from `0` to `1`, inclusive.
+
+#### Values from Popular Papers
+
+
+### Examples
+A basic example:
+```python
+>>> spearmanr_metric = evaluate.load("spearmanr")
+>>> results = spearmanr_metric.compute(references=[1, 2, 3, 4, 5], predictions=[10, 9, 2.5, 6, 4])
+>>> print(results)
+{'spearmanr': -0.7}
+```
+
+The same example, but that also returns the pvalue:
+```python
+>>> spearmanr_metric = evaluate.load("spearmanr")
+>>> results = spearmanr_metric.compute(references=[1, 2, 3, 4, 5], predictions=[10, 9, 2.5, 6, 4], return_pvalue=True)
+>>> print(results)
+{'spearmanr': -0.7, 'spearmanr_pvalue': 0.1881204043741873
+>>> print(results['spearmanr'])
+-0.7
+>>> print(results['spearmanr_pvalue'])
+0.1881204043741873
+```
+
+## Limitations and Bias
+
+
+## Citation
+```bibtex
+@book{kokoska2000crc,
+  title={CRC standard probability and statistics tables and formulae},
+  author={Kokoska, Stephen and Zwillinger, Daniel},
+  year={2000},
+  publisher={Crc Press}
+}
+@article{2020SciPy-NMeth,
+  author  = {Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and
+            Haberland, Matt and Reddy, Tyler and Cournapeau, David and
+            Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and
+            Bright, Jonathan and {van der Walt}, St{\'e}fan J. and
+            Brett, Matthew and Wilson, Joshua and Millman, K. Jarrod and
+            Mayorov, Nikolay and Nelson, Andrew R. J. and Jones, Eric and
+            Kern, Robert and Larson, Eric and Carey, C J and
+            Polat, {\.I}lhan and Feng, Yu and Moore, Eric W. and
+            {VanderPlas}, Jake and Laxalde, Denis and Perktold, Josef and
+            Cimrman, Robert and Henriksen, Ian and Quintero, E. A. and
+            Harris, Charles R. and Archibald, Anne M. and
+            Ribeiro, Ant{\^o}nio H. and Pedregosa, Fabian and
+            {van Mulbregt}, Paul and {SciPy 1.0 Contributors}},
+  title   = {{{SciPy} 1.0: Fundamental Algorithms for Scientific
+            Computing in Python}},
+  journal = {Nature Methods},
+  year    = {2020},
+  volume  = {17},
+  pages   = {261--272},
+  adsurl  = {https://rdcu.be/b08Wh},
+  doi     = {10.1038/s41592-019-0686-2},
+}
+```
+
+## Further References
+*Add any useful further references.*
--- a/evaluate-0.4.2/metrics/spearmanr/app.py
+++ b/evaluate-0.4.2/metrics/spearmanr/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("spearmanr")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/spearmanr/requirements.txt
+++ b/evaluate-0.4.2/metrics/spearmanr/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+scipy
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/spearmanr/spearmanr.py
+++ b/evaluate-0.4.2/metrics/spearmanr/spearmanr.py
+# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Spearman correlation coefficient metric."""
+
+import datasets
+from scipy.stats import spearmanr
+
+import evaluate
+
+
+_DESCRIPTION = """
+The Spearman rank-order correlation coefficient is a measure of the
+relationship between two datasets. Like other correlation coefficients,
+this one varies between -1 and +1 with 0 implying no correlation.
+Positive correlations imply that as data in dataset x increases, so
+does data in dataset y. Negative correlations imply that as x increases,
+y decreases. Correlations of -1 or +1 imply an exact monotonic relationship.
+
+Unlike the Pearson correlation, the Spearman correlation does not
+assume that both datasets are normally distributed.
+
+The p-value roughly indicates the probability of an uncorrelated system
+producing datasets that have a Spearman correlation at least as extreme
+as the one computed from these datasets. The p-values are not entirely
+reliable but are probably reasonable for datasets larger than 500 or so.
+"""
+
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions (`List[float]`): Predicted labels, as returned by a model.
+    references (`List[float]`): Ground truth labels.
+    return_pvalue (`bool`): If `True`, returns the p-value. If `False`, returns
+            only the spearmanr score. Defaults to `False`.
+Returns:
+    spearmanr (`float`): Spearman correlation coefficient.
+    p-value (`float`): p-value. **Note**: is only returned if `return_pvalue=True` is input.
+Examples:
+    Example 1:
+        >>> spearmanr_metric = evaluate.load("spearmanr")
+        >>> results = spearmanr_metric.compute(references=[1, 2, 3, 4, 5], predictions=[10, 9, 2.5, 6, 4])
+        >>> print(results)
+        {'spearmanr': -0.7}
+
+    Example 2:
+        >>> spearmanr_metric = evaluate.load("spearmanr")
+        >>> results = spearmanr_metric.compute(references=[1, 2, 3, 4, 5],
+        ...                                     predictions=[10, 9, 2.5, 6, 4],
+        ...                                     return_pvalue=True)
+        >>> print(results['spearmanr'])
+        -0.7
+        >>> print(round(results['spearmanr_pvalue'], 2))
+        0.19
+"""
+
+_CITATION = r"""\
+@book{kokoska2000crc,
+  title={CRC standard probability and statistics tables and formulae},
+  author={Kokoska, Stephen and Zwillinger, Daniel},
+  year={2000},
+  publisher={Crc Press}
+}
+@article{2020SciPy-NMeth,
+  author  = {Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and
+            Haberland, Matt and Reddy, Tyler and Cournapeau, David and
+            Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and
+            Bright, Jonathan and {van der Walt}, St{\'e}fan J. and
+            Brett, Matthew and Wilson, Joshua and Millman, K. Jarrod and
+            Mayorov, Nikolay and Nelson, Andrew R. J. and Jones, Eric and
+            Kern, Robert and Larson, Eric and Carey, C J and
+            Polat, {\.I}lhan and Feng, Yu and Moore, Eric W. and
+            {VanderPlas}, Jake and Laxalde, Denis and Perktold, Josef and
+            Cimrman, Robert and Henriksen, Ian and Quintero, E. A. and
+            Harris, Charles R. and Archibald, Anne M. and
+            Ribeiro, Ant{\^o}nio H. and Pedregosa, Fabian and
+            {van Mulbregt}, Paul and {SciPy 1.0 Contributors}},
+  title   = {{{SciPy} 1.0: Fundamental Algorithms for Scientific
+            Computing in Python}},
+  journal = {Nature Methods},
+  year    = {2020},
+  volume  = {17},
+  pages   = {261--272},
+  adsurl  = {https://rdcu.be/b08Wh},
+  doi     = {10.1038/s41592-019-0686-2},
+}
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Spearmanr(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("float"),
+                    "references": datasets.Value("float"),
+                }
+            ),
+            reference_urls=["https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html"],
+        )
+
+    def _compute(self, predictions, references, return_pvalue=False):
+        results = spearmanr(references, predictions)
+        if return_pvalue:
+            return {"spearmanr": results[0], "spearmanr_pvalue": results[1]}
+        else:
+            return {"spearmanr": results[0]}
--- a/evaluate-0.4.2/metrics/squad/README.md
+++ b/evaluate-0.4.2/metrics/squad/README.md
+---
+title: SQuAD
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD).
+
+  Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
+  crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
+  from the corresponding reading passage, or the question might be unanswerable.
+---
+
+# Metric Card for SQuAD
+
+## Metric description
+This metric wraps the official scoring script for version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad). 
+
+SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
+
+## How to use 
+
+The metric takes two files or two lists of question-answers dictionaries as inputs : one with the predictions of the model and the other with the references to be compared to:
+
+```python
+from evaluate import load
+squad_metric = load("squad")
+results = squad_metric.compute(predictions=predictions, references=references)
+```
+## Output values
+
+This metric outputs a dictionary with two values: the average exact match score and the average [F1 score](https://huggingface.co/metrics/f1).
+
+```
+{'exact_match': 100.0, 'f1': 100.0}
+```
+
+The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched. 
+
+The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
+
+### Values from popular papers
+The [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an F1 score of 51.0% and an Exact Match score of 40.0%. They also report that human performance on the dataset represents an F1 score of 90.5% and an Exact Match score of 80.3%.
+
+For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).
+
+## Examples 
+
+Maximal values for both exact match and F1 (perfect match):
+
+```python
+from evaluate import load
+squad_metric = load("squad")
+predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
+references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
+results = squad_metric.compute(predictions=predictions, references=references)
+results
+{'exact_match': 100.0, 'f1': 100.0}
+```
+
+Minimal values for both exact match and F1 (no match):
+
+```python
+from evaluate import load
+squad_metric = load("squad")
+predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}]
+references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
+results = squad_metric.compute(predictions=predictions, references=references)
+results
+{'exact_match': 0.0, 'f1': 0.0}
+```
+
+Partial match (2 out of 3 answers correct) : 
+
+```python
+from evaluate import load
+squad_metric = load("squad")
+predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b'},  {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1'}]
+references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
+results = squad_metric.compute(predictions=predictions, references=references)
+results
+{'exact_match': 66.66666666666667, 'f1': 66.66666666666667}
+```
+
+## Limitations and bias
+This metric works only with datasets that have the same format as [SQuAD v.1 dataset](https://huggingface.co/datasets/squad).
+
+The SQuAD dataset does contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers. 
+
+
+## Citation
+
+    @inproceedings{Rajpurkar2016SQuAD10,
+    title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
+    author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
+    booktitle={EMNLP},
+    year={2016}
+    }
+    
+## Further References 
+
+- [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/)
+- [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)
--- a/evaluate-0.4.2/metrics/squad/app.py
+++ b/evaluate-0.4.2/metrics/squad/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("squad")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/squad/compute_score.py
+++ b/evaluate-0.4.2/metrics/squad/compute_score.py
+""" Official evaluation script for v1.1 of the SQuAD dataset. """
+
+import argparse
+import json
+import re
+import string
+import sys
+from collections import Counter
+
+
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def f1_score(prediction, ground_truth):
+    prediction_tokens = normalize_answer(prediction).split()
+    ground_truth_tokens = normalize_answer(ground_truth).split()
+    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
+    num_same = sum(common.values())
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(prediction_tokens)
+    recall = 1.0 * num_same / len(ground_truth_tokens)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+
+
+def exact_match_score(prediction, ground_truth):
+    return normalize_answer(prediction) == normalize_answer(ground_truth)
+
+
+def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
+    scores_for_ground_truths = []
+    for ground_truth in ground_truths:
+        score = metric_fn(prediction, ground_truth)
+        scores_for_ground_truths.append(score)
+    return max(scores_for_ground_truths)
+
+
+def compute_score(dataset, predictions):
+    f1 = exact_match = total = 0
+    for article in dataset:
+        for paragraph in article["paragraphs"]:
+            for qa in paragraph["qas"]:
+                total += 1
+                if qa["id"] not in predictions:
+                    message = "Unanswered question " + qa["id"] + " will receive score 0."
+                    print(message, file=sys.stderr)
+                    continue
+                ground_truths = list(map(lambda x: x["text"], qa["answers"]))
+                prediction = predictions[qa["id"]]
+                exact_match += metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
+                f1 += metric_max_over_ground_truths(f1_score, prediction, ground_truths)
+
+    exact_match = 100.0 * exact_match / total
+    f1 = 100.0 * f1 / total
+
+    return {"exact_match": exact_match, "f1": f1}
+
+
+if __name__ == "__main__":
+    expected_version = "1.1"
+    parser = argparse.ArgumentParser(description="Evaluation for SQuAD " + expected_version)
+    parser.add_argument("dataset_file", help="Dataset file")
+    parser.add_argument("prediction_file", help="Prediction File")
+    args = parser.parse_args()
+    with open(args.dataset_file) as dataset_file:
+        dataset_json = json.load(dataset_file)
+        if dataset_json["version"] != expected_version:
+            print(
+                "Evaluation expects v-" + expected_version + ", but got dataset with v-" + dataset_json["version"],
+                file=sys.stderr,
+            )
+        dataset = dataset_json["data"]
+    with open(args.prediction_file) as prediction_file:
+        predictions = json.load(prediction_file)
+    print(json.dumps(compute_score(dataset, predictions)))
--- a/evaluate-0.4.2/metrics/squad/requirements.txt
+++ b/evaluate-0.4.2/metrics/squad/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/squad/squad.py
+++ b/evaluate-0.4.2/metrics/squad/squad.py
+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" SQuAD metric. """
+
+import datasets
+
+import evaluate
+
+from .compute_score import compute_score
+
+
+_CITATION = """\
+@inproceedings{Rajpurkar2016SQuAD10,
+  title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
+  author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
+  booktitle={EMNLP},
+  year={2016}
+}
+"""
+
+_DESCRIPTION = """
+This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD).
+
+Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
+crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
+from the corresponding reading passage, or the question might be unanswerable.
+"""
+
+_KWARGS_DESCRIPTION = """
+Computes SQuAD scores (F1 and EM).
+Args:
+    predictions: List of question-answers dictionaries with the following key-values:
+        - 'id': id of the question-answer pair as given in the references (see below)
+        - 'prediction_text': the text of the answer
+    references: List of question-answers dictionaries with the following key-values:
+        - 'id': id of the question-answer pair (see above),
+        - 'answers': a Dict in the SQuAD dataset format
+            {
+                'text': list of possible texts for the answer, as a list of strings
+                'answer_start': list of start positions for the answer, as a list of ints
+            }
+            Note that answer_start values are not taken into account to compute the metric.
+Returns:
+    'exact_match': Exact match (the normalized answer exactly match the gold answer)
+    'f1': The F-score of predicted tokens versus the gold answer
+Examples:
+
+    >>> predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
+    >>> references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
+    >>> squad_metric = evaluate.load("squad")
+    >>> results = squad_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'exact_match': 100.0, 'f1': 100.0}
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Squad(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": {"id": datasets.Value("string"), "prediction_text": datasets.Value("string")},
+                    "references": {
+                        "id": datasets.Value("string"),
+                        "answers": datasets.features.Sequence(
+                            {
+                                "text": datasets.Value("string"),
+                                "answer_start": datasets.Value("int32"),
+                            }
+                        ),
+                    },
+                }
+            ),
+            codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
+            reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
+        )
+
+    def _compute(self, predictions, references):
+        pred_dict = {prediction["id"]: prediction["prediction_text"] for prediction in predictions}
+        dataset = [
+            {
+                "paragraphs": [
+                    {
+                        "qas": [
+                            {
+                                "answers": [{"text": answer_text} for answer_text in ref["answers"]["text"]],
+                                "id": ref["id"],
+                            }
+                            for ref in references
+                        ]
+                    }
+                ]
+            }
+        ]
+        score = compute_score(dataset=dataset, predictions=pred_dict)
+        return score
--- a/evaluate-0.4.2/metrics/squad_v2/README.md
+++ b/evaluate-0.4.2/metrics/squad_v2/README.md
+---
+title: SQuAD v2
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  This metric wrap the official scoring script for version 2 of the Stanford Question Answering Dataset (SQuAD).
+
+  Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
+  crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
+  from the corresponding reading passage, or the question might be unanswerable.
+
+  SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions 
+  written adversarially by crowdworkers to look similar to answerable ones.
+  To do well on SQuAD2.0, systems must not only answer questions when possible, but also
+  determine when no answer is supported by the paragraph and abstain from answering.
+---
+
+# Metric Card for SQuAD v2
+
+## Metric description
+This metric wraps the official scoring script for version 2 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad_v2).
+
+SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
+
+SQuAD 2.0 combines the 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
+
+## How to use 
+
+The metric takes two files or two lists - one representing model predictions and the other the references to compare them to. 
+
+*Predictions* : List of triple for question-answers to score with the following key-value pairs:
+* `'id'`:  the question-answer identification field of the question and answer pair 
+* `'prediction_text'` : the text of the answer
+* `'no_answer_probability'` : the probability that the question has no answer
+
+*References*: List of question-answers dictionaries with the following key-value pairs:
+* `'id'`: id of the question-answer pair (see above),
+* `'answers'`: a list of Dict {'text': text of the answer as a string}
+*  `'no_answer_threshold'`: the probability threshold to decide that a question has no answer.
+
+```python
+from evaluate import load
+squad_metric = load("squad_v2")
+results = squad_metric.compute(predictions=predictions, references=references)
+```
+## Output values
+
+This metric outputs a dictionary with 13 values: 
+* `'exact'`: Exact match (the normalized answer exactly match the gold answer) (see the `exact_match` metric (forthcoming))
+* `'f1'`: The average F1-score of predicted tokens versus the gold answer (see the [F1 score](https://huggingface.co/metrics/f1) metric)
+* `'total'`: Number of scores considered
+* `'HasAns_exact'`: Exact match (the normalized answer exactly match the gold answer)
+* `'HasAns_f1'`:  The F-score of predicted tokens versus the gold answer
+* `'HasAns_total'`: How many of the questions have answers
+* `'NoAns_exact'`: Exact match (the normalized answer exactly match the gold answer)
+* `'NoAns_f1'`: The F-score of predicted tokens versus the gold answer
+* `'NoAns_total'`: How many of the questions have no answers
+* `'best_exact'` : Best exact match (with varying threshold)
+* `'best_exact_thresh'`: No-answer probability threshold associated to the best exact match
+* `'best_f1'`: Best F1 score (with varying threshold)
+* `'best_f1_thresh'`: No-answer probability threshold associated to the best F1
+
+
+The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched. 
+
+The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
+
+The range of `total` depends on the length of predictions/references: its minimal value is 0, and maximal value is the total number of questions in the predictions and references.
+
+### Values from popular papers
+The [SQuAD v2 paper](https://arxiv.org/pdf/1806.03822.pdf) reported an F1 score of 66.3% and an Exact Match score of 63.4%. 
+They also report that human performance on the dataset represents an F1 score of 89.5% and an Exact Match score of 86.9%.
+
+For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).
+
+## Examples 
+
+Maximal values for both exact match and F1 (perfect match):
+
+```python
+from evaluate import load
+squad_v2_ metric = load("squad_v2")
+predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}]
+references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
+results = squad_v2_metric.compute(predictions=predictions, references=references)
+results
+{'exact': 100.0, 'f1': 100.0, 'total': 1, 'HasAns_exact': 100.0, 'HasAns_f1': 100.0, 'HasAns_total': 1, 'best_exact': 100.0, 'best_exact_thresh': 0.0, 'best_f1': 100.0, 'best_f1_thresh': 0.0}
+```
+
+Minimal values for both exact match and F1 (no match):
+
+```python
+from evaluate import load
+squad_metric = load("squad_v2")
+predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}]
+references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
+results = squad_v2_metric.compute(predictions=predictions, references=references)
+results
+{'exact': 0.0, 'f1': 0.0, 'total': 1, 'HasAns_exact': 0.0, 'HasAns_f1': 0.0, 'HasAns_total': 1, 'best_exact': 0.0, 'best_exact_thresh': 0.0, 'best_f1': 0.0, 'best_f1_thresh': 0.0}
+```
+
+Partial match (2 out of 3 answers correct) : 
+
+```python
+from evaluate import load
+squad_metric = load("squad_v2")
+predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b', 'no_answer_probability': 0.},  {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1', 'no_answer_probability': 0.}]
+references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
+results = squad_v2_metric.compute(predictions=predictions, references=references)
+results
+{'exact': 66.66666666666667, 'f1': 66.66666666666667, 'total': 3, 'HasAns_exact': 66.66666666666667, 'HasAns_f1': 66.66666666666667, 'HasAns_total': 3, 'best_exact': 66.66666666666667, 'best_exact_thresh': 0.0, 'best_f1': 66.66666666666667, 'best_f1_thresh': 0.0}
+```
+
+## Limitations and bias
+This metric works only with the datasets in the same format as the [SQuAD v.2 dataset](https://huggingface.co/datasets/squad_v2).
+
+The SQuAD datasets do contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers. 
+
+
+## Citation
+
+```bibtex
+@inproceedings{Rajpurkar2018SQuAD2,
+title={Know What You Don't Know: Unanswerable Questions for SQuAD},
+author={Pranav Rajpurkar and Jian Zhang and Percy Liang},
+booktitle={ACL 2018},
+year={2018}
+}
+```
+    
+## Further References 
+
+- [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/)
+- [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)
--- a/evaluate-0.4.2/metrics/squad_v2/app.py
+++ b/evaluate-0.4.2/metrics/squad_v2/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("squad_v2")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/squad_v2/compute_score.py
+++ b/evaluate-0.4.2/metrics/squad_v2/compute_score.py
+"""Official evaluation script for SQuAD version 2.0.
+
+In addition to basic functionality, we also compute additional statistics and
+plot precision-recall curves if an additional na_prob.json file is provided.
+This file is expected to map question ID's to the model's predicted probability
+that a question is unanswerable.
+"""
+import argparse
+import collections
+import json
+import os
+import re
+import string
+import sys
+
+import numpy as np
+
+
+ARTICLES_REGEX = re.compile(r"\b(a|an|the)\b", re.UNICODE)
+
+OPTS = None
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Official evaluation script for SQuAD version 2.0.")
+    parser.add_argument("data_file", metavar="data.json", help="Input data JSON file.")
+    parser.add_argument("pred_file", metavar="pred.json", help="Model predictions.")
+    parser.add_argument(
+        "--out-file", "-o", metavar="eval.json", help="Write accuracy metrics to file (default is stdout)."
+    )
+    parser.add_argument(
+        "--na-prob-file", "-n", metavar="na_prob.json", help="Model estimates of probability of no answer."
+    )
+    parser.add_argument(
+        "--na-prob-thresh",
+        "-t",
+        type=float,
+        default=1.0,
+        help='Predict "" if no-answer probability exceeds this (default = 1.0).',
+    )
+    parser.add_argument(
+        "--out-image-dir", "-p", metavar="out_images", default=None, help="Save precision-recall curves to directory."
+    )
+    parser.add_argument("--verbose", "-v", action="store_true")
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    return parser.parse_args()
+
+
+def make_qid_to_has_ans(dataset):
+    qid_to_has_ans = {}
+    for article in dataset:
+        for p in article["paragraphs"]:
+            for qa in p["qas"]:
+                qid_to_has_ans[qa["id"]] = bool(qa["answers"]["text"])
+    return qid_to_has_ans
+
+
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+
+    def remove_articles(text):
+        return ARTICLES_REGEX.sub(" ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def get_tokens(s):
+    if not s:
+        return []
+    return normalize_answer(s).split()
+
+
+def compute_exact(a_gold, a_pred):
+    return int(normalize_answer(a_gold) == normalize_answer(a_pred))
+
+
+def compute_f1(a_gold, a_pred):
+    gold_toks = get_tokens(a_gold)
+    pred_toks = get_tokens(a_pred)
+    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
+    num_same = sum(common.values())
+    if len(gold_toks) == 0 or len(pred_toks) == 0:
+        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
+        return int(gold_toks == pred_toks)
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(pred_toks)
+    recall = 1.0 * num_same / len(gold_toks)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+
+
+def get_raw_scores(dataset, preds):
+    exact_scores = {}
+    f1_scores = {}
+    for article in dataset:
+        for p in article["paragraphs"]:
+            for qa in p["qas"]:
+                qid = qa["id"]
+                gold_answers = [t for t in qa["answers"]["text"] if normalize_answer(t)]
+                if not gold_answers:
+                    # For unanswerable questions, only correct answer is empty string
+                    gold_answers = [""]
+                if qid not in preds:
+                    print(f"Missing prediction for {qid}")
+                    continue
+                a_pred = preds[qid]
+                # Take max over all gold answers
+                exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
+                f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
+    return exact_scores, f1_scores
+
+
+def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
+    new_scores = {}
+    for qid, s in scores.items():
+        pred_na = na_probs[qid] > na_prob_thresh
+        if pred_na:
+            new_scores[qid] = float(not qid_to_has_ans[qid])
+        else:
+            new_scores[qid] = s
+    return new_scores
+
+
+def make_eval_dict(exact_scores, f1_scores, qid_list=None):
+    if not qid_list:
+        total = len(exact_scores)
+        return collections.OrderedDict(
+            [
+                ("exact", 100.0 * sum(exact_scores.values()) / total),
+                ("f1", 100.0 * sum(f1_scores.values()) / total),
+                ("total", total),
+            ]
+        )
+    else:
+        total = len(qid_list)
+        return collections.OrderedDict(
+            [
+                ("exact", 100.0 * sum(exact_scores[k] for k in qid_list) / total),
+                ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total),
+                ("total", total),
+            ]
+        )
+
+
+def merge_eval(main_eval, new_eval, prefix):
+    for k in new_eval:
+        main_eval[f"{prefix}_{k}"] = new_eval[k]
+
+
+def plot_pr_curve(precisions, recalls, out_image, title):
+    plt.step(recalls, precisions, color="b", alpha=0.2, where="post")
+    plt.fill_between(recalls, precisions, step="post", alpha=0.2, color="b")
+    plt.xlabel("Recall")
+    plt.ylabel("Precision")
+    plt.xlim([0.0, 1.05])
+    plt.ylim([0.0, 1.05])
+    plt.title(title)
+    plt.savefig(out_image)
+    plt.clf()
+
+
+def make_precision_recall_eval(scores, na_probs, num_true_pos, qid_to_has_ans, out_image=None, title=None):
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    true_pos = 0.0
+    cur_p = 1.0
+    cur_r = 0.0
+    precisions = [1.0]
+    recalls = [0.0]
+    avg_prec = 0.0
+    for i, qid in enumerate(qid_list):
+        if qid_to_has_ans[qid]:
+            true_pos += scores[qid]
+        cur_p = true_pos / float(i + 1)
+        cur_r = true_pos / float(num_true_pos)
+        if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
+            # i.e., if we can put a threshold after this point
+            avg_prec += cur_p * (cur_r - recalls[-1])
+            precisions.append(cur_p)
+            recalls.append(cur_r)
+    if out_image:
+        plot_pr_curve(precisions, recalls, out_image, title)
+    return {"ap": 100.0 * avg_prec}
+
+
+def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, out_image_dir):
+    if out_image_dir and not os.path.exists(out_image_dir):
+        os.makedirs(out_image_dir)
+    num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
+    if num_true_pos == 0:
+        return
+    pr_exact = make_precision_recall_eval(
+        exact_raw,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, "pr_exact.png"),
+        title="Precision-Recall curve for Exact Match score",
+    )
+    pr_f1 = make_precision_recall_eval(
+        f1_raw,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, "pr_f1.png"),
+        title="Precision-Recall curve for F1 score",
+    )
+    oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
+    pr_oracle = make_precision_recall_eval(
+        oracle_scores,
+        na_probs,
+        num_true_pos,
+        qid_to_has_ans,
+        out_image=os.path.join(out_image_dir, "pr_oracle.png"),
+        title="Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)",
+    )
+    merge_eval(main_eval, pr_exact, "pr_exact")
+    merge_eval(main_eval, pr_f1, "pr_f1")
+    merge_eval(main_eval, pr_oracle, "pr_oracle")
+
+
+def histogram_na_prob(na_probs, qid_list, image_dir, name):
+    if not qid_list:
+        return
+    x = [na_probs[k] for k in qid_list]
+    weights = np.ones_like(x) / float(len(x))
+    plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
+    plt.xlabel("Model probability of no-answer")
+    plt.ylabel("Proportion of dataset")
+    plt.title(f"Histogram of no-answer probability: {name}")
+    plt.savefig(os.path.join(image_dir, f"na_prob_hist_{name}.png"))
+    plt.clf()
+
+
+def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
+    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+    cur_score = num_no_ans
+    best_score = cur_score
+    best_thresh = 0.0
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    for i, qid in enumerate(qid_list):
+        if qid not in scores:
+            continue
+        if qid_to_has_ans[qid]:
+            diff = scores[qid]
+        else:
+            if preds[qid]:
+                diff = -1
+            else:
+                diff = 0
+        cur_score += diff
+        if cur_score > best_score:
+            best_score = cur_score
+            best_thresh = na_probs[qid]
+    return 100.0 * best_score / len(scores), best_thresh
+
+
+def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
+    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
+    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
+    main_eval["best_exact"] = best_exact
+    main_eval["best_exact_thresh"] = exact_thresh
+    main_eval["best_f1"] = best_f1
+    main_eval["best_f1_thresh"] = f1_thresh
+
+
+def main():
+    with open(OPTS.data_file) as f:
+        dataset_json = json.load(f)
+        dataset = dataset_json["data"]
+    with open(OPTS.pred_file) as f:
+        preds = json.load(f)
+    if OPTS.na_prob_file:
+        with open(OPTS.na_prob_file) as f:
+            na_probs = json.load(f)
+    else:
+        na_probs = {k: 0.0 for k in preds}
+    qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
+    has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+    no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+    exact_raw, f1_raw = get_raw_scores(dataset, preds)
+    exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
+    f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
+    out_eval = make_eval_dict(exact_thresh, f1_thresh)
+    if has_ans_qids:
+        has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
+        merge_eval(out_eval, has_ans_eval, "HasAns")
+    if no_ans_qids:
+        no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
+        merge_eval(out_eval, no_ans_eval, "NoAns")
+    if OPTS.na_prob_file:
+        find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
+    if OPTS.na_prob_file and OPTS.out_image_dir:
+        run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, OPTS.out_image_dir)
+        histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, "hasAns")
+        histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, "noAns")
+    if OPTS.out_file:
+        with open(OPTS.out_file, "w") as f:
+            json.dump(out_eval, f)
+    else:
+        print(json.dumps(out_eval, indent=2))
+
+
+if __name__ == "__main__":
+    OPTS = parse_args()
+    if OPTS.out_image_dir:
+        import matplotlib
+
+        matplotlib.use("Agg")
+        import matplotlib.pyplot as plt
+    main()
--- a/evaluate-0.4.2/metrics/squad_v2/requirements.txt
+++ b/evaluate-0.4.2/metrics/squad_v2/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/squad_v2/squad_v2.py
+++ b/evaluate-0.4.2/metrics/squad_v2/squad_v2.py
+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" SQuAD v2 metric. """
+
+import datasets
+
+import evaluate
+
+from .compute_score import (
+    apply_no_ans_threshold,
+    find_all_best_thresh,
+    get_raw_scores,
+    make_eval_dict,
+    make_qid_to_has_ans,
+    merge_eval,
+)
+
+
+_CITATION = """\
+@inproceedings{Rajpurkar2016SQuAD10,
+  title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
+  author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
+  booktitle={EMNLP},
+  year={2016}
+}
+"""
+
+_DESCRIPTION = """
+This metric wrap the official scoring script for version 2 of the Stanford Question
+Answering Dataset (SQuAD).
+
+Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
+crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
+from the corresponding reading passage, or the question might be unanswerable.
+
+SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions
+written adversarially by crowdworkers to look similar to answerable ones.
+To do well on SQuAD2.0, systems must not only answer questions when possible, but also
+determine when no answer is supported by the paragraph and abstain from answering.
+"""
+
+_KWARGS_DESCRIPTION = """
+Computes SQuAD v2 scores (F1 and EM).
+Args:
+    predictions: List of triple for question-answers to score with the following elements:
+        - the question-answer 'id' field as given in the references (see below)
+        - the text of the answer
+        - the probability that the question has no answer
+    references: List of question-answers dictionaries with the following key-values:
+            - 'id': id of the question-answer pair (see above),
+            - 'answers': a list of Dict {'text': text of the answer as a string}
+    no_answer_threshold: float
+        Probability threshold to decide that a question has no answer.
+Returns:
+    'exact': Exact match (the normalized answer exactly match the gold answer)
+    'f1': The F-score of predicted tokens versus the gold answer
+    'total': Number of score considered
+    'HasAns_exact': Exact match (the normalized answer exactly match the gold answer)
+    'HasAns_f1': The F-score of predicted tokens versus the gold answer
+    'HasAns_total': Number of score considered
+    'NoAns_exact': Exact match (the normalized answer exactly match the gold answer)
+    'NoAns_f1': The F-score of predicted tokens versus the gold answer
+    'NoAns_total': Number of score considered
+    'best_exact': Best exact match (with varying threshold)
+    'best_exact_thresh': No-answer probability threshold associated to the best exact match
+    'best_f1': Best F1 (with varying threshold)
+    'best_f1_thresh': No-answer probability threshold associated to the best F1
+Examples:
+
+    >>> predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}]
+    >>> references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
+    >>> squad_v2_metric = evaluate.load("squad_v2")
+    >>> results = squad_v2_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'exact': 100.0, 'f1': 100.0, 'total': 1, 'HasAns_exact': 100.0, 'HasAns_f1': 100.0, 'HasAns_total': 1, 'best_exact': 100.0, 'best_exact_thresh': 0.0, 'best_f1': 100.0, 'best_f1_thresh': 0.0}
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class SquadV2(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": {
+                        "id": datasets.Value("string"),
+                        "prediction_text": datasets.Value("string"),
+                        "no_answer_probability": datasets.Value("float32"),
+                    },
+                    "references": {
+                        "id": datasets.Value("string"),
+                        "answers": datasets.features.Sequence(
+                            {"text": datasets.Value("string"), "answer_start": datasets.Value("int32")}
+                        ),
+                    },
+                }
+            ),
+            codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
+            reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
+        )
+
+    def _compute(self, predictions, references, no_answer_threshold=1.0):
+        no_answer_probabilities = {p["id"]: p["no_answer_probability"] for p in predictions}
+        dataset = [{"paragraphs": [{"qas": references}]}]
+        predictions = {p["id"]: p["prediction_text"] for p in predictions}
+
+        qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
+        has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+        no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+
+        exact_raw, f1_raw = get_raw_scores(dataset, predictions)
+        exact_thresh = apply_no_ans_threshold(exact_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
+        f1_thresh = apply_no_ans_threshold(f1_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
+        out_eval = make_eval_dict(exact_thresh, f1_thresh)
+
+        if has_ans_qids:
+            has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
+            merge_eval(out_eval, has_ans_eval, "HasAns")
+        if no_ans_qids:
+            no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
+            merge_eval(out_eval, no_ans_eval, "NoAns")
+        find_all_best_thresh(out_eval, predictions, exact_raw, f1_raw, no_answer_probabilities, qid_to_has_ans)
+        return dict(out_eval)
--- a/evaluate-0.4.2/metrics/super_glue/README.md
+++ b/evaluate-0.4.2/metrics/super_glue/README.md
+---
+title: SuperGLUE
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
+---
+
+# Metric Card for SuperGLUE
+
+## Metric description
+This metric is used to compute the SuperGLUE evaluation metric associated to each of the subsets of the [SuperGLUE dataset](https://huggingface.co/datasets/super_glue). 
+
+SuperGLUE is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
+
+
+## How to use 
+
+There are two steps: (1) loading the SuperGLUE metric relevant to the subset of the dataset being used for evaluation; and (2) calculating the metric.
+
+1. **Loading the relevant SuperGLUE metric** : the subsets of SuperGLUE are the following: `boolq`, `cb`, `copa`, `multirc`, `record`, `rte`, `wic`, `wsc`, `wsc.fixed`, `axb`, `axg`.
+
+More information about the different subsets of the SuperGLUE dataset can be found on the [SuperGLUE dataset page](https://huggingface.co/datasets/super_glue) and on the [official dataset website](https://super.gluebenchmark.com/).
+
+2. **Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one list of reference labels. The structure of both inputs depends on the SuperGlUE subset being used:
+
+Format of `predictions`:
+- for `record`: list of question-answer dictionaries with the following keys:
+    - `idx`: index of the question as specified by the dataset
+    - `prediction_text`: the predicted answer text
+- for `multirc`: list of question-answer dictionaries with the following keys:
+    - `idx`: index of the question-answer pair as specified by the dataset
+    - `prediction`: the predicted answer label
+- otherwise: list of predicted labels
+
+Format of `references`:
+- for `record`: list of question-answers dictionaries with the following keys:
+    - `idx`: index of the question as specified by the dataset
+    - `answers`: list of possible answers
+- otherwise: list of reference labels
+
+```python
+from evaluate import load
+super_glue_metric = load('super_glue', 'copa') 
+predictions = [0, 1]
+references = [0, 1]
+results = super_glue_metric.compute(predictions=predictions, references=references)
+```
+## Output values
+
+The output of the metric depends on the SuperGLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
+
+`exact_match`: A given predicted string's exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise. (See [Exact Match](https://huggingface.co/metrics/exact_match) for more information).
+
+`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
+
+`matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
+
+### Values from popular papers
+The [original SuperGLUE paper](https://arxiv.org/pdf/1905.00537.pdf) reported average scores ranging from 47 to 71.5%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible). 
+
+For more recent model performance, see the [dataset leaderboard](https://super.gluebenchmark.com/leaderboard).
+
+## Examples 
+
+Maximal values for the COPA subset (which outputs `accuracy`):
+
+```python
+from evaluate import load
+super_glue_metric = load('super_glue', 'copa')  # any of ["copa", "rte", "wic", "wsc", "wsc.fixed", "boolq", "axg"]
+predictions = [0, 1]
+references = [0, 1]
+results = super_glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'accuracy': 1.0}
+```
+
+Minimal values for the MultiRC subset (which outputs `pearson` and `spearmanr`):
+
+```python
+from evaluate import load
+super_glue_metric = load('super_glue', 'multirc')
+predictions = [{'idx': {'answer': 0, 'paragraph': 0, 'question': 0}, 'prediction': 0}, {'idx': {'answer': 1, 'paragraph': 2, 'question': 3}, 'prediction': 1}]
+references = [1,0]
+results = super_glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'exact_match': 0.0, 'f1_m': 0.0, 'f1_a': 0.0}
+```
+
+Partial match for the COLA subset (which outputs `matthews_correlation`) 
+
+```python
+from evaluate import load
+super_glue_metric = load('super_glue', 'axb')
+references = [0, 1]
+predictions = [1,1]
+results = super_glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'matthews_correlation': 0.0}
+```
+
+## Limitations and bias
+This metric works only with datasets that have the same format as the [SuperGLUE dataset](https://huggingface.co/datasets/super_glue).
+
+The dataset also includes Winogender, a subset of the dataset that is designed to measure gender bias in coreference resolution systems. However, as noted in the SuperGLUE paper, this subset has its limitations: *"It offers only positive predictive value: A poor bias score is clear evidence that a model exhibits gender bias, but a good score does not mean that the model is unbiased.[...] Also, Winogender does not cover all forms of social bias, or even all forms of gender. For instance, the version of the data used here offers no coverage of gender-neutral they or non-binary pronouns."
+
+## Citation
+
+```bibtex
+@article{wang2019superglue,
+  title={Super{GLUE}: A Stickier Benchmark for General-Purpose Language Understanding Systems},
+  author={Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},
+  journal={arXiv preprint arXiv:1905.00537},
+  year={2019}
+}
+```
+    
+## Further References 
+
+- [SuperGLUE benchmark homepage](https://super.gluebenchmark.com/)
--- a/evaluate-0.4.2/metrics/super_glue/app.py
+++ b/evaluate-0.4.2/metrics/super_glue/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("super_glue", "copa")
+launch_gradio_widget(module)