修改readme

25991f98 · hepj · ac192496 · 25991f98 · 25991f98 · 25991f98
Commit 25991f98 authored Jul 25, 2024 by hepj
20 changed files
--- a/evaluate-0.4.2/metrics/poseval/README.md
+++ b/evaluate-0.4.2/metrics/poseval/README.md
+---
+title: poseval
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data 
+  that is not in IOB format the poseval is an alternative. It treats each token in the dataset as independant 
+  observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's
+  classification report to compute the scores.
+---
+# Metric Card for peqeval
+## Metric description
+The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data (see e.g. [here](https://stackoverflow.com/questions/71327693/how-to-disable-seqeval-label-formatting-for-pos-tagging)) that is not in IOB format the poseval is an alternative. It treats each token in the dataset as independant observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) to compute the scores.
+## How to use 
+Poseval produces labelling scores along with its sufficient statistics from a source against references.
+It takes two mandatory arguments:
+`predictions`: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.
+`references`: a list of lists of reference labels, i.e. the ground truth/target values.
+It can also take several optional arguments:
+`zero_division`: Which value to substitute as a metric value when encountering zero division. Should be one of [`0`,`1`,`"warn"`]. `"warn"` acts as `0`, but the warning is raised.
+```python
+>>> predictions = [['INTJ', 'ADP', 'PROPN', 'NOUN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'VERB', 'SYM']]
+>>> references = [['INTJ', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'PROPN', 'SYM']]
+>>> poseval = evaluate.load("poseval")
+>>> results = poseval.compute(predictions=predictions, references=references)
+>>> print(list(results.keys()))
+['ADP', 'INTJ', 'NOUN', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'accuracy', 'macro avg', 'weighted avg']
+>>> print(results["accuracy"])
+0.8
+>>> print(results["PROPN"]["recall"])
+0.5
+```
+## Output values
+This metric returns a a classification report as a dictionary with a summary of scores for overall and per type:
+Overall (weighted and macro avg):
+`accuracy`: the average [accuracy](https://huggingface.co/metrics/accuracy), on a scale between 0.0 and 1.0.
+`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
+`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
+`f1`: the average [F1 score](https://huggingface.co/metrics/f1), which is the harmonic mean of the precision and recall. It also has a scale of 0.0 to 1.0.
+Per type (e.g. `MISC`, `PER`, `LOC`,...):
+`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
+`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
+`f1`: the average [F1 score](https://huggingface.co/metrics/f1), on a scale between 0.0 and 1.0.
+## Examples 
+```python
+>>> predictions = [['INTJ', 'ADP', 'PROPN', 'NOUN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'VERB', 'SYM']]
+>>> references = [['INTJ', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'PROPN', 'SYM']]
+>>> poseval = evaluate.load("poseval")
+>>> results = poseval.compute(predictions=predictions, references=references)
+>>> print(list(results.keys()))
+['ADP', 'INTJ', 'NOUN', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'accuracy', 'macro avg', 'weighted avg']
+>>> print(results["accuracy"])
+0.8
+>>> print(results["PROPN"]["recall"])
+0.5
+```
+## Limitations and bias
+In contrast to [seqeval](https://github.com/chakki-works/seqeval), the poseval metric treats each token independently and computes the classification report over all concatenated sequences..
+## Citation
+```bibtex
+@article{scikit-learn,
+ title={Scikit-learn: Machine Learning in {P}ython},
+ author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+ journal={Journal of Machine Learning Research},
+ volume={12},
+ pages={2825--2830},
+ year={2011}
+}
+```
+## Further References 
+- [README for seqeval at GitHub](https://github.com/chakki-works/seqeval)
+- [Classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) 
+- [Issues with seqeval](https://stackoverflow.com/questions/71327693/how-to-disable-seqeval-label-formatting-for-pos-tagging)
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/poseval/app.py
+++ b/evaluate-0.4.2/metrics/poseval/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("poseval")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/poseval/poseval.py
+++ b/evaluate-0.4.2/metrics/poseval/poseval.py
+# Copyright 2022 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" seqeval metric. """
+from typing import Union
+import datasets
+from sklearn.metrics import classification_report
+import evaluate
+_CITATION = """\
+@article{scikit-learn,
+ title={Scikit-learn: Machine Learning in {P}ython},
+ author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+ journal={Journal of Machine Learning Research},
+ volume={12},
+ pages={2825--2830},
+ year={2011}
+}
+"""
+_DESCRIPTION = """\
+The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data \
+(see e.g. [here](https://stackoverflow.com/questions/71327693/how-to-disable-seqeval-label-formatting-for-pos-tagging))\
+that is not in IOB format the poseval metric is an alternative. It treats each token in the dataset as independant \
+observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's \
+[classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) \
+to compute the scores.
+"""
+_KWARGS_DESCRIPTION = """
+Computes the poseval metric.
+Args:
+    predictions: List of List of predicted labels (Estimated targets as returned by a tagger)
+    references: List of List of reference labels (Ground truth (correct) target values)
+    zero_division: Which value to substitute as a metric value when encountering zero division. Should be on of 0, 1,
+        "warn". "warn" acts as 0, but the warning is raised.
+Returns:
+    'scores': dict. Summary of the scores for overall and per type
+        Overall (weighted and macro avg):
+            'accuracy': accuracy,
+            'precision': precision,
+            'recall': recall,
+            'f1': F1 score, also known as balanced F-score or F-measure,
+        Per type:
+            'precision': precision,
+            'recall': recall,
+            'f1': F1 score, also known as balanced F-score or F-measure
+Examples:
+    >>> predictions = [['INTJ', 'ADP', 'PROPN', 'NOUN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'VERB', 'SYM']]
+    >>> references = [['INTJ', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'PROPN', 'SYM']]
+    >>> poseval = evaluate.load("poseval")
+    >>> results = poseval.compute(predictions=predictions, references=references)
+    >>> print(list(results.keys()))
+    ['ADP', 'INTJ', 'NOUN', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'accuracy', 'macro avg', 'weighted avg']
+    >>> print(results["accuracy"])
+    0.8
+    >>> print(results["PROPN"]["recall"])
+    0.5
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Poseval(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            homepage="https://scikit-learn.org",
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("string", id="label"), id="sequence"),
+                    "references": datasets.Sequence(datasets.Value("string", id="label"), id="sequence"),
+                }
+            ),
+            codebase_urls=["https://github.com/scikit-learn/scikit-learn"],
+        )
+    def _compute(
+        self,
+        predictions,
+        references,
+        zero_division: Union[str, int] = "warn",
+    ):
+        report = classification_report(
+            y_true=[label for ref in references for label in ref],
+            y_pred=[label for pred in predictions for label in pred],
+            output_dict=True,
+            zero_division=zero_division,
+        )
+        return report
--- a/evaluate-0.4.2/metrics/poseval/requirements.txt
+++ b/evaluate-0.4.2/metrics/poseval/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+scikit-learn
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/precision/README.md
+++ b/evaluate-0.4.2/metrics/precision/README.md
+---
+title: Precision
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation:
+  Precision = TP / (TP + FP)
+  where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).
+---
+# Metric Card for Precision
+## Metric Description
+Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation:
+Precision = TP / (TP + FP)
+where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).
+## How to Use
+At minimum, precision takes as input a list of predicted labels, `predictions`, and a list of output labels, `references`.
+```python
+>>> precision_metric = evaluate.load("precision")
+>>> results = precision_metric.compute(references=[0, 1], predictions=[0, 1])
+>>> print(results)
+{'precision': 1.0}
+```
+### Inputs
+- **predictions** (`list` of `int`): Predicted class labels.
+- **references** (`list` of `int`): Actual class labels.
+- **labels** (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`. If `average` is `None`, it should be the label order. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.
+- **pos_label** (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.
+- **average** (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
+    - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.
+    - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
+    - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+    - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.
+    - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
+- **sample_weight** (`list` of `float`): Sample weights Defaults to None.
+- **zero_division** (): Sets the value to return when there is a zero division. Defaults to .
+    - 0: Returns 0 when there is a zero division.
+    - 1: Returns 1 when there is a zero division.
+    - 'warn': Raises warnings and then returns 0 when there is a zero division.
+### Output Values
+- **precision**(`float` or `array` of `float`): Precision score or list of precision scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher values indicate that fewer negative examples were incorrectly labeled as positive, which means that, generally, higher scores are better.
+Output Example(s):
+```python
+{'precision': 0.2222222222222222}
+```
+```python
+{'precision': array([0.66666667, 0.0, 0.0])}
+```
+#### Values from Popular Papers
+### Examples
+Example 1-A simple binary example
+```python
+>>> precision_metric = evaluate.load("precision")
+>>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])
+>>> print(results)
+{'precision': 0.5}
+```
+Example 2-The same simple binary example as in Example 1, but with `pos_label` set to `0`.
+```python
+>>> precision_metric = evaluate.load("precision")
+>>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0)
+>>> print(round(results['precision'], 2))
+0.67
+```
+Example 3-The same simple binary example as in Example 1, but with `sample_weight` included.
+```python
+>>> precision_metric = evaluate.load("precision")
+>>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])
+>>> print(results)
+{'precision': 0.23529411764705882}
+```
+Example 4-A multiclass example, with different values for the `average` input.
+```python
+>>> predictions = [0, 2, 1, 0, 0, 1]
+>>> references = [0, 1, 2, 0, 1, 2]
+>>> results = precision_metric.compute(predictions=predictions, references=references, average='macro')
+>>> print(results)
+{'precision': 0.2222222222222222}
+>>> results = precision_metric.compute(predictions=predictions, references=references, average='micro')
+>>> print(results)
+{'precision': 0.3333333333333333}
+>>> results = precision_metric.compute(predictions=predictions, references=references, average='weighted')
+>>> print(results)
+{'precision': 0.2222222222222222}
+>>> results = precision_metric.compute(predictions=predictions, references=references, average=None)
+>>> print([round(res, 2) for res in results['precision']])
+[0.67, 0.0, 0.0]
+```
+## Limitations and Bias
+[Precision](https://huggingface.co/metrics/precision) and [recall](https://huggingface.co/metrics/recall) are complementary and can be used to measure different aspects of model performance -- using both of them (or an averaged measure like [F1 score](https://huggingface.co/metrics/F1) to better represent different aspects of performance. See [Wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall) for more information.
+## Citation(s)
+```bibtex
+@article{scikit-learn,
+    title={Scikit-learn: Machine Learning in {P}ython},
+    author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+    and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+    and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+    Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+    journal={Journal of Machine Learning Research},
+    volume={12},
+    pages={2825--2830},
+    year={2011}
+}
+```
+## Further References
+- [Wikipedia -- Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)
--- a/evaluate-0.4.2/metrics/precision/app.py
+++ b/evaluate-0.4.2/metrics/precision/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("precision")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/precision/precision.py
+++ b/evaluate-0.4.2/metrics/precision/precision.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Precision metric."""
+import datasets
+from sklearn.metrics import precision_score
+import evaluate
+_DESCRIPTION = """
+Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation:
+Precision = TP / (TP + FP)
+where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).
+"""
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions (`list` of `int`): Predicted class labels.
+    references (`list` of `int`): Actual class labels.
+    labels (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`. If `average` is `None`, it should be the label order. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.
+    pos_label (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.
+    average (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
+        - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.
+        - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
+        - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+        - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.
+        - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
+    sample_weight (`list` of `float`): Sample weights Defaults to None.
+    zero_division (`int` or `string`): Sets the value to return when there is a zero division. Defaults to 'warn'.
+        - 0: Returns 0 when there is a zero division.
+        - 1: Returns 1 when there is a zero division.
+        - 'warn': Raises warnings and then returns 0 when there is a zero division.
+Returns:
+    precision (`float` or `array` of `float`): Precision score or list of precision scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher values indicate that fewer negative examples were incorrectly labeled as positive, which means that, generally, higher scores are better.
+Examples:
+    Example 1-A simple binary example
+        >>> precision_metric = evaluate.load("precision")
+        >>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])
+        >>> print(results)
+        {'precision': 0.5}
+    Example 2-The same simple binary example as in Example 1, but with `pos_label` set to `0`.
+        >>> precision_metric = evaluate.load("precision")
+        >>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0)
+        >>> print(round(results['precision'], 2))
+        0.67
+    Example 3-The same simple binary example as in Example 1, but with `sample_weight` included.
+        >>> precision_metric = evaluate.load("precision")
+        >>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])
+        >>> print(results)
+        {'precision': 0.23529411764705882}
+    Example 4-A multiclass example, with different values for the `average` input.
+        >>> predictions = [0, 2, 1, 0, 0, 1]
+        >>> references = [0, 1, 2, 0, 1, 2]
+        >>> results = precision_metric.compute(predictions=predictions, references=references, average='macro')
+        >>> print(results)
+        {'precision': 0.2222222222222222}
+        >>> results = precision_metric.compute(predictions=predictions, references=references, average='micro')
+        >>> print(results)
+        {'precision': 0.3333333333333333}
+        >>> results = precision_metric.compute(predictions=predictions, references=references, average='weighted')
+        >>> print(results)
+        {'precision': 0.2222222222222222}
+        >>> results = precision_metric.compute(predictions=predictions, references=references, average=None)
+        >>> print([round(res, 2) for res in results['precision']])
+        [0.67, 0.0, 0.0]
+"""
+_CITATION = """
+@article{scikit-learn,
+    title={Scikit-learn: Machine Learning in {P}ython},
+    author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+    and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+    and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+    Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+    journal={Journal of Machine Learning Research},
+    volume={12},
+    pages={2825--2830},
+    year={2011}
+}
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Precision(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("int32")),
+                    "references": datasets.Sequence(datasets.Value("int32")),
+                }
+                if self.config_name == "multilabel"
+                else {
+                    "predictions": datasets.Value("int32"),
+                    "references": datasets.Value("int32"),
+                }
+            ),
+            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html"],
+        )
+    def _compute(
+        self,
+        predictions,
+        references,
+        labels=None,
+        pos_label=1,
+        average="binary",
+        sample_weight=None,
+        zero_division="warn",
+    ):
+        score = precision_score(
+            references,
+            predictions,
+            labels=labels,
+            pos_label=pos_label,
+            average=average,
+            sample_weight=sample_weight,
+            zero_division=zero_division,
+        )
+        return {"precision": float(score) if score.size == 1 else score}
--- a/evaluate-0.4.2/metrics/precision/requirements.txt
+++ b/evaluate-0.4.2/metrics/precision/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+scikit-learn
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/r_squared/README.md
+++ b/evaluate-0.4.2/metrics/r_squared/README.md
+---
+title: r_squared
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.0.2
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  The R^2 (R Squared) metric is a measure of the goodness of fit of a linear regression model. It is the proportion of the variance in the dependent variable that is predictable from the independent variable.
+---
+# Metric Card for R^2
+## Metric description
+An R-squared value of 1 indicates that the model perfectly explains the variance of the dependent variable. A value of 0 means that the model does not explain any of the variance. Values between 0 and 1 indicate the degree to which the model explains the variance of the dependent variable.
+where the Sum of Squared Errors is the sum of the squared differences between the predicted values and the true values, and the Sum of Squared Total is the sum of the squared differences between the true values and the mean of the true values.
+For example, if an R-squared value for a model is 0.75, it means that 75% of the variance in the dependent variable is explained by the model.
+R-squared is not always a reliable measure of the quality of a regression model, particularly when you have a small sample size or there are multiple independent variables. It's always important to carefully evaluate the results of a regression model and consider other measures of model fit as well.
+R squared can be calculated using the following formula:
+```python
+r_squared = 1 - (Sum of Squared Errors / Sum of Squared Total)
+```
+* Calculate the residual sum of squares (RSS), which is the sum of the squared differences between the predicted values and the actual values.
+* Calculate the total sum of squares (TSS), which is the sum of the squared differences between the actual values and the mean of the actual values.
+* Calculate the R-squared value by taking 1 - (RSS / TSS).
+Here's an example of how to calculate the R-squared value:
+```python
+r_squared = 1 - (SSR/SST)
+```
+### How to Use Examples:
+The R2 class in the evaluate module can be used to compute the R^2 value for a given set of predictions and references. (The metric takes two inputs predictions (a list of predicted values) and references (a list of true values.))
+```python
+from evaluate import load
+>>> r2_metric = evaluate.load("r_squared")
+>>> r_squared = r2_metric.compute(predictions=[1, 2, 3, 4], references=[0.9, 2.1, 3.2, 3.8])
+>>> print(r_squared)  
+0.98
+```
+Alternatively, if you want to see an example where there is a perfect match between the prediction and reference:
+```python
+>>> from evaluate import load
+>>> r2_metric = evaluate.load("r_squared")
+>>> r_squared = r2_metric.compute(predictions=[1, 2, 3, 4], references=[1, 2, 3, 4])
+>>> print(r_squared)
+1.0
+```
+## Limitations and Bias
+R^2 is a statistical measure of the goodness of fit of a regression model. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables. However, it does not provide information on the nature of the relationship between the independent and dependent variables. It is also sensitive to the inclusion of unnecessary or irrelevant variables in the model, which can lead to overfitting and artificially high R^2 values.
+## Citation
+```bibtex
+@article{r_squared_model,
+  title={The R^2 Model Metric: A Comprehensive Guide},
+  author={John Doe},
+  journal={Journal of Model Evaluation},
+  volume={10},
+  number={2},
+  pages={101-112},
+  year={2022},
+  publisher={Model Evaluation Society}}
+```
+## Further References
+- [The Open University: R-Squared](https://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=55450§ion=3.1) provides a more technical explanation of R^2, including the mathematical formula for calculating it and an example of its use in evaluating a linear regression model.
+- [Khan Academy: R-Squared](https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/r-squared-intuition) offers a visual explanation of R^2, including how it can be used to compare the fit of different regression models.
--- a/evaluate-0.4.2/metrics/r_squared/app.py
+++ b/evaluate-0.4.2/metrics/r_squared/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("r_squared")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/r_squared/r_squared.py
+++ b/evaluate-0.4.2/metrics/r_squared/r_squared.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""R squared metric."""
+import datasets
+import numpy as np
+import evaluate
+_CITATION = """
+@article{williams2006relationship,
+title={The relationship between R2 and the correlation coefficient},
+author={Williams, James},
+journal={Journal of Statistics Education},
+volume={14},
+number={2},
+year={2006}
+}
+"""
+_DESCRIPTION = """
+R^2 (R Squared) is a statistical measure of the goodness of fit of a regression model. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
+The R^2 value ranges from 0 to 1, with a higher value indicating a better fit. A value of 0 means that the model does not explain any of the variance in the dependent variable, while a value of 1 means that the model explains all of the variance.
+R^2 can be calculated using the following formula:
+r_squared = 1 - (Sum of Squared Errors / Sum of Squared Total)
+where the Sum of Squared Errors is the sum of the squared differences between the predicted values and the true values, and the Sum of Squared Total is the sum of the squared differences between the true values and the mean of the true values.
+"""
+_KWARGS_DESCRIPTION = """
+Computes the R Squared metric.
+Args:
+    predictions: List of predicted values of the dependent variable
+    references: List of true values of the dependent variable
+    zero_division: Which value to substitute as a metric value when encountering zero division. Should be one of 0, 1,
+        "warn". "warn" acts as 0, but the warning is raised.
+Returns:
+    R^2 value ranging from 0 to 1, with a higher value indicating a better fit.
+Examples:
+    >>> r2_metric = evaluate.load("r_squared")
+    >>> r_squared = r2_metric.compute(predictions=[1, 2, 3, 4], references=[0.9, 2.1, 3.2, 3.8])
+    >>> print(r_squared)
+    0.98
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class r_squared(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("float", id="sequence"),
+                    "references": datasets.Value("float", id="sequence"),
+                }
+            ),
+            codebase_urls=["https://github.com/scikit-learn/scikit-learn/"],
+            reference_urls=[
+                "https://en.wikipedia.org/wiki/Coefficient_of_determination",
+            ],
+        )
+    def _compute(self, predictions=None, references=None):
+        """
+        Computes the coefficient of determination (R-squared) of predictions with respect to references.
+        Parameters:
+            predictions (List or np.ndarray): The predicted values.
+            references (List or np.ndarray): The true/reference values.
+        Returns:
+            float: The R-squared value, rounded to 3 decimal places.
+        """
+        predictions = np.array(predictions)
+        references = np.array(references)
+        # Calculate mean of the references
+        mean_references = np.mean(references)
+        # Calculate sum of squared residuals
+        ssr = np.sum((predictions - references) ** 2)
+        # Calculate sum of squared total
+        sst = np.sum((references - mean_references) ** 2)
+        # Calculate R Squared
+        r_squared = 1 - (ssr / sst)
+        # Round off to 3 decimal places
+        rounded_r_squared = round(r_squared, 3)
+        return rounded_r_squared
--- a/evaluate-0.4.2/metrics/r_squared/requirements.txt
+++ b/evaluate-0.4.2/metrics/r_squared/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
--- a/evaluate-0.4.2/metrics/recall/README.md
+++ b/evaluate-0.4.2/metrics/recall/README.md
+---
+title: Recall
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation:
+  Recall = TP / (TP + FN)
+  Where TP is the true positives and FN is the false negatives.
+---
+# Metric Card for Recall
+## Metric Description
+Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation:
+Recall = TP / (TP + FN)
+Where TP is the number of true positives and FN is the number of false negatives.
+## How to Use
+At minimum, this metric takes as input two `list`s, each containing `int`s: predictions and references.
+```python
+>>> recall_metric = evaluate.load('recall')
+>>> results = recall_metric.compute(references=[0, 1], predictions=[0, 1])
+>>> print(results)
+["{'recall': 1.0}"]
+```
+### Inputs
+- **predictions** (`list` of `int`): The predicted labels.
+- **references** (`list` of `int`): The ground truth labels.
+- **labels** (`list` of `int`): The set of labels to include when `average` is not set to `binary`, and their order when average is `None`. Labels present in the data can be excluded in this input, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order. Defaults to None.
+- **pos_label** (`int`): The class label to use as the 'positive class' when calculating the recall. Defaults to `1`.
+- **average** (`string`): This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
+    - `'binary'`: Only report results for the class specified by `pos_label`. This is applicable only if the target labels and predictions are binary.
+    - `'micro'`: Calculate metrics globally by counting the total true positives, false negatives, and false positives.
+    - `'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+    - `'weighted'`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. Note that it can result in an F-score that is not between precision and recall.
+    - `'samples'`: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
+- **sample_weight** (`list` of `float`): Sample weights Defaults to `None`.
+- **zero_division** (): Sets the value to return when there is a zero division. Defaults to .
+    - `'warn'`: If there is a zero division, the return value is `0`, but warnings are also raised.
+    - `0`: If there is a zero division, the return value is `0`.
+    - `1`: If there is a zero division, the return value is `1`.
+### Output Values
+- **recall**(`float`, or `array` of `float`, for multiclass targets): Either the general recall score, or the recall scores for individual classes, depending on the values input to `labels` and `average`. Minimum possible value is 0. Maximum possible value is 1. A higher recall means that more of the positive examples have been labeled correctly. Therefore, a higher recall is generally considered better.
+Output Example(s):
+```python
+{'recall': 1.0}
+```
+```python
+{'recall': array([1., 0., 0.])}
+```
+This metric outputs a dictionary with one entry, `'recall'`.
+#### Values from Popular Papers
+### Examples
+Example 1-A simple example with some errors
+```python
+>>> recall_metric = evaluate.load('recall')
+>>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1])
+>>> print(results)
+{'recall': 0.6666666666666666}
+```
+Example 2-The same example as Example 1, but with `pos_label=0` instead of the default `pos_label=1`.
+```python
+>>> recall_metric = evaluate.load('recall')
+>>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1], pos_label=0)
+>>> print(results)
+{'recall': 0.5}
+```
+Example 3-The same example as Example 1, but with `sample_weight` included.
+```python
+>>> recall_metric = evaluate.load('recall')
+>>> sample_weight = [0.9, 0.2, 0.9, 0.3, 0.8]
+>>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1], sample_weight=sample_weight)
+>>> print(results)
+{'recall': 0.55}
+```
+Example 4-A multiclass example, using different averages.
+```python
+>>> recall_metric = evaluate.load('recall')
+>>> predictions = [0, 2, 1, 0, 0, 1]
+>>> references = [0, 1, 2, 0, 1, 2]
+>>> results = recall_metric.compute(predictions=predictions, references=references, average='macro')
+>>> print(results)
+{'recall': 0.3333333333333333}
+>>> results = recall_metric.compute(predictions=predictions, references=references, average='micro')
+>>> print(results)
+{'recall': 0.3333333333333333}
+>>> results = recall_metric.compute(predictions=predictions, references=references, average='weighted')
+>>> print(results)
+{'recall': 0.3333333333333333}
+>>> results = recall_metric.compute(predictions=predictions, references=references, average=None)
+>>> print(results)
+{'recall': array([1., 0., 0.])}
+```
+## Limitations and Bias
+## Citation(s)
+```bibtex
+@article{scikit-learn, title={Scikit-learn: Machine Learning in {P}ython}, author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, journal={Journal of Machine Learning Research}, volume={12}, pages={2825--2830}, year={2011}
+```
+## Further References
--- a/evaluate-0.4.2/metrics/recall/app.py
+++ b/evaluate-0.4.2/metrics/recall/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("recall")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/recall/recall.py
+++ b/evaluate-0.4.2/metrics/recall/recall.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Recall metric."""
+import datasets
+from sklearn.metrics import recall_score
+import evaluate
+_DESCRIPTION = """
+Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation:
+Recall = TP / (TP + FN)
+Where TP is the true positives and FN is the false negatives.
+"""
+_KWARGS_DESCRIPTION = """
+Args:
+- **predictions** (`list` of `int`): The predicted labels.
+- **references** (`list` of `int`): The ground truth labels.
+- **labels** (`list` of `int`): The set of labels to include when `average` is not set to `binary`, and their order when average is `None`. Labels present in the data can be excluded in this input, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order. Defaults to None.
+- **pos_label** (`int`): The class label to use as the 'positive class' when calculating the recall. Defaults to `1`.
+- **average** (`string`): This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
+    - `'binary'`: Only report results for the class specified by `pos_label`. This is applicable only if the target labels and predictions are binary.
+    - `'micro'`: Calculate metrics globally by counting the total true positives, false negatives, and false positives.
+    - `'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
+    - `'weighted'`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. Note that it can result in an F-score that is not between precision and recall.
+    - `'samples'`: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
+- **sample_weight** (`list` of `float`): Sample weights Defaults to `None`.
+- **zero_division** (): Sets the value to return when there is a zero division. Defaults to .
+    - `'warn'`: If there is a zero division, the return value is `0`, but warnings are also raised.
+    - `0`: If there is a zero division, the return value is `0`.
+    - `1`: If there is a zero division, the return value is `1`.
+Returns:
+- **recall** (`float`, or `array` of `float`): Either the general recall score, or the recall scores for individual classes, depending on the values input to `labels` and `average`. Minimum possible value is 0. Maximum possible value is 1. A higher recall means that more of the positive examples have been labeled correctly. Therefore, a higher recall is generally considered better.
+Examples:
+    Example 1-A simple example with some errors
+        >>> recall_metric = evaluate.load('recall')
+        >>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1])
+        >>> print(results)
+        {'recall': 0.6666666666666666}
+    Example 2-The same example as Example 1, but with `pos_label=0` instead of the default `pos_label=1`.
+        >>> recall_metric = evaluate.load('recall')
+        >>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1], pos_label=0)
+        >>> print(results)
+        {'recall': 0.5}
+    Example 3-The same example as Example 1, but with `sample_weight` included.
+        >>> recall_metric = evaluate.load('recall')
+        >>> sample_weight = [0.9, 0.2, 0.9, 0.3, 0.8]
+        >>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1], sample_weight=sample_weight)
+        >>> print(results)
+        {'recall': 0.55}
+    Example 4-A multiclass example, using different averages.
+        >>> recall_metric = evaluate.load('recall')
+        >>> predictions = [0, 2, 1, 0, 0, 1]
+        >>> references = [0, 1, 2, 0, 1, 2]
+        >>> results = recall_metric.compute(predictions=predictions, references=references, average='macro')
+        >>> print(results)
+        {'recall': 0.3333333333333333}
+        >>> results = recall_metric.compute(predictions=predictions, references=references, average='micro')
+        >>> print(results)
+        {'recall': 0.3333333333333333}
+        >>> results = recall_metric.compute(predictions=predictions, references=references, average='weighted')
+        >>> print(results)
+        {'recall': 0.3333333333333333}
+        >>> results = recall_metric.compute(predictions=predictions, references=references, average=None)
+        >>> print(results)
+        {'recall': array([1., 0., 0.])}
+"""
+_CITATION = """
+@article{scikit-learn, title={Scikit-learn: Machine Learning in {P}ython}, author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, journal={Journal of Machine Learning Research}, volume={12}, pages={2825--2830}, year={2011}
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Recall(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("int32")),
+                    "references": datasets.Sequence(datasets.Value("int32")),
+                }
+                if self.config_name == "multilabel"
+                else {
+                    "predictions": datasets.Value("int32"),
+                    "references": datasets.Value("int32"),
+                }
+            ),
+            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html"],
+        )
+    def _compute(
+        self,
+        predictions,
+        references,
+        labels=None,
+        pos_label=1,
+        average="binary",
+        sample_weight=None,
+        zero_division="warn",
+    ):
+        score = recall_score(
+            references,
+            predictions,
+            labels=labels,
+            pos_label=pos_label,
+            average=average,
+            sample_weight=sample_weight,
+            zero_division=zero_division,
+        )
+        return {"recall": float(score) if score.size == 1 else score}
--- a/evaluate-0.4.2/metrics/recall/requirements.txt
+++ b/evaluate-0.4.2/metrics/recall/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+scikit-learn
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/rl_reliability/README.md
+++ b/evaluate-0.4.2/metrics/rl_reliability/README.md
+---
+title: RL Reliability
+emoji: 🤗 
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 3.19.1
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  Computes the RL reliability metrics from a set of experiments. There is an `"online"` and `"offline"` configuration for evaluation.
+---
+# Metric Card for RL Reliability
+## Metric Description
+The RL Reliability Metrics library provides a set of metrics for measuring the reliability of reinforcement learning (RL) algorithms. 
+## How to Use
+```python
+import evaluate
+import numpy as np
+rl_reliability = evaluate.load("rl_reliability", "online")
+results = rl_reliability.compute(
+    timesteps=[np.linspace(0, 2000000, 1000)],
+    rewards=[np.linspace(0, 100, 1000)]
+    )
+rl_reliability = evaluate.load("rl_reliability", "offline")
+results = rl_reliability.compute(
+    timesteps=[np.linspace(0, 2000000, 1000)],
+    rewards=[np.linspace(0, 100, 1000)]
+    )
+```
+### Inputs
+- **timesteps** *(List[int]): For each run a an list/array with its timesteps.*
+- **rewards** *(List[float]): For each run a an list/array with its rewards.*
+KWARGS:
+- **baseline="default"** *(Union[str, float]) Normalization used for curves. When `"default"` is passed the curves are normalized by their range in the online setting and by the median performance across runs in the offline case. When a float is passed the curves are divided by that value.*
+- **eval_points=[50000, 150000, ..., 2000000]** *(List[int]) Statistics will be computed at these points*
+- **freq_thresh=0.01** *(float) Frequency threshold for low-pass filtering.*
+- **window_size=100000** *(int) Defines a window centered at each eval point.*
+- **window_size_trimmed=99000** *(int) To handle shortened curves due to differencing*
+- **alpha=0.05** *(float)The "value at risk" (VaR) cutoff point, a float in the range [0,1].*
+### Output Values
+In `"online"` mode:
+- HighFreqEnergyWithinRuns: High Frequency across Time (DT)
+- IqrWithinRuns: IQR across Time (DT)
+- MadWithinRuns: 'MAD across Time (DT)
+- StddevWithinRuns: Stddev across Time (DT)
+- LowerCVaROnDiffs: Lower CVaR on Differences (SRT)
+- UpperCVaROnDiffs: Upper CVaR on Differences (SRT)
+- MaxDrawdown: Max Drawdown (LRT)
+- LowerCVaROnDrawdown: Lower CVaR on Drawdown (LRT)
+- UpperCVaROnDrawdown: Upper CVaR on Drawdown (LRT)
+- LowerCVaROnRaw: Lower CVaR on Raw
+- UpperCVaROnRaw: Upper CVaR on Raw
+- IqrAcrossRuns: IQR across Runs (DR)
+- MadAcrossRuns: MAD across Runs (DR)
+- StddevAcrossRuns: Stddev across Runs (DR)
+- LowerCVaROnAcross: Lower CVaR across Runs (RR)
+- UpperCVaROnAcross: Upper CVaR across Runs (RR)
+- MedianPerfDuringTraining: Median Performance across Runs
+In `"offline"` mode:
+- MadAcrossRollouts: MAD across rollouts (DF)
+- IqrAcrossRollouts: IQR across rollouts (DF)
+- LowerCVaRAcrossRollouts: Lower CVaR across rollouts (RF)
+- UpperCVaRAcrossRollouts: Upper CVaR across rollouts (RF)
+- MedianPerfAcrossRollouts: Median Performance across rollouts
+### Examples
+First get the sample data from the repository:
+```bash
+wget https://storage.googleapis.com/rl-reliability-metrics/data/tf_agents_example_csv_dataset.tgz
+tar -xvzf tf_agents_example_csv_dataset.tgz
+```
+Load the sample data:
+```python
+dfs = [pd.read_csv(f"./csv_data/sac_humanoid_{i}_train.csv") for i in range(1, 4)]
+```
+Compute the metrics:
+```python
+rl_reliability = evaluate.load("rl_reliability", "online")
+rl_reliability.compute(timesteps=[df["Metrics/EnvironmentSteps"] for df in dfs],
+                       rewards=[df["Metrics/AverageReturn"] for df in dfs])
+```
+## Limitations and Bias
+This implementation of RL reliability metrics does not compute permutation tests to determine whether algorithms are statistically different in their metric values and also does not compute bootstrap confidence intervals on the rankings of the algorithms. See the [original library](https://github.com/google-research/rl-reliability-metrics/) for more resources.
+## Citation
+```bibtex
+@conference{rl_reliability_metrics,
+  title = {Measuring the Reliability of Reinforcement Learning Algorithms},
+  author = {Stephanie CY Chan, Sam Fishman, John Canny, Anoop Korattikara, and Sergio Guadarrama},
+  booktitle = {International Conference on Learning Representations, Addis Ababa, Ethiopia},
+  year = 2020,
+}
+```
+## Further References
+- Homepage: https://github.com/google-research/rl-reliability-metrics
--- a/evaluate-0.4.2/metrics/rl_reliability/app.py
+++ b/evaluate-0.4.2/metrics/rl_reliability/app.py
+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("rl_reliability", "online")
+launch_gradio_widget(module)
--- a/evaluate-0.4.2/metrics/rl_reliability/requirements.txt
+++ b/evaluate-0.4.2/metrics/rl_reliability/requirements.txt
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+git+https://github.com/google-research/rl-reliability-metrics
+scipy
+tensorflow
+gin-config
\ No newline at end of file
--- a/evaluate-0.4.2/metrics/rl_reliability/rl_reliability.py
+++ b/evaluate-0.4.2/metrics/rl_reliability/rl_reliability.py
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Computes the RL Reliability Metrics."""
+import datasets
+import numpy as np
+from rl_reliability_metrics.evaluation import eval_metrics
+from rl_reliability_metrics.metrics import metrics_offline, metrics_online
+import evaluate
+logger = evaluate.logging.get_logger(__name__)
+DEFAULT_EVAL_POINTS = [
+    50000,
+    150000,
+    250000,
+    350000,
+    450000,
+    550000,
+    650000,
+    750000,
+    850000,
+    950000,
+    1050000,
+    1150000,
+    1250000,
+    1350000,
+    1450000,
+    1550000,
+    1650000,
+    1750000,
+    1850000,
+    1950000,
+]
+N_RUNS_RECOMMENDED = 10
+_CITATION = """\
+@conference{rl_reliability_metrics,
+  title = {Measuring the Reliability of Reinforcement Learning Algorithms},
+  author = {Stephanie CY Chan, Sam Fishman, John Canny, Anoop Korattikara, and Sergio Guadarrama},
+  booktitle = {International Conference on Learning Representations, Addis Ababa, Ethiopia},
+  year = 2020,
+}
+"""
+_DESCRIPTION = """\
+Computes the RL reliability metrics from a set of experiments. There is an `"online"` and `"offline"` configuration for evaluation.
+"""
+_KWARGS_DESCRIPTION = """
+Computes the RL reliability metrics from a set of experiments. There is an `"online"` and `"offline"` configuration for evaluation.
+Args:
+    timestamps: list of timestep lists/arrays that serve as index.
+    rewards: list of reward lists/arrays of each experiment.
+Returns:
+    dictionary: a set of reliability metrics
+Examples:
+    >>> import numpy as np
+    >>> rl_reliability = evaluate.load("rl_reliability", "online")
+    >>> results = rl_reliability.compute(
+    ...     timesteps=[np.linspace(0, 2000000, 1000)],
+    ...     rewards=[np.linspace(0, 100, 1000)]
+    ...     )
+    >>> print(results["LowerCVaROnRaw"].round(4))
+    [0.0258]
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class RLReliability(evaluate.Metric):
+    """Computes the RL Reliability Metrics."""
+    def _info(self):
+        if self.config_name not in ["online", "offline"]:
+            raise KeyError("""You should supply a configuration name selected in '["online", "offline"]'""")
+        return evaluate.MetricInfo(
+            module_type="metric",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "timesteps": datasets.Sequence(datasets.Value("int64")),
+                    "rewards": datasets.Sequence(datasets.Value("float")),
+                }
+            ),
+            homepage="https://github.com/google-research/rl-reliability-metrics",
+        )
+    def _compute(
+        self,
+        timesteps,
+        rewards,
+        baseline="default",
+        freq_thresh=0.01,
+        window_size=100000,
+        window_size_trimmed=99000,
+        alpha=0.05,
+        eval_points=None,
+    ):
+        if len(timesteps) < N_RUNS_RECOMMENDED:
+            logger.warning(
+                f"For robust statistics it is recommended to use at least {N_RUNS_RECOMMENDED} runs whereas you provided {len(timesteps)}."
+            )
+        curves = []
+        for timestep, reward in zip(timesteps, rewards):
+            curves.append(np.stack([timestep, reward]))
+        if self.config_name == "online":
+            if baseline == "default":
+                baseline = "curve_range"
+            if eval_points is None:
+                eval_points = DEFAULT_EVAL_POINTS
+            metrics = [
+                metrics_online.HighFreqEnergyWithinRuns(thresh=freq_thresh),
+                metrics_online.IqrWithinRuns(
+                    window_size=window_size_trimmed, eval_points=eval_points, baseline=baseline
+                ),
+                metrics_online.IqrAcrossRuns(
+                    lowpass_thresh=freq_thresh, eval_points=eval_points, window_size=window_size, baseline=baseline
+                ),
+                metrics_online.LowerCVaROnDiffs(baseline=baseline),
+                metrics_online.LowerCVaROnDrawdown(baseline=baseline),
+                metrics_online.LowerCVaROnAcross(
+                    lowpass_thresh=freq_thresh, eval_points=eval_points, window_size=window_size, baseline=baseline
+                ),
+                metrics_online.LowerCVaROnRaw(alpha=alpha, baseline=baseline),
+                metrics_online.MadAcrossRuns(
+                    lowpass_thresh=freq_thresh, eval_points=eval_points, window_size=window_size, baseline=baseline
+                ),
+                metrics_online.MadWithinRuns(
+                    eval_points=eval_points, window_size=window_size_trimmed, baseline=baseline
+                ),
+                metrics_online.MaxDrawdown(),
+                metrics_online.StddevAcrossRuns(
+                    lowpass_thresh=freq_thresh, eval_points=eval_points, window_size=window_size, baseline=baseline
+                ),
+                metrics_online.StddevWithinRuns(
+                    eval_points=eval_points, window_size=window_size_trimmed, baseline=baseline
+                ),
+                metrics_online.UpperCVaROnAcross(
+                    alpha=alpha,
+                    lowpass_thresh=freq_thresh,
+                    eval_points=eval_points,
+                    window_size=window_size,
+                    baseline=baseline,
+                ),
+                metrics_online.UpperCVaROnDiffs(alpha=alpha, baseline=baseline),
+                metrics_online.UpperCVaROnDrawdown(alpha=alpha, baseline=baseline),
+                metrics_online.UpperCVaROnRaw(alpha=alpha, baseline=baseline),
+                metrics_online.MedianPerfDuringTraining(window_size=window_size, eval_points=eval_points),
+            ]
+        else:
+            if baseline == "default":
+                baseline = "median_perf"
+            metrics = [
+                metrics_offline.MadAcrossRollouts(baseline=baseline),
+                metrics_offline.IqrAcrossRollouts(baseline=baseline),
+                metrics_offline.StddevAcrossRollouts(baseline=baseline),
+                metrics_offline.LowerCVaRAcrossRollouts(alpha=alpha, baseline=baseline),
+                metrics_offline.UpperCVaRAcrossRollouts(alpha=alpha, baseline=baseline),
+                metrics_offline.MedianPerfAcrossRollouts(baseline=None),
+            ]
+        evaluator = eval_metrics.Evaluator(metrics=metrics)
+        result = evaluator.compute_metrics(curves)
+        return result