Merge pull request #780 from EleutherAI/fix-metrics

[Refactor] Fix metrics in Greedy Until

Merge pull request #780 from EleutherAI/fix-metrics
[Refactor] Fix metrics in Greedy Until
b5b01a49 · Lintang Sutawika · GitHub · 2f870265 · 48e32540 · b5b01a49
Unverified Commit b5b01a49 authored Aug 22, 2023 by Lintang Sutawika Committed by GitHub Aug 22, 2023
6 changed files
--- a/ignore.txt
+++ b/ignore.txt
@@ -4,3 +4,4 @@ nin
 maka
 mor
 te
+ond
\ No newline at end of file
--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -1044,37 +1044,37 @@ class ConfigurableTask(Task):
            else:
                gold = str(gold)
-            for key, result in zip(self._metric_fn_list.keys(), results):
+            result = results[0]
+            for metric in self._metric_fn_list.keys():
                if self.multiple_target:
                    # in the case where we have multiple targets,
                    # return true if any are true
                    # TODO: this may break for multipLe_target, non zero-or-1 metrics
                    scores = []
                    for gold_option in gold:
-                        res = self._metric_fn_list[key](
+                        res = self._metric_fn_list[metric](
                            references=[gold_option],
                            predictions=[result],
-                            **self._metric_fn_kwargs[key],
+                            **self._metric_fn_kwargs[metric],
                        )
                        if isinstance(res, dict):
                            # TODO: this handles the case where HF evaluate returns a dict.
-                            res = res[key]
+                            res = res[metric]
                        scores.append(res)
                    if any(scores):
                        result_score = 1.0
                    else:
                        result_score = 0.0
                else:
-                    result_score = self._metric_fn_list[key](
+                    result_score = self._metric_fn_list[metric](
                        references=[gold],
                        predictions=[result],
-                        **self._metric_fn_kwargs[key],
+                        **self._metric_fn_kwargs[metric],
                    )
+                    if isinstance(result_score, dict):
-                if isinstance(result_score, dict):
+                        # TODO: this handles the case where HF evaluate returns a dict.
-                    result_dict.update(result_score)
+                        result_score = result_score[metric]
-                else:
+                result_dict[metric] = result_score
-                    result_dict[key] = result_score
        else:
            raise ValueError(
                f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",

--- a/lm_eval/tasks/super_glue/cb/t5-prompt.yaml
+++ b/lm_eval/tasks/super_glue/cb/t5-prompt.yaml
@@ -15,5 +15,6 @@ metric_list:
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
-  - metric: f1
+  - metric: !function "t5_utils.mean_3class_f1"
-    aggregation: !function "aggregate.cb_multi_fi"
+    aggregation: !function "t5_utils.agg_mean_3class_f1"
+    higher_is_better: true
--- a/lm_eval/tasks/wmt2016/README.md
+++ b/lm_eval/tasks/wmt2016/README.md
+# WMT16
+### Paper
+Title: `Findings of the 2016 Conference on Machine Translation`
+Abstract: http://www.aclweb.org/anthology/W/W16/W16-2301
+Homepage: https://huggingface.co/datasets/wmt16
+### Citation
+```
+@InProceedings{bojar-EtAl:2016:WMT1,
+  author    = {Bojar, Ond
+{r}ej  and  Chatterjee, Rajen  and  Federmann, Christian  and  Graham, Yvette  and  Haddow, Barry  and  Huck, Matthias  and  Jimeno Yepes, Antonio  and  Koehn, Philipp  and  Logacheva, Varvara  and  Monz, Christof  and  Negri, Matteo  and  Neveol, Aurelie  and  Neves, Mariana  and  Popel, Martin  and  Post, Matt  and  Rubino, Raphael  and  Scarton, Carolina  and  Specia, Lucia  and  Turchi, Marco  and  Verspoor, Karin  and  Zampieri, Marcos},
+  title     = {Findings of the 2016 Conference on Machine Translation},
+  booktitle = {Proceedings of the First Conference on Machine Translation},
+  month     = {August},
+  year      = {2016},
+  address   = {Berlin, Germany},
+  publisher = {Association for Computational Linguistics},
+  pages     = {131--198},
+  url       = {http://www.aclweb.org/anthology/W/W16/W16-2301}
+}
+```
+### Groups and Tasks
+#### Groups
+* `wmt-t5-prompt`: Group for all wmt tasks with prompt templates used for T5 (`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`)
+#### Tasks
+With specific prompt styles
+* `wmt-ro-en-t5-prompt`: WMT16 with the prompt template used for T5
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/wmt2016/metrics.py
+++ b/lm_eval/tasks/wmt2016/metrics.py
+import evaluate
+def bleu(predictions, references):
+    return (predictions[0], references[0])
+def agg_bleu(items):
+    bleu_fn = evaluate.load("bleu")
+    predictions, references = zip(*items)
+    return bleu_fn.compute(predictions=predictions, references=references)["bleu"]
--- a/lm_eval/tasks/wmt2016/ro_en-t5_prompt.yaml
+++ b/lm_eval/tasks/wmt2016/ro_en-t5_prompt.yaml
+group:
+  - wmt-t5-prompt
+task: wmt-ro-en-t5-prompt
+dataset_path: wmt16
+dataset_name: ro-en
+training_split: train
+validation_split: validation
+output_type: greedy_until
+doc_to_text: "translate English to Romanian: {{translation.en}}"
+doc_to_target: "{{translation.ro}}"
+metric_list:
+  - metric: wer
+    aggregation: mean
+    higher_is_better: false
+  - metric: !function metrics.bleu
+    aggregation: !function metrics.agg_bleu
+    higher_is_better: true