Unverified Commit b5b01a49 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge pull request #780 from EleutherAI/fix-metrics

[Refactor] Fix metrics in Greedy Until
parents 2f870265 48e32540
...@@ -4,3 +4,4 @@ nin ...@@ -4,3 +4,4 @@ nin
maka maka
mor mor
te te
ond
\ No newline at end of file
...@@ -1044,37 +1044,37 @@ class ConfigurableTask(Task): ...@@ -1044,37 +1044,37 @@ class ConfigurableTask(Task):
else: else:
gold = str(gold) gold = str(gold)
for key, result in zip(self._metric_fn_list.keys(), results): result = results[0]
for metric in self._metric_fn_list.keys():
if self.multiple_target: if self.multiple_target:
# in the case where we have multiple targets, # in the case where we have multiple targets,
# return true if any are true # return true if any are true
# TODO: this may break for multipLe_target, non zero-or-1 metrics # TODO: this may break for multipLe_target, non zero-or-1 metrics
scores = [] scores = []
for gold_option in gold: for gold_option in gold:
res = self._metric_fn_list[key]( res = self._metric_fn_list[metric](
references=[gold_option], references=[gold_option],
predictions=[result], predictions=[result],
**self._metric_fn_kwargs[key], **self._metric_fn_kwargs[metric],
) )
if isinstance(res, dict): if isinstance(res, dict):
# TODO: this handles the case where HF evaluate returns a dict. # TODO: this handles the case where HF evaluate returns a dict.
res = res[key] res = res[metric]
scores.append(res) scores.append(res)
if any(scores): if any(scores):
result_score = 1.0 result_score = 1.0
else: else:
result_score = 0.0 result_score = 0.0
else: else:
result_score = self._metric_fn_list[key]( result_score = self._metric_fn_list[metric](
references=[gold], references=[gold],
predictions=[result], predictions=[result],
**self._metric_fn_kwargs[key], **self._metric_fn_kwargs[metric],
) )
if isinstance(result_score, dict):
if isinstance(result_score, dict): # TODO: this handles the case where HF evaluate returns a dict.
result_dict.update(result_score) result_score = result_score[metric]
else: result_dict[metric] = result_score
result_dict[key] = result_score
else: else:
raise ValueError( raise ValueError(
f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ", f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",
......
...@@ -15,5 +15,6 @@ metric_list: ...@@ -15,5 +15,6 @@ metric_list:
higher_is_better: true higher_is_better: true
ignore_case: true ignore_case: true
ignore_punctuation: true ignore_punctuation: true
- metric: f1 - metric: !function "t5_utils.mean_3class_f1"
aggregation: !function "aggregate.cb_multi_fi" aggregation: !function "t5_utils.agg_mean_3class_f1"
higher_is_better: true
# WMT16
### Paper
Title: `Findings of the 2016 Conference on Machine Translation`
Abstract: http://www.aclweb.org/anthology/W/W16/W16-2301
Homepage: https://huggingface.co/datasets/wmt16
### Citation
```
@InProceedings{bojar-EtAl:2016:WMT1,
author = {Bojar, Ond
{r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos},
title = {Findings of the 2016 Conference on Machine Translation},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
year = {2016},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {131--198},
url = {http://www.aclweb.org/anthology/W/W16/W16-2301}
}
```
### Groups and Tasks
#### Groups
* `wmt-t5-prompt`: Group for all wmt tasks with prompt templates used for T5 (`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`)
#### Tasks
With specific prompt styles
* `wmt-ro-en-t5-prompt`: WMT16 with the prompt template used for T5
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
import evaluate
def bleu(predictions, references):
return (predictions[0], references[0])
def agg_bleu(items):
bleu_fn = evaluate.load("bleu")
predictions, references = zip(*items)
return bleu_fn.compute(predictions=predictions, references=references)["bleu"]
group:
- wmt-t5-prompt
task: wmt-ro-en-t5-prompt
dataset_path: wmt16
dataset_name: ro-en
training_split: train
validation_split: validation
output_type: greedy_until
doc_to_text: "translate English to Romanian: {{translation.en}}"
doc_to_target: "{{translation.ro}}"
metric_list:
- metric: wer
aggregation: mean
higher_is_better: false
- metric: !function metrics.bleu
aggregation: !function metrics.agg_bleu
higher_is_better: true
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment