Commit a27e8ed1 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into squadv2

parents fc329d31 4cda3a1c
group:
- perplexity
- loglikelihood_rolling
task: wikitext
dataset_path: EleutherAI/wikitext_document_level
dataset_name: wikitext-2-raw-v1
......
# WinoGrande
### Paper
Title: `WinoGrande: An Adversarial Winograd Schema Challenge at Scale`
Abstract: https://arxiv.org/abs/1907.10641
WinoGrande is a collection of 44k problems, inspired by Winograd Schema Challenge
(Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and
robustness against the dataset-specific bias. Formulated as a fill-in-a-blank
task with binary options, the goal is to choose the right option for a given
sentence which requires commonsense reasoning.
NOTE: This evaluation of Winogrande uses partial evaluation as described by
Trinh & Le in Simple Method for Commonsense Reasoning (2018).
See: https://arxiv.org/abs/1806.02847
Homepage: https://leaderboard.allenai.org/winogrande/submissions/public
### Citation
```
@article{sakaguchi2019winogrande,
title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},
author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},
journal={arXiv preprint arXiv:1907.10641},
year={2019}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `winogrande`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
# WMT16
### Paper
Title: `Findings of the 2016 Conference on Machine Translation`
Abstract: http://www.aclweb.org/anthology/W/W16/W16-2301
Homepage: https://huggingface.co/datasets/wmt16
### Citation
```
@InProceedings{bojar-EtAl:2016:WMT1,
author = {Bojar, Ond
{r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos},
title = {Findings of the 2016 Conference on Machine Translation},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
year = {2016},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {131--198},
url = {http://www.aclweb.org/anthology/W/W16/W16-2301}
}
```
### Groups and Tasks
#### Groups
* `wmt-t5-prompt`: Group for all wmt tasks with prompt templates used for T5 (`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`)
#### Tasks
With specific prompt styles
* `wmt-ro-en-t5-prompt`: WMT16 with the prompt template used for T5
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
import evaluate
def bleu(predictions, references):
return (predictions[0], references[0])
def agg_bleu(items):
bleu_fn = evaluate.load("bleu")
predictions, references = zip(*items)
return bleu_fn.compute(predictions=predictions, references=references)["bleu"]
group:
- wmt-t5-prompt
task: wmt-ro-en-t5-prompt
dataset_path: wmt16
dataset_name: ro-en
training_split: train
validation_split: validation
output_type: greedy_until
doc_to_text: "translate English to Romanian: {{translation.en}}"
doc_to_target: "{{translation.ro}}"
metric_list:
- metric: wer
aggregation: mean
higher_is_better: false
- metric: !function metrics.bleu
aggregation: !function metrics.agg_bleu
higher_is_better: true
## XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
https://ducdauge.github.io/files/xcopa.pdf
# XCOPA
### Paper
Title: `XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning`
Abstract: https://ducdauge.github.io/files/xcopa.pdf
The Cross-lingual Choice of Plausible Alternatives dataset is a benchmark to evaluate the ability of machine learning models to transfer commonsense reasoning across languages.
The dataset is the translation and reannotation of the English COPA (Roemmele et al. 2011) and covers 11 languages from 11 families and several areas around the globe.
......@@ -8,6 +13,8 @@ All the details about the creation of XCOPA and the implementation of the baseli
Homepage: https://github.com/cambridgeltl/xcopa
### Citation
```
@inproceedings{ponti2020xcopa,
title={{XCOPA: A} Multilingual Dataset for Causal Commonsense Reasoning},
......@@ -17,3 +24,37 @@ Homepage: https://github.com/cambridgeltl/xcopa
url={https://ducdauge.github.io/files/xcopa.pdf}
}
```
### Groups and Tasks
#### Groups
* `xcopa`
#### Tasks
* `xcopa_et`: Estonian
* `xcopa_ht`: Haitian Creole
* `xcopa_id`: Indonesian
* `xcopa_it`: Italian
* `xcopa_qu`: Cusco-Collao Quechua
* `xcopa_sw`: Kiswahili
* `xcopa_ta`: Tamil
* `xcopa_th`: Thai
* `xcopa_tr`: Turkish
* `xcopa_vi`: Vietnamese
* `xcopa_zh`: Mandarin Chinese
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
# XStoryCloze
### Paper
Title: `Few-shot Learning with Multilingual Language Models`
Abstract: https://arxiv.org/abs/2112.10668
XStoryCloze consists of the professionally translated version of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 non-English languages. This dataset is released by Meta AI.
Homepage: https://github.com/facebookresearch/fairseq/pull/4820
### Citation
```
@article{DBLP:journals/corr/abs-2112-10668,
author = {Xi Victoria Lin and
Todor Mihaylov and
Mikel Artetxe and
Tianlu Wang and
Shuohui Chen and
Daniel Simig and
Myle Ott and
Naman Goyal and
Shruti Bhosale and
Jingfei Du and
Ramakanth Pasunuru and
Sam Shleifer and
Punit Singh Koura and
Vishrav Chaudhary and
Brian O'Horo and
Jeff Wang and
Luke Zettlemoyer and
Zornitsa Kozareva and
Mona T. Diab and
Veselin Stoyanov and
Xian Li},
title = {Few-shot Learning with Multilingual Language Models},
journal = {CoRR},
volume = {abs/2112.10668},
year = {2021},
url = {https://arxiv.org/abs/2112.10668},
eprinttype = {arXiv},
eprint = {2112.10668},
timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
### Groups and Tasks
#### Groups
* `xstorycloze`
#### Tasks
* `xstorycloze_ar`: Arabic
* `xstorycloze_en`: English
* `xstorycloze_es`: Spanish
* `xstorycloze_eu`: Basque
* `xstorycloze_hi`: Hindi
* `xstorycloze_id`: Indonesian
* `xstorycloze_my`: Burmese
* `xstorycloze_ru`: Russian
* `xstorycloze_sw`: Swahili
* `xstorycloze_te`: Telugu
* `xstorycloze_zh`: Chinese
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
......@@ -31,7 +31,13 @@ Homepage: `https://huggingface.co/datasets/Muennighoff/xwinograd`
}
```
### Subtasks
### Groups and Tasks
#### Groups
* `xwinograd`
#### Tasks
List or describe tasks defined in this folder, and their names here:
* `xwinograd_en`: Winograd schema challenges in English.
......
......@@ -2,9 +2,7 @@
# It doesn't have a yaml file extension as it is not meant to be imported directly
# by the harness.
group:
- winograd
- commonsense
- multilingual
- xwinograd
dataset_path: Muennighoff/xwinograd
dataset_name: null # Overridden by language-specific config.
output_type: multiple_choice
......
......@@ -2,7 +2,8 @@
### Paper
Title: `paper title goes here`
Title: `paper titles goes here`
Abstract: `link to paper PDF or arXiv abstract goes here`
`Short description of paper / benchmark goes here:`
......@@ -16,11 +17,16 @@ Homepage: `homepage to the benchmark's website goes here, if applicable`
BibTeX-formatted citation goes here
```
### Subtasks
### Groups and Tasks
#### Groups
* `group_name`: `Short description`
#### Tasks
List or describe tasks defined in this folder, and their names here:
* `task_name`: `1-sentence description of what this particular task does`
* `task_name2`: .....
* `task_name2`: ...
### Checklist
......
from __future__ import annotations
import pytest
import numpy as np
from lm_eval.models.huggingface import HFLM
from lm_eval.api.instance import Instance
import lm_eval.tasks as tasks
class Test_HFLM:
multiple_choice_task = tasks.TASK_REGISTRY.get("arc_easy")() # type: ignore
multiple_choice_task.build_all_requests(limit=10, rank=0, world_size=1)
MULTIPLE_CH: list[Instance] = multiple_choice_task.instances
greedy_until_task = tasks.TASK_REGISTRY.get("gsm8k_yaml")() # type: ignore
greedy_until_task.build_all_requests(limit=10, rank=0, world_size=1)
greedy_until_task._config.generation_kwargs["max_gen_toks"] = 10
GREEDY_UNTIL: list[Instance] = greedy_until_task.instances
rolling_task = tasks.TASK_REGISTRY.get("wikitext")() # type: ignore
rolling_task.build_all_requests(limit=10, rank=0, world_size=1)
ROLLING: list[Instance] = rolling_task.instances
MULTIPLE_CH_RES = [
-41.902435302734375,
-42.939308166503906,
-33.914180755615234,
-37.07139205932617,
-22.95258331298828,
-20.342208862304688,
-14.818366050720215,
-27.942853927612305,
-15.80704116821289,
-15.936427116394043,
-13.052018165588379,
-18.04828453063965,
-13.345029830932617,
-13.366025924682617,
-12.127134323120117,
-11.872495651245117,
-47.10598373413086,
-47.76410675048828,
-36.4406852722168,
-50.0289421081543,
-16.72093963623047,
-18.535587310791016,
-26.46993637084961,
-20.355995178222656,
-17.757919311523438,
-21.80595588684082,
-33.1990852355957,
-39.28636932373047,
-14.759679794311523,
-16.753942489624023,
-11.486852645874023,
-15.42177677154541,
-13.15798282623291,
-15.887393951416016,
-15.28614616394043,
-12.339089393615723,
-44.59441375732422,
-55.40888214111328,
-52.70050811767578,
-56.25089645385742,
]
GREEDY_UNTIL_RES = [
" The average of $2.50 each is $",
" A robe takes 2 bolts of blue fiber and half",
" $50,000 in repairs.",
" He runs 1 sprint 3 times a week.",
" They feed each of her chickens three cups of mixed",
" The price of the glasses is $5, but",
" The total percentage of students who said they like to",
" Carla is downloading a 200 GB file. Normally",
" John drives for 3 hours at a speed of 60",
" Eliza sells 4 tickets to 5 friends so she",
]
ROLLING_RES = [
-3603.6328125,
-19779.23974609375,
-8834.16455078125,
-27967.591796875,
-7636.794982910156,
-9491.93505859375,
-41043.4248046875,
-8397.689819335938,
-45969.47155761719,
-7158.90625,
]
LM = HFLM(pretrained="EleutherAI/pythia-70m", device="cpu", dtype="float32")
def test_logliklihood(self) -> None:
res = self.LM.loglikelihood(self.MULTIPLE_CH)
_RES, _res = self.MULTIPLE_CH_RES, [r[0] for r in res]
# change atol in case of consistent failure
assert np.allclose(_res, _RES, atol=1e-4)
# check indices for Multiple Choice
argmax_RES, argmax_res = np.argmax(
np.array(_RES).reshape(-1, 4), axis=1
), np.argmax(np.array(_res).reshape(-1, 4), axis=1)
assert (argmax_RES == argmax_res).all()
def test_greedy_until(self) -> None:
res = self.LM.greedy_until(self.GREEDY_UNTIL)
assert res == self.GREEDY_UNTIL_RES
def test_logliklihood_rolling(self) -> None:
res = self.LM.loglikelihood_rolling(self.ROLLING)
assert np.allclose(res, self.ROLLING_RES, atol=1e-2)
def test_toc_encode(self) -> None:
res = self.LM.tok_encode("foo bar")
assert res == [12110, 2534]
def test_toc_decode(self) -> None:
res = self.LM.tok_decode([12110, 2534])
assert res == "foo bar"
def test_batch_encode(self) -> None:
res = self.LM.tok_batch_encode(["foo bar", "bar foo"])[0].tolist()
assert res == [[12110, 2534], [2009, 17374]]
def test_model_generate(self) -> None:
context = self.LM.tok_batch_encode(["foo bar"])[0]
res = self.LM._model_generate(context, max_length=10, stop=["\n\n"])
res = self.LM.tok_decode(res[0])
assert res == "foo bar\n<bazhang>!info bar"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment