f'Both target_delimiter and target choice: "{choice}" does not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
f'Both target_delimiter and target choice: "{choice}" does not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
)
defdownload(self,dataset_kwargs=None)->None:
self.dataset=datasets.load_dataset(
path=self.DATASET_PATH,
...
...
@@ -838,7 +841,10 @@ class ConfigurableTask(Task):
and(target_string[0]=="[")
and(target_string[-1]=="]")
):
returnast.literal_eval(target_string)
try:
returnast.literal_eval(target_string)
except(SyntaxError,ValueError):
returntarget_string
else:
returntarget_string
eliftype(doc_to_target)==list:
...
...
@@ -1067,6 +1073,9 @@ class ConfigurableTask(Task):
# it assumes that doc_to_target returns a number.
choices=self.doc_to_choice(doc)
gold=choices[gold]
# we expect multiple_targets to be a list.
elifself.multiple_target:
gold=list(gold)
else:
gold=str(gold)
...
...
@@ -1077,6 +1086,10 @@ class ConfigurableTask(Task):
# return true if any are true
# TODO: this may break for multipLe_target, non zero-or-1 metrics
scores=[]
ifnotisinstance(gold,list):
# sometimes, a multiple_target dataset has exceptions where one doc has only one string answer
title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},
author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},
year={2021},
eprint={2106.15772},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
*`asdiv`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
https://arxiv.org/pdf/2305.08322.pdf
C-Eval is a comprehensive Chinese evaluation suite for foundation models.
It consists of 13948 multi-choice questions spanning 52 diverse disciplines
and four difficulty levels.
Homepage: https://cevalbenchmark.com/
### Citation
```bibtex
@article{huang2023ceval,
title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models},
author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian},
CMMLU: Measuring massive multitask language understanding in Chinese
https://arxiv.org/abs/2306.09212
CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture.
CMMLU covers a wide range of subjects, comprising 67 topics that span from elementary to advanced professional levels.
Homepage: https://github.com/haonan-li/CMMLU
### Citation
```bibtex
@misc{li2023cmmlu,
title={CMMLU: Measuring massive multitask language understanding in Chinese},
author={Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao and Yeyun Gong and Nan Duan and Timothy Baldwin},
year={2023},
eprint={2306.09212},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
-`ceval-valid`: All 52 subjects of the C-Eval dataset, evaluated following the methodology in MMLU's original implementation. This implementation consists solely of the validation set of C-Eval, as the test set requires submission of model predictions to an external site.
#### Tasks
The following tasks evaluate subjects in the C-Eval dataset using loglikelihood-based multiple-choice scoring:
-`ceval-valid_{subject_english}`
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?