Commit 88486e57 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'group-agg-rework' of...

Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-evaluation-harness into multiprompt
parents 5971f2ca ba73d131
"dataset_name": "virology" "dataset_name": "virology"
"description": "فم بعملية التقييم في مجال علوم أخرى \n\n" "description": "以下是关于病毒学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml" "include": "_default_template_yaml"
"task": "ammlu_virology" "task": "cmmlu_virology"
"dataset_name": "world_history"
"description": "以下是关于世界历史的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_world_history"
"dataset_name": "world_religions" "dataset_name": "world_religions"
"description": "فم بعملية التقييم في مجال العلوم الانسانية \n\n" "description": "以下是关于世界宗教的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml" "include": "_default_template_yaml"
"task": "ammlu_world_religions" "task": "cmmlu_world_religions"
# Task-name
### Paper
Title: `COMMONSENSEQA: A Question Answering Challenge Targeting
Commonsense Knowledge`
Abstract: https://arxiv.org/pdf/1811.00937.pdf
CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers.
It contains 12,102 questions with one correct answer and four distractor answers.
Homepage: https://www.tau-nlp.org/commonsenseqa
### Citation
```
@inproceedings{talmor-etal-2019-commonsenseqa,
title = "{C}ommonsense{QA}: A Question Answering Challenge Targeting Commonsense Knowledge",
author = "Talmor, Alon and
Herzig, Jonathan and
Lourie, Nicholas and
Berant, Jonathan",
booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N19-1421",
doi = "10.18653/v1/N19-1421",
pages = "4149--4158",
archivePrefix = "arXiv",
eprint = "1811.00937",
primaryClass = "cs",
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `commonsense_qa`: Represents the "random" split from the paper. Uses an MMLU-style prompt, as (presumably) used by Llama evaluations.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
task: commonsense_qa
dataset_path: tau/commonsense_qa
training_split: train
validation_split: validation
output_type: multiple_choice
doc_to_text: "Question: {{ question.strip() }}\nA. {{choices['text'][0]}}\nB. {{choices['text'][1]}}\nC. {{choices['text'][2]}}\nD. {{choices['text'][3]}}\nE. {{choices['text'][4]}}\nAnswer:"
doc_to_target: answerKey
doc_to_choice: ['A', 'B', 'C', 'D', 'E']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: copal_id tag: copal_id
task: copal_id_standard task: copal_id_standard
task_alias: standard task_alias: standard
dataset_path: haryoaw/COPAL dataset_path: haryoaw/COPAL
......
group: tag:
- crows_pairs - crows_pairs
- social_bias
- loglikelihood
task: crows_pairs_english task: crows_pairs_english
dataset_path: BigScienceBiasEval/crows_pairs_multilingual dataset_path: BigScienceBiasEval/crows_pairs_multilingual
dataset_name: english dataset_name: english
......
group: csatqa
task:
- csatqa_gr
- csatqa_li
- csatqa_rch
- csatqa_rcs
- csatqa_rcss
- csatqa_wr
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
group: csatqa
dataset_path: EleutherAI/csatqa dataset_path: EleutherAI/csatqa
test_split: test test_split: test
output_type: multiple_choice output_type: multiple_choice
......
""" """
Take in a YAML, and output all other splits with this YAML Take in a YAML, and output all other splits with this YAML
""" """
import argparse import argparse
import os import os
......
"""
"""
import re import re
from typing import List from typing import List
...@@ -14,7 +12,7 @@ class FDA(ConfigurableTask): ...@@ -14,7 +12,7 @@ class FDA(ConfigurableTask):
DATASET_PATH = "hazyresearch/based-fda" DATASET_PATH = "hazyresearch/based-fda"
DATASET_NAME = "default" DATASET_NAME = "default"
def __init__(self): def __init__(self, **kwargs):
super().__init__(config={"metadata": {"version": self.VERSION}}) super().__init__(config={"metadata": {"version": self.VERSION}})
def has_training_docs(self): def has_training_docs(self):
......
...@@ -38,18 +38,19 @@ Homepage: https://github.com/hitachi-nlp/FLD ...@@ -38,18 +38,19 @@ Homepage: https://github.com/hitachi-nlp/FLD
### Groups and Tasks ### Groups and Tasks
#### Groups
* `fld`
#### Tasks
This release is the simplified version of FLD where a model is required to predict only an answer. This release is the simplified version of FLD where a model is required to predict only an answer.
This setting is described by "answer accuracy" in the original paper. This setting is described by "answer accuracy" in the original paper.
#### Tasks in Group `fld`
* `fld_default` is a basic task based on [FLD.v2](https://huggingface.co/datasets/hitachi-nlp/FLD.v2/viewer/star) * `fld_default` is a basic task based on [FLD.v2](https://huggingface.co/datasets/hitachi-nlp/FLD.v2/viewer/star)
* `fld_star`: is a more challenging version based on [FLD.v2-star](https://huggingface.co/datasets/hitachi-nlp/FLD.v2/viewer/star) * `fld_star`: is a more challenging version based on [FLD.v2-star](https://huggingface.co/datasets/hitachi-nlp/FLD.v2/viewer/star)
#### Tasks in Group `fld_logical_formula`
Further, we have "logical formula" versions of the benchmarks, which evaluate LLMs' pure logical reasoning capabilities within the domain of logical formulas, rather than natural language:
* `fld_logical_formula_default`
* `fld_logical_formula_fld_star`
### Checklist ### Checklist
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
......
group:
- fld
task: fld_default task: fld_default
dataset_path: hitachi-nlp/FLD.v2 dataset_path: hitachi-nlp/FLD.v2
dataset_name: default dataset_name: default
......
group:
- fld_logical_formula
task: fld_logical_formula_default
dataset_path: hitachi-nlp/FLD.v2
dataset_name: default
training_split: train
validation_split: validation
test_split: test
doc_to_text: "Based on the provided facts ($context$), either prove or disprove the hypothesis or state that it is unknown. The facts and the hypothesis are written in logical formulas as follows: capital letters such as \"{A}\", \"{B}\", \"{AB}\" are predicates, small letters such as \"{a}\", \"{b}\", \"{ab}\" are constants, \"&\" is logical conjunction, \"v\" is logical disjunction, \"¬\" is negation, \"->\" is implication, \"(x)\" is \"for all x\", and \"(Ex)\" is \"for some x\". $hypothesis$ = {{hypothesis_formula}} ; $context$ = {{context_formula}} ; $proof$ = "
doc_to_target: world_assump_label
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
filter_list:
- name: remove_whitespace
filter:
- function: remove_whitespace
- function: take_first
metadata:
version: 2.0
include: fld_logical_formula_default.yaml
task: fld_logical_formula_star
dataset_name: star
...@@ -20,9 +20,9 @@ This benchmark is constructed both from openly available datasets, as well as ne ...@@ -20,9 +20,9 @@ This benchmark is constructed both from openly available datasets, as well as ne
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Tags
- `french_bench`: All tasks (non-perplexity based) - `french_bench`: All tasks (non-perplexity based)
- `french_bench_gen`: All official generative tasks - `french_bench_gen`: All official generative tasks
......
group: tag:
- french_bench - french_bench
- french_bench_mc - french_bench_mc
task: french_bench_arc_challenge task: french_bench_arc_challenge
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_extra - french_bench_extra
description: "D'après l'information dans le contexte donné, quelle est la réponse à la question ?" description: "D'après l'information dans le contexte donné, quelle est la réponse à la question ?"
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_extra - french_bench_extra
description: "D'après l'information dans le contexte donné, donne la réponse à la question en citant quelques mots du contexte. Si il est impossible de répondre avec les informations du contexte, répond 'Impossible'." description: "D'après l'information dans le contexte donné, donne la réponse à la question en citant quelques mots du contexte. Si il est impossible de répondre avec les informations du contexte, répond 'Impossible'."
......
include: "_default_template_yaml" include: "_default_template_yaml"
group: tag:
- french_bench - french_bench
- french_bench_extra - french_bench_extra
description: "D'après l'information présente dans le contexte, est il possible de répondre à la question ?" description: "D'après l'information présente dans le contexte, est il possible de répondre à la question ?"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment