Unverified Commit 3d1b8f43 authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge branch 'main' into group-agg-rework

parents e200c24e d855d0ba
task: bertaqa_en_mt_latxa-13b-v1
include: _bertaqa_template
dataset_name: en_mt_latxa-13b-v1
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_latxa-70b-v1.1
include: _bertaqa_template
dataset_name: en_mt_latxa-70b-v1.1
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_latxa-70b-v1
include: _bertaqa_template
dataset_name: en_mt_latxa-70b-v1
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_latxa-7b-v1.1
include: _bertaqa_template
dataset_name: en_mt_latxa-7b-v1.1
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_latxa-7b-v1
include: _bertaqa_template
dataset_name: en_mt_latxa-7b-v1
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_llama-2-13b
include: _bertaqa_template
dataset_name: en_mt_llama-2-13b
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_llama-2-70b
include: _bertaqa_template
dataset_name: en_mt_llama-2-70b
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_llama-2-7b
include: _bertaqa_template
dataset_name: en_mt_llama-2-7b
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_madlad
include: _bertaqa_template
dataset_name: en_mt_madlad
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_en_mt_nllb
include: _bertaqa_template
dataset_name: en_mt_nllb
doc_to_text: "Question: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nAnswer:"
task: bertaqa_eu
include: _bertaqa_template
dataset_name: eu
doc_to_text: "Galdera: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nErantzuna:"
...@@ -8,6 +8,7 @@ Requires the installation of ...@@ -8,6 +8,7 @@ Requires the installation of
`pip install "bigbench @ https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"` `pip install "bigbench @ https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"`
and is included so that the bigbench dependency can be avoided. and is included so that the bigbench dependency can be avoided.
""" """
import bigbench.api.util as bb_utils import bigbench.api.util as bb_utils
import datasets import datasets
from tqdm import tqdm from tqdm import tqdm
......
""" """
Take in a YAML, and output all other splits with this YAML Take in a YAML, and output all other splits with this YAML
""" """
import argparse import argparse
import os import os
......
""" """
Take in a YAML, and output all other splits with this YAML Take in a YAML, and output all other splits with this YAML
""" """
import argparse import argparse
import os import os
......
# Task-name
### Paper
Title: `COMMONSENSEQA: A Question Answering Challenge Targeting
Commonsense Knowledge`
Abstract: https://arxiv.org/pdf/1811.00937.pdf
CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers.
It contains 12,102 questions with one correct answer and four distractor answers.
Homepage: https://www.tau-nlp.org/commonsenseqa
### Citation
```
@inproceedings{talmor-etal-2019-commonsenseqa,
title = "{C}ommonsense{QA}: A Question Answering Challenge Targeting Commonsense Knowledge",
author = "Talmor, Alon and
Herzig, Jonathan and
Lourie, Nicholas and
Berant, Jonathan",
booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N19-1421",
doi = "10.18653/v1/N19-1421",
pages = "4149--4158",
archivePrefix = "arXiv",
eprint = "1811.00937",
primaryClass = "cs",
}
```
### Groups and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `commonsense_qa`: Represents the "random" split from the paper. Uses an MMLU-style prompt, as (presumably) used by Llama evaluations.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
task: commonsense_qa
dataset_path: tau/commonsense_qa
training_split: train
validation_split: validation
output_type: multiple_choice
doc_to_text: "Question: {{ question.strip() }}\nA. {{choices['text'][0]}}\nB. {{choices['text'][1]}}\nC. {{choices['text'][2]}}\nD. {{choices['text'][3]}}\nE. {{choices['text'][4]}}\nAnswer:"
doc_to_target: answerKey
doc_to_choice: ['A', 'B', 'C', 'D', 'E']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
""" """
Take in a YAML, and output all other splits with this YAML Take in a YAML, and output all other splits with this YAML
""" """
import argparse import argparse
import os import os
......
"""
"""
import re import re
from typing import List from typing import List
......
...@@ -38,18 +38,19 @@ Homepage: https://github.com/hitachi-nlp/FLD ...@@ -38,18 +38,19 @@ Homepage: https://github.com/hitachi-nlp/FLD
### Groups and Tasks ### Groups and Tasks
#### Groups
* `fld`
#### Tasks
This release is the simplified version of FLD where a model is required to predict only an answer. This release is the simplified version of FLD where a model is required to predict only an answer.
This setting is described by "answer accuracy" in the original paper. This setting is described by "answer accuracy" in the original paper.
#### Tasks in Group `fld`
* `fld_default` is a basic task based on [FLD.v2](https://huggingface.co/datasets/hitachi-nlp/FLD.v2/viewer/star) * `fld_default` is a basic task based on [FLD.v2](https://huggingface.co/datasets/hitachi-nlp/FLD.v2/viewer/star)
* `fld_star`: is a more challenging version based on [FLD.v2-star](https://huggingface.co/datasets/hitachi-nlp/FLD.v2/viewer/star) * `fld_star`: is a more challenging version based on [FLD.v2-star](https://huggingface.co/datasets/hitachi-nlp/FLD.v2/viewer/star)
#### Tasks in Group `fld_logical_formula`
Further, we have "logical formula" versions of the benchmarks, which evaluate LLMs' pure logical reasoning capabilities within the domain of logical formulas, rather than natural language:
* `fld_logical_formula_default`
* `fld_logical_formula_fld_star`
### Checklist ### Checklist
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
......
group:
- fld_logical_formula
task: fld_logical_formula_default
dataset_path: hitachi-nlp/FLD.v2
dataset_name: default
training_split: train
validation_split: validation
test_split: test
doc_to_text: "Based on the provided facts ($context$), either prove or disprove the hypothesis or state that it is unknown. The facts and the hypothesis are written in logical formulas as follows: capital letters such as \"{A}\", \"{B}\", \"{AB}\" are predicates, small letters such as \"{a}\", \"{b}\", \"{ab}\" are constants, \"&\" is logical conjunction, \"v\" is logical disjunction, \"¬\" is negation, \"->\" is implication, \"(x)\" is \"for all x\", and \"(Ex)\" is \"for some x\". $hypothesis$ = {{hypothesis_formula}} ; $context$ = {{context_formula}} ; $proof$ = "
doc_to_target: world_assump_label
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
filter_list:
- name: remove_whitespace
filter:
- function: remove_whitespace
- function: take_first
metadata:
version: 2.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment