Unverified Commit 3c8db1bb authored by Nathan Habib's avatar Nathan Habib Committed by GitHub
Browse files

Adds Open LLM Leaderboard Taks (#2047)



* adds leaderboard tasks

* Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml

* add readme

* Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml

* modify readme

* fix bbh task

* fix bbh salient task

* modify the readme

* Delete lm_eval/tasks/leaderboard/ifeval/README.md

* Delete lm_eval/tasks/leaderboard/math/README.md

* add leaderboard to the tasks repertory

* add anouncment about new leaderbaord tasks

* linting

* Update README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* installs ifeval dependency in new_task github workflow

---------
Co-authored-by: default avatarNathan Habib <nathan.habib@huggingface.com>
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 1870ee4e
...@@ -56,7 +56,7 @@ jobs: ...@@ -56,7 +56,7 @@ jobs:
if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true' if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true'
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install -e '.[dev]' --extra-index-url https://download.pytorch.org/whl/cpu pip install -e '.[dev,ifeval]' --extra-index-url https://download.pytorch.org/whl/cpu
# Install optional git dependencies # Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt # pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi # if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
......
...@@ -7,6 +7,7 @@ ...@@ -7,6 +7,7 @@
New updates and features include: New updates and features include:
- **New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.**
- Internal refactoring - Internal refactoring
- Config-based task creation and configuration - Config-based task creation and configuration
- Easier import and sharing of externally-defined task config YAMLs - Easier import and sharing of externally-defined task config YAMLs
......
...@@ -56,6 +56,7 @@ ...@@ -56,6 +56,7 @@
| [lambada_cloze](lambada_cloze/README.md) | Cloze-style LAMBADA dataset. | English | | [lambada_cloze](lambada_cloze/README.md) | Cloze-style LAMBADA dataset. | English |
| [lambada_multilingual](lambada_multilingual/README.md) | Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use `lambada_multilingual_stablelm`. | German, English, Spanish, French, Italian | | [lambada_multilingual](lambada_multilingual/README.md) | Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use `lambada_multilingual_stablelm`. | German, English, Spanish, French, Italian |
| [lambada_multilingual_stablelm](lambada_multilingual_stablelm/README.md) | Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on `lambada_multilingual`. | German, English, Spanish, French, Italian, Dutch, Portuguese | | [lambada_multilingual_stablelm](lambada_multilingual_stablelm/README.md) | Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on `lambada_multilingual`. | German, English, Spanish, French, Italian, Dutch, Portuguese |
| [leaderboard](leaderboard/README.md) | Task group used by Hugging Face's [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Those tasks are static and will not change through time | English |
| [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese | | [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese |
| [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese | | [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
| [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English | | [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English |
......
# Leaderboard evaluations
Our goal with this group is to create an unchanging through time version of
evaluations that will power the Open LLM Leaderboard on HuggingFace.
As we want to evaluate models across capabilities, the list currently contains:
- BBH (3-shots, multichoice)
- GPQA (0-shot, multichoice)
- mmlu-pro (5-shots, multichoice)
- Musr (0-shot, multichoice)
- ifeval (0-shot, generative)
- Math-lvl-5 (4-shots, generative, minerva version)
Details on the choice of those evals can be found [here](https://huggingface.co/spaces/open-llm-leaderboard/blog) !
## BigBenchHard (BBH)
A suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH).
These are the task for which prior language model evaluations did not
outperform the average human-rater.
### Paper
Title: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models?
In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.
- paper: https://huggingface.co/papers/2210.09261
- Homepage: https://github.com/suzgunmirac/BIG-Bench-Hard
### Citation
```
@article{suzgun2022challenging,
title={Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them},
author={Suzgun, Mirac and Scales, Nathan and Sch{\"a}rli, Nathanael and Gehrmann, Sebastian and Tay, Yi and Chung, Hyung Won and Chowdhery, Aakanksha and Le, Quoc V and Chi, Ed H and Zhou, Denny and and Wei, Jason},
journal={arXiv preprint arXiv:2210.09261},
year={2022}
}
```
### Groups
- `leaderboard_bbh`
### Tasks
- `leaderboard_bbh_boolean_expressions`
- `leaderboard_bbh_causal_judgement`
- `leaderboard_bbh_date_understanding`
- `leaderboard_bbh_disambiguation_qa`
- `leaderboard_bbh_dyck_languages`
- `leaderboard_bbh_formal_fallacies`
- `leaderboard_bbh_geometric_shapes`
- `leaderboard_bbh_hyperbaton`
- `leaderboard_bbh_logical_deduction_five_objects`
- `leaderboard_bbh_logical_deduction_seven_objects`
- `leaderboard_bbh_logical_deduction_three_objects`
- `leaderboard_bbh_movie_recommendation`
- `leaderboard_bbh_multistep_arithmetic_two`
- `leaderboard_bbh_navigate`
- `leaderboard_bbh_object_counting`
- `leaderboard_bbh_penguins_in_a_table`
- `leaderboard_bbh_reasoning_about_colored_objects`
- `leaderboard_bbh_ruin_names`
- `leaderboard_bbh_salient_translation_error_detection`
- `leaderboard_bbh_snarks`
- `leaderboard_bbh_sports_understanding`
- `leaderboard_bbh_temporal_sequences`
- `leaderboard_bbh_tracking_shuffled_objects_five_objects`
- `leaderboard_bbh_tracking_shuffled_objects_seven_objects`
- `leaderboard_bbh_tracking_shuffled_objects_three_objects`
- `leaderboard_bbh_web_of_lies`
- `leaderboard_bbh_word_sorting`
## GPQA
### Paper
Title: GPQA: A Graduate-Level Google-Proof Q&A Benchmark
We present GPQA, a challenging dataset of 448 multiple-choice questions written
by domain experts in biology, physics, and chemistry. We ensure that the
questions are high-quality and extremely difficult: experts who have or are
pursuing PhDs in the corresponding domains reach 65% accuracy (74% when
discounting clear mistakes the experts identified in retrospect), while highly
skilled non-expert validators only reach 34% accuracy, despite spending on
average over 30 minutes with unrestricted access to the web (i.e., the
questions are “Google-proof”). The questions are also difficult for
state-of-the-art AI systems, with our strongest GPT-4–based baseline achieving
39% accuracy. If we are to use future AI systems to help us answer very hard
questions—for example, when developing new scientific knowledge—we need to
develop scalable oversight methods that enable humans to supervise their
outputs, which may be difficult even if the supervisors are themselves skilled
and knowledgeable. The difficulty of GPQA both for skilled non-experts and
frontier AI systems should enable realistic scalable oversight experiments,
which we hope can help devise ways for human experts to reliably get truthful
information from AI systems that surpass human capabilities.
- Paper: https://huggingface.co/papers/2311.12022
- Homepage: https://github.com/idavidrein/gpqa/tree/main
### Citation
```
@misc{rein2023gpqa,
title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
year={2023},
eprint={2311.12022},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
### Groups
- `leaderboard_gpqa`
### Tasks
- `leaderboard_gpqa_extended`
- `leaderboard_gpqa_diamond`
- `leaderboard_gpqa_main`
## IFEval
### Paper
Title: Instruction-Following Evaluation for Large Language Models
One core capability of Large Language Models (LLMs) is to follow natural
language instructions. However, the evaluation of such abilities is not
standardized: Human evaluations are expensive, slow, and not objectively
reproducible, while LLM-based auto-evaluation is potentially biased or limited
by the ability of the evaluator LLM. To overcome these issues, we introduce
Instruction-Following Eval (IFEval) for large language models. IFEval is a
straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set
of "verifiable instructions" such as "write in more than 400 words" and
"mention the keyword of AI at least 3 times". We identified 25 types of those
verifiable instructions and constructed around 500 prompts, with each prompt
containing one or more verifiable instructions. We show evaluation results of
two widely available LLMs on the market.
- Paper: https://huggingface.co/papers/2210.09261
- Homepage: https://github.com/google-research/google-research/tree/master/instruction_following_eval
### Citation
```
@article{zhou2023instructionfollowing,
title={Instruction-Following Evaluation for Large Language Models},
author={Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and Siddhartha Brahma and Sujoy Basu and Yi Luan and Denny Zhou and Le Hou},
journal={arXiv preprint arXiv:2311.07911},
year={2023},
}
```
### Tasks
- `leaderboard_ifeval`
## MATH-hard
This is the 4 shots variant of minerva math but only keeping the level 5 questions.
### Paper
Title: Measuring Mathematical Problem Solving With the MATH Dataset
Many intellectual endeavors require mathematical problem solving, but this
skill remains beyond the capabilities of computers. To measure this ability in
machine learning models, we introduce MATH, a new dataset of 12,500 challenging
competition mathematics problems. Each problem in MATH has a full step-by-step
solution which can be used to teach models to generate answer derivations and
explanations.
NOTE: The few-shot and the generated answer extraction is based on the
[Minerva](https://arxiv.org/abs/2206.14858) and exact match equivalence is
calculated using the `sympy` library. This requires additional dependencies,
which can be installed via the `lm-eval[math]` extra.
- Paper: https://huggingface.co/papers/2103.03874
- Homepage: https://github.com/hendrycks/math
### Citation
```
@article{hendrycksmath2021,
title={Measuring Mathematical Problem Solving With the MATH Dataset},
author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
journal={NeurIPS},
year={2021}
}
@misc{2206.14858,
Author = {Aitor Lewkowycz and Anders Andreassen and David Dohan and Ethan Dye and Henryk Michalewski and Vinay Ramasesh and Ambrose Slone and Cem Anil and Imanol Schlag and Theo Gutman-Solo and Yuhuai Wu and Behnam Neyshabur and Guy Gur-Ari and Vedant Misra},
Title = {Solving Quantitative Reasoning Problems with Language Models},
Year = {2022},
Eprint = {arXiv:2206.14858},
}
```
### Groups
- `leaderboard_math_hard`
### Tasks
- `leaderboard_math_algebra_hard`
- `leaderboard_math_counting_and_prob_hard`
- `leaderboard_math_geometry_hard`
- `leaderboard_math_intermediate_algebra_hard`
- `leaderboard_math_num_theory_hard`
- `leaderboard_math_prealgebra_hard`
- `leaderboard_math_precalc_hard`
## MMLU-Pro
### Paper
Title: MMLU-Pro: A More Robust and Challenging Multi-Task Language
Understanding Benchmark
In the age of large-scale language models, benchmarks like the Massive
Multitask Language Understanding (MMLU) have been pivotal in pushing the
boundaries of what AI can achieve in language comprehension and reasoning
across diverse domains. However, as models continue to improve, their
performance on these benchmarks has begun to plateau, making it increasingly
difficult to discern differences in model capabilities. This paper introduces
MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven
MMLU benchmark by integrating more challenging, reasoning-focused questions and
expanding the choice set from four to ten options. Additionally, MMLU-Pro
eliminates the trivial and noisy questions in MMLU. Our experimental results
show that MMLU-Pro not only raises the challenge, causing a significant drop in
accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability
under varying prompts. With 24 different prompt styles tested, the sensitivity
of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in
MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT)
reasoning achieved better performance on MMLU-Pro compared to direct answering,
which is in stark contrast to the findings on the original MMLU, indicating
that MMLU-Pro includes more complex reasoning questions. Our assessments
confirm that MMLU-Pro is a more discriminative benchmark to better track
progress in the field.
- Paper: https://huggingface.co/papers/2406.01574
- Homepage: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
### Citation
```
@misc{wang2024mmluprorobustchallengingmultitask,
title={MMLU-Pro: A More Robust and Challenging Multi-Task Language
Understanding Benchmark},
author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
year={2024},
eprint={2406.01574},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.01574},
}
```
## Musr
### Paper
Title: MuSR: Testing the Limits of Chain-of-thought with Multistep Soft
Reasoning
While large language models (LLMs) equipped with techniques like
chain-of-thought prompting have demonstrated impressive capabilities, they
still fall short in their ability to reason robustly in complex settings.
However, evaluating LLM reasoning is challenging because system capabilities
continue to grow while benchmark datasets for tasks like logical deduction have
remained static. We introduce MuSR, a dataset for evaluating language models on
multistep soft reasoning tasks specified in a natural language narrative. This
dataset has two crucial features. First, it is created through a novel
neurosymbolic synthetic-to-natural generation algorithm, enabling the
construction of complex reasoning instances that challenge GPT-4 (e.g., murder
mysteries roughly 1000 words in length) and which can be scaled further as more
capable LLMs are released. Second, our dataset instances are free text
narratives corresponding to real-world domains of reasoning; this makes it
simultaneously much more challenging than other synthetically-crafted
benchmarks while remaining realistic and tractable for human annotators to
solve with high accuracy. We evaluate a range of LLMs and prompting techniques
on this dataset and characterize the gaps that remain for techniques like
chain-of-thought to perform robust reasoning.
- Paper: https://huggingface.co/papers/2310.16049
- Homepage: https://zayne-sprague.github.io/MuSR/
### Citation
```
@misc{sprague2024musrtestinglimitschainofthought,
title={MuSR: Testing the Limits of Chain-of-thought with Multistep Soft
Reasoning},
author={Zayne Sprague and Xi Ye and Kaj Bostrom and Swarat Chaudhuri and Greg Durrett},
year={2024},
eprint={2310.16049},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2310.16049},
}
```
### Groups
- `leaderboard_musr`
### Tasks
- `leaderboard_musr_murder_mysteries`
- `leaderboard_musr_object_placements`
- `leaderboard_musr_team_allocation`
group: leaderboard_bbh
dataset_path: SaylorTwift/bbh
output_type: multiple_choice
test_split: test
doc_to_text: 'Q: {{input}}
A:'
doc_to_target: "{{target}}"
metric_list:
- metric: acc_norm
aggregation: mean
higher_is_better: true
num_fewshot: 3
fewshot_config:
sampler: first_n
metadata:
version: 0.0
dataset_name: boolean_expressions
description: 'Evaluate the result of a random Boolean expression.
'
doc_to_choice: ["False", "True"]
fewshot_config:
sampler: first_n
samples:
- input: not ( ( not not True ) ) is
target: 'False'
- input: True and False and not True and True is
target: 'False'
- input: not not ( not ( False ) ) is
target: 'True'
include: _fewshot_template_yaml
task: leaderboard_bbh_boolean_expressions
dataset_name: causal_judgement
description: 'Answer questions about causal attribution.
'
doc_to_choice: ["Yes", "No"]
fewshot_config:
sampler: first_n
samples:
- input: 'How would a typical person answer each of the following questions about
causation?
Frank T., had an ongoing dispute with his neighbor over a stretch of land and
one day decided to shoot his neighbor in the body. Frank T. had no experience
with guns, his hand slipped on the barrel of the gun, and the shot went wild.
Nonetheless, the bullet bounced off a large boulder several feet away and hit
the neighbor''s body, causing significant injury. Did Frank T. intentionally
shoot his neighbor in the body?
Options:
- Yes
- No'
target: 'No'
- input: 'How would a typical person answer each of the following questions about
causation?
Suzy and Billy are working on a project that is very important for our nation''s
security. The boss tells them both: "Be sure that you are here at exactly 9
am. It is absolutely essential that you arrive at that time." Both Billy and
Suzy arrive at 9 am. As it happens, there was a motion detector installed in
the room where they arrived. The motion detector was set up to be triggered
if at least one person appeared in the room at the same time. So the motion
detector went off. Did Billy cause the motion detector to go off?
Options:
- Yes
- No'
target: 'Yes'
- input: 'How would a typical person answer each of the following questions about
causation?
George and his sister Lena reunite at their parents'' house for Thanksgiving.
Whereas George just got into medical school, Lena is unhappy in her marriage
and recently lost her job. Over the course of the day, George and Lena get into
a number of heated arguments. Later in the afternoon they play a game of darts.
They split the first two games, and the third game is close until the end. Who
will win comes down to George''s last shot. If he hits a high point region,
he wins; if he hits a low point region, Lena wins. George thinks of the difficult
time Lena is having, and he really wants to let her win. He aims the dart at
the low point region. He sets up his shot and the dart lands in the low point
region. After his shot, Lena wins the game and is very happy. Did George hit
the low point region intentionally?
Options:
- Yes
- No'
target: 'Yes'
include: _fewshot_template_yaml
task: leaderboard_bbh_causal_judgement
dataset_name: date_understanding
description: 'Infer the date from context.
'
doc_to_choice: ["(A)", "(B)", "(C)", "(D)", "(E)", "(F)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Today is Christmas Eve of 1937. What is the date 10 days ago in MM/DD/YYYY?
Options:
(A) 12/14/2026
(B) 12/14/1950
(C) 12/14/2007
(D) 12/14/1937
(E) 07/14/1938
(F) 12/14/1988'
target: (D)
- input: 'Tomorrow is 11/12/2019. What is the date one year ago from today in MM/DD/YYYY?
Options:
(A) 09/04/2018
(B) 11/11/2018
(C) 08/25/2018
(D) 11/02/2018
(E) 11/04/2018'
target: (B)
- input: 'Jane and John married on Jan 2, 1958. It is their 5-year anniversary today.
What is the date tomorrow in MM/DD/YYYY?
Options:
(A) 01/11/1961
(B) 01/03/1963
(C) 01/18/1961
(D) 10/14/1960
(E) 01/03/1982
(F) 12/03/1960'
target: (B)
include: _fewshot_template_yaml
task: leaderboard_bbh_date_understanding
dataset_name: disambiguation_qa
description: 'Clarify the meaning of sentences with ambiguous pronouns.
'
doc_to_choice: ["(A)", "(B)", "(C)"]
fewshot_config:
sampler: first_n
samples:
- input: 'In the following sentences, explain the antecedent of the pronoun (which
thing the pronoun refers to), or state that it is ambiguous.
Sentence: The chief told the counselor that they took the day off.
Options:
(A) The chief took the day off
(B) The counselor took the day off
(C) Ambiguous'
target: (A)
- input: 'In the following sentences, explain the antecedent of the pronoun (which
thing the pronoun refers to), or state that it is ambiguous.
Sentence: The manager sent a message to the secretary, but he didn''t reply
yet.
Options:
(A) The secretary didn''t reply yet
(B) The manager didn''t reply yet
(C) Ambiguous'
target: (A)
- input: 'In the following sentences, explain the antecedent of the pronoun (which
thing the pronoun refers to), or state that it is ambiguous.
Sentence: Bailey will plan to meet the director at his office
Options:
(A) It will be Bailey''s office
(B) It will be the director''s office
(C) Ambiguous'
target: (C)
include: _fewshot_template_yaml
task: leaderboard_bbh_disambiguation_qa
dataset_name: formal_fallacies
description: 'Distinguish deductively valid arguments from formal fallacies.
'
doc_to_choice: ["valid", "invalid"]
fewshot_config:
sampler: first_n
samples:
- input: '"It is not always easy to see who is related to whom -- and in which ways.
The following argument pertains to this question: To begin with, Lesley is a
close friend of Fernando. Moreover, being a close friend of Fernando or a schoolmate
of Lowell is sufficient for being a great-grandfather of Leroy. It follows that
Lesley is a great-grandfather of Leroy."
Is the argument, given the explicitly stated premises, deductively valid or
invalid?
Options:
- valid
- invalid'
target: valid
- input: '"It is not always easy to see who is related to whom -- and in which ways.
The following argument pertains to this question: Whoever is not a great-grandfather
of Clyde is a stepbrother of Brian. Being an ancestor of Dana is sufficient
for not being a great-grandfather of Clyde. We may conclude: Everyone who is
an ancestor of Dana is a stepbrother of Brian, too."
Is the argument, given the explicitly stated premises, deductively valid or
invalid?
Options:
- valid
- invalid'
target: valid
- input: '"It is not always easy to grasp who is consuming which products. The following
argument pertains to this question: Every infrequent user of Paul Mitchell shampoo
is either a rare consumer of Nioxin shampoo or a loyal buyer of Caress soap,
or both. No regular consumer of Lush soap is a rare consumer of Nioxin shampoo
and, in the same time, a loyal buyer of Caress soap. It follows that whoever
is an infrequent user of Paul Mitchell shampoo is not a regular consumer of
Lush soap."
Is the argument, given the explicitly stated premises, deductively valid or
invalid?
Options:
- valid
- invalid'
target: invalid
include: _fewshot_template_yaml
task: leaderboard_bbh_formal_fallacies
dataset_name: geometric_shapes
description: 'Name geometric shapes from their SVG paths.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)","(G)","(H)","(I)","(J)","(K)"]
fewshot_config:
sampler: first_n
samples:
- input: 'This SVG path element <path d="M 31.00,73.00 L 32.00,59.00 L 44.00,50.00
L 49.00,41.00 L 64.00,37.00 L 71.00,55.00 L 64.00,76.00 L 52.00,61.00 L 31.00,73.00"/>
draws a
Options:
(A) circle
(B) heptagon
(C) hexagon
(D) kite
(E) line
(F) octagon
(G) pentagon
(H) rectangle
(I) sector
(J) triangle'
target: (F)
- input: 'This SVG path element <path d="M 14.19,26.04 L 51.43,39.21 L 58.44,36.69
L 56.63,30.17 L 48.53,26.66 L 14.19,26.04"/> draws a
Options:
(A) circle
(B) heptagon
(C) hexagon
(D) kite
(E) line
(F) octagon
(G) pentagon
(H) rectangle
(I) sector
(J) triangle'
target: (G)
- input: 'This SVG path element <path d="M 41.00,43.00 L 37.00,34.00 L 41.00,33.00
L 45.00,34.00 L 41.00,43.00"/> draws a
Options:
(A) circle
(B) heptagon
(C) hexagon
(D) kite
(E) line
(F) octagon
(G) pentagon
(H) rectangle
(I) sector
(J) triangle'
target: (D)
include: _fewshot_template_yaml
task: leaderboard_bbh_geometric_shapes
dataset_name: hyperbaton
description: 'Order adjectives correctly in English sentences.
'
doc_to_choice: ["(A)", "(B)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Which sentence has the correct adjective order:
Options:
(A) rubber terrible ship
(B) terrible rubber ship'
target: (B)
- input: 'Which sentence has the correct adjective order:
Options:
(A) repulsive small Brazilian exercise ship
(B) Brazilian repulsive exercise small ship'
target: (A)
- input: 'Which sentence has the correct adjective order:
Options:
(A) blue gold wonderful square shoe
(B) wonderful square blue gold shoe'
target: (B)
include: _fewshot_template_yaml
task: leaderboard_bbh_hyperbaton
dataset_name: logical_deduction_five_objects
description: 'A logical deduction task which requires deducing the order of a sequence
of objects.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)"]
fewshot_config:
sampler: first_n
samples:
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
In a golf tournament, there were three golfers: Amy, Eli, and Eve. Eve finished
above Amy. Eli finished below Amy.
Options:
(A) Amy finished last
(B) Eli finished last
(C) Eve finished last'
target: (B)
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
On a shelf, there are three books: a white book, a green book, and an orange
book. The green book is to the right of the white book. The orange book is the
rightmost.
Options:
(A) The white book is the leftmost
(B) The green book is the leftmost
(C) The orange book is the leftmost'
target: (A)
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
On a shelf, there are three books: a red book, a gray book, and a white book.
The white book is to the left of the gray book. The red book is the second from
the left.
Options:
(A) The red book is the leftmost
(B) The gray book is the leftmost
(C) The white book is the leftmost'
target: (C)
include: _fewshot_template_yaml
task: leaderboard_bbh_logical_deduction_five_objects
dataset_name: logical_deduction_seven_objects
description: 'A logical deduction task which requires deducing the order of a sequence
of objects.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)","(G)"]
fewshot_config:
sampler: first_n
samples:
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
In a golf tournament, there were three golfers: Amy, Eli, and Eve. Eve finished
above Amy. Eli finished below Amy.
Options:
(A) Amy finished last
(B) Eli finished last
(C) Eve finished last'
target: (B)
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
On a shelf, there are three books: a white book, a green book, and an orange
book. The green book is to the right of the white book. The orange book is the
rightmost.
Options:
(A) The white book is the leftmost
(B) The green book is the leftmost
(C) The orange book is the leftmost'
target: (A)
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
On a shelf, there are three books: a red book, a gray book, and a white book.
The white book is to the left of the gray book. The red book is the second from
the left.
Options:
(A) The red book is the leftmost
(B) The gray book is the leftmost
(C) The white book is the leftmost'
target: (C)
include: _fewshot_template_yaml
task: leaderboard_bbh_logical_deduction_seven_objects
dataset_name: logical_deduction_three_objects
description: 'A logical deduction task which requires deducing the order of a sequence
of objects.
'
doc_to_choice: ["(A)","(B)","(C)"]
fewshot_config:
sampler: first_n
samples:
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
In a golf tournament, there were three golfers: Amy, Eli, and Eve. Eve finished
above Amy. Eli finished below Amy.
Options:
(A) Amy finished last
(B) Eli finished last
(C) Eve finished last'
target: (B)
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
On a shelf, there are three books: a white book, a green book, and an orange
book. The green book is to the right of the white book. The orange book is the
rightmost.
Options:
(A) The white book is the leftmost
(B) The green book is the leftmost
(C) The orange book is the leftmost'
target: (A)
- input: 'The following paragraphs each describe a set of three objects arranged
in a fixed order. The statements are logically consistent within each paragraph.
On a shelf, there are three books: a red book, a gray book, and a white book.
The white book is to the left of the gray book. The red book is the second from
the left.
Options:
(A) The red book is the leftmost
(B) The gray book is the leftmost
(C) The white book is the leftmost'
target: (C)
include: _fewshot_template_yaml
task: leaderboard_bbh_logical_deduction_three_objects
dataset_name: movie_recommendation
description: 'Recommend movies similar to the given list of movies.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Find a movie similar to Star Wars Episode IV - A New Hope, Indiana Jones
and the Last Crusade, Star Wars Episode V - The Empire Strikes Back, The Big
Lebowski:
Options:
(A) Tetsuo
(B) the Ironman
(C) The Princess Bride
(D) The Barkley Marathons The Race That Eats Its Young
(E) Bug'
target: (C)
- input: 'Find a movie similar to Twister, The Silence of the Lambs, Independence
Day, Braveheart:
Options:
(A) They Shoot Horses
(B) Don''t They
(C) Forrest Gump
(D) The Salton Sea
(E) Extreme Days'
target: (C)
- input: 'Find a movie similar to Minority Report, Total Recall, Inside Out, Forrest
Gump:
Options:
(A) Phenomena
(B) Lilting
(C) Catwoman
(D) Edge of Tomorrow'
target: (D)
include: _fewshot_template_yaml
task: leaderboard_bbh_movie_recommendation
dataset_name: navigate
description: 'Given a series of navigation instructions, determine whether one would
end up back at the starting point.
'
doc_to_choice: ["Yes","No"]
fewshot_config:
sampler: first_n
samples:
- input: 'If you follow these instructions, do you return to the starting point?
Turn left. Turn around. Turn left. Take 7 steps. Take 2 steps. Take 4 steps.
Take 8 steps.
Options:
- Yes
- No'
target: 'No'
- input: 'If you follow these instructions, do you return to the starting point?
Turn around. Take 1 step. Take 6 steps. Turn around. Take 6 steps. Take 9 steps.
Take 1 step.
Options:
- Yes
- No'
target: 'No'
- input: 'If you follow these instructions, do you return to the starting point?
Always face forward. Take 2 steps right. Take 9 steps left. Take 7 steps right.
Options:
- Yes
- No'
target: 'Yes'
include: _fewshot_template_yaml
task: leaderboard_bbh_navigate
dataset_name: object_counting
description: 'Questions that involve enumerating objects and asking the model to count
them.
'
doc_to_choice: ["0","1","2","3","4","5","6","7","8","9","10", "11", "12", "13", "14", "15", "16", "17", "18"]
fewshot_config:
sampler: first_n
samples:
- input: I have a blackberry, a clarinet, a nectarine, a plum, a strawberry, a banana,
a flute, an orange, and a violin. How many fruits do I have?
target: '6'
- input: I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape,
a nectarine, and three plums. How many fruits do I have?
target: '11'
- input: I have a lettuce head, a head of broccoli, an onion, a stalk of celery,
two carrots, a garlic, and a yam. How many vegetables do I have?
target: '8'
include: _fewshot_template_yaml
task: leaderboard_bbh_object_counting
dataset_name: penguins_in_a_table
description: 'Answer questions about a table of penguins and their attributes.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Here is a table where the first line is a header and each subsequent line
is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard,
5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis
is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. We now add
a penguin to the table:
James, 12, 90, 12
How many penguins are less than 8 years old?
Options:
(A) 1
(B) 2
(C) 3
(D) 4
(E) 5'
target: (B)
- input: 'Here is a table where the first line is a header and each subsequent line
is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard,
5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis
is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. Which is
the youngest penguin?
Options:
(A) Louis
(B) Bernard
(C) Vincent
(D) Gwen
(E) James'
target: (B)
- input: 'Here is a table where the first line is a header and each subsequent line
is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard,
5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis
is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. What is
the name of the second penguin sorted by alphabetic order?
Options:
(A) Louis
(B) Bernard
(C) Vincent
(D) Gwen
(E) James'
target: (D)
include: _fewshot_template_yaml
task: leaderboard_bbh_penguins_in_a_table
dataset_name: reasoning_about_colored_objects
description: 'Answer extremely simple questions about the colors of objects on a surface.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)","(G)","(H)","(I)","(J)","(K)","(L)","(M)","(N)","(O)","(P)","(Q)","(R)"]
fewshot_config:
sampler: first_n
samples:
- input: 'On the nightstand, there is a red pencil, a purple mug, a burgundy keychain,
a fuchsia teddy bear, a black plate, and a blue stress ball. What color is the
stress ball?
Options:
(A) red
(B) orange
(C) yellow
(D) green
(E) blue
(F) brown
(G) magenta
(H) fuchsia
(I) mauve
(J) teal
(K) turquoise
(L) burgundy
(M) silver
(N) gold
(O) black
(P) grey
(Q) purple
(R) pink'
target: (E)
- input: 'On the table, you see a bunch of objects arranged in a row: a purple paperclip,
a pink stress ball, a brown keychain, a green scrunchiephone charger, a mauve
fidget spinner, and a burgundy pen. What is the color of the object directly
to the right of the stress ball?
Options:
(A) red
(B) orange
(C) yellow
(D) green
(E) blue
(F) brown
(G) magenta
(H) fuchsia
(I) mauve
(J) teal
(K) turquoise
(L) burgundy
(M) silver
(N) gold
(O) black
(P) grey
(Q) purple
(R) pink'
target: (F)
- input: 'On the nightstand, you see the following items arranged in a row: a teal
plate, a burgundy keychain, a yellow scrunchiephone charger, an orange mug,
a pink notebook, and a grey cup. How many non-orange items do you see to the
left of the teal item?
Options:
(A) zero
(B) one
(C) two
(D) three
(E) four
(F) five
(G) six'
target: (A)
include: _fewshot_template_yaml
task: leaderboard_bbh_reasoning_about_colored_objects
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment