Unverified Commit 3c8db1bb authored by Nathan Habib's avatar Nathan Habib Committed by GitHub
Browse files

Adds Open LLM Leaderboard Taks (#2047)



* adds leaderboard tasks

* Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml

* add readme

* Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml

* modify readme

* fix bbh task

* fix bbh salient task

* modify the readme

* Delete lm_eval/tasks/leaderboard/ifeval/README.md

* Delete lm_eval/tasks/leaderboard/math/README.md

* add leaderboard to the tasks repertory

* add anouncment about new leaderbaord tasks

* linting

* Update README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* installs ifeval dependency in new_task github workflow

---------
Co-authored-by: default avatarNathan Habib <nathan.habib@huggingface.com>
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 1870ee4e
dataset_name: ruin_names
description: 'Select the humorous edit that ''ruins'' the input movie or musical artist
name.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Which of the following is a humorous edit of this artist or movie name:
''whitesnake''?
Options:
(A) whitesnape
(B) whitesnapke
(C) whitesnuake
(D) mwhitesnake'
target: (A)
- input: 'Which of the following is a humorous edit of this artist or movie name:
''one of our dinosaurs is missing''?
Options:
(A) ofne of our dinosaurs is missing
(B) one af our dinosaurs is missing
(C) one of our dinosaurs is pissing
(D) one of our dinosaur is missing'
target: (C)
- input: 'Which of the following is a humorous edit of this artist or movie name:
''counting crows''?
Options:
(A) countingy crows
(B) counting cows
(C) courting crows
(D) coutnting crows'
target: (B)
include: _fewshot_template_yaml
task: leaderboard_bbh_ruin_names
dataset_name: salient_translation_error_detection
description: 'Detect the type of error in an English translation of a German source
sentence.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)"]
fewshot_config:
sampler: first_n
samples:
- input: "The following translations from German to English contain a particular
error. That error will be one of the following types: Named Entities: An entity
(names, places, locations, etc.) is changed to a different entity. Numerical
Values: Numerical values (ordinals or cardinals), dates, and/or units are changed.
Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are
changed. Negation or Antonyms: Introduce or remove a negation or change comparatives
to their antonyms. Facts: Trivial factual errors not pertaining to the above
classes are introduced in the translations. Dropped Content: A significant clause
in the translation is removed. Please identify that error. Source: In der Liste
der Baudenkmale in Lenzen (Elbe) sind alle Baudenkmale der brandenburgischen
Stadt Lenzen (Elbe) und ihrer Ortsteile aufgelistet.
Translation: In the list of architectural monuments in Lenzen all architectural
monuments of the Brandenburg city of Lenzen and its districts are listed.
The translation contains an error pertaining to
Options:
(A) Modifiers or Adjectives
(B) Numerical Values
(C) Negation or Antonyms
(D) Named Entities
(E) Dropped Content
(F) Facts"
target: (D)
- input: "The following translations from German to English contain a particular
error. That error will be one of the following types: Named Entities: An entity
(names, places, locations, etc.) is changed to a different entity. Numerical
Values: Numerical values (ordinals or cardinals), dates, and/or units are changed.
Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are
changed. Negation or Antonyms: Introduce or remove a negation or change comparatives
to their antonyms. Facts: Trivial factual errors not pertaining to the above
classes are introduced in the translations. Dropped Content: A significant clause
in the translation is removed. Please identify that error. Source: Auf dieser
Seite sind die Baudenkm\xE4ler der oberbayerischen Gro\xDFen Kreisstadt Landsberg
am Lech zusammengestellt.
Translation: On this page are compiled the architectural monuments of the town of
Landsberg am Lech.
The translation contains an error pertaining to
Options:
(A) Modifiers or Adjectives
(B) Numerical Values
(C) Negation or Antonyms
(D) Named Entities
(E) Dropped Content
(F) Facts"
target: (E)
- input: "The following translations from German to English contain a particular
error. That error will be one of the following types: Named Entities: An entity
(names, places, locations, etc.) is changed to a different entity. Numerical
Values: Numerical values (ordinals or cardinals), dates, and/or units are changed.
Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are
changed. Negation or Antonyms: Introduce or remove a negation or change comparatives
to their antonyms. Facts: Trivial factual errors not pertaining to the above
classes are introduced in the translations. Dropped Content: A significant clause
in the translation is removed. Please identify that error. Source: \u0141eba ist
eine Kleinstadt und ein Badeort im Powiat L\u0119borski der polnischen Woiwodschaft
Pommern.
Translation: Eba is not a small town and seaside resort in the Powiat L\xE9borski
county of the Pomeranian Voivodeship of Poland.\nThe translation contains an error
pertaining to
Options:
(A) Modifiers or Adjectives
(B) Numerical Values
(C) Negation or Antonyms
(D) Named Entities
(E) Dropped Content
(F) Facts"
target: (C)
include: _fewshot_template_yaml
task: leaderboard_bbh_salient_translation_error_detection
dataset_name: snarks
description: 'Determine which of two sentences is sarcastic.
According to Cambridge University Dictionary, sarcasm is "the use of remarks that
clearly mean the opposite of what they say, made in order to hurt someone''s feelings
or to criticize something in a humorous way." Sarcastic sentences often contain
satirical or ironic utterances, hyperboles, ambivalent or witty remarks.
'
doc_to_choice: ["(A)","(B)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Which statement is sarcastic?
Options:
(A) Yes, because having interests and actively researching them is a huge waste
(B) Yes, because having interests and actively researching them is a huge deal'
target: (A)
- input: 'Which statement is sarcastic?
Options:
(A) No one is going to disagree with you on this. Avoiding ad hominem attacks
really help your case
(B) No one is going to disagree with you on this. Ad hominem attacks really
help your case'
target: (B)
- input: 'Which statement is sarcastic?
Options:
(A) Consistency in the league''s punishments? What do you think this is supposed
to be, politics?
(B) Consistency in the league''s punishments? What do you think this is supposed
to be, moral?'
target: (A)
include: _fewshot_template_yaml
task: leaderboard_bbh_snarks
dataset_name: sports_understanding
description: 'Determine whether an artificially constructed sentence relating to sports
is plausible or not.
'
doc_to_choice: ["yes","no"]
fewshot_config:
sampler: first_n
samples:
- input: Is the following sentence plausible? "Bam Adebayo scored a reverse layup
in the Western Conference Finals."
target: 'yes'
- input: Is the following sentence plausible? "Santi Cazorla scored a touchdown."
target: 'no'
- input: Is the following sentence plausible? "DeMar DeRozan was called for the
goal tend."
target: 'yes'
include: _fewshot_template_yaml
task: leaderboard_bbh_sports_understanding
dataset_name: temporal_sequences
description: 'Task description: Answer questions about which times certain events
could have occurred.
'
doc_to_choice: ["(A)","(B)","(C)","(D)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Today, Emily went to the museum. Between what times could they have gone?
We know that:
Emily woke up at 1pm.
Elizabeth saw Emily reading at the library from 2pm to 4pm.
Jessica saw Emily watching a movie at the theater from 4pm to 5pm.
Leslie saw Emily waiting at the airport from 5pm to 6pm.
William saw Emily buying clothes at the mall from 6pm to 7pm.
The museum was closed after 7pm.
Between what times could Emily have gone to the museum?
Options:
(A) 1pm to 2pm
(B) 6pm to 7pm
(C) 5pm to 6pm
(D) 2pm to 4pm'
target: (A)
- input: 'Today, Elizabeth went to the amusement park. Between what times could
they have gone?
We know that:
Elizabeth woke up at 7am.
David saw Elizabeth fixing their computer at the electronic store from 1pm to
2pm.
Sarah saw Elizabeth playing tennis at the tennis court from 2pm to 3pm.
Susan saw Elizabeth walking towards the Statue of Liberty from 3pm to 6pm.
Andrew saw Elizabeth taking photos near the Eiffel Tower from 6pm to 9pm.
Emily saw Elizabeth getting a coffee at the cafe from 9pm to 10pm.
The amusement park was closed after 10pm.
Between what times could Elizabeth have gone to the amusement park?
Options:
(A) 7am to 1pm
(B) 9pm to 10pm
(C) 1pm to 2pm
(D) 3pm to 6pm'
target: (A)
- input: 'Today, Tiffany went to the beach. Between what times could they have gone?
We know that:
Tiffany woke up at 5am.
Betty saw Tiffany getting a coffee at the cafe from 5am to 6am.
Jessica saw Tiffany working at the office from 6am to 9am.
John saw Tiffany stretching at a yoga studio from 9am to 12pm.
Sean saw Tiffany sitting on a rooftop from 12pm to 2pm.
Sarah saw Tiffany playing tennis at the tennis court from 2pm to 3pm.
The beach was closed after 4pm.
Between what times could Tiffany have gone to the beach?
Options:
(A) 9am to 12pm
(B) 12pm to 2pm
(C) 5am to 6am
(D) 3pm to 4pm'
target: (D)
include: _fewshot_template_yaml
task: leaderboard_bbh_temporal_sequences
dataset_name: tracking_shuffled_objects_five_objects
description: 'A task requiring determining the final positions of a set of objects
given their initial positions and a description of a sequence of swaps.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire
has a pink ball.
As the game progresses, pairs of players trade balls. First, Claire and Alice
swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls.
At the end of the game, Bob has the
Options:
(A) yellow ball
(B) blue ball
(C) pink ball'
target: (A)
- input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
are each holding a ball: Alice has a white ball, Bob has a purple ball, and
Claire has a pink ball.
As the game progresses, pairs of players trade balls. First, Bob and Alice swap
balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At
the end of the game, Alice has the
Options:
(A) white ball
(B) purple ball
(C) pink ball'
target: (C)
- input: 'Alice, Bob, and Claire are dancers at a square dance. At the start of
a song, they each have a partner: Alice is dancing with Lola, Bob is dancing
with Rodrigo, and Claire is dancing with Patrick.
Throughout the song, the dancers often trade partners. First, Alice and Bob
switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice
switch partners. At the end of the dance, Alice is dancing with
Options:
(A) Lola
(B) Rodrigo
(C) Patrick'
target: (C)
include: _fewshot_template_yaml
task: leaderboard_bbh_tracking_shuffled_objects_five_objects
dataset_name: tracking_shuffled_objects_seven_objects
description: 'A task requiring determining the final positions of a set of objects
given their initial positions and a description of a sequence of swaps.
'
doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)","(G)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire
has a pink ball.
As the game progresses, pairs of players trade balls. First, Claire and Alice
swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls.
At the end of the game, Bob has the
Options:
(A) yellow ball
(B) blue ball
(C) pink ball'
target: (A)
- input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
are each holding a ball: Alice has a white ball, Bob has a purple ball, and
Claire has a pink ball.
As the game progresses, pairs of players trade balls. First, Bob and Alice swap
balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At
the end of the game, Alice has the
Options:
(A) white ball
(B) purple ball
(C) pink ball'
target: (C)
- input: 'Alice, Bob, and Claire are dancers at a square dance. At the start of
a song, they each have a partner: Alice is dancing with Lola, Bob is dancing
with Rodrigo, and Claire is dancing with Patrick.
Throughout the song, the dancers often trade partners. First, Alice and Bob
switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice
switch partners. At the end of the dance, Alice is dancing with
Options:
(A) Lola
(B) Rodrigo
(C) Patrick'
target: (C)
include: _fewshot_template_yaml
task: leaderboard_bbh_tracking_shuffled_objects_seven_objects
dataset_name: tracking_shuffled_objects_three_objects
description: 'A task requiring determining the final positions of a set of objects
given their initial positions and a description of a sequence of swaps.
'
doc_to_choice: ["(A)","(B)","(C)"]
fewshot_config:
sampler: first_n
samples:
- input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire
has a pink ball.
As the game progresses, pairs of players trade balls. First, Claire and Alice
swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls.
At the end of the game, Bob has the
Options:
(A) yellow ball
(B) blue ball
(C) pink ball'
target: (A)
- input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
are each holding a ball: Alice has a white ball, Bob has a purple ball, and
Claire has a pink ball.
As the game progresses, pairs of players trade balls. First, Bob and Alice swap
balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At
the end of the game, Alice has the
Options:
(A) white ball
(B) purple ball
(C) pink ball'
target: (C)
- input: 'Alice, Bob, and Claire are dancers at a square dance. At the start of
a song, they each have a partner: Alice is dancing with Lola, Bob is dancing
with Rodrigo, and Claire is dancing with Patrick.
Throughout the song, the dancers often trade partners. First, Alice and Bob
switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice
switch partners. At the end of the dance, Alice is dancing with
Options:
(A) Lola
(B) Rodrigo
(C) Patrick'
target: (C)
include: _fewshot_template_yaml
task: leaderboard_bbh_tracking_shuffled_objects_three_objects
dataset_name: web_of_lies
description: 'Evaluate a random boolean function expressed as a word problem.
'
doc_to_choice: ["Yes","No"]
fewshot_config:
sampler: first_n
samples:
- input: 'Question: Fidel tells the truth. Jerry says Fidel tells the truth. Vina
says Jerry tells the truth. Millicent says Vina lies. Raymond says Millicent
lies. Does Raymond tell the truth?'
target: 'Yes'
- input: 'Question: Kristian lies. Millie says Kristian lies. Maybelle says Millie
tells the truth. Fidel says Maybelle lies. Leda says Fidel lies. Does Leda tell
the truth?'
target: 'Yes'
- input: 'Question: Kristian tells the truth. Michaela says Kristian lies. Raymond
says Michaela tells the truth. Osvaldo says Raymond tells the truth. Jamey says
Osvaldo tells the truth. Does Jamey tell the truth?'
target: 'No'
include: _fewshot_template_yaml
task: leaderboard_bbh_web_of_lies
dataset_path: Idavidrein/gpqa
group: leaderboard_gpqa
output_type: multiple_choice
process_docs: !function utils.process_docs
training_split: train
# Because huggingface dataset only has train split
validation_split: train
test_split: null
doc_to_text: "What is the correct answer to this question:{{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer: "
doc_to_target: answer
doc_to_choice: ["(A)", "(B)", "(C)", "(D)"]
num_fewshot: 0
metric_list:
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
fewshot_config:
sampler: first_n
# Generated by _generate_configs.py
dataset_name: gpqa_diamond
include: _template_yaml
task: leaderboard_gpqa_diamond
# Generated by _generate_configs.py
dataset_name: gpqa_extended
include: _template_yaml
task: leaderboard_gpqa_extended
# Generated by _generate_configs.py
dataset_name: gpqa_main
include: _template_yaml
task: leaderboard_gpqa_main
import random
import re
import datasets
def preprocess(text):
if text is None:
return " "
text = text.strip()
text = text.replace(" [title]", ". ")
text = re.sub("\\[.*?\\]", "", text)
text = text.replace(" ", " ")
return text
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc):
choices = [
preprocess(doc["Incorrect Answer 1"]),
preprocess(doc["Incorrect Answer 2"]),
preprocess(doc["Incorrect Answer 3"]),
preprocess(doc["Correct Answer"]),
]
random.shuffle(choices)
correct_answer_index = choices.index(preprocess(doc["Correct Answer"]))
out_doc = {
"choice1": choices[0],
"choice2": choices[1],
"choice3": choices[2],
"choice4": choices[3],
"answer": f"({chr(65 + correct_answer_index)})",
}
return out_doc
return dataset.map(_process_doc)
task: leaderboard_ifeval
group: leaderboard_instruction_following
dataset_path: wis-k/instruction-following-eval
dataset_name: null
output_type: generate_until
test_split: train
num_fewshot: 0
doc_to_text: prompt
doc_to_target: 0
generation_kwargs:
until: []
do_sample: false
temperature: 0.0
max_gen_toks: 1280
process_results: !function utils.process_results
metric_list:
- metric: prompt_level_strict_acc
aggregation: mean
higher_is_better: true
- metric: inst_level_strict_acc
aggregation: !function utils.agg_inst_level_acc
higher_is_better: true
- metric: prompt_level_loose_acc
aggregation: mean
higher_is_better: true
- metric: inst_level_loose_acc
aggregation: !function utils.agg_inst_level_acc
higher_is_better: true
metadata:
version: 2.0
fewshot_config:
sampler: first_n
This diff is collapsed.
# Copyright 2023 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Registry of all instructions."""
from lm_eval.tasks.ifeval import instructions
_KEYWORD = "keywords:"
_LANGUAGE = "language:"
_LENGTH = "length_constraints:"
_CONTENT = "detectable_content:"
_FORMAT = "detectable_format:"
_MULTITURN = "multi-turn:"
_COMBINATION = "combination:"
_STARTEND = "startend:"
_CHANGE_CASES = "change_case:"
_PUNCTUATION = "punctuation:"
INSTRUCTION_DICT = {
_KEYWORD + "existence": instructions.KeywordChecker,
_KEYWORD + "frequency": instructions.KeywordFrequencyChecker,
# TODO(jeffreyzhou): make a proper set of sentences to choose from
# _KEYWORD + "key_sentences": instructions.KeySentenceChecker,
_KEYWORD + "forbidden_words": instructions.ForbiddenWords,
_KEYWORD + "letter_frequency": instructions.LetterFrequencyChecker,
_LANGUAGE + "response_language": instructions.ResponseLanguageChecker,
_LENGTH + "number_sentences": instructions.NumberOfSentences,
_LENGTH + "number_paragraphs": instructions.ParagraphChecker,
_LENGTH + "number_words": instructions.NumberOfWords,
_LENGTH + "nth_paragraph_first_word": instructions.ParagraphFirstWordCheck,
_CONTENT + "number_placeholders": instructions.PlaceholderChecker,
_CONTENT + "postscript": instructions.PostscriptChecker,
_FORMAT + "number_bullet_lists": instructions.BulletListChecker,
# TODO(jeffreyzhou): Pre-create paragraph or use prompt to replace
# _CONTENT + "rephrase_paragraph": instructions.RephraseParagraph,
_FORMAT + "constrained_response": instructions.ConstrainedResponseChecker,
_FORMAT + "number_highlighted_sections": (instructions.HighlightSectionChecker),
_FORMAT + "multiple_sections": instructions.SectionChecker,
# TODO(tianjianlu): Re-enable rephrasing with preprocessing the message.
# _FORMAT + "rephrase": instructions.RephraseChecker,
_FORMAT + "json_format": instructions.JsonFormat,
_FORMAT + "title": instructions.TitleChecker,
# TODO(tianjianlu): Re-enable with specific prompts.
# _MULTITURN + "constrained_start": instructions.ConstrainedStartChecker,
_COMBINATION + "two_responses": instructions.TwoResponsesChecker,
_COMBINATION + "repeat_prompt": instructions.RepeatPromptThenAnswer,
_STARTEND + "end_checker": instructions.EndChecker,
_CHANGE_CASES + "capital_word_frequency": instructions.CapitalWordFrequencyChecker,
_CHANGE_CASES + "english_capital": instructions.CapitalLettersEnglishChecker,
_CHANGE_CASES + "english_lowercase": instructions.LowercaseLettersEnglishChecker,
_PUNCTUATION + "no_comma": instructions.CommaChecker,
_STARTEND + "quotation": instructions.QuotationChecker,
}
INSTRUCTION_CONFLICTS = {
_KEYWORD + "existence": {_KEYWORD + "existence"},
_KEYWORD + "frequency": {_KEYWORD + "frequency"},
# TODO(jeffreyzhou): make a proper set of sentences to choose from
# _KEYWORD + "key_sentences": instructions.KeySentenceChecker,
_KEYWORD + "forbidden_words": {_KEYWORD + "forbidden_words"},
_KEYWORD + "letter_frequency": {_KEYWORD + "letter_frequency"},
_LANGUAGE + "response_language": {
_LANGUAGE + "response_language",
_FORMAT + "multiple_sections",
_KEYWORD + "existence",
_KEYWORD + "frequency",
_KEYWORD + "forbidden_words",
_STARTEND + "end_checker",
_CHANGE_CASES + "english_capital",
_CHANGE_CASES + "english_lowercase",
},
_LENGTH + "number_sentences": {_LENGTH + "number_sentences"},
_LENGTH + "number_paragraphs": {
_LENGTH + "number_paragraphs",
_LENGTH + "nth_paragraph_first_word",
_LENGTH + "number_sentences",
_LENGTH + "nth_paragraph_first_word",
},
_LENGTH + "number_words": {_LENGTH + "number_words"},
_LENGTH + "nth_paragraph_first_word": {
_LENGTH + "nth_paragraph_first_word",
_LENGTH + "number_paragraphs",
},
_CONTENT + "number_placeholders": {_CONTENT + "number_placeholders"},
_CONTENT + "postscript": {_CONTENT + "postscript"},
_FORMAT + "number_bullet_lists": {_FORMAT + "number_bullet_lists"},
# TODO(jeffreyzhou): Pre-create paragraph or use prompt to replace
# _CONTENT + "rephrase_paragraph": instructions.RephraseParagraph,
_FORMAT + "constrained_response": set(INSTRUCTION_DICT.keys()),
_FORMAT + "number_highlighted_sections": {_FORMAT + "number_highlighted_sections"},
_FORMAT + "multiple_sections": {
_FORMAT + "multiple_sections",
_LANGUAGE + "response_language",
_FORMAT + "number_highlighted_sections",
},
# TODO(tianjianlu): Re-enable rephrasing with preprocessing the message.
# _FORMAT + "rephrase": instructions.RephraseChecker,
_FORMAT + "json_format": set(INSTRUCTION_DICT.keys()).difference(
{_KEYWORD + "forbidden_words", _KEYWORD + "existence"}
),
_FORMAT + "title": {_FORMAT + "title"},
# TODO(tianjianlu): Re-enable with specific prompts.
# _MULTITURN + "constrained_start": instructions.ConstrainedStartChecker,
_COMBINATION + "two_responses": set(INSTRUCTION_DICT.keys()).difference(
{
_KEYWORD + "forbidden_words",
_KEYWORD + "existence",
_LANGUAGE + "response_language",
_FORMAT + "title",
_PUNCTUATION + "no_comma",
}
),
_COMBINATION + "repeat_prompt": set(INSTRUCTION_DICT.keys()).difference(
{_KEYWORD + "existence", _FORMAT + "title", _PUNCTUATION + "no_comma"}
),
_STARTEND + "end_checker": {_STARTEND + "end_checker"},
_CHANGE_CASES + "capital_word_frequency": {
_CHANGE_CASES + "capital_word_frequency",
_CHANGE_CASES + "english_lowercase",
_CHANGE_CASES + "english_capital",
},
_CHANGE_CASES + "english_capital": {_CHANGE_CASES + "english_capital"},
_CHANGE_CASES + "english_lowercase": {
_CHANGE_CASES + "english_lowercase",
_CHANGE_CASES + "english_capital",
},
_PUNCTUATION + "no_comma": {_PUNCTUATION + "no_comma"},
_STARTEND + "quotation": {_STARTEND + "quotation", _FORMAT + "title"},
}
def conflict_make(conflicts):
"""Makes sure if A conflicts with B, B will conflict with A.
Args:
conflicts: Dictionary of potential conflicts where key is instruction id
and value is set of instruction ids that it conflicts with.
Returns:
Revised version of the dictionary. All instructions conflict with
themselves. If A conflicts with B, B will conflict with A.
"""
for key in conflicts:
for k in conflicts[key]:
conflicts[k].add(key)
conflicts[key].add(key)
return conflicts
This diff is collapsed.
import dataclasses
from typing import Dict, Optional, Union
from lm_eval.tasks.ifeval import instructions_registry
@dataclasses.dataclass
class InputExample:
key: int
instruction_id_list: list[str]
prompt: str
kwargs: list[Dict[str, Optional[Union[str, int]]]]
@dataclasses.dataclass
class OutputExample:
instruction_id_list: list[str]
prompt: str
response: str
follow_all_instructions: bool
follow_instruction_list: list[bool]
def test_instruction_following_strict(
inp,
response,
):
"""Tests response to see if instructions are followed."""
instruction_list = inp.instruction_id_list
is_following_list = []
for index, instruction_id in enumerate(instruction_list):
instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
instruction = instruction_cls(instruction_id)
# Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
kwargs = {k: v for k, v in inp.kwargs[index].items() if v}
instruction.build_description(**kwargs)
args = instruction.get_instruction_args()
if args and "prompt" in args:
instruction.build_description(prompt=inp.prompt)
if response.strip() and instruction.check_following(response):
is_following_list.append(True)
else:
is_following_list.append(False)
return OutputExample(
instruction_id_list=inp.instruction_id_list,
prompt=inp.prompt,
response=response,
follow_all_instructions=all(is_following_list),
follow_instruction_list=is_following_list,
)
def test_instruction_following_loose(
inp,
response,
):
"""Tests response for an upper bound for following instructions."""
r = response.split("\n")
response_remove_first = "\n".join(r[1:]).strip()
response_remove_last = "\n".join(r[:-1]).strip()
response_remove_both = "\n".join(r[1:-1]).strip()
revised_response = response.replace("*", "")
revised_response_remove_first = response_remove_first.replace("*", "")
revised_response_remove_last = response_remove_last.replace("*", "")
revised_response_remove_both = response_remove_both.replace("*", "")
all_responses = [
response,
revised_response,
response_remove_first,
response_remove_last,
response_remove_both,
revised_response_remove_first,
revised_response_remove_last,
revised_response_remove_both,
]
instruction_list = inp.instruction_id_list
is_following_list = []
for index, instruction_id in enumerate(instruction_list):
instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
instruction = instruction_cls(instruction_id)
# Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
kwargs = {k: v for k, v in inp.kwargs[index].items() if v}
instruction.build_description(**kwargs)
args = instruction.get_instruction_args()
if args and "prompt" in args:
instruction.build_description(prompt=inp.prompt)
is_following = False
for r in all_responses:
if r.strip() and instruction.check_following(r):
is_following = True
break
is_following_list.append(is_following)
return OutputExample(
instruction_id_list=inp.instruction_id_list,
prompt=inp.prompt,
response=response,
follow_all_instructions=all(is_following_list),
follow_instruction_list=is_following_list,
)
def process_results(doc, results):
inp = InputExample(
key=doc["key"],
instruction_id_list=doc["instruction_id_list"],
prompt=doc["prompt"],
kwargs=doc["kwargs"],
)
response = results[0]
out_strict = test_instruction_following_strict(inp, response)
out_loose = test_instruction_following_loose(inp, response)
return {
"prompt_level_strict_acc": out_strict.follow_all_instructions,
"inst_level_strict_acc": out_strict.follow_instruction_list,
"prompt_level_loose_acc": out_loose.follow_all_instructions,
"inst_level_loose_acc": out_loose.follow_instruction_list,
}
def agg_inst_level_acc(items):
flat_items = [item for sublist in items for item in sublist]
inst_level_acc = sum(flat_items) / len(flat_items)
return inst_level_acc
group: leaderboard
task:
- leaderboard_mmlu_pro
- leaderboard_bbh
- leaderboard_gpqa
- leaderboard_math_hard
- leaderboard_ifeval
- leaderboard_musr
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment