Adds Open LLM Leaderboard Taks (#2047)

* adds leaderboard tasks * Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml * add readme * Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml * modify readme * fix bbh task * fix bbh salient task * modify the readme * Delete lm_eval/tasks/leaderboard/ifeval/README.md * Delete lm_eval/tasks/leaderboard/math/README.md * add leaderboard to the tasks repertory * add anouncment about new leaderbaord tasks * linting * Update README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * installs ifeval dependency in new_task github workflow --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

Adds Open LLM Leaderboard Taks (#2047)
* adds leaderboard tasks * Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml * add readme * Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml * modify readme * fix bbh task * fix bbh salient task * modify the readme * Delete lm_eval/tasks/leaderboard/ifeval/README.md * Delete lm_eval/tasks/leaderboard/math/README.md * add leaderboard to the tasks repertory * add anouncment about new leaderbaord tasks * linting * Update README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * installs ifeval dependency in new_task github workflow --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
3c8db1bb · Nathan Habib · GitHub · 1870ee4e · 3c8db1bb · 3c8db1bb
Unverified Commit 3c8db1bb authored Jul 03, 2024 by Nathan Habib Committed by GitHub Jul 03, 2024
20 changed files
--- a/lm_eval/tasks/leaderboard/bbh_mc/ruin_names.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/ruin_names.yaml
+dataset_name: ruin_names
+description: 'Select the humorous edit that ''ruins'' the input movie or musical artist
+  name.
+
+        '
+doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: 'Which of the following is a humorous edit of this artist or movie name:
+      ''whitesnake''?
+
+      Options:
+
+      (A) whitesnape
+
+      (B) whitesnapke
+
+      (C) whitesnuake
+
+      (D) mwhitesnake'
+    target: (A)
+  - input: 'Which of the following is a humorous edit of this artist or movie name:
+      ''one of our dinosaurs is missing''?
+
+      Options:
+
+      (A) ofne of our dinosaurs is missing
+
+      (B) one af our dinosaurs is missing
+
+      (C) one of our dinosaurs is pissing
+
+      (D) one of our dinosaur is missing'
+    target: (C)
+  - input: 'Which of the following is a humorous edit of this artist or movie name:
+      ''counting crows''?
+
+      Options:
+
+      (A) countingy crows
+
+      (B) counting cows
+
+      (C) courting crows
+
+      (D) coutnting crows'
+    target: (B)
+include: _fewshot_template_yaml
+task: leaderboard_bbh_ruin_names
--- a/lm_eval/tasks/leaderboard/bbh_mc/salient_translation_error_detection.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/salient_translation_error_detection.yaml
+dataset_name: salient_translation_error_detection
+description: 'Detect the type of error in an English translation of a German source
+  sentence.
+
+        '
+doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: "The following translations from German to English contain a particular
+      error. That error will be one of the following types: Named Entities: An entity
+      (names, places, locations, etc.) is changed to a different entity. Numerical
+      Values: Numerical values (ordinals or cardinals), dates, and/or units are changed.
+      Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are
+      changed. Negation or Antonyms: Introduce or remove a negation or change comparatives
+      to their antonyms. Facts: Trivial factual errors not pertaining to the above
+      classes are introduced in the translations. Dropped Content: A significant clause
+      in the translation is removed. Please identify that error.  Source: In der Liste
+      der Baudenkmale in Lenzen (Elbe) sind alle Baudenkmale der brandenburgischen
+      Stadt Lenzen (Elbe) und ihrer Ortsteile aufgelistet.
+
+      Translation: In the list of architectural monuments in Lenzen all architectural
+      monuments of the Brandenburg city of Lenzen and its districts are listed.
+
+      The translation contains an error pertaining to
+
+      Options:
+
+      (A) Modifiers or Adjectives
+
+      (B) Numerical Values
+
+      (C) Negation or Antonyms
+
+      (D) Named Entities
+
+      (E) Dropped Content
+
+      (F) Facts"
+    target: (D)
+  - input: "The following translations from German to English contain a particular
+      error. That error will be one of the following types: Named Entities: An entity
+      (names, places, locations, etc.) is changed to a different entity. Numerical
+      Values: Numerical values (ordinals or cardinals), dates, and/or units are changed.
+      Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are
+      changed. Negation or Antonyms: Introduce or remove a negation or change comparatives
+      to their antonyms. Facts: Trivial factual errors not pertaining to the above
+      classes are introduced in the translations. Dropped Content: A significant clause
+      in the translation is removed. Please identify that error.  Source: Auf dieser
+      Seite sind die Baudenkm\xE4ler der oberbayerischen Gro\xDFen Kreisstadt Landsberg
+      am Lech zusammengestellt.
+
+      Translation: On this page are compiled the architectural monuments of the town of
+      Landsberg am Lech.
+
+      The translation contains an error pertaining to
+
+      Options:
+
+      (A) Modifiers or Adjectives
+
+      (B) Numerical Values
+
+      (C) Negation or Antonyms
+
+      (D) Named Entities
+
+      (E) Dropped Content
+
+      (F) Facts"
+    target: (E)
+  - input: "The following translations from German to English contain a particular
+      error. That error will be one of the following types: Named Entities: An entity
+      (names, places, locations, etc.) is changed to a different entity. Numerical
+      Values: Numerical values (ordinals or cardinals), dates, and/or units are changed.
+      Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are
+      changed. Negation or Antonyms: Introduce or remove a negation or change comparatives
+      to their antonyms. Facts: Trivial factual errors not pertaining to the above
+      classes are introduced in the translations. Dropped Content: A significant clause
+      in the translation is removed. Please identify that error.  Source: \u0141eba ist
+      eine Kleinstadt und ein Badeort im Powiat L\u0119borski der polnischen Woiwodschaft
+      Pommern.
+
+      Translation: Eba is not a small town and seaside resort in the Powiat L\xE9borski
+      county of the Pomeranian Voivodeship of Poland.\nThe translation contains an error
+      pertaining to
+
+      Options:
+
+      (A) Modifiers or Adjectives
+
+      (B) Numerical Values
+
+      (C) Negation or Antonyms
+
+      (D) Named Entities
+
+      (E) Dropped Content
+
+      (F) Facts"
+    target: (C)
+include: _fewshot_template_yaml
+task: leaderboard_bbh_salient_translation_error_detection
--- a/lm_eval/tasks/leaderboard/bbh_mc/snarks.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/snarks.yaml
+dataset_name: snarks
+description: 'Determine which of two sentences is sarcastic.
+
+
+  According to Cambridge University Dictionary, sarcasm is "the use of remarks that
+  clearly mean the opposite of what they say, made in order to hurt someone''s feelings
+  or to criticize something in a humorous way." Sarcastic sentences often contain
+  satirical or ironic utterances, hyperboles, ambivalent or witty remarks.
+
+
+  '
+doc_to_choice: ["(A)","(B)"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: 'Which statement is sarcastic?
+
+      Options:
+
+      (A) Yes, because having interests and actively researching them is a huge waste
+
+      (B) Yes, because having interests and actively researching them is a huge deal'
+    target: (A)
+  - input: 'Which statement is sarcastic?
+
+      Options:
+
+      (A) No one is going to disagree with you on this. Avoiding ad hominem attacks
+      really help your case
+
+      (B) No one is going to disagree with you on this. Ad hominem attacks really
+      help your case'
+    target: (B)
+  - input: 'Which statement is sarcastic?
+
+      Options:
+
+      (A) Consistency in the league''s punishments? What do you think this is supposed
+      to be, politics?
+
+      (B) Consistency in the league''s punishments? What do you think this is supposed
+      to be, moral?'
+    target: (A)
+include: _fewshot_template_yaml
+task: leaderboard_bbh_snarks
--- a/lm_eval/tasks/leaderboard/bbh_mc/sports_understanding.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/sports_understanding.yaml
+dataset_name: sports_understanding
+description: 'Determine whether an artificially constructed sentence relating to sports
+  is plausible or not.
+
+        '
+doc_to_choice: ["yes","no"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: Is the following sentence plausible? "Bam Adebayo scored a reverse layup
+      in the Western Conference Finals."
+    target: 'yes'
+  - input: Is the following sentence plausible? "Santi Cazorla scored a touchdown."
+    target: 'no'
+  - input: Is the following sentence plausible? "DeMar DeRozan was called for the
+      goal tend."
+    target: 'yes'
+include: _fewshot_template_yaml
+task: leaderboard_bbh_sports_understanding
--- a/lm_eval/tasks/leaderboard/bbh_mc/temporal_sequences.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/temporal_sequences.yaml
+dataset_name: temporal_sequences
+description: 'Task description: Answer questions about which times certain events
+  could have occurred.
+
+        '
+doc_to_choice: ["(A)","(B)","(C)","(D)"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: 'Today, Emily went to the museum. Between what times could they have gone?
+
+      We know that:
+
+      Emily woke up at 1pm.
+
+      Elizabeth saw Emily reading at the library from 2pm to 4pm.
+
+      Jessica saw Emily watching a movie at the theater from 4pm to 5pm.
+
+      Leslie saw Emily waiting at the airport from 5pm to 6pm.
+
+      William saw Emily buying clothes at the mall from 6pm to 7pm.
+
+      The museum was closed after 7pm.
+
+      Between what times could Emily have gone to the museum?
+
+      Options:
+
+      (A) 1pm to 2pm
+
+      (B) 6pm to 7pm
+
+      (C) 5pm to 6pm
+
+      (D) 2pm to 4pm'
+    target: (A)
+  - input: 'Today, Elizabeth went to the amusement park. Between what times could
+      they have gone?
+
+      We know that:
+
+      Elizabeth woke up at 7am.
+
+      David saw Elizabeth fixing their computer at the electronic store from 1pm to
+      2pm.
+
+      Sarah saw Elizabeth playing tennis at the tennis court from 2pm to 3pm.
+
+      Susan saw Elizabeth walking towards the Statue of Liberty from 3pm to 6pm.
+
+      Andrew saw Elizabeth taking photos near the Eiffel Tower from 6pm to 9pm.
+
+      Emily saw Elizabeth getting a coffee at the cafe from 9pm to 10pm.
+
+      The amusement park was closed after 10pm.
+
+      Between what times could Elizabeth have gone to the amusement park?
+
+      Options:
+
+      (A) 7am to 1pm
+
+      (B) 9pm to 10pm
+
+      (C) 1pm to 2pm
+
+      (D) 3pm to 6pm'
+    target: (A)
+  - input: 'Today, Tiffany went to the beach. Between what times could they have gone?
+
+      We know that:
+
+      Tiffany woke up at 5am.
+
+      Betty saw Tiffany getting a coffee at the cafe from 5am to 6am.
+
+      Jessica saw Tiffany working at the office from 6am to 9am.
+
+      John saw Tiffany stretching at a yoga studio from 9am to 12pm.
+
+      Sean saw Tiffany sitting on a rooftop from 12pm to 2pm.
+
+      Sarah saw Tiffany playing tennis at the tennis court from 2pm to 3pm.
+
+      The beach was closed after 4pm.
+
+      Between what times could Tiffany have gone to the beach?
+
+      Options:
+
+      (A) 9am to 12pm
+
+      (B) 12pm to 2pm
+
+      (C) 5am to 6am
+
+      (D) 3pm to 4pm'
+    target: (D)
+include: _fewshot_template_yaml
+task: leaderboard_bbh_temporal_sequences
--- a/lm_eval/tasks/leaderboard/bbh_mc/tracking_shuffled_objects_five_objects.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/tracking_shuffled_objects_five_objects.yaml
+dataset_name: tracking_shuffled_objects_five_objects
+description: 'A task requiring determining the final positions of a set of objects
+  given their initial positions and a description of a sequence of swaps.
+
+        '
+doc_to_choice: ["(A)","(B)","(C)","(D)","(E)"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
+      are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire
+      has a pink ball.
+
+      As the game progresses, pairs of players trade balls. First, Claire and Alice
+      swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls.
+      At the end of the game, Bob has the
+
+      Options:
+
+      (A) yellow ball
+
+      (B) blue ball
+
+      (C) pink ball'
+    target: (A)
+  - input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
+      are each holding a ball: Alice has a white ball, Bob has a purple ball, and
+      Claire has a pink ball.
+
+      As the game progresses, pairs of players trade balls. First, Bob and Alice swap
+      balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At
+      the end of the game, Alice has the
+
+      Options:
+
+      (A) white ball
+
+      (B) purple ball
+
+      (C) pink ball'
+    target: (C)
+  - input: 'Alice, Bob, and Claire are dancers at a square dance. At the start of
+      a song, they each have a partner: Alice is dancing with Lola, Bob is dancing
+      with Rodrigo, and Claire is dancing with Patrick.
+
+      Throughout the song, the dancers often trade partners. First, Alice and Bob
+      switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice
+      switch partners. At the end of the dance, Alice is dancing with
+
+      Options:
+
+      (A) Lola
+
+      (B) Rodrigo
+
+      (C) Patrick'
+    target: (C)
+include: _fewshot_template_yaml
+task: leaderboard_bbh_tracking_shuffled_objects_five_objects
--- a/lm_eval/tasks/leaderboard/bbh_mc/tracking_shuffled_objects_seven_objects.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/tracking_shuffled_objects_seven_objects.yaml
+dataset_name: tracking_shuffled_objects_seven_objects
+description: 'A task requiring determining the final positions of a set of objects
+  given their initial positions and a description of a sequence of swaps.
+
+        '
+doc_to_choice: ["(A)","(B)","(C)","(D)","(E)","(F)","(G)"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
+      are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire
+      has a pink ball.
+
+      As the game progresses, pairs of players trade balls. First, Claire and Alice
+      swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls.
+      At the end of the game, Bob has the
+
+      Options:
+
+      (A) yellow ball
+
+      (B) blue ball
+
+      (C) pink ball'
+    target: (A)
+  - input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
+      are each holding a ball: Alice has a white ball, Bob has a purple ball, and
+      Claire has a pink ball.
+
+      As the game progresses, pairs of players trade balls. First, Bob and Alice swap
+      balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At
+      the end of the game, Alice has the
+
+      Options:
+
+      (A) white ball
+
+      (B) purple ball
+
+      (C) pink ball'
+    target: (C)
+  - input: 'Alice, Bob, and Claire are dancers at a square dance. At the start of
+      a song, they each have a partner: Alice is dancing with Lola, Bob is dancing
+      with Rodrigo, and Claire is dancing with Patrick.
+
+      Throughout the song, the dancers often trade partners. First, Alice and Bob
+      switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice
+      switch partners. At the end of the dance, Alice is dancing with
+
+      Options:
+
+      (A) Lola
+
+      (B) Rodrigo
+
+      (C) Patrick'
+    target: (C)
+include: _fewshot_template_yaml
+task: leaderboard_bbh_tracking_shuffled_objects_seven_objects
--- a/lm_eval/tasks/leaderboard/bbh_mc/tracking_shuffled_objects_three_objects.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/tracking_shuffled_objects_three_objects.yaml
+dataset_name: tracking_shuffled_objects_three_objects
+description: 'A task requiring determining the final positions of a set of objects
+  given their initial positions and a description of a sequence of swaps.
+
+        '
+doc_to_choice: ["(A)","(B)","(C)"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
+      are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire
+      has a pink ball.
+
+      As the game progresses, pairs of players trade balls. First, Claire and Alice
+      swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls.
+      At the end of the game, Bob has the
+
+      Options:
+
+      (A) yellow ball
+
+      (B) blue ball
+
+      (C) pink ball'
+    target: (A)
+  - input: 'Alice, Bob, and Claire are playing a game. At the start of the game, they
+      are each holding a ball: Alice has a white ball, Bob has a purple ball, and
+      Claire has a pink ball.
+
+      As the game progresses, pairs of players trade balls. First, Bob and Alice swap
+      balls. Then, Bob and Claire swap balls. Finally, Bob and Alice swap balls. At
+      the end of the game, Alice has the
+
+      Options:
+
+      (A) white ball
+
+      (B) purple ball
+
+      (C) pink ball'
+    target: (C)
+  - input: 'Alice, Bob, and Claire are dancers at a square dance. At the start of
+      a song, they each have a partner: Alice is dancing with Lola, Bob is dancing
+      with Rodrigo, and Claire is dancing with Patrick.
+
+      Throughout the song, the dancers often trade partners. First, Alice and Bob
+      switch partners. Then, Claire and Bob switch partners. Finally, Bob and Alice
+      switch partners. At the end of the dance, Alice is dancing with
+
+      Options:
+
+      (A) Lola
+
+      (B) Rodrigo
+
+      (C) Patrick'
+    target: (C)
+include: _fewshot_template_yaml
+task: leaderboard_bbh_tracking_shuffled_objects_three_objects
--- a/lm_eval/tasks/leaderboard/bbh_mc/web_of_lies.yaml
+++ b/lm_eval/tasks/leaderboard/bbh_mc/web_of_lies.yaml
+dataset_name: web_of_lies
+description: 'Evaluate a random boolean function expressed as a word problem.
+
+        '
+doc_to_choice: ["Yes","No"]
+fewshot_config:
+  sampler: first_n
+  samples:
+  - input: 'Question: Fidel tells the truth. Jerry says Fidel tells the truth. Vina
+      says Jerry tells the truth. Millicent says Vina lies. Raymond says Millicent
+      lies. Does Raymond tell the truth?'
+    target: 'Yes'
+  - input: 'Question: Kristian lies. Millie says Kristian lies. Maybelle says Millie
+      tells the truth. Fidel says Maybelle lies. Leda says Fidel lies. Does Leda tell
+      the truth?'
+    target: 'Yes'
+  - input: 'Question: Kristian tells the truth. Michaela says Kristian lies. Raymond
+      says Michaela tells the truth. Osvaldo says Raymond tells the truth. Jamey says
+      Osvaldo tells the truth. Does Jamey tell the truth?'
+    target: 'No'
+include: _fewshot_template_yaml
+task: leaderboard_bbh_web_of_lies
--- a/lm_eval/tasks/leaderboard/gpqa/_template_yaml
+++ b/lm_eval/tasks/leaderboard/gpqa/_template_yaml
+dataset_path: Idavidrein/gpqa
+group: leaderboard_gpqa
+output_type: multiple_choice
+process_docs: !function utils.process_docs
+training_split: train
+# Because huggingface dataset only has train split
+validation_split: train
+test_split: null
+doc_to_text: "What is the correct answer to this question:{{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer: "
+doc_to_target: answer
+doc_to_choice: ["(A)", "(B)", "(C)", "(D)"]
+num_fewshot: 0
+metric_list:
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+fewshot_config:
+  sampler: first_n
--- a/lm_eval/tasks/leaderboard/gpqa/gpqa_diamond_zeroshot.yaml
+++ b/lm_eval/tasks/leaderboard/gpqa/gpqa_diamond_zeroshot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_diamond
+include: _template_yaml
+task: leaderboard_gpqa_diamond
--- a/lm_eval/tasks/leaderboard/gpqa/gpqa_extended_zeroshot.yaml
+++ b/lm_eval/tasks/leaderboard/gpqa/gpqa_extended_zeroshot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_extended
+include: _template_yaml
+task: leaderboard_gpqa_extended
--- a/lm_eval/tasks/leaderboard/gpqa/gpqa_main_zeroshot.yaml
+++ b/lm_eval/tasks/leaderboard/gpqa/gpqa_main_zeroshot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_main
+include: _template_yaml
+task: leaderboard_gpqa_main
--- a/lm_eval/tasks/leaderboard/gpqa/utils.py
+++ b/lm_eval/tasks/leaderboard/gpqa/utils.py
+import random
+import re
+
+import datasets
+
+
+def preprocess(text):
+    if text is None:
+        return " "
+    text = text.strip()
+    text = text.replace(" [title]", ". ")
+    text = re.sub("\\[.*?\\]", "", text)
+    text = text.replace("  ", " ")
+    return text
+
+
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc):
+        choices = [
+            preprocess(doc["Incorrect Answer 1"]),
+            preprocess(doc["Incorrect Answer 2"]),
+            preprocess(doc["Incorrect Answer 3"]),
+            preprocess(doc["Correct Answer"]),
+        ]
+
+        random.shuffle(choices)
+        correct_answer_index = choices.index(preprocess(doc["Correct Answer"]))
+
+        out_doc = {
+            "choice1": choices[0],
+            "choice2": choices[1],
+            "choice3": choices[2],
+            "choice4": choices[3],
+            "answer": f"({chr(65 + correct_answer_index)})",
+        }
+        return out_doc
+
+    return dataset.map(_process_doc)
--- a/lm_eval/tasks/leaderboard/ifeval/ifeval.yaml
+++ b/lm_eval/tasks/leaderboard/ifeval/ifeval.yaml
+task: leaderboard_ifeval
+group: leaderboard_instruction_following
+dataset_path: wis-k/instruction-following-eval
+dataset_name: null
+output_type: generate_until
+test_split: train
+num_fewshot: 0
+doc_to_text: prompt
+doc_to_target: 0
+generation_kwargs:
+  until: []
+  do_sample: false
+  temperature: 0.0
+  max_gen_toks: 1280
+process_results: !function utils.process_results
+metric_list:
+  - metric: prompt_level_strict_acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: inst_level_strict_acc
+    aggregation: !function utils.agg_inst_level_acc
+    higher_is_better: true
+  - metric: prompt_level_loose_acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: inst_level_loose_acc
+    aggregation: !function utils.agg_inst_level_acc
+    higher_is_better: true
+metadata:
+  version: 2.0
+fewshot_config:
+  sampler: first_n
--- a/lm_eval/tasks/leaderboard/ifeval/instructions.py
+++ b/lm_eval/tasks/leaderboard/ifeval/instructions.py
--- a/lm_eval/tasks/leaderboard/ifeval/instructions_registry.py
+++ b/lm_eval/tasks/leaderboard/ifeval/instructions_registry.py
+# Copyright 2023 The Google Research Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Registry of all instructions."""
+
+from lm_eval.tasks.ifeval import instructions
+
+
+_KEYWORD = "keywords:"
+
+_LANGUAGE = "language:"
+
+_LENGTH = "length_constraints:"
+
+_CONTENT = "detectable_content:"
+
+_FORMAT = "detectable_format:"
+
+_MULTITURN = "multi-turn:"
+
+_COMBINATION = "combination:"
+
+_STARTEND = "startend:"
+
+_CHANGE_CASES = "change_case:"
+
+_PUNCTUATION = "punctuation:"
+
+INSTRUCTION_DICT = {
+    _KEYWORD + "existence": instructions.KeywordChecker,
+    _KEYWORD + "frequency": instructions.KeywordFrequencyChecker,
+    # TODO(jeffreyzhou): make a proper set of sentences to choose from
+    # _KEYWORD + "key_sentences": instructions.KeySentenceChecker,
+    _KEYWORD + "forbidden_words": instructions.ForbiddenWords,
+    _KEYWORD + "letter_frequency": instructions.LetterFrequencyChecker,
+    _LANGUAGE + "response_language": instructions.ResponseLanguageChecker,
+    _LENGTH + "number_sentences": instructions.NumberOfSentences,
+    _LENGTH + "number_paragraphs": instructions.ParagraphChecker,
+    _LENGTH + "number_words": instructions.NumberOfWords,
+    _LENGTH + "nth_paragraph_first_word": instructions.ParagraphFirstWordCheck,
+    _CONTENT + "number_placeholders": instructions.PlaceholderChecker,
+    _CONTENT + "postscript": instructions.PostscriptChecker,
+    _FORMAT + "number_bullet_lists": instructions.BulletListChecker,
+    # TODO(jeffreyzhou): Pre-create paragraph or use prompt to replace
+    # _CONTENT + "rephrase_paragraph": instructions.RephraseParagraph,
+    _FORMAT + "constrained_response": instructions.ConstrainedResponseChecker,
+    _FORMAT + "number_highlighted_sections": (instructions.HighlightSectionChecker),
+    _FORMAT + "multiple_sections": instructions.SectionChecker,
+    # TODO(tianjianlu): Re-enable rephrasing with preprocessing the message.
+    # _FORMAT + "rephrase": instructions.RephraseChecker,
+    _FORMAT + "json_format": instructions.JsonFormat,
+    _FORMAT + "title": instructions.TitleChecker,
+    # TODO(tianjianlu): Re-enable with specific prompts.
+    # _MULTITURN + "constrained_start": instructions.ConstrainedStartChecker,
+    _COMBINATION + "two_responses": instructions.TwoResponsesChecker,
+    _COMBINATION + "repeat_prompt": instructions.RepeatPromptThenAnswer,
+    _STARTEND + "end_checker": instructions.EndChecker,
+    _CHANGE_CASES + "capital_word_frequency": instructions.CapitalWordFrequencyChecker,
+    _CHANGE_CASES + "english_capital": instructions.CapitalLettersEnglishChecker,
+    _CHANGE_CASES + "english_lowercase": instructions.LowercaseLettersEnglishChecker,
+    _PUNCTUATION + "no_comma": instructions.CommaChecker,
+    _STARTEND + "quotation": instructions.QuotationChecker,
+}
+
+INSTRUCTION_CONFLICTS = {
+    _KEYWORD + "existence": {_KEYWORD + "existence"},
+    _KEYWORD + "frequency": {_KEYWORD + "frequency"},
+    # TODO(jeffreyzhou): make a proper set of sentences to choose from
+    # _KEYWORD + "key_sentences": instructions.KeySentenceChecker,
+    _KEYWORD + "forbidden_words": {_KEYWORD + "forbidden_words"},
+    _KEYWORD + "letter_frequency": {_KEYWORD + "letter_frequency"},
+    _LANGUAGE + "response_language": {
+        _LANGUAGE + "response_language",
+        _FORMAT + "multiple_sections",
+        _KEYWORD + "existence",
+        _KEYWORD + "frequency",
+        _KEYWORD + "forbidden_words",
+        _STARTEND + "end_checker",
+        _CHANGE_CASES + "english_capital",
+        _CHANGE_CASES + "english_lowercase",
+    },
+    _LENGTH + "number_sentences": {_LENGTH + "number_sentences"},
+    _LENGTH + "number_paragraphs": {
+        _LENGTH + "number_paragraphs",
+        _LENGTH + "nth_paragraph_first_word",
+        _LENGTH + "number_sentences",
+        _LENGTH + "nth_paragraph_first_word",
+    },
+    _LENGTH + "number_words": {_LENGTH + "number_words"},
+    _LENGTH + "nth_paragraph_first_word": {
+        _LENGTH + "nth_paragraph_first_word",
+        _LENGTH + "number_paragraphs",
+    },
+    _CONTENT + "number_placeholders": {_CONTENT + "number_placeholders"},
+    _CONTENT + "postscript": {_CONTENT + "postscript"},
+    _FORMAT + "number_bullet_lists": {_FORMAT + "number_bullet_lists"},
+    # TODO(jeffreyzhou): Pre-create paragraph or use prompt to replace
+    # _CONTENT + "rephrase_paragraph": instructions.RephraseParagraph,
+    _FORMAT + "constrained_response": set(INSTRUCTION_DICT.keys()),
+    _FORMAT + "number_highlighted_sections": {_FORMAT + "number_highlighted_sections"},
+    _FORMAT + "multiple_sections": {
+        _FORMAT + "multiple_sections",
+        _LANGUAGE + "response_language",
+        _FORMAT + "number_highlighted_sections",
+    },
+    # TODO(tianjianlu): Re-enable rephrasing with preprocessing the message.
+    # _FORMAT + "rephrase": instructions.RephraseChecker,
+    _FORMAT + "json_format": set(INSTRUCTION_DICT.keys()).difference(
+        {_KEYWORD + "forbidden_words", _KEYWORD + "existence"}
+    ),
+    _FORMAT + "title": {_FORMAT + "title"},
+    # TODO(tianjianlu): Re-enable with specific prompts.
+    # _MULTITURN + "constrained_start": instructions.ConstrainedStartChecker,
+    _COMBINATION + "two_responses": set(INSTRUCTION_DICT.keys()).difference(
+        {
+            _KEYWORD + "forbidden_words",
+            _KEYWORD + "existence",
+            _LANGUAGE + "response_language",
+            _FORMAT + "title",
+            _PUNCTUATION + "no_comma",
+        }
+    ),
+    _COMBINATION + "repeat_prompt": set(INSTRUCTION_DICT.keys()).difference(
+        {_KEYWORD + "existence", _FORMAT + "title", _PUNCTUATION + "no_comma"}
+    ),
+    _STARTEND + "end_checker": {_STARTEND + "end_checker"},
+    _CHANGE_CASES + "capital_word_frequency": {
+        _CHANGE_CASES + "capital_word_frequency",
+        _CHANGE_CASES + "english_lowercase",
+        _CHANGE_CASES + "english_capital",
+    },
+    _CHANGE_CASES + "english_capital": {_CHANGE_CASES + "english_capital"},
+    _CHANGE_CASES + "english_lowercase": {
+        _CHANGE_CASES + "english_lowercase",
+        _CHANGE_CASES + "english_capital",
+    },
+    _PUNCTUATION + "no_comma": {_PUNCTUATION + "no_comma"},
+    _STARTEND + "quotation": {_STARTEND + "quotation", _FORMAT + "title"},
+}
+
+
+def conflict_make(conflicts):
+    """Makes sure if A conflicts with B, B will conflict with A.
+
+    Args:
+      conflicts: Dictionary of potential conflicts where key is instruction id
+        and value is set of instruction ids that it conflicts with.
+
+    Returns:
+      Revised version of the dictionary. All instructions conflict with
+      themselves. If A conflicts with B, B will conflict with A.
+    """
+    for key in conflicts:
+        for k in conflicts[key]:
+            conflicts[k].add(key)
+        conflicts[key].add(key)
+    return conflicts
--- a/lm_eval/tasks/leaderboard/ifeval/instructions_util.py
+++ b/lm_eval/tasks/leaderboard/ifeval/instructions_util.py
--- a/lm_eval/tasks/leaderboard/ifeval/utils.py
+++ b/lm_eval/tasks/leaderboard/ifeval/utils.py
+import dataclasses
+from typing import Dict, Optional, Union
+
+from lm_eval.tasks.ifeval import instructions_registry
+
+
+@dataclasses.dataclass
+class InputExample:
+    key: int
+    instruction_id_list: list[str]
+    prompt: str
+    kwargs: list[Dict[str, Optional[Union[str, int]]]]
+
+
+@dataclasses.dataclass
+class OutputExample:
+    instruction_id_list: list[str]
+    prompt: str
+    response: str
+    follow_all_instructions: bool
+    follow_instruction_list: list[bool]
+
+
+def test_instruction_following_strict(
+    inp,
+    response,
+):
+    """Tests response to see if instructions are followed."""
+    instruction_list = inp.instruction_id_list
+    is_following_list = []
+
+    for index, instruction_id in enumerate(instruction_list):
+        instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
+        instruction = instruction_cls(instruction_id)
+
+        # Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
+        kwargs = {k: v for k, v in inp.kwargs[index].items() if v}
+        instruction.build_description(**kwargs)
+        args = instruction.get_instruction_args()
+        if args and "prompt" in args:
+            instruction.build_description(prompt=inp.prompt)
+
+        if response.strip() and instruction.check_following(response):
+            is_following_list.append(True)
+        else:
+            is_following_list.append(False)
+
+    return OutputExample(
+        instruction_id_list=inp.instruction_id_list,
+        prompt=inp.prompt,
+        response=response,
+        follow_all_instructions=all(is_following_list),
+        follow_instruction_list=is_following_list,
+    )
+
+
+def test_instruction_following_loose(
+    inp,
+    response,
+):
+    """Tests response for an upper bound for following instructions."""
+    r = response.split("\n")
+    response_remove_first = "\n".join(r[1:]).strip()
+    response_remove_last = "\n".join(r[:-1]).strip()
+    response_remove_both = "\n".join(r[1:-1]).strip()
+    revised_response = response.replace("*", "")
+    revised_response_remove_first = response_remove_first.replace("*", "")
+    revised_response_remove_last = response_remove_last.replace("*", "")
+    revised_response_remove_both = response_remove_both.replace("*", "")
+    all_responses = [
+        response,
+        revised_response,
+        response_remove_first,
+        response_remove_last,
+        response_remove_both,
+        revised_response_remove_first,
+        revised_response_remove_last,
+        revised_response_remove_both,
+    ]
+    instruction_list = inp.instruction_id_list
+    is_following_list = []
+
+    for index, instruction_id in enumerate(instruction_list):
+        instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
+        instruction = instruction_cls(instruction_id)
+
+        # Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
+        kwargs = {k: v for k, v in inp.kwargs[index].items() if v}
+        instruction.build_description(**kwargs)
+        args = instruction.get_instruction_args()
+        if args and "prompt" in args:
+            instruction.build_description(prompt=inp.prompt)
+
+        is_following = False
+        for r in all_responses:
+            if r.strip() and instruction.check_following(r):
+                is_following = True
+                break
+
+        is_following_list.append(is_following)
+
+    return OutputExample(
+        instruction_id_list=inp.instruction_id_list,
+        prompt=inp.prompt,
+        response=response,
+        follow_all_instructions=all(is_following_list),
+        follow_instruction_list=is_following_list,
+    )
+
+
+def process_results(doc, results):
+    inp = InputExample(
+        key=doc["key"],
+        instruction_id_list=doc["instruction_id_list"],
+        prompt=doc["prompt"],
+        kwargs=doc["kwargs"],
+    )
+    response = results[0]
+
+    out_strict = test_instruction_following_strict(inp, response)
+    out_loose = test_instruction_following_loose(inp, response)
+
+    return {
+        "prompt_level_strict_acc": out_strict.follow_all_instructions,
+        "inst_level_strict_acc": out_strict.follow_instruction_list,
+        "prompt_level_loose_acc": out_loose.follow_all_instructions,
+        "inst_level_loose_acc": out_loose.follow_instruction_list,
+    }
+
+
+def agg_inst_level_acc(items):
+    flat_items = [item for sublist in items for item in sublist]
+    inst_level_acc = sum(flat_items) / len(flat_items)
+    return inst_level_acc
--- a/lm_eval/tasks/leaderboard/leaderboard.yaml
+++ b/lm_eval/tasks/leaderboard/leaderboard.yaml
+group: leaderboard
+task:
+  - leaderboard_mmlu_pro
+  - leaderboard_bbh
+  - leaderboard_gpqa
+  - leaderboard_math_hard
+  - leaderboard_ifeval
+  - leaderboard_musr