Merge branch 'EleutherAI:main' into main

a2af2101 · Yen-Ting Lin · GitHub · 82cb25c1 · d5f39bf8 · a2af2101
Unverified Commit a2af2101 authored Jul 12, 2024 by Yen-Ting Lin Committed by GitHub Jul 12, 2024
20 changed files
--- a/lm_eval/tasks/hendrycks_math/README.md
+++ b/lm_eval/tasks/hendrycks_math/README.md
+# MATH
+## Paper
+Measuring Mathematical Problem Solving With the MATH Dataset
+https://arxiv.org/abs/2103.03874
+Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
+NOTE: This task corresponds to the MATH (`hendrycks_math`) implementation at https://github.com/EleutherAI/lm-evaluation-harness/tree/master . For the variant which uses the custom 4-shot prompt in the Minerva paper (https://arxiv.org/abs/2206.14858), and SymPy answer checking as done by Minerva, see `lm_eval/tasks/minerva_math`.
+Homepage: https://github.com/hendrycks/math
+## Citation
+```
+@article{hendrycksmath2021,
+  title={Measuring Mathematical Problem Solving With the MATH Dataset},
+  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
+  journal={NeurIPS},
+  year={2021}
+}
+```
+### Groups and Tasks
+#### Groups
+- `hendrycks_math`: the MATH benchmark from Hendrycks et al. 0- or few-shot.
+#### Tasks
+- `hendrycks_math_algebra`
+- `hendrycks_math_counting_and_prob`
+- `hendrycks_math_geometry`
+- `hendrycks_math_intermediate_algebra`
+- `hendrycks_math_num_theory`
+- `hendrycks_math_prealgebra`
+- `hendrycks_math_precalc`
+### Checklist
+The checklist is the following:
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+    * Answer extraction code is taken from the original MATH benchmark paper's repository.
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math.yaml
+group: hendrycks_math
+task:
+  - hendrycks_math_algebra
+  - hendrycks_math_counting_and_prob
+  - hendrycks_math_geometry
+  - hendrycks_math_intermediate_algebra
+  - hendrycks_math_num_theory
+  - hendrycks_math_prealgebra
+  - hendrycks_math_precalc
+aggregate_metric_list:
+  - metric: exact_match
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_algebra.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_algebra.yaml
+tag:
+  - math_word_problems
+task: hendrycks_math_algebra
+dataset_path: EleutherAI/hendrycks_math
+process_docs: !function utils.process_docs
+dataset_name: algebra
+output_type: generate_until
+training_split: train
+test_split: test
+doc_to_text:  "Problem: {{problem}}\nAnswer:"
+process_results: !function utils.process_results
+doc_to_target: "{{answer}}"
+generation_kwargs:
+  until:
+    - "Problem:"
+  do_sample: false
+  temperature: 0
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_counting_and_prob.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_counting_and_prob.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: counting_and_probability
+task: hendrycks_math_counting_and_prob
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_geometry.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_geometry.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: geometry
+task: hendrycks_math_geometry
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_intermediate_algebra.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_intermediate_algebra.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: intermediate_algebra
+task: hendrycks_math_intermediate_algebra
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_num_theory.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_num_theory.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: number_theory
+task: hendrycks_math_num_theory
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_prealgebra.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_prealgebra.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: prealgebra
+task: hendrycks_math_prealgebra
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_precalc.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_precalc.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: precalculus
+task: hendrycks_math_precalc
--- a/lm_eval/tasks/hendrycks_math/utils.py
+++ b/lm_eval/tasks/hendrycks_math/utils.py
+from typing import Dict, List
+import datasets
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc: dict) -> dict:
+        out_doc = {
+            "problem": doc["problem"],
+            "solution": doc["solution"],
+            "answer": remove_boxed(last_boxed_only_string(doc["solution"])),
+        }
+        return out_doc
+    return dataset.map(_process_doc)
+def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
+    retval = 0
+    indices = [pos for pos, char in enumerate(results[0]) if char == "$"]
+    if len(indices) <= 1:
+        answer = results[0]
+    else:
+        answer = results[0][indices[0] + 1 : indices[-1]]
+    if is_equiv(answer, remove_boxed(last_boxed_only_string(doc["solution"]))):
+        retval = 1
+    results = {
+        "exact_match": retval,
+    }
+    return results
+# string normalization from https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_math.py
+def is_equiv(str1, str2, verbose=False):
+    if str1 is None and str2 is None:
+        print("WARNING: Both None")
+        return True
+    if str1 is None or str2 is None:
+        return False
+    try:
+        ss1 = strip_string(str1)
+        ss2 = strip_string(str2)
+        if verbose:
+            print(ss1, ss2)
+        return ss1 == ss2
+    except Exception:
+        return str1 == str2
+def remove_boxed(s):
+    if "\\boxed " in s:
+        left = "\\boxed "
+        assert s[: len(left)] == left
+        return s[len(left) :]
+    left = "\\boxed{"
+    assert s[: len(left)] == left
+    assert s[-1] == "}"
+    return s[len(left) : -1]
+def last_boxed_only_string(string):
+    idx = string.rfind("\\boxed")
+    if "\\boxed " in string:
+        return "\\boxed " + string.split("\\boxed ")[-1].split("$")[0]
+    if idx < 0:
+        idx = string.rfind("\\fbox")
+        if idx < 0:
+            return None
+    i = idx
+    right_brace_idx = None
+    num_left_braces_open = 0
+    while i < len(string):
+        if string[i] == "{":
+            num_left_braces_open += 1
+        if string[i] == "}":
+            num_left_braces_open -= 1
+            if num_left_braces_open == 0:
+                right_brace_idx = i
+                break
+        i += 1
+    if right_brace_idx is None:
+        retval = None
+    else:
+        retval = string[idx : right_brace_idx + 1]
+    return retval
+def fix_fracs(string):
+    substrs = string.split("\\frac")
+    new_str = substrs[0]
+    if len(substrs) > 1:
+        substrs = substrs[1:]
+        for substr in substrs:
+            new_str += "\\frac"
+            if substr[0] == "{":
+                new_str += substr
+            else:
+                try:
+                    assert len(substr) >= 2
+                except AssertionError:
+                    return string
+                a = substr[0]
+                b = substr[1]
+                if b != "{":
+                    if len(substr) > 2:
+                        post_substr = substr[2:]
+                        new_str += "{" + a + "}{" + b + "}" + post_substr
+                    else:
+                        new_str += "{" + a + "}{" + b + "}"
+                else:
+                    if len(substr) > 2:
+                        post_substr = substr[2:]
+                        new_str += "{" + a + "}" + b + post_substr
+                    else:
+                        new_str += "{" + a + "}" + b
+    string = new_str
+    return string
+def fix_a_slash_b(string):
+    if len(string.split("/")) != 2:
+        return string
+    a = string.split("/")[0]
+    b = string.split("/")[1]
+    try:
+        a = int(a)
+        b = int(b)
+        assert string == "{}/{}".format(a, b)
+        new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
+        return new_string
+    except AssertionError:
+        return string
+def remove_right_units(string):
+    # "\\text{ " only ever occurs (at least in the val set) when describing units
+    if "\\text{ " in string:
+        splits = string.split("\\text{ ")
+        assert len(splits) == 2
+        return splits[0]
+    else:
+        return string
+def fix_sqrt(string):
+    if "\\sqrt" not in string:
+        return string
+    splits = string.split("\\sqrt")
+    new_string = splits[0]
+    for split in splits[1:]:
+        if split[0] != "{":
+            a = split[0]
+            new_substr = "\\sqrt{" + a + "}" + split[1:]
+        else:
+            new_substr = "\\sqrt" + split
+        new_string += new_substr
+    return new_string
+def strip_string(string):
+    # linebreaks
+    string = string.replace("\n", "")
+    # remove inverse spaces
+    string = string.replace("\\!", "")
+    # replace \\ with \
+    string = string.replace("\\\\", "\\")
+    # replace tfrac and dfrac with frac
+    string = string.replace("tfrac", "frac")
+    string = string.replace("dfrac", "frac")
+    # remove \left and \right
+    string = string.replace("\\left", "")
+    string = string.replace("\\right", "")
+    # Remove circ (degrees)
+    string = string.replace("^{\\circ}", "")
+    string = string.replace("^\\circ", "")
+    # remove dollar signs
+    string = string.replace("\\$", "")
+    # remove units (on the right)
+    string = remove_right_units(string)
+    # remove percentage
+    string = string.replace("\\%", "")
+    string = string.replace("\%", "")  # noqa: W605
+    # " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
+    string = string.replace(" .", " 0.")
+    string = string.replace("{.", "{0.")
+    # if empty, return empty string
+    if len(string) == 0:
+        return string
+    if string[0] == ".":
+        string = "0" + string
+    # to consider: get rid of e.g. "k = " or "q = " at beginning
+    if len(string.split("=")) == 2:
+        if len(string.split("=")[0]) <= 2:
+            string = string.split("=")[1]
+    # fix sqrt3 --> sqrt{3}
+    string = fix_sqrt(string)
+    # remove spaces
+    string = string.replace(" ", "")
+    # \frac1b or \frac12 --> \frac{1}{b} and \frac{1}{2}, etc. Even works with \frac1{72} (but not \frac{72}1). Also does a/b --> \\frac{a}{b}
+    string = fix_fracs(string)
+    # manually change 0.5 --> \frac{1}{2}
+    if string == "0.5":
+        string = "\\frac{1}{2}"
+    # NOTE: X/Y changed to \frac{X}{Y} in dataset, but in simple cases fix in case the model output is X/Y
+    string = fix_a_slash_b(string)
+    return string
--- a/lm_eval/tasks/ifeval/instructions.py
+++ b/lm_eval/tasks/ifeval/instructions.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 """Library of instructions."""
 import collections
 import json
 import logging

--- a/lm_eval/tasks/ifeval/instructions_registry.py
+++ b/lm_eval/tasks/ifeval/instructions_registry.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 """Registry of all instructions."""
 from lm_eval.tasks.ifeval import instructions

--- a/lm_eval/tasks/inverse_scaling/README.md
+++ b/lm_eval/tasks/inverse_scaling/README.md
+# inverse_scaling
+### Paper
+Title: `Inverse Scaling: When Bigger Isn't Better`
+Abstract: `Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at this https URL to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.`
+Note: This is not official implementation of inverse scaling prize. Implemented by h-albert-lee with permission from the authors of the paper.
+Homepage: https://github.com/inverse-scaling/prize
+### Citation
+@article{mckenzie2023inverse,
+      title={Inverse Scaling: When Bigger Isn't Better},
+      author={Ian R. McKenzie and Alexander Lyzhov and Michael Pieler and Alicia Parrish and Aaron Mueller and Ameya Prabhu and Euan McLean and Aaron Kirtland and Alexis Ross and Alisa Liu and Andrew Gritsevskiy and Daniel Wurgaft and Derik Kauffman and Gabriel Recchia and Jiacheng Liu and Joe Cavanagh and Max Weiss and Sicong Huang and The Floating Droid and Tom Tseng and Tomasz Korbak and Xudong Shen and Yuhui Zhang and Zhengping Zhou and Najoung Kim and Samuel R. Bowman and Ethan Perez},
+      journal={arXiv preprint arXiv:2306.09479},
+      year={2023}
+}
+### Groups and Tasks
+#### Groups
+* `inverse_scaling_mc`: all tasks of Inverse Scaling Prize (currently aside from Prompt Injection), matching their implementations on OPT for multiple-choice type classification tasks. **These match the published dataset versions from the prize, which may slightly differ from numbers in the paper (but have been tested for equivalence to the OPT numbers reported at https://huggingface.co/inverse-scaling/opt-1.3b_eval for multiple sizes.**
+#### Tasks
+- `inverse_scaling_hindsight_neglect_10shot`
+- `inverse_scaling_redefine_math`
+- `inverse_scaling_quote_repetition`
+- `inverse_scaling_neqa`
+- `inverse_scaling_winobias_antistereotype`: not an official Inverse Scaling prize winner, but eval results reported on it at https://huggingface.co/inverse-scaling/opt-1.3b_eval .
+- `inverse_scaling_into_the_unknown`
+- `inverse_scaling_memo_trap`
+- `inverse_scaling_modus_tollens`
+- `inverse_scaling_pattern_matching_suppression`
+- `inverse_scaling_repetitive_algebra`
+- `inverse_scaling_sig_figs`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml
+++ b/lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml
+tag:
+  - inverse_scaling_mc
+output_type: multiple_choice
+test_split: train
+doc_to_text: prompt
+doc_to_choice: classes
+doc_to_target: answer_index
+target_delimiter: ""
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0
--- a/lm_eval/tasks/inverse_scaling/_some_results
+++ b/lm_eval/tasks/inverse_scaling/_some_results
+# |                   Tasks                   |Version|Filter|n-shot| Metric |Value |   |Stderr|
+# |-------------------------------------------|-------|------|-----:|--------|-----:|---|-----:|
+# | - inverse_scaling_hindsight_neglect_10shot|      0|none  |     0|acc     |0.4476|±  |0.0281|
+# |                                           |       |none  |     0|acc_norm|0.4476|±  |0.0281|
+# |inverse_scaling_mc                         |N/A    |none  |     0|acc_norm|0.6273|±  |0.0096|
+# |                                           |       |none  |     0|acc     |0.6210|±  |0.0095|
+# | - inverse_scaling_neqa                    |      0|none  |     0|acc     |0.5300|±  |0.0289|
+# |                                           |       |none  |     0|acc_norm|0.5300|±  |0.0289|
+# | - inverse_scaling_quote_repetition        |      0|none  |     0|acc     |0.9367|±  |0.0141|
+# |                                           |       |none  |     0|acc_norm|0.9367|±  |0.0141|
+# | - inverse_scaling_redefine_math           |      0|none  |     0|acc     |0.7178|±  |0.0150|
+# |                                           |       |none  |     0|acc_norm|0.7178|±  |0.0150|
+# | - inverse_scaling_winobias_antistereotype |      0|none  |     0|acc     |0.3786|±  |0.0239|
+# |                                           |       |none  |     0|acc_norm|0.4126|±  |0.0243|
+# |      Groups      |Version|Filter|n-shot| Metric |Value |   |Stderr|
+# |------------------|-------|------|-----:|--------|-----:|---|-----:|
+# |inverse_scaling_mc|N/A    |none  |     0|acc_norm|0.6273|±  |0.0096|
+# |                  |       |none  |     0|acc     |0.6210|±  |0.0095|
+# hf (pretrained=facebook/opt-2.7b,add_bos_token=True,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (32)
+# |                   Tasks                   |Version|Filter|n-shot| Metric |Value |   |Stderr|
+# |-------------------------------------------|-------|------|-----:|--------|-----:|---|-----:|
+# | - inverse_scaling_hindsight_neglect_10shot|      0|none  |     0|acc     |0.4476|±  |0.0281|
+# |                                           |       |none  |     0|acc_norm|0.4476|±  |0.0281|
+# |inverse_scaling_mc                         |N/A    |none  |     0|acc_norm|0.6291|±  |0.0095|
+# |                                           |       |none  |     0|acc     |0.6219|±  |0.0095|
+# | - inverse_scaling_neqa                    |      0|none  |     0|acc     |0.5267|±  |0.0289|
+# |                                           |       |none  |     0|acc_norm|0.5267|±  |0.0289|
+# | - inverse_scaling_quote_repetition        |      0|none  |     0|acc     |0.9433|±  |0.0134|
+# |                                           |       |none  |     0|acc_norm|0.9433|±  |0.0134|
+# | - inverse_scaling_redefine_math           |      0|none  |     0|acc     |0.7200|±  |0.0150|
+# |                                           |       |none  |     0|acc_norm|0.7200|±  |0.0150|
+# | - inverse_scaling_winobias_antistereotype |      0|none  |     0|acc     |0.3762|±  |0.0239|
+# |                                           |       |none  |     0|acc_norm|0.4150|±  |0.0243|
+# |      Groups      |Version|Filter|n-shot| Metric |Value |   |Stderr|
+# |------------------|-------|------|-----:|--------|-----:|---|-----:|
+# |inverse_scaling_mc|N/A    |none  |     0|acc_norm|0.6291|±  |0.0095|
+# |                  |       |none  |     0|acc     |0.6219|±  |0.0095|
--- a/lm_eval/tasks/inverse_scaling/inverse_scaling_hindsight_neglect.yaml
+++ b/lm_eval/tasks/inverse_scaling/inverse_scaling_hindsight_neglect.yaml
+include: _inverse_scaling_mc_yaml
+task: inverse_scaling_hindsight_neglect_10shot
+dataset_path: inverse-scaling/hindsight-neglect-10shot
--- a/lm_eval/tasks/inverse_scaling/inverse_scaling_into_the_unknown.yaml
+++ b/lm_eval/tasks/inverse_scaling/inverse_scaling_into_the_unknown.yaml
+include: _inverse_scaling_mc_yaml
+task: inverse_scaling_into_the_unknown
+dataset_path: Albertmade/into-the-unknown
--- a/lm_eval/tasks/inverse_scaling/inverse_scaling_memo_trap.yaml
+++ b/lm_eval/tasks/inverse_scaling/inverse_scaling_memo_trap.yaml
+include: _inverse_scaling_mc_yaml
+task: inverse_scaling_memo_trap
+dataset_path: Albertmade/memo-trap
--- a/lm_eval/tasks/inverse_scaling/inverse_scaling_modus_tollens.yaml
+++ b/lm_eval/tasks/inverse_scaling/inverse_scaling_modus_tollens.yaml
+include: _inverse_scaling_mc_yaml
+task: inverse_scaling_modus_tollens
+dataset_path: Albertmade/modus-tollens
--- a/lm_eval/tasks/inverse_scaling/inverse_scaling_neqa.yaml
+++ b/lm_eval/tasks/inverse_scaling/inverse_scaling_neqa.yaml
+include: _inverse_scaling_mc_yaml
+task: inverse_scaling_neqa
+dataset_path: inverse-scaling/NeQA