Merge branch 'EleutherAI:main' into main

eea16d36 · Jess · GitHub · 72f5f4b1 · 885f48d6 · eea16d36
Unverified Commit eea16d36 authored May 08, 2024 by Jess Committed by GitHub May 08, 2024
20 changed files
--- a/README.md
+++ b/README.md
@@ -307,7 +307,7 @@ To save evaluation results provide an `--output_path`. We also support logging m
 Additionally, one can provide a directory with `--use_cache` to cache the results of prior runs. This allows you to avoid repeated execution of the same (model, task) pairs for re-scoring.
-To push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the --hf_hub_log_args flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub. For example:
+To push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the `--hf_hub_log_args` flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - [example output](https://huggingface.co/datasets/KonradSzafer/lm-eval-results-demo/tree/main/microsoft__phi-2). For instance:
 ```bash
 lm_eval --model hf \
@@ -443,6 +443,7 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
 | sentencepiece | For using the sentencepiece tokenizer |
 | sparseml      | For using NM's SparseML models        |
 | testing       | For running library test suite        |
+| unitxt        | For IBM's unitxt dataset tasks        |
 | vllm          | For loading models with vLLM          |
 | zeno          | For visualizing results with Zeno     |
 |---------------|---------------------------------------|

--- a/lm_eval/logging/evaluation_tracker.py
+++ b/lm_eval/logging/evaluation_tracker.py
@@ -81,7 +81,7 @@ class EvaluationTracker:
    def __init__(
        self,
-        output_path: str = "",
+        output_path: str = None,
        hub_results_org: str = "",
        hub_repo_name: str = "",
        push_results_to_hub: bool = False,

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -667,6 +667,8 @@ class HFLM(TemplateLM):
            max_cont_enc = len(continuation_enc[-(self.max_length + 1) :])
        else:
            max_length = self.max_length
+            max_context_enc = max_length
+            max_cont_enc = max_length
        # if OOM, then halves batch_size and tries again
        @find_executable_batch_size(starting_batch_size=self.max_batch_size)

--- a/lm_eval/tasks/hendrycks_math/README.md
+++ b/lm_eval/tasks/hendrycks_math/README.md
+# MATH
+## Paper
+Measuring Mathematical Problem Solving With the MATH Dataset
+https://arxiv.org/abs/2103.03874
+Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
+NOTE: This task corresponds to the MATH (`hendrycks_math`) implementation at https://github.com/EleutherAI/lm-evaluation-harness/tree/master . For the variant which uses the custom 4-shot prompt in the Minerva paper (https://arxiv.org/abs/2206.14858), and SymPy answer checking as done by Minerva, see `lm_eval/tasks/minerva_math`.
+Homepage: https://github.com/hendrycks/math
+## Citation
+```
+@article{hendrycksmath2021,
+  title={Measuring Mathematical Problem Solving With the MATH Dataset},
+  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
+  journal={NeurIPS},
+  year={2021}
+}
+```
+### Groups and Tasks
+#### Groups
+- `hendrycks_math`: the MATH benchmark from Hendrycks et al. 0- or few-shot.
+#### Tasks
+- `hendrycks_math_algebra`
+- `hendrycks_math_counting_and_prob`
+- `hendrycks_math_geometry`
+- `hendrycks_math_intermediate_algebra`
+- `hendrycks_math_num_theory`
+- `hendrycks_math_prealgebra`
+- `hendrycks_math_precalc`
+### Checklist
+The checklist is the following:
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+    * Answer extraction code is taken from the original MATH benchmark paper's repository.
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math.yaml
+group: hendrycks_math
+task:
+  - hendrycks_math_algebra
+  - hendrycks_math_counting_and_prob
+  - hendrycks_math_geometry
+  - hendrycks_math_intermediate_algebra
+  - hendrycks_math_num_theory
+  - hendrycks_math_prealgebra
+  - hendrycks_math_precalc
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_algebra.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_algebra.yaml
+group:
+  - math_word_problems
+task: hendrycks_math_algebra
+dataset_path: EleutherAI/hendrycks_math
+process_docs: !function utils.process_docs
+dataset_name: algebra
+output_type: generate_until
+training_split: train
+test_split: test
+doc_to_text:  "Problem: {{problem}}\nAnswer:"
+process_results: !function utils.process_results
+doc_to_target: "{{answer}}"
+generation_kwargs:
+  until:
+    - "Problem:"
+  do_sample: false
+  temperature: 0
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_counting_and_prob.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_counting_and_prob.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: counting_and_probability
+task: hendrycks_math_counting_and_prob
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_geometry.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_geometry.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: geometry
+task: hendrycks_math_geometry
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_intermediate_algebra.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_intermediate_algebra.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: intermediate_algebra
+task: hendrycks_math_intermediate_algebra
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_num_theory.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_num_theory.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: number_theory
+task: hendrycks_math_num_theory
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_prealgebra.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_prealgebra.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: prealgebra
+task: hendrycks_math_prealgebra
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_precalc.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_precalc.yaml
+include: hendrycks_math_algebra.yaml
+dataset_name: precalculus
+task: hendrycks_math_precalc
--- a/lm_eval/tasks/hendrycks_math/utils.py
+++ b/lm_eval/tasks/hendrycks_math/utils.py
+from typing import Dict, List
+import datasets
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc: dict) -> dict:
+        out_doc = {
+            "problem": doc["problem"],
+            "solution": doc["solution"],
+            "answer": remove_boxed(last_boxed_only_string(doc["solution"])),
+        }
+        return out_doc
+    return dataset.map(_process_doc)
+def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
+    retval = 0
+    indices = [pos for pos, char in enumerate(results[0]) if char == "$"]
+    if len(indices) <= 1:
+        answer = results[0]
+    else:
+        answer = results[0][indices[0] + 1 : indices[-1]]
+    if is_equiv(answer, remove_boxed(last_boxed_only_string(doc["solution"]))):
+        retval = 1
+    results = {
+        "exact_match": retval,
+    }
+    return results
+# string normalization from https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_math.py
+def is_equiv(str1, str2, verbose=False):
+    if str1 is None and str2 is None:
+        print("WARNING: Both None")
+        return True
+    if str1 is None or str2 is None:
+        return False
+    try:
+        ss1 = strip_string(str1)
+        ss2 = strip_string(str2)
+        if verbose:
+            print(ss1, ss2)
+        return ss1 == ss2
+    except Exception:
+        return str1 == str2
+def remove_boxed(s):
+    if "\\boxed " in s:
+        left = "\\boxed "
+        assert s[: len(left)] == left
+        return s[len(left) :]
+    left = "\\boxed{"
+    assert s[: len(left)] == left
+    assert s[-1] == "}"
+    return s[len(left) : -1]
+def last_boxed_only_string(string):
+    idx = string.rfind("\\boxed")
+    if "\\boxed " in string:
+        return "\\boxed " + string.split("\\boxed ")[-1].split("$")[0]
+    if idx < 0:
+        idx = string.rfind("\\fbox")
+        if idx < 0:
+            return None
+    i = idx
+    right_brace_idx = None
+    num_left_braces_open = 0
+    while i < len(string):
+        if string[i] == "{":
+            num_left_braces_open += 1
+        if string[i] == "}":
+            num_left_braces_open -= 1
+            if num_left_braces_open == 0:
+                right_brace_idx = i
+                break
+        i += 1
+    if right_brace_idx is None:
+        retval = None
+    else:
+        retval = string[idx : right_brace_idx + 1]
+    return retval
+def fix_fracs(string):
+    substrs = string.split("\\frac")
+    new_str = substrs[0]
+    if len(substrs) > 1:
+        substrs = substrs[1:]
+        for substr in substrs:
+            new_str += "\\frac"
+            if substr[0] == "{":
+                new_str += substr
+            else:
+                try:
+                    assert len(substr) >= 2
+                except AssertionError:
+                    return string
+                a = substr[0]
+                b = substr[1]
+                if b != "{":
+                    if len(substr) > 2:
+                        post_substr = substr[2:]
+                        new_str += "{" + a + "}{" + b + "}" + post_substr
+                    else:
+                        new_str += "{" + a + "}{" + b + "}"
+                else:
+                    if len(substr) > 2:
+                        post_substr = substr[2:]
+                        new_str += "{" + a + "}" + b + post_substr
+                    else:
+                        new_str += "{" + a + "}" + b
+    string = new_str
+    return string
+def fix_a_slash_b(string):
+    if len(string.split("/")) != 2:
+        return string
+    a = string.split("/")[0]
+    b = string.split("/")[1]
+    try:
+        a = int(a)
+        b = int(b)
+        assert string == "{}/{}".format(a, b)
+        new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
+        return new_string
+    except AssertionError:
+        return string
+def remove_right_units(string):
+    # "\\text{ " only ever occurs (at least in the val set) when describing units
+    if "\\text{ " in string:
+        splits = string.split("\\text{ ")
+        assert len(splits) == 2
+        return splits[0]
+    else:
+        return string
+def fix_sqrt(string):
+    if "\\sqrt" not in string:
+        return string
+    splits = string.split("\\sqrt")
+    new_string = splits[0]
+    for split in splits[1:]:
+        if split[0] != "{":
+            a = split[0]
+            new_substr = "\\sqrt{" + a + "}" + split[1:]
+        else:
+            new_substr = "\\sqrt" + split
+        new_string += new_substr
+    return new_string
+def strip_string(string):
+    # linebreaks
+    string = string.replace("\n", "")
+    # remove inverse spaces
+    string = string.replace("\\!", "")
+    # replace \\ with \
+    string = string.replace("\\\\", "\\")
+    # replace tfrac and dfrac with frac
+    string = string.replace("tfrac", "frac")
+    string = string.replace("dfrac", "frac")
+    # remove \left and \right
+    string = string.replace("\\left", "")
+    string = string.replace("\\right", "")
+    # Remove circ (degrees)
+    string = string.replace("^{\\circ}", "")
+    string = string.replace("^\\circ", "")
+    # remove dollar signs
+    string = string.replace("\\$", "")
+    # remove units (on the right)
+    string = remove_right_units(string)
+    # remove percentage
+    string = string.replace("\\%", "")
+    string = string.replace("\%", "")  # noqa: W605
+    # " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
+    string = string.replace(" .", " 0.")
+    string = string.replace("{.", "{0.")
+    # if empty, return empty string
+    if len(string) == 0:
+        return string
+    if string[0] == ".":
+        string = "0" + string
+    # to consider: get rid of e.g. "k = " or "q = " at beginning
+    if len(string.split("=")) == 2:
+        if len(string.split("=")[0]) <= 2:
+            string = string.split("=")[1]
+    # fix sqrt3 --> sqrt{3}
+    string = fix_sqrt(string)
+    # remove spaces
+    string = string.replace(" ", "")
+    # \frac1b or \frac12 --> \frac{1}{b} and \frac{1}{2}, etc. Even works with \frac1{72} (but not \frac{72}1). Also does a/b --> \\frac{a}{b}
+    string = fix_fracs(string)
+    # manually change 0.5 --> \frac{1}{2}
+    if string == "0.5":
+        string = "\\frac{1}{2}"
+    # NOTE: X/Y changed to \frac{X}{Y} in dataset, but in simple cases fix in case the model output is X/Y
+    string = fix_a_slash_b(string)
+    return string
--- a/lm_eval/tasks/minerva_math/README.md
+++ b/lm_eval/tasks/minerva_math/README.md
@@ -28,16 +28,11 @@ Eprint = {arXiv:2206.14858},
 }
 ```
-### Groups, Benchmarks and Tasks
+### Groups and Tasks
-#### Benchmarks
- `minerva_math`
 #### Groups
- `math_word_problems`
+- `minerva_math`
- `generate_until`
 #### Tasks

--- a/lm_eval/tasks/unitxt/20_newsgroups.yaml
+++ b/lm_eval/tasks/unitxt/20_newsgroups.yaml
+include: unitxt_tasks.classification.multi_class
+task: 20_newsgroups
+dataset_name: card=cards.20_newsgroups,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/README.md
+++ b/lm_eval/tasks/unitxt/README.md
+# Unitxt
+### Paper
+Title: `Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI`
+Abstract: `https://arxiv.org/abs/2401.14019`
+Unitxt is a library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. These components are centralized in the Unitxt-Catalog, thus fostering collaboration and exploration in modern textual data workflows.
+The full Unitxt catalog can be viewed in an online explorer. `https://unitxt.readthedocs.io/en/latest/docs/demo.html`
+Homepage: https://unitxt.readthedocs.io/en/latest/index.html
+### Citation
+```
+@misc{unitxt,
+      title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
+      author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
+      year={2024},
+      eprint={2401.14019},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+### Groups and Tasks
+#### Groups
+* `unitxt`:  Subset of Unitxt tasks that were not in LM-Eval Harness task catalog, including new types of tasks like multi-label classification, grammatical error correction, named entity extraction.
+#### Tasks
+The full list of Unitxt tasks currently supported can be seen under `tasks/unitxt` directory.
+### Adding tasks
+You can add additional tasks from the Unitxt catalog by generating new LM-Eval yaml files for these datasets.
+The Unitxt task yaml files are generated via the `generate_yamls.py` script in the `tasks/unitxt` directory.
+To add a yaml file for an existing dataset Unitxt which is not yet in LM-Eval:
+1. Add the card name to the `unitxt_datasets`  file in the `tasks/unitxt` directory.  
+2. The generate_yaml.py contains the default Unitxt [template](https://unitxt.readthedocs.io/en/latest/docs/adding_template.html) used for each kind of NLP task in the `default_template_per_task` dictionary.  If the dataset is of a Unitxt task type, previously not used in LM-Eval, you will need to add a default template for it in the dictionary.  
+```
+default_template_per_task = {
+     "tasks.classification.multi_label" : "templates.classification.multi_label.title" ,
+     "tasks.classification.multi_class" : "templates.classification.multi_class.title" ,
+     "tasks.summarization.abstractive" :  "templates.summarization.abstractive.full",
+     "tasks.regression.two_texts" : "templates.regression.two_texts.simple",
+     "tasks.qa.with_context.extractive" : "templates.qa.with_context.simple",
+     "tasks.grammatical_error_correction" : "templates.grammatical_error_correction.simple",
+     "tasks.span_labeling.extraction" : "templates.span_labeling.extraction.title"
+}
+```
+3. Run `python generate_yaml.py` (this will generate all the datasets listed in the `unitxt_datasets`)
+If you want to add a new dataset to the Unitxt catalog, see the Unitxt documentation:
+https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/unitxt/ag_news.yaml
+++ b/lm_eval/tasks/unitxt/ag_news.yaml
+include: unitxt_tasks.classification.multi_class
+task: ag_news
+dataset_name: card=cards.ag_news,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/argument_topic.yaml
+++ b/lm_eval/tasks/unitxt/argument_topic.yaml
+include: unitxt_tasks.classification.multi_class
+task: argument_topic
+dataset_name: card=cards.argument_topic,template=templates.classification.multi_class.title
--- a/lm_eval/tasks/unitxt/atis.yaml
+++ b/lm_eval/tasks/unitxt/atis.yaml
+include: unitxt_tasks.span_labeling.extraction
+task: atis
+dataset_name: card=cards.atis,template=templates.span_labeling.extraction.title
--- a/lm_eval/tasks/unitxt/banking77.yaml
+++ b/lm_eval/tasks/unitxt/banking77.yaml
+include: unitxt_tasks.classification.multi_class
+task: banking77
+dataset_name: card=cards.banking77,template=templates.classification.multi_class.title