Merge branch 'big-refactor' into fix-gold-aliases

3dcde17b · Lintang Sutawika · GitHub · b8cb513b · 6a000adb · 3dcde17b
Unverified Commit 3dcde17b authored Jun 21, 2023 by Lintang Sutawika Committed by GitHub Jun 21, 2023
9 changed files
--- a/docs/PROGRESS.md
+++ b/docs/PROGRESS.md
-Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.
+# Eval Harness Documentation
+Welcome to the docs for the LM Evaluation Harness!
+## Table of Contents
+* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
+* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
+* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Advanced Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/advanced_task_guide.md).
+## Progress on Revamp
+Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.
-## Desired Pages
+### Desired Pages
 * [ ] YAML explainer
  * [ ] Explainer on filters + advanced features

--- a/docs/model_guide.md
+++ b/docs/model_guide.md
-This is a placeholder.
+# New Model Guide
+The `lm-evaluation-harness` is intended to be a model-agnostic framework for evaluating . We provide first-class support for HuggingFace `AutoModelForCausalLM` and `AutoModelForSeq2SeqLM` type models, but
+This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling `lm_eval.evaluator.evaluate()` to evaluate an existing model.
+In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lm_eval.api.model.LM` class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this `LM` subclass via adding it to the library!
+## Setup
+To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
+```sh
+# After forking...
+git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
+cd lm-evaluation-harness
+git checkout big-refactor
+git checkout -b <model-type>
+pip install -e ".[dev]"
+```
+Now, we'll create a new file where we'll be adding our model:
+```sh
+touch lm_eval/models/<my_model_filename>.py
+```
+**Tip: this filename should not shadow package names! For example, naming your file `anthropic.py` is disallowed since the API's name on pypi is `anthropic`, but naming it `anthropic_llms.py` works with no problems.**
+## Interface
+All models must subclass the `lm_eval.api.model.LM` class.
+The LM class enforces a common interface via which we can extract responses from a model:
+```python
+class MyCustomLM(LM):
+    #...
+    def loglikelihood(self, requests):
+    def loglikelihood_rolling(self, requests):
+    def greedy_until(self, requests):
+    #...
+```
+We support
+The three types of
+smth smth tokenizer-agnostic
+3 reqtypes
+- greedy_until, and the arguments passed to it
+- loglikelihood, and args passed to it
+- loglikelihood_rolling, and args passed to it
+## Registration
+Congrats on implementing your model! Now it's time to test it out.
+To make your model usable via the command line interface to `lm-eval` using `main.py`, you'll need to tell `lm-eval` what your model's name is.
+This is done via a *decorator*, `lm_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python main.py --model <name>` and alert `lm-eval` to the model's existence.
+```python
+from lm_eval.api.registry import register_model
+@register_model("<name1>", "<name2>")
+class MyCustomLM(LM):
+```
+Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at `lm_eval.api.registry.MODEL_REGISTRY`. See `lm_eval.api.registry` for more detail on what sorts of registries and decorators exist in the library!
+## Other
+**Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model!
+## Conclusion
+After reading this guide, you should be able to add new model APIs or implementations to the Eval Harness library!
--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -63,10 +63,11 @@ class TaskConfig(dict):
    fewshot_split: str = None  # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
    template_aliases: str = None
-    aliases: Union[str, list] = None
    doc_to_text: Union[Callable, str] = None
    doc_to_target: Union[Callable, str] = None
    use_prompt: str = None
+    delimiter: str = "\n\n"
+    description: str = ""
    num_fewshot: int = 0
    batch_size: int = 1
@@ -434,35 +435,12 @@ class Task(abc.ABC):
        ), "A `random.Random` generator argument must be provided to `rnd`"
        if num_fewshot == 0:
-            labeled_examples = ""
+            # always prepend the (possibly empty) task description
+            labeled_examples = self._config.description
        else:
-            labeled_examples = self.sampler.get_context(doc, num_fewshot)
+            labeled_examples = self._config.description + self.sampler.get_context(
+                doc, num_fewshot
-            # for sets with no training docs, draw from other set *but ensure no overlap with current doc*
+            )
-            # if self.has_training_docs():
-            #     fewshotex = self.fewshot_examples(k=num_fewshot, rnd=rnd)
-            # else:
-            #     if self._fewshot_docs is None:
-            #         self._fewshot_docs = list(
-            #             self.validation_docs()
-            #             if self.has_validation_docs()
-            #             else self.test_docs()
-            #         )
-            #     fewshotex = rnd.sample(self._fewshot_docs, num_fewshot + 1)
-            #     # get rid of the doc that's the one we're evaluating, if it's in the fewshot
-            #     fewshotex = [x for x in fewshotex if x != doc][:num_fewshot]
-            # labeled_examples = (
-            #     "\n\n".join(
-            #         [
-            #             self.doc_to_text(doc) + self.doc_to_target(doc)
-            #             for doc in fewshotex
-            #         ]
-            #     )
-            #     + "\n\n"
-            # )
        example = self.doc_to_text(doc)
        return labeled_examples + example

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -23,7 +23,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] LogiQA
 - [ ] HellaSwag
 - [ ] SWAG
- [ ] OpenBookQA
+- [x] OpenBookQA
 - [ ] SQuADv2
 - [ ] RACE
 - [ ] HeadQA

--- a/lm_eval/tasks/openbookqa/openbookqa.yaml
+++ b/lm_eval/tasks/openbookqa/openbookqa.yaml
+group:
+  - multiple_choice
+task: openbookqa
+dataset_path: openbookqa
+dataset_name: main
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+template_aliases: "{% set answer_choices = choices['text'] %}{% set gold = choices.label.index(answerKey.lstrip()) %}" # set the list of possible answer choices, and set what this doc's gold answer is (set what ds column used, and what)
+doc_to_text: "{{question_stem}}"
+doc_to_target: "{{gold}}" # this will be cast to an int.
+should_decontaminate: true
+doc_to_decontamination_query: "{{question_stem}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
--- a/main.py
+++ b/main.py
@@ -41,7 +41,6 @@ def parse_args():
    parser.add_argument("--data_sampling", type=float, default=None)
    parser.add_argument("--no_cache", action="store_true")
    parser.add_argument("--decontamination_ngrams_path", default=None)
-    parser.add_argument("--description_dict_path", default=None)
    parser.add_argument("--check_integrity", action="store_true")
    parser.add_argument("--write_out", action="store_true", default=False)
    parser.add_argument("--output_base_path", type=str, default=None)
@@ -78,12 +77,6 @@ def main():
    eval_logger.info(f"Selected Tasks: {task_names}")
-    # TODO: description_dict?
-    # description_dict = {}
-    # if args.description_dict_path:
-    #     with open(args.description_dict_path, "r") as f:
-    #         description_dict = json.load(f)
    results = evaluator.simple_evaluate(
        model=args.model,
        model_args=args.model_args,
@@ -94,7 +87,6 @@ def main():
        device=args.device,
        no_cache=args.no_cache,
        limit=args.limit,
-        # description_dict=description_dict,
        decontamination_ngrams_path=args.decontamination_ngrams_path,
        check_integrity=args.check_integrity,
        write_out=args.write_out,

--- a/scripts/write_out.py
+++ b/scripts/write_out.py
@@ -13,12 +13,10 @@ def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--output_base_path", required=True)
    parser.add_argument("--tasks", default="all_tasks")
-    parser.add_argument("--provide_description", action="store_true")
    parser.add_argument("--sets", type=str, default="val")  # example: val,test
    parser.add_argument("--num_fewshot", type=int, default=1)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--num_examples", type=int, default=1)
-    parser.add_argument("--description_dict_path", default=None)
    return parser.parse_args()
@@ -32,11 +30,6 @@ def main():
        task_names = args.tasks.split(",")
    task_dict = tasks.get_task_dict(task_names)
-    # description_dict = {}
-    # if args.description_dict_path:
-    #     with open(args.description_dict_path, "r") as f:
-    #         description_dict = json.load(f)
    os.makedirs(args.output_base_path, exist_ok=True)
    for task_name, task in task_dict.items():
        rnd = random.Random()
@@ -55,12 +48,6 @@ def main():
        docs = join_iters(iters)
-        # description = (
-        #     description_dict[task_name]
-        #     if description_dict and task_name in description_dict
-        #     else ""
-        # )
        with open(os.path.join(args.output_base_path, task_name), "w") as f:
            for i, doc in (
                zip(range(args.num_examples), docs)
@@ -72,7 +59,6 @@ def main():
                    doc=doc,
                    num_fewshot=args.num_fewshot,
                    rnd=rnd,
-                    # description=description,
                )
                f.write(ctx + "\n")

--- a/tests/test_description_dict.py
+++ b/tests/test_description_dict.py
@@ -3,7 +3,7 @@ import lm_eval.tasks
 import lm_eval.models
-def test_description_dict():
+def test_description():
    seed = 42
    num_examples = 1
    task_names = ["hellaswag", "winogrande"]
@@ -37,6 +37,5 @@ def test_description_dict():
                doc=doc,
                num_fewshot=1,
                rnd=rnd,
-                description=description,
            )
            assert description in ctx
--- a/tests/test_evaluator.py
+++ b/tests/test_evaluator.py
@@ -55,7 +55,6 @@ def test_evaluator(taskname, task_class):
        num_fewshot=0,
        limit=limit,
        bootstrap_iters=10,
-        description_dict=None,
    )
    e2 = evaluator.evaluate(
        lm=lm,
@@ -63,7 +62,6 @@ def test_evaluator(taskname, task_class):
        num_fewshot=0,
        limit=limit,
        bootstrap_iters=10,
-        description_dict=None,
    )
    # check that caching is working