Merge branch 'big-refactor' of...

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into better-benchmark

Merge branch 'big-refactor' of...
Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into better-benchmark
b155946e · lintangsutawika · 892f40a9 · b8d1cef9 · b155946e · b155946e
Commit b155946e authored Sep 13, 2023 by lintangsutawika
20 changed files
--- a/docs/README.md
+++ b/docs/README.md
@@ -4,6 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!
 ## Table of Contents
+* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/user_guide.md)
 * To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
 * For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
 * To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Advanced Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/advanced_task_guide.md).

--- a/docs/interface.md
+++ b/docs/interface.md
+# User Guide
+This document details the interface exposed by `lm-eval` and provides details on what flags are available to users.
+## Command-line Interface
+A majority of users run the library by cloning it from Github and running the `main.py` script.
+Equivalently, running the library can be done via the `lm-eval` entrypoint at the command line.
+This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
+* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
+* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
+* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
+* `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
+* `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
+* `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
+* `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
+* `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
+* `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
+* `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
+* `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
+* `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool.
+* `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
+* `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
+* `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
+* `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
+## External Library Usage
+We also support using the library's external API for use within model training loops or other scripts.
+`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.
+`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs/model_guide.md), and wrapping your custom model in that class as follows:
+```python
+import lm_eval
+...
+my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
+...
+lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.greedy_until()`
+results = lm_eval.simple_evaluate( # call simple_evaluate
+    model=lm_obj,
+    tasks=["taskname1", "taskname2"],
+    num_fewshot=0,
+    ...
+)
+```
+See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.
+Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.
+See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details.
+As a brief example usage of `evaluate()`:
+```python
+import lm_eval
+from my_tasks import MyTask1 # suppose you've defined a custom lm_eval.api.Task subclass in your own external codebase
+...
+my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
+...
+lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.greedy_until()`
+def evaluate(
+    lm=lm_obj,
+    task_dict={"mytask1": MyTask1},
+    ...
+):
+```
--- a/lm_eval/__init__.py
+++ b/lm_eval/__init__.py
+from .evaluator import evaluate, simple_evaluate
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -186,7 +186,7 @@ def evaluate(
    :param lm: obj
        Language Model
    :param task_dict: dict[str, Task]
-        Dictionary of tasks. Tasks will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
+        Dictionary of tasks. Tasks will be taken to have name type(task).config.task .
    :param limit: int, optional
        Limit the number of examples per task (only use this for testing)
    :param bootstrap_iters:

--- a/lm_eval/tasks/mgsm/native_cot/cot_yaml
+++ b/lm_eval/tasks/mgsm/native_cot/cot_yaml
@@ -21,3 +21,9 @@ metric_list:
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
+filter_list:
+  - name: "get-answer"
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
+      - function: "take_first"
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_bn_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_bn_en-cot.yaml
+# Generated by utils.py
+dataset_name: bn
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"প্রশ্ন: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_bn_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_de_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_de_en-cot.yaml
+# Generated by utils.py
+dataset_name: de
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Frage: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_de_direct
--- a/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_en.yaml
+++ b/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_en.yaml
 # Generated by utils.py
 dataset_name: en
-doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{%
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
-  endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Question: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else
-  %}{{"Question: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
-filter:
- function: regex
-  regex_pattern: The answer is (\-?[0-9\.\,]+)
- function: take_first
-filter_list:
- name: get-answer
 include: cot_yaml
 task: mgsm_en_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_es_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_es_en-cot.yaml
+# Generated by utils.py
+dataset_name: es
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Pregunta: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_es_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_fr_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_fr_en-cot.yaml
+# Generated by utils.py
+dataset_name: fr
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Question : "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_fr_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_ja_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_ja_en-cot.yaml
+# Generated by utils.py
+dataset_name: ja
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"問題: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_ja_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_ru_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_ru_en-cot.yaml
+# Generated by utils.py
+dataset_name: ru
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Задача: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_ru_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_sw_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_sw_en-cot.yaml
+# Generated by utils.py
+dataset_name: sw
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Swali: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_sw_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_te_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_te_en-cot.yaml
+# Generated by utils.py
+dataset_name: te
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"ప్రశ్న: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_te_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_th_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_th_en-cot.yaml
+# Generated by utils.py
+dataset_name: th
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"โจทย์: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_th_direct
--- a/lm_eval/tasks/mgsm/en_cot/mgsm_zh_en-cot.yaml
+++ b/lm_eval/tasks/mgsm/en_cot/mgsm_zh_en-cot.yaml
+# Generated by utils.py
+dataset_name: zh
+doc_to_target: '{% if answer is not none %}{{answer[20+1]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"问题: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+include: cot_yaml
+task: mgsm_zh_direct
--- a/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_bn.yaml
+++ b/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_bn.yaml
-# Generated by utils.py
-dataset_name: bn
-doc_to_target: '{% if answer is not none %}{{answer[16+1]}}{% else %}{{answer_number|string}}{%
-  endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nধাপে ধাপে উত্তর:"}}{% else
-  %}{{"প্রশ্ন: "+question+"\nধাপে ধাপে উত্তর:"}}{% endif %}'
-filter:
- function: regex
-  regex_pattern: The answer is (\-?[0-9\.\,]+)
- function: take_first
-filter_list:
- name: get-answer
-include: cot_yaml
-task: mgsm_bn_direct
--- a/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_de.yaml
+++ b/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_de.yaml
-# Generated by utils.py
-dataset_name: de
-doc_to_target: '{% if answer is not none %}{{answer[28+1]}}{% else %}{{answer_number|string}}{%
-  endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nSchritt-für-Schritt-Antwort:"}}{%
-  else %}{{"Frage: "+question+"\nSchritt-für-Schritt-Antwort:"}}{% endif %}'
-filter:
- function: regex
-  regex_pattern: The answer is (\-?[0-9\.\,]+)
- function: take_first
-filter_list:
- name: get-answer
-include: cot_yaml
-task: mgsm_de_direct
--- a/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_es.yaml
+++ b/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_es.yaml
-# Generated by utils.py
-dataset_name: es
-doc_to_target: '{% if answer is not none %}{{answer[22+1]}}{% else %}{{answer_number|string}}{%
-  endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nRespuesta paso a paso:"}}{%
-  else %}{{"Pregunta: "+question+"\nRespuesta paso a paso:"}}{% endif %}'
-filter:
- function: regex
-  regex_pattern: The answer is (\-?[0-9\.\,]+)
- function: take_first
-filter_list:
- name: get-answer
-include: cot_yaml
-task: mgsm_es_direct
--- a/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_fr.yaml
+++ b/lm_eval/tasks/mgsm/native_cot/mgsm_cot_native_fr.yaml
-# Generated by utils.py
-dataset_name: fr
-doc_to_target: '{% if answer is not none %}{{answer[25+1]}}{% else %}{{answer_number|string}}{%
-  endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nRéponse étape par étape :"}}{%
-  else %}{{"Question : "+question+"\nRéponse étape par étape :"}}{% endif %}'
-filter:
- function: regex
-  regex_pattern: The answer is (\-?[0-9\.\,]+)
- function: take_first
-filter_list:
- name: get-answer
-include: cot_yaml
-task: mgsm_fr_direct