Merge branch 'main' into chat_template

9bd948df · Konrad · 8a0ce59d · 70e1de09 · 9bd948df · 9bd948df
Commit 9bd948df authored May 22, 2024 by Konrad
20 changed files
--- a/docs/README.md
+++ b/docs/README.md
@@ -4,7 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!

 ## Table of Contents

-* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md)
-* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
-* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
-* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
+* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](./interface.md)
+* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](./model_guide.md).
+* For a crash course on adding new tasks to the library, see our [New Task Guide](./new_task_guide.md).
+* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](./task_guide.md).
--- a/docs/interface.md
+++ b/docs/interface.md
@@ -42,7 +42,7 @@ This mode supports a number of command-line arguments, the details of which can

 - `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.

- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than  `lm_eval/tasks/`
+- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.

 - `--system_instruction`: Specifies a system instruction string to prepend to the prompt.

@@ -54,7 +54,14 @@ This mode supports a number of command-line arguments, the details of which can

 * `--seed`: Set seed for python's random, numpy and torch.  Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three.  The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility).  E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`.  E.g, `--seed 42` sets all three seeds to 42.

-* `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]. e.g., ```--wandb_args project=test-project,name=test-run```
+* `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```
+
+* `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
+    * `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`,
+    * `hub_repo_name` - repository name on Hugging Face Hub, e.g., `lm-eval-results`,
+    * `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
+    * `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
+    * `public_repo` - whether the repository is public, can be `True` or `False`,

 ## External Library Usage

@@ -83,7 +90,7 @@ task_manager = lm_eval.tasks.TaskManager()

 # Setting `task_manager` to the one above is optional and should generally be done
 # if you want to include tasks from paths other than ones in `lm_eval/tasks`.
-# `simple_evaluate` will instantiate its own task_manager is the it is set to None here.
+# `simple_evaluate` will instantiate its own task_manager if it is set to None here.
 results = lm_eval.simple_evaluate( # call simple_evaluate
    model=lm_obj,
    tasks=["taskname1", "taskname2"],

--- a/docs/model_guide.md
+++ b/docs/model_guide.md
@@ -6,7 +6,7 @@ In order to properly evaluate a given LM, we require implementation of a wrapper

 ## Setup

-To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
+To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your model, and install the project requirements in your environment:

 ```sh
 # After forking...

--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -172,7 +172,7 @@ doc_to_target: "{{answer}}"
 ```


-**Important**: we now add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_target(doc) + target_delimiter + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.
+**Important**: we now add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_target(doc) + target_delimiter + doc_to_text(doc)`. `doc_to_text` and `doc_to_target` should not contain trailing right or left whitespace, respectively.


 #### Multiple choice format
@@ -366,9 +366,7 @@ task:

 ## Beautifying Table Display

-To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed.
-``
-for example in `mmlu_abstract_algebra.yaml` we set `group_alias` to `stem` and `task_alias` to `abstract_algebra`.
+To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed. For example in `mmlu_abstract_algebra.yaml` we set `group_alias` to `stem` and `task_alias` to `abstract_algebra`.

 ```
 "dataset_name": "abstract_algebra"

--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -31,8 +31,8 @@ Dataset configuration options:
 Prompting / in-context formatting options:
 - **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
 - **description** (`str`, *optional*) — An optional prepended Jinja2 template or string which will be prepended to the few-shot examples passed into the model, often describing the task or providing instructions to a model, such as `"The following are questions (with answers) about {{subject}}.\n\n"`. No delimiters or spacing are inserted between the description and the first few-shot example.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
+- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate input for the model.
+- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into the answer choice list of the correct answer.
 - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
 - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
 - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -199,6 +199,15 @@ class HFLM(TemplateLM):
            config=self.config, backend=backend, trust_remote_code=trust_remote_code
        )

+        # load tokenizer so we know tokenizer vocabulary size before loading model and PEFT
+        self._create_tokenizer(
+            pretrained,
+            tokenizer,
+            revision=revision,
+            trust_remote_code=trust_remote_code,
+            use_fast_tokenizer=use_fast_tokenizer,
+        )
+
        # if we passed `pretrained` as a string, initialize our model now
        if isinstance(pretrained, str):
            self._create_model(
@@ -235,14 +244,6 @@ class HFLM(TemplateLM):
                        "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
                    )

-        self._create_tokenizer(
-            pretrained,
-            tokenizer,
-            revision=revision,
-            trust_remote_code=trust_remote_code,
-            use_fast_tokenizer=use_fast_tokenizer,
-        )
-
        self.truncation = truncation
        self.logits_cache = logits_cache
        self.vocab_size = self.tokenizer.vocab_size
@@ -579,6 +580,10 @@ class HFLM(TemplateLM):
            if model_kwargs.get("load_in_4bit", None):
                if version.parse(PEFT_VERSION) < version.parse("0.4.0"):
                    raise AssertionError("load_in_4bit requires peft >= 0.4.0")
+            if self._model.config.vocab_size != len(self.tokenizer):
+                # resize model for LoRAs with added tokens
+                self._model.resize_token_embeddings(len(self.tokenizer))
+                eval_logger.info(f"Model config indicates vocab_size='{self._model.config.vocab_size}', but found tokenizer with vocab size '{len(self.tokenizer)}'. Resizing model embedding layer...") 
            self._model = PeftModel.from_pretrained(
                self._model, peft, revision=revision
            )

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -413,7 +413,9 @@ def get_task_dict(
        )

    string_task_name_list = [task for task in task_name_list if isinstance(task, str)]
-    others_task_name_list = [task for task in task_name_list if ~isinstance(task, str)]
+    others_task_name_list = [
+        task for task in task_name_list if not isinstance(task, str)
+    ]
    if len(string_task_name_list) > 0:
        if task_manager is None:
            task_manager = TaskManager()

--- a/lm_eval/tasks/copal_id/README.md
+++ b/lm_eval/tasks/copal_id/README.md
+# COPAL
+
+### Paper
+
+Title: `COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances`
+
+Abstract: `https://arxiv.org/abs/2311.01012`
+
+`COPAL-ID is an Indonesian causal commonsense reasoning dataset that captures local nuances. It provides a more natural portrayal of day-to-day causal reasoning within the Indonesian (especially Jakartan) cultural sphere. Professionally written and validatid from scratch by natives, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID.`
+
+Homepage: `https://github.com/haryoa/copal-id`
+
+
+### Citation
+
+```
+@article{wibowo2023copal,
+  title={COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances},
+  author={Wibowo, Haryo Akbarianto and Fuadi, Erland Hilman and Nityasya, Made Nindyatama and Prasojo, Radityo Eko and Aji, Alham Fikri},
+  journal={arXiv preprint arXiv:2311.01012},
+  year={2023}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `copal_id`
+
+#### Tasks
+
+* `copal_id_standard`: `Standard version of COPAL dataset, use formal language and less local nuances`
+* `copal_id_colloquial`: `Colloquial version of COPAL dataset, use informal language and more local nuances`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/copal_id/colloquial.yaml
+++ b/lm_eval/tasks/copal_id/colloquial.yaml
+include: standard.yaml
+task: copal_id_colloquial
+task_alias: colloquial
+test_split: test_colloquial
--- a/lm_eval/tasks/copal_id/standard.yaml
+++ b/lm_eval/tasks/copal_id/standard.yaml
+group: copal_id
+task: copal_id_standard
+task_alias: standard
+dataset_path: haryoaw/COPAL
+dataset_name: id
+output_type: multiple_choice
+test_split: test
+doc_to_text: !function utils.doc_to_text_id
+doc_to_target: label
+doc_to_choice: !function utils.doc_to_choice
+metric_list:
+  - metric: acc
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/copal_id/utils.py
+++ b/lm_eval/tasks/copal_id/utils.py
+from functools import partial
+
+
+def convert_choice(choice):
+    return choice[0].lower() + choice[1:]
+
+
+def doc_to_text(doc, connector):
+    conn = connector[doc["question"]]
+    return doc["premise"].strip()[:-1] + f" {conn}"
+
+
+def doc_to_choice(doc):
+    return [convert_choice(doc["choice1"]), convert_choice(doc["choice2"])]
+
+
+doc_to_text_id = partial(
+    doc_to_text,
+    connector={
+        "cause": "karena",
+        "effect": "maka",
+    },
+)
--- a/lm_eval/tasks/mmlu/continuation/_continuation_template_yaml
+++ b/lm_eval/tasks/mmlu/continuation/_continuation_template_yaml
+dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
+output_type: multiple_choice
+test_split: test
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+doc_to_text: "Question: {{question.strip()}}\nAnswer:"
+doc_to_choice: "{{choices}}"
+doc_to_target: "{{answer}}"
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/mmlu/continuation/_mmlu.yaml
+++ b/lm_eval/tasks/mmlu/continuation/_mmlu.yaml
+group: mmlu_continuation
+task:
+  - mmlu_continuation_stem
+  - mmlu_continuation_other
+  - mmlu_continuation_social_sciences
+  - mmlu_continuation_humanities
--- a/lm_eval/tasks/mmlu/continuation/mmlu_abstract_algebra.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_abstract_algebra.yaml
+"dataset_name": "abstract_algebra"
+"description": "The following are questions (with answers) about abstract\
+  \ algebra.\n\n"
+"group": "mmlu_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_continuation_abstract_algebra"
--- a/lm_eval/tasks/mmlu/continuation/mmlu_anatomy.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_anatomy.yaml
+"dataset_name": "anatomy"
+"description": "The following are questions (with answers) about anatomy.\n\
+  \n"
+"group": "mmlu_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_continuation_anatomy"
--- a/lm_eval/tasks/mmlu/continuation/mmlu_astronomy.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_astronomy.yaml
+"dataset_name": "astronomy"
+"description": "The following are questions (with answers) about astronomy.\n\
+  \n"
+"group": "mmlu_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_continuation_astronomy"
--- a/lm_eval/tasks/mmlu/continuation/mmlu_business_ethics.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_business_ethics.yaml
+"dataset_name": "business_ethics"
+"description": "The following are questions (with answers) about business\
+  \ ethics.\n\n"
+"group": "mmlu_continuation_other"
+"include": "_continuation_template_yaml"
+"task": "mmlu_continuation_business_ethics"
--- a/lm_eval/tasks/mmlu/continuation/mmlu_clinical_knowledge.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_clinical_knowledge.yaml
+"dataset_name": "clinical_knowledge"
+"description": "The following are questions (with answers) about clinical\
+  \ knowledge.\n\n"
+"group": "mmlu_continuation_other"
+"include": "_continuation_template_yaml"
+"task": "mmlu_continuation_clinical_knowledge"
--- a/lm_eval/tasks/mmlu/continuation/mmlu_college_biology.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_college_biology.yaml
+"dataset_name": "college_biology"
+"description": "The following are questions (with answers) about college\
+  \ biology.\n\n"
+"group": "mmlu_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_continuation_college_biology"
--- a/lm_eval/tasks/mmlu/continuation/mmlu_college_chemistry.yaml
+++ b/lm_eval/tasks/mmlu/continuation/mmlu_college_chemistry.yaml
+"dataset_name": "college_chemistry"
+"description": "The following are questions (with answers) about college\
+  \ chemistry.\n\n"
+"group": "mmlu_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_continuation_college_chemistry"