Unverified Commit 3dcde17b authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge branch 'big-refactor' into fix-gold-aliases

parents b8cb513b 6a000adb
Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness. # Eval Harness Documentation
Welcome to the docs for the LM Evaluation Harness!
## Table of Contents
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Advanced Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/advanced_task_guide.md).
## Progress on Revamp
Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.
## Desired Pages ### Desired Pages
* [ ] YAML explainer * [ ] YAML explainer
* [ ] Explainer on filters + advanced features * [ ] Explainer on filters + advanced features
......
This is a placeholder. # New Model Guide
The `lm-evaluation-harness` is intended to be a model-agnostic framework for evaluating . We provide first-class support for HuggingFace `AutoModelForCausalLM` and `AutoModelForSeq2SeqLM` type models, but
This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling `lm_eval.evaluator.evaluate()` to evaluate an existing model.
In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lm_eval.api.model.LM` class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this `LM` subclass via adding it to the library!
## Setup
To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
```sh
# After forking...
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout big-refactor
git checkout -b <model-type>
pip install -e ".[dev]"
```
Now, we'll create a new file where we'll be adding our model:
```sh
touch lm_eval/models/<my_model_filename>.py
```
**Tip: this filename should not shadow package names! For example, naming your file `anthropic.py` is disallowed since the API's name on pypi is `anthropic`, but naming it `anthropic_llms.py` works with no problems.**
## Interface
All models must subclass the `lm_eval.api.model.LM` class.
The LM class enforces a common interface via which we can extract responses from a model:
```python
class MyCustomLM(LM):
#...
def loglikelihood(self, requests):
def loglikelihood_rolling(self, requests):
def greedy_until(self, requests):
#...
```
We support
The three types of
smth smth tokenizer-agnostic
3 reqtypes
- greedy_until, and the arguments passed to it
- loglikelihood, and args passed to it
- loglikelihood_rolling, and args passed to it
## Registration
Congrats on implementing your model! Now it's time to test it out.
To make your model usable via the command line interface to `lm-eval` using `main.py`, you'll need to tell `lm-eval` what your model's name is.
This is done via a *decorator*, `lm_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python main.py --model <name>` and alert `lm-eval` to the model's existence.
```python
from lm_eval.api.registry import register_model
@register_model("<name1>", "<name2>")
class MyCustomLM(LM):
```
Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at `lm_eval.api.registry.MODEL_REGISTRY`. See `lm_eval.api.registry` for more detail on what sorts of registries and decorators exist in the library!
## Other
**Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model!
## Conclusion
After reading this guide, you should be able to add new model APIs or implementations to the Eval Harness library!
...@@ -63,10 +63,11 @@ class TaskConfig(dict): ...@@ -63,10 +63,11 @@ class TaskConfig(dict):
fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?) fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
template_aliases: str = None template_aliases: str = None
aliases: Union[str, list] = None
doc_to_text: Union[Callable, str] = None doc_to_text: Union[Callable, str] = None
doc_to_target: Union[Callable, str] = None doc_to_target: Union[Callable, str] = None
use_prompt: str = None use_prompt: str = None
delimiter: str = "\n\n"
description: str = ""
num_fewshot: int = 0 num_fewshot: int = 0
batch_size: int = 1 batch_size: int = 1
...@@ -434,35 +435,12 @@ class Task(abc.ABC): ...@@ -434,35 +435,12 @@ class Task(abc.ABC):
), "A `random.Random` generator argument must be provided to `rnd`" ), "A `random.Random` generator argument must be provided to `rnd`"
if num_fewshot == 0: if num_fewshot == 0:
labeled_examples = "" # always prepend the (possibly empty) task description
labeled_examples = self._config.description
else: else:
labeled_examples = self.sampler.get_context(doc, num_fewshot) labeled_examples = self._config.description + self.sampler.get_context(
doc, num_fewshot
# for sets with no training docs, draw from other set *but ensure no overlap with current doc* )
# if self.has_training_docs():
# fewshotex = self.fewshot_examples(k=num_fewshot, rnd=rnd)
# else:
# if self._fewshot_docs is None:
# self._fewshot_docs = list(
# self.validation_docs()
# if self.has_validation_docs()
# else self.test_docs()
# )
# fewshotex = rnd.sample(self._fewshot_docs, num_fewshot + 1)
# # get rid of the doc that's the one we're evaluating, if it's in the fewshot
# fewshotex = [x for x in fewshotex if x != doc][:num_fewshot]
# labeled_examples = (
# "\n\n".join(
# [
# self.doc_to_text(doc) + self.doc_to_target(doc)
# for doc in fewshotex
# ]
# )
# + "\n\n"
# )
example = self.doc_to_text(doc) example = self.doc_to_text(doc)
return labeled_examples + example return labeled_examples + example
......
...@@ -23,7 +23,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -23,7 +23,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] LogiQA - [ ] LogiQA
- [ ] HellaSwag - [ ] HellaSwag
- [ ] SWAG - [ ] SWAG
- [ ] OpenBookQA - [x] OpenBookQA
- [ ] SQuADv2 - [ ] SQuADv2
- [ ] RACE - [ ] RACE
- [ ] HeadQA - [ ] HeadQA
......
group:
- multiple_choice
task: openbookqa
dataset_path: openbookqa
dataset_name: main
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
template_aliases: "{% set answer_choices = choices['text'] %}{% set gold = choices.label.index(answerKey.lstrip()) %}" # set the list of possible answer choices, and set what this doc's gold answer is (set what ds column used, and what)
doc_to_text: "{{question_stem}}"
doc_to_target: "{{gold}}" # this will be cast to an int.
should_decontaminate: true
doc_to_decontamination_query: "{{question_stem}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
...@@ -41,7 +41,6 @@ def parse_args(): ...@@ -41,7 +41,6 @@ def parse_args():
parser.add_argument("--data_sampling", type=float, default=None) parser.add_argument("--data_sampling", type=float, default=None)
parser.add_argument("--no_cache", action="store_true") parser.add_argument("--no_cache", action="store_true")
parser.add_argument("--decontamination_ngrams_path", default=None) parser.add_argument("--decontamination_ngrams_path", default=None)
parser.add_argument("--description_dict_path", default=None)
parser.add_argument("--check_integrity", action="store_true") parser.add_argument("--check_integrity", action="store_true")
parser.add_argument("--write_out", action="store_true", default=False) parser.add_argument("--write_out", action="store_true", default=False)
parser.add_argument("--output_base_path", type=str, default=None) parser.add_argument("--output_base_path", type=str, default=None)
...@@ -78,12 +77,6 @@ def main(): ...@@ -78,12 +77,6 @@ def main():
eval_logger.info(f"Selected Tasks: {task_names}") eval_logger.info(f"Selected Tasks: {task_names}")
# TODO: description_dict?
# description_dict = {}
# if args.description_dict_path:
# with open(args.description_dict_path, "r") as f:
# description_dict = json.load(f)
results = evaluator.simple_evaluate( results = evaluator.simple_evaluate(
model=args.model, model=args.model,
model_args=args.model_args, model_args=args.model_args,
...@@ -94,7 +87,6 @@ def main(): ...@@ -94,7 +87,6 @@ def main():
device=args.device, device=args.device,
no_cache=args.no_cache, no_cache=args.no_cache,
limit=args.limit, limit=args.limit,
# description_dict=description_dict,
decontamination_ngrams_path=args.decontamination_ngrams_path, decontamination_ngrams_path=args.decontamination_ngrams_path,
check_integrity=args.check_integrity, check_integrity=args.check_integrity,
write_out=args.write_out, write_out=args.write_out,
......
...@@ -13,12 +13,10 @@ def parse_args(): ...@@ -13,12 +13,10 @@ def parse_args():
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("--output_base_path", required=True) parser.add_argument("--output_base_path", required=True)
parser.add_argument("--tasks", default="all_tasks") parser.add_argument("--tasks", default="all_tasks")
parser.add_argument("--provide_description", action="store_true")
parser.add_argument("--sets", type=str, default="val") # example: val,test parser.add_argument("--sets", type=str, default="val") # example: val,test
parser.add_argument("--num_fewshot", type=int, default=1) parser.add_argument("--num_fewshot", type=int, default=1)
parser.add_argument("--seed", type=int, default=42) parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--num_examples", type=int, default=1) parser.add_argument("--num_examples", type=int, default=1)
parser.add_argument("--description_dict_path", default=None)
return parser.parse_args() return parser.parse_args()
...@@ -32,11 +30,6 @@ def main(): ...@@ -32,11 +30,6 @@ def main():
task_names = args.tasks.split(",") task_names = args.tasks.split(",")
task_dict = tasks.get_task_dict(task_names) task_dict = tasks.get_task_dict(task_names)
# description_dict = {}
# if args.description_dict_path:
# with open(args.description_dict_path, "r") as f:
# description_dict = json.load(f)
os.makedirs(args.output_base_path, exist_ok=True) os.makedirs(args.output_base_path, exist_ok=True)
for task_name, task in task_dict.items(): for task_name, task in task_dict.items():
rnd = random.Random() rnd = random.Random()
...@@ -55,12 +48,6 @@ def main(): ...@@ -55,12 +48,6 @@ def main():
docs = join_iters(iters) docs = join_iters(iters)
# description = (
# description_dict[task_name]
# if description_dict and task_name in description_dict
# else ""
# )
with open(os.path.join(args.output_base_path, task_name), "w") as f: with open(os.path.join(args.output_base_path, task_name), "w") as f:
for i, doc in ( for i, doc in (
zip(range(args.num_examples), docs) zip(range(args.num_examples), docs)
...@@ -72,7 +59,6 @@ def main(): ...@@ -72,7 +59,6 @@ def main():
doc=doc, doc=doc,
num_fewshot=args.num_fewshot, num_fewshot=args.num_fewshot,
rnd=rnd, rnd=rnd,
# description=description,
) )
f.write(ctx + "\n") f.write(ctx + "\n")
......
...@@ -3,7 +3,7 @@ import lm_eval.tasks ...@@ -3,7 +3,7 @@ import lm_eval.tasks
import lm_eval.models import lm_eval.models
def test_description_dict(): def test_description():
seed = 42 seed = 42
num_examples = 1 num_examples = 1
task_names = ["hellaswag", "winogrande"] task_names = ["hellaswag", "winogrande"]
...@@ -37,6 +37,5 @@ def test_description_dict(): ...@@ -37,6 +37,5 @@ def test_description_dict():
doc=doc, doc=doc,
num_fewshot=1, num_fewshot=1,
rnd=rnd, rnd=rnd,
description=description,
) )
assert description in ctx assert description in ctx
...@@ -55,7 +55,6 @@ def test_evaluator(taskname, task_class): ...@@ -55,7 +55,6 @@ def test_evaluator(taskname, task_class):
num_fewshot=0, num_fewshot=0,
limit=limit, limit=limit,
bootstrap_iters=10, bootstrap_iters=10,
description_dict=None,
) )
e2 = evaluator.evaluate( e2 = evaluator.evaluate(
lm=lm, lm=lm,
...@@ -63,7 +62,6 @@ def test_evaluator(taskname, task_class): ...@@ -63,7 +62,6 @@ def test_evaluator(taskname, task_class):
num_fewshot=0, num_fewshot=0,
limit=limit, limit=limit,
bootstrap_iters=10, bootstrap_iters=10,
description_dict=None,
) )
# check that caching is working # check that caching is working
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment