This directory contains custom HuggingFace [dataset loading scripts](https://huggingface.co/docs/datasets/dataset_script). They are provided to maintain backward compatibility with the ad-hoc data downloaders in earlier versions of the `lm-evaluation-harness` before HuggingFace [`datasets`](https://huggingface.co/docs/datasets/index) was adopted as the default downloading manager. For example, some instances in the HuggingFace `datasets` repository process features (e.g. whitespace stripping, lower-casing, etc.) in ways that the `lm-evaluation-harness` did not.
__NOTE__: We are __not__ accepting any additional loading scripts into the main branch! If you'd like to use a custom dataset, fork the repo and follow HuggingFace's loading script guide found [here](https://huggingface.co/docs/datasets/dataset_script). You can then override your `Task`'s `DATASET_PATH` attribute to point to this script's local path.
__WARNING__: A handful of loading scripts are included in this collection because they have not yet been pushed to the Huggingface Hub or a HuggingFace organization repo. We will remove such scripts once pushed.
{"asdiv":{"description":"ASDiv (Academia Sinica Diverse MWP Dataset) is a diverse (in terms of both language\npatterns and problem types) English math word problem (MWP) corpus for evaluating\nthe capability of various MWP solvers. Existing MWP corpora for studying AI progress\nremain limited either in language usage patterns or in problem types. We thus present\na new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem\ntypes taught in elementary school. Each MWP is annotated with its problem type and grade\nlevel (for indicating the level of difficulty).\n","citation":"@misc{miao2021diverse,\n title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},\n author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},\n year={2021},\n eprint={2106.15772},\n archivePrefix={arXiv},\n primaryClass={cs.AI}\n}\n","homepage":"https://github.com/chaochun/nlu-asdiv-dataset","license":"","features":{"body":{"dtype":"string","id":null,"_type":"Value"},"question":{"dtype":"string","id":null,"_type":"Value"},"solution_type":{"dtype":"string","id":null,"_type":"Value"},"answer":{"dtype":"string","id":null,"_type":"Value"},"formula":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"as_div","config_name":"asdiv","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":501489,"num_examples":2305,"dataset_name":"as_div"}},"download_checksums":{"https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip":{"num_bytes":440966,"checksum":"8f1fe4f6d5f170ec1e24ab78c244153c14c568b1bb2b1dad0324e71f37939a2d"}},"download_size":440966,"post_processing_size":null,"dataset_size":501489,"size_in_bytes":942455}}
"""Instantiate and evaluate a model on a list of tasks.
"""Instantiate and evaluate a model on a list of tasks.
...
@@ -117,10 +116,11 @@ def simple_evaluate(
...
@@ -117,10 +116,11 @@ def simple_evaluate(
task_dict=lm_eval.tasks.get_task_dict(tasks)
task_dict=lm_eval.tasks.get_task_dict(tasks)
fortask_nameintask_dict.keys():
fortask_nameintask_dict.keys():
task_obj=task_dict[task_name]
task_obj=task_dict[task_name]
iftype(task_obj)==tuple:
iftype(task_obj)==tuple:
group,task_obj=task_obj
group,task_obj=task_obj
iftask_objisNone:
continue
config=task_obj._config
config=task_obj._config
ifnum_fewshotisnotNone:
ifnum_fewshotisnotNone:
...
@@ -175,17 +175,17 @@ def evaluate(
...
@@ -175,17 +175,17 @@ def evaluate(
lm,
lm,
task_dict,
task_dict,
limit=None,
limit=None,
bootstrap_iters=100000,
bootstrap_iters:int=100000,
decontamination_ngrams_path=None,
decontamination_ngrams_path=None,
write_out=False,
write_out:bool=False,
log_samples=True,
log_samples:bool=True,
):
):
"""Instantiate and evaluate a model on a list of tasks.
"""Instantiate and evaluate a model on a list of tasks.
:param lm: obj
:param lm: obj
Language Model
Language Model
:param task_dict: dict[str, Task]
:param task_dict: dict[str, Task]
Dictionary of tasks. Tasks will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
Dictionary of tasks. Tasks will be taken to have name type(task).config.task .
:param limit: int, optional
:param limit: int, optional
Limit the number of examples per task (only use this for testing)
Limit the number of examples per task (only use this for testing)
:param bootstrap_iters:
:param bootstrap_iters:
...
@@ -210,24 +210,30 @@ def evaluate(
...
@@ -210,24 +210,30 @@ def evaluate(
samples=collections.defaultdict(list)
samples=collections.defaultdict(list)
# tracks all Instances/requests a model must generate output on.
# tracks all Instances/requests a model must generate output on.
requests=collections.defaultdict(list)
requests=collections.defaultdict(list)
# Stores task scores based on task grouping.
# Aggregated task scores presented with groups
aggregate=collections.defaultdict(dict)
results_agg=collections.defaultdict(dict)
# tracks if a task was chosen via user selecting a group containing it
# Aggregated groups scores only
task_groups=collections.defaultdict(dict)
groups_agg=collections.defaultdict(dict)
# stores the amount to pad out reqs per req. type so that
# stores the amount to pad out reqs per req. type so that
# number of fwd passes per distributed rank is equal
# number of fwd passes per distributed rank is equal
padding_requests=collections.defaultdict(int)
padding_requests=collections.defaultdict(int)
# store the hierarchy to do proper ordering
# Stores group related keys and values for group-aggregation
task_hierarchy=collections.defaultdict(list)
task_groups=collections.defaultdict(dict)
# store the ordering of tasks and groups
task_order=collections.defaultdict(int)
# store the aggregation for aggregating across tasks in the same group
sample_agg_fn=collections.defaultdict(dict)
# get lists of each type of request
# get lists of each type of request
fortask_name,taskintask_dict.items():
fortask_name,taskintask_dict.items():
iftype(task)==tuple:
iftype(task)==tuple:
group,task=task
group_name,task=task
task_groups[task_name]=group
task_hierarchy[group_name].append(task_name)
aggregate[task_name]={}
else:
task_hierarchy[task_name]=[]
iftaskisNone:
continue
versions[task_name]=task.VERSION
versions[task_name]=task.VERSION
configs[task_name]=dict(task.dump_config())
configs[task_name]=dict(task.dump_config())
...
@@ -252,7 +258,8 @@ def evaluate(
...
@@ -252,7 +258,8 @@ def evaluate(
# print the prompt for the first few documents
# print the prompt for the first few documents
ifinst.doc_id<1:
ifinst.doc_id<1:
eval_logger.info(
eval_logger.info(
f"Task: {task_name}; document {inst.doc_id}; context prompt (starting on next line):\n{inst.args[0]}\n(end of prompt on previous line)"
f"Task: {task_name}; document {inst.doc_id}; context prompt (starting on next line):\
\n{inst.args[0]}\n(end of prompt on previous line)\ntarget string or answer choice index (starting on next line):\n{task.doc_to_target(inst.doc)}\n(end of target on previous line)"
)
)
eval_logger.info(f"Request: {str(inst)}")
eval_logger.info(f"Request: {str(inst)}")
...
@@ -302,6 +309,8 @@ def evaluate(
...
@@ -302,6 +309,8 @@ def evaluate(
fortask_name,taskintask_dict.items():
fortask_name,taskintask_dict.items():
iftype(task)==tuple:
iftype(task)==tuple:
group,task=task
group,task=task
iftaskisNone:
continue
task.apply_filters()
task.apply_filters()
### Collect values of metrics on all datapoints ###
### Collect values of metrics on all datapoints ###
...
@@ -311,6 +320,8 @@ def evaluate(
...
@@ -311,6 +320,8 @@ def evaluate(
fortask_name,taskintask_dict.items():
fortask_name,taskintask_dict.items():
iftype(task)==tuple:
iftype(task)==tuple:
group,task=task
group,task=task
iftaskisNone:
continue
# TODO: make it possible to use a different metric per filter
# TODO: make it possible to use a different metric per filter
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
)
)
else:
else:
self._model=accelerator.prepare_model(
assertaccelerator.distributed_typein[
self.model,evaluation_mode=True
DistributedType.FSDP,
)
DistributedType.MULTI_GPU,
],"Unsupported distributed type provided. Only DDP and FSDP are supported."