Commit a808c661 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-harness into standardize_metrics

parents 6117c507 42730d90
...@@ -17,6 +17,7 @@ repos: ...@@ -17,6 +17,7 @@ repos:
- id: detect-private-key - id: detect-private-key
- id: end-of-file-fixer - id: end-of-file-fixer
- id: no-commit-to-branch - id: no-commit-to-branch
always_run: false
- id: requirements-txt-fixer - id: requirements-txt-fixer
- id: trailing-whitespace - id: trailing-whitespace
args: [--markdown-linebreak-ext=md] args: [--markdown-linebreak-ext=md]
......
* @haileyschoelkopf @lintangsutawika @StellaAthena * @haileyschoelkopf @lintangsutawika
...@@ -28,7 +28,7 @@ This project provides a unified framework to test generative language models on ...@@ -28,7 +28,7 @@ This project provides a unified framework to test generative language models on
- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented. - Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface. - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm). - Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm).
- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/). - Support for commercial APIs including [OpenAI](https://openai.com), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft). - Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Support for local models and benchmarks. - Support for local models and benchmarks.
- Evaluation with publicly available prompts ensures reproducibility and comparability between papers. - Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
...@@ -159,11 +159,10 @@ Note that for externally hosted models, configs such as `--device` and `--batch_ ...@@ -159,11 +159,10 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: | | API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
|-----------------------------|---------------------------------|----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|----------------------------------------------------------| |-----------------------------|---------------------------------|--------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|----------------------------------------------------------|
| OpenAI Completions | :heavy_check_mark: | `openai`, `openai-completions`, `gooseai` | up to `code-davinci-002` | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | OpenAI Completions | :heavy_check_mark: | `openai-completions` | up to `code-davinci-002` | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :x: Not yet - needs testing! | N/A | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) | | OpenAI ChatCompletions | :x: Not yet - needs testing! | N/A | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) | | Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) |
| GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Textsynth | :heavy_check_mark: | `textsynth` | [All supported engines](https://textsynth.com/documentation.html#engines) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | Textsynth | :heavy_check_mark: | `textsynth` | [All supported engines](https://textsynth.com/documentation.html#engines) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Cohere | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | Cohere | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
......
...@@ -219,6 +219,49 @@ Aggregation functions: ...@@ -219,6 +219,49 @@ Aggregation functions:
* `weighted_perplexity` * `weighted_perplexity`
* `bits_per_byte` * `bits_per_byte`
### Adding a Multiple Choice Metric
Adding a multiple choice metric has a few steps. To get it working you need to:
1. register a metric function
2. register an aggregation function
3. update the `Task` definition to make sure the correct arguments are passed
The default metric and aggregation functions are in `lm_eval/api/metrics.py`, and you can add a function there if it's for general use. The metrics are towards the bottom of the file and look like this:
@register_metric(
metric="mcc",
higher_is_better=True,
output_type="multiple_choice",
aggregation="matthews_corrcoef",
)
def mcc_fn(items): # This is a passthrough function
return items
Note that many of these are passthrough functions, and for multiple choice (at least) this function is never actually called.
Aggregation functions are defined towards the top of the file, here's an example:
@register_aggregation("matthews_corrcoef")
def matthews_corrcoef(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
return sklearn.metrics.matthews_corrcoef(golds, preds)
This function returns a single numeric value. The input is defined in `Task.process_results` in `lm_eval/api/task.py`. There's a section that looks like this:
result_dict = {
**({"acc": acc} if "acc" in use_metric else {}),
**({"f1": (gold, pred)} if "f1" in use_metric else {}),
**({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
**({"exact_match": exact_match} if "exact_match" in use_metric else {}),
}
The value here determines the input to the aggregation function, though the name used matches the metric function. These metrics all have simple needs and just need the accuracy or gold and predicted values, but immediately below this there are examples of metrics with more complicated needs you can use as reference.
## Good Reference Tasks ## Good Reference Tasks
......
...@@ -25,20 +25,23 @@ def _handle_non_serializable(o): ...@@ -25,20 +25,23 @@ def _handle_non_serializable(o):
def parse_eval_args() -> argparse.Namespace: def parse_eval_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter) parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("--model", default="hf", help="Name of model e.g. `hf`") parser.add_argument("--model", "-m", default="hf", help="Name of model e.g. `hf`")
parser.add_argument( parser.add_argument(
"--tasks", "--tasks",
"-t",
default=None, default=None,
metavar="task1,task2", metavar="task1,task2",
help="To get full list of tasks, use the command lm-eval --tasks list", help="To get full list of tasks, use the command lm-eval --tasks list",
) )
parser.add_argument( parser.add_argument(
"--model_args", "--model_args",
"-a",
default="", default="",
help="Comma separated string arguments for model, e.g. `pretrained=EleutherAI/pythia-160m,dtype=float32`", help="Comma separated string arguments for model, e.g. `pretrained=EleutherAI/pythia-160m,dtype=float32`",
) )
parser.add_argument( parser.add_argument(
"--num_fewshot", "--num_fewshot",
"-f",
type=int, type=int,
default=None, default=None,
metavar="N", metavar="N",
...@@ -46,6 +49,7 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -46,6 +49,7 @@ def parse_eval_args() -> argparse.Namespace:
) )
parser.add_argument( parser.add_argument(
"--batch_size", "--batch_size",
"-b",
type=str, type=str,
default=1, default=1,
metavar="auto|auto:N|N", metavar="auto|auto:N|N",
...@@ -66,6 +70,7 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -66,6 +70,7 @@ def parse_eval_args() -> argparse.Namespace:
) )
parser.add_argument( parser.add_argument(
"--output_path", "--output_path",
"-o",
default=None, default=None,
type=str, type=str,
metavar="DIR|DIR/file.json", metavar="DIR|DIR/file.json",
...@@ -73,6 +78,7 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -73,6 +78,7 @@ def parse_eval_args() -> argparse.Namespace:
) )
parser.add_argument( parser.add_argument(
"--limit", "--limit",
"-L",
type=float, type=float,
default=None, default=None,
metavar="N|0<N<1", metavar="N|0<N<1",
...@@ -81,6 +87,7 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -81,6 +87,7 @@ def parse_eval_args() -> argparse.Namespace:
) )
parser.add_argument( parser.add_argument(
"--use_cache", "--use_cache",
"-c",
type=str, type=str,
default=None, default=None,
metavar="DIR", metavar="DIR",
...@@ -94,12 +101,14 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -94,12 +101,14 @@ def parse_eval_args() -> argparse.Namespace:
) )
parser.add_argument( parser.add_argument(
"--write_out", "--write_out",
"-w",
action="store_true", action="store_true",
default=False, default=False,
help="Prints the prompt for the first few documents.", help="Prints the prompt for the first few documents.",
) )
parser.add_argument( parser.add_argument(
"--log_samples", "--log_samples",
"-s",
action="store_true", action="store_true",
default=False, default=False,
help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Use with --output_path.", help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Use with --output_path.",
...@@ -127,7 +136,8 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -127,7 +136,8 @@ def parse_eval_args() -> argparse.Namespace:
) )
parser.add_argument( parser.add_argument(
"--verbosity", "--verbosity",
type=str, "-v",
type=str.upper,
default="INFO", default="INFO",
metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG", metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.", help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
......
...@@ -234,9 +234,6 @@ def evaluate( ...@@ -234,9 +234,6 @@ def evaluate(
padding_requests = collections.defaultdict(int) padding_requests = collections.defaultdict(int)
# store the hierarchy to do proper ordering # store the hierarchy to do proper ordering
task_hierarchy = collections.defaultdict(list) task_hierarchy = collections.defaultdict(list)
# store the ordering of tasks and groups
task_order = collections.defaultdict(int)
task_group_alias = collections.defaultdict(dict)
# store num-fewshot value per task # store num-fewshot value per task
num_fewshot = collections.defaultdict(int) num_fewshot = collections.defaultdict(int)
...@@ -264,14 +261,14 @@ def evaluate( ...@@ -264,14 +261,14 @@ def evaluate(
num_fewshot[task_name] = n_shot num_fewshot[task_name] = n_shot
if "task_alias" in configs[task_name]: if "task_alias" in configs[task_name]:
task_group_alias[task_name] = configs[task_name]["task_alias"] results[task_name]["alias"] = configs[task_name]["task_alias"]
if ( if (
("group_alias" in configs[task_name]) ("group_alias" in configs[task_name])
and (group_name not in task_group_alias) and (group_name not in results)
and (group_name is not None) and (group_name is not None)
): ):
task_group_alias[group_name] = configs[task_name]["group_alias"] results[group_name]["alias"] = configs[task_name]["group_alias"]
if limit is not None: if limit is not None:
if task.has_test_docs(): if task.has_test_docs():
...@@ -443,32 +440,6 @@ def evaluate( ...@@ -443,32 +440,6 @@ def evaluate(
vals = vals_torch vals = vals_torch
if lm.rank == 0: if lm.rank == 0:
### Get task ordering for correct sample-wide aggregation
group_to_task = {}
for group in task_hierarchy.keys():
if group not in task_order:
task_order[group] = 0
if len(task_hierarchy[group]) > 0:
group_to_task[group] = task_hierarchy[group].copy()
for task in task_hierarchy[group]:
if task in task_order:
task_order[task] += 1
else:
task_order[task] = 1 + task_order[group]
if task in task_hierarchy:
group_to_task[group].remove(task)
group_to_task[group].extend(task_hierarchy[task])
task_to_group = {}
for group in group_to_task:
for task in group_to_task[group]:
if task in task_to_group:
task_to_group[task].append(group)
else:
task_to_group[task] = [group]
### Aggregate results over all datapoints ### ### Aggregate results over all datapoints ###
# aggregate results ; run bootstrap CIs # aggregate results ; run bootstrap CIs
...@@ -509,7 +480,10 @@ def evaluate( ...@@ -509,7 +480,10 @@ def evaluate(
total_size = 0 total_size = 0
for task in task_list: for task in task_list:
metrics = results[task] metrics = results[task].copy()
if "alias" in metrics:
metrics.pop("alias")
current_size = metrics.pop("samples") current_size = metrics.pop("samples")
# TODO: There should be a way for users # TODO: There should be a way for users
...@@ -557,71 +531,77 @@ def evaluate( ...@@ -557,71 +531,77 @@ def evaluate(
results[group]["samples"] = total_size results[group]["samples"] = total_size
def print_tasks(task_hierarchy, task_order, task_version, task_group_alias): def print_tasks(task_hierarchy, results, tab=0):
results_agg = collections.defaultdict(dict) results_agg = collections.defaultdict(dict)
groups_agg = collections.defaultdict(dict) groups_agg = collections.defaultdict(dict)
for group_name, task_list in task_hierarchy.items():
order = task_order[group_name]
results_agg[group_name] = results[group_name].copy()
results_agg[group_name]["tab"] = order
if (order < max(task_order.values())) and (len(task_list) > 0):
groups_agg[group_name] = results[group_name].copy()
groups_agg[group_name]["tab"] = order
if task_list != []: (group_name, task_list), *_ = task_hierarchy.items()
for task in sorted(task_list): task_list = sorted(task_list)
if task in task_hierarchy:
_task_hierarchy = {task: task_hierarchy[task]}
else:
_task_hierarchy = {task: []}
_results_agg, _groups_agg, task_version = print_tasks( results_agg[group_name] = results[group_name].copy()
_task_hierarchy, task_order, task_version, task_group_alias # results_agg[group_name]["tab"] = tab
) if "samples" in results_agg[group_name]:
results_agg[group_name].pop("samples")
results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg}
return results_agg, groups_agg, task_version tab_string = " " * tab + "- " if tab > 0 else ""
results_agg, groups_agg, versions = print_tasks( if "alias" in results_agg[group_name]:
task_hierarchy, task_order, versions, task_group_alias results_agg[group_name]["alias"] = (
tab_string + results_agg[group_name]["alias"]
) )
else:
results_agg[group_name]["alias"] = tab_string + group_name
for task in results_agg: if len(task_list) > 0:
task_results = results_agg[task] groups_agg[group_name] = results[group_name].copy()
# groups_agg[group_name]["tab"] = tab
if "samples" in task_results: if "samples" in groups_agg[group_name]:
task_results.pop("samples") groups_agg[group_name].pop("samples")
tab_string = "" if "alias" in groups_agg[group_name]:
if "tab" in task_results: groups_agg[group_name]["alias"] = (
tab = task_results.pop("tab") tab_string + groups_agg[group_name]["alias"]
tab_string = " " * tab + "- " if tab > 0 else "" )
else:
groups_agg[group_name]["alias"] = tab_string + group_name
if task in task_group_alias: for task_name in task_list:
task_alias = task_group_alias[task] if task_name in task_hierarchy:
results_agg[task]["alias"] = tab_string + task_alias _task_hierarchy = {
**{task_name: task_hierarchy[task_name]},
**task_hierarchy,
}
else: else:
results_agg[task]["alias"] = tab_string + task _task_hierarchy = {
**{task_name: []},
**task_hierarchy,
}
for group in groups_agg: _results_agg, _groups_agg = print_tasks(
group_results = groups_agg[group] _task_hierarchy, results, tab + 1
)
results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg}
if "samples" in group_results: return results_agg, groups_agg
group_results.pop("samples")
tab_string = "" results_agg = collections.defaultdict(dict)
if "tab" in group_results: groups_agg = collections.defaultdict(dict)
tab = group_results.pop("tab") all_tasks_list = list(task_hierarchy.keys())
tab_string = " " * tab + "- " if tab > 0 else "" left_tasks_list = []
while True:
add_tasks_list = list(k for k in results_agg.keys())
left_tasks_list = sorted(list(set(all_tasks_list) - set(add_tasks_list)))
if len(left_tasks_list) == 0:
break
_task_hierarchy = {
k: v for k, v in task_hierarchy.items() if k in left_tasks_list
}
_results_agg, _groups_agg = print_tasks(_task_hierarchy, results)
if group in task_group_alias: results_agg = {**results_agg, **_results_agg}
group_alias = task_group_alias[group] groups_agg = {**groups_agg, **_groups_agg}
groups_agg[group]["alias"] = tab_string + group_alias
else:
groups_agg[group]["alias"] = tab_string + group
for group_name, task_list in task_hierarchy.items(): for group_name, task_list in task_hierarchy.items():
if task_list != []: if task_list != []:
......
include: fld.yaml include: fld_default.yaml
task: fld_star task: fld_star
dataset_name: star dataset_name: star
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment