Unverified Commit 26bc3eab authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge branch 'big-refactor' into model-written-eval

parents 0d701496 cf617ab1
...@@ -43,7 +43,7 @@ jobs: ...@@ -43,7 +43,7 @@ jobs:
# # mypy turned off for now # # mypy turned off for now
# - name: Lint with mypy # - name: Lint with mypy
# run: mypy . --ignore-missing-imports --check-untyped-defs --explicit-package-bases --warn-unreachable # run: mypy . --ignore-missing-imports --check-untyped-defs --explicit-package-bases --warn-unreachable
Job 2 # Job 2
testcpu: testcpu:
name: CPU Tests name: CPU Tests
runs-on: ubuntu-latest runs-on: ubuntu-latest
......
...@@ -33,7 +33,6 @@ repos: ...@@ -33,7 +33,6 @@ repos:
rev: 22.3.0 rev: 22.3.0
hooks: hooks:
- id: black - id: black
language_version: python3.8
- repo: https://github.com/codespell-project/codespell - repo: https://github.com/codespell-project/codespell
rev: v2.1.0 rev: v2.1.0
hooks: hooks:
......
...@@ -23,8 +23,12 @@ Features: ...@@ -23,8 +23,12 @@ Features:
- Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md). - Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface. - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/). - Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft). - Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Evaluating with publicly available prompts ensures reproducibility and comparability between papers. - Support for local models and benchmarks.
- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and is used internally by dozens of companies including NVIDIA, Cohere, Booz Allen Hamilton, and Mosaic ML.
## Install ## Install
...@@ -86,7 +90,7 @@ python -m lm_eval \ ...@@ -86,7 +90,7 @@ python -m lm_eval \
--batch_size 8 --batch_size 8
``` ```
Models that are loaded via either `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) or `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported via Support for this model type is currently pending. Models that are loaded via both `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) and `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supporteded.
Batch size selection can be automated by setting the ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be: Batch size selection can be automated by setting the ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:
...@@ -151,14 +155,14 @@ A full accounting of the supported and planned libraries + APIs can be seen belo ...@@ -151,14 +155,14 @@ A full accounting of the supported and planned libraries + APIs can be seen belo
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: | | API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
|-----------------------------|---------------------------------|----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------| |-----------------------------|---------------------------------|----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------|
| OpenAI Completions | :heavy_check_mark: | `openai`, `openai-completions`, `gooseai` | up to `code-davinci-002` | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | OpenAI Completions | :heavy_check_mark: | `openai`, `openai-completions`, `gooseai` | up to `code-davinci-002` | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :x: Not yet - needs help! | N/A | (link here?) | `greedy_until` (no logprobs) | | OpenAI ChatCompletions | :x: Not yet - needs help! | N/A | (link here?) | `generate_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `greedy_until` (no logprobs) | | Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) |
| GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Textsynth | Needs testing | `textsynth` | ??? | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | Textsynth | Needs testing | `textsynth` | ??? | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Cohere | :hourglass: - blocked on Cohere API bug | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | Cohere | :hourglass: - blocked on Cohere API bug | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| GGML | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617) | N/A | ??? | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | GGML | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617) | N/A | ??? | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| vLLM | :x: Not yet - needs help! | N/A | All HF models | `greedy_until` (no logprobs) | | vLLM | :x: Not yet - needs help! | N/A | All HF models | `generate_until` (no logprobs) |
| Your inference server here! | ... | ... | ... | ... | | ... | | Your inference server here! | ... | ... | ... | ... | | ... |
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models. It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
...@@ -232,7 +236,7 @@ We support wildcards in task names, for example you can run all of the machine-t ...@@ -232,7 +236,7 @@ We support wildcards in task names, for example you can run all of the machine-t
To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md). To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants. As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in [the task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md) and [the advanced task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md) and welcome contributions of novel task templates and task variants.
## How to Contribute or Learn More? ## How to Contribute or Learn More?
...@@ -248,16 +252,23 @@ You can also ask for help, or discuss new features with the maintainers in the # ...@@ -248,16 +252,23 @@ You can also ask for help, or discuss new features with the maintainers in the #
@software{eval-harness, @software{eval-harness,
author = {Gao, Leo and author = {Gao, Leo and
Tow, Jonathan and Tow, Jonathan and
Abbasi, Baber and
Biderman, Stella and Biderman, Stella and
Black, Sid and Black, Sid and
DiPofi, Anthony and DiPofi, Anthony and
Foster, Charles and Foster, Charles and
Golding, Laurence and Golding, Laurence and
Hsu, Jeffrey and Hsu, Jeffrey and
Le Noac'h, Alain and
Li, Haonan and
McDonell, Kyle and McDonell, Kyle and
Muennighoff, Niklas and Muennighoff, Niklas and
Ociepa, Chris
Phang, Jason and Phang, Jason and
Reynolds, Laria and Reynolds, Laria and
Schoelkopf, Hailey and
Skowron, Aviya and
Sutawika, Lintang and
Tang, Eric and Tang, Eric and
Thite, Anish and Thite, Anish and
Wang, Ben and Wang, Ben and
......
...@@ -57,7 +57,7 @@ import lm_eval ...@@ -57,7 +57,7 @@ import lm_eval
my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code) my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
... ...
lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.greedy_until()` lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.generate_until()`
results = lm_eval.simple_evaluate( # call simple_evaluate results = lm_eval.simple_evaluate( # call simple_evaluate
model=lm_obj, model=lm_obj,
...@@ -83,7 +83,7 @@ from my_tasks import MyTask1 # suppose you've defined a custom lm_eval.api.Task ...@@ -83,7 +83,7 @@ from my_tasks import MyTask1 # suppose you've defined a custom lm_eval.api.Task
my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code) my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
... ...
lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.greedy_until()` lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.generate_until()`
......
...@@ -44,26 +44,24 @@ class MyCustomLM(LM): ...@@ -44,26 +44,24 @@ class MyCustomLM(LM):
#... #...
def greedy_until(self, requests: list[Instance]) -> list[str]: def generate_until(self, requests: list[Instance]) -> list[str]:
#... #...
#... #...
``` ```
Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` which returns a tuple of (context, continuation). Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` which returns a tuple of (context, continuation).
We support We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.
The three types of All three request types take as input `requests` of type `list[Instance]` that have a matching `Instance.request_type` to the method name.
- `generate_until`
- Each request contains `Instance.args : Tuple[str, dict]` containing 1. an input string to the LM and 2. a dictionary of keyword arguments used to control generation parameters.
-
- `loglikelihood`
-
smth smth tokenizer-agnostic - `loglikelihood_rolling`, and args passed to it
3 reqtypes
- greedy_until, and the arguments passed to it
- loglikelihood, and args passed to it
- loglikelihood_rolling, and args passed to it
## Registration ## Registration
......
...@@ -32,7 +32,7 @@ Prompting / in-context formatting options: ...@@ -32,7 +32,7 @@ Prompting / in-context formatting options:
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice. - **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into - **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
- **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `greedy_until` tasks. - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
- **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples. - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested. - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
...@@ -42,7 +42,7 @@ Runtime configuration options: ...@@ -42,7 +42,7 @@ Runtime configuration options:
Scoring details: Scoring details:
- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format. - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
- **output_type** (`str`, *optional*, defaults to "greedy_until") — Selects the type of model output for the given task. Options are `greedy_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. - **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes. - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency. - **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency.
- **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API. - **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
...@@ -142,7 +142,7 @@ Our final filter pipeline, "maj@8", does majority voting across the first 8 of t ...@@ -142,7 +142,7 @@ Our final filter pipeline, "maj@8", does majority voting across the first 8 of t
- performing the same sequence of filters on these new sets of 8 responses, for each document. - performing the same sequence of filters on these new sets of 8 responses, for each document.
```yaml ```yaml
- name: "maj@8" - name: "maj@8"
filter: filter:
- function: "take_first_k" - function: "take_first_k"
k: 8 k: 8
- function: "regex" - function: "regex"
......
...@@ -98,20 +98,20 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -98,20 +98,20 @@ def parse_eval_args() -> argparse.Namespace:
help="Additional path to include if there are external tasks to include.", help="Additional path to include if there are external tasks to include.",
) )
parser.add_argument( parser.add_argument(
"--verbose", "--verbosity",
type=bool, type=str,
default=False, default="INFO",
help="Log error when tasks are not registered.", help="Log error when tasks are not registered.",
) )
return parser.parse_args() return parser.parse_args()
def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
if not args: if not args:
# we allow for args to be passed externally, else we parse them ourselves # we allow for args to be passed externally, else we parse them ourselves
args = parse_eval_args() args = parse_eval_args()
eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["TOKENIZERS_PARALLELISM"] = "false"
if args.limit: if args.limit:
...@@ -138,19 +138,21 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -138,19 +138,21 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
else: else:
tasks_list = args.tasks.split(",") tasks_list = args.tasks.split(",")
task_names = utils.pattern_match(tasks_list, ALL_TASKS) task_names = utils.pattern_match(tasks_list, ALL_TASKS)
task_missing = []
for task in [task for task in tasks_list if task not in task_names]: for task in [task for task in tasks_list if task not in task_names]:
if os.path.isfile(task): if os.path.isfile(task):
config = utils.load_yaml_config(task) config = utils.load_yaml_config(task)
task_names.append(config) task_names.append(config)
task_missing = [task for task in tasks_list if task not in task_names]
if task_missing != []:
missing = ", ".join(task_missing) if task_missing:
eval_logger.error( missing = ", ".join(task_missing)
f"Tasks were not found: {missing}\n" eval_logger.error(
f"{SPACING}Try `lm-eval -h` for list of available tasks", f"Tasks were not found: {missing}\n"
) f"{SPACING}Try `lm-eval -h` for list of available tasks",
raise ValueError(f"Tasks {missing} were not found.") )
raise ValueError(
f"Tasks {missing} were not found. Try `lm-eval -h` for list of available tasks."
)
if args.output_path: if args.output_path:
path = Path(args.output_path) path = Path(args.output_path)
......
...@@ -4,7 +4,7 @@ from typing import Literal, Tuple ...@@ -4,7 +4,7 @@ from typing import Literal, Tuple
@dataclass @dataclass
class Instance: class Instance:
request_type: Literal["loglikelihood", "loglikelihood_rolling", "greedy_until"] request_type: Literal["loglikelihood", "loglikelihood_rolling", "generate_until"]
doc: dict doc: dict
arguments: tuple arguments: tuple
idx: int idx: int
......
...@@ -212,7 +212,7 @@ def f1_fn(items): # This is a passthrough function ...@@ -212,7 +212,7 @@ def f1_fn(items): # This is a passthrough function
@register_metric( @register_metric(
metric="bleu", metric="bleu",
higher_is_better=True, higher_is_better=True,
output_type="greedy_until", output_type="generate_until",
aggregation="bleu", aggregation="bleu",
) )
def bleu_fn(items): # This is a passthrough function def bleu_fn(items): # This is a passthrough function
...@@ -222,7 +222,7 @@ def bleu_fn(items): # This is a passthrough function ...@@ -222,7 +222,7 @@ def bleu_fn(items): # This is a passthrough function
@register_metric( @register_metric(
metric="chrf", metric="chrf",
higher_is_better=True, higher_is_better=True,
output_type="greedy_until", output_type="generate_until",
aggregation="chrf", aggregation="chrf",
) )
def chrf_fn(items): # This is a passthrough function def chrf_fn(items): # This is a passthrough function
...@@ -232,7 +232,7 @@ def chrf_fn(items): # This is a passthrough function ...@@ -232,7 +232,7 @@ def chrf_fn(items): # This is a passthrough function
@register_metric( @register_metric(
metric="ter", metric="ter",
higher_is_better=True, higher_is_better=True,
output_type="greedy_until", output_type="generate_until",
aggregation="ter", aggregation="ter",
) )
def ter_fn(items): # This is a passthrough function def ter_fn(items): # This is a passthrough function
......
...@@ -96,7 +96,7 @@ class LM(abc.ABC): ...@@ -96,7 +96,7 @@ class LM(abc.ABC):
# TODO: Add an optional max length # TODO: Add an optional max length
@abc.abstractmethod @abc.abstractmethod
def greedy_until(self, requests) -> List[str]: def generate_until(self, requests) -> List[str]:
"""Generate greedily until a stopping sequence """Generate greedily until a stopping sequence
:param requests: list[Instance] :param requests: list[Instance]
...@@ -211,12 +211,12 @@ class CachingLM: ...@@ -211,12 +211,12 @@ class CachingLM:
) )
for req in tqdm(requests): for req in tqdm(requests):
hsh = hash_args(attr, req.args) hsh = hash_args(attr, req.args)
if attr == "greedy_until" and req.args[1].get("do_sample", False): if attr == "generate_until" and req.args[1].get("do_sample", False):
# when we are doing non-greedy generation, don't use the cache # when we are doing non-greedy generation, don't use the cache
# (else every "randomly sampled" generation would be identical for repeats > 1). # (else every "randomly sampled" generation would be identical for repeats > 1).
if not warned: if not warned:
eval_logger.warning( eval_logger.warning(
f"Arguments to lm.greedy_until() '{req.args[1]}' include non-deterministic sampling. Caching will not be performed for such requests." f"Arguments to lm.generate_until() '{req.args[1]}' include non-deterministic sampling. Caching will not be performed for such requests."
) )
warned = True warned = True
res.append(None) res.append(None)
......
...@@ -81,7 +81,7 @@ DEFAULT_METRIC_REGISTRY = { ...@@ -81,7 +81,7 @@ DEFAULT_METRIC_REGISTRY = {
], ],
"loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"], "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
"multiple_choice": ["acc", "acc_norm"], "multiple_choice": ["acc", "acc_norm"],
"greedy_until": ["exact_match"], "generate_until": ["exact_match"],
} }
...@@ -171,7 +171,6 @@ def is_higher_better(metric_name): ...@@ -171,7 +171,6 @@ def is_higher_better(metric_name):
try: try:
return HIGHER_IS_BETTER_REGISTRY[metric_name] return HIGHER_IS_BETTER_REGISTRY[metric_name]
except KeyError: except KeyError:
raise Warning(f"higher_is_better not specified for metric '{metric_name}'!")
eval_logger.warning( eval_logger.warning(
f"higher_is_better not specified for metric '{metric_name}'!" f"higher_is_better not specified for metric '{metric_name}'!"
) )
...@@ -44,7 +44,7 @@ ALL_OUTPUT_TYPES = [ ...@@ -44,7 +44,7 @@ ALL_OUTPUT_TYPES = [
"loglikelihood", "loglikelihood",
"multiple_choice", "multiple_choice",
"loglikelihood_rolling", "loglikelihood_rolling",
"greedy_until", "generate_until",
] ]
...@@ -80,7 +80,7 @@ class TaskConfig(dict): ...@@ -80,7 +80,7 @@ class TaskConfig(dict):
num_fewshot: int = 0 num_fewshot: int = 0
# scoring options # scoring options
metric_list: list = None metric_list: list = None
output_type: str = "greedy_until" output_type: str = "generate_until"
generation_kwargs: dict = None generation_kwargs: dict = None
repeats: int = 1 repeats: int = 1
filter_list: Union[str, list] = None filter_list: Union[str, list] = None
...@@ -97,11 +97,11 @@ class TaskConfig(dict): ...@@ -97,11 +97,11 @@ class TaskConfig(dict):
self.dataset_path = inspect.getfile(import_module(self.dataset_path)) self.dataset_path = inspect.getfile(import_module(self.dataset_path))
if self.generation_kwargs is not None: if self.generation_kwargs is not None:
if self.output_type != "greedy_until": if self.output_type != "generate_until":
eval_logger.warning( eval_logger.warning(
"passed `generation_kwargs`, but not using `output_type: greedy_until`!" f"[{self.task}] passed `generation_kwargs`, but not using `output_type: generate_until`!"
) )
assert self.output_type != "greedy_until" assert self.output_type != "generate_until"
if "temperature" in self.generation_kwargs: if "temperature" in self.generation_kwargs:
self.generation_kwargs["temperature"] = float( self.generation_kwargs["temperature"] = float(
...@@ -111,7 +111,7 @@ class TaskConfig(dict): ...@@ -111,7 +111,7 @@ class TaskConfig(dict):
if "until" not in self.generation_kwargs: if "until" not in self.generation_kwargs:
self.generation_kwargs["until"] = [self.fewshot_delimiter] self.generation_kwargs["until"] = [self.fewshot_delimiter]
else: else:
if self.output_type == "greedy_until": if self.output_type == "generate_until":
# ensure that we greedily generate in absence of explicit arguments otherwise # ensure that we greedily generate in absence of explicit arguments otherwise
self.generation_kwargs = { self.generation_kwargs = {
"until": None "until": None
...@@ -759,7 +759,6 @@ class ConfigurableTask(Task): ...@@ -759,7 +759,6 @@ class ConfigurableTask(Task):
return super().fewshot_docs() return super().fewshot_docs()
def apply_filters(self): def apply_filters(self):
if hasattr(self, "_filters"): if hasattr(self, "_filters"):
for f in self._filters: for f in self._filters:
f.apply(self._instances, self.task_docs) f.apply(self._instances, self.task_docs)
...@@ -959,7 +958,7 @@ class ConfigurableTask(Task): ...@@ -959,7 +958,7 @@ class ConfigurableTask(Task):
) )
return request_list return request_list
elif self.OUTPUT_TYPE == "greedy_until": elif self.OUTPUT_TYPE == "generate_until":
arguments = (ctx, self.config.generation_kwargs) arguments = (ctx, self.config.generation_kwargs)
return Instance( return Instance(
...@@ -967,7 +966,6 @@ class ConfigurableTask(Task): ...@@ -967,7 +966,6 @@ class ConfigurableTask(Task):
) )
def process_results(self, doc, results): def process_results(self, doc, results):
if callable(self.config.process_results): if callable(self.config.process_results):
return self.config.process_results(doc, results) return self.config.process_results(doc, results)
...@@ -1072,7 +1070,7 @@ class ConfigurableTask(Task): ...@@ -1072,7 +1070,7 @@ class ConfigurableTask(Task):
acc_mutual_info = 1.0 if np.argmax(lls_mutual_info) == gold else 0.0 acc_mutual_info = 1.0 if np.argmax(lls_mutual_info) == gold else 0.0
result_dict["acc_mutual_info"] = acc_mutual_info result_dict["acc_mutual_info"] = acc_mutual_info
elif self.OUTPUT_TYPE == "greedy_until": elif self.OUTPUT_TYPE == "generate_until":
gold = self.doc_to_target(doc) gold = self.doc_to_target(doc)
result = results[0] result = results[0]
if self.config.doc_to_choice is not None: if self.config.doc_to_choice is not None:
...@@ -1104,7 +1102,9 @@ class ConfigurableTask(Task): ...@@ -1104,7 +1102,9 @@ class ConfigurableTask(Task):
predictions=[result], predictions=[result],
**self._metric_fn_kwargs[metric], **self._metric_fn_kwargs[metric],
) )
except TypeError: # TODO: this is hacky and I don't want to do it except (
TypeError
): # TODO: this is hacky and I don't want to do it
result_score = self._metric_fn_list[metric]( result_score = self._metric_fn_list[metric](
[gold_option, result] [gold_option, result]
) )
...@@ -1123,7 +1123,9 @@ class ConfigurableTask(Task): ...@@ -1123,7 +1123,9 @@ class ConfigurableTask(Task):
predictions=[result], predictions=[result],
**self._metric_fn_kwargs[metric], **self._metric_fn_kwargs[metric],
) )
except TypeError: # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics except (
TypeError
): # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics
result_score = self._metric_fn_list[metric]([gold, result]) result_score = self._metric_fn_list[metric]([gold, result])
if isinstance(result_score, dict): if isinstance(result_score, dict):
# TODO: this handles the case where HF evaluate returns a dict. # TODO: this handles the case where HF evaluate returns a dict.
...@@ -1132,7 +1134,7 @@ class ConfigurableTask(Task): ...@@ -1132,7 +1134,7 @@ class ConfigurableTask(Task):
else: else:
raise ValueError( raise ValueError(
f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ", f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",
"'loglikelihood', 'loglikelihood_rolling', 'greedy_until' or 'multiple_choice'", "'loglikelihood', 'loglikelihood_rolling', 'generate_until' or 'multiple_choice'",
) )
return result_dict return result_dict
......
import os
import yaml
from lm_eval import utils
from lm_eval.tasks import register_configurable_task, check_prompt_config
from lm_eval.logger import eval_logger
from lm_eval.api.registry import (
TASK_REGISTRY,
GROUP_REGISTRY,
ALL_TASKS,
)
def include_benchmarks(task_dir: str) -> None:
for root, subdirs, file_list in os.walk(task_dir):
if (subdirs == [] or "__pycache__" in subdirs) and (len(file_list) > 0):
for f in file_list:
if f.endswith(".yaml"):
try:
benchmark_path = os.path.join(root, f)
with open(benchmark_path, "rb") as file:
yaml_config = yaml.full_load(file)
if "prompts" in yaml_config:
continue # Skip it
assert "group" in yaml_config
group = yaml_config["group"]
all_task_list = yaml_config["task"]
config_list = [
task for task in all_task_list if type(task) != str
]
task_list = [
task for task in all_task_list if type(task) == str
]
for task_config in config_list:
yaml_dir = os.path.dirname(benchmark_path)
task_config = utils.load_yaml_config(
yaml_config=task_config, yaml_dir=yaml_dir
)
if "use_prompt" in task_config:
if "yaml" in task_config["use_prompt"]:
task_config["use_prompt"] = os.path.join(
root, task_config["use_prompt"]
)
var_configs = check_prompt_config(
{
**task_config,
**{"group": group},
}
)
for config in var_configs:
register_configurable_task(config)
task_names = utils.pattern_match(task_list, ALL_TASKS)
for task in task_names:
if task in TASK_REGISTRY:
if group in GROUP_REGISTRY:
GROUP_REGISTRY[group].append(task)
else:
GROUP_REGISTRY[group] = [task]
ALL_TASKS.add(group)
except Exception as error:
eval_logger.warning(
"Failed to load benchmark in\n"
f" {benchmark_path}\n"
" Benchmark will not be added to registry\n"
f" Error: {error}"
)
task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
include_benchmarks(task_dir)
...@@ -138,7 +138,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e ...@@ -138,7 +138,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
def _loglikelihood_tokens(self, requests, disable_tqdm: bool = False): def _loglikelihood_tokens(self, requests, disable_tqdm: bool = False):
raise NotImplementedError("No support for logits.") raise NotImplementedError("No support for logits.")
def greedy_until(self, requests) -> List[str]: def generate_until(self, requests) -> List[str]:
if not requests: if not requests:
return [] return []
...@@ -164,7 +164,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e ...@@ -164,7 +164,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
) )
res.append(response) res.append(response)
self.cache_hook.add_partial("greedy_until", request, response) self.cache_hook.add_partial("generate_until", request, response)
except anthropic.APIConnectionError as e: # type: ignore # noqa: F821 except anthropic.APIConnectionError as e: # type: ignore # noqa: F821
eval_logger.critical(f"Server unreachable: {e.__cause__}") eval_logger.critical(f"Server unreachable: {e.__cause__}")
break break
...@@ -179,7 +179,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e ...@@ -179,7 +179,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
raise NotImplementedError() raise NotImplementedError()
def _model_generate(self, context, max_length, eos_token_id): def _model_generate(self, context, max_length, eos_token_id):
# Isn't used because we override greedy_until # Isn't used because we override generate_until
raise NotImplementedError() raise NotImplementedError()
def loglikelihood(self, requests): def loglikelihood(self, requests):
......
...@@ -20,7 +20,7 @@ class DummyLM(LM): ...@@ -20,7 +20,7 @@ class DummyLM(LM):
return res return res
def greedy_until(self, requests): def generate_until(self, requests):
res = [] res = []
for ctx, _ in requests: for ctx, _ in requests:
......
...@@ -621,6 +621,23 @@ class HFLM(LM): ...@@ -621,6 +621,23 @@ class HFLM(LM):
return loglikelihoods return loglikelihoods
def _batch_scheduler(self, pos, n_reordered_requests):
sched = pos // int(len(n_reordered_requests) / self.batch_schedule)
if sched in self.batch_sizes:
return self.batch_sizes[sched]
if (len(self.batch_sizes) > 1) and (
self.batch_sizes[sched - 1] == self.max_batch_size
):
# if previous batch size is already maximal, skip recomputation
self.batch_sizes[sched] = self.max_batch_size
return self.batch_sizes[sched]
print(
f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size"
)
self.batch_sizes[sched] = self._detect_batch_size(n_reordered_requests, pos)
print(f"Determined largest batch size: {self.batch_sizes[sched]}")
return self.batch_sizes[sched]
def _loglikelihood_tokens( def _loglikelihood_tokens(
self, requests, disable_tqdm: bool = False, override_bs=None self, requests, disable_tqdm: bool = False, override_bs=None
): ):
...@@ -644,38 +661,13 @@ class HFLM(LM): ...@@ -644,38 +661,13 @@ class HFLM(LM):
# automatic (variable) batch size detection for vectorization # automatic (variable) batch size detection for vectorization
# pull longest context sample from request # pull longest context sample from request
def _batch_scheduler(pos): chunks = utils.chunks(
sched = pos // int(n_reordered_requests / self.batch_schedule) re_ord.get_reordered(),
if sched in self.batch_sizes: n=self.batch_size if self.batch_size != "auto" else override_bs if override_bs is not None else 0,
return self.batch_sizes[sched] fn=self._batch_scheduler if self.batch_size == "auto" and n_reordered_requests > 0 and not override_bs else None,
if (len(self.batch_sizes) > 1) and ( )
self.batch_sizes[sched - 1] == self.max_batch_size
):
# if previous batch size is already maximal, skip recomputation
self.batch_sizes[sched] = self.max_batch_size
return self.batch_sizes[sched]
print(
f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size"
)
self.batch_sizes[sched] = self._detect_batch_size(
re_ord.get_reordered(), pos
)
print(f"Determined largest batch size: {self.batch_sizes[sched]}")
return self.batch_sizes[sched]
for chunk in utils.chunks( for chunk in tqdm(chunks, disable=(disable_tqdm or (self.rank != 0))):
tqdm(re_ord.get_reordered(), disable=(disable_tqdm or (self.rank != 0))),
n=self.batch_size
if self.batch_size != "auto"
else override_bs
if override_bs is not None
else 0,
fn=_batch_scheduler
if self.batch_size == "auto"
and n_reordered_requests > 0
and not override_bs
else None,
):
inps = [] inps = []
cont_toks_list = [] cont_toks_list = []
inplens = [] inplens = []
...@@ -815,7 +807,7 @@ class HFLM(LM): ...@@ -815,7 +807,7 @@ class HFLM(LM):
return re_ord.get_original(res) return re_ord.get_original(res)
def greedy_until(self, requests): def generate_until(self, requests):
res = defaultdict(list) res = defaultdict(list)
re_ords = {} re_ords = {}
...@@ -838,13 +830,20 @@ class HFLM(LM): ...@@ -838,13 +830,20 @@ class HFLM(LM):
re_ords[key] = utils.Reorderer([req.args for req in reqs], _collate) re_ords[key] = utils.Reorderer([req.args for req in reqs], _collate)
pbar = tqdm(total=len(requests), disable=(self.rank != 0)) pbar = tqdm(total=len(requests), disable=(self.rank != 0))
if self.batch_size == "auto":
# using rolling window with maximum context
print("Passed argument batch_size = auto. Detecting largest batch size")
batch_size = self._detect_batch_size()
print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size
# for each different set of kwargs, we execute all requests, by batch. # for each different set of kwargs, we execute all requests, by batch.
for key, re_ord in re_ords.items(): for key, re_ord in re_ords.items():
for chunk in utils.chunks( chunks = utils.chunks(
re_ord.get_reordered(), re_ord.get_reordered(),
self.batch_size, n=self.batch_size if self.batch_size != "auto" else adaptive_batch_size if adaptive_batch_size is not None else 0,
): fn=self._batch_scheduler if self.batch_size == "auto" and not adaptive_batch_size else None,
)
for chunk in tqdm(chunks, disable=self.rank != 0):
contexts, all_gen_kwargs = zip(*chunk) contexts, all_gen_kwargs = zip(*chunk)
# we assume all gen kwargs in the batch are the same # we assume all gen kwargs in the batch are the same
# this is safe to assume because the `grouper` object ensures it. # this is safe to assume because the `grouper` object ensures it.
...@@ -920,7 +919,7 @@ class HFLM(LM): ...@@ -920,7 +919,7 @@ class HFLM(LM):
res[key].append(s) res[key].append(s)
self.cache_hook.add_partial( self.cache_hook.add_partial(
"greedy_until", (context, gen_kwargs), s "generate_until", (context, gen_kwargs), s
) )
pbar.update(1) pbar.update(1)
# reorder this group of results back to original unsorted form # reorder this group of results back to original unsorted form
......
...@@ -203,7 +203,7 @@ class OpenaiCompletionsLM(LM): ...@@ -203,7 +203,7 @@ class OpenaiCompletionsLM(LM):
self.cache_hook.add_partial("loglikelihood", cache_key, answer) self.cache_hook.add_partial("loglikelihood", cache_key, answer)
return re_ord.get_original(res) return re_ord.get_original(res)
def greedy_until(self, requests) -> List[str]: def generate_until(self, requests) -> List[str]:
if not requests: if not requests:
return [] return []
res = [] res = []
...@@ -260,7 +260,7 @@ class OpenaiCompletionsLM(LM): ...@@ -260,7 +260,7 @@ class OpenaiCompletionsLM(LM):
# partial caching # partial caching
self.cache_hook.add_partial( self.cache_hook.add_partial(
"greedy_until", (context, {"until": until_}), s "generate_until", (context, {"until": until_}), s
) )
res.append(s) res.append(s)
...@@ -271,7 +271,7 @@ class OpenaiCompletionsLM(LM): ...@@ -271,7 +271,7 @@ class OpenaiCompletionsLM(LM):
raise NotImplementedError() raise NotImplementedError()
def _model_generate(self, context, max_length, eos_token_id): def _model_generate(self, context, max_length, eos_token_id):
# Isn't used because we override greedy_until # Isn't used because we override generate_until
raise NotImplementedError() raise NotImplementedError()
def loglikelihood_rolling(self, requests) -> List[float]: def loglikelihood_rolling(self, requests) -> List[float]:
......
...@@ -58,7 +58,7 @@ class TextSynthLM(LM): ...@@ -58,7 +58,7 @@ class TextSynthLM(LM):
@property @property
def eot_token_id(self): def eot_token_id(self):
# Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
raise NotImplementedError() raise NotImplementedError()
@property @property
...@@ -72,20 +72,20 @@ class TextSynthLM(LM): ...@@ -72,20 +72,20 @@ class TextSynthLM(LM):
@property @property
def batch_size(self): def batch_size(self):
# Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
raise NotImplementedError() raise NotImplementedError()
@property @property
def device(self): def device(self):
# Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
raise NotImplementedError() raise NotImplementedError()
def tok_encode(self, string: str): def tok_encode(self, string: str):
# Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
raise NotImplementedError() raise NotImplementedError()
def tok_decode(self, tokens): def tok_decode(self, tokens):
# Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
raise NotImplementedError() raise NotImplementedError()
def loglikelihood(self, requests): def loglikelihood(self, requests):
...@@ -122,7 +122,7 @@ class TextSynthLM(LM): ...@@ -122,7 +122,7 @@ class TextSynthLM(LM):
"input tokenization support from TextSynth." "input tokenization support from TextSynth."
) )
def greedy_until(self, requests): def generate_until(self, requests):
if not requests: if not requests:
return [] return []
...@@ -146,7 +146,7 @@ class TextSynthLM(LM): ...@@ -146,7 +146,7 @@ class TextSynthLM(LM):
s = resp["text"] s = resp["text"]
res.append(s) res.append(s)
self.cache_hook.add_partial("greedy_until", (inp, request_args), s) self.cache_hook.add_partial("generate_until", (inp, request_args), s)
else: else:
logger.error( logger.error(
f"The following response does not contain generated `text`. " f"The following response does not contain generated `text`. "
...@@ -160,5 +160,5 @@ class TextSynthLM(LM): ...@@ -160,5 +160,5 @@ class TextSynthLM(LM):
raise NotImplementedError() raise NotImplementedError()
def _model_generate(self, context, max_length, eos_token_id): def _model_generate(self, context, max_length, eos_token_id):
# Isn't used because we override greedy_until # Isn't used because we override generate_until
raise NotImplementedError() raise NotImplementedError()
...@@ -59,6 +59,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -59,6 +59,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [x] MGSM - [x] MGSM
- [ ] SCROLLS - [ ] SCROLLS
- [x] Babi - [x] Babi
- [x] Belebele
# Novel Tasks # Novel Tasks
Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*. Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.
......
...@@ -4,7 +4,6 @@ from typing import List, Union, Dict ...@@ -4,7 +4,6 @@ from typing import List, Union, Dict
from lm_eval import utils from lm_eval import utils
from lm_eval import prompts from lm_eval import prompts
from lm_eval.logger import eval_logger
from lm_eval.api.task import TaskConfig, Task, ConfigurableTask from lm_eval.api.task import TaskConfig, Task, ConfigurableTask
from lm_eval.api.registry import ( from lm_eval.api.registry import (
register_task, register_task,
...@@ -14,6 +13,9 @@ from lm_eval.api.registry import ( ...@@ -14,6 +13,9 @@ from lm_eval.api.registry import (
ALL_TASKS, ALL_TASKS,
) )
import logging
eval_logger = logging.getLogger('lm-eval')
def register_configurable_task(config: Dict[str, str]) -> int: def register_configurable_task(config: Dict[str, str]) -> int:
SubClass = type( SubClass = type(
...@@ -27,7 +29,9 @@ def register_configurable_task(config: Dict[str, str]) -> int: ...@@ -27,7 +29,9 @@ def register_configurable_task(config: Dict[str, str]) -> int:
register_task(task_name)(SubClass) register_task(task_name)(SubClass)
if "group" in config: if "group" in config:
if type(config["group"]) == str: if config["group"] == config["task"]:
raise ValueError("task and group name cannot be the same")
elif type(config["group"]) == str:
group_name = [config["group"]] group_name = [config["group"]]
else: else:
group_name = config["group"] group_name = config["group"]
...@@ -45,7 +49,6 @@ def register_configurable_group(config: Dict[str, str], yaml_path: str = None) - ...@@ -45,7 +49,6 @@ def register_configurable_group(config: Dict[str, str], yaml_path: str = None) -
task_list = [task for task in all_task_list if type(task) == str] task_list = [task for task in all_task_list if type(task) == str]
for task_config in config_list: for task_config in config_list:
task_config = utils.load_yaml_config(yaml_path, task_config) task_config = utils.load_yaml_config(yaml_path, task_config)
var_configs = check_prompt_config( var_configs = check_prompt_config(
{ {
...@@ -97,7 +100,7 @@ def check_prompt_config( ...@@ -97,7 +100,7 @@ def check_prompt_config(
] ]
) )
}, },
**{"output_type": "greedy_until"}, **{"output_type": "generate_until"},
} }
) )
else: else:
...@@ -137,20 +140,20 @@ def include_task_folder(task_dir: str, register_task: bool = True) -> None: ...@@ -137,20 +140,20 @@ def include_task_folder(task_dir: str, register_task: bool = True) -> None:
else: else:
if type(config["task"]) == list: if type(config["task"]) == list:
register_configurable_group(config, yaml_path) register_configurable_group(config, yaml_path)
except ModuleNotFoundError as e:
eval_logger.warning(
f"{yaml_path}: {e}. Config will not be added to registry."
)
except Exception as error: except Exception as error:
if eval_logger.verbose: import traceback
import traceback
eval_logger.debug(
eval_logger.warning( "Failed to load config in\n"
"Failed to load config in\n" f" {yaml_path}\n"
f" {yaml_path}\n" " Config will not be added to registry\n"
" Config will not be added to registry\n" f" Error: {error}\n"
f" Error: {error}\n" f" Traceback: {traceback.format_exc()}"
f" Traceback: {traceback.format_exc()}" )
)
else:
eval_logger.warning("Yaml failed to register {yaml_path}\n")
return 0 return 0
...@@ -190,7 +193,6 @@ def get_task_name_from_object(task_object): ...@@ -190,7 +193,6 @@ def get_task_name_from_object(task_object):
# TODO: pass num_fewshot and other cmdline overrides in a better way # TODO: pass num_fewshot and other cmdline overrides in a better way
def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs): def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs):
config = {**kwargs} config = {**kwargs}
task_name_from_registry_dict = {} task_name_from_registry_dict = {}
...@@ -202,7 +204,6 @@ def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs): ...@@ -202,7 +204,6 @@ def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs):
for task_element in task_name_list: for task_element in task_name_list:
if isinstance(task_element, str): if isinstance(task_element, str):
if task_element in GROUP_REGISTRY: if task_element in GROUP_REGISTRY:
group_name = task_element group_name = task_element
for task_name in GROUP_REGISTRY[task_element]: for task_name in GROUP_REGISTRY[task_element]:
...@@ -240,7 +241,6 @@ def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs): ...@@ -240,7 +241,6 @@ def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs):
} }
elif isinstance(task_element, Task): elif isinstance(task_element, Task):
task_name_from_object_dict = { task_name_from_object_dict = {
**task_name_from_object_dict, **task_name_from_object_dict,
get_task_name_from_object(task_element): task_element, get_task_name_from_object(task_element): task_element,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment