Commit 6ac42518 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' of...

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into openai_completions
parents 9c3ba7d4 e3644fcc
...@@ -63,10 +63,10 @@ jobs: ...@@ -63,10 +63,10 @@ jobs:
- name: Test with pytest - name: Test with pytest
# if new tasks are added, run tests on them # if new tasks are added, run tests on them
if: steps.changed-tasks.outputs.tasks_any_modified == 'true' if: steps.changed-tasks.outputs.tasks_any_modified == 'true'
run: python -m pytest tests/test_tasks.py -s -vv -n=auto run: python -m pytest tests/test_tasks.py -s -vv
# if api is modified, run tests on it # if api is modified, run tests on it
- name: Test more tasks with pytest - name: Test more tasks with pytest
env: env:
API: true API: true
if: steps.changed-tasks.outputs.api_any_modified == 'true' if: steps.changed-tasks.outputs.api_any_modified == 'true'
run: python -m pytest tests/test_tasks.py -s -vv -n=auto run: python -m pytest tests/test_tasks.py -s -vv
...@@ -22,10 +22,10 @@ jobs: ...@@ -22,10 +22,10 @@ jobs:
steps: steps:
- name: Checkout Code - name: Checkout Code
uses: actions/checkout@v3 uses: actions/checkout@v3
- name: Set up Python 3.9 - name: Set up Python 3.8
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: 3.9 python-version: 3.8
cache: pip cache: pip
cache-dependency-path: setup.py cache-dependency-path: setup.py
- name: Install dependencies - name: Install dependencies
...@@ -40,7 +40,7 @@ jobs: ...@@ -40,7 +40,7 @@ jobs:
flake8 . --count --select=F,E9,E71,E72,E501,E112,E113,W6 --extend-ignore=F541 --show-source --statistics --exit-zero flake8 . --count --select=F,E9,E71,E72,E501,E112,E113,W6 --extend-ignore=F541 --show-source --statistics --exit-zero
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
# mypy turned off for now # # mypy turned off for now
# - name: Lint with mypy # - name: Lint with mypy
# run: mypy . --ignore-missing-imports --check-untyped-defs --explicit-package-bases --warn-unreachable # run: mypy . --ignore-missing-imports --check-untyped-defs --explicit-package-bases --warn-unreachable
# Job 2 # Job 2
...@@ -49,9 +49,8 @@ jobs: ...@@ -49,9 +49,8 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
strategy: strategy:
matrix: matrix:
python-version: [ "3.9", "3.10", "3.11" ] python-version: [ "3.8", "3.9", "3.10", "3.11" ]
timeout-minutes: 30 timeout-minutes: 30
steps: steps:
- name: Checkout Code - name: Checkout Code
uses: actions/checkout@v3 uses: actions/checkout@v3
......
...@@ -33,14 +33,13 @@ repos: ...@@ -33,14 +33,13 @@ repos:
rev: 22.3.0 rev: 22.3.0
hooks: hooks:
- id: black - id: black
language_version: python3.9
- repo: https://github.com/codespell-project/codespell - repo: https://github.com/codespell-project/codespell
rev: v2.1.0 rev: v2.1.0
hooks: hooks:
- id: codespell - id: codespell
exclude: > exclude: >
(?x)^( (?x)^(
.*\.json|ignore.txt .*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml
)$ )$
args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt] args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
- repo: https://github.com/pre-commit/mirrors-mypy - repo: https://github.com/pre-commit/mirrors-mypy
......
* @haileyschoelkopf @lintangsutawika * @haileyschoelkopf @lintangsutawika @StellaAthena
# Language Model Evaluation Harness # Language Model Evaluation Harness
## Notice to Users
(as of 6/15/23)
We have a revamp of the Evaluation Harness library internals staged on the [big-refactor](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) branch! It is far along in progress, but before we start to move the `master` branch of the repository over to this new design with a new version release, we'd like to ensure that it's been tested by outside users and there are no glaring bugs.
We’d like your help to test it out! you can help by:
1. Trying out your current workloads on the big-refactor branch, and seeing if anything breaks or is counterintuitive,
2. Porting tasks supported in the previous version of the harness to the new YAML configuration format. Please check out our [task implementation guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md) for more information.
If you choose to port a task not yet completed according to [our checklist](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/README.md), then you can contribute it by opening a PR containing [Refactor] in the name with:
- A command of the form `python main.py --model hf --model_args ..... --tasks <task name> ...` which will run the task in the `master` branch, and what the score is
- A command of the form `python main.py --model hf --model_args ..... --tasks <task name> ...` to run the task in your PR branch to `big-refactor`, and what the resulting score is, to show that we achieve equality between the two implementations.
Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information.
## Overview ## Overview
This project provides a unified framework to test generative language models on a large number of different evaluation tasks. This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
Features: Features:
- Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md). - Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface. - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/). - Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft). - Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Evaluating with publicly available prompts ensures reproducibility and comparability between papers. - Support for local models and benchmarks.
- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and is used internally by dozens of companies including NVIDIA, Cohere, Booz Allen Hamilton, and Mosaic ML.
## Install ## Install
To install the `lm-eval` refactor branch from the github repository, run: To install the `lm-eval` package from the github repository, run:
```bash ```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness git clone https://github.com/EleutherAI/lm-evaluation-harness
...@@ -54,7 +44,6 @@ To install the package with all extras, run ...@@ -54,7 +44,6 @@ To install the package with all extras, run
pip install -e ".[all]" pip install -e ".[all]"
``` ```
## Support ## Support
The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
...@@ -67,7 +56,7 @@ To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/model ...@@ -67,7 +56,7 @@ To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/model
```bash ```bash
python main.py \ python -m lm_eval \
--model hf \ --model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \ --model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \ --tasks hellaswag \
...@@ -78,7 +67,7 @@ python main.py \ ...@@ -78,7 +67,7 @@ python main.py \
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model: Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:
```bash ```bash
python main.py \ python -m lm_eval \
--model hf \ --model hf \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \ --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks lambada_openai,hellaswag \ --tasks lambada_openai,hellaswag \
...@@ -86,12 +75,12 @@ python main.py \ ...@@ -86,12 +75,12 @@ python main.py \
--batch_size 8 --batch_size 8
``` ```
Models that are loaded via either `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) or `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported via Support for this model type is currently pending. Models that are loaded via both `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) and `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supporteded.
Batch size selection can be automated by setting the ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be: Batch size selection can be automated by setting the ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:
```bash ```bash
python main.py \ python -m lm_eval \
--model hf \ --model hf \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \ --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks lambada_openai,hellaswag \ --tasks lambada_openai,hellaswag \
...@@ -99,7 +88,7 @@ python main.py \ ...@@ -99,7 +88,7 @@ python main.py \
--batch_size auto:4 --batch_size auto:4
``` ```
Alternatively, you can use `lm-eval` instead of `python main.py` to call lm eval from anywhere. Alternatively, you can use `lm-eval` or `lm_eval` instead of `python -m lm_eval` to call lm eval from anywhere.
### Multi-GPU Evaluation with Hugging Face `accelerate` ### Multi-GPU Evaluation with Hugging Face `accelerate`
...@@ -108,7 +97,7 @@ To parallelize evaluation of HuggingFace models across multiple GPUs, we allow f ...@@ -108,7 +97,7 @@ To parallelize evaluation of HuggingFace models across multiple GPUs, we allow f
The first is performed by launching evaluation via the `accelerate` library as follows: The first is performed by launching evaluation via the `accelerate` library as follows:
``` ```
accelerate launch main.py \ accelerate launch -m lm_eval \
--model hf \ --model hf \
--tasks lambada_openai,arc_easy \ --tasks lambada_openai,arc_easy \
--batch_size 16 \ --batch_size 16 \
...@@ -121,7 +110,7 @@ If your model is *is too large to be run on a single one of your GPUs* then you ...@@ -121,7 +110,7 @@ If your model is *is too large to be run on a single one of your GPUs* then you
We also provide an second method to run these large models: use of the `parallelize` argument. We also provide an second method to run these large models: use of the `parallelize` argument.
``` ```
python main.py \ python -m lm_eval \
--model hf \ --model hf \
--model_args pretrained=EleutherAI/pythia-12b,parallelize=True --model_args pretrained=EleutherAI/pythia-12b,parallelize=True
--tasks lambada_openai,arc_easy \ --tasks lambada_openai,arc_easy \
...@@ -136,7 +125,7 @@ To pass even more advanced keyword arguments to `accelerate`, we allow for the f ...@@ -136,7 +125,7 @@ To pass even more advanced keyword arguments to `accelerate`, we allow for the f
Note that this method naively splits models across GPUs, resulting in only a single GPU performing work at any point in time, and so is much slower than launching with `accelerate launch`, possibly by a factor of the total # of GPUs. Note that this method naively splits models across GPUs, resulting in only a single GPU performing work at any point in time, and so is much slower than launching with `accelerate launch`, possibly by a factor of the total # of GPUs.
**Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.** **Note that this option requires launching evaluation via `python -m lm_eval` rather than `accelerate launch -m lm_eval`.**
To use `accelerate` with the `lm-eval` command, use To use `accelerate` with the `lm-eval` command, use
``` ```
...@@ -151,14 +140,14 @@ A full accounting of the supported and planned libraries + APIs can be seen belo ...@@ -151,14 +140,14 @@ A full accounting of the supported and planned libraries + APIs can be seen belo
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: | | API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
|-----------------------------|---------------------------------|----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------| |-----------------------------|---------------------------------|----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------|
| OpenAI Completions | :heavy_check_mark: | `openai`, `openai-completions`, `gooseai` | up to `code-davinci-002` | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | OpenAI Completions | :heavy_check_mark: | `openai`, `openai-completions`, `gooseai` | up to `code-davinci-002` | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :x: Not yet - needs help! | N/A | (link here?) | `greedy_until` (no logprobs) | | OpenAI ChatCompletions | :x: Not yet - needs testing! | N/A | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `greedy_until` (no logprobs) | | Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) |
| GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Textsynth | Needs testing | `textsynth` | ??? | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | Textsynth | Needs testing | `textsynth` | ??? | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Cohere | :hourglass: - blocked on Cohere API bug | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | Cohere | :hourglass: - blocked on Cohere API bug | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| GGML | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617) | N/A | ??? | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` | | GGML | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617) | N/A | ??? | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| vLLM | :x: Not yet - needs help! | N/A | All HF models | `greedy_until` (no logprobs) | | vLLM | :x: Not yet - needs help! | N/A | All HF models | `generate_until` (no logprobs) |
| Your inference server here! | ... | ... | ... | ... | | ... | | Your inference server here! | ... | ... | ... | ... | | ... |
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models. It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
...@@ -167,7 +156,7 @@ Our library supports language models served via the OpenAI Completions API as fo ...@@ -167,7 +156,7 @@ Our library supports language models served via the OpenAI Completions API as fo
```bash ```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python main.py \ python -m lm_eval \
--model openai-completions \ --model openai-completions \
--model_args engine=davinci \ --model_args engine=davinci \
--tasks lambada_openai,hellaswag --tasks lambada_openai,hellaswag
...@@ -198,7 +187,7 @@ This will write out one text file for each task. ...@@ -198,7 +187,7 @@ This will write out one text file for each task.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag: To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
```bash ```bash
python main.py \ python -m lm_eval \
--model openai \ --model openai \
--model_args engine=davinci \ --model_args engine=davinci \
--tasks lambada_openai,hellaswag \ --tasks lambada_openai,hellaswag \
...@@ -209,7 +198,7 @@ python main.py \ ...@@ -209,7 +198,7 @@ python main.py \
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument: For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
```bash ```bash
python main.py \ python -m lm_eval \
--model hf \ --model hf \
--model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \ --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \
--tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \ --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
...@@ -219,7 +208,7 @@ python main.py \ ...@@ -219,7 +208,7 @@ python main.py \
[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,gptq=NAME` (or `,gptq=True` for default names) in the `model_args` argument: [GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,gptq=NAME` (or `,gptq=True` for default names) in the `model_args` argument:
```bash ```bash
python main.py \ python -m lm_eval \
--model hf \ --model hf \
--model_args pretrained=model-name-or-path,gptq=model.safetensors,gptq_use_triton=True \ --model_args pretrained=model-name-or-path,gptq=model.safetensors,gptq_use_triton=True \
--tasks hellaswag --tasks hellaswag
...@@ -227,12 +216,6 @@ python main.py \ ...@@ -227,12 +216,6 @@ python main.py \
We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`. We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
## Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants.
## How to Contribute or Learn More? ## How to Contribute or Learn More?
...@@ -241,28 +224,19 @@ For more information on the library and how everything fits together, check out ...@@ -241,28 +224,19 @@ For more information on the library and how everything fits together, check out
You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you! You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
### Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in [the task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md) and [the advanced task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md) and welcome contributions of novel task templates and task variants.
## Cite as ## Cite as
``` ```
@software{eval-harness, @misc{eval-harness,
author = {Gao, Leo and author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title = {A framework for few-shot language model evaluation}, title = {A framework for few-shot language model evaluation},
month = sep, month = sep,
year = 2021, year = 2021,
......
...@@ -4,6 +4,7 @@ Welcome to the docs for the LM Evaluation Harness! ...@@ -4,6 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!
## Table of Contents ## Table of Contents
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/user_guide.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md). * To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md). * For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Advanced Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/advanced_task_guide.md). * To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Advanced Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/advanced_task_guide.md).
......
...@@ -2,11 +2,11 @@ ...@@ -2,11 +2,11 @@
## Usage ## Usage
Simply add a "--decontamination_ngrams_path" when running main.py. The provided directory should contain Simply add a "--decontamination_ngrams_path" when running \__main\__.py. The provided directory should contain
the ngram files and info.json produced in "Pile Ngram Generation" further down. the ngram files and info.json produced in "Pile Ngram Generation" further down.
```bash ```bash
python main.py \ python -m lm_eval \
--model gpt2 \ --model gpt2 \
--device 0 \ --device 0 \
--tasks sciq \ --tasks sciq \
......
# User Guide
This document details the interface exposed by `lm-eval` and provides details on what flags are available to users.
## Command-line Interface
A majority of users run the library by cloning it from Github, installing the package as editable, and running the `python -m lm_eval` script.
Equivalently, running the library can be done via the `lm-eval` entrypoint at the command line.
This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
* `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
* `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
* `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
* `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
* `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
* `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
* `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
* `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
* `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool.
* `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
* `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
* `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
* `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
## External Library Usage
We also support using the library's external API for use within model training loops or other scripts.
`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.
`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs/model_guide.md), and wrapping your custom model in that class as follows:
```python
import lm_eval
...
my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
...
lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.generate_until()`
lm_eval.tasks.initialize_tasks() # register all tasks from the `lm_eval/tasks` subdirectory. Alternatively, can call `lm_eval.tasks.include_path("path/to/my/custom/task/configs")` to only register a set of tasks in a separate directory.
results = lm_eval.simple_evaluate( # call simple_evaluate
model=lm_obj,
tasks=["taskname1", "taskname2"],
num_fewshot=0,
...
)
```
See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.
Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.
See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details.
As a brief example usage of `evaluate()`:
```python
import lm_eval
from my_tasks import MyTask1 # suppose you've defined a custom lm_eval.api.Task subclass in your own external codebase
...
my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
...
lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.generate_until()`
lm_eval.tasks.initialize_tasks() # register all tasks from the `lm_eval/tasks` subdirectory. Alternatively, can call `lm_eval.tasks.include_path("path/to/my/custom/task/configs")` to only register a set of tasks in a separate directory.
def evaluate(
lm=lm_obj,
task_dict={"mytask1": MyTask1},
...
):
```
...@@ -44,35 +44,56 @@ class MyCustomLM(LM): ...@@ -44,35 +44,56 @@ class MyCustomLM(LM):
#... #...
def greedy_until(self, requests: list[Instance]) -> list[str]: def generate_until(self, requests: list[Instance]) -> list[str]:
#... #...
#... #...
``` ```
Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` which returns a tuple of (context, continuation). Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below.
We support We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.
The three types of All three request types take as input `requests` of type `list[Instance]` that have a matching `Instance.request_type` to the method name.
- `generate_until`
- Each request contains `Instance.args : Tuple[str, dict]` containing 1. an input string to the LM and 2. a dictionary of keyword arguments used to control generation parameters.
- Using this input and these generation parameters, text will be sampled from the language model (typically until a maximum output length or specific stopping string sequences--for example, `{"until": ["\n\n", "."], "max_gen_toks": 128}`).
- The generated input+output text from the model will then be returned.
- `loglikelihood`
- Each request contains `Instance.args : Tuple[str, str]` containing 1. an input string to the LM and 2. a target string on which the loglikelihood of the LM producing this target, conditioned on the input, will be returned.
- Each request will have, as result, `(ll, is_greedy): Tuple[float, int]` returned, where `ll` is a floating point number representing the log probability of generating the target string conditioned on the input, and `is_greedy` being either the value `0` or `1`, with it being `1` if and only if the target string *would be generated by greedy sampling from the LM* (that is, if the target string is the *most likely* N-token string to be output by the LM given the input. )
smth smth tokenizer-agnostic - `loglikelihood_rolling`
- Each request contains `Instance.args : Tuple[str]`, which is an input string to the model whose *entire* loglikelihood, conditioned on purely the EOT token, will be calculated.
- This is used to evaluate *perplexity* on a data distribution.
- It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.
3 reqtypes
- greedy_until, and the arguments passed to it
- loglikelihood, and args passed to it To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` !
- loglikelihood_rolling, and args passed to it **Tip: be careful of indexing in loglikelihood!**
LMs take in tokens in position `[0 1 2 ... N]` and output a probability distribution for token position `N+1`. We provide a simplified graphic here, excerpted from `huggingface.py`:
```
# how this all works (illustrated on a causal decoder-only setup):
# CTX CONT
# inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1]
# model \ \
# logits 1 2 3|4 5 6 7 8 9 <- the ctx half gets tossed out by the
# cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice
```
The final token of the target is not passed into the LM, because we want the LM's predictions *up to but not past* that final target token. For more information, check out https://github.com/EleutherAI/lm-evaluation-harness/issues/942 .
## Registration ## Registration
Congrats on implementing your model! Now it's time to test it out. Congrats on implementing your model! Now it's time to test it out.
To make your model usable via the command line interface to `lm-eval` using `main.py`, you'll need to tell `lm-eval` what your model's name is. To make your model usable via the command line interface to `lm-eval` using `python -m lm_eval`, you'll need to tell `lm-eval` what your model's name is.
This is done via a *decorator*, `lm_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python main.py --model <name>` and alert `lm-eval` to the model's existence. This is done via a *decorator*, `lm_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python -m lm_eval --model <name>` and alert `lm-eval` to the model's existence.
```python ```python
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
...@@ -83,7 +104,9 @@ class MyCustomLM(LM): ...@@ -83,7 +104,9 @@ class MyCustomLM(LM):
Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at `lm_eval.api.registry.MODEL_REGISTRY`. See `lm_eval.api.registry` for more detail on what sorts of registries and decorators exist in the library! Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at `lm_eval.api.registry.MODEL_REGISTRY`. See `lm_eval.api.registry` for more detail on what sorts of registries and decorators exist in the library!
## Testing
We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py .
## Other ## Other
......
...@@ -17,7 +17,7 @@ git checkout -b <task-name> ...@@ -17,7 +17,7 @@ git checkout -b <task-name>
pip install -e ".[dev]" pip install -e ".[dev]"
``` ```
As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (a *generative* task which requires sampling text from a model) and the `sciq` benchmark. (a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices). In this document, we'll walk through the basics of implementing a static benchmark evaluation in two formats: a *generative* task which requires sampling text from a model, such as [`gsm8k`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/gsm8k/gsm8k.yaml), and a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices, such as [`sciq`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/sciq/sciq.yaml).
## Creating a YAML file ## Creating a YAML file
...@@ -45,6 +45,16 @@ dataset_name: ... # the dataset configuration to use. Leave `null` if your datas ...@@ -45,6 +45,16 @@ dataset_name: ... # the dataset configuration to use. Leave `null` if your datas
dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`. dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
``` ```
------------------------------
**Tip:** To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
```
dataset_path: json
dataset_name: null
dataset_kwargs:
data_files: /path/to/my/json
```
-------------------------------
Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist: Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist:
```yaml ```yaml
...@@ -116,7 +126,7 @@ doc_to_choice: ['No', 'Yes'] ...@@ -116,7 +126,7 @@ doc_to_choice: ['No', 'Yes']
We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format. We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
Take for example `super_glue/boolq`, as input, we'd like to use the features `passage` and `question` and string them together so that for a a sample line `doc`, the model sees something the format of: Take for example the dataset `super_glue/boolq`. As input, we'd like to use the features `passage` and `question` and string them together so that for a a sample line `doc`, the model sees something the format of:
``` ```
doc["passage"] doc["passage"]
Question: doc["question"]? Question: doc["question"]?
...@@ -214,7 +224,7 @@ metric_list: ...@@ -214,7 +224,7 @@ metric_list:
``` ```
`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function). `aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function).
For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`. For a full list of natively supported metrics and aggregation functions see `docs/advanced_task_guide.md`. All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval` or `hf_evaluate` is set to `true`.
### Optional, More Advanced Setup ### Optional, More Advanced Setup
...@@ -258,11 +268,29 @@ You can do this via adding the Python snippet ...@@ -258,11 +268,29 @@ You can do this via adding the Python snippet
from lm_eval.tasks import include_task_folder from lm_eval.tasks import include_task_folder
include_task_folder("/path/to/yaml/parent/folder") include_task_folder("/path/to/yaml/parent/folder")
``` ```
to the top of any Python file that is run or imported when performing evaluation, such as `main.py`. to the top of any Python file that is run or imported when performing evaluation, such as `\_\_main\_\_.py`.
Passing `--tasks /path/to/yaml/file` is also accepted. Passing `--tasks /path/to/yaml/file` is also accepted.
## Beautifying Table Display
To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed.
``
for example in `mmlu_abstract_algebra.yaml` we set `group_alias` to `stem` and `task_alias` to `abstract_algebra`.
```
"dataset_name": "abstract_algebra"
"description": "The following are multiple choice questions (with answers) about abstract\
\ algebra.\n\n"
"group": "mmlu_stem"
"group_alias": "stem"
"include": "_default_template_yaml"
"task": "mmlu_abstract_algebra"
"task_alias": "abstract_algebra"
```
Note: Even though `group` can be a list, for now, `group_alias` can only be a single string.
## Checking validity ## Checking validity
After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args: After registering your task, you can now check on your data downloading and verify that the few-shot samples look as intended. Run the following command with your desired args:
...@@ -285,7 +313,7 @@ It's now time to check models' performance on your task! In the evaluation harne ...@@ -285,7 +313,7 @@ It's now time to check models' performance on your task! In the evaluation harne
To enable this, we provide a checklist that should be completed when contributing a new task, to enable accurate book-keeping and to ensure that tasks added to the library are well-tested and, where applicable, precedented. To enable this, we provide a checklist that should be completed when contributing a new task, to enable accurate book-keeping and to ensure that tasks added to the library are well-tested and, where applicable, precedented.
### Task impl. checklist ### Task Validity Checklist
The checklist is the following: The checklist is the following:
......
...@@ -20,19 +20,19 @@ Task naming + registration: ...@@ -20,19 +20,19 @@ Task naming + registration:
Dataset configuration options: Dataset configuration options:
- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub. - **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
- **dataset_name** (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.) - **dataset_name** (`str`, *optional*, defaults to None) — The name of what HF calls a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv. - **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split. - **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split. - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split. - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?) - **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0.
- **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template. - **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
Prompting / in-context formatting options: Prompting / in-context formatting options:
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice. - **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into - **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
- **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `greedy_until` tasks. - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
- **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples. - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested. - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
...@@ -42,7 +42,7 @@ Runtime configuration options: ...@@ -42,7 +42,7 @@ Runtime configuration options:
Scoring details: Scoring details:
- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format. - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
- **output_type** (`str`, *optional*, defaults to "greedy_until") — Selects the type of model output for the given task. Options are `greedy_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. - **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes. - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency. - **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency.
- **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API. - **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
......
...@@ -5,3 +5,4 @@ maka ...@@ -5,3 +5,4 @@ maka
mor mor
te te
ond ond
extraversion
from .evaluator import evaluate, simple_evaluate
import os import os
import re import re
import sys
import json import json
import fnmatch
import jsonlines
import argparse
import logging import logging
import argparse
import numpy as np
from pathlib import Path from pathlib import Path
from typing import Union
from lm_eval import evaluator, utils from lm_eval import evaluator, utils
from lm_eval.tasks import initialize_tasks, include_path
from lm_eval.api.registry import ALL_TASKS from lm_eval.api.registry import ALL_TASKS
from lm_eval.logger import eval_logger, SPACING
from lm_eval.tasks import include_task_folder
from lm_eval.benchmarks import include_benchmarks
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def _handle_non_serializable(o):
if isinstance(o, np.int64) or isinstance(o, np.int32):
return int(o)
elif isinstance(o, set):
return list(o)
else:
return str(o)
def parse_args() -> argparse.Namespace: def parse_eval_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter) parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("--model", required=True, help="Name of model e.g. `hf`") parser.add_argument("--model", default="hf", help="Name of model e.g. `hf`")
parser.add_argument( parser.add_argument(
"--tasks", "--tasks",
default=None, default=None,
help="Available Tasks:\n - {}".format("\n - ".join(sorted(ALL_TASKS))), help="To get full list of tasks, use the command lm-eval --tasks list",
) )
parser.add_argument( parser.add_argument(
"--model_args", "--model_args",
...@@ -98,24 +105,43 @@ def parse_args() -> argparse.Namespace: ...@@ -98,24 +105,43 @@ def parse_args() -> argparse.Namespace:
default=None, default=None,
help="Additional path to include if there are external tasks to include.", help="Additional path to include if there are external tasks to include.",
) )
parser.add_argument(
"--verbosity",
type=str,
default="INFO",
help="Log error when tasks are not registered.",
)
return parser.parse_args() return parser.parse_args()
def main() -> None: def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
args = parse_args() if not args:
# we allow for args to be passed externally, else we parse them ourselves
args = parse_eval_args()
eval_logger = utils.eval_logger
eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
eval_logger.info(f"Verbosity set to {args.verbosity}")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
initialize_tasks(args.verbosity)
if args.limit: if args.limit:
eval_logger.warning( eval_logger.warning(
" --limit SHOULD ONLY BE USED FOR TESTING." " --limit SHOULD ONLY BE USED FOR TESTING."
"REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT." "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT."
) )
if args.include_path is not None: if args.include_path is not None:
eval_logger.info(f"Including path: {args.include_path}") eval_logger.info(f"Including path: {args.include_path}")
include_task_folder(args.include_path) include_path(args.include_path)
if args.tasks is None: if args.tasks is None:
task_names = ALL_TASKS task_names = ALL_TASKS
elif args.tasks == "list":
eval_logger.info(
"Available Tasks:\n - {}".format(f"\n - ".join(sorted(ALL_TASKS)))
)
sys.exit()
else: else:
if os.path.isdir(args.tasks): if os.path.isdir(args.tasks):
import glob import glob
...@@ -128,21 +154,25 @@ def main() -> None: ...@@ -128,21 +154,25 @@ def main() -> None:
else: else:
tasks_list = args.tasks.split(",") tasks_list = args.tasks.split(",")
task_names = utils.pattern_match(tasks_list, ALL_TASKS) task_names = utils.pattern_match(tasks_list, ALL_TASKS)
task_missing = []
for task in [task for task in tasks_list if task not in task_names]: for task in [task for task in tasks_list if task not in task_names]:
if os.path.isfile(task): if os.path.isfile(task):
config = utils.load_yaml_config(task) config = utils.load_yaml_config(task)
task_names.append(config) task_names.append(config)
else: task_missing = [
task_missing.append(task) task
for task in tasks_list
if task not in task_names and "*" not in task
] # we don't want errors if a wildcard ("*") task name was used
if task_missing != []: if task_missing:
missing = ", ".join(task_missing) missing = ", ".join(task_missing)
eval_logger.error( eval_logger.error(
f"Tasks were not found: {missing}\n" f"Tasks were not found: {missing}\n"
f"{SPACING}Try `lm-eval -h` for list of available tasks", f"{utils.SPACING}Try `lm-eval --tasks list` for list of available tasks",
)
raise ValueError(
f"Tasks {missing} were not found. Try `lm-eval --tasks list` for list of available tasks."
) )
raise ValueError(f"Tasks {missing} were not found.")
if args.output_path: if args.output_path:
path = Path(args.output_path) path = Path(args.output_path)
...@@ -185,7 +215,7 @@ def main() -> None: ...@@ -185,7 +215,7 @@ def main() -> None:
if results is not None: if results is not None:
if args.log_samples: if args.log_samples:
samples = results.pop("samples") samples = results.pop("samples")
dumped = json.dumps(results, indent=2, default=lambda o: str(o)) dumped = json.dumps(results, indent=2, default=_handle_non_serializable)
if args.show_config: if args.show_config:
print(dumped) print(dumped)
...@@ -200,18 +230,19 @@ def main() -> None: ...@@ -200,18 +230,19 @@ def main() -> None:
re.sub("/|=", "__", args.model_args), task_name re.sub("/|=", "__", args.model_args), task_name
) )
filename = path.joinpath(f"{output_name}.jsonl") filename = path.joinpath(f"{output_name}.jsonl")
samples_dumped = json.dumps(
with jsonlines.open(filename, "w") as f: samples[task_name], indent=2, default=_handle_non_serializable
f.write_all(samples[task_name]) )
filename.open("w").write(samples_dumped)
print( print(
f"{args.model} ({args.model_args}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, " f"{args.model} ({args.model_args}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, "
f"batch_size: {args.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}" f"batch_size: {args.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
) )
print(evaluator.make_table(results)) print(evaluator.make_table(results))
if "aggregate" in results: if "groups" in results:
print(evaluator.make_table(results, "aggregate")) print(evaluator.make_table(results, "groups"))
if __name__ == "__main__": if __name__ == "__main__":
main() cli_evaluate()
...@@ -4,7 +4,7 @@ from typing import Literal, Tuple ...@@ -4,7 +4,7 @@ from typing import Literal, Tuple
@dataclass @dataclass
class Instance: class Instance:
request_type: Literal["loglikelihood", "loglikelihood_rolling", "greedy_until"] request_type: Literal["loglikelihood", "loglikelihood_rolling", "generate_until"]
doc: dict doc: dict
arguments: tuple arguments: tuple
idx: int idx: int
......
...@@ -5,9 +5,13 @@ import numpy as np ...@@ -5,9 +5,13 @@ import numpy as np
import sacrebleu import sacrebleu
import sklearn.metrics import sklearn.metrics
import random import random
import evaluate
from lm_eval.api.registry import register_metric, register_aggregation from lm_eval.api.registry import register_metric, register_aggregation
import logging
eval_logger = logging.getLogger("lm-eval")
# Register Aggregations First # Register Aggregations First
@register_aggregation("mean") @register_aggregation("mean")
...@@ -135,6 +139,19 @@ def acc_mutual_info_fn(items): # This is a passthrough function ...@@ -135,6 +139,19 @@ def acc_mutual_info_fn(items): # This is a passthrough function
return items return items
exact_match = evaluate.load("exact_match")
@register_metric(
metric="exact_match",
higher_is_better=True,
output_type="generate_until",
aggregation="mean",
)
def exact_match_fn(**kwargs):
return exact_match.compute(**kwargs)
@register_metric( @register_metric(
metric="perplexity", metric="perplexity",
higher_is_better=False, higher_is_better=False,
...@@ -212,7 +229,7 @@ def f1_fn(items): # This is a passthrough function ...@@ -212,7 +229,7 @@ def f1_fn(items): # This is a passthrough function
@register_metric( @register_metric(
metric="bleu", metric="bleu",
higher_is_better=True, higher_is_better=True,
output_type="greedy_until", output_type="generate_until",
aggregation="bleu", aggregation="bleu",
) )
def bleu_fn(items): # This is a passthrough function def bleu_fn(items): # This is a passthrough function
...@@ -222,7 +239,7 @@ def bleu_fn(items): # This is a passthrough function ...@@ -222,7 +239,7 @@ def bleu_fn(items): # This is a passthrough function
@register_metric( @register_metric(
metric="chrf", metric="chrf",
higher_is_better=True, higher_is_better=True,
output_type="greedy_until", output_type="generate_until",
aggregation="chrf", aggregation="chrf",
) )
def chrf_fn(items): # This is a passthrough function def chrf_fn(items): # This is a passthrough function
...@@ -232,7 +249,7 @@ def chrf_fn(items): # This is a passthrough function ...@@ -232,7 +249,7 @@ def chrf_fn(items): # This is a passthrough function
@register_metric( @register_metric(
metric="ter", metric="ter",
higher_is_better=True, higher_is_better=True,
output_type="greedy_until", output_type="generate_until",
aggregation="ter", aggregation="ter",
) )
def ter_fn(items): # This is a passthrough function def ter_fn(items): # This is a passthrough function
......
import abc import abc
import os import os
from typing import Union, List, Tuple import torch
from typing import Union, List, Tuple, Optional, Type, TypeVar
from sqlitedict import SqliteDict from sqlitedict import SqliteDict
import json import json
import hashlib import hashlib
...@@ -9,7 +10,12 @@ import hashlib ...@@ -9,7 +10,12 @@ import hashlib
from tqdm import tqdm from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
from lm_eval.logger import eval_logger
import logging
eval_logger = logging.getLogger("lm-eval")
T = TypeVar("T", bound="LM")
class LM(abc.ABC): class LM(abc.ABC):
...@@ -93,7 +99,7 @@ class LM(abc.ABC): ...@@ -93,7 +99,7 @@ class LM(abc.ABC):
# TODO: Add an optional max length # TODO: Add an optional max length
@abc.abstractmethod @abc.abstractmethod
def greedy_until(self, requests) -> List[str]: def generate_until(self, requests) -> List[str]:
"""Generate greedily until a stopping sequence """Generate greedily until a stopping sequence
:param requests: list[Instance] :param requests: list[Instance]
...@@ -111,11 +117,28 @@ class LM(abc.ABC): ...@@ -111,11 +117,28 @@ class LM(abc.ABC):
pass pass
@classmethod @classmethod
def create_from_arg_string(cls, arg_string, additional_config=None): def create_from_arg_string(
cls: Type[T], arg_string: str, additional_config: Optional[dict] = None
) -> T:
"""
Creates an instance of the LM class using the given argument string and additional config.
Parameters:
- arg_string: A string containing arguments in the format key1=value1,key2=value2.
- additional_config: Optional dictionary containing additional configuration parameters.
Returns:
- Instance of the LM class.
"""
additional_config = {} if additional_config is None else additional_config additional_config = {} if additional_config is None else additional_config
args = utils.simple_parse_args_string(arg_string) args = utils.simple_parse_args_string(arg_string)
args2 = {k: v for k, v in additional_config.items() if v is not None} args2 = {k: v for k, v in additional_config.items() if v is not None}
if args2.get("device") == "mps" or args.get("device") == "mps": # TODO: delete once float16 MPS is fixed in torch stable
if (
args2.get("device") in ("mps", "mps:0")
or args.get("device") in ("mps", "mps:0")
and "dev" not in torch.__version__
):
args["dtype"] = "float32" args["dtype"] = "float32"
return cls(**args, **args2) return cls(**args, **args2)
...@@ -191,12 +214,12 @@ class CachingLM: ...@@ -191,12 +214,12 @@ class CachingLM:
) )
for req in tqdm(requests): for req in tqdm(requests):
hsh = hash_args(attr, req.args) hsh = hash_args(attr, req.args)
if attr == "greedy_until" and req.args[1].get("do_sample", False): if attr == "generate_until" and req.args[1].get("do_sample", False):
# when we are doing non-greedy generation, don't use the cache # when we are doing non-greedy generation, don't use the cache
# (else every "randomly sampled" generation would be identical for repeats > 1). # (else every "randomly sampled" generation would be identical for repeats > 1).
if not warned: if not warned:
eval_logger.warning( eval_logger.warning(
f"Arguments to lm.greedy_until() '{req.args[1]}' include non-deterministic sampling. Caching will not be performed for such requests." f"Arguments to lm.generate_until() '{req.args[1]}' include non-deterministic sampling. Caching will not be performed for such requests."
) )
warned = True warned = True
res.append(None) res.append(None)
......
import os import os
import evaluate import evaluate
from lm_eval.api.model import LM from lm_eval.api.model import LM
from lm_eval.logger import eval_logger
import logging
eval_logger = logging.getLogger("lm-eval")
MODEL_REGISTRY = {} MODEL_REGISTRY = {}
...@@ -68,10 +71,10 @@ def register_group(name): ...@@ -68,10 +71,10 @@ def register_group(name):
return decorate return decorate
AGGREGATION_REGISTRY = {}
DEFAULT_AGGREGATION_REGISTRY = {}
METRIC_REGISTRY = {}
OUTPUT_TYPE_REGISTRY = {} OUTPUT_TYPE_REGISTRY = {}
METRIC_REGISTRY = {}
METRIC_AGGREGATION_REGISTRY = {}
AGGREGATION_REGISTRY = {}
HIGHER_IS_BETTER_REGISTRY = {} HIGHER_IS_BETTER_REGISTRY = {}
DEFAULT_METRIC_REGISTRY = { DEFAULT_METRIC_REGISTRY = {
...@@ -81,7 +84,7 @@ DEFAULT_METRIC_REGISTRY = { ...@@ -81,7 +84,7 @@ DEFAULT_METRIC_REGISTRY = {
], ],
"loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"], "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
"multiple_choice": ["acc", "acc_norm"], "multiple_choice": ["acc", "acc_norm"],
"greedy_until": ["exact_match"], "generate_until": ["exact_match"],
} }
...@@ -95,8 +98,7 @@ def register_metric(**args): ...@@ -95,8 +98,7 @@ def register_metric(**args):
for key, registry in [ for key, registry in [
("metric", METRIC_REGISTRY), ("metric", METRIC_REGISTRY),
("higher_is_better", HIGHER_IS_BETTER_REGISTRY), ("higher_is_better", HIGHER_IS_BETTER_REGISTRY),
# ("output_type", OUTPUT_TYPE_REGISTRY), ("aggregation", METRIC_AGGREGATION_REGISTRY),
("aggregation", DEFAULT_AGGREGATION_REGISTRY),
]: ]:
if key in args: if key in args:
...@@ -117,23 +119,22 @@ def register_metric(**args): ...@@ -117,23 +119,22 @@ def register_metric(**args):
return decorate return decorate
def get_metric(name): def get_metric(name, hf_evaluate_metric=False):
try: if not hf_evaluate_metric:
if name in METRIC_REGISTRY:
return METRIC_REGISTRY[name] return METRIC_REGISTRY[name]
except KeyError: else:
# TODO: change this print to logging? eval_logger.warning(
print( f"Could not find registered metric '{name}' in lm-eval, searching in HF Evaluate library..."
f"Could not find registered metric '{name}' in lm-eval, \
searching in HF Evaluate library..."
) )
try: try:
metric_object = evaluate.load(name) metric_object = evaluate.load(name)
return metric_object.compute return metric_object.compute
except Exception: except Exception:
eval_logger.error( eval_logger.error(
"{} not found in the evaluate library!".format(name), f"{name} not found in the evaluate library! Please check https://huggingface.co/evaluate-metric",
"Please check https://huggingface.co/evaluate-metric",
) )
...@@ -159,12 +160,13 @@ def get_aggregation(name): ...@@ -159,12 +160,13 @@ def get_aggregation(name):
) )
def get_default_aggregation(metric_name): def get_metric_aggregation(name):
try: try:
return DEFAULT_AGGREGATION_REGISTRY[metric_name] return METRIC_AGGREGATION_REGISTRY[name]
except KeyError: except KeyError:
eval_logger.warning( eval_logger.warning(
f"No default aggregation metric for metric '{metric_name}'!" "{} metric is not assigned a default aggregation!".format(name),
) )
...@@ -172,7 +174,6 @@ def is_higher_better(metric_name): ...@@ -172,7 +174,6 @@ def is_higher_better(metric_name):
try: try:
return HIGHER_IS_BETTER_REGISTRY[metric_name] return HIGHER_IS_BETTER_REGISTRY[metric_name]
except KeyError: except KeyError:
raise Warning(f"higher_is_better not specified for metric '{metric_name}'!")
eval_logger.warning( eval_logger.warning(
f"higher_is_better not specified for metric '{metric_name}'!" f"higher_is_better not specified for metric '{metric_name}'!"
) )
class Sampler: class ContextSampler:
def __init__(self, docs, task, fewshot_indices=None, rnd=None) -> None: def __init__(self, docs, task, fewshot_indices=None, rnd=None) -> None:
self.rnd = rnd self.rnd = rnd
assert self.rnd, "must pass rnd to FewShotSampler!" assert self.rnd, "must pass rnd to FewShotSampler!"
...@@ -46,14 +46,14 @@ class Sampler: ...@@ -46,14 +46,14 @@ class Sampler:
) )
+ self.target_delimiter + self.target_delimiter
+ ( + (
self.doc_to_target(doc)[0] str(self.doc_to_target(doc)[0])
if type(self.doc_to_target(doc)) is list if type(self.doc_to_target(doc)) is list
else self.doc_to_target(doc) else self.doc_to_target(doc)
if ( if (
self.config.doc_to_choice is None self.config.doc_to_choice is None
or type(self.doc_to_target(doc)) is str or type(self.doc_to_target(doc)) is str
) )
else self.doc_to_choice(doc)[self.doc_to_target(doc)] else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
) )
for doc in selected_docs for doc in selected_docs
] ]
...@@ -71,7 +71,19 @@ class Sampler: ...@@ -71,7 +71,19 @@ class Sampler:
return self.rnd.sample(self.docs, n) return self.rnd.sample(self.docs, n)
class BalancedSampler(Sampler): class FirstNSampler(ContextSampler):
def sample(self, n) -> None:
"""
Draw the first `n` samples in order from the specified split.
Used for tasks with "canonical" ordered fewshot examples, such as MMLU and CMMLU.
"""
assert n <= len(
self.docs
), f"Error: number of fewshot samples requested exceeds the {len(self.docs)} that are available."
return self.docs[:n]
class BalancedSampler(ContextSampler):
def sample(self, n) -> None: def sample(self, n) -> None:
""" """
TODO: this should return approximately class-balanced samples from our fewshot examples. TODO: this should return approximately class-balanced samples from our fewshot examples.
...@@ -81,12 +93,27 @@ class BalancedSampler(Sampler): ...@@ -81,12 +93,27 @@ class BalancedSampler(Sampler):
pass pass
class ManualSampler(Sampler): class ManualSampler(ContextSampler):
def sample(self, n) -> None: def sample(self, n) -> None:
""" """ """ """
pass pass
SAMPLER_REGISTRY = {
"default": ContextSampler,
"first_n": FirstNSampler,
}
def get_sampler(name):
try:
return SAMPLER_REGISTRY[name]
except KeyError:
raise ValueError(
f"Attempted to use contextsampler '{name}', but no sampling strategy for this name found! Supported model names: {', '.join(SAMPLER_REGISTRY.keys())}"
)
# TODO: how should we do design here? might be better to have a single sampler and pass more kwargs at init. # TODO: how should we do design here? might be better to have a single sampler and pass more kwargs at init.
# Depends what's easier for new user to add own functionality on top of # Depends what's easier for new user to add own functionality on top of
......
...@@ -4,6 +4,7 @@ from dataclasses import dataclass, field, asdict ...@@ -4,6 +4,7 @@ from dataclasses import dataclass, field, asdict
import re import re
import ast import ast
import yaml import yaml
import logging
import evaluate import evaluate
import random import random
import itertools import itertools
...@@ -21,7 +22,6 @@ from lm_eval.api import samplers ...@@ -21,7 +22,6 @@ from lm_eval.api import samplers
from lm_eval.api.instance import Instance from lm_eval.api.instance import Instance
from lm_eval.api.filter import FilterEnsemble from lm_eval.api.filter import FilterEnsemble
from lm_eval.logger import eval_logger
from lm_eval.prompts import get_prompt from lm_eval.prompts import get_prompt
from lm_eval.filters import build_filter_ensemble from lm_eval.filters import build_filter_ensemble
from lm_eval.api.metrics import ( from lm_eval.api.metrics import (
...@@ -33,7 +33,7 @@ from lm_eval.api.metrics import ( ...@@ -33,7 +33,7 @@ from lm_eval.api.metrics import (
from lm_eval.api.registry import ( from lm_eval.api.registry import (
get_metric, get_metric,
get_aggregation, get_aggregation,
get_default_aggregation, get_metric_aggregation,
is_higher_better, is_higher_better,
DEFAULT_METRIC_REGISTRY, DEFAULT_METRIC_REGISTRY,
OUTPUT_TYPE_REGISTRY, OUTPUT_TYPE_REGISTRY,
...@@ -44,15 +44,20 @@ ALL_OUTPUT_TYPES = [ ...@@ -44,15 +44,20 @@ ALL_OUTPUT_TYPES = [
"loglikelihood", "loglikelihood",
"multiple_choice", "multiple_choice",
"loglikelihood_rolling", "loglikelihood_rolling",
"greedy_until", "generate_until",
] ]
eval_logger = logging.getLogger("lm-eval")
@dataclass @dataclass
class TaskConfig(dict): class TaskConfig(dict):
# task naming/registry # task naming/registry
task: str = None task: str = None
task_alias: str = None
group: Union[str, list] = None group: Union[str, list] = None
group_alias: Union[str, list] = None
# HF dataset options. # HF dataset options.
# which dataset to use, # which dataset to use,
# and what splits for what purpose # and what splits for what purpose
...@@ -69,17 +74,17 @@ class TaskConfig(dict): ...@@ -69,17 +74,17 @@ class TaskConfig(dict):
doc_to_text: Union[Callable, str] = None doc_to_text: Union[Callable, str] = None
doc_to_target: Union[Callable, str] = None doc_to_target: Union[Callable, str] = None
doc_to_choice: Union[Callable, str, dict, list] = None doc_to_choice: Union[Callable, str, dict, list] = None
gold_alias: Union[Callable, str] = None
process_results: Union[Callable, str] = None process_results: Union[Callable, str] = None
use_prompt: str = None use_prompt: str = None
description: str = "" description: str = ""
target_delimiter: str = " " target_delimiter: str = " "
fewshot_delimiter: str = "\n\n" fewshot_delimiter: str = "\n\n"
fewshot_config: dict = None
# runtime configuration options # runtime configuration options
num_fewshot: int = 0 num_fewshot: int = 0
# scoring options # scoring options
metric_list: list = None metric_list: list = None
output_type: str = "greedy_until" output_type: str = "generate_until"
generation_kwargs: dict = None generation_kwargs: dict = None
repeats: int = 1 repeats: int = 1
filter_list: Union[str, list] = None filter_list: Union[str, list] = None
...@@ -89,18 +94,18 @@ class TaskConfig(dict): ...@@ -89,18 +94,18 @@ class TaskConfig(dict):
metadata: str = None # by default, not used in the code. allows for users to pass arbitrary info to tasks metadata: str = None # by default, not used in the code. allows for users to pass arbitrary info to tasks
def __post_init__(self) -> None: def __post_init__(self) -> None:
if "." in self.dataset_path: if self.dataset_path and ("." in self.dataset_path):
import inspect import inspect
from importlib import import_module from importlib import import_module
self.dataset_path = inspect.getfile(import_module(self.dataset_path)) self.dataset_path = inspect.getfile(import_module(self.dataset_path))
if self.generation_kwargs is not None: if self.generation_kwargs is not None:
if self.output_type != "greedy_until": if self.output_type != "generate_until":
eval_logger.warning( eval_logger.warning(
"passed `generation_kwargs`, but not using `output_type: greedy_until`!" f"[{self.task}] passed `generation_kwargs`, but not using `output_type: generate_until`!"
) )
assert self.output_type != "greedy_until" assert self.output_type != "generate_until"
if "temperature" in self.generation_kwargs: if "temperature" in self.generation_kwargs:
self.generation_kwargs["temperature"] = float( self.generation_kwargs["temperature"] = float(
...@@ -110,14 +115,13 @@ class TaskConfig(dict): ...@@ -110,14 +115,13 @@ class TaskConfig(dict):
if "until" not in self.generation_kwargs: if "until" not in self.generation_kwargs:
self.generation_kwargs["until"] = [self.fewshot_delimiter] self.generation_kwargs["until"] = [self.fewshot_delimiter]
else: else:
if self.output_type == "greedy_until": if self.output_type == "generate_until":
# ensure that we greedily generate in absence of explicit arguments otherwise # ensure that we greedily generate in absence of explicit arguments otherwise
self.generation_kwargs = { self.generation_kwargs = {
"until": None "until": None
if self.fewshot_delimiter is None if self.fewshot_delimiter is None
else [self.fewshot_delimiter], else [self.fewshot_delimiter],
"do_sample": False, "do_sample": False,
"temperature": 0.0,
} }
# TODO: how to make TaskConfigs be de- and re-serializable, even when using the !function constructor? # TODO: how to make TaskConfigs be de- and re-serializable, even when using the !function constructor?
...@@ -203,19 +207,9 @@ class Task(abc.ABC): ...@@ -203,19 +207,9 @@ class Task(abc.ABC):
self._fewshot_docs = None self._fewshot_docs = None
self._instances = None self._instances = None
self._config = TaskConfig(**config) if config else TaskConfig() self._config = TaskConfig({**config}) if config else TaskConfig()
if not hasattr(self, "_filters"):
self._filters = []
for name, components in self._config.get(
"filters", [["none", [["take_first", None]]]]
):
filter_pipeline = build_filter_ensemble(name, components)
self._filters.append(filter_pipeline)
self.sampler = samplers.Sampler( self._filters = [build_filter_ensemble("none", [["take_first", None]])]
list(self.fewshot_docs()), self, rnd=random.Random(1234)
)
def download(self, data_dir=None, cache_dir=None, download_mode=None) -> None: def download(self, data_dir=None, cache_dir=None, download_mode=None) -> None:
"""Downloads and returns the task dataset. """Downloads and returns the task dataset.
...@@ -250,6 +244,11 @@ class Task(abc.ABC): ...@@ -250,6 +244,11 @@ class Task(abc.ABC):
download_mode=download_mode, download_mode=download_mode,
) )
@property
def config(self):
"""Returns the TaskConfig associated with this class."""
return self._config
@abc.abstractmethod @abc.abstractmethod
def has_training_docs(self): def has_training_docs(self):
"""Whether the task has a training set""" """Whether the task has a training set"""
...@@ -351,9 +350,7 @@ class Task(abc.ABC): ...@@ -351,9 +350,7 @@ class Task(abc.ABC):
False False
), f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!" ), f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
eval_logger.info( eval_logger.info(f"Building contexts for task on rank {rank}...")
f"Building contexts for task '{self._config.task}' on rank {rank}..."
)
instances = [] instances = []
for doc_id, doc in utils.create_iterator( for doc_id, doc in utils.create_iterator(
...@@ -362,14 +359,14 @@ class Task(abc.ABC): ...@@ -362,14 +359,14 @@ class Task(abc.ABC):
# sample fewshot context #TODO: need to offset doc_id by rank now! # sample fewshot context #TODO: need to offset doc_id by rank now!
fewshot_ctx = self.fewshot_context( fewshot_ctx = self.fewshot_context(
doc, doc,
self._config.num_fewshot, self.config.num_fewshot,
) )
# TODO: we should override self._config.repeats if doing greedy gen so users don't waste time+compute # TODO: we should override self.config.repeats if doing greedy gen so users don't waste time+compute
inst = self.construct_requests( inst = self.construct_requests(
doc=doc, doc=doc,
ctx=fewshot_ctx, ctx=fewshot_ctx,
metadata=(self._config["task"], doc_id, self._config.repeats), metadata=(self.config["task"], doc_id, self.config.repeats),
) )
if not isinstance(inst, list): if not isinstance(inst, list):
...@@ -443,7 +440,13 @@ class Task(abc.ABC): ...@@ -443,7 +440,13 @@ class Task(abc.ABC):
return len(re.split(r"\s+", doc)) return len(re.split(r"\s+", doc))
@utils.positional_deprecated @utils.positional_deprecated
def fewshot_context(self, doc, num_fewshot): def fewshot_context(
self,
doc,
num_fewshot,
rnd=random.Random(1234),
description=None,
):
"""Returns a fewshot context string that is made up of a prepended description """Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example. (if provided), the `num_fewshot` number of examples, and an appended prompt example.
...@@ -451,34 +454,56 @@ class Task(abc.ABC): ...@@ -451,34 +454,56 @@ class Task(abc.ABC):
The document as returned from training_docs, validation_docs, or test_docs. The document as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int :param num_fewshot: int
The number of fewshot examples to provide in the returned context string. The number of fewshot examples to provide in the returned context string.
:param rnd: random.Random
The pseudo-random number generator used to randomly sample examples.
WARNING: This is currently a required arg although it's optionalized with a default `None`.
:param description: str
The task's description that will be prepended to the fewshot examples.
:returns: str :returns: str
The fewshot context. The fewshot context.
""" """
assert (
rnd is not None
), "A `random.Random` generator argument must be provided to `rnd`"
description = description if description else ""
if num_fewshot == 0: if num_fewshot == 0:
# always prepend the (possibly empty) task description labeled_examples = ""
labeled_examples = self._config.description
else: else:
labeled_examples = self._config.description + self.sampler.get_context( # for sets with no training docs, draw from other set *but ensure no overlap with current doc*
doc, num_fewshot if self.has_training_docs():
fewshotex = self.fewshot_examples(k=num_fewshot, rnd=rnd)
else:
if self._fewshot_docs is None:
self._fewshot_docs = list(
self.validation_docs()
if self.has_validation_docs()
else self.test_docs()
)
fewshotex = rnd.sample(self._fewshot_docs, num_fewshot + 1)
# get rid of the doc that's the one we're evaluating, if it's in the fewshot
fewshotex = [x for x in fewshotex if x != doc][:num_fewshot]
labeled_examples = (
"\n\n".join(
[
self.doc_to_text(doc) + self.doc_to_target(doc)
for doc in fewshotex
]
)
+ "\n\n"
) )
example = self.doc_to_text(doc) example = self.doc_to_text(doc)
if type(example) == str: return description + labeled_examples + example
return labeled_examples + example
elif type(example) == list:
return [labeled_examples + ex for ex in example]
elif type(example) == int:
if self._config.doc_to_choice is not None:
choices = self.doc_to_choice(doc)
return labeled_examples + choices[example]
else:
return labeled_examples + str(example)
def apply_filters(self): def apply_filters(self):
if hasattr(self, "_filters"): if hasattr(self, "_filters"):
for f in self._filters: for f in self._filters:
f.apply(self._instances) f.apply(self._instances, None)
else: else:
eval_logger.warning("No filter defined, passing through instances") eval_logger.warning("No filter defined, passing through instances")
return self._instances return self._instances
...@@ -491,7 +516,7 @@ class Task(abc.ABC): ...@@ -491,7 +516,7 @@ class Task(abc.ABC):
""" """
# TODO: this should only return the overrides applied to a non-YAML task's configuration. # TODO: this should only return the overrides applied to a non-YAML task's configuration.
# (num_fewshot) # (num_fewshot)
return self._config.to_dict() return self.config.to_dict()
class ConfigurableTask(Task): class ConfigurableTask(Task):
...@@ -506,53 +531,60 @@ class ConfigurableTask(Task): ...@@ -506,53 +531,60 @@ class ConfigurableTask(Task):
self._config = self.CONFIG self._config = self.CONFIG
# Use new configurations if there was no preconfiguration # Use new configurations if there was no preconfiguration
if self._config is None: if self.config is None:
self._config = TaskConfig(**config) self._config = TaskConfig(**config)
# Overwrite configs # Overwrite configs
else: else:
if config is not None: if config is not None:
self._config.__dict__.update(config) self._config.__dict__.update(config)
if self._config is None: if self.config is None:
raise ValueError( raise ValueError(
"Must pass a config to ConfigurableTask, either in cls.CONFIG or `config` kwarg" "Must pass a config to ConfigurableTask, either in cls.CONFIG or `config` kwarg"
) )
if self._config.output_type is not None: if self.config.output_type is not None:
assert self._config.output_type in ALL_OUTPUT_TYPES assert self.config.output_type in ALL_OUTPUT_TYPES
self.OUTPUT_TYPE = self._config.output_type self.OUTPUT_TYPE = self.config.output_type
if self._config.dataset_path is not None: if self.config.dataset_path is not None:
self.DATASET_PATH = self._config.dataset_path self.DATASET_PATH = self.config.dataset_path
if self._config.dataset_name is not None: if self.config.dataset_name is not None:
self.DATASET_NAME = self._config.dataset_name self.DATASET_NAME = self.config.dataset_name
self._metric_fn_list = {} self._metric_fn_list = {}
self._metric_fn_kwargs = {} self._metric_fn_kwargs = {}
self._aggregation_list = {} self._aggregation_list = {}
self._higher_is_better = {} self._higher_is_better = {}
_metric_list = DEFAULT_METRIC_REGISTRY[self._config.output_type] if self.config.metric_list is None:
if self._config.metric_list is None:
# TODO: handle this in TaskConfig.__post_init__ ? # TODO: handle this in TaskConfig.__post_init__ ?
_metric_list = DEFAULT_METRIC_REGISTRY[self.config.output_type]
for metric_name in _metric_list: for metric_name in _metric_list:
self._metric_fn_list[metric_name] = get_metric(metric_name) self._metric_fn_list[metric_name] = get_metric(metric_name)
self._aggregation_list[metric_name] = get_default_aggregation( self._metric_fn_kwargs[metric_name] = {}
self._aggregation_list[metric_name] = get_metric_aggregation(
metric_name metric_name
) )
self._higher_is_better[metric_name] = is_higher_better(metric_name) self._higher_is_better[metric_name] = is_higher_better(metric_name)
else: else:
for metric_config in self._config.metric_list: for metric_config in self.config.metric_list:
assert "metric" in metric_config assert "metric" in metric_config
metric_name = metric_config["metric"] metric_name = metric_config["metric"]
kwargs = { kwargs = {
key: metric_config[key] key: metric_config[key]
for key in metric_config for key in metric_config
if key not in ["metric", "aggregation", "higher_is_better"] if key
not in ["metric", "aggregation", "higher_is_better", "hf_evaluate"]
} }
hf_evaluate_metric = (
"hf_evaluate" in metric_config
and metric_config["hf_evaluate"] is True
)
if self._config.process_results is not None: if self.config.process_results is not None:
self._metric_fn_list[metric_name] = None self._metric_fn_list[metric_name] = None
self._metric_fn_kwargs[metric_name] = {} self._metric_fn_kwargs[metric_name] = {}
elif callable(metric_name): elif callable(metric_name):
...@@ -561,7 +593,9 @@ class ConfigurableTask(Task): ...@@ -561,7 +593,9 @@ class ConfigurableTask(Task):
self._metric_fn_list[metric_name] = metric_fn self._metric_fn_list[metric_name] = metric_fn
self._metric_fn_kwargs[metric_name] = kwargs self._metric_fn_kwargs[metric_name] = kwargs
else: else:
self._metric_fn_list[metric_name] = get_metric(metric_name) self._metric_fn_list[metric_name] = get_metric(
metric_name, hf_evaluate_metric
)
self._metric_fn_kwargs[metric_name] = kwargs self._metric_fn_kwargs[metric_name] = kwargs
if "aggregation" in metric_config: if "aggregation" in metric_config:
...@@ -574,9 +608,9 @@ class ConfigurableTask(Task): ...@@ -574,9 +608,9 @@ class ConfigurableTask(Task):
] ]
else: else:
INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()} INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
metric_agg = get_default_aggregation(metric_name) metric_agg = get_metric_aggregation(metric_name)
eval_logger.warning( eval_logger.warning(
f"metric {metric_name} is defined, but aggregation is not. " f"[Task: {self._config.task}] metric {metric_name} is defined, but aggregation is not. "
f"using default " f"using default "
f"aggregation={INV_AGG_REGISTRY[metric_agg]}" f"aggregation={INV_AGG_REGISTRY[metric_agg]}"
) )
...@@ -588,19 +622,19 @@ class ConfigurableTask(Task): ...@@ -588,19 +622,19 @@ class ConfigurableTask(Task):
] ]
else: else:
eval_logger.warning( eval_logger.warning(
f"metric {metric_name} is defined, but higher_is_better is not. " f"[Task: {self._config.task}] metric {metric_name} is defined, but higher_is_better is not. "
f"using default " f"using default "
f"higher_is_better={is_higher_better(metric_name)}" f"higher_is_better={is_higher_better(metric_name)}"
) )
self._higher_is_better[metric_name] = is_higher_better(metric_name) self._higher_is_better[metric_name] = is_higher_better(metric_name)
self.download(self._config.dataset_kwargs) self.download(self.config.dataset_kwargs)
self._training_docs = None self._training_docs = None
self._fewshot_docs = None self._fewshot_docs = None
if self._config.filter_list is not None: if self.config.filter_list is not None:
self._filters = [] self._filters = []
for filter_config in self._config.filter_list: for filter_config in self.config.filter_list:
for filter_pipeline in filter_config: for filter_pipeline in filter_config:
filter_name = filter_config["name"] filter_name = filter_config["name"]
filter_functions = filter_config["filter"] filter_functions = filter_config["filter"]
...@@ -615,18 +649,20 @@ class ConfigurableTask(Task): ...@@ -615,18 +649,20 @@ class ConfigurableTask(Task):
else: else:
self._filters = [build_filter_ensemble("none", [["take_first", None]])] self._filters = [build_filter_ensemble("none", [["take_first", None]])]
if self._config.use_prompt is not None: if self.config.use_prompt is not None:
eval_logger.info(f"loading prompt {self._config.use_prompt}") eval_logger.info(f"loading prompt {self.config.use_prompt}")
self.prompt = get_prompt( self.prompt = get_prompt(
self._config.use_prompt, self.DATASET_PATH, self.DATASET_NAME self.config.use_prompt, self.DATASET_PATH, self.DATASET_NAME
) )
else: else:
self.prompt = None self.prompt = None
if self.fewshot_docs() is not None: if self.fewshot_docs() is not None:
self.sampler = samplers.Sampler( self.sampler = samplers.get_sampler(
list(self.fewshot_docs()), self, rnd=random.Random(1234) self.config.fewshot_config.get("sampler", "default")
) if self.config.fewshot_config
else "default"
)(list(self.fewshot_docs()), self, rnd=random.Random(1234))
if self.has_test_docs(): if self.has_test_docs():
self.task_docs = self.test_docs() self.task_docs = self.test_docs()
...@@ -645,7 +681,7 @@ class ConfigurableTask(Task): ...@@ -645,7 +681,7 @@ class ConfigurableTask(Task):
test_text = self.doc_to_text(test_doc) test_text = self.doc_to_text(test_doc)
test_target = self.doc_to_target(test_doc) test_target = self.doc_to_target(test_doc)
if self._config.doc_to_choice is not None: if self.config.doc_to_choice is not None:
test_choice = self.doc_to_choice(test_doc) test_choice = self.doc_to_choice(test_doc)
if type(test_choice) is not list: if type(test_choice) is not list:
eval_logger.error("doc_to_choice must return list") eval_logger.error("doc_to_choice must return list")
...@@ -669,11 +705,14 @@ class ConfigurableTask(Task): ...@@ -669,11 +705,14 @@ class ConfigurableTask(Task):
check_choices = test_choice check_choices = test_choice
else: else:
check_choices = [test_target] check_choices = [test_target]
if self.config.doc_to_choice is not None:
for choice in check_choices: for choice in check_choices:
choice_has_whitespace = True if " " in choice else False choice_has_whitespace = True if choice[0].isspace() else False
delimiter_has_whitespace = ( delimiter_has_whitespace = (
True if " " in self._config.target_delimiter else False True
if self.config.target_delimiter.rstrip()
!= self.config.target_delimiter
else False
) )
if delimiter_has_whitespace and choice_has_whitespace: if delimiter_has_whitespace and choice_has_whitespace:
...@@ -682,7 +721,7 @@ class ConfigurableTask(Task): ...@@ -682,7 +721,7 @@ class ConfigurableTask(Task):
) )
elif (not delimiter_has_whitespace) and (not choice_has_whitespace): elif (not delimiter_has_whitespace) and (not choice_has_whitespace):
eval_logger.warning( eval_logger.warning(
f'Both target_delimiter and target choice: "{choice}" does not have whitespace, ignore if the language you are evaluating on does not require/use whitespace' f'Both target_delimiter "{self.config.target_delimiter}" and target choice: "{choice}" do not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
) )
def download(self, dataset_kwargs=None) -> None: def download(self, dataset_kwargs=None) -> None:
...@@ -693,59 +732,91 @@ class ConfigurableTask(Task): ...@@ -693,59 +732,91 @@ class ConfigurableTask(Task):
) )
def has_training_docs(self) -> bool: def has_training_docs(self) -> bool:
if self._config.training_split is not None: if self.config.training_split is not None:
return True return True
else: else:
return False return False
def has_validation_docs(self) -> bool: def has_validation_docs(self) -> bool:
if self._config.validation_split is not None: if self.config.validation_split is not None:
return True return True
else: else:
return False return False
def has_test_docs(self) -> bool: def has_test_docs(self) -> bool:
if self._config.test_split is not None: if self.config.test_split is not None:
return True return True
else: else:
return False return False
def training_docs(self) -> datasets.Dataset: def training_docs(self) -> datasets.Dataset:
if self.has_training_docs(): if self.has_training_docs():
if self._config.process_docs is not None: if self.config.process_docs is not None:
return self._config.process_docs( return self.config.process_docs(
self.dataset[self._config.training_split] self.dataset[self.config.training_split]
) )
return self.dataset[self._config.training_split] return self.dataset[self.config.training_split]
def validation_docs(self) -> datasets.Dataset: def validation_docs(self) -> datasets.Dataset:
if self.has_validation_docs(): if self.has_validation_docs():
if self._config.process_docs is not None: if self.config.process_docs is not None:
return self._config.process_docs( return self.config.process_docs(
self.dataset[self._config.validation_split] self.dataset[self.config.validation_split]
) )
return self.dataset[self._config.validation_split] return self.dataset[self.config.validation_split]
def test_docs(self) -> datasets.Dataset: def test_docs(self) -> datasets.Dataset:
if self.has_test_docs(): if self.has_test_docs():
if self._config.process_docs is not None: if self.config.process_docs is not None:
return self._config.process_docs(self.dataset[self._config.test_split]) return self.config.process_docs(self.dataset[self.config.test_split])
return self.dataset[self._config.test_split] return self.dataset[self.config.test_split]
def fewshot_docs(self): def fewshot_docs(self):
if self._config.fewshot_split is not None: if self.config.fewshot_split is not None:
return self.dataset[self._config.fewshot_split] return self.dataset[self.config.fewshot_split]
else: else:
if self._config.num_fewshot > 0: if self.config.num_fewshot > 0:
eval_logger.warning( eval_logger.warning(
f"Task '{self._config.task}': " f"Task '{self.config.task}': "
"num_fewshot > 0 but fewshot_split is None. " "num_fewshot > 0 but fewshot_split is None. "
"using preconfigured rule." "using preconfigured rule."
) )
return super().fewshot_docs() return super().fewshot_docs()
def apply_filters(self): @utils.positional_deprecated
def fewshot_context(self, doc, num_fewshot):
"""Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example.
:param doc: str
The document as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int
The number of fewshot examples to provide in the returned context string.
:returns: str
The fewshot context.
"""
if num_fewshot == 0:
# always prepend the (possibly empty) task description
labeled_examples = self.config.description
else:
labeled_examples = self.config.description + self.sampler.get_context(
doc, num_fewshot
)
example = self.doc_to_text(doc)
if type(example) == str:
return labeled_examples + example
elif type(example) == list:
return [labeled_examples + ex for ex in example]
elif type(example) == int:
if self.config.doc_to_choice is not None:
choices = self.doc_to_choice(doc)
return labeled_examples + choices[example]
else:
return labeled_examples + str(example)
def apply_filters(self):
if hasattr(self, "_filters"): if hasattr(self, "_filters"):
for f in self._filters: for f in self._filters:
f.apply(self._instances, self.task_docs) f.apply(self._instances, self.task_docs)
...@@ -754,15 +825,15 @@ class ConfigurableTask(Task): ...@@ -754,15 +825,15 @@ class ConfigurableTask(Task):
return self._instances return self._instances
def should_decontaminate(self): def should_decontaminate(self):
return self._config.should_decontaminate return self.config.should_decontaminate
def doc_to_decontamination_query(self, doc): def doc_to_decontamination_query(self, doc):
if self._config.should_decontaminate: if self.config.should_decontaminate:
if self._config.doc_to_decontamination_query in self.features: if self.config.doc_to_decontamination_query in self.features:
return doc[self._config.doc_to_decontamination_query] return doc[self.config.doc_to_decontamination_query]
else: else:
return ast.literal_eval( return ast.literal_eval(
utils.apply_template(self._config.doc_to_decontamination_query, doc) utils.apply_template(self.config.doc_to_decontamination_query, doc)
) )
def _process_doc(self, doc): def _process_doc(self, doc):
...@@ -780,13 +851,13 @@ class ConfigurableTask(Task): ...@@ -780,13 +851,13 @@ class ConfigurableTask(Task):
if self.prompt is not None: if self.prompt is not None:
doc_to_text = self.prompt doc_to_text = self.prompt
else: else:
doc_to_text = self._config.doc_to_text doc_to_text = self.config.doc_to_text
if type(doc_to_text) == int: if type(doc_to_text) == int:
return doc_to_text return doc_to_text
elif type(doc_to_text) == str: elif type(doc_to_text) == str:
if doc_to_text in self.features: if doc_to_text in self.features:
# if self._config.doc_to_choice is not None: # if self.config.doc_to_choice is not None:
# return self.doc_to_choice(doc)[doc[doc_to_text]] # return self.doc_to_choice(doc)[doc[doc_to_text]]
# else: # else:
return doc[doc_to_text] return doc[doc_to_text]
...@@ -805,7 +876,7 @@ class ConfigurableTask(Task): ...@@ -805,7 +876,7 @@ class ConfigurableTask(Task):
return applied_prompt[0] return applied_prompt[0]
else: else:
eval_logger.warning("Applied prompt returns empty string") eval_logger.warning("Applied prompt returns empty string")
return self._config.fewshot_delimiter return self.config.fewshot_delimiter
else: else:
print(type(doc_to_text)) print(type(doc_to_text))
raise TypeError raise TypeError
...@@ -814,13 +885,13 @@ class ConfigurableTask(Task): ...@@ -814,13 +885,13 @@ class ConfigurableTask(Task):
if self.prompt is not None: if self.prompt is not None:
doc_to_target = self.prompt doc_to_target = self.prompt
else: else:
doc_to_target = self._config.doc_to_target doc_to_target = self.config.doc_to_target
if type(doc_to_target) == int: if type(doc_to_target) == int:
return doc_to_target return doc_to_target
elif type(doc_to_target) == str: elif type(doc_to_target) == str:
if doc_to_target in self.features: if doc_to_target in self.features:
# if self._config.doc_to_choice is not None: # if self.config.doc_to_choice is not None:
# return self.doc_to_choice(doc)[doc[doc_to_target]] # return self.doc_to_choice(doc)[doc[doc_to_target]]
# else: # else:
return doc[doc_to_target] return doc[doc_to_target]
...@@ -833,7 +904,10 @@ class ConfigurableTask(Task): ...@@ -833,7 +904,10 @@ class ConfigurableTask(Task):
and (target_string[0] == "[") and (target_string[0] == "[")
and (target_string[-1] == "]") and (target_string[-1] == "]")
): ):
try:
return ast.literal_eval(target_string) return ast.literal_eval(target_string)
except (SyntaxError, ValueError):
return target_string
else: else:
return target_string return target_string
elif type(doc_to_target) == list: elif type(doc_to_target) == list:
...@@ -847,17 +921,17 @@ class ConfigurableTask(Task): ...@@ -847,17 +921,17 @@ class ConfigurableTask(Task):
return applied_prompt[1] return applied_prompt[1]
else: else:
eval_logger.warning("Applied prompt returns empty string") eval_logger.warning("Applied prompt returns empty string")
return self._config.fewshot_delimiter return self.config.fewshot_delimiter
else: else:
raise TypeError raise TypeError
def doc_to_choice(self, doc: Any) -> List[str]: def doc_to_choice(self, doc: Any) -> List[str]:
if self.prompt is not None: if self.prompt is not None:
doc_to_choice = self.prompt doc_to_choice = self.prompt
elif self._config.doc_to_choice is None: elif self.config.doc_to_choice is None:
eval_logger.error("doc_to_choice was called but not set in config") eval_logger.error("doc_to_choice was called but not set in config")
else: else:
doc_to_choice = self._config.doc_to_choice doc_to_choice = self.config.doc_to_choice
if type(doc_to_choice) == str: if type(doc_to_choice) == str:
return ast.literal_eval(utils.apply_template(doc_to_choice, doc)) return ast.literal_eval(utils.apply_template(doc_to_choice, doc))
...@@ -872,26 +946,6 @@ class ConfigurableTask(Task): ...@@ -872,26 +946,6 @@ class ConfigurableTask(Task):
else: else:
raise TypeError raise TypeError
def gold_alias(self, doc):
# returns a version of the gold target answer to a document,
# which should be passed into metric for scoring as the ground truth.
# in multiple_choice tasks, this should be castable to an int corresponding to the index
# within the answer choices, while doc_to_target is the string version of {{answer_choices[gold]}}.
if self._config.gold_alias is not None:
doc_to_target = self._config.gold_alias
else:
return self.doc_to_target(doc)
if type(doc_to_target) == str:
return utils.apply_template(doc_to_target, doc)
elif callable(doc_to_target):
return doc_to_target(doc)
elif hasattr(doc_to_target, "apply"):
return doc_to_target.apply(doc)[1]
else:
raise TypeError
def construct_requests( def construct_requests(
self, doc: dict, ctx: str, **kwargs self, doc: dict, ctx: str, **kwargs
) -> Union[List[Instance], Instance]: ) -> Union[List[Instance], Instance]:
...@@ -901,7 +955,7 @@ class ConfigurableTask(Task): ...@@ -901,7 +955,7 @@ class ConfigurableTask(Task):
arguments = (self.doc_to_target(doc),) arguments = (self.doc_to_target(doc),)
elif self.OUTPUT_TYPE == "multiple_choice": elif self.OUTPUT_TYPE == "multiple_choice":
choices = self.doc_to_choice(doc) choices = self.doc_to_choice(doc)
target_delimiter = self._config.target_delimiter target_delimiter = self.config.target_delimiter
if self.multiple_input: if self.multiple_input:
# If there are multiple inputs, choices are placed in the ctx # If there are multiple inputs, choices are placed in the ctx
cont = self.doc_to_target(doc) cont = self.doc_to_target(doc)
...@@ -942,16 +996,16 @@ class ConfigurableTask(Task): ...@@ -942,16 +996,16 @@ class ConfigurableTask(Task):
) )
return request_list return request_list
elif self.OUTPUT_TYPE == "greedy_until": elif self.OUTPUT_TYPE == "generate_until":
arguments = (ctx, self._config.generation_kwargs) arguments = (ctx, self.config.generation_kwargs)
return Instance( return Instance(
request_type=self.OUTPUT_TYPE, doc=doc, arguments=arguments, idx=0, **kwargs request_type=self.OUTPUT_TYPE, doc=doc, arguments=arguments, idx=0, **kwargs
) )
def process_results(self, doc, results): def process_results(self, doc, results):
if callable(self._config.process_results): if callable(self.config.process_results):
return self._config.process_results(doc, results) return self.config.process_results(doc, results)
result_dict = {} result_dict = {}
use_metric = list(self._metric_fn_list.keys()) use_metric = list(self._metric_fn_list.keys())
...@@ -1054,23 +1108,31 @@ class ConfigurableTask(Task): ...@@ -1054,23 +1108,31 @@ class ConfigurableTask(Task):
acc_mutual_info = 1.0 if np.argmax(lls_mutual_info) == gold else 0.0 acc_mutual_info = 1.0 if np.argmax(lls_mutual_info) == gold else 0.0
result_dict["acc_mutual_info"] = acc_mutual_info result_dict["acc_mutual_info"] = acc_mutual_info
elif self.OUTPUT_TYPE == "greedy_until": elif self.OUTPUT_TYPE == "generate_until":
gold = self.doc_to_target(doc) gold = self.doc_to_target(doc)
if self._config.doc_to_choice is not None: result = results[0]
if self.config.doc_to_choice is not None:
# If you set doc_to_choice, # If you set doc_to_choice,
# it assumes that doc_to_target returns a number. # it assumes that doc_to_target returns a number.
choices = self.doc_to_choice(doc) choices = self.doc_to_choice(doc)
gold = choices[gold] gold = choices[gold]
else: # we expect multiple_targets to be a list.
gold = str(gold) elif self.multiple_target:
gold = list(gold)
elif type(gold) != type(result):
# cast gold to the same type as result
gold = type(result)(gold)
result = results[0]
for metric in self._metric_fn_list.keys(): for metric in self._metric_fn_list.keys():
if self.multiple_target: if self.multiple_target:
# in the case where we have multiple targets, # in the case where we have multiple targets,
# return true if any are true # return true if any are true
# TODO: this may break for multipLe_target, non zero-or-1 metrics # TODO: this may break for multipLe_target, non zero-or-1 metrics
scores = [] scores = []
if not isinstance(gold, list):
# sometimes, a multiple_target dataset has exceptions where one doc has only one string answer
# print(gold)
gold = [gold]
for gold_option in gold: for gold_option in gold:
try: try:
result_score = self._metric_fn_list[metric]( result_score = self._metric_fn_list[metric](
...@@ -1078,7 +1140,9 @@ class ConfigurableTask(Task): ...@@ -1078,7 +1140,9 @@ class ConfigurableTask(Task):
predictions=[result], predictions=[result],
**self._metric_fn_kwargs[metric], **self._metric_fn_kwargs[metric],
) )
except TypeError: # TODO: this is hacky and I don't want to do it except (
TypeError
): # TODO: this is hacky and I don't want to do it
result_score = self._metric_fn_list[metric]( result_score = self._metric_fn_list[metric](
[gold_option, result] [gold_option, result]
) )
...@@ -1097,7 +1161,9 @@ class ConfigurableTask(Task): ...@@ -1097,7 +1161,9 @@ class ConfigurableTask(Task):
predictions=[result], predictions=[result],
**self._metric_fn_kwargs[metric], **self._metric_fn_kwargs[metric],
) )
except TypeError: # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics except (
TypeError
): # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics
result_score = self._metric_fn_list[metric]([gold, result]) result_score = self._metric_fn_list[metric]([gold, result])
if isinstance(result_score, dict): if isinstance(result_score, dict):
# TODO: this handles the case where HF evaluate returns a dict. # TODO: this handles the case where HF evaluate returns a dict.
...@@ -1106,7 +1172,7 @@ class ConfigurableTask(Task): ...@@ -1106,7 +1172,7 @@ class ConfigurableTask(Task):
else: else:
raise ValueError( raise ValueError(
f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ", f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",
"'loglikelihood', 'loglikelihood_rolling', 'greedy_until' or 'multiple_choice'", "'loglikelihood', 'loglikelihood_rolling', 'generate_until' or 'multiple_choice'",
) )
return result_dict return result_dict
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment