Commit e495e3a0 authored by gk's avatar gk
Browse files

Merge branch 'master' into big-refactor-test

parents 6d355b85 9d06c953
......@@ -9,5 +9,5 @@ jobs:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.8
python-version: 3.9
- uses: pre-commit/action@v2.0.3
......@@ -12,7 +12,7 @@ repos:
- id: check-merge-conflict
- id: check-symlinks
- id: check-yaml
args: ['--unsafe']
args: ["--unsafe"]
- id: destroyed-symlinks
- id: detect-private-key
- id: end-of-file-fixer
......@@ -33,7 +33,7 @@ repos:
rev: 22.3.0
hooks:
- id: black
language_version: python3.8
language_version: python3.9
- repo: https://github.com/codespell-project/codespell
rev: v2.1.0
hooks:
......
* @jon-tow @StellaAthena
* @jon-tow @StellaAthena @haileyschoelkopf @lintangsutawika
......@@ -7,14 +7,17 @@
This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
**Features:**
### Features
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface.
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Task versioning to ensure reproducibility.
- Support for the Hugging Face [transformers](https://github.com/huggingface/transformers) library, [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed), with flexible tokenization-agnostic interface.
- Support for commercial APIs including [OpenAI](https://openai.com/), [goose.ai](https://goose.ai/), [Anthropic](https://www.anthropic.com/), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Support for GPTQ quantized models via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
- Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
- Task versioning to ensure reproducibility when tasks are updated.
**Evaluation Overview**
### Evaluation Overview
`Task` and `Prompt` classes contain information that, when combined, produces the input to the language model. The language model is then queried to obtain an output. One or more `Filters` can then be applied to perform arbitrary operations on the model's raw output, such as selecting the final answer (for chain of thought) or calling an external API. This final output is then evaluated using a `Metric` to obtain the final result.
......@@ -37,7 +40,7 @@ graph LR;
O --> F
Me --> R:::empty
F --> R
```
```
## Install
......@@ -55,12 +58,19 @@ To install additional multilingual tokenization and text segmentation packages,
pip install -e ".[multilingual]"
```
To support loading GPTQ quantized models, install the package with the `auto-gptq` extra:
```bash
pip install -e ".[auto-gptq]"
```
## Basic Usage
> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) you can use the following command:
### Hugging Face `transformers`
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `lambada_openai` and `hellaswag` you can use the following command:
```bash
python main.py \
......@@ -70,21 +80,24 @@ python main.py \
--device cuda:0
```
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints:
Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:
```bash
python main.py \
--model hf-causal \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000 \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks lambada_openai,hellaswag \
--device cuda:0
```
To evaluate models that are called via `AutoSeq2SeqLM`, you instead use `hf-seq2seq`.
To evaluate models that are loaded via `AutoSeq2SeqLM`, you instead use `hf-seq2seq`.
> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
Arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library.
To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below:
```bash
python main.py \
--model hf-causal \
......@@ -93,7 +106,18 @@ python main.py \
--device cuda:0
```
Our library also supports the OpenAI API:
GPTQ quantized models can be loaded by specifying their file names in `,quantized=NAME` (or `,quantized=True` for default names) in the `model_args` argument:
```bash
python main.py \
--model hf-causal \
--model_args pretrained=model-name-or-path,quantized=model.safetensors,gptq_use_triton=True \
--tasks hellaswag
```
### Commercial APIs
Our library also supports language models served via the OpenAI API:
```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
......@@ -115,7 +139,9 @@ python main.py \
--check_integrity
```
To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
......@@ -131,16 +157,17 @@ This will write out one text file for each task.
## Multi-GPU Evaluation
Multi-GPU evaluation is supported through [accelerate](https://github.com/huggingface/accelerate). To initialize the distributed environment, run ```accelerate config``` in terminal and follow the prompts. Once the environment is configured, evaluations can be launched with:
Multi-GPU evaluation is supported through [accelerate](https://github.com/huggingface/accelerate). To initialize the distributed environment, run `accelerate config` in terminal and follow the prompts. Once the environment is configured, evaluations can be launched with:
```bash
accelerate launch main.py \
--model hf-causal \
--model_args pretrained=EleutherAI/pythia-12b \
--tasks lambada_openai,arc_easy \
--batch_size 16 \
--batch_size 16
```
**Warning**: Distributed evaluation requires launching multiple processes of the evaluation script. Running ```python main.py *args*``` instead of ```accelerate launch main.py *args*``` on machine with multiple GPUs will only run the evaluations on a single device.
**Warning**: Distributed evaluation requires launching multiple processes of the evaluation script. Running `python main.py *args*` instead of `accelerate launch main.py *args*` on machine with multiple GPUs will only run the evaluations on a single device (unless you instead use `use_accelerate=True` in `--model_args`).
## Implementing new tasks
......@@ -154,7 +181,7 @@ When reporting eval harness results, please also report the version of each task
## Test Set Decontamination
To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points nto found in the model trainign set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points not found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).
......
......@@ -202,7 +202,7 @@ def construct_requests(self, doc, ctx):
#### What's a `Request`? What's a `doc`?
To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
The function `construct_requests` can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
The function `construct_requests` returns a list or tuple of, or singular `Request`s. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
```python
......@@ -271,6 +271,19 @@ python main.py \
--num_fewshot K
```
### Checking the Model Outputs
The `--write_out.py` script mentioned previously can be used to verify that the prompts look as intended. If you also want to save model outputs, you can use the `--write_out` parameter in `main.py` to dump JSON with prompts and completions. The output path can be chosen with `--output_base_path`. It is helpful for debugging and for exploring model outputs.
```sh
python main.py \
--model gpt2 \
--model_args device=<device-name> \
--tasks <task-name> \
--num_fewshot K \
--write_out \
--output_base_path <path>
```
### Running Unit Tests
To run the entire test suite, use:
......
......@@ -17,6 +17,27 @@
|arithmetic_4ds | |✓ | | 2000|acc |
|arithmetic_5da | |✓ | | 2000|acc |
|arithmetic_5ds | |✓ | | 2000|acc |
|bigbench_causal_judgement | | |✓ | 190|multiple_choice_grade, exact_str_match |
|bigbench_date_understanding | | |✓ | 369|multiple_choice_grade, exact_str_match |
|bigbench_disambiguation_qa | | |✓ | 258|multiple_choice_grade, exact_str_match |
|bigbench_dyck_languages | | |✓ | 1000|multiple_choice_grade, exact_str_match |
|bigbench_formal_fallacies_syllogisms_negation | | |✓ | 14200|multiple_choice_grade, exact_str_match |
|bigbench_geometric_shapes | | |✓ | 359|multiple_choice_grade, exact_str_match |
|bigbench_hyperbaton | | |✓ | 50000|multiple_choice_grade, exact_str_match |
|bigbench_logical_deduction_five_objects | | |✓ | 500|multiple_choice_grade, exact_str_match |
|bigbench_logical_deduction_seven_objects | | |✓ | 700|multiple_choice_grade, exact_str_match |
|bigbench_logical_deduction_three_objects | | |✓ | 300|multiple_choice_grade, exact_str_match |
|bigbench_movie_recommendation | | |✓ | 500|multiple_choice_grade, exact_str_match |
|bigbench_navigate | | |✓ | 1000|multiple_choice_grade, exact_str_match |
|bigbench_reasoning_about_colored_objects | | |✓ | 2000|multiple_choice_grade, exact_str_match |
|bigbench_ruin_names | | |✓ | 448|multiple_choice_grade, exact_str_match |
|bigbench_salient_translation_error_detection | | |✓ | 998|multiple_choice_grade, exact_str_match |
|bigbench_snarks | | |✓ | 181|multiple_choice_grade, exact_str_match |
|bigbench_sports_understanding | | |✓ | 986|multiple_choice_grade, exact_str_match |
|bigbench_temporal_sequences | | |✓ | 1000|multiple_choice_grade, exact_str_match |
|bigbench_tracking_shuffled_objects_five_objects | | |✓ | 1250|multiple_choice_grade, exact_str_match |
|bigbench_tracking_shuffled_objects_seven_objects | | |✓ | 1750|multiple_choice_grade, exact_str_match |
|bigbench_tracking_shuffled_objects_three_objects | | |✓ | 300|multiple_choice_grade, exact_str_match |
|blimp_adjunct_island | |✓ | | 1000|acc |
|blimp_anaphor_gender_agreement | |✓ | | 1000|acc |
|blimp_anaphor_number_agreement | |✓ | | 1000|acc |
......@@ -89,6 +110,28 @@
|cola |✓ |✓ | | 1043|mcc |
|copa |✓ |✓ | | 100|acc |
|coqa |✓ |✓ | | 500|f1, em |
|crows_pairs_english | |✓ | | 1677|likelihood_difference, pct_stereotype |
|crows_pairs_english_age | |✓ | | 91|likelihood_difference, pct_stereotype |
|crows_pairs_english_autre | |✓ | | 11|likelihood_difference, pct_stereotype |
|crows_pairs_english_disability | |✓ | | 65|likelihood_difference, pct_stereotype |
|crows_pairs_english_gender | |✓ | | 320|likelihood_difference, pct_stereotype |
|crows_pairs_english_nationality | |✓ | | 216|likelihood_difference, pct_stereotype |
|crows_pairs_english_physical_appearance | |✓ | | 72|likelihood_difference, pct_stereotype |
|crows_pairs_english_race_color | |✓ | | 508|likelihood_difference, pct_stereotype |
|crows_pairs_english_religion | |✓ | | 111|likelihood_difference, pct_stereotype |
|crows_pairs_english_sexual_orientation | |✓ | | 93|likelihood_difference, pct_stereotype |
|crows_pairs_english_socioeconomic | |✓ | | 190|likelihood_difference, pct_stereotype |
|crows_pairs_french | |✓ | | 1677|likelihood_difference, pct_stereotype |
|crows_pairs_french_age | |✓ | | 90|likelihood_difference, pct_stereotype |
|crows_pairs_french_autre | |✓ | | 13|likelihood_difference, pct_stereotype |
|crows_pairs_french_disability | |✓ | | 66|likelihood_difference, pct_stereotype |
|crows_pairs_french_gender | |✓ | | 321|likelihood_difference, pct_stereotype |
|crows_pairs_french_nationality | |✓ | | 253|likelihood_difference, pct_stereotype |
|crows_pairs_french_physical_appearance | |✓ | | 72|likelihood_difference, pct_stereotype |
|crows_pairs_french_race_color | |✓ | | 460|likelihood_difference, pct_stereotype |
|crows_pairs_french_religion | |✓ | | 115|likelihood_difference, pct_stereotype |
|crows_pairs_french_sexual_orientation | |✓ | | 91|likelihood_difference, pct_stereotype |
|crows_pairs_french_socioeconomic | |✓ | | 196|likelihood_difference, pct_stereotype |
|cycle_letters | |✓ | | 10000|acc |
|drop |✓ |✓ | | 9536|em, f1 |
|ethics_cm |✓ | |✓ | 3885|acc |
......@@ -161,13 +204,13 @@
|hendrycksTest-world_religions | |✓ |✓ | 171|acc, acc_norm |
|iwslt17-ar-en | | |✓ | 1460|bleu, chrf, ter |
|iwslt17-en-ar | | |✓ | 1460|bleu, chrf, ter |
|lambada_openai | | | | 5153|ppl, acc |
|lambada_openai_cloze | | | | 5153|ppl, acc |
|lambada_openai_mt_de | | | | 5153|ppl, acc |
|lambada_openai_mt_en | | | | 5153|ppl, acc |
|lambada_openai_mt_es | | | | 5153|ppl, acc |
|lambada_openai_mt_fr | | | | 5153|ppl, acc |
|lambada_openai_mt_it | | | | 5153|ppl, acc |
|lambada_openai | | | | 5153|ppl, acc |
|lambada_openai_cloze | | | | 5153|ppl, acc |
|lambada_openai_mt_de | | | | 5153|ppl, acc |
|lambada_openai_mt_en | | | | 5153|ppl, acc |
|lambada_openai_mt_es | | | | 5153|ppl, acc |
|lambada_openai_mt_fr | | | | 5153|ppl, acc |
|lambada_openai_mt_it | | | | 5153|ppl, acc |
|lambada_standard | |✓ |✓ | 5153|ppl, acc |
|lambada_standard_cloze | |✓ |✓ | 5153|ppl, acc |
|logiqa |✓ |✓ |✓ | 651|acc, acc_norm |
......@@ -181,6 +224,17 @@
|math_precalc |✓ | |✓ | 546|acc |
|mathqa |✓ |✓ |✓ | 2985|acc, acc_norm |
|mc_taco | |✓ |✓ | 9442|f1, em |
|mgsm_bn |✓ | |✓ | 250|acc |
|mgsm_de |✓ | |✓ | 250|acc |
|mgsm_en |✓ | |✓ | 250|acc |
|mgsm_es |✓ | |✓ | 250|acc |
|mgsm_fr |✓ | |✓ | 250|acc |
|mgsm_ja |✓ | |✓ | 250|acc |
|mgsm_ru |✓ | |✓ | 250|acc |
|mgsm_sw |✓ | |✓ | 250|acc |
|mgsm_te |✓ | |✓ | 250|acc |
|mgsm_th |✓ | |✓ | 250|acc |
|mgsm_zh |✓ | |✓ | 250|acc |
|mnli |✓ |✓ | | 9815|acc |
|mnli_mismatched |✓ |✓ | | 9832|acc |
|mrpc |✓ |✓ | | 408|acc, f1 |
......@@ -188,6 +242,13 @@
|mutual |✓ |✓ | | 886|r@1, r@2, mrr |
|mutual_plus |✓ |✓ | | 886|r@1, r@2, mrr |
|openbookqa |✓ |✓ |✓ | 500|acc, acc_norm |
|pawsx_de |✓ |✓ |✓ | 2000|acc |
|pawsx_en |✓ |✓ |✓ | 2000|acc |
|pawsx_es |✓ |✓ |✓ | 2000|acc |
|pawsx_fr |✓ |✓ |✓ | 2000|acc |
|pawsx_ja |✓ |✓ |✓ | 2000|acc |
|pawsx_ko |✓ |✓ |✓ | 2000|acc |
|pawsx_zh |✓ |✓ |✓ | 2000|acc |
|pile_arxiv | |✓ |✓ | 2407|word_perplexity, byte_perplexity, bits_per_byte |
|pile_bookcorpus2 | |✓ |✓ | 28|word_perplexity, byte_perplexity, bits_per_byte |
|pile_books3 | |✓ |✓ | 269|word_perplexity, byte_perplexity, bits_per_byte |
......@@ -228,6 +289,7 @@
|squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1 |
|sst |✓ |✓ | | 872|acc |
|swag |✓ |✓ | | 20006|acc, acc_norm |
|toxigen |✓ | |✓ | 940|acc, acc_norm |
|triviaqa |✓ |✓ | | 11313|acc |
|truthfulqa_gen | |✓ | | 817|bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff|
|truthfulqa_mc | |✓ | | 817|mc1, mc2 |
......@@ -266,3 +328,46 @@
|wnli |✓ |✓ | | 71|acc |
|wsc |✓ |✓ | | 104|acc |
|wsc273 | | |✓ | 273|acc |
|xcopa_et | |✓ |✓ | 500|acc |
|xcopa_ht | |✓ |✓ | 500|acc |
|xcopa_id | |✓ |✓ | 500|acc |
|xcopa_it | |✓ |✓ | 500|acc |
|xcopa_qu | |✓ |✓ | 500|acc |
|xcopa_sw | |✓ |✓ | 500|acc |
|xcopa_ta | |✓ |✓ | 500|acc |
|xcopa_th | |✓ |✓ | 500|acc |
|xcopa_tr | |✓ |✓ | 500|acc |
|xcopa_vi | |✓ |✓ | 500|acc |
|xcopa_zh | |✓ |✓ | 500|acc |
|xnli_ar |✓ |✓ |✓ | 5010|acc |
|xnli_bg |✓ |✓ |✓ | 5010|acc |
|xnli_de |✓ |✓ |✓ | 5010|acc |
|xnli_el |✓ |✓ |✓ | 5010|acc |
|xnli_en |✓ |✓ |✓ | 5010|acc |
|xnli_es |✓ |✓ |✓ | 5010|acc |
|xnli_fr |✓ |✓ |✓ | 5010|acc |
|xnli_hi |✓ |✓ |✓ | 5010|acc |
|xnli_ru |✓ |✓ |✓ | 5010|acc |
|xnli_sw |✓ |✓ |✓ | 5010|acc |
|xnli_th |✓ |✓ |✓ | 5010|acc |
|xnli_tr |✓ |✓ |✓ | 5010|acc |
|xnli_ur |✓ |✓ |✓ | 5010|acc |
|xnli_vi |✓ |✓ |✓ | 5010|acc |
|xnli_zh |✓ |✓ |✓ | 5010|acc |
|xstory_cloze_ar |✓ |✓ | | 1511|acc |
|xstory_cloze_en |✓ |✓ | | 1511|acc |
|xstory_cloze_es |✓ |✓ | | 1511|acc |
|xstory_cloze_eu |✓ |✓ | | 1511|acc |
|xstory_cloze_hi |✓ |✓ | | 1511|acc |
|xstory_cloze_id |✓ |✓ | | 1511|acc |
|xstory_cloze_my |✓ |✓ | | 1511|acc |
|xstory_cloze_ru |✓ |✓ | | 1511|acc |
|xstory_cloze_sw |✓ |✓ | | 1511|acc |
|xstory_cloze_te |✓ |✓ | | 1511|acc |
|xstory_cloze_zh |✓ |✓ | | 1511|acc |
|xwinograd_en | | |✓ | 2325|acc |
|xwinograd_fr | | |✓ | 83|acc |
|xwinograd_jp | | |✓ | 959|acc |
|xwinograd_pt | | |✓ | 263|acc |
|xwinograd_ru | | |✓ | 315|acc |
|xwinograd_zh | | |✓ | 504|acc |
ROUGE
rouge
nin
maka
mor
te
import abc
from typing import Union
from lm_eval import utils
......
......@@ -460,7 +460,7 @@ class Task(abc.ABC):
return self._instances
def dump_config(self):
"""Returns a dictionary representing the task's config.
"""Returns a dictionary representing the task's config.
:returns: str
The fewshot context.
......
......@@ -30,14 +30,16 @@ def simple_evaluate(
tasks=[],
num_fewshot=0,
batch_size=None,
max_batch_size=None,
device=None,
no_cache=False,
limit=None,
bootstrap_iters=100000,
check_integrity=False,
decontamination_ngrams_path=None,
write_out=False,
output_base_path=None,
):
"""Instantiate and evaluate a model on a list of tasks.
:param model: Union[str, LM]
......@@ -49,18 +51,24 @@ def simple_evaluate(
List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param num_fewshot: int
Number of examples in few-shot context
:param batch_size: int, optional
:param batch_size: int or str, optional
Batch size for model
:param max_batch_size: int, optional
Maximal batch size to try with automatic batch size detection
:param device: str, optional
PyTorch device (e.g. "cpu" or "cuda:0") for running models
:param no_cache: bool
Whether or not to cache
:param limit: int, optional
Limit the number of examples per task (only use this for testing)
:param limit: int or float, optional
Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
:param bootstrap_iters:
Number of iterations for bootstrap statistics
:param check_integrity: bool
Whether to run the relevant part of the test suite for the tasks
:param write_out: bool
If True, write details about prompts and logits to json for all tasks
:param output_base_path: str, optional
Directory to which detailed eval info will be written. Defaults to present working dir.
:return
Dictionary of results
"""
......@@ -73,7 +81,7 @@ def simple_evaluate(
if model_args is None:
model_args = ""
lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
model_args, {"batch_size": batch_size, "device": device}
model_args, {"batch_size": batch_size, "max_batch_size": max_batch_size, "device": device}
)
else:
assert isinstance(model, lm_eval.api.model.LM)
......@@ -90,15 +98,18 @@ def simple_evaluate(
limit=limit,
bootstrap_iters=bootstrap_iters,
decontamination_ngrams_path=decontamination_ngrams_path,
write_out=write_out,
output_base_path=output_base_path,
)
if lm.rank == 0:
# add info about the model and few shot config
results["config"] = {
"model": model,
"model": model if isinstance(model, str) else model.model.config._name_or_path,
"model_args": model_args,
"num_fewshot": num_fewshot,
"batch_size": batch_size,
"batch_sizes": list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else [],
"device": device,
"no_cache": no_cache,
"limit": limit,
......@@ -120,6 +131,8 @@ def evaluate(
limit=None,
bootstrap_iters=100000,
decontamination_ngrams_path=None,
write_out=False,
output_base_path=None,
):
"""Instantiate and evaluate a model on a list of tasks.
......@@ -133,6 +146,10 @@ def evaluate(
Limit the number of examples per task (only use this for testing)
:param bootstrap_iters:
Number of iterations for bootstrap statistics
:param write_out: bool
If True, write all prompts, logits and metrics to json for offline analysis
:param output_base_path: str, optional
Directory to which detailed eval info will be written. Defaults to present working dir
:return
Dictionary of results
"""
......
import os
from lm_eval.base import BaseLM
from tqdm import tqdm
import time
def anthropic_completion(client, model, prompt, max_tokens_to_sample, temperature, stop):
"""Query Anthropic API for completion.
Retry with back-off until they respond
"""
import anthropic
backoff_time = 3
while True:
try:
response = client.completion(
prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}",
model=model,
# NOTE: Claude really likes to do CoT, and overly aggressive stop sequences
# (e.g. gsm8k's ":") may truncate a lot of the input.
stop_sequences=[anthropic.HUMAN_PROMPT] + stop,
max_tokens_to_sample=max_tokens_to_sample,
temperature=temperature,
)
print(response)
return response["completion"]
except RuntimeError:
# TODO: I don't actually know what error Anthropic raises when it times out
# So err update this error when we find out.
import traceback
traceback.print_exc()
time.sleep(backoff_time)
backoff_time *= 1.5
class AnthropicLM(BaseLM):
REQ_CHUNK_SIZE = 20
def __init__(self, model):
"""
:param model: str
Anthropic model e.g. claude-instant-v1
"""
super().__init__()
import anthropic
self.model = model
self.client = anthropic.Client(os.environ['ANTHROPIC_API_KEY'])
@property
def eot_token_id(self):
raise NotImplementedError("No idea about anthropic tokenization.")
@property
def max_length(self):
return 2048
@property
def max_gen_toks(self):
return 256
@property
def batch_size(self):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
@property
def device(self):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
def tok_encode(self, string: str):
raise NotImplementedError("No idea about anthropic tokenization.")
def tok_decode(self, tokens):
raise NotImplementedError("No idea about anthropic tokenization.")
def _loglikelihood_tokens(self, requests, disable_tqdm=False):
raise NotImplementedError("No support for logits.")
def greedy_until(self, requests):
if not requests:
return []
res = []
for request in tqdm(requests):
inp = request[0]
request_args = request[1]
until = request_args["until"]
response = anthropic_completion(
client=self.client,
model=self.model,
prompt=inp,
max_tokens_to_sample=self.max_gen_toks,
temperature=0.0,
stop=until,
)
res.append(response)
return res
def _model_call(self, inps):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
def _model_generate(self, context, max_length, eos_token_id):
# Isn't used because we override greedy_until
raise NotImplementedError()
from typing import Iterable
from tqdm import tqdm
from accelerate import find_executable_batch_size
import math
import peft
from peft import __version__ as PEFT_VERSION
from pathlib import Path
from typing import List, Mapping, NewType, Optional, Tuple, Union
from tqdm import tqdm
import torch
import transformers
from typing import Optional, Union
from transformers import BatchEncoding
from lm_eval.api.model import LM
from lm_eval import utils
from abc import abstractmethod
class BaseLM(LM):
def __init__(self):
super().__init__()
self.batch_schedule = 1
self.batch_sizes = {}
self.max_batch_size = 512
@property
@abstractmethod
def eot_token_id(self):
pass
@property
@abstractmethod
def max_length(self):
pass
@property
@abstractmethod
def max_gen_toks(self):
pass
@property
@abstractmethod
def batch_size(self):
pass
@property
@abstractmethod
def device(self):
pass
@abstractmethod
def tok_encode(self, string: str):
pass
@abstractmethod
def tok_decode(self, tokens: Iterable[int]):
pass
@abstractmethod
def _model_generate(self, context, max_length, eos_token_id):
pass
@abstractmethod
def _model_call(self, inps):
"""
inps: a torch tensor of shape [batch, sequence]
the size of sequence may vary from call to call
returns: a torch tensor of shape [batch, sequence, vocab] with the
logits returned from the model
"""
pass
def _detect_batch_size(self, requests=None, pos=0):
if requests:
_, context_enc, continuation_enc = requests[pos]
max_length = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1])
else:
max_length = self.max_length
# if OOM, then halves batch_size and tries again
@find_executable_batch_size(starting_batch_size=self.max_batch_size)
def forward_batch(batch_size):
test_batch = torch.ones((batch_size, max_length), device=self.device).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size
batch_size = forward_batch()
utils.clear_torch_cache()
return batch_size
# subclass must implement properties vocab_size, eot_token_id, max_gen_toks, batch_size, device, max_length.
# TODO: enforce this somehow
def _encode_pair(self, context, continuation):
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests):
new_reqs = []
for context, continuation in requests:
if context == "":
# end of text as context
context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(continuation)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
return self._loglikelihood_tokens(new_reqs)
def loglikelihood_rolling(self, requests):
# TODO: Implement caching once we've confirmed the perplexity implementation
# automatic batch size detection for vectorization
adaptive_batch_size = None
if self.batch_size == "auto":
# using rolling window with maximum context
print("Passed argument batch_size = auto. Detecting largest batch size")
batch_size = self._detect_batch_size()
print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size
loglikelihoods = []
for (string,) in tqdm(requests):
rolling_token_windows = list(
map(
utils.make_disjoint_window,
utils.get_rolling_token_windows(
token_list=self.tok_encode(string),
prefix_token=self.eot_token_id,
max_seq_len=self.max_length,
context_len=1,
),
)
)
rolling_token_windows = [(None,) + x for x in rolling_token_windows]
# TODO: extract out this call so it only gets called once and also somehow figure out partial caching for
# that
string_nll = self._loglikelihood_tokens(
rolling_token_windows,
disable_tqdm=True,
override_bs=adaptive_batch_size,
)
# discard is_greedy
string_nll = [x[0] for x in string_nll]
string_nll = sum(string_nll)
loglikelihoods.append(string_nll)
return loglikelihoods
def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs=None):
# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
res = []
def _collate(x):
# the negative sign on len(toks) sorts descending - this has a few advantages:
# - time estimates will always be over not underestimates, which is more useful for planning
# - to know the size of a batch when going through the list, you know the first one is always the batch
# padded context length. this is useful to simplify the batching logic and more importantly to make
# automatic adaptive batches much much easier to implement
# - any OOMs will happen right away rather than near the end
toks = x[1] + x[2]
return -len(toks), tuple(toks)
re_ord = utils.Reorderer(requests, _collate)
reordered_requests = re_ord.get_reordered()
n_reordered_requests = len(reordered_requests)
# automatic (variable) batch size detection for vectorization
# pull longest context sample from request
def _batch_scheduler(pos):
sched = pos // int(n_reordered_requests / self.batch_schedule)
if sched in self.batch_sizes:
return self.batch_sizes[sched]
print(f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size")
self.batch_sizes[sched] = self._detect_batch_size(reordered_requests, pos)
print(f"Determined largest batch size: {self.batch_sizes[sched]}")
return self.batch_sizes[sched]
for chunk in utils.chunks(
tqdm(reordered_requests, disable=disable_tqdm),
n=self.batch_size if self.batch_size != "auto" else override_bs if override_bs is not None else 0,
fn=_batch_scheduler if self.batch_size == "auto" and n_reordered_requests > 0 else None,
):
inps = []
cont_toks_list = []
inplens = []
padding_length = None
# because vectorizing is annoying, we first convert each (context, continuation) pair to padded
# tensors, then we pack them together into a batch, call the model, and then pick it all apart
# again because vectorizing is annoying
for _, context_enc, continuation_enc in chunk:
# sanity check
assert len(context_enc) > 0
assert len(continuation_enc) > 0
assert len(continuation_enc) <= self.max_length
# how this all works:
# CTX CONT
# inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1]
# gpt2 \ \
# logits 1 2 3|4 5 6 7 8 9 <- the ctx half gets tossed out by the
# cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice
# when too long to fit in context, truncate from the left
inp = torch.tensor(
(context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
dtype=torch.long,
).to(self.device)
(inplen,) = inp.shape
cont = continuation_enc
# since in _collate we make sure length is descending, the longest is always the first one.
padding_length = (
padding_length if padding_length is not None else inplen
)
# pad length from seq to padding_length
inp = torch.cat(
[
inp, # [seq]
torch.zeros(padding_length - inplen, dtype=torch.long).to(
inp.device
), # [padding_length - seq]
],
dim=0,
)
inps.append(inp.unsqueeze(0)) # [1, padding_length]
cont_toks_list.append(cont)
inplens.append(inplen)
batched_inps = torch.cat(inps, dim=0) # [batch, padding_length
multi_logits = F.log_softmax(
self._model_call(batched_inps), dim=-1
).cpu() # [batch, padding_length, vocab]
for (cache_key, _, _), logits, inp, inplen, cont_toks in zip(
chunk, multi_logits, inps, inplens, cont_toks_list
):
# Slice to original seq length
contlen = len(cont_toks)
logits = logits[inplen - contlen : inplen].unsqueeze(
0
) # [1, seq, vocab]
# Check if per-token argmax is exactly equal to continuation
greedy_tokens = logits.argmax(dim=-1)
cont_toks = torch.tensor(cont_toks, dtype=torch.long).unsqueeze(
0
) # [1, seq]
max_equal = (greedy_tokens == cont_toks).all()
# Obtain log-probs at the corresponding continuation token indices
# last_token_slice = logits[:, -1, :].squeeze(0).tolist()
logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
-1
) # [1, seq]
# Answer: (log prob, is-exact-match)
answer = (float(logits.sum()), bool(max_equal))
# partial caching
if cache_key is not None:
self.cache_hook.add_partial("loglikelihood", cache_key, answer)
res.append(answer)
return re_ord.get_original(res)
def greedy_until(self, requests):
# TODO: implement fully general `until` that handles until that are
# multiple tokens or that span multiple tokens correctly
# TODO: extract to TokenizedLM?
res = []
def _collate(x):
toks = self.tok_encode(x[0])
return len(toks), x[0]
re_ord = utils.Reorderer(requests, _collate)
for context, request_args in tqdm(re_ord.get_reordered()):
until = request_args["until"]
if isinstance(until, str):
until = [until]
if until:
(primary_until,) = self.tok_encode(until[0])
else:
primary_until = None
context_enc = torch.tensor(
[self.tok_encode(context)[self.max_gen_toks - self.max_length :]]
).to(self.device)
max_gen_tokens = min(
self.max_gen_toks, request_args.get("max_length", self.max_gen_toks)
)
cont = self._model_generate(
context_enc, context_enc.shape[1] + max_gen_tokens, primary_until
)
s = self.tok_decode(cont[0].tolist()[context_enc.shape[1] :])
for term in until:
s = s.split(term)[0]
# partial caching
self.cache_hook.add_partial("greedy_until", (context, until), s)
res.append(s)
return re_ord.get_original(res)
def _get_dtype(
dtype: Union[str, torch.dtype]
) -> torch.dtype:
"""Converts `dtype` from `str` to torch.dtype when possible. Does not use an instantiated HF AutoConfig"""
if isinstance(dtype, str) and dtype != "auto":
# Convert `str` args torch dtype: `float16` -> `torch.float16`
_torch_dtype = getattr(torch, dtype)
else:
_torch_dtype = dtype
return _torch_dtype
class HFLM(BaseLM):
_DEFAULT_MAX_LENGTH = 2048
def __init__(
self,
device="cuda",
pretrained="gpt2",
revision="main",
low_cpu_mem_usage=None,
subfolder=None,
tokenizer=None,
batch_size=1,
max_length=None,
load_in_8bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
dtype: Optional[Union[str, torch.dtype]]="auto",
):
super().__init__()
assert isinstance(device, str)
assert isinstance(pretrained, str)
assert isinstance(batch_size, (int, str))
device_list = set(
["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
)
if device and device in device_list:
self._device = torch.device(device)
print(f"Using device '{device}'")
else:
print("Device not specified")
print(f"Cuda Available? {torch.cuda.is_available()}")
self._device = (
torch.device("cuda")
if torch.cuda.is_available()
else torch.device("cpu")
)
# TODO: update this to be less of a hack once subfolder is fixed in HF
revision = revision + ("/" + subfolder if subfolder is not None else "")
self.gpt2 = transformers.AutoModelForCausalLM.from_pretrained(
pretrained,
load_in_8bit=load_in_8bit,
low_cpu_mem_usage=low_cpu_mem_usage,
revision=revision,
torch_dtype=_get_dtype(dtype),
trust_remote_code=trust_remote_code,
).eval()
if not load_in_8bit:
try:
self.gpt2.to(self.device)
except:
print("Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore.")
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
pretrained if tokenizer is None else tokenizer,
revision=revision,
trust_remote_code=trust_remote_code,
)
self.vocab_size = self.tokenizer.vocab_size
# setup for automatic batch size detection
if batch_size == "auto":
self.batch_size_per_gpu = batch_size
else:
self.batch_size_per_gpu = int(batch_size)
self._max_length = max_length
@property
def eot_token_id(self):
# we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
return self.tokenizer.eos_token_id
@property
def max_length(self):
if self._max_length: # if max length manually set, return it
return self._max_length
seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
for attr in seqlen_config_attrs:
if hasattr(self.gpt2.config, attr):
return getattr(self.gpt2.config, attr)
if hasattr(self.tokenizer, "model_max_length"):
if self.tokenizer.model_max_length == 1000000000000000019884624838656:
return self._DEFAULT_MAX_LENGTH
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
@property
def max_gen_toks(self):
return 256
@property
def batch_size(self):
# TODO: fix multi-gpu
return self.batch_size_per_gpu # * gpus
@property
def device(self):
# TODO: fix multi-gpu
return self._device
def tok_encode(self, string: str):
return self.tokenizer.encode(string, add_special_tokens=False)
def tok_decode(self, tokens):
return self.tokenizer.decode(tokens)
def _model_call(self, inps):
"""
inps: a torch tensor of shape [batch, sequence]
the size of sequence may vary from call to call
returns: a torch tensor of shape [batch, sequence, vocab] with the
logits returned from the model
"""
with torch.no_grad():
return self.gpt2(inps)[0]
def _model_generate(self, context, max_length, eos_token_id):
generation_kwargs = {"do_sample": False, "max_length": max_length}
if eos_token_id is not None:
generation_kwargs['eos_token_id'] = eos_token_id
generation_kwargs['pad_token_id'] = eos_token_id # setting eos_token_id as pad token
return self.gpt2.generate(context, **generation_kwargs)
TokenSequence = Union[List[int], torch.LongTensor, torch.Tensor, BatchEncoding]
_DeviceMapping = NewType("DeviceMapping", Mapping[str, Union[int, str, torch.device]])
def _get_accelerate_args(
device_map_option: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None,
offload_folder: Optional[str] = "./offload",
) -> dict:
"""Returns the kwargs needed to apply `accelerate` in `AutoModel.from_pretrained`."""
max_memory = {}
if max_memory_per_gpu is not None:
max_memory_per_gpu_map = {
device_idx: max_memory_per_gpu
for device_idx in range(torch.cuda.device_count())
}
max_memory.update(max_memory_per_gpu_map)
if max_cpu_memory is not None:
max_memory["cpu"] = max_cpu_memory
args = {}
if max_memory:
args["max_memory"] = max_memory
args["device_map"] = device_map_option
args["offload_folder"] = offload_folder
return args
def _get_dtype(
dtype: Union[str, torch.dtype], config: Optional[transformers.AutoConfig] = None
) -> torch.dtype:
"""Converts `dtype` from `str` to torch.dtype when possible."""
if dtype is None and config is not None:
_torch_dtype = config.torch_dtype
elif isinstance(dtype, str) and dtype != "auto":
# Convert `str` args torch dtype: `float16` -> `torch.float16`
_torch_dtype = getattr(torch, dtype)
else:
_torch_dtype = dtype
return _torch_dtype
class HuggingFaceAutoLM(BaseLM):
AUTO_CONFIG_CLASS: transformers.AutoConfig = transformers.AutoConfig
AUTO_TOKENIZER_CLASS: transformers.AutoTokenizer = transformers.AutoTokenizer
AUTO_MODEL_CLASS: transformers.AutoModel = None
AUTO_PEFT_CLASS: peft.PeftModel = None
# Default max sequence length setting for when no `max_length` is provided
# or no max length config setting is found in the model or tokenizer.
_DEFAULT_MAX_LENGTH: int = 2048
def __init__(
self,
pretrained: str,
quantized: Optional[Union[bool, str]] = False,
tokenizer: Optional[str] = None,
subfolder: Optional[str] = None,
revision: Optional[str] = "main",
batch_size: Optional[Union[int, str]] = 1,
max_batch_size: Optional[int] = 512,
max_gen_toks: Optional[int] = 256,
max_length: Optional[int] = None,
add_special_tokens: Optional[bool] = None,
use_accelerate: Optional[bool] = False,
device_map_option: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None,
offload_folder: Optional[str] = "./offload",
dtype: Optional[Union[str, torch.dtype]] = None,
device: Optional[Union[int, str]] = "cuda",
peft: str = None,
load_in_8bit: Optional[bool] = False,
load_in_4bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
gptq_use_triton: Optional[bool] = False,
):
"""Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.
Args:
pretrained (str):
The HuggingFace Hub model ID name or the path to a pre-trained
model to load. This is effectively the `pretrained_model_name_or_path`
argument of `from_pretrained` in the HuggingFace `transformers` API.
quantized (str or bool, optional, defaults to False):
File name of a GPTQ quantized model to load. Set to `True` to use the
default name of the quantized model.
add_special_tokens (bool, optional, defaults to True):
Whether to add special tokens to the input sequences. If `None`, the
default value will be set to `True` for seq2seq models (e.g. T5) and
`False` for causal models.
WARNING: Evaluating causal models with `add_special_tokens=True` is
currently __not__ supported.
> Large model loading `accelerate` arguments
use_accelerate (bool, optional, defaults to False):
If True, uses the `accelerate` library to load a large model across
multiple devices.
device_map_option (str, optional, defaults to "auto"):
The device map option to use when loading the model with
`accelerate`.
Options:
"auto", "balanced", "balanced_low_0", "sequential"
See the `accelerate` docs for more details on these options:
https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.device_map
max_memory_per_gpu (Union[int, str], optional, defaults to None):
The maximum memory available for each GPU in bytes as `int` or in
the format f"{significand}{unit_symbol}" where {unit_symbol} is
any of ["GB", "MB", "GIB", "MIB"]. Refer to the `max_memory` arg in
the "Parameters for big model inference" section of the following
docs:
https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.max_memory
max_cpu_memory (Union[int, str], optional, defaults to None):
The maximum available CPU RAM in bytes as `int` or in the format
f"{significand}{unit_symbol}" where {unit_symbol} is any of
["GB", "MB", "GIB", "MIB"]. Refer to the `max_memory` arg in the
"Parameters for big model inference" section of the following docs:
https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.max_memory
offload_folder (str, optional, defaults to "./offload"):
The folder to offload weights into if `device_map` contains any
"disk" value.
dtype (Union[str, torch.dtype], optional, defaults to None):):
Converts the model weights to `dtype`, if specified. Strings get
converted to `torch.dtype` objects (e.g. `float16` -> `torch.float16`).
Use `dtype="auto"` to derive the type from the model’s weights.
peft (str, optional, defaults to None):
Path of the adapter weights to load from Huggingface. This will usually
include a directory that includes the files `adapter_config.json` and
`adapter_model.bin`. Compatible with [PEFT](https://github.com/huggingface/peft)
load_in_8bit (bool, optional, defaults to False):
If True, will convert the loaded model into mixed-8bit quantized model. See:
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-8bit
load_in_4bit (bool, optional, defaults to False):
If True, will convert the loaded model into mixed-4bit quantized model. See:
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-4bit
trust_remote_code (bool, optional, defaults to False):
If True, will trust the remote code when loading the model.
gptq_use_triton (bool, optional, defaults to False):
Use Triton for GPTQ inference.
"""
super().__init__()
assert isinstance(pretrained, str)
assert isinstance(device, str)
assert isinstance(batch_size, (int, str))
if (
add_special_tokens is not None
and self.AUTO_MODEL_CLASS is transformers.AutoModelForCausalLM
):
# TODO: Support evaluating causal models with special tokens. Currently,
# this is not possible because the `_loglikelihood_tokens()` method for
# causal LMs makes a no-special-tokens assumption given that contexts
# and labels/continuations are tokenized separately without special
# tokens, concatenated, and then processed as inputs.
assert (
not add_special_tokens
), "Evaluating causal models with `add_special_tokens=True` is currently not supported."
# setup for automatic batch size detection
if str(batch_size).startswith("auto"):
batch_size = batch_size.split(":")
self._batch_size = batch_size[0]
self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
else:
self._batch_size = int(batch_size)
self.max_batch_size = max_batch_size
self._max_gen_toks = max_gen_toks
self._max_length = max_length
self._config = self.AUTO_CONFIG_CLASS.from_pretrained(
pretrained,
trust_remote_code=trust_remote_code,
revision=revision + ("/" + subfolder if subfolder is not None else ""),
)
self._add_special_tokens = add_special_tokens
self.tokenizer = self._create_auto_tokenizer(
pretrained=pretrained,
revision=revision,
subfolder=subfolder,
tokenizer=tokenizer,
)
self.tokenizer.model_max_length = self.max_length
model_kwargs = {}
if use_accelerate:
model_kwargs = _get_accelerate_args(
device_map_option,
max_memory_per_gpu,
max_cpu_memory,
offload_folder,
)
self.model = self._create_auto_model(
pretrained=pretrained,
quantized=quantized,
trust_remote_code=trust_remote_code,
revision=revision,
subfolder=subfolder,
torch_dtype=_get_dtype(dtype, self._config),
gptq_use_triton=gptq_use_triton,
load_in_8bit=load_in_8bit,
load_in_4bit=load_in_4bit,
**model_kwargs,
)
# note: peft_path can be different than pretrained model path
if peft is not None:
self.model = self._create_auto_model_peft(
model=self.model,
peft=peft,
revision=revision,
subfolder=subfolder,
load_in_4bit=load_in_4bit,
)
self.model.eval()
torch.set_grad_enabled(False)
self._device = device
if use_accelerate and "lm_head" in self.model.hf_device_map:
# `accelerate` can place `lm_head` weights on a different device than
# the user specified one so we force `self._device` to be the same as
# `lm_head`'s.
self._device = self.model.hf_device_map["lm_head"]
if not use_accelerate and not (load_in_4bit or load_in_8bit):
try:
self.model.to(self._device)
except:
print("Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore.")
def _create_auto_model(
self,
*,
pretrained: str,
quantized: Optional[Union[bool, str]] = False,
revision: str,
subfolder: str,
device_map: Optional[Union[str, _DeviceMapping]] = None,
max_memory: Optional[dict] = None,
offload_folder: Optional[str] = None,
load_in_8bit: Optional[bool] = False,
load_in_4bit: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
torch_dtype: Optional[Union[str, torch.dtype]] = None,
gptq_use_triton: Optional[bool] = False,
) -> transformers.AutoModel:
"""Returns a pre-trained pytorch model from a pre-trained model configuration."""
if not quantized:
if load_in_4bit:
assert transformers.__version__ >= "4.30.0", "load_in_4bit requires transformers >= 4.30.0"
model_kwargs = {}
if transformers.__version__ >= "4.30.0":
model_kwargs["load_in_4bit"] = load_in_4bit
model = self.AUTO_MODEL_CLASS.from_pretrained(
pretrained,
revision=revision + ("/" + subfolder if subfolder is not None else ""),
device_map=device_map,
max_memory=max_memory,
offload_folder=offload_folder,
load_in_8bit=load_in_8bit,
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
**model_kwargs,
)
else:
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
pretrained,
model_basename=None if quantized == True else Path(quantized).stem,
device_map=device_map,
max_memory=max_memory,
trust_remote_code=trust_remote_code,
use_safetensors=True if quantized == True else quantized.endswith('.safetensors'),
use_triton=gptq_use_triton,
warmup_triton=gptq_use_triton,
)
return model
def _create_auto_model_peft(
self,
*,
model: transformers.PreTrainedModel,
peft: str,
revision: str,
subfolder: str,
load_in_4bit: Optional[bool] = False,
):
if load_in_4bit:
assert PEFT_VERSION >= "0.4.0", "load_in_4bit requires peft >= 0.4.0"
model = self.AUTO_PEFT_CLASS.from_pretrained(
model,
peft,
revision=revision + ("/" + subfolder if subfolder is not None else ""),
)
return model
def _create_auto_tokenizer(
self,
*,
pretrained: str,
revision: str,
subfolder: str,
tokenizer: Optional[str] = None,
) -> transformers.PreTrainedTokenizer:
"""Returns a pre-trained tokenizer from a pre-trained tokenizer configuration."""
tokenizer = self.AUTO_TOKENIZER_CLASS.from_pretrained(
pretrained if tokenizer is None else tokenizer,
revision=revision + ("/" + subfolder if subfolder is not None else ""),
)
tokenizer.pad_token = tokenizer.eos_token
return tokenizer
@property
def add_special_tokens(self) -> bool:
"""Whether to include special tokens in encoded text. This should be
determined by whether or not the model was trained with special tokens.
TODO: Remove these conditionals once HuggingFace supports a way to
check whether or not an arbitrary model was trained with special tokens.
"""
if self._add_special_tokens is not None:
return self._add_special_tokens
elif self.AUTO_MODEL_CLASS is transformers.AutoModelForCausalLM:
return False
elif self.AUTO_MODEL_CLASS is transformers.AutoModelForSeq2SeqLM:
return True
else:
raise ValueError(
"Could not determine `add_special_tokens` value from the model "
"class. Set to `True` or `False` depending on whether the model "
"was pre-trained with special tokens."
)
@property
def eot_token(self) -> str:
return self.tokenizer.eos_token
@property
def eot_token_id(self) -> int:
return self.tokenizer.eos_token_id
@property
def max_gen_toks(self) -> int:
return self._max_gen_toks
@property
def max_length(self) -> int:
"""Return the maximum sequence length of the model.
NOTE: Different model configurations have different max sequence length
attribute names.
- n_positions: (CTRLConfig, T5Config)
- max_position_embeddings: (BartConfig, RoFormerConfig)
- n_ctx: (GPT2Config)
NOTE: For relative position encoded models you should specify the max
sequence length of the model in the constructor via `max_length`.
"""
if self._max_length is not None:
return self._max_length
# Try to get the sequence length from the model config.
seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
for attr in seqlen_config_attrs:
if hasattr(self._config, attr):
return getattr(self._config, attr)
if hasattr(self.tokenizer, "model_max_length"):
if self.tokenizer.model_max_length == 1000000000000000019884624838656:
return self._DEFAULT_MAX_LENGTH
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
@property
def batch_size(self) -> int:
# TODO: Add adaptive batch size.
return self._batch_size # * gpus
@property
def device(self) -> Union[int, str, torch.device]:
return self._device
def tok_encode(self, string: str) -> TokenSequence:
# TODO: Merge `tok_encode_batch` here.
return self.tokenizer.encode(string, add_special_tokens=self.add_special_tokens)
def tok_encode_batch(self, strings: List[str]) -> TokenSequence:
return self.tokenizer(
strings,
padding=True,
add_special_tokens=self.add_special_tokens,
return_tensors="pt",
)
def tok_decode(self, tokens: torch.LongTensor) -> List[str]:
return self.tokenizer.batch_decode(tokens, skip_special_tokens=True)
def greedy_until(
self, requests: List[Tuple[str, Union[List[str], str]]]
) -> List[str]:
def _collate(x):
tokens = self.tok_encode(x[0])
return len(tokens), x[0]
results = []
reorder = utils.Reorderer(requests, _collate)
adaptive_batch_size = None
if self.batch_size == "auto":
# using rolling window with maximum context
print("Passed argument batch_size = auto. Detecting largest batch size")
batch_size = self._detect_batch_size()
print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size
for chunk in utils.chunks(
tqdm(reorder.get_reordered(), disable=False),
self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
):
context = [c[0] for c in chunk]
request_args = chunk[0][1]
stop = request_args.get("until", None)
stop_sequences = stop if isinstance(stop, list) else [stop]
max_generation_length = request_args.get("max_length", None)
assert (
isinstance(max_generation_length, int) or max_generation_length is None
)
assert isinstance(stop_sequences, list) or stop_sequences is None
# TODO: Find a better way to handle stop sequences for 0-shot.
if stop_sequences is None:
until = [self.eot_token]
else:
until = stop_sequences + [self.eot_token]
if max_generation_length is None:
max_tokens = self.max_gen_toks
else:
max_tokens = max_generation_length
token_context = self.tok_encode_batch(context)
responses = self._model_generate(
inputs=token_context,
max_tokens=max_tokens,
stop=until,
)
responses = self.tok_decode(responses.tolist())
for response in responses:
# Ensure the generated responses do not contain the stop sequences.
for term in until:
response = response.split(term)[0]
# partial caching
self.cache_hook.add_partial("greedy_until", (context, until), response)
results.append(response)
return reorder.get_original(results)
class AutoCausalLM(HuggingFaceAutoLM):
"""Causal language modeling.
You can find a set of supported models in the HF documentation:
https://huggingface.co/docs/transformers/main/model_doc/auto#transformers.AutoModelForCausalLM
"""
AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
AUTO_PEFT_CLASS = peft.PeftModel
def _create_auto_tokenizer(
self,
*,
pretrained: str,
revision: str,
subfolder: str,
tokenizer: Optional[str] = None,
) -> transformers.PreTrainedTokenizer:
tokenizer = super()._create_auto_tokenizer(
pretrained=pretrained,
revision=revision,
subfolder=subfolder,
tokenizer=tokenizer,
)
tokenizer.padding_side = "left"
return tokenizer
def _model_call(
self, inputs: TokenSequence, labels: Optional[TokenSequence] = None
) -> TokenSequence:
return self.model(inputs)["logits"]
def _model_generate(
self,
inputs: transformers.BatchEncoding,
max_tokens: int,
stop: Optional[List[str]] = None,
) -> TokenSequence:
# Ensure that the context does not encroach into the `space`
# for the generation.
input_ids = inputs["input_ids"][:, self.max_gen_toks - self.max_length :]
attention_mask = inputs["attention_mask"][
:, self.max_gen_toks - self.max_length :
]
input_ids = input_ids.to(self.device)
attention_mask = attention_mask.to(self.device)
stopping_criteria = stop_sequences_criteria(
self.tokenizer, stop, input_ids.shape[1], input_ids.shape[0]
)
generations = self.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
# GPT style models require the `generate` `max_length` arg to include the
# context length, so we instead set `max_new_tokens` which is the number
# of new tokens to generate, excluding the current number of tokens.
max_new_tokens=max_tokens,
stopping_criteria=stopping_criteria,
do_sample=False,
)
return utils.select_continuation_from_batch_left_padding(
generations, max_context_size=inputs["input_ids"].size(1)
)
class AutoSeq2SeqLM(HuggingFaceAutoLM):
"""Seq2Seq language modeling.
You can find a set of supported models in the following documentation:
https://huggingface.co/docs/transformers/main/model_doc/auto#transformers.AutoModelForSeq2SeqLM
"""
AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
AUTO_PEFT_CLASS = peft.PeftModel
def loglikelihood(
self, requests: List[Tuple[str, str]]
) -> List[Tuple[float, bool]]:
new_requests = []
for chunk in utils.chunks(requests, self.batch_size):
context, continuation = zip(*chunk)
# Fill empty contexts with the EOT token.
context = [
f"{self.eot_token}" if len(text) == 0 else text for text in context
]
context_enc = self.tok_encode_batch(context)
for key in context_enc:
context_enc[key] = context_enc[key][:, -self.max_length :]
# Remove leading whitespace introduced by the default
# `text_target_separator` since the context and continuation
# will not be concatenated as a single (decoder) input.
continuation = [text.lstrip() for text in continuation]
continuation_enc = self.tok_encode_batch(list(continuation))
for key in continuation_enc:
continuation_enc[key] = continuation_enc[key][:, -self.max_length :]
new_requests.append(
((context, continuation), context_enc, continuation_enc)
)
return self._loglikelihood_tokens(new_requests)
def loglikelihood_rolling(self, requests: List[Tuple[str, str]]) -> List[float]:
loglikelihoods = []
for (string,) in tqdm(requests):
rolling_token_windows = list(
map(
utils.make_disjoint_window,
utils.get_rolling_token_windows(
token_list=self.tok_encode(string),
prefix_token=self.eot_token_id,
max_seq_len=self.max_length,
context_len=1,
),
)
)
contexts, conts = utils.split_and_pad_windows(
rolling_token_windows,
pad_token_id=self.eot_token_id,
max_seq_len=self.max_length,
)
# Manually create BatchEncoding tensors with attention masks as
# expected by `self._model_call` in `self._loglikelihood_tokens`.
contexts_enc = torch.Tensor(contexts).long()
contexts_enc = transformers.tokenization_utils_base.BatchEncoding(
{
"input_ids": contexts_enc,
"attention_mask": (contexts_enc != self.eot_token_id).long(),
}
)
conts_enc = torch.Tensor(conts).long()
conts_enc = transformers.tokenization_utils_base.BatchEncoding(
{
"input_ids": conts_enc,
"attention_mask": (conts_enc != self.eot_token_id).long(),
}
)
# TODO: Extract out this call so it only gets called once and also
# somehow figure out partial caching for.
rolling_token_windows_request = [
((contexts, conts), contexts_enc, conts_enc)
]
string_nll = self._loglikelihood_tokens(
rolling_token_windows_request, disable_tqdm=True
)
string_nll = [x[0] for x in string_nll] # discard is_greedy
string_nll = sum(string_nll)
loglikelihoods.append(string_nll)
return loglikelihoods
def _loglikelihood_tokens(
self,
requests: List[Tuple[Tuple[str, str], TokenSequence, TokenSequence]],
disable_tqdm: Optional[bool] = False,
) -> List[Tuple[float, bool]]:
results = []
for chunk in tqdm(
requests, total=math.ceil(len(requests)), disable=disable_tqdm
):
cache_keys, inputs_tokens, targets_tokens = chunk
inputs_tokens = inputs_tokens.to(self.device)
targets_tokens = targets_tokens.to(self.device)
outputs = self._model_call(inputs=inputs_tokens, labels=targets_tokens)
log_softmaxes = F.log_softmax(outputs.logits, dim=-1)
output_iterator = zip(
zip(cache_keys[0], cache_keys[1]),
log_softmaxes,
targets_tokens["input_ids"],
targets_tokens["attention_mask"],
)
for cache_key, log_softmax, target_tokens, target_mask in output_iterator:
length = target_mask.sum()
log_softmax = log_softmax[:length]
target_tokens = target_tokens[:length]
greedy_tokens = log_softmax.argmax(dim=-1)
max_equal = (greedy_tokens == target_tokens).all()
target_logits = torch.gather(
log_softmax, 1, target_tokens.unsqueeze(-1)
).squeeze(-1)
answer = (float(target_logits.sum()), bool(max_equal))
results.append(answer)
if cache_key is not None:
self.cache_hook.add_partial("loglikelihood", cache_key, answer)
return results
def _model_call(
self, inputs: TokenSequence, labels: Optional[TokenSequence] = None
) -> TokenSequence:
return self.model(**inputs, labels=labels["input_ids"])
def _model_generate(
self,
inputs: transformers.BatchEncoding,
max_tokens: int,
stop: Optional[List[str]] = None,
) -> TokenSequence:
input_ids = inputs["input_ids"][:, -self.max_length :].to(self.device)
attention_mask = inputs["attention_mask"][:, -self.max_length :].to(self.device)
# Generate one token to calculate the number of start tokens prepended to decoder_input_ids
# (leaving this here in case the below assumption is violated in the future)
# one_tok_gen = self.model.generate(
# input_ids=torch.zeros((1, 1), dtype=torch.int),
# min_length=2,
# max_new_tokens=1,
# ).squeeze()
# initial_decoder_input_length = len(one_tok_gen) - 1
# Assume that there will always only be one token in the decoder inputs, assumption holds for existing HF models
stopping_criteria = stop_sequences_criteria(
self.tokenizer, stop, 1, input_ids.shape[0]
)
generations = self.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=max_tokens,
stopping_criteria=stopping_criteria,
do_sample=False,
)
return generations
class MultiTokenEOSCriteria(transformers.StoppingCriteria):
"""Criteria to stop on the specified multi-token sequence."""
def __init__(
self,
sequence: str,
tokenizer: transformers.PreTrainedTokenizer,
initial_decoder_input_length: int,
batch_size: int,
):
self.initial_decoder_input_length = initial_decoder_input_length
self.done_tracker = [False] * batch_size
self.sequence = sequence
self.sequence_ids = tokenizer.encode(sequence, add_special_tokens=False)
self.sequence_id_len = len(self.sequence_ids)
self.tokenizer = tokenizer
def __call__(self, input_ids, scores, **kwargs) -> bool:
# For efficiency, we compare the last n tokens where n is the number of tokens in the stop_sequence
lookback_ids_batch = input_ids[:, self.initial_decoder_input_length :][
:, -self.sequence_id_len :
]
lookback_tokens_batch = self.tokenizer.batch_decode(lookback_ids_batch)
for i, done in enumerate(self.done_tracker):
if not done:
self.done_tracker[i] = self.sequence in lookback_tokens_batch[i]
return False not in self.done_tracker
def stop_sequences_criteria(
tokenizer: transformers.PreTrainedTokenizer,
stop_sequences: List[str],
initial_decoder_input_length: int,
batch_size: int,
) -> transformers.StoppingCriteriaList:
return transformers.StoppingCriteriaList(
[
*[
MultiTokenEOSCriteria(
sequence, tokenizer, initial_decoder_input_length, batch_size
)
for sequence in stop_sequences
],
]
)
......@@ -125,7 +125,8 @@ class TextSynthLM(LM):
res = []
for request in tqdm(requests):
inp = request[0]
until = request[1]
request_args = request[1]
until = request_args["until"]
response = textsynth_completion(
url=self.api_url + "/v1/engines/" + self.engine + "/completions",
headers={"Authorization": "Bearer " + self.api_key},
......
......@@ -15,6 +15,9 @@ from lm_eval.api.registry import (
)
ALL_TASKS = sorted(list(TASK_REGISTRY.keys()) + list(GROUP_REGISTRY.keys()))
def get_task_name_from_config(task_config):
return "{dataset_path}_{dataset_name}".format(**task_config)
......
......@@ -8,13 +8,20 @@ import functools
import subprocess
import collections
import importlib.util
import fnmatch
from typing import List
from typing import List, Union
import gc
import torch
from omegaconf import OmegaConf
from jinja2 import BaseLoader, Environment, StrictUndefined
from itertools import islice
from lm_eval import tasks
from lm_eval.logger import eval_logger
class ExitCodeError(Exception):
pass
......@@ -25,6 +32,29 @@ def sh(x):
raise ExitCodeError()
def escaped_split(text, sep_char, maxsplit=-1):
"""Split text into a list on occurrences of the given separation
character `sep_char`. The separation character may be escaped by a
backslash to avoid splitting at that location.
The separation character must be a string of size 1.
If `maxsplit` is given, at most `maxsplit` splits are done (thus,
the list will have at most `maxsplit + 1` elements). If `maxsplit`
is not specified or less than 0, then there is no limit on the
number of splits (all possible splits are made).
"""
assert (
len(sep_char) == 1
), "separation string must be a single character for escaped splitting"
if maxsplit == 0:
return text
maxsplit = max(0, maxsplit)
return re.split(r"(?<!\\)" + sep_char, text, maxsplit)
def simple_parse_args_string(args_string):
"""
Parses something like
......@@ -44,11 +74,11 @@ def join_iters(iters):
yield from iter
def chunks(iter, n):
def chunks(iter, n=0, fn=None):
arr = []
for x in iter:
for i, x in enumerate(iter):
arr.append(x)
if len(arr) == n:
if len(arr) == (fn(i) if fn else n):
yield arr
arr = []
......@@ -65,6 +95,35 @@ def group(arr, fn):
return list(res.values())
class MultiChoice:
def __init__(self, choices):
self.choices = choices
# Simple wildcard support (linux filename patterns)
def __contains__(self, values):
for value in values.split(","):
if len(fnmatch.filter(self.choices, value)) == 0:
eval_logger.warning("{} is not in task list.".format(value))
eval_logger.info(f"Available tasks to choose:")
for choice in self.choices:
eval_logger.info(f" - {choice}")
return True
def __iter__(self):
for choice in self.choices:
yield choice
# Returns a list containing all values of the source_list that
# match at least one of the patterns
def pattern_match(patterns, source_list):
task_names = set()
for pattern in patterns:
for matching in fnmatch.filter(source_list, pattern):
task_names.add(matching)
return sorted(list(task_names))
def general_detokenize(string):
string = string.replace(" n't", "n't")
string = string.replace(" )", ")")
......@@ -110,8 +169,8 @@ def get_rolling_token_windows(token_list, prefix_token, max_seq_len, context_len
window_end = predicted + window_pred_len
yield (
token_list[window_end - max_seq_len - 1 : window_end - 1],
token_list[window_end - window_pred_len : window_end],
token_list[window_end - max_seq_len - 1: window_end - 1],
token_list[window_end - window_pred_len: window_end],
)
predicted += window_pred_len
......@@ -122,6 +181,26 @@ def make_disjoint_window(pair):
return a[: len(a) - (len(b) - 1)], b
def select_continuation_from_batch_left_padding(
generations: Union[List[List[int]], torch.Tensor], max_context_size: int
):
"""Select the continuation from the batch, removing prompts of different lengths.
Args:
generations (Union[List[List[int]], torch.Tensor]):
A tensor or list-of-lists of shape [batch_size, sequence length].
max_context_size (int):
The size of the biggest context; generations will proceed from that
index.
Example:
PAD PAD Continue : The dog chased the cat [every day of the week]
Riddle me this : The dog chased the cat [yesterday] PAD PAD PAD PAD
Output:
[every day of the week]
[yesterday] PAD PAD PAD PAD
"""
return generations[:, max_context_size:]
class Reorderer:
def __init__(self, arr, fn):
self.size = len(arr)
......@@ -336,3 +415,8 @@ def create_iterator(raw_iterator, rank, world_size, limit=None):
among ranks in multigpu setting or only pulling a sample of documents
"""
return islice(raw_iterator, rank, limit, world_size)
def clear_torch_cache():
gc.collect()
torch.cuda.empty_cache()
import os
import json
import fnmatch
import argparse
from lm_eval import evaluator, utils
from lm_eval.api.registry import GROUP_REGISTRY, TASK_REGISTRY
from lm_eval import tasks, evaluator, utils
from lm_eval.logger import eval_logger
os.environ["TOKENIZERS_PARALLELISM"] = "false"
ALL_TASKS = sorted(list(TASK_REGISTRY.keys()) + list(GROUP_REGISTRY.keys()))
class MultiChoice:
def __init__(self, choices):
self.choices = choices
# Simple wildcard support (linux filename patterns)
def __contains__(self, values):
for value in values.split(","):
if len(fnmatch.filter(self.choices, value)) == 0:
eval_logger.warning("{} is not in task list.".format(value))
eval_logger.info(f"Available tasks to choose:")
# for choice in self.choices:
# eval_logger.info(f" {choice}")
eval_logger.info(ALL_TASKS)
return True
def __iter__(self):
for choice in self.choices:
yield choice
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--model_args", default="")
parser.add_argument("--tasks", default=None, choices=MultiChoice(ALL_TASKS))
parser.add_argument("--tasks", default=None, choices=utils.MultiChoice(tasks.ALL_TASKS))
parser.add_argument("--config", default=None)
parser.add_argument("--provide_description", action="store_true")
parser.add_argument("--num_fewshot", type=int, default=0)
parser.add_argument("--batch_size", type=int, default=1)
parser.add_argument("--batch_size", type=str, default=None)
parser.add_argument("--max_batch_size", type=int, default=None,
help="Maximal batch size to try with --batch_size auto")
parser.add_argument("--device", type=str, default=None)
parser.add_argument("--output_path", default=None)
parser.add_argument("--limit", type=int, default=None)
parser.add_argument("--limit", type=float, default=None,
help="Limit the number of examples per task. "
"If <1, limit is a percentage of the total number of examples.")
parser.add_argument("--data_sampling", type=float, default=None)
parser.add_argument("--no_cache", action="store_true")
parser.add_argument("--decontamination_ngrams_path", default=None)
parser.add_argument("--description_dict_path", default=None)
parser.add_argument("--check_integrity", action="store_true")
parser.add_argument("--write_out", action="store_true", default=False)
parser.add_argument("--output_base_path", type=str, default=None)
return parser.parse_args()
# Returns a list containing all values of the source_list that
# match at least one of the patterns
def pattern_match(patterns, source_list):
task_names = set()
for pattern in patterns:
for matching in fnmatch.filter(source_list, pattern):
task_names.add(matching)
return sorted(list(task_names))
def main():
args = parse_args()
......@@ -68,7 +43,9 @@ def main():
"REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT."
)
if args.tasks is not None:
if args.tasks is None:
task_names = tasks.ALL_TASKS
else:
if os.path.isdir(args.tasks):
import glob
......@@ -79,7 +56,7 @@ def main():
task_names.append(config)
else:
tasks_list = args.tasks.split(",")
task_names = pattern_match(tasks_list, ALL_TASKS)
task_names = utils.pattern_match(tasks_list, tasks.ALL_TASKS)
for task in [task for task in tasks_list if task not in task_names]:
if os.path.isfile(task):
config = utils.load_yaml_config(task)
......@@ -87,28 +64,42 @@ def main():
eval_logger.info(f"Selected Tasks: {task_names}")
# TODO: description_dict?
# description_dict = {}
# if args.description_dict_path:
# with open(args.description_dict_path, "r") as f:
# description_dict = json.load(f)
results = evaluator.simple_evaluate(
model=args.model,
model_args=args.model_args,
tasks=task_names,
num_fewshot=args.num_fewshot,
batch_size=args.batch_size,
max_batch_size=args.max_batch_size,
device=args.device,
no_cache=args.no_cache,
limit=args.limit,
# description_dict=description_dict,
decontamination_ngrams_path=args.decontamination_ngrams_path,
check_integrity=args.check_integrity,
write_out=args.write_out,
output_base_path=args.output_base_path,
)
if results is not None:
dumped = json.dumps(results, indent=2)
print(dumped)
if args.output_path:
os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
with open(args.output_path, "w") as f:
f.write(dumped)
batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
print(
f"{args.model} ({args.model_args}), limit: {args.limit}, provide_description: {args.provide_description}, "
f"num_fewshot: {args.num_fewshot}, batch_size: {args.batch_size}"
f"{args.model} ({args.model_args}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, "
f"batch_size: {args.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
)
print(evaluator.make_table(results))
......
# bloom-1b1
## bloom-1b1_common_sense_reasoning_0-shot.json
| Task |Version| Metric |Value| |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge| 0|acc |23.63|± | 1.24|
| | |acc_norm|25.68|± | 1.28|
|arc_easy | 0|acc |51.47|± | 1.03|
| | |acc_norm|45.45|± | 1.02|
|boolq | 1|acc |59.08|± | 0.86|
|copa | 0|acc |68.00|± | 4.69|
|hellaswag | 0|acc |34.63|± | 0.47|
| | |acc_norm|41.77|± | 0.49|
|mc_taco | 0|em |14.49| | |
| | |f1 |32.43| | |
|openbookqa | 0|acc |19.60|± | 1.78|
| | |acc_norm|29.40|± | 2.04|
|piqa | 0|acc |67.14|± | 1.10|
| | |acc_norm|67.14|± | 1.10|
|prost | 0|acc |23.41|± | 0.31|
| | |acc_norm|30.50|± | 0.34|
|swag | 0|acc |43.43|± | 0.35|
| | |acc_norm|58.28|± | 0.35|
|winogrande | 0|acc |54.93|± | 1.40|
|wsc273 | 0|acc |68.50|± | 2.82|
## bloom-1b1_gsm8k_8-shot.json
|Task |Version|Metric|Value| |Stderr|
|-----|------:|------|----:|---|-----:|
|gsm8k| 0|acc | 0.83|± | 0.25|
## bloom-1b1_mathematical_reasoning_few_shot_5-shot.json
| Task |Version| Metric |Value| |Stderr|
|-------------------------|------:|--------|----:|---|-----:|
|drop | 1|em | 1.38|± | 0.12|
| | |f1 | 4.01|± | 0.15|
|gsm8k | 0|acc | 0.00|± | 0.00|
|math_algebra | 1|acc | 0.00|± | 0.00|
|math_counting_and_prob | 1|acc | 0.21|± | 0.21|
|math_geometry | 1|acc | 0.21|± | 0.21|
|math_intermediate_algebra| 1|acc | 0.00|± | 0.00|
|math_num_theory | 1|acc | 0.19|± | 0.19|
|math_prealgebra | 1|acc | 0.11|± | 0.11|
|math_precalc | 1|acc | 0.00|± | 0.00|
|mathqa | 0|acc |23.55|± | 0.78|
| | |acc_norm|23.62|± | 0.78|
## bloom-1b1_pawsx_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|--------|------:|------|----:|---|-----:|
|pawsx_de| 0|acc |46.95|± | 1.12|
|pawsx_en| 0|acc |52.45|± | 1.12|
|pawsx_es| 0|acc |51.50|± | 1.12|
|pawsx_fr| 0|acc |46.15|± | 1.11|
|pawsx_ja| 0|acc |48.40|± | 1.12|
|pawsx_ko| 0|acc |49.90|± | 1.12|
|pawsx_zh| 0|acc |48.95|± | 1.12|
## bloom-1b1_question_answering_0-shot.json
| Task |Version| Metric |Value| |Stderr|
|-------------|------:|------------|----:|---|-----:|
|headqa_en | 0|acc |26.44|± | 0.84|
| | |acc_norm |30.49|± | 0.88|
|headqa_es | 0|acc |24.43|± | 0.82|
| | |acc_norm |28.30|± | 0.86|
|logiqa | 0|acc |18.89|± | 1.54|
| | |acc_norm |25.65|± | 1.71|
|squad2 | 1|exact | 4.17| | |
| | |f1 | 6.60| | |
| | |HasAns_exact| 2.19| | |
| | |HasAns_f1 | 7.05| | |
| | |NoAns_exact | 6.14| | |
| | |NoAns_f1 | 6.14| | |
| | |best_exact |50.07| | |
| | |best_f1 |50.07| | |
|triviaqa | 1|acc | 2.68|± | 0.15|
|truthfulqa_mc| 1|mc1 |25.34|± | 1.52|
| | |mc2 |41.80|± | 1.46|
|webqs | 0|acc | 1.38|± | 0.26|
## bloom-1b1_reading_comprehension_0-shot.json
|Task|Version|Metric|Value| |Stderr|
|----|------:|------|----:|---|-----:|
|coqa| 1|f1 |45.57|± | 1.88|
| | |em |32.98|± | 1.95|
|drop| 1|em | 3.31|± | 0.18|
| | |f1 | 8.63|± | 0.22|
|race| 1|acc |32.63|± | 1.45|
## bloom-1b1_xcopa_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|--------|------:|------|----:|---|-----:|
|xcopa_et| 0|acc | 50.6|± | 2.24|
|xcopa_ht| 0|acc | 53.0|± | 2.23|
|xcopa_id| 0|acc | 64.8|± | 2.14|
|xcopa_it| 0|acc | 50.8|± | 2.24|
|xcopa_qu| 0|acc | 51.2|± | 2.24|
|xcopa_sw| 0|acc | 54.4|± | 2.23|
|xcopa_ta| 0|acc | 57.0|± | 2.22|
|xcopa_th| 0|acc | 53.2|± | 2.23|
|xcopa_tr| 0|acc | 53.0|± | 2.23|
|xcopa_vi| 0|acc | 62.4|± | 2.17|
|xcopa_zh| 0|acc | 59.4|± | 2.20|
## bloom-1b1_xnli_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|-------|------:|------|----:|---|-----:|
|xnli_ar| 0|acc |33.93|± | 0.67|
|xnli_bg| 0|acc |34.13|± | 0.67|
|xnli_de| 0|acc |39.64|± | 0.69|
|xnli_el| 0|acc |34.03|± | 0.67|
|xnli_en| 0|acc |51.48|± | 0.71|
|xnli_es| 0|acc |47.98|± | 0.71|
|xnli_fr| 0|acc |47.15|± | 0.71|
|xnli_hi| 0|acc |42.32|± | 0.70|
|xnli_ru| 0|acc |40.46|± | 0.69|
|xnli_sw| 0|acc |35.29|± | 0.68|
|xnli_th| 0|acc |33.75|± | 0.67|
|xnli_tr| 0|acc |34.79|± | 0.67|
|xnli_ur| 0|acc |37.33|± | 0.68|
|xnli_vi| 0|acc |44.45|± | 0.70|
|xnli_zh| 0|acc |36.23|± | 0.68|
## bloom-1b1_xstory_cloze_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|---------------|------:|------|----:|---|-----:|
|xstory_cloze_ar| 0|acc |52.88|± | 1.28|
|xstory_cloze_en| 0|acc |62.54|± | 1.25|
|xstory_cloze_es| 0|acc |58.31|± | 1.27|
|xstory_cloze_eu| 0|acc |54.33|± | 1.28|
|xstory_cloze_hi| 0|acc |55.53|± | 1.28|
|xstory_cloze_id| 0|acc |57.91|± | 1.27|
|xstory_cloze_my| 0|acc |46.19|± | 1.28|
|xstory_cloze_ru| 0|acc |48.25|± | 1.29|
|xstory_cloze_sw| 0|acc |50.56|± | 1.29|
|xstory_cloze_te| 0|acc |56.39|± | 1.28|
|xstory_cloze_zh| 0|acc |58.04|± | 1.27|
## bloom-1b1_xwinograd_0-shot.json
| Task |Version|Metric|Value| |Stderr|
|------------|------:|------|----:|---|-----:|
|xwinograd_en| 0|acc |69.98|± | 0.95|
|xwinograd_fr| 0|acc |66.27|± | 5.22|
|xwinograd_jp| 0|acc |52.87|± | 1.61|
|xwinograd_pt| 0|acc |63.12|± | 2.98|
|xwinograd_ru| 0|acc |54.29|± | 2.81|
|xwinograd_zh| 0|acc |69.25|± | 2.06|
{
"results": {
"boolq": {
"acc": 0.5908256880733945,
"acc_stderr": 0.008599563442397352
},
"arc_easy": {
"acc": 0.5147306397306397,
"acc_stderr": 0.010255329977562096,
"acc_norm": 0.45454545454545453,
"acc_norm_stderr": 0.010217299762709435
},
"openbookqa": {
"acc": 0.196,
"acc_stderr": 0.017770751227744862,
"acc_norm": 0.294,
"acc_norm_stderr": 0.020395095484936614
},
"hellaswag": {
"acc": 0.3463453495319657,
"acc_stderr": 0.004748324319714264,
"acc_norm": 0.4177454690300737,
"acc_norm_stderr": 0.004921798492608764
},
"swag": {
"acc": 0.43431970408877335,
"acc_stderr": 0.0035044592489844794,
"acc_norm": 0.5828251524542637,
"acc_norm_stderr": 0.0034862531772295617
},
"arc_challenge": {
"acc": 0.2363481228668942,
"acc_stderr": 0.012414960524301834,
"acc_norm": 0.2568259385665529,
"acc_norm_stderr": 0.0127669237941168
},
"mc_taco": {
"em": 0.1448948948948949,
"f1": 0.32425976796237205
},
"wsc273": {
"acc": 0.684981684981685,
"acc_stderr": 0.028165854394193602
},
"winogrande": {
"acc": 0.5493291239147593,
"acc_stderr": 0.013983928869040239
},
"prost": {
"acc": 0.23409479077711356,
"acc_stderr": 0.003093545711826552,
"acc_norm": 0.3049743808710504,
"acc_norm_stderr": 0.003363606918420179
},
"copa": {
"acc": 0.68,
"acc_stderr": 0.04688261722621504
},
"piqa": {
"acc": 0.6713819368879217,
"acc_stderr": 0.010959127105167048,
"acc_norm": 0.6713819368879217,
"acc_norm_stderr": 0.010959127105167044
}
},
"versions": {
"boolq": 1,
"arc_easy": 0,
"openbookqa": 0,
"hellaswag": 0,
"swag": 0,
"arc_challenge": 0,
"mc_taco": 0,
"wsc273": 0,
"winogrande": 0,
"prost": 0,
"copa": 0,
"piqa": 0
},
"config": {
"model": "hf-causal-experimental",
"model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
"num_fewshot": 0,
"batch_size": "auto",
"device": "cuda:0",
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
{
"results": {
"gsm8k": {
"acc": 0.008339651250947688,
"acc_stderr": 0.002504942226860508
}
},
"versions": {
"gsm8k": 0
},
"config": {
"model": "hf-causal-experimental",
"model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
"num_fewshot": 8,
"batch_size": "auto",
"device": "cuda",
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
{
"results": {
"mathqa": {
"acc": 0.2355108877721943,
"acc_stderr": 0.007767687364650971,
"acc_norm": 0.23618090452261306,
"acc_norm_stderr": 0.0077753193787470495
},
"gsm8k": {
"acc": 0.0,
"acc_stderr": 0.0
},
"drop": {
"em": 0.013842281879194632,
"em_stderr": 0.001196510970060749,
"f1": 0.040085989932885986,
"f1_stderr": 0.0014841664758736023
},
"math_geometry": {
"acc": 0.0020876826722338203,
"acc_stderr": 0.0020876826722338315
},
"math_counting_and_prob": {
"acc": 0.002109704641350211,
"acc_stderr": 0.002109704641350211
},
"math_prealgebra": {
"acc": 0.001148105625717566,
"acc_stderr": 0.0011481056257175708
},
"math_num_theory": {
"acc": 0.001851851851851852,
"acc_stderr": 0.0018518518518518448
},
"math_precalc": {
"acc": 0.0,
"acc_stderr": 0.0
},
"math_algebra": {
"acc": 0.0,
"acc_stderr": 0.0
},
"math_intermediate_algebra": {
"acc": 0.0,
"acc_stderr": 0.0
}
},
"versions": {
"mathqa": 0,
"gsm8k": 0,
"drop": 1,
"math_geometry": 1,
"math_counting_and_prob": 1,
"math_prealgebra": 1,
"math_num_theory": 1,
"math_precalc": 1,
"math_algebra": 1,
"math_intermediate_algebra": 1
},
"config": {
"model": "hf-causal-experimental",
"model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
"num_fewshot": 5,
"batch_size": "auto",
"device": "cuda:0",
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment