Commit e61a3159 authored by LucWeber's avatar LucWeber
Browse files

Add tinyBenchmarks post-processing

parent 157b4661
......@@ -20,23 +20,36 @@ All configs and utils mirror the ones from their original dataset!
#### Tasks
* `tinyArc`, `tinyGSM8k`, `tinyHellaswag`, `tinyMMLU`, `tinyTruthfulQA`, `tinyWinogrande`,
* `tinyArc`, `tinyGSM8k`, `tinyHellaswag`, `tinyMMLU`, `tinyTruthfulQA`, `tinyWinogrande`
### Usage
*tinyBenchmarks* can evaluate different benchmarks with a fraction of their examples.
To obtain accurate results, the score vectors obtained with this task require post-processing using the *tinyBenchmarks*-package (see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)).
You can install our package by running the following commands on the terminal
To obtain accurate results, this task applies post-processing using the *tinyBenchmarks*-package.
You can install the package by running the following commands on the terminal (for more information see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)):
``` :sh
$ pip install git+https://github.com/felipemaiapolo/tinyBenchmarks
pip install git+https://github.com/felipemaiapolo/tinyBenchmarks
```
Through the package, the ability parameter $\theta$ from the IRT model will be estimated using all the available data. For `benchmark='lb'` or `benchmark='helm_lite'`, the dimension of `y` should be 600 and 1000, respectively, where the correctness values must obey the following order
- For the Open LLM Leaderboard: TruthfulQA, GSM8K, Winogrande, ARC, HellaSwag, and MMLU;
The value that is returned by the task corresponds to the '**IRT++**'-method from the [original paper](https://arxiv.org/abs/2402.14992).
Evaluate specific tasks individually (e.g. `--tasks tinyHellaswag`) or all [open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) tasks by specifying `--tasks tinyBenchmarks`.
### Advanced usage
To obtain the estimated accuracies from all methods from the original paper, the *tinyBenchmarks*-package has to be applied manually.
To do so, run the evaluation with the `--log_samples` and `--output_path` arguments. For example:
```bash
lm_eval --model hf \
--model_args pretrained="mistralai/Mistral-7B-Instruct-v0.2" \
--tasks tinyHellaswag \
--batch_size 4 \
--output_path '<output_path>' \
--log_samples
```
For all other, benchmarks the dimension of `y` should be 100.
Afterwards, run include the correct `file_path` and run the following script:
```python
import json
......@@ -45,11 +58,11 @@ import numpy as np
# Choose benchmark (e.g. hellaswag)
benchmark = 'hellaswag' # possible benchmarks:
# ['lb','mmlu','alpaca','helm_lite','truthfulqa',
# 'gsm8k', 'winogrande', 'arc', 'hellaswag']
# ['mmlu','truthfulqa', 'gsm8k',
# 'winogrande', 'arc', 'hellaswag']
# Get score vector from output-file (the metric [here `acc_norm`] depends on the benchmark)
file_path = '<your-output-file.jsonl>'
file_path = '<output_path>/<output-file.jsonl>'
with open(file_path, 'r') as file:
outputs = json.load(file)
......
from typing import List
import numpy as np
import tinyBenchmarks as tb
def agg_pirt(items: List[float], benchmark: str) -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["pirt"]
def agg_gpirt_arc(items: List[float], benchmark: str = "arc") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_gsm8k(items: List[float], benchmark: str = "gsm8k") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_hellaswag(items: List[float], benchmark: str = "hellaswag") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_mmlu(items: List[float], benchmark: str = "mmlu") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_truthfulqa(items: List[float], benchmark: str = "truthfulqa") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_winogrande(items: List[float], benchmark: str = "winogrande") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
......@@ -14,11 +14,8 @@ doc_to_choice: "{{choices.text}}"
should_decontaminate: true
doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
aggregation: !function agg_functions.agg_gpirt_arc
higher_is_better: true
metadata:
version: 0.0
......@@ -13,7 +13,7 @@ doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
aggregation: mean
aggregation: !function agg_functions.agg_gpirt_gsm8k
higher_is_better: true
ignore_case: true
ignore_punctuation: false
......
......@@ -13,11 +13,8 @@ doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
doc_to_choice: "choices"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
aggregation: !function agg_functions.agg_gpirt_hellaswag
higher_is_better: true
metadata:
version: 0.0
......@@ -12,8 +12,8 @@ doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
num_fewshot: 0
metric_list:
- metric: acc
aggregation: mean
- metric: acc_norm
aggregation: !function agg_functions.agg_gpirt_mmlu
higher_is_better: true
metadata:
version: 0.0
......@@ -7,7 +7,7 @@ should_decontaminate: True
doc_to_decontamination_query: question
metric_list:
- metric: acc
aggregation: mean
aggregation: !function agg_functions.agg_gpirt_truthfulqa
higher_is_better: true
metadata:
version: 0.0
......@@ -11,8 +11,8 @@ doc_to_choice: !function utils_winogrande.doc_to_choice
should_decontaminate: true
doc_to_decontamination_query: sentence
metric_list:
- metric: acc
aggregation: mean
- metric: acc_norm
aggregation: !function agg_functions.agg_gpirt_winogrande
higher_is_better: true
metadata:
version: 0.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment