"...resnet50_tensorflow.git" did not exist on "8c6df6412ae7141f46e7752be2a2a16c519b8ccc"
Commit e61a3159 authored by LucWeber's avatar LucWeber
Browse files

Add tinyBenchmarks post-processing

parent 157b4661
...@@ -20,23 +20,36 @@ All configs and utils mirror the ones from their original dataset! ...@@ -20,23 +20,36 @@ All configs and utils mirror the ones from their original dataset!
#### Tasks #### Tasks
* `tinyArc`, `tinyGSM8k`, `tinyHellaswag`, `tinyMMLU`, `tinyTruthfulQA`, `tinyWinogrande`, * `tinyArc`, `tinyGSM8k`, `tinyHellaswag`, `tinyMMLU`, `tinyTruthfulQA`, `tinyWinogrande`
### Usage ### Usage
*tinyBenchmarks* can evaluate different benchmarks with a fraction of their examples. *tinyBenchmarks* can evaluate different benchmarks with a fraction of their examples.
To obtain accurate results, the score vectors obtained with this task require post-processing using the *tinyBenchmarks*-package (see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)). To obtain accurate results, this task applies post-processing using the *tinyBenchmarks*-package.
You can install the package by running the following commands on the terminal (for more information see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)):
You can install our package by running the following commands on the terminal
``` :sh ``` :sh
$ pip install git+https://github.com/felipemaiapolo/tinyBenchmarks pip install git+https://github.com/felipemaiapolo/tinyBenchmarks
``` ```
Through the package, the ability parameter $\theta$ from the IRT model will be estimated using all the available data. For `benchmark='lb'` or `benchmark='helm_lite'`, the dimension of `y` should be 600 and 1000, respectively, where the correctness values must obey the following order The value that is returned by the task corresponds to the '**IRT++**'-method from the [original paper](https://arxiv.org/abs/2402.14992).
- For the Open LLM Leaderboard: TruthfulQA, GSM8K, Winogrande, ARC, HellaSwag, and MMLU; Evaluate specific tasks individually (e.g. `--tasks tinyHellaswag`) or all [open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) tasks by specifying `--tasks tinyBenchmarks`.
### Advanced usage
To obtain the estimated accuracies from all methods from the original paper, the *tinyBenchmarks*-package has to be applied manually.
To do so, run the evaluation with the `--log_samples` and `--output_path` arguments. For example:
```bash
lm_eval --model hf \
--model_args pretrained="mistralai/Mistral-7B-Instruct-v0.2" \
--tasks tinyHellaswag \
--batch_size 4 \
--output_path '<output_path>' \
--log_samples
```
For all other, benchmarks the dimension of `y` should be 100. Afterwards, run include the correct `file_path` and run the following script:
```python ```python
import json import json
...@@ -45,11 +58,11 @@ import numpy as np ...@@ -45,11 +58,11 @@ import numpy as np
# Choose benchmark (e.g. hellaswag) # Choose benchmark (e.g. hellaswag)
benchmark = 'hellaswag' # possible benchmarks: benchmark = 'hellaswag' # possible benchmarks:
# ['lb','mmlu','alpaca','helm_lite','truthfulqa', # ['mmlu','truthfulqa', 'gsm8k',
# 'gsm8k', 'winogrande', 'arc', 'hellaswag'] # 'winogrande', 'arc', 'hellaswag']
# Get score vector from output-file (the metric [here `acc_norm`] depends on the benchmark) # Get score vector from output-file (the metric [here `acc_norm`] depends on the benchmark)
file_path = '<your-output-file.jsonl>' file_path = '<output_path>/<output-file.jsonl>'
with open(file_path, 'r') as file: with open(file_path, 'r') as file:
outputs = json.load(file) outputs = json.load(file)
......
from typing import List
import numpy as np
import tinyBenchmarks as tb
def agg_pirt(items: List[float], benchmark: str) -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["pirt"]
def agg_gpirt_arc(items: List[float], benchmark: str = "arc") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_gsm8k(items: List[float], benchmark: str = "gsm8k") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_hellaswag(items: List[float], benchmark: str = "hellaswag") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_mmlu(items: List[float], benchmark: str = "mmlu") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_truthfulqa(items: List[float], benchmark: str = "truthfulqa") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
def agg_gpirt_winogrande(items: List[float], benchmark: str = "winogrande") -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]["gpirt"]
...@@ -14,11 +14,8 @@ doc_to_choice: "{{choices.text}}" ...@@ -14,11 +14,8 @@ doc_to_choice: "{{choices.text}}"
should_decontaminate: true should_decontaminate: true
doc_to_decontamination_query: "Question: {{question}}\nAnswer:" doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
metric_list: metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm - metric: acc_norm
aggregation: mean aggregation: !function agg_functions.agg_gpirt_arc
higher_is_better: true higher_is_better: true
metadata: metadata:
version: 0.0 version: 0.0
...@@ -13,7 +13,7 @@ doc_to_text: "Question: {{question}}\nAnswer:" ...@@ -13,7 +13,7 @@ doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}" doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list: metric_list:
- metric: exact_match - metric: exact_match
aggregation: mean aggregation: !function agg_functions.agg_gpirt_gsm8k
higher_is_better: true higher_is_better: true
ignore_case: true ignore_case: true
ignore_punctuation: false ignore_punctuation: false
......
...@@ -13,11 +13,8 @@ doc_to_text: "{{query}}" ...@@ -13,11 +13,8 @@ doc_to_text: "{{query}}"
doc_to_target: "{{label}}" doc_to_target: "{{label}}"
doc_to_choice: "choices" doc_to_choice: "choices"
metric_list: metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm - metric: acc_norm
aggregation: mean aggregation: !function agg_functions.agg_gpirt_hellaswag
higher_is_better: true higher_is_better: true
metadata: metadata:
version: 0.0 version: 0.0
...@@ -12,8 +12,8 @@ doc_to_choice: ["A", "B", "C", "D"] ...@@ -12,8 +12,8 @@ doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer doc_to_target: answer
num_fewshot: 0 num_fewshot: 0
metric_list: metric_list:
- metric: acc - metric: acc_norm
aggregation: mean aggregation: !function agg_functions.agg_gpirt_mmlu
higher_is_better: true higher_is_better: true
metadata: metadata:
version: 0.0 version: 0.0
...@@ -7,7 +7,7 @@ should_decontaminate: True ...@@ -7,7 +7,7 @@ should_decontaminate: True
doc_to_decontamination_query: question doc_to_decontamination_query: question
metric_list: metric_list:
- metric: acc - metric: acc
aggregation: mean aggregation: !function agg_functions.agg_gpirt_truthfulqa
higher_is_better: true higher_is_better: true
metadata: metadata:
version: 0.0 version: 0.0
...@@ -11,8 +11,8 @@ doc_to_choice: !function utils_winogrande.doc_to_choice ...@@ -11,8 +11,8 @@ doc_to_choice: !function utils_winogrande.doc_to_choice
should_decontaminate: true should_decontaminate: true
doc_to_decontamination_query: sentence doc_to_decontamination_query: sentence
metric_list: metric_list:
- metric: acc - metric: acc_norm
aggregation: mean aggregation: !function agg_functions.agg_gpirt_winogrande
higher_is_better: true higher_is_better: true
metadata: metadata:
version: 0.0 version: 0.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment