Add tinyBenchmarks post-processing

e61a3159 · LucWeber · 157b4661 · e61a3159 · e61a3159 · e61a3159
Commit e61a3159 authored Apr 27, 2024 by LucWeber
8 changed files
--- a/lm_eval/tasks/tinyBenchmarks/README.md
+++ b/lm_eval/tasks/tinyBenchmarks/README.md
@@ -20,23 +20,36 @@ All configs and utils mirror the ones from their original dataset!

 #### Tasks

-* `tinyArc`, `tinyGSM8k`, `tinyHellaswag`, `tinyMMLU`, `tinyTruthfulQA`, `tinyWinogrande`,
+* `tinyArc`, `tinyGSM8k`, `tinyHellaswag`, `tinyMMLU`, `tinyTruthfulQA`, `tinyWinogrande`

 ### Usage

 *tinyBenchmarks* can evaluate different benchmarks with a fraction of their examples.
-To obtain accurate results, the score vectors obtained with this task require post-processing using the *tinyBenchmarks*-package (see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)).
-
-You can install our package by running the following commands on the terminal
+To obtain accurate results, this task applies post-processing using the *tinyBenchmarks*-package.
+You can install the package by running the following commands on the terminal (for more information see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)):

 ``` :sh
-$ pip install git+https://github.com/felipemaiapolo/tinyBenchmarks
+pip install git+https://github.com/felipemaiapolo/tinyBenchmarks
 ```

-Through the package, the ability parameter $\theta$ from the IRT model will be estimated using all the available data. For `benchmark='lb'` or `benchmark='helm_lite'`, the dimension of `y` should be 600 and 1000, respectively, where the correctness values must obey the following order
- For the Open LLM Leaderboard: TruthfulQA, GSM8K, Winogrande, ARC, HellaSwag, and MMLU;
+The value that is returned by the task corresponds to the '**IRT++**'-method from the [original paper](https://arxiv.org/abs/2402.14992).
+Evaluate specific tasks individually (e.g. `--tasks tinyHellaswag`) or all [open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) tasks by specifying `--tasks tinyBenchmarks`.
+
+### Advanced usage
+
+To obtain the estimated accuracies from all methods from the original paper, the *tinyBenchmarks*-package has to be applied manually.
+To do so, run the evaluation with the `--log_samples` and `--output_path` arguments. For example:
+
+```bash
+lm_eval --model hf  \
+        --model_args pretrained="mistralai/Mistral-7B-Instruct-v0.2" \
+        --tasks tinyHellaswag \
+        --batch_size 4 \
+        --output_path '<output_path>' \
+        --log_samples
+```

-For all other, benchmarks the dimension of `y` should be 100.
+Afterwards, run include the correct `file_path` and run the following script:

 ```python
 import json
@@ -45,11 +58,11 @@ import numpy as np

 # Choose benchmark (e.g. hellaswag)
 benchmark = 'hellaswag' # possible benchmarks:
-                        # ['lb','mmlu','alpaca','helm_lite','truthfulqa',
-                        #  'gsm8k', 'winogrande', 'arc', 'hellaswag']
+                        # ['mmlu','truthfulqa', 'gsm8k',
+                        #  'winogrande', 'arc', 'hellaswag']

 # Get score vector from output-file (the metric [here `acc_norm`] depends on the benchmark)
-file_path = '<your-output-file.jsonl>'
+file_path = '<output_path>/<output-file.jsonl>'
 with open(file_path, 'r') as file:
    outputs = json.load(file)


--- a/lm_eval/tasks/tinyBenchmarks/agg_functions.py
+++ b/lm_eval/tasks/tinyBenchmarks/agg_functions.py
+from typing import List
+
+import numpy as np
+import tinyBenchmarks as tb
+
+
+def agg_pirt(items: List[float], benchmark: str) -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["pirt"]
+
+
+def agg_gpirt_arc(items: List[float], benchmark: str = "arc") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_gsm8k(items: List[float], benchmark: str = "gsm8k") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_hellaswag(items: List[float], benchmark: str = "hellaswag") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_mmlu(items: List[float], benchmark: str = "mmlu") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_truthfulqa(items: List[float], benchmark: str = "truthfulqa") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_winogrande(items: List[float], benchmark: str = "winogrande") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
--- a/lm_eval/tasks/tinyBenchmarks/tinyArc.yaml
+++ b/lm_eval/tasks/tinyBenchmarks/tinyArc.yaml
@@ -14,11 +14,8 @@ doc_to_choice: "{{choices.text}}"
 should_decontaminate: true
 doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
 metric_list:
-  - metric: acc
-    aggregation: mean
-    higher_is_better: true
  - metric: acc_norm
-    aggregation: mean
+    aggregation: !function agg_functions.agg_gpirt_arc
    higher_is_better: true
 metadata:
  version: 0.0
--- a/lm_eval/tasks/tinyBenchmarks/tinyGSM8k.yaml
+++ b/lm_eval/tasks/tinyBenchmarks/tinyGSM8k.yaml
@@ -13,7 +13,7 @@ doc_to_text: "Question: {{question}}\nAnswer:"
 doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
 metric_list:
  - metric: exact_match
-    aggregation: mean
+    aggregation: !function agg_functions.agg_gpirt_gsm8k
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: false

--- a/lm_eval/tasks/tinyBenchmarks/tinyHellaswag.yaml
+++ b/lm_eval/tasks/tinyBenchmarks/tinyHellaswag.yaml
@@ -13,11 +13,8 @@ doc_to_text: "{{query}}"
 doc_to_target: "{{label}}"
 doc_to_choice: "choices"
 metric_list:
-  - metric: acc
-    aggregation: mean
-    higher_is_better: true
  - metric: acc_norm
-    aggregation: mean
+    aggregation: !function agg_functions.agg_gpirt_hellaswag
    higher_is_better: true
 metadata:
  version: 0.0
--- a/lm_eval/tasks/tinyBenchmarks/tinyMMLU.yaml
+++ b/lm_eval/tasks/tinyBenchmarks/tinyMMLU.yaml
@@ -12,8 +12,8 @@ doc_to_choice: ["A", "B", "C", "D"]
 doc_to_target: answer
 num_fewshot: 0
 metric_list:
-  - metric: acc
-    aggregation: mean
+  - metric: acc_norm
+    aggregation: !function agg_functions.agg_gpirt_mmlu
    higher_is_better: true
 metadata:
  version: 0.0
--- a/lm_eval/tasks/tinyBenchmarks/tinyTruthfulQA_mc2.yaml
+++ b/lm_eval/tasks/tinyBenchmarks/tinyTruthfulQA_mc2.yaml
@@ -7,7 +7,7 @@ should_decontaminate: True
 doc_to_decontamination_query: question
 metric_list:
  - metric: acc
-    aggregation: mean
+    aggregation: !function agg_functions.agg_gpirt_truthfulqa
    higher_is_better: true
 metadata:
  version: 0.0
--- a/lm_eval/tasks/tinyBenchmarks/tinyWinogrande.yaml
+++ b/lm_eval/tasks/tinyBenchmarks/tinyWinogrande.yaml
@@ -11,8 +11,8 @@ doc_to_choice: !function utils_winogrande.doc_to_choice
 should_decontaminate: true
 doc_to_decontamination_query: sentence
 metric_list:
-  - metric: acc
-    aggregation: mean
+  - metric: acc_norm
+    aggregation: !function agg_functions.agg_gpirt_winogrande
    higher_is_better: true
 metadata:
  version: 0.0