Add docs on adding a multiple choice metric (#1147)

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Add docs on adding a multiple choice metric (#1147)
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
8e87eff4 · Paul McCann · GitHub · 13fbfef7 · 8e87eff4
Unverified Commit 8e87eff4 authored Dec 19, 2023 by Paul McCann Committed by GitHub Dec 18, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 43 additions and 0 deletions

docs/task_guide.md docs/task_guide.md +43 -0

No files found.
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -219,6 +219,49 @@ Aggregation functions:
 * `weighted_perplexity`
 * `bits_per_byte`
+### Adding a Multiple Choice Metric
+Adding a multiple choice metric has a few steps. To get it working you need to:
+1. register a metric function
+2. register an aggregation function
+3. update the `Task` definition to make sure the correct arguments are passed
+The default metric and aggregation functions are in `lm_eval/api/metrics.py`, and you can add a function there if it's for general use. The metrics are towards the bottom of the file and look like this:
+    @register_metric(
+        metric="mcc",
+        higher_is_better=True,
+        output_type="multiple_choice",
+        aggregation="matthews_corrcoef",
+    )
+    def mcc_fn(items):  # This is a passthrough function
+        return items
+Note that many of these are passthrough functions, and for multiple choice (at least) this function is never actually called.
+Aggregation functions are defined towards the top of the file, here's an example:
+    @register_aggregation("matthews_corrcoef")
+    def matthews_corrcoef(items):
+        unzipped_list = list(zip(*items))
+        golds = unzipped_list[0]
+        preds = unzipped_list[1]
+        return sklearn.metrics.matthews_corrcoef(golds, preds)
+This function returns a single numeric value. The input is defined in `Task.process_results` in `lm_eval/api/task.py`. There's a section that looks like this:
+    result_dict = {
+        **({"acc": acc} if "acc" in use_metric else {}),
+        **({"f1": (gold, pred)} if "f1" in use_metric else {}),
+        **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
+        **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
+        **({"exact_match": exact_match} if "exact_match" in use_metric else {}),
+    }
+The value here determines the input to the aggregation function, though the name used matches the metric function. These metrics all have simple needs and just need the accuracy or gold and predicted values, but immediately below this there are examples of metrics with more complicated needs you can use as reference.
 ## Good Reference Tasks