Unverified Commit 8e87eff4 authored by Paul McCann's avatar Paul McCann Committed by GitHub
Browse files

Add docs on adding a multiple choice metric (#1147)


Co-authored-by: default avatarPaul O'Leary McCann <polm@dampfkraft.com>
parent 13fbfef7
......@@ -219,6 +219,49 @@ Aggregation functions:
* `weighted_perplexity`
* `bits_per_byte`
### Adding a Multiple Choice Metric
Adding a multiple choice metric has a few steps. To get it working you need to:
1. register a metric function
2. register an aggregation function
3. update the `Task` definition to make sure the correct arguments are passed
The default metric and aggregation functions are in `lm_eval/api/metrics.py`, and you can add a function there if it's for general use. The metrics are towards the bottom of the file and look like this:
@register_metric(
metric="mcc",
higher_is_better=True,
output_type="multiple_choice",
aggregation="matthews_corrcoef",
)
def mcc_fn(items): # This is a passthrough function
return items
Note that many of these are passthrough functions, and for multiple choice (at least) this function is never actually called.
Aggregation functions are defined towards the top of the file, here's an example:
@register_aggregation("matthews_corrcoef")
def matthews_corrcoef(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
return sklearn.metrics.matthews_corrcoef(golds, preds)
This function returns a single numeric value. The input is defined in `Task.process_results` in `lm_eval/api/task.py`. There's a section that looks like this:
result_dict = {
**({"acc": acc} if "acc" in use_metric else {}),
**({"f1": (gold, pred)} if "f1" in use_metric else {}),
**({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
**({"exact_match": exact_match} if "exact_match" in use_metric else {}),
}
The value here determines the input to the aggregation function, though the name used matches the metric function. These metrics all have simple needs and just need the accuracy or gold and predicted values, but immediately below this there are examples of metrics with more complicated needs you can use as reference.
## Good Reference Tasks
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment