Adding the Evalita-LLM benchmark (#2681)

* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by: rzanoli <zanoli@fbk.eu> Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>

Adding the Evalita-LLM benchmark (#2681)
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by: rzanoli <zanoli@fbk.eu> Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>
b7fccef5 · Michele Resta · GitHub · a40fe42a · b7fccef5 · b7fccef5
Unverified Commit b7fccef5 authored Feb 11, 2025 by Michele Resta Committed by GitHub Feb 11, 2025
14 changed files
--- a/lm_eval/tasks/evalita_llm/_hs_template_yaml
+++ b/lm_eval/tasks/evalita_llm/_hs_template_yaml
+dataset_path: evalitahf/hatespeech_detection
+output_type: multiple_choice
+test_split: test_all
+fewshot_split: dev
+validation_split: dev
+doc_to_target: hs # 0 = Falso, 1 = Vero
+doc_to_choice: ["Falso", "Vero"]
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_ls_template_yaml
+++ b/lm_eval/tasks/evalita_llm/_ls_template_yaml
+dataset_path: evalitahf/lexical_substitution
+test_split: test
+validation_split: dev
+fewshot_split: dev
+output_type: generate_until
+generation_kwargs:
+  until:
+    - "</s>"
+doc_to_target: !function utils.ls_doc_to_target
+process_results: !function utils.ls_process_results
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_ls
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_ner_template_yaml
+++ b/lm_eval/tasks/evalita_llm/_ner_template_yaml
+dataset_path: evalitahf/entity_recognition
+output_type: generate_until
+generation_kwargs:
+  until:
+    - "</s>"
+    - "\n"
+doc_to_target: !function utils.ner_doc_to_target
+process_results: !function utils.ner_process_results
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_ner
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_re_template_yaml
+++ b/lm_eval/tasks/evalita_llm/_re_template_yaml
+dataset_path: evalitahf/relation_extraction
+test_split: test
+output_type: generate_until
+generation_kwargs:
+  until:
+    - "</s>"
+doc_to_target: !function utils.re_doc_to_target
+process_results: !function utils.rel_process_results_v3
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_rel
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_sa_template_v2_yaml
+++ b/lm_eval/tasks/evalita_llm/_sa_template_v2_yaml
+dataset_path: evalitahf/sentiment_analysis
+output_type: multiple_choice
+test_split: test
+fewshot_split: train
+validation_split: test
+doc_to_target: !function utils.sa_doc_to_target_v2
+doc_to_choice: ["positivo", "negativo", "neutrale", "misto"]
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_sa_template_yaml
+++ b/lm_eval/tasks/evalita_llm/_sa_template_yaml
+dataset_path: evalitahf/sentiment_analysis
+output_type: multiple_choice
+test_split: test
+fewshot_split: train
+validation_split: test
+doc_to_target: !function utils.sa_doc_to_target
+doc_to_choice: !function utils.sa_doc_to_choice
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_sum_template_fp-small_yaml
+++ b/lm_eval/tasks/evalita_llm/_sum_template_fp-small_yaml
+dataset_path: evalitahf/summarization-fp
+output_type: generate_until
+generation_kwargs:
+  until:
+    - "</s>"
+test_split: test_100
+fewshot_split: dev
+doc_to_target: "{{target}}"
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_sum_template_fp_yaml
+++ b/lm_eval/tasks/evalita_llm/_sum_template_fp_yaml
+dataset_path: ARTeLab/fanpage
+output_type: generate_until
+generation_kwargs:
+  until:
+    - "</s>"
+test_split: test
+doc_to_target: "{{target}}"
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/evalita_llm/_sum_template_yaml
+++ b/lm_eval/tasks/evalita_llm/_sum_template_yaml
+dataset_path: silvia-casola/WITS
+output_type: generate_until
+generation_kwargs:
+  until:
+    - "</s>"
+test_split: test_100
+fewshot_split: dev
+#test_split: train
+doc_to_target: "{{summary}}"
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_te_template_yaml
+++ b/lm_eval/tasks/evalita_llm/_te_template_yaml
+dataset_path: evalitahf/textual_entailment
+output_type: multiple_choice
+test_split: test
+fewshot_split: dev
+validation_split: dev
+doc_to_target: "{{ 0 if entailment == 'SI' else 1 }}"
+doc_to_choice: ["Sì", "No"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_wic_template_yaml
+++ b/lm_eval/tasks/evalita_llm/_wic_template_yaml
+dataset_path: evalitahf/word_in_context
+dataset_name: default
+output_type: multiple_choice
+test_split: test
+fewshot_split: dev
+validation_split: dev
+doc_to_target: label # 0: No, 1: Si
+doc_to_choice: ["No", "Sì"]
+metric_list:
+  - metric: f1
+    higher_is_better: true
+    aggregation: f1
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/evalita_llm/metrics.py
+++ b/lm_eval/tasks/evalita_llm/metrics.py
+import torch
+from sklearn.metrics import f1_score, precision_score, recall_score
+inference_decorator = (
+    torch.inference_mode if torch.__version__ >= "2.0.0" else torch.no_grad
+)
+def _aggreg_ls(predictions):
+    """
+    Custom aggregation to compute corpus level metrics for the lexical substitution task
+    predictions is a list of tuples (prec, has_answ, has_annotation)
+    prec is the precision before dividing by |A|
+    has_answ is 0 if the model did not produce any answer
+    has_annotation is 0 if the gold answer is empty: no synonims from annotators
+    """
+    # get |A| and |T| to compute the final precision and recall using a lambda function
+    A = sum([p[1] for p in predictions])
+    T = sum([p[2] for p in predictions])
+    # compute the final precision and recall
+    if A == 0:
+        prec = sum([p[0] for p in predictions]) / 1
+    else:
+        prec = sum([p[0] for p in predictions]) / A
+    if T == 0:
+        rec = sum([p[0] for p in predictions]) / 1
+    else:
+        rec = sum([p[0] for p in predictions]) / T
+    # compute the final F1 score
+    f1 = 0
+    if prec + rec != 0:
+        f1 = (2 * prec * rec) / (prec + rec)
+    return f1
+def _aggreg_sa_v2(predictions):
+    """
+    This aggregation considers the sentiment analysis task as a multiple choice one with four classes
+    the f1 score is computed as the average of the f1 scores for each class weighted by the number of samples
+    See sklearn.metrics.f1_score for more details
+    """
+    predictions, references = zip(*predictions)
+    f1 = f1_score(references, predictions, average="weighted")
+    return f1
+def _aggreg_sa(predictions):
+    """
+    Custom aggregation function for the sentiment analysis task
+    The original tasks compute the F1 score for each class and then average them
+    Since the prompt cast the task to a multple choice one we need to aggregate the results in a different way
+    """
+    # split the predictions and references in two lists (pred is a tuple)
+    predictions, references = zip(*predictions)
+    """
+    Class 0: positivo -> 'opos': 1, 'oneg': 0
+    Class 1: negativo -> 'opos': 0, 'oneg': 1
+    etc.
+    """
+    def _map_to_original_labels(x):
+        """
+        Return two separate list of labels for opos and oneg
+        x is a list of integers
+        """
+        opos = []
+        oneg = []
+        for i in x:
+            if i == 0:
+                # positive
+                opos.append(1)
+                oneg.append(0)
+            elif i == 1:
+                # negative
+                opos.append(0)
+                oneg.append(1)
+            elif i == 2:
+                # neutral
+                opos.append(0)
+                oneg.append(0)
+            elif i == 3:
+                # mixed
+                opos.append(1)
+                oneg.append(1)
+            else:
+                pass
+        return opos, oneg
+    pred_opos, pred_oneg = _map_to_original_labels(predictions)
+    ref_opos, ref_oneg = _map_to_original_labels(references)
+    opos_f1 = f1_score(ref_opos, pred_opos, average=None)
+    opos_f1_c0 = f1_score(ref_opos, pred_opos, average=None)[0]
+    if len(opos_f1) > 1:
+        opos_f1_c1 = opos_f1[1]
+    else:
+        opos_f1_c1 = 0
+    # oneg class
+    oneg_prec_c0, oneg_prec_c1 = precision_score(
+        ref_oneg, pred_oneg, labels=[0, 1], average=None
+    )
+    oneg_rec_c0, oneg_rec_c1 = recall_score(
+        ref_oneg, pred_oneg, labels=[0, 1], average=None
+    )
+    oneg_f1 = f1_score(ref_oneg, pred_oneg, average=None)
+    oneg_f1_c0 = f1_score(ref_oneg, pred_oneg, average=None)[0]
+    if len(oneg_f1) > 1:
+        oneg_f1_c1 = f1_score(ref_oneg, pred_oneg, average=None)[1]
+    else:
+        oneg_f1_c1 = 0
+    # average f1 score for each class (opos and oneg)
+    f1_score_opos = (opos_f1_c0 + opos_f1_c1) / 2
+    f1_score_oneg = (oneg_f1_c0 + oneg_f1_c1) / 2
+    # average f1 score for the two classes
+    f1_final = (f1_score_opos + f1_score_oneg) / 2
+    return f1_final
+def _aggreg_ner(predictions):
+    pred, ref = zip(*predictions)
+    # concat all the predictions and references
+    all_pred = []
+    for p in pred:
+        all_pred.extend(p)
+    all_ref = []
+    for r in ref:
+        all_ref.extend(r)
+    # compute the F1 score
+    f1 = f1_score(all_ref, all_pred, average=None)
+    if len(f1) > 1:
+        f1_sum = sum(f1[:-1]) / (len(f1) - 1)
+    else:
+        f1_sum = f1[0]
+    return f1_sum
+def _aggreg_rel(predictions):
+    pred, ref = zip(*predictions)
+    # concat all the predictions and references
+    all_pred = []
+    for p in pred:
+        all_pred.extend(p)
+    all_ref = []
+    for r in ref:
+        all_ref.extend(r)
+    # compute the F1 score
+    f1 = f1_score(all_ref, all_pred, average="macro")
+    return f1
+# ------------------------ DOCUMENT DATING ---------------------------
+def _aggreg_dd(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    fscore = f1_score(golds, preds, average="macro")
+    return fscore
--- a/lm_eval/tasks/evalita_llm/single_prompt.zip
+++ b/lm_eval/tasks/evalita_llm/single_prompt.zip
--- a/lm_eval/tasks/evalita_llm/utils.py
+++ b/lm_eval/tasks/evalita_llm/utils.py
+import logging
+from evaluate import load
+from sklearn.metrics import f1_score
+eval_logger = logging.getLogger("lm-eval")
+# ---------------------- SENTIMENT ANALYSIS ----------------------
+def sa_doc_to_target(x):
+    """
+    Function to extract the target from the dataset for sentiment analysis
+    """
+    opos = x["opos"]
+    oneg = x["oneg"]
+    # return indexes matches the choices in sa_doc_to_choice
+    if opos == "1" and oneg == "0":
+        return 0
+    elif opos == "0" and oneg == "1":
+        return 1
+    elif opos == "0" and oneg == "0":
+        return 2
+    elif opos == "1" and oneg == "1":
+        return 3
+    else:
+        pass
+def sa_doc_to_target_v2(x):
+    """
+    Function to extract the target from the dataset for sentiment analysis
+    """
+    opos = x["opos"]
+    oneg = x["oneg"]
+    # return indexes matches the choices in sa_doc_to_choice
+    if opos == "1" and oneg == "0":
+        return 0
+    elif opos == "0" and oneg == "1":
+        return 1
+    elif opos == "0" and oneg == "0":
+        return 2
+    elif opos == "1" and oneg == "1":
+        return 3
+    else:
+        pass
+def sa_doc_to_choice(x):
+    """
+    Function to return the choices from the dataset for sentiment analysis
+    """
+    return ["Positivo", "Negativo", "Neutrale", "Misto"]
+# ---------------------- LEXICAL SUBSTITUTION ----------------------
+NO_SYN_STRING = "&&NOSYN&&"
+def _ls_gold_to_target(x):
+    """
+    Generate the target for the lexical similarity task
+    """
+    # all_answers = [(i["word"], i["count"]) for i in x["answers"]]
+    if len(x["answers"]) == 0:
+        return NO_SYN_STRING
+    ans_str = ""
+    for i in x["answers"]:
+        ans_str += i["word"] + "$$" + str(i["count"]) + "::"
+    if len(ans_str) != 0 and ans_str[-2] == ":":
+        ans_str = ans_str[:-2]
+    # print(ans_str)
+    return ans_str
+def ls_doc_to_target(x):
+    """
+    Generate the target for the lexical similarity task
+    """
+    if len(x["answers"]) == 0:
+        return NO_SYN_STRING
+    ans_str = ""
+    for i in x["answers"]:
+        ans_str += i["word"] + ", "
+    if len(ans_str) != 0 and ans_str[-2] == ",":
+        ans_str = ans_str[:-2]
+    return ans_str
+def _ls_split_gold(x):
+    """
+    Split the gold string into a list of tuples
+    """
+    if x == NO_SYN_STRING:
+        return [], []
+    answers = x.split("::")
+    words = []
+    freqs = []
+    if len(answers) != 0:
+        for a in answers:
+            if "$$" in a:
+                word, count = a.split("$$")
+                words.append(word)
+                try:
+                    freqs.append(int(count))
+                except ValueError:
+                    freqs.append(0)
+    return words, freqs
+def ls_process_results(doc, results):
+    """
+    Process the results of the evaluation for the lexical substitution task
+    look at coqa for another example
+    """
+    gold_to_target = _ls_gold_to_target(doc)
+    words, freqs = _ls_split_gold(gold_to_target)
+    prec = 0
+    # Considering a maximum of the first 10 synonyms
+    results = split_text_with_regex(results[0], LS_SPLIT_REGEX)
+    results = results[: min(10, len(results))]
+    # Remove non-alphabetic characters from the word at the end of the list
+    if results:  # Check if results is not empty
+        results[-1] = "".join(char for char in results[-1] if char.isalpha())
+    has_answ = 0 if len(results) == 0 else 1  # so we can compute |A|
+    has_annotation = 0 if len(words) == 0 else 1  # so we can compute |T|
+    matching_res = []  # for debugging
+    for r in results:
+        if r in words:
+            # get frequency of the synonyms from annotators
+            idx = words.index(r.strip())
+            prec += freqs[idx]
+            matching_res.append(r)
+    # In the case of the OOT (out of ten) subtask, this normalization should not be applied
+    # ai = len(results) if len(results) != 0 else 1
+    # prec = prec / ai
+    Hi = sum(freqs)
+    if Hi != 0:
+        prec = prec / Hi
+    else:
+        eval_logger.debug("H_i is 0")
+    return {"f1": (prec, has_answ, has_annotation)}
+# ---------------------- NER ----------------------
+NO_ENT_STRING = "&&NOENT&&"
+NER_ENTITY_SEPARATOR = ","
+NER_TYPE_SEPARATOR = "$"
+NER_MAPPING_V2 = {"PER": 0, "LOC": 1, "ORG": 2, NO_ENT_STRING: 3, "O": 4}
+NER_MAPPING = {"PER": 0, "LOC": 1, "ORG": 2, "O": 3}
+def _ner_gold_to_target(x: list) -> list:
+    """
+    Convert the gold entities to the target format according to the NER_MAPPING
+    """
+    res = [NER_MAPPING[e["type"]] for e in x]
+    return res
+def _ner_gold_to_target_v2(x: list) -> list:
+    """
+    Convert the gold entities to the target format according to the NER_MAPPING
+    """
+    res = [NER_MAPPING[e["type"]] for e in x]
+    return res
+def ner_doc_to_target(doc):
+    ents = doc["entities"]
+    targ_str = ""
+    # Entità$Tipo%Entità$Tipo.
+    if ents == []:
+        return NO_ENT_STRING
+    else:
+        for e in ents:
+            targ_str += (
+                e["entity_text"] + NER_TYPE_SEPARATOR + e["type"] + NER_ENTITY_SEPARATOR
+            )
+    return targ_str[:-1]
+def ner_process_results(doc, results):
+    """
+    Process the results of the Named Entity Recognition task
+    """
+    # each document has a list of entities with the following format:
+    # [{"entity_text": "string", "type": "string"}]
+    gold = doc["entities"]
+    raw_results = results[0]
+    results = _ner_process_raw_output(raw_results)
+    gold_labels = _ner_gold_to_target(gold)
+    res_labels = [0] * len(gold_labels)
+    matched_gold_idx = []
+    if len(results) > len(gold):
+        for r in results:
+            r_text = r[0]
+            r_type = r[1]
+            for i in range(len(gold)):
+                if r_text == gold[i]["entity_text"] and r_type == gold[i]["type"]:
+                    res_labels[i] = NER_MAPPING[r_type]
+                    matched_gold_idx.append(i)
+        # Since we have more results than gold, we artificially set to false positive the remaining labels
+        # extend gold label list
+        for i in range(len(results) - len(gold)):
+            gold_labels.append(3)
+            res_labels.append(2)
+    elif len(results) == 0 and len(gold) == 0:
+        res_labels = [3]
+        gold_labels = res_labels
+    else:  # len(results) <= len(gold)
+        for r in results:
+            r_text = r[0]
+            r_type = r[1]
+            for i in range(len(gold)):
+                if r_text == gold[i]["entity_text"] and r_type == gold[i]["type"]:
+                    res_labels[i] = NER_MAPPING[r_type]
+                    matched_gold_idx.append(i)
+        # we map all wrong predictions to the "O" class
+        for i in range(len(gold_labels)):
+            if i in matched_gold_idx:
+                continue
+            if gold_labels[i] == 1:
+                res_labels[i] = 3
+            elif gold_labels[i] == 0:
+                res_labels[i] = 3
+            else:
+                res_labels[i] = 3
+    assert len(gold_labels) == len(res_labels)
+    return {"f1": (res_labels, gold_labels)}
+def ner_process_results_v2(doc, results):
+    """
+    Process the results of the Named Entity Recognition task
+    This version considers and score explicitly when the model responds that there are no entities
+    """
+    # each document has a list of entities with the following format:
+    # [{"entity_text": "string", "type": "string"}]
+    gold = doc["entities"]
+    raw_results = results[0]
+    results = _ner_process_raw_output_v2(raw_results)
+    # eval_logger.debug(f"results {results}")
+    # eval_logger.debug(f"gold {gold}")
+    gold_labels = _ner_gold_to_target_v2(gold)
+    res_labels = [0] * len(gold_labels)
+    matched_gold_idx = []
+    if len(results) > len(gold):
+        for r in results:
+            # print(r)
+            r_text = r[0]
+            r_type = r[1]
+            for i in range(len(gold)):
+                if r_text == gold[i]["entity_text"] and r_type == gold[i]["type"]:
+                    res_labels[i] = NER_MAPPING[r_type]
+                    matched_gold_idx.append(i)
+        # Since we have more results than gold, we artificially set to false positive the remaining labels
+        # extend gold label list
+        for i in range(len(results) - len(gold)):
+            # gold_labels.append(3)
+            # res_labels.append(2)
+            gold_labels.append(4)
+            res_labels.append(3)
+    elif len(results) == 0 and len(gold) == 0:
+        # res_labels = [random.choice([0, 1, 2, 3])]
+        res_labels = [3]
+        gold_labels = res_labels
+    elif len(results) == 1 and results[0] == NO_ENT_STRING:
+        # res_labels = [3]
+        res_labels = [4]
+        gold_labels = res_labels
+    else:  # len(results) <= len(gold)
+        for r in results:
+            r_text = r[0]
+            r_type = r[1]
+            for i in range(len(gold)):
+                if r_text == gold[i]["entity_text"] and r_type == gold[i]["type"]:
+                    res_labels[i] = NER_MAPPING[r_type]
+                    matched_gold_idx.append(i)
+        # we map all wrong predictions to the "O" class
+        for i in range(len(gold_labels)):
+            if i in matched_gold_idx:
+                continue
+            if gold_labels[i] == 1:
+                # res_labels[i] = 2
+                res_labels[i] = 4
+            elif gold_labels[i] == 0:
+                # res_labels[i] = 1
+                res_labels[i] = 4
+            else:
+                res_labels[i] = 4
+    assert len(gold_labels) == len(res_labels)
+    return {"f1": (res_labels, gold_labels)}
+def _ner_process_raw_output(llm_result: str) -> list[tuple]:
+    if NO_ENT_STRING in llm_result:
+        return []
+    if llm_result == "":
+        return ["WRONG"]
+    tmp_results = llm_result.split(NER_ENTITY_SEPARATOR)
+    results = []
+    for res in tmp_results:
+        r = res.strip()
+        # split on type separator
+        r_text = ""
+        r_type = ""
+        r_splitted = r.split(NER_TYPE_SEPARATOR)
+        if len(r_splitted) < 2:
+            r_text = r_splitted[0]
+            r_type = ""
+        else:
+            r_text = r_splitted[0]
+            r_type = r_splitted[1]
+        if r_text != "":
+            results.append((r_text, r_type.upper()))
+    return results
+def _ner_process_raw_output_v2(llm_result: str) -> list[tuple]:
+    if NO_ENT_STRING in llm_result:
+        return [NO_ENT_STRING]
+    if llm_result == "":
+        return ["WRONG"]
+    tmp_results = llm_result.split(NER_ENTITY_SEPARATOR)
+    results = []
+    for res in tmp_results:
+        r = res.strip()
+        # split on type separator
+        r_text = ""
+        r_type = ""
+        r_splitted = r.split(NER_TYPE_SEPARATOR)
+        if len(r_splitted) < 2:
+            r_text = r_splitted[0]
+            r_type = ""
+        else:
+            r_text = r_splitted[0]
+            r_type = r_splitted[1]
+        if r_text != "":
+            results.append((r_text, r_type.upper()))
+    return results
+# ---------------------- RELATION EXTRACTION ----------------------
+def _rel_process_raw_output(llm_result: str) -> list[str]:
+    if NO_REL_STRING in llm_result:
+        return []
+    if llm_result == "":
+        return ["WRONG"]
+    tmp_results = llm_result.split(INTER_REL_SEPARATOR)
+    relations = []
+    for res in tmp_results:
+        r_text1 = ""
+        r_text2 = ""
+        r_splitted = res.split(INTRA_REL_SEPARATOR)
+        if len(r_splitted) < 2:
+            r_text1 = r_splitted[0].strip()
+            r_text2 = ""
+        else:
+            r_text1 = r_splitted[0].strip()
+            r_text2 = r_splitted[1].strip()
+        relations.append((r_text1, r_text2))
+    assert len(relations) == len(tmp_results)
+    return relations
+INTER_REL_SEPARATOR = "%"
+INTRA_REL_SEPARATOR = "$"
+NO_REL_STRING = "&&NOREL&&"
+def re_doc_to_target(doc):
+    ents = doc["relations"]
+    targ_str = ""
+    # Entità$Tipo%Entità$Tipo.
+    if ents == []:
+        return NO_ENT_STRING
+    else:
+        for e in ents:
+            targ_str += e[0] + INTRA_REL_SEPARATOR + e[1] + INTER_REL_SEPARATOR
+    return targ_str[:-1]
+def _rel_gold_to_target(x: list) -> list:
+    if x == []:
+        return [0]
+    else:
+        return [1] * len(x)
+def rel_doc_to_target(doc):
+    rel = doc["relations"]
+    targ_str = ""
+    # misura1$result1%misure2$result2.
+    if rel == []:
+        return NO_REL_STRING
+    else:
+        for r in rel:
+            targ_str += r[0] + "$" + r[1] + "%"
+    return targ_str[:-1]
+def _extract_relations(results):
+    relations = []
+    for r in results:
+        r_text1 = ""
+        r_text2 = ""
+        r_splitted = r.split(INTRA_REL_SEPARATOR)
+        if len(r_splitted) < 2:
+            r_text1 = r_splitted[0]
+            r_text2 = ""
+        else:
+            r_text1 = r_splitted[0]
+            r_text2 = r_splitted[1]
+        relations.append((r_text1, r_text2))
+    assert len(relations) == len(results)
+    return relations
+def rel_process_results_v3(doc, results):
+    """
+    Process the results of the Relation extraction task not considering the order of the relation extracted
+    """
+    # each document has a list of relation with the following format:
+    # [[text1, text2], [text3, text4]]
+    gold = doc["relations"]
+    raw_results = results[0]
+    has_results = 0 if NO_REL_STRING in raw_results else 1
+    has_gold = 1 if gold != [] else 0
+    res_labels = []
+    gold_labels = []
+    if has_results == 0 and has_gold:
+        # False negative
+        gold_labels = _rel_gold_to_target(gold)
+        res_labels = [0] * len(gold_labels)
+    elif has_results == 0 and has_gold == 0:
+        # True negative
+        gold_labels = _rel_gold_to_target(gold)
+        res_labels = gold_labels
+    elif has_results and has_gold == 0:
+        # False positive
+        gold_labels = _rel_gold_to_target(gold)
+        res_labels = [1] * len(gold_labels)
+    else:
+        results = _rel_process_raw_output(raw_results)
+        # results = raw_results.split(INTER_REL_SEPARATOR)
+        gold_labels = _rel_gold_to_target(gold)
+        res_labels = [0] * len(gold_labels)
+        assert len(gold) > 0
+        for i in range(len(gold)):
+            for j in range(len(results)):
+                r_text1 = results[j][0]
+                r_text2 = results[j][1]
+                if r_text1 == gold[i][0] and r_text2 == gold[i][1]:  # list of lists
+                    res_labels[i] = 1
+                    results[j] = ("DELETED", "DELETED")
+                elif r_text1 == "DELETED" and r_text2 == "DELETED":
+                    continue
+                else:
+                    pass
+        # if there are more predictions than gold, we set the remaining predictions to false positive
+        if len(results) - len(gold) > 0:
+            for i in range(len(results) - len(gold)):
+                if results[i] == ("DELETED", "DELETED"):
+                    continue
+                res_labels.append(1)
+                gold_labels.append(0)
+    assert len(gold_labels) == len(res_labels)
+    return {"f1": (res_labels, gold_labels)}
+LS_SPLIT_REGEX = r"[^,]+"
+def split_text_with_regex(text, pattern):
+    """
+    pattern: str - a regex pattern to match the text
+    text: str - the text to split
+    """
+    import re
+    # Get text with model-generated words for comparison with the gold standard
+    text = text.split("\n")[0]
+    # Find all matches for the pattern
+    matches = re.findall(pattern, text)
+    # Split each matched segment further if it contains a comma and is quoted
+    result = []
+    for match in matches:
+        if match.startswith('"') and match.endswith('"'):
+            # Remove the quotes and split inside the quoted string
+            inner_matches = re.findall(r"[^,]+", match[1:-1])
+            result.extend(inner_matches)
+        else:
+            result.append(match)
+    # Strip leading and trailing whitespaces from each element
+    result = [element.strip().replace('"', "") for element in result]
+    return result
+# ---------------------- SUMMARIZATION ----------------------
+def rouge1_score(references, predictions, **kwargs):
+    """
+    suboptimal way of compute rouge because of the following issue:
+    https://github.com/EleutherAI/lm-evaluation-harness/issues/1302
+    """
+    rouge = load("rouge")
+    return rouge.compute(predictions=predictions, references=references, **kwargs)[
+        "rouge1"
+    ]
+def process_results_sum(doc, results):
+    """
+    Process the results of the Evalita summarization task
+    """
+    ref = doc["summary"] if "summary" in doc.keys() else doc["target"]
+    rouge_scorer = load("rouge", keep_in_memory=True)
+    r1score = rouge_scorer.compute(predictions=results, references=[ref])["rouge1"]
+    return {
+        "rouge1": r1score,
+    }
+def faq_doc_to_target(x):
+    if x["correct_answer"] == "A":
+        return 0
+    elif x["correct_answer"] == "B":
+        return 1
+    elif x["correct_answer"] == "C":
+        return 2
+    elif x["correct_answer"] == "D":
+        return 3
+    else:
+        eval_logger.warning(
+            'WARNING: correct answer not found or not in ["A", "B", "C", "D"]'
+        )
+def ht_doc_to_target(x):
+    if x["source"] == "ilgiornale":
+        return 0
+    elif x["source"] == "repubblica":
+        return 1
+    else:
+        eval_logger.warning(
+            'WARNING: source not found or not in ["ilgiornale", "repubblica"]'
+        )