Adding the Evalita-LLM benchmark (#2681)

* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by: rzanoli <zanoli@fbk.eu> Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>

Adding the Evalita-LLM benchmark (#2681)
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by: rzanoli <zanoli@fbk.eu> Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>
b7fccef5 · Michele Resta · GitHub · a40fe42a · b7fccef5 · b7fccef5
Unverified Commit b7fccef5 authored Feb 11, 2025 by Michele Resta Committed by GitHub Feb 11, 2025
20 changed files
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p2.yaml
+tag: evalita-mp_hs_tasks
+include: _hs_template_yaml
+task: evalita-mp_hs_prompt-2
+task_alias: prompt-2
+#doc_to_text: "Dato il seguente testo, rispondi Vero se il testo contiene hate speech, altrimenti rispondi Falso. Testo:\n{{full_text}} Risposta:"
+doc_to_text: "Devi svolgere un compito di identificazione di incitamento all'odio. C'è incitamento all'odio nel seguente tweet: '{{full_text}}'?"
+metric_list:
+  - metric: f1
+    higher_is_better: true
+    average: macro
+    aggregation: f1
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p3.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p3.yaml
+tag: evalita-mp_hs_tasks
+include: _hs_template_yaml
+task: evalita-mp_hs_prompt-3
+task_alias: prompt-3
+doc_to_choice: ["B", "A"]
+#doc_to_text: "Dato il seguente testo, rispondi Vero se il testo contiene hate speech, altrimenti rispondi Falso. Testo:\n{{full_text}} Risposta:"
+doc_to_text: "C'è incitamento all'odio nel seguente tweet: '{{full_text}}'?\nA: Vero\nB: Falso\nRisposta:"
+metric_list:
+  - metric: f1
+    higher_is_better: true
+    average: macro
+    aggregation: f1
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p4.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p4.yaml
+tag: evalita-mp_hs_tasks
+include: _hs_template_yaml
+task: evalita-mp_hs_prompt-4
+task_alias: prompt-4
+doc_to_choice: ["B", "A"]
+#doc_to_text: "Dato il seguente testo, rispondi Vero se il testo contiene hate speech, altrimenti rispondi Falso. Testo:\n{{full_text}} Risposta:"
+doc_to_text: "Devi svolgere un compito di identificazione di incitamento all'odio. C'è incitamento all'odio nel seguente tweet: '{{full_text}}'?\nA: Sì\nB: No\nRisposta:"
+metric_list:
+  - metric: f1
+    higher_is_better: true
+    average: macro
+    aggregation: f1
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p5.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p5.yaml
+tag: evalita-mp_hs_tasks
+include: _hs_template_yaml
+task: evalita-mp_hs_prompt-5
+task_alias: prompt-5
+doc_to_choice: ["non contiene incitamento all'odio", "contiene incitamento all'odio"]
+#doc_to_text: "Dato il seguente testo, rispondi Vero se il testo contiene hate speech, altrimenti rispondi Falso. Testo:\n{{full_text}} Risposta:"
+doc_to_text: "Il tweet: '{{full_text}}'"
+metric_list:
+  - metric: f1
+    higher_is_better: true
+    average: macro
+    aggregation: f1
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p6.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_hs_p6.yaml
+tag: evalita-mp_hs_tasks
+include: _hs_template_yaml
+task: evalita-mp_hs_prompt-6
+task_alias: prompt-6
+doc_to_choice: ["non contiene incitamento all'odio", "contiene incitamento all'odio"]
+#doc_to_text: "Dato il seguente testo, rispondi Vero se il testo contiene hate speech, altrimenti rispondi Falso. Testo:\n{{full_text}} Risposta:"
+doc_to_text: "Devi svolgere un compito di identificazione di incitamento all'odio. Il tweet: '{{full_text}}'"
+metric_list:
+  - metric: f1
+    higher_is_better: true
+    average: macro
+    aggregation: f1
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_hs_task.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_hs_task.yaml
+group: evalita-mp_hs
+group_alias: hate-speech-detection
+task:
+  - evalita-mp_hs_tasks
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ls_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ls_p1.yaml
+tag: evalita-mp_ls_tasks
+include: _ls_template_yaml
+task: evalita-mp_ls_prompt-1
+task_alias: prompt-1
+#doc_to_text: "Sostituisci la parola tra i tag <head> con sinonimi appropriati per il contesto. Separa i sinonimi con virgole. Testo:\n{{context}}"
+doc_to_text: "Trova 10 parole che possono sostituire la parola racchiusa tra i marcatori <head> nella seguente frase: '{{context}}', mantenendo lo stesso significato. Elenca i lemmi (forme base) di queste parole, separandoli con una virgola, ad esempio: lemma1, lemma2, lemma3, lemma4, lemma5. Non aggiungere commenti o altro testo. Risposta:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ls_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ls_p2.yaml
+tag: evalita-mp_ls_tasks
+include: _ls_template_yaml
+task: evalita-mp_ls_prompt-2
+task_alias: prompt-2
+#doc_to_text: "Sostituisci la parola tra i tag <head> con sinonimi appropriati per il contesto. Separa i sinonimi con virgole. Testo:\n{{context}}"
+doc_to_text: "Devi risolvere un compito di sostituzione lessicale. Trova 10 parole che possono sostituire la parola racchiusa tra i marcatori <head> nella seguente frase: '{{context}}', mantenendo lo stesso significato. Elenca i lemmi (forme base) di queste parole, separandoli con una virgola, ad esempio: lemma1, lemma2, lemma3, lemma4, lemma5. Non aggiungere commenti o altro testo. Risposta:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ls_task.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ls_task.yaml
+group: evalita-mp_ls
+group_alias: lexical-substitution
+task:
+- evalita-mp_ls_tasks
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_mc.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_mc.yaml
+group: evalita-mp_mc
+group_alias: Evalita-LLM - PPL-based
+task:
+  - evalita-mp_te
+  - evalita-mp_sa
+  - evalita-mp_wic
+  - evalita-mp_hs
+  - evalita-mp_at
+  - evalita-mp_faq
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-adg_group.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-adg_group.yaml
+group: evalita-mp_ner_adg_group
+group_alias: 'evalita NER: ADG'
+task:
+  - evalita-mp_ner-v2_tasks_adg
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-adg_group_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-adg_group_p1.yaml
+include: _ner_template_yaml
+dataset_name: adg
+test_split: reduced_test
+fewshot_split: dev
+task_alias: prompt-1
+tag: evalita-mp_ner-v2_tasks_adg
+task: evalita-mp_ner-v2_adg_p1
+
+# English
+#doc_to_text: "Given the following text, write the entity mentions in the text, indicating their type: [PER] (person), [LOC] (location), [ORG] (organization). Respond with the following format: Entity$Type. Separate each entity-type pair with the '%' character. Text: {{text}}"
+# Italian
+doc_to_text: "Dato il seguente testo, scrivi le menzioni di entità nel testo, indicandone il tipo: PER (persona), LOC (luogo), ORG (organizzazione). Rispondi con il seguente formato: Entità$Tipo%Entità$Tipo. Separa ogni coppia entità-tipo con il carattere '%' ad esempio:  Entità_2$Tipo%Entità_2$Tipo. In caso non ci siano entita' rispondi '&&NOENT&&'.
+Testo: {{text}}"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-adg_group_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-adg_group_p2.yaml
+include: _ner_template_yaml
+dataset_name: adg
+test_split: reduced_test
+fewshot_split: dev
+task_alias: prompt-2
+tag: evalita-mp_ner-v2_tasks_adg
+task: evalita-mp_ner-v2_adg_p2
+
+# English
+#doc_to_text: "Given the following text, write the entity mentions in the text, indicating their type: [PER] (person), [LOC] (location), [ORG] (organization). Respond with the following format: Entity$Type. Separate each entity-type pair with the '%' character. Text: {{text}}"
+# Italian
+doc_to_text: "Dato il seguente testo, scrivi le menzioni di entità nel testo, indicandone il tipo: PER (persona), LOC (luogo), ORG (organizzazione). Rispondi con il seguente formato: Entità$Tipo%Entità$Tipo. Separa ogni coppia entità-tipo con il carattere '%' ad esempio:  Entità_2$Tipo%Entità_2$Tipo. In caso non ci siano entita' rispondi '&&NOENT&&'.
+Testo: {{text}}"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-fic_group.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-fic_group.yaml
+group: evalita-mp_ner_fic_group
+group_alias: 'evalita NER: FIC'
+task:
+  - evalita-mp_ner-v2_tasks_fic
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-fic_group_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-fic_group_p1.yaml
+include: _ner_template_yaml
+dataset_name: fic
+test_split: reduced_test
+test_split: dev
+fewshot_split: dev
+task_alias: prompt-1
+tag: evalita-mp_ner-v2_tasks_fic
+task: evalita-mp_ner-v2_fic_p1
+
+#
+doc_to_target: !function utils.filter_per_entities_from_lines
+doc_to_target: entities
+
+# English
+#doc_to_text: "Given the following text, write the entity mentions in the text, indicating their type: [PER] (person), [LOC] (location), [ORG] (organization). Respond with the following format: Entity$Type. Separate each entity-type pair with the '%' character. Text: {{text}}"
+# Italian
+doc_to_text: "Dato il seguente testo, scrivi le menzioni di entità nel testo, indicandone il tipo: PER (persona), LOC (luogo), ORG (organizzazione). Rispondi con il seguente formato: Entità$Tipo%Entità$Tipo. Separa ogni coppia entità-tipo con il carattere '%' ad esempio:  Entità_2$Tipo%Entità_2$Tipo. In caso non ci siano entita' rispondi '&&NOENT&&'. Testo: {{text}}"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-fic_group_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-fic_group_p2.yaml
+include: _ner_template_yaml
+dataset_name: fic
+test_split: reduced_test
+fewshot_split: dev
+task_alias: prompt-2
+tag: evalita-mp_ner-v2_tasks_fic
+task: evalita-mp_ner-v2_fic_p2
+
+# English
+#doc_to_text: "Given the following text, write the entity mentions in the text, indicating their type: [PER] (person), [LOC] (location), [ORG] (organization). Respond with the following format: Entity$Type. Separate each entity-type pair with the '%' character. Text: {{text}}"
+# Italian
+doc_to_text: "Dato il seguente testo, scrivi le menzioni di entità nel testo, indicandone il tipo: PER (persona), LOC (luogo), ORG (organizzazione). Rispondi con il seguente formato: Entità$Tipo%Entità$Tipo. Separa ogni coppia entità-tipo con il carattere '%' ad esempio:  Entità_2$Tipo%Entità_2$Tipo. In caso non ci siano entita' rispondi '&&NOENT&&'.
+Testo: {{text}}"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-wn_group.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-wn_group.yaml
+group: evalita-mp_ner_wn_group
+group_alias: 'evalita NER: WN'
+task:
+  - evalita-mp_ner-v2_tasks_wn
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-wn_group_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-wn_group_p1.yaml
+include: _ner_template_yaml
+dataset_name: wn
+test_split: reduced_test
+fewshot_split: dev
+task_alias: prompt-1
+tag: evalita-mp_ner-v2_tasks_wn
+task: evalita-mp_ner-v2_wn_p1
+
+# English
+#doc_to_text: "Given the following text, write the entity mentions in the text, indicating their type: [PER] (person), [LOC] (location), [ORG] (organization). Respond with the following format: Entity$Type. Separate each entity-type pair with the '%' character. Text: {{text}}"
+# Italian
+doc_to_text: "Dato il seguente testo, scrivi le menzioni di entità nel testo, indicandone il tipo: PER (persona), LOC (luogo), ORG (organizzazione). Rispondi con il seguente formato: Entità$Tipo%Entità$Tipo. Separa ogni coppia entità-tipo con il carattere '%' ad esempio:  Entità_2$Tipo%Entità_2$Tipo. In caso non ci siano entita' rispondi '&&NOENT&&'.
+Testo: {{text}}"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner-wn_group_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner-wn_group_p2.yaml
+include: _ner_template_yaml
+dataset_name: wn
+test_split: reduced_test
+fewshot_split: dev
+task_alias: prompt-2
+tag: evalita-mp_ner-v2_tasks_wn
+task: evalita-mp_ner-v2_wn_p2
+
+# English
+#doc_to_text: "Given the following text, write the entity mentions in the text, indicating their type: [PER] (person), [LOC] (location), [ORG] (organization). Respond with the following format: Entity$Type. Separate each entity-type pair with the '%' character. Text: {{text}}"
+# Italian
+doc_to_text: "Dato il seguente testo, scrivi le menzioni di entità nel testo, indicandone il tipo: PER (persona), LOC (luogo), ORG (organizzazione). Rispondi con il seguente formato: Entità$Tipo%Entità$Tipo. Separa ogni coppia entità-tipo con il carattere '%' ad esempio:  Entità_2$Tipo%Entità_2$Tipo. In caso non ci siano entita' rispondi '&&NOENT&&'.
+Testo: {{text}}"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_adg.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_adg.yaml
+group: evalita-mp_ner_tasks_adg
+group_alias: evalita NER adg
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+metadata:
+  version: 1