Adding the Evalita-LLM benchmark (#2681)

* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by: rzanoli <zanoli@fbk.eu> Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>

Adding the Evalita-LLM benchmark (#2681)
* feat: initial commit with templates for evalita evaluation * fix: change rule for generate_until * feat: modified yaml to use reduced version of NER test datasets * feat: added templates to use reduced dataset for summarization (fanpage and ilpost) * Add Six Prompts for Each Multiple-Choice Task * feat: modified fewshot split for textual entailment task * fix: new doc_to_target function for NER tasks * Update prompt * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Add partition for few-shot evaluatio * Update prompt * Add partition for few-shot evaluation * Rename file Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml * Add partition for few-shot evaluation * Add partition for few-shot evaluation * Enhance lexical substitution management - Improve scorer calculation for better accuracy - Update model output postprocessing for clearer results - Add support for few-shot relation extraction task * Add F1 macro measure for the document dating task * Add F1-macro measure to evaluate document dating * Use the whole dataset * Small changes * Add the two prompts for the task of lexical substitution * Add few-shot split configuration * Add few-shot split configuration * Add function for handling few-shot learning setup * Fix prompt * Remove configuration file * Update dataset from test_same to test_cross for evaluations * Remove whitespace at end of prompt * Fix configuration error: corrected parameter name for the dataset used in few-shot * Fix: Check if results is not empty before processing in lexical substitution task * added the prompts and functions for correct NER and RE execution * Add accuracy measure * Add tasks for the EVALITA-LLM benchmark evaluation * Small changes Add the alias of the task name that will be printed in the final table results. * Updated the prompts to reflect changes made to the extended dataset for the Admission Test task * chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks. * fix: add information on Evalita-LLM for PR * fix: rename folders and files * fix: remove unused imports * chore: run pre-commit * chore: add task description --------- Co-authored-by: rzanoli <zanoli@fbk.eu> Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>
b7fccef5 · Michele Resta · GitHub · a40fe42a · b7fccef5 · b7fccef5
Unverified Commit b7fccef5 authored Feb 11, 2025 by Michele Resta Committed by GitHub Feb 11, 2025
20 changed files
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_adg_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_adg_p1.yaml
+include: _ner_template_yaml
+dataset_name: adg
+test_split: reduced_test
+fewshot_split: trial
+task_alias: ADG prompt-1
+tag: evalita-mp_ner_tasks_adg
+task: evalita-mp_ner_adg_p1
+#p1
+doc_to_text: "Estrai tutte le entità di tipo PER (persona), LOC (luogo) e ORG (organizzazione) dal testo seguente. Riporta ogni entità con il formato: Entità$Tipo, separando ciascuna coppia con ','. Se non ci sono entità da estrarre, rispondi con '&&NOENT&&'.
+Testo: '{{text}}'
+Entità:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_adg_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_adg_p2.yaml
+include: _ner_template_yaml
+dataset_name: adg
+test_split: reduced_test
+fewshot_split: trial
+task_alias: ADG prompt-2
+tag: evalita-mp_ner_tasks_adg
+task: evalita-mp_ner_adg_p2
+#p8
+doc_to_text: "Devi svolgere un compito di riconoscimento delle entità nei testi. Estrai tutte le entità di tipo PER (persona), LOC (luogo) e ORG (organizzazione) dal testo seguente. Riporta ogni entità con il formato: Entità$Tipo, separando ciascuna coppia con ','. Se non ci sono entità da estrarre, rispondi con '&&NOENT&&'.
+Testo: '{{text}}'
+Entità:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_fic.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_fic.yaml
+group: evalita-mp_ner_tasks_fic
+group_alias: evalita NER fic
+task_alias: NER fic
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_fic_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_fic_p1.yaml
+include: _ner_template_yaml
+dataset_name: fic
+test_split: reduced_test
+fewshot_split: trial
+task_alias: FIC prompt-1
+tag: evalita-mp_ner_tasks_fic
+task: evalita-mp_ner_fic_p1
+#p1
+doc_to_text: "Estrai tutte le entità di tipo PER (persona), LOC (luogo) e ORG (organizzazione) dal testo seguente. Riporta ogni entità con il formato: Entità$Tipo, separando ciascuna coppia con ','. Se non ci sono entità da estrarre, rispondi con '&&NOENT&&'.
+Testo: '{{text}}'
+Entità:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_fic_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_fic_p2.yaml
+include: _ner_template_yaml
+dataset_name: fic
+test_split: reduced_test
+fewshot_split: trial
+task_alias: FIC prompt-2
+tag: evalita-mp_ner_tasks_fic
+task: evalita-mp_ner_fic_p2
+#p8
+doc_to_text: "Devi svolgere un compito di riconoscimento delle entità nei testi. Estrai tutte le entità di tipo PER (persona), LOC (luogo) e ORG (organizzazione) dal testo seguente. Riporta ogni entità con il formato: Entità$Tipo, separando ciascuna coppia con ','. Se non ci sono entità da estrarre, rispondi con '&&NOENT&&'.
+Testo: '{{text}}'
+Entità:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_group.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_group.yaml
+group: evalita-mp_ner_group
+group_alias: evalita NER
+task:
+  - evalita-mp_ner_tasks_fic
+  - evalita-mp_ner_tasks_adg
+  - evalita-mp_ner_tasks_wn
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_wn.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_wn.yaml
+group: evalita-mp_ner_tasks_wn
+group_alias: evalita NER wn
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_wn_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_wn_p1.yaml
+include: _ner_template_yaml
+dataset_name: wn
+test_split: reduced_test
+fewshot_split: trial
+task_alias: WN prompt-1
+tag: evalita-mp_ner_tasks_wn
+task: evalita-mp_ner_wn_p1
+doc_to_text: "Estrai tutte le entità di tipo PER (persona), LOC (luogo) e ORG (organizzazione) dal testo seguente. Riporta ogni entità con il formato: Entità$Tipo, separando ciascuna coppia con ','. Se non ci sono entità da estrarre, rispondi con '&&NOENT&&'.
+Testo: '{{text}}'
+Entità:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_ner_wn_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_ner_wn_p2.yaml
+include: _ner_template_yaml
+dataset_name: wn
+test_split: reduced_test
+fewshot_split: trial
+task_alias: WN prompt-2
+tag: evalita-mp_ner_tasks_wn
+task: evalita-mp_ner_wn_p2
+doc_to_text: "Devi svolgere un compito di riconoscimento delle entità nei testi. Estrai tutte le entità di tipo PER (persona), LOC (luogo) e ORG (organizzazione) dal testo seguente. Riporta ogni entità con il formato: Entità$Tipo, separando ciascuna coppia con ','. Se non ci sono entità da estrarre, rispondi con '&&NOENT&&'.
+Testo: '{{text}}'
+Entità:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_re_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_re_p1.yaml
+tag: evalita-mp_re_tasks
+include: _re_template_yaml
+task: evalita-mp_re_prompt-1
+fewshot_split: dev
+task_alias: prompt-1
+#p4
+doc_to_text: "Dato un documento medico devi estrarre tutte le misurazioni degli esami medici presenti. Riporta ogni relazione nel formato: misurazione$esame, separando ciascuna coppia con '%'. Se non ci sono relazioni da estrarre, rispondi con '&&NOREL&&'.
+Testo: '{{text}}'
+Relazioni:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_re_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_re_p2.yaml
+tag: evalita-mp_re_tasks
+include: _re_template_yaml
+fewshot_split: dev
+task: evalita-mp_re_prompt-2
+task_alias: prompt-2
+#p5
+doc_to_text: "Devi svolgere un compito di estrazione di relazioni da documenti medici. Dato un documento medico devi estrarre tutte le misurazioni degli esami medici presenti. Riporta ogni relazione nel formato: misurazione$esame, separando ciascuna coppia con '%'. Se non ci sono relazioni da estrarre, rispondi con '&&NOREL&&'.
+Testo: '{{text}}'
+Relazioni:"
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_re_task.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_re_task.yaml
+group: evalita-mp_re
+group_alias: relation-extraction
+task:
+- evalita-mp_re_tasks
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p1.yaml
+tag: evalita-mp_sa_tasks
+include: _sa_template_yaml
+task: evalita-mp_sa_prompt-1
+task_alias: prompt-1
+#doc_to_text: "Opinione: '{{text}}' Determinare la sentiment dell'opinione data. Possibili risposte: A – neutrale B – negativo C – positivo D - misto Risposta:"
+doc_to_text: "Qual è il sentiment espresso nel seguente tweet: '{{text}}'?"
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_sa
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p2.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p2.yaml
+tag: evalita-mp_sa_tasks
+include: _sa_template_yaml
+task: evalita-mp_sa_prompt-2
+task_alias: prompt-2
+#doc_to_text: "Opinione: '{{text}}' Determinare la sentiment dell'opinione data. Possibili risposte: A – neutrale B – negativo C – positivo D - misto Risposta:"
+doc_to_text: "Devi svolgere un compito di analisi del sentiment. Qual è il sentiment espresso nel seguente tweet: '{{text}}'?"
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_sa
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p3.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p3.yaml
+tag: evalita-mp_sa_tasks
+include: _sa_template_yaml
+task: evalita-mp_sa_prompt-3
+task_alias: prompt-3
+#doc_to_text: "Opinione: '{{text}}' Determinare la sentiment dell'opinione data. Possibili risposte: A – neutrale B – negativo C – positivo D - misto Risposta:"
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_text: "Qual è il sentiment espresso nel seguente tweet: '{{text}}'?\nA: Positivo\nB: Negativo\nC: Neutro\nD: Misto\nRisposta:"
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_sa
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p4.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p4.yaml
+tag: evalita-mp_sa_tasks
+include: _sa_template_yaml
+task: evalita-mp_sa_prompt-4
+task_alias: prompt-4
+#doc_to_text: "Opinione: '{{text}}' Determinare la sentiment dell'opinione data. Possibili risposte: A – neutrale B – negativo C – positivo D - misto Risposta:"
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_text: "Devi svolgere un compito di analisi del sentiment. Qual è il sentiment espresso nel seguente tweet: '{{text}}'?\nA: Positivo\nB: Negativo\nC: Neutro\nD: Misto\nRisposta:"
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_sa
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p5.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p5.yaml
+tag: evalita-mp_sa_tasks
+include: _sa_template_yaml
+task: evalita-mp_sa_prompt-5
+task_alias: prompt-5
+#doc_to_text: "Opinione: '{{text}}' Determinare la sentiment dell'opinione data. Possibili risposte: A – neutrale B – negativo C – positivo D - misto Risposta:"
+#doc_to_choice: ["A", "B", "C", "D"]
+doc_to_text: "Il seguente tweet: '{{text}}' esprime un sentiment"
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_sa
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p6.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_sa_p6.yaml
+tag: evalita-mp_sa_tasks
+include: _sa_template_yaml
+task: evalita-mp_sa_prompt-6
+task_alias: prompt-6
+#doc_to_text: "Opinione: '{{text}}' Determinare la sentiment dell'opinione data. Possibili risposte: A – neutrale B – negativo C – positivo D - misto Risposta:"
+#doc_to_choice: ["A", "B", "C", "D"]
+doc_to_text: "Devi svolgere un compito di analisi del sentiment. Il seguente tweet: '{{text}}' esprime un sentiment"
+metric_list:
+  - metric: f1
+    higher_is_better: True
+    aggregation: !function metrics._aggreg_sa
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_sa_tasks.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_sa_tasks.yaml
+group: evalita-mp_sa
+group_alias: sentiment-analysis
+task:
+  - evalita-mp_sa_tasks # Each of the tasks has to have a matching tag in its own yaml file
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: True
+metadata:
+  version: 1
--- a/lm_eval/tasks/evalita_llm/_evalita-mp_sum_fp-small_p1.yaml
+++ b/lm_eval/tasks/evalita_llm/_evalita-mp_sum_fp-small_p1.yaml
+tag: evalita-mp_sum_fp-small_tasks
+include: _sum_template_fp-small_yaml
+task: evalita-sp_sum_task_fp-small_p1
+task_alias: prompt-1
+#doc_to_text: >
+#  "Crea un sommario del seguente testo. Testo: {{source}}\nSommario: "
+doc_to_text: "Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:"
+process_results: !function utils.process_results_sum
+metric_list:
+  - metric: rouge1
+    higher_is_better: true
+    aggregation: mean