Unverified Commit b7fccef5 authored by Michele Resta's avatar Michele Resta Committed by GitHub
Browse files

Adding the Evalita-LLM benchmark (#2681)



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* feat: modified fewshot split for textual entailment task

* fix: new doc_to_target function for NER tasks

* Update prompt

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluatio

* Update prompt

* Add partition for few-shot evaluation

* Rename file

Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Enhance lexical substitution management

- Improve scorer calculation for better accuracy
- Update model output postprocessing for clearer results
- Add support for few-shot relation extraction task

* Add F1 macro measure for the document dating task

* Add F1-macro measure to evaluate document dating

* Use the whole dataset

* Small changes

* Add the two prompts for the task of lexical substitution

* Add few-shot split configuration

* Add few-shot split configuration

* Add function for handling few-shot learning setup

* Fix prompt

* Remove configuration file

* Update dataset from test_same to test_cross for evaluations

* Remove whitespace at end of prompt

* Fix configuration error: corrected parameter name for the dataset used in few-shot

* Fix: Check if results is not empty before processing in lexical substitution task

* added the prompts and functions for correct NER and RE execution

* Add accuracy measure

* Add tasks for the EVALITA-LLM benchmark evaluation

* Small changes

Add the alias of the task name that will be printed in the final table results.

* Updated the prompts to reflect changes made to the extended dataset for the Admission Test task

* chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.

* fix: add information on Evalita-LLM for PR

* fix: rename folders and files

* fix: remove unused imports

* chore: run pre-commit

* chore: add task description

---------
Co-authored-by: default avatarrzanoli <zanoli@fbk.eu>
Co-authored-by: default avatarMarco Madeddu <marco.madeddu.bra@gmail.com>
parent a40fe42a
......@@ -42,6 +42,7 @@
| [eus_proficiency](eus_proficiency/README.md) | Tasks designed to test proficiency in the Basque language across various topics. | Basque |
| [eus_reading](eus_reading/README.md) | Reading comprehension tasks specifically designed for the Basque language. | Basque |
| [eus_trivia](eus_trivia/README.md) | Trivia and knowledge testing tasks in the Basque language. | Basque |
| [evalita-LLM](evalita-LLM/README.md) | A native Italian benchmark with diverse tasks formats and multiple prompts. | Italian |
| [fda](fda/README.md) | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English |
| [fld](fld/README.md) | Tasks involving free-form and directed dialogue understanding. | English |
| [french_bench](french_bench/README.md) | Set of tasks designed to assess language model performance in French. | French |
......
# Evalita-LLM
### Paper
Evalita-LLM, a new benchmark designed to evaluate Large Language
Models (LLMs) on Italian tasks. The distinguishing and innovative features of
Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of
translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation.
### Citation
```bibtex
@misc{magnini2025evalitallmbenchmarkinglargelanguage,
title={Evalita-LLM: Benchmarking Large Language Models on Italian},
author={Bernardo Magnini and Roberto Zanoli and Michele Resta and Martin Cimmino and Paolo Albano and Marco Madeddu and Viviana Patti},
year={2025},
eprint={2502.02289},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02289},
}
```
### Groups
- `evalita-mp`: All tasks (perplexity and non-perplexity based).
- `evalita-mp_gen`: Only generative tasks.
- `evalita-mp_mc`: Only perplexity-based tasks.
#### Tasks
The following Evalita-LLM tasks can also be evaluated in isolation:
- `evalita-mp_te`: Textual Entailment
- `evalita-mp_sa`: Sentiment Analysis
- `evalita-mp_wic`: Word in Context
- `evalita-mp_hs`: Hate Speech Detection
- `evalita-mp_at`: Admission Tests
- `evalita-mp_faq`: FAQ
- `evalita-mp_sum_fp`: Summarization
- `evalita-mp_ls`: Lexical Substitution
- `evalita-mp_ner_group`: Named Entity Recognition
- `evalita-mp_re`: Relation Extraction
### Usage
```bash
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks evalita-mp --device cuda:0 --batch_size auto
```
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation?
* [x] Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
dataset_path: evalitahf/admission_test
output_type: multiple_choice
test_split: test
fewshot_split: dev
validation_split: test
doc_to_target: Correct
doc_to_choice: ["A", "B", "C", "D", "E"]
metadata:
version: 1
group: evalita-mp
group_alias: Evalita-LLM
task:
- evalita-mp_te
- evalita-mp_sa
- evalita-mp_wic
- evalita-mp_hs
- evalita-mp_at
- evalita-mp_faq
- evalita-mp_sum_fp
- evalita-mp_ls
- evalita-mp_ner_group
- evalita-mp_re
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-1
task_alias: prompt-1
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Dato il seguente caso clinico: '{{background}}' qual è la risposta corretta alla domanda: '{{domanda}}'?"
doc_to_text: "Dato il seguente quesito di medicina: '{{Question}}' qual è la risposta corretta?"
doc_to_choice: "{{[A,B,C,D,E]}}"
doc_to_target: "{{ A if Correct == 'A' else B if Correct == 'B' else C if Correct == 'C' else D if Correct == 'D' else E}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-2
task_alias: prompt-2
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Devi risolvere un compito di risposte a domande. Dato il seguente caso clinico: '{{background}}' qual è la risposta corretta alla domanda: '{{domanda}}'?"
doc_to_text: "Devi risolvere un compito di risposte a domande. Dato il seguente quesito di medicina: '{{Question}}' qual è la risposta corretta?"
doc_to_choice: "{{[A,B,C,D,E]}}"
doc_to_target: "{{ A if Correct == 'A' else B if Correct == 'B' else C if Correct == 'C' else D if Correct == 'D' else E}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-3
task_alias: prompt-3
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Dato il seguente caso clinico: '{{background}}', qual è la risposta corretta alla domanda: '{{domanda}}'?\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nE: {{E}}\nRisposta:"
doc_to_text: "Dato il seguente quesito di medicina: '{{Question}}' qual è la risposta corretta?\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nE: {{E}}\nRisposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-4
task_alias: prompt-4
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
doc_to_text: "Devi risolvere un compito a scelta multipla. Dato il seguente quesito di medicina: '{{Question}}' qual è la risposta corretta?\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nE: {{E}}\nRisposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-5
task_alias: prompt-5
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Dato il seguente caso clinico: '{{background}}'. La risposta corretta alla domanda: '{{domanda}}' è:"
doc_to_text: "Dato il seguente quesito di medicina '{{Question}}' la risposta corretta è:"
doc_to_choice: "{{[A,B,C,D,E]}}"
doc_to_target: "{{ A if Correct == 'A' else B if Correct == 'B' else C if Correct == 'C' else D if Correct == 'D' else E}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-6
task_alias: prompt-6
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Devi risolvere un compito di risposte a domande. Dato il seguente caso clinico: '{{background}}'. La risposta corretta alla domanda: '{{domanda}}' è:"
doc_to_text: "Devi risolvere un compito di risposte a domande. Dato il seguente quesito di medicina '{{Question}}' la risposta corretta è:"
doc_to_choice: "{{[A,B,C,D,E]}}"
doc_to_target: "{{ A if Correct == 'A' else B if Correct == 'B' else C if Correct == 'C' else D if Correct == 'D' else E}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
group: evalita-mp_at
group_alias: admission-test
task:
- evalita-mp_at_tasks # Each of the tasks has to have a matching tag in its own yaml file
aggregate_metric_list:
- metric: acc
weight_by_size: True
aggregation: mean
metadata:
version: 1.0
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-1
task_alias: prompt-1
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Rispondi alla seguente domanda: '{{question}}'"
doc_to_choice: "{{[A,B,C,D]}}"
doc_to_target: "{{ A if correct_answer == 'A' else B if correct_answer == 'B' else C if correct_answer == 'C' else D}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-2
task_alias: prompt-2
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Devi risolvere un compito di risposte a domande. Rispondi alla seguente domanda: '{{question}}'"
doc_to_choice: "{{[A,B,C,D]}}"
doc_to_target: "{{ A if correct_answer == 'A' else B if correct_answer == 'B' else C if correct_answer == 'C' else D}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-3
task_alias: prompt-3
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Rispondi alla seguente domanda: '{{question}}'\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nRisposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-4
task_alias: prompt-4
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Devi risolvere un compito a scelta multipla. Rispondi alla seguente domanda: '{{question}}'\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nRisposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-5
task_alias: prompt-5
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
#doc_to_text: "La risposta alla domanda: '{{question}}' è:"
doc_to_text: "La risposta alla domanda: '{{question}}' è:"
doc_to_choice: "{{[A,B,C,D]}}"
doc_to_target: "{{ A if correct_answer == 'A' else B if correct_answer == 'B' else C if correct_answer == 'C' else D }}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-6
task_alias: prompt-6
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Devi risolvere un compito di risposte a domande. La risposta alla domanda: '{{question}}' è:"
doc_to_choice: "{{[A,B,C,D]}}"
doc_to_target: "{{ A if correct_answer == 'A' else B if correct_answer == 'B' else C if correct_answer == 'C' else D }}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
group: evalita-mp_faq
group_alias: faq
task:
- evalita-mp_faq_tasks # Each of the tasks has to have a matching tag in its own yaml file
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
group: evalita-mp_gen
group_alias: Evalita-LLM - Generative
task:
- evalita-mp_sum_fp
- evalita-mp_ls
- evalita-mp_ner_group
- evalita-mp_re
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
tag: evalita-mp_hs_tasks
include: _hs_template_yaml
task: evalita-mp_hs_prompt-1
task_alias: prompt-1
#doc_to_text: "Dato il seguente testo, rispondi Vero se il testo contiene hate speech, altrimenti rispondi Falso. Testo:\n{{full_text}} Risposta:"
doc_to_text: "C'è incitamento all'odio nel seguente tweet: '{{full_text}}'?"
metric_list:
- metric: f1
higher_is_better: true
average: macro
aggregation: f1
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment