Unverified Commit b7fccef5 authored by Michele Resta's avatar Michele Resta Committed by GitHub
Browse files

Adding the Evalita-LLM benchmark (#2681)



* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* feat: modified fewshot split for textual entailment task

* fix: new doc_to_target function for NER tasks

* Update prompt

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluatio

* Update prompt

* Add partition for few-shot evaluation

* Rename file

Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Enhance lexical substitution management

- Improve scorer calculation for better accuracy
- Update model output postprocessing for clearer results
- Add support for few-shot relation extraction task

* Add F1 macro measure for the document dating task

* Add F1-macro measure to evaluate document dating

* Use the whole dataset

* Small changes

* Add the two prompts for the task of lexical substitution

* Add few-shot split configuration

* Add few-shot split configuration

* Add function for handling few-shot learning setup

* Fix prompt

* Remove configuration file

* Update dataset from test_same to test_cross for evaluations

* Remove whitespace at end of prompt

* Fix configuration error: corrected parameter name for the dataset used in few-shot

* Fix: Check if results is not empty before processing in lexical substitution task

* added the prompts and functions for correct NER and RE execution

* Add accuracy measure

* Add tasks for the EVALITA-LLM benchmark evaluation

* Small changes

Add the alias of the task name that will be printed in the final table results.

* Updated the prompts to reflect changes made to the extended dataset for the Admission Test task

* chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.

* fix: add information on Evalita-LLM for PR

* fix: rename folders and files

* fix: remove unused imports

* chore: run pre-commit

* chore: add task description

---------
Co-authored-by: default avatarrzanoli <zanoli@fbk.eu>
Co-authored-by: default avatarMarco Madeddu <marco.madeddu.bra@gmail.com>
parent a40fe42a
tag: evalita-mp_sum_fp-small_tasks
include: _sum_template_fp-small_yaml
task: evalita-sp_sum_task_fp-small_p2
task_alias: prompt-2
#doc_to_text: >
# "Crea un sommario del seguente testo. Testo: {{source}}\nSommario: "
doc_to_text: "Devi risolvere un compito di sintesi automatica del testo. Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:"
process_results: !function utils.process_results_sum
metric_list:
- metric: rouge1
higher_is_better: true
aggregation: mean
group: evalita-mp_sum_fp
group_alias: summarization-fanpage
task:
- evalita-mp_sum_fp-small_tasks
aggregate_metric_list:
- metric: rouge1
weight_by_size: True
metadata:
version: 0.0
tag: evalita-mp_sum_fp_tasks
include: _sum_template_fp_yaml
task: evalita-sp_sum_task_fp_p1
task_alias: prompt-1
doc_to_text: "Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:"
process_results: !function utils.process_results_sum
metric_list:
- metric: rouge1
higher_is_better: true
aggregation: mean
tag: evalita-mp_sum_fp_tasks
include: _sum_template_fp_yaml
task: evalita-sp_sum_task_fp_p2
task_alias: prompt-2
doc_to_text: "Devi risolvere un compito di sintesi automatica del testo. Riassumi il seguente articolo di giornale: '{{source}}'\nRiassunto:"
process_results: !function utils.process_results_sum
metric_list:
- metric: rouge1
higher_is_better: true
aggregation: mean
group: evalita-mp_sum_fp
group_alias: summarization-fanpage
task:
- evalita-mp_sum_fp_tasks
aggregate_metric_list:
- metric: rouge1
weight_by_size: True
metadata:
version: 1
tag: evalita-mp_te_tasks
include: _te_template_yaml
task: evalita-mp_te_prompt-1
task_alias: prompt-1
#doc_to_text: "Task di Text Entailment. Rispondi Vero o Falso in base alla correttezza dell'ipotesi rispetto al testo.\nTesto:{{text1}}\nIpotesi: {{text2}}\nRisposta:"
doc_to_text: "La frase: '{{text1}}' implica logicamente che la frase: '{{text2}}' sia vera?"
#metric_list:
# - metric: acc
# higher_is_better: true
tag: evalita-mp_te_tasks
include: _te_template_yaml
task: evalita-mp_te_prompt-2
task_alias: prompt-2
doc_to_text: "Devi risolvere un compito di inferenza semantica. La frase: '{{text1}}' implica logicamente che la frase: '{{text2}}' sia vera?"
tag: evalita-mp_te_tasks
include: _te_template_yaml
task: evalita-mp_te_prompt-3
task_alias: prompt-3
doc_to_choice: ["A", "B"]
doc_to_text: "La frase: '{{text1}}' implica logicamente che la frase: '{{text2}}' sia vera?\nA: \nB: No\nRisposta:"
tag: evalita-mp_te_tasks
include: _te_template_yaml
task: evalita-mp_te_prompt-4
task_alias: prompt-4
doc_to_choice: ["A", "B"]
doc_to_text: "Devi risolvere un compito di inferenza semantica. La frase: '{{text1}}' implica logicamente che la frase: '{{text2}}' sia vera?\nA: \nB: No\nRisposta:"
tag: evalita-mp_te_tasks
include: _te_template_yaml
task: evalita-mp_te_prompt-5
task_alias: prompt-5
doc_to_choice: ["La frase 1 implica logicamente che la frase 2 sia vera", "La frase 1 non implica logicamente che la frase 2 sia vera"]
doc_to_text: "Frase 1: '{{text1}}' Frase 2: '{{text2}}'"
tag: evalita-mp_te_tasks
include: _te_template_yaml
task: evalita-mp_te_prompt-6
task_alias: prompt-6
doc_to_choice: ["La frase 1 implica logicamente che la frase 2 sia vera", "La frase 1 non implica logicamente che la frase 2 sia vera"]
doc_to_text: "Devi risolvere un compito di inferenza semantica. Frase 1: '{{text1}}' Frase 2: '{{text2}}'"
group: evalita-mp_te
group_alias: text-entailment
task:
- evalita-mp_te_tasks # this has to match the tag in the task yaml file
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
tag: evalita-mp_wic_tasks
task: evalita-mp_wic_prompt-1
task_alias: prompt-1
include: _wic_template_yaml
doc_to_text: "La parola: '{{sentence1[start1:end1]}}' nella frase: '{{sentence1}}' ha lo stesso significato della parola: '{{sentence2[start2:end2]}}' nella frase: '{{sentence2}}'?"
tag: evalita-mp_wic_tasks
task: evalita-mp_wic_prompt-2
task_alias: prompt-2
include: _wic_template_yaml
doc_to_text: "Devi determinare se una stessa parola usata in due frasi differenti ha lo stesso significato in entrambi i contesti. La parola: '{{sentence1[start1:end1]}}' nella frase: '{{sentence1}}' ha lo stesso significato della parola: '{{sentence2[start2:end2]}}' nella frase: '{{sentence2}}'?"
tag: evalita-mp_wic_tasks
task: evalita-mp_wic_prompt-3
task_alias: prompt-3
include: _wic_template_yaml
doc_to_text: "La parola '{{sentence1[start1:end1]}}' nella frase '{{sentence1}}' ha lo stesso significato della parola '{{sentence2[start2:end2]}}' nella frase '{{sentence2}}'?\nA: \nB: No\nRisposta:"
doc_to_choice: ["B", "A"]
tag: evalita-mp_wic_tasks
task: evalita-mp_wic_prompt-4
task_alias: prompt-4
include: _wic_template_yaml
doc_to_text: "Devi determinare se una stessa parola usata in due frasi differenti ha lo stesso significato in entrambi i contesti. La parola '{{sentence1[start1:end1]}}' nella frase '{{sentence1}}' ha lo stesso significato della parola '{{sentence2[start2:end2]}}' nella frase '{{sentence2}}'?\nA: \nB: No\nRisposta:"
doc_to_choice: ["B", "A"]
tag: evalita-mp_wic_tasks
task: evalita-mp_wic_prompt-5
task_alias: prompt-5
include: _wic_template_yaml
doc_to_text: "La parola: '{{sentence1[start1:end1]}}' nella frase: '{{sentence1}}' e la parola: '{{sentence2[start2:end2]}}' nella frase: '{{sentence2}}'"
doc_to_choice: ["non hanno lo stesso significato", "hanno lo stesso significato"]
tag: evalita-mp_wic_tasks
task: evalita-mp_wic_prompt-6
task_alias: prompt-6
include: _wic_template_yaml
doc_to_text: "Devi determinare se una stessa parola usata in due frasi differenti ha lo stesso significato in entrambi i contesti. La parola: '{{sentence1[start1:end1]}}' nella frase: '{{sentence1}}' e la parola: '{{sentence2[start2:end2]}}' nella frase: '{{sentence2}}'"
doc_to_choice: ["non hanno lo stesso significato", "hanno lo stesso significato"]
group: evalita-mp_wic
group_alias: word-in-context
task:
- evalita-mp_wic_tasks # this has to match the tag in the task yaml file
aggregate_metric_list:
- metric: f1
weight_by_size: True
metadata:
version: 1
dataset_path: evalitahf/faq
test_split: test_1
fewshot_split: dev_1
doc_to_target: !function utils.faq_doc_to_target
doc_to_choice: ["A", "B", "C", "D"]
output_type: multiple_choice
metadata:
version: 1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment