Commit 527a4352 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into longcxt

# Conflicts:
#	lm_eval/tasks/README.md
parents 6042f622 52df63b7
group: bigbench_multiple_choice
tag: bigbench_multiple_choice_a
dataset_path: hails/bigbench
dataset_kwargs:
# num_shots: 0 # TODO: num of shots for `bigbench` HF dataset should be controlled through this, not through the typical methods
......
group: bigbench_multiple_choice
tag: bigbench_multiple_choice_b
dataset_path: hails/bigbench
dataset_kwargs:
# num_shots: 0 # TODO: num of shots for `bigbench` HF dataset should be controlled through this, not through the typical methods
......
......@@ -9,5 +9,7 @@ should_decontaminate: true
doc_to_decontamination_query: "{{sentence_good}} {{sentence_bad}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
# Evalita-LLM
### Paper
Evalita-LLM, a new benchmark designed to evaluate Large Language
Models (LLMs) on Italian tasks. The distinguishing and innovative features of
Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of
translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation.
### Citation
```bibtex
@misc{magnini2025evalitallmbenchmarkinglargelanguage,
title={Evalita-LLM: Benchmarking Large Language Models on Italian},
author={Bernardo Magnini and Roberto Zanoli and Michele Resta and Martin Cimmino and Paolo Albano and Marco Madeddu and Viviana Patti},
year={2025},
eprint={2502.02289},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02289},
}
```
### Groups
- `evalita-mp`: All tasks (perplexity and non-perplexity based).
- `evalita-mp_gen`: Only generative tasks.
- `evalita-mp_mc`: Only perplexity-based tasks.
#### Tasks
The following Evalita-LLM tasks can also be evaluated in isolation:
- `evalita-mp_te`: Textual Entailment
- `evalita-mp_sa`: Sentiment Analysis
- `evalita-mp_wic`: Word in Context
- `evalita-mp_hs`: Hate Speech Detection
- `evalita-mp_at`: Admission Tests
- `evalita-mp_faq`: FAQ
- `evalita-mp_sum_fp`: Summarization
- `evalita-mp_ls`: Lexical Substitution
- `evalita-mp_ner_group`: Named Entity Recognition
- `evalita-mp_re`: Relation Extraction
### Usage
```bash
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks evalita-mp --device cuda:0 --batch_size auto
```
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation?
* [x] Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
dataset_path: evalitahf/admission_test
output_type: multiple_choice
test_split: test
fewshot_split: dev
validation_split: test
doc_to_target: Correct
doc_to_choice: ["A", "B", "C", "D", "E"]
metadata:
version: 1
group: evalita-mp
group_alias: Evalita-LLM
task:
- evalita-mp_te
- evalita-mp_sa
- evalita-mp_wic
- evalita-mp_hs
- evalita-mp_at
- evalita-mp_faq
- evalita-mp_sum_fp
- evalita-mp_ls
- evalita-mp_ner_group
- evalita-mp_re
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-1
task_alias: prompt-1
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Dato il seguente caso clinico: '{{background}}' qual è la risposta corretta alla domanda: '{{domanda}}'?"
doc_to_text: "Dato il seguente quesito di medicina: '{{Question}}' qual è la risposta corretta?"
doc_to_choice: "{{[A,B,C,D,E]}}"
doc_to_target: "{{ A if Correct == 'A' else B if Correct == 'B' else C if Correct == 'C' else D if Correct == 'D' else E}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-2
task_alias: prompt-2
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Devi risolvere un compito di risposte a domande. Dato il seguente caso clinico: '{{background}}' qual è la risposta corretta alla domanda: '{{domanda}}'?"
doc_to_text: "Devi risolvere un compito di risposte a domande. Dato il seguente quesito di medicina: '{{Question}}' qual è la risposta corretta?"
doc_to_choice: "{{[A,B,C,D,E]}}"
doc_to_target: "{{ A if Correct == 'A' else B if Correct == 'B' else C if Correct == 'C' else D if Correct == 'D' else E}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-3
task_alias: prompt-3
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Dato il seguente caso clinico: '{{background}}', qual è la risposta corretta alla domanda: '{{domanda}}'?\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nE: {{E}}\nRisposta:"
doc_to_text: "Dato il seguente quesito di medicina: '{{Question}}' qual è la risposta corretta?\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nE: {{E}}\nRisposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-4
task_alias: prompt-4
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
doc_to_text: "Devi risolvere un compito a scelta multipla. Dato il seguente quesito di medicina: '{{Question}}' qual è la risposta corretta?\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nE: {{E}}\nRisposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-5
task_alias: prompt-5
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Dato il seguente caso clinico: '{{background}}'. La risposta corretta alla domanda: '{{domanda}}' è:"
doc_to_text: "Dato il seguente quesito di medicina '{{Question}}' la risposta corretta è:"
doc_to_choice: "{{[A,B,C,D,E]}}"
doc_to_target: "{{ A if Correct == 'A' else B if Correct == 'B' else C if Correct == 'C' else D if Correct == 'D' else E}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_at_tasks
include: _at_template_yaml
task: evalita-mp_at_prompt-6
task_alias: prompt-6
#doc_to_text: "Rispondi alla domanda a scelta multipla considerando le informazioni del testo seguente.\nTesto: {{background}}\nDomanda: {{domanda}}\nOpzioni: A: {{A}} B: {{B}} C: {{C}} D: {{D}}"
#doc_to_text: "Devi risolvere un compito di risposte a domande. Dato il seguente caso clinico: '{{background}}'. La risposta corretta alla domanda: '{{domanda}}' è:"
doc_to_text: "Devi risolvere un compito di risposte a domande. Dato il seguente quesito di medicina '{{Question}}' la risposta corretta è:"
doc_to_choice: "{{[A,B,C,D,E]}}"
doc_to_target: "{{ A if Correct == 'A' else B if Correct == 'B' else C if Correct == 'C' else D if Correct == 'D' else E}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1
group: evalita-mp_at
group_alias: admission-test
task:
- evalita-mp_at_tasks # Each of the tasks has to have a matching tag in its own yaml file
aggregate_metric_list:
- metric: acc
weight_by_size: True
aggregation: mean
metadata:
version: 1.0
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-1
task_alias: prompt-1
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Rispondi alla seguente domanda: '{{question}}'"
doc_to_choice: "{{[A,B,C,D]}}"
doc_to_target: "{{ A if correct_answer == 'A' else B if correct_answer == 'B' else C if correct_answer == 'C' else D}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-2
task_alias: prompt-2
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Devi risolvere un compito di risposte a domande. Rispondi alla seguente domanda: '{{question}}'"
doc_to_choice: "{{[A,B,C,D]}}"
doc_to_target: "{{ A if correct_answer == 'A' else B if correct_answer == 'B' else C if correct_answer == 'C' else D}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-3
task_alias: prompt-3
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Rispondi alla seguente domanda: '{{question}}'\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nRisposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-4
task_alias: prompt-4
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Devi risolvere un compito a scelta multipla. Rispondi alla seguente domanda: '{{question}}'\nA: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}}\nRisposta:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-5
task_alias: prompt-5
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
#doc_to_text: "La risposta alla domanda: '{{question}}' è:"
doc_to_text: "La risposta alla domanda: '{{question}}' è:"
doc_to_choice: "{{[A,B,C,D]}}"
doc_to_target: "{{ A if correct_answer == 'A' else B if correct_answer == 'B' else C if correct_answer == 'C' else D }}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
tag: evalita-mp_faq_tasks
include: _faq_template_yaml
task: evalita-mp_faq_prompt-6
task_alias: prompt-6
#doc_to_text: "Data la seguente domanda {{question}}, individua la risposta corretta tra le seguenti opzioni:\n A: {{A}}\nB: {{B}}\nC: {{C}}\nD: {{D}} Risposta:"
doc_to_text: "Devi risolvere un compito di risposte a domande. La risposta alla domanda: '{{question}}' è:"
doc_to_choice: "{{[A,B,C,D]}}"
doc_to_target: "{{ A if correct_answer == 'A' else B if correct_answer == 'B' else C if correct_answer == 'C' else D }}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1
group: evalita-mp_faq
group_alias: faq
task:
- evalita-mp_faq_tasks # Each of the tasks has to have a matching tag in its own yaml file
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment