README.md

# Evalita-LLM

### Paper

Evalita-LLM, a new benchmark designed to evaluate Large Language
Models (LLMs) on Italian tasks. The distinguishing and innovative features of
Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of
translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation.

### Citation

```bibtex
@misc{magnini2025evalitallmbenchmarkinglargelanguage,
      title={Evalita-LLM: Benchmarking Large Language Models on Italian},
      author={Bernardo Magnini and Roberto Zanoli and Michele Resta and Martin Cimmino and Paolo Albano and Marco Madeddu and Viviana Patti},
      year={2025},
      eprint={2502.02289},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02289},
}
```

### Groups

- `evalita-mp`: All tasks (perplexity and non-perplexity based).
- `evalita-mp_gen`: Only generative tasks.
- `evalita-mp_mc`: Only perplexity-based tasks.

#### Tasks

The following Evalita-LLM tasks can also be evaluated in isolation:
  - `evalita-mp_te`: Textual Entailment
  - `evalita-mp_sa`: Sentiment Analysis
  - `evalita-mp_wic`: Word in Context
  - `evalita-mp_hs`: Hate Speech Detection
  - `evalita-mp_at`: Admission Tests
  - `evalita-mp_faq`: FAQ
  - `evalita-mp_sum_fp`:  Summarization
  - `evalita-mp_ls`: Lexical Substitution
  - `evalita-mp_ner_group`: Named Entity Recognition
  - `evalita-mp_re`: Relation Extraction


### Usage

```bash

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks evalita-mp --device cuda:0 --batch_size auto
```

### Checklist

* [x] Is the task an existing benchmark in the literature?
  * [x] Have you referenced the original paper that introduced the task?
  * [x] If yes, does the original paper provide a reference implementation?
    * [x] Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?