Unverified Commit 71f2954b authored by Vladislav Mikhailov's avatar Vladislav Mikhailov Committed by GitHub
Browse files

Added NorEval, a novel Norwegian benchmark (#2919)

* added noreval

* added a checklist for noreval

* run pre-commit

* changed imports and added short noreval description

* fixed norsumm path

* refactored multi-folder tasks

* refactored multi-folder tasks
parent ab618f01
.DS_Store
env
*.pyc
output/
......
This diff is collapsed.
# 🇳🇴 NorEval
### Paper
* Title: `NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark`
* Abstract: [arxiv.org/abs/2504.07749](https://arxiv.org/abs/2504.07749)
* Homepage: [github.com/ltgoslo/noreval](https://github.com/ltgoslo/noreval/tree/main)
![noreval](noreval.jpg)
**Overview of the NorEval design.** 😼 denotes datasets used in [NorBench](https://aclanthology.org/2023.nodalida-1.61/), [NLEBench](https://aclanthology.org/2024.emnlp-main.317/), [ScandEval](https://aclanthology.org/2023.nodalida-1.20/), and [SEB](https://proceedings.neurips.cc/paper_files/paper/2024/file/4746bb91bd073ec7eef930d5775122ba-Paper-Datasets_and_Benchmarks_Track.pdf); 🚀 represents datasets that have not been used in the existing Norwegian benchmarks; and 😎 denotes our novel datasets introduced as part of NorEval. EN=English; BM=Norwegian Bokmål; NN=Norwegian Nynorsk.
🇳🇴 NorEval is a multi-task Norwegian language understanding and generation evaluation benchmark that combines 19 existing peer-reviewed datasets with five datasets created from scratch. NorEval covers nine diverse task categories: sentiment analysis, Norwegian language knowledge, Norwegian-specific \& world knowledge, machine reading comprehension, commonsense reasoning, machine translation, text summarization, instruction following, and truthfulness. Our main evaluation principles are:
- 🌐 **Linguistic diversity**: support for both of the official written standards of Norwegian: Bokmål and Nynorsk (the minority variant).
- 📊 **Task diversity**: coverage of various least addressed tasks for Norwegian. In particular, only three out of 24 NorEval datasets are included in existing Norwegian benchmarks to date: [NorBench](https://aclanthology.org/2023.nodalida-1.61/), [NLEBench](https://aclanthology.org/2024.emnlp-main.317/), [ScandEval](https://aclanthology.org/2023.nodalida-1.20/), and [SEB](https://proceedings.neurips.cc/paper_files/paper/2024/file/4746bb91bd073ec7eef930d5775122ba-Paper-Datasets_and_Benchmarks_Track.pdf).
- 🧠 **Data quality**: focus on only peer-reviewed human-created datasets to ensure reliable evaluation in the context of the Norwegian language, culture, and values.
- 📏 **Prompt sensitivity**: evaluation across 100+ human-written prompts to account for the prompt sensitivity.
- 👩🏻‍🔬 **Standardized evaluation**: integration of NorEval into LM Evaluation Harness for flexible and reproducible evaluation.
### Tasks
|Name |Bokmål | Nynorsk |*k*-shot | Task type | Task category |
|:---|:---|:---|:---|:---|:---|
|[NoReC Sentence](https://huggingface.co/datasets/ltg/norec_sentence) |```norec_sentence``` | ❌ |✅ |Text classification| Sentiment analysis |
|[NoReC Document](https://huggingface.co/datasets/ltg/norec_document) |```norec_document``` | ❌ |✅ |Text classification| Sentiment analysis |
|[NCB](https://huggingface.co/datasets/hcfa/ncb) |```ncb```| ❌ | ❌ |Sentence ranking| Norwegian language knowledge |
|[NorIdiom](https://huggingface.co/datasets/Sprakbanken/Norwegian_idioms) |```noridiom_nob``` | ```noridiom_nno``` | ❌ |Sentence completion| Norwegian language knowledge |
|[Belebele](https://huggingface.co/datasets/facebook/belebele) |```norbelebele```| ❌|❌ |Multiple-choice question answering| Machine reading comprehension |
|[NRK-Quiz-QA](https://huggingface.co/datasets/ltg/nrk_quiz_qa) |```nrk_quiz_qa_nob```| ```nrk_quiz_qa_nno```| ❌ |Multiple-choice question answering| Norwegian-specific & world knowledge |
|[NorOpenBookQA](https://huggingface.co/datasets/ltg/noropenbookqa) |```noropenbookqa_nob```| ```noropenbookqa_nno``` |✅ |Multiple-choice question answering| Norwegian-specific & world knowledge |
|[NorCommonsenseQA](https://huggingface.co/datasets/ltg/norcommonsenseqa) |```norcommonsenseqa_nob```| ```norcommonsenseqa_nno``` |❌ |Multiple-choice question answering|Commonsense reasoning |
|[NorTruthfulQA Multiple choice](https://huggingface.co/datasets/ltg/nortruthfulqa_mc) |```nortruthfulqa_mc_nob```| ```nortruthfulqa_mc_nno``` |❌ |Multiple-choice question answering |Truthfulness |
|[NorQuAD](https://huggingface.co/datasets/ltg/norquad) |```norquad```| ❌ | ✅ |Generative question answering |Machine reading comprehension |
|[NorTruthfulQA Generation](https://huggingface.co/datasets/ltg/nortruthfulqa_gen) |```nortruthfulqa_gen_nob```| ```nortruthfulqa_gen_nno``` | ❌ | Generative question answering|Truthfulness |
|[ASK-GEC](https://huggingface.co/datasets/ltg/ask-gec) |```ask_gec```| ❌ |✅ |Sequence-to-sequence generation|Norwegian language knowledge |
|[NorSumm](https://huggingface.co/datasets/SamiaT/NorSumm) |```norsumm_nob``` | ```norsumm_nno``` |✅ |Sequence-to-sequence generation|Text summarization |
|[Tatoeba (English → Bokmål/Nynorsk)](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) | ```tatoeba_eng_nob```| ```tatoeba_eng_nno``` |✅ |Sequence-to-sequence generation|Machine translation |
|[Tatoeba (Bokmål/Nynorsk → English)](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) | ```tatoeba_nob_eng```| ```tatoeba_nno_eng``` |✅ |Sequence-to-sequence generation|Machine translation |
|[NorRewrite-Instruct](https://huggingface.co/datasets/ltg/norrewrite-instruct) |```norrewrite_instruct``` |❌ |❌ |Sequence-to-sequence generation|Instruction following|
|[NorSummarize-Instruct](https://huggingface.co/datasets/ltg/norsummarize-instruct) |```norsummarize_instruct``` |❌ |❌ |Sequence-to-sequence generation|Instruction following|
<details open>
<summary><b>Table description</b></summary>
* **Name**: a dataset name with a HuggingFace link.
* **Bokmål**: the LM Evaluation Harness task name for the Norwegian Bokmål dataset.
* **Nynorsk**: the LM Evaluation Harness task name for the Norwegian Nynorsk dataset, if available.
* **k-shot**: the support for *k*-shot evaluation regimes with *k* > 0. We follow the original datasets' design and focus mainly on the zero-shot evaluation by default.
* ✅ means that the user can run the evaluation in both zero-shot and *k*-shot regimes.
* ❌ denotes that only the zero-shot evaluation regime is available due to the lack of the training or validation set to sample the demonstration examples from. Technically, *k*-shot evaluation on the test set is possible using sampling without replacement, given that the model is not proprietary and not accessed via an API.
* **Task type**: the task type.
* **Task category**: the task category.
</details>
##### Comments on Belebele
Belebele for Norwegian Bokmål is already available in LM Evaluation Harness as `belebele_nob_Latn`. However, our version (`norbelebele`) supports five prompt templates written by Norwegian native speakers, which are different from the default prompt template used in Belebele.
### Citation
```
@article{mikhailov2025noreval,
title={NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark},
author={Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\aa}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja},
journal={arXiv preprint arXiv:2504.07749},
year={2025}
}
```
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation?
* [ ] Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
### Evaluation example
Here, we use the `--predict_only` argument and compute the performance metrics as described below.
**Step 1: Generate the predictions**
```bash
lm_eval \
--model hf \
--model_args pretrained=AI-Sweden-Models/Llama-3-8B \
--tasks ask_gec \
--output results/ask_gec/0-shot/ \
--log_samples \
--show_config \
--write_out \
--predict_only \
--batch_size auto \
--num_fewshot 0
```
**Step 2: Evaluate the predictions with ERRANT**
* Please refer to the installation instructions [here](https://github.com/chrisjbryant/errant/tree/main).
* Run the following:
```bash
python3 ask_gec/errant.py --fpath results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441.jsonl --out_fdir results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/
```
* The results will be saved as `results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441_errant.json`
tag: ask_gec
dataset_path: ltg/ask-gec
output_type: generate_until
training_split: train
validation_split: validation
test_split: test
doc_to_target: correction
generation_kwargs:
until:
- "\n"
do_sample: false
num_beams: 1
max_new_tokens: 256
metadata:
version: 1.0
task: ask_gec_p0
doc_to_text: "Tekst: {{source}}\nKorreksjon:"
include: _ask_gec_yaml
task: ask_gec_p1
doc_to_text: "Tekst: {{source}}\nRettet versjon:"
include: _ask_gec_yaml
task: ask_gec_p2
doc_to_text: "Skriv om følgende tekst slik at den blir grammatisk korrekt: {{source}}\nKorreksjon:"
include: _ask_gec_yaml
task: ask_gec_p3
doc_to_text: "Original versjon: {{source}}\nKorrekturlest og rettet versjon:"
include: _ask_gec_yaml
task: ask_gec_p4
doc_to_text: "Rett opp grammatiske feil i denne teksten: {{source}}\nKorreksjon:"
include: _ask_gec_yaml
import argparse
import json
import os
import subprocess
import pandas as pd
def parse_args():
"""
Parses arguments.
Returns:
Arguments containing the names of the prediction file and the file directory to for saving the evaluation results.
"""
parser = argparse.ArgumentParser()
parser.add_argument(
"--fpath",
type=str,
help="path to a model output file in the lm-evaluation-harness format.",
)
parser.add_argument(
"--out_fdir",
type=str,
help="path to an output directory for saving the results.",
)
args = parser.parse_args()
return args
def read_examples(fpath: str):
"""
Reads examples from the prediction file.
Args:
fpath: A path to the prediction file.
Returns:
Lists of the sources, targets, and predictions.
"""
examples = pd.read_json(fpath, lines=True)
sources, targets, predictions = [], [], []
for i, example in examples.iterrows():
sources.append(example["doc"]["source"])
targets.append(example["doc"]["correction"])
predictions.append(example["resps"][0][0].replace("\n\n", "\n"))
return sources, targets, predictions
def save_results(fpath: str, obj: dict):
"""
Saves the evaluation results.
Args:
fpath: A path for the output file for saving the results.
obj: The evaluation results.
"""
with open(fpath, "w+", encoding="utf-8") as out:
json.dump(obj, out, indent=3)
def evaluate(fpath: str, out_fpath: str):
"""
Runs the evaluation based on the ERRANT performance metric.
Args:
fpath: A path to the prediction file.
out_Fpath: A path for the output file for saving the results.
"""
tmp_name = fpath.replace(".jsonl", "").replace("/", "-")
os.makedirs("tmp", exist_ok=True)
sources, targets, predictions = read_examples(fpath=fpath)
with open(f"tmp/{tmp_name}_sources.txt", "w+") as f:
f.write("\n".join(sources))
with open(f"tmp/{tmp_name}_targets.txt", "w+") as f:
f.write("\n".join(targets))
with open(f"tmp/{tmp_name}_predictions.txt", "w+") as f:
f.write("\n".join(predictions))
subprocess.run(
f"errant_parallel -orig tmp/{tmp_name}_sources.txt -cor tmp/{tmp_name}_targets.txt -out tmp/{tmp_name}_targets.m2 -lev -tok",
shell=True,
)
subprocess.run(
f"errant_parallel -orig tmp/{tmp_name}_sources.txt -cor tmp/{tmp_name}_predictions.txt -out tmp/{tmp_name}_predictions.m2 -lev -tok",
shell=True,
)
output = subprocess.check_output(
f"errant_compare -ref tmp/{tmp_name}_targets.m2 -hyp tmp/{tmp_name}_predictions.m2",
shell=True,
)
f_05 = float(output.decode().strip().split("\n")[-2].split()[-1].strip())
print(f"Prediction fpath: {fpath}\n\nERRANT: {f_05}", flush=True)
print(f"Saving to: {out_fpath}", flush=True)
save_results(obj={"errant": f_05}, fpath=out_fpath)
subprocess.run(f"rm tmp/{tmp_name}_*", shell=True)
def main():
args = parse_args()
fpath = args.fpath
print(f"Out: {args.out_fdir}", flush=True)
out_fpath = fpath.replace(".jsonl", "_errant.json")
evaluate(fpath=fpath, out_fpath=out_fpath)
if __name__ == "__main__":
print(
"\nWARNING: make sure you have ERRANT installed to run the evaluation! Available here: https://github.com/chrisjbryant/errant\n\n",
flush=True,
)
main()
task: ncb
dataset_path: hcfa/ncb
output_type: multiple_choice
test_split: train
doc_to_text: ""
doc_to_target: 0
doc_to_choice: "{{[correct, wrong]}}"
num_fewshot: 0
metric_list:
- metric: acc
higher_is_better: true
metadata:
version: 1.0
tag: norbelebele
dataset_path: facebook/belebele
dataset_name: nob_Latn
test_split: test
fewshot_split: test
fewshot_config:
sampler: first_n
output_type: multiple_choice
doc_to_target: "{{['1', '2', '3', '4'].index(correct_answer_num)}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: norbelebele_p0
include: _norbelebele_yaml
doc_to_text: "Tekst: {{flores_passage}}\nSpørsmål: {{question}}\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\nSvar:"
doc_to_choice: ["A", "B", "C", "D"]
task: norbelebele_p1
include: _norbelebele_yaml
doc_to_text: "Bakgrunn: {{flores_passage}}\nSpørsmål:{{question}}\nSvaralternativer:\n- {{mc_answer1}}\n- {{mc_answer2}}\n- {{mc_answer3}}\n- {{mc_answer4}}\nRiktig svar:"
doc_to_choice: "{{[mc_answer1, mc_answer2, mc_answer3, mc_answer4]}}"
task: norbelebele_p2
include: _norbelebele_yaml
doc_to_text: "{{question}}\nHvilket av følgende mulige svar er det riktige?\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\nSvar:"
doc_to_choice: ["A", "B", "C", "D"]
task: norbelebele_p3
include: _norbelebele_yaml
doc_to_text: "Svar følgende spørsmål: {{question}}\nSvaret skal baseres følgende tekst:\n{{flores_passage}}\nVelg et svar fra denne listen:\n {{mc_answer1}}\n {{mc_answer2}},\n {{mc_answer3}}\n {{mc_answer4}}"
doc_to_choice: "{{[mc_answer1, mc_answer2, mc_answer3, mc_answer4]}}"
target_delimiter: "\n"
task: norbelebele_p4
include: _norbelebele_yaml
doc_to_text: "{{flores_passage}}\n\n{{question}}\n\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\n\nEr det riktige svaret A, B, C, eller D?"
doc_to_choice: ["A", "B", "C", "D"]
dataset_path: ltg/norcommonsenseqa
output_type: multiple_choice
training_split: null
validation_split: null
test_split: train
doc_to_target: "{{choices.label.index(answer)}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
tag: norcommonsenseqa_nno
dataset_name: nn
task: norcommonsenseqa_nno_p0
include: ../_norcommonsenseqa_yaml
doc_to_text: "Spørsmål: {{question}}\n\nSvar:"
doc_to_choice: "{{choices.text}}"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment