Added NorEval, a novel Norwegian benchmark (#2919)

* added noreval * added a checklist for noreval * run pre-commit * changed imports and added short noreval description * fixed norsumm path * refactored multi-folder tasks * refactored multi-folder tasks

Added NorEval, a novel Norwegian benchmark (#2919)
* added noreval * added a checklist for noreval * run pre-commit * changed imports and added short noreval description * fixed norsumm path * refactored multi-folder tasks * refactored multi-folder tasks
71f2954b · Vladislav Mikhailov · GitHub · ab618f01 · 71f2954b · 71f2954b
Unverified Commit 71f2954b authored May 06, 2025 by Vladislav Mikhailov Committed by GitHub May 06, 2025
20 changed files
--- a/.gitignore
+++ b/.gitignore
+.DS_Store
 env
 *.pyc
 output/

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
--- a/lm_eval/tasks/noreval/README.md
+++ b/lm_eval/tasks/noreval/README.md
+# 🇳🇴 NorEval
+
+### Paper
+
+* Title: `NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark`
+* Abstract: [arxiv.org/abs/2504.07749](https://arxiv.org/abs/2504.07749)
+* Homepage: [github.com/ltgoslo/noreval](https://github.com/ltgoslo/noreval/tree/main)
+
+![noreval](noreval.jpg)
+
+**Overview of the NorEval design.**  😼 denotes datasets used in [NorBench](https://aclanthology.org/2023.nodalida-1.61/), [NLEBench](https://aclanthology.org/2024.emnlp-main.317/), [ScandEval](https://aclanthology.org/2023.nodalida-1.20/), and [SEB](https://proceedings.neurips.cc/paper_files/paper/2024/file/4746bb91bd073ec7eef930d5775122ba-Paper-Datasets_and_Benchmarks_Track.pdf); 🚀 represents datasets that have not been used in the existing Norwegian benchmarks; and 😎 denotes our novel datasets introduced as part of NorEval. EN=English; BM=Norwegian Bokmål; NN=Norwegian Nynorsk.
+
+🇳🇴 NorEval is a multi-task Norwegian language understanding and generation evaluation benchmark that combines 19 existing peer-reviewed datasets with five datasets created from scratch. NorEval covers nine diverse task categories: sentiment analysis, Norwegian language knowledge, Norwegian-specific \& world knowledge, machine reading comprehension, commonsense reasoning, machine translation, text summarization, instruction following, and truthfulness. Our main evaluation principles are:
+
+- 🌐 **Linguistic diversity**: support for both of the official written standards of Norwegian: Bokmål and Nynorsk (the minority variant).
+- 📊 **Task diversity**: coverage of various least addressed tasks for Norwegian. In particular, only three out of 24 NorEval datasets are included in existing Norwegian benchmarks to date: [NorBench](https://aclanthology.org/2023.nodalida-1.61/), [NLEBench](https://aclanthology.org/2024.emnlp-main.317/), [ScandEval](https://aclanthology.org/2023.nodalida-1.20/), and [SEB](https://proceedings.neurips.cc/paper_files/paper/2024/file/4746bb91bd073ec7eef930d5775122ba-Paper-Datasets_and_Benchmarks_Track.pdf).
+- 🧠 **Data quality**: focus on only peer-reviewed human-created datasets to ensure reliable evaluation in the context of the Norwegian language, culture, and values.
+- 📏 **Prompt sensitivity**: evaluation across 100+ human-written prompts to account for the prompt sensitivity.
+- 👩🏻‍🔬 **Standardized evaluation**: integration of NorEval into LM Evaluation Harness for flexible and reproducible evaluation.
+
+
+### Tasks
+
+|Name  |Bokmål | Nynorsk  |*k*-shot | Task type  | Task category |
+|:---|:---|:---|:---|:---|:---|
+|[NoReC Sentence](https://huggingface.co/datasets/ltg/norec_sentence) |```norec_sentence```  | ❌ |✅ |Text classification| Sentiment analysis |
+|[NoReC Document](https://huggingface.co/datasets/ltg/norec_document) |```norec_document```  | ❌ |✅ |Text classification| Sentiment analysis |
+|[NCB](https://huggingface.co/datasets/hcfa/ncb) |```ncb```| ❌ | ❌ |Sentence ranking| Norwegian language knowledge   |
+|[NorIdiom](https://huggingface.co/datasets/Sprakbanken/Norwegian_idioms) |```noridiom_nob```  | ```noridiom_nno```  | ❌ |Sentence completion| Norwegian language knowledge  |
+|[Belebele](https://huggingface.co/datasets/facebook/belebele) |```norbelebele```| ❌|❌ |Multiple-choice question answering| Machine reading comprehension |
+|[NRK-Quiz-QA](https://huggingface.co/datasets/ltg/nrk_quiz_qa) |```nrk_quiz_qa_nob```| ```nrk_quiz_qa_nno```| ❌   |Multiple-choice question answering| Norwegian-specific & world knowledge |
+|[NorOpenBookQA](https://huggingface.co/datasets/ltg/noropenbookqa) |```noropenbookqa_nob```| ```noropenbookqa_nno``` |✅  |Multiple-choice question answering| Norwegian-specific & world knowledge |
+|[NorCommonsenseQA](https://huggingface.co/datasets/ltg/norcommonsenseqa) |```norcommonsenseqa_nob```| ```norcommonsenseqa_nno``` |❌   |Multiple-choice question answering|Commonsense reasoning  |
+|[NorTruthfulQA Multiple choice](https://huggingface.co/datasets/ltg/nortruthfulqa_mc) |```nortruthfulqa_mc_nob```| ```nortruthfulqa_mc_nno``` |❌   |Multiple-choice question answering |Truthfulness |
+|[NorQuAD](https://huggingface.co/datasets/ltg/norquad) |```norquad```| ❌  | ✅  |Generative question answering |Machine reading comprehension |
+|[NorTruthfulQA Generation](https://huggingface.co/datasets/ltg/nortruthfulqa_gen) |```nortruthfulqa_gen_nob```| ```nortruthfulqa_gen_nno``` | ❌   | Generative question answering|Truthfulness |
+|[ASK-GEC](https://huggingface.co/datasets/ltg/ask-gec) |```ask_gec```| ❌ |✅ |Sequence-to-sequence generation|Norwegian language knowledge |
+|[NorSumm](https://huggingface.co/datasets/SamiaT/NorSumm)  |```norsumm_nob``` | ```norsumm_nno```  |✅ |Sequence-to-sequence generation|Text summarization |
+|[Tatoeba (English → Bokmål/Nynorsk)](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) | ```tatoeba_eng_nob```| ```tatoeba_eng_nno```  |✅  |Sequence-to-sequence generation|Machine translation |
+|[Tatoeba (Bokmål/Nynorsk → English)](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) | ```tatoeba_nob_eng```| ```tatoeba_nno_eng```  |✅  |Sequence-to-sequence generation|Machine translation |
+|[NorRewrite-Instruct](https://huggingface.co/datasets/ltg/norrewrite-instruct) |```norrewrite_instruct```  |❌ |❌ |Sequence-to-sequence generation|Instruction following|
+|[NorSummarize-Instruct](https://huggingface.co/datasets/ltg/norsummarize-instruct) |```norsummarize_instruct``` |❌ |❌ |Sequence-to-sequence generation|Instruction following|
+
+<details open>
+<summary><b>Table description</b></summary>
+
+* **Name**: a dataset name with a HuggingFace link.
+* **Bokmål**: the LM Evaluation Harness task name for the Norwegian Bokmål dataset.
+* **Nynorsk**: the LM Evaluation Harness task name for the Norwegian Nynorsk dataset, if available.
+* **k-shot**: the support for *k*-shot evaluation regimes with *k* > 0. We follow the original datasets' design and focus mainly on the zero-shot evaluation by default.
+  * ✅ means that the user can run the evaluation in both zero-shot and *k*-shot regimes.
+  * ❌ denotes that only the zero-shot evaluation regime is available due to the lack of the training or validation set to sample the demonstration examples from. Technically, *k*-shot evaluation on the test set is possible using sampling without replacement, given that the model is not proprietary and not accessed via an API.
+* **Task type**: the task type.
+* **Task category**: the task category.
+
+</details>
+
+##### Comments on Belebele
+Belebele for Norwegian Bokmål is already available in LM Evaluation Harness as `belebele_nob_Latn`. However, our version (`norbelebele`) supports five prompt templates written by Norwegian native speakers, which are different from the default prompt template used in Belebele.
+
+
+
+### Citation
+
+```
+@article{mikhailov2025noreval,
+  title={NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark},
+  author={Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\aa}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja},
+  journal={arXiv preprint arXiv:2504.07749},
+  year={2025}
+}
+```
+
+### Checklist
+
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation?
+    * [ ] Yes, original implementation contributed by author of the benchmark
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/noreval/ask_gec/README.md
+++ b/lm_eval/tasks/noreval/ask_gec/README.md
+### Evaluation example
+
+Here, we use the `--predict_only` argument and compute the performance metrics as described below.
+
+**Step 1: Generate the predictions**
+
+```bash
+lm_eval \
+  --model hf \
+  --model_args pretrained=AI-Sweden-Models/Llama-3-8B \
+  --tasks ask_gec \
+  --output results/ask_gec/0-shot/ \
+  --log_samples \
+  --show_config \
+  --write_out \
+  --predict_only \
+  --batch_size auto \
+  --num_fewshot 0
+```
+
+**Step 2: Evaluate the predictions with ERRANT**
+
+* Please refer to the installation instructions [here](https://github.com/chrisjbryant/errant/tree/main).
+* Run the following:
+    ```bash
+    python3 ask_gec/errant.py --fpath results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441.jsonl --out_fdir results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/
+    ```
+* The results will be saved as `results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441_errant.json`
--- a/lm_eval/tasks/noreval/ask_gec/_ask_gec_yaml
+++ b/lm_eval/tasks/noreval/ask_gec/_ask_gec_yaml
+tag: ask_gec
+dataset_path: ltg/ask-gec
+output_type: generate_until
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_target: correction
+generation_kwargs:
+  until:
+    - "\n"
+  do_sample: false
+  num_beams: 1
+  max_new_tokens: 256
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/noreval/ask_gec/ask_gec_p0.yaml
+++ b/lm_eval/tasks/noreval/ask_gec/ask_gec_p0.yaml
+task: ask_gec_p0
+doc_to_text: "Tekst: {{source}}\nKorreksjon:"
+include: _ask_gec_yaml
--- a/lm_eval/tasks/noreval/ask_gec/ask_gec_p1.yaml
+++ b/lm_eval/tasks/noreval/ask_gec/ask_gec_p1.yaml
+task: ask_gec_p1
+doc_to_text: "Tekst: {{source}}\nRettet versjon:"
+include: _ask_gec_yaml
--- a/lm_eval/tasks/noreval/ask_gec/ask_gec_p2.yaml
+++ b/lm_eval/tasks/noreval/ask_gec/ask_gec_p2.yaml
+task: ask_gec_p2
+doc_to_text: "Skriv om følgende tekst slik at den blir grammatisk korrekt: {{source}}\nKorreksjon:"
+include: _ask_gec_yaml
--- a/lm_eval/tasks/noreval/ask_gec/ask_gec_p3.yaml
+++ b/lm_eval/tasks/noreval/ask_gec/ask_gec_p3.yaml
+task: ask_gec_p3
+doc_to_text: "Original versjon: {{source}}\nKorrekturlest og rettet versjon:"
+include: _ask_gec_yaml
--- a/lm_eval/tasks/noreval/ask_gec/ask_gec_p4.yaml
+++ b/lm_eval/tasks/noreval/ask_gec/ask_gec_p4.yaml
+task: ask_gec_p4
+doc_to_text: "Rett opp grammatiske feil i denne teksten: {{source}}\nKorreksjon:"
+include: _ask_gec_yaml
--- a/lm_eval/tasks/noreval/ask_gec/errant.py
+++ b/lm_eval/tasks/noreval/ask_gec/errant.py
+import argparse
+import json
+import os
+import subprocess
+
+import pandas as pd
+
+
+def parse_args():
+    """
+    Parses arguments.
+    Returns:
+        Arguments containing the names of the prediction file and the file directory to for saving the evaluation results.
+    """
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--fpath",
+        type=str,
+        help="path to a model output file in the lm-evaluation-harness format.",
+    )
+    parser.add_argument(
+        "--out_fdir",
+        type=str,
+        help="path to an output directory for saving the results.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def read_examples(fpath: str):
+    """
+    Reads examples from the prediction file.
+    Args:
+        fpath: A path to the prediction file.
+    Returns:
+        Lists of the sources, targets, and predictions.
+    """
+    examples = pd.read_json(fpath, lines=True)
+    sources, targets, predictions = [], [], []
+    for i, example in examples.iterrows():
+        sources.append(example["doc"]["source"])
+        targets.append(example["doc"]["correction"])
+        predictions.append(example["resps"][0][0].replace("\n\n", "\n"))
+    return sources, targets, predictions
+
+
+def save_results(fpath: str, obj: dict):
+    """
+    Saves the evaluation results.
+    Args:
+        fpath: A path for the output file for saving the results.
+        obj: The evaluation results.
+    """
+    with open(fpath, "w+", encoding="utf-8") as out:
+        json.dump(obj, out, indent=3)
+
+
+def evaluate(fpath: str, out_fpath: str):
+    """
+    Runs the evaluation based on the ERRANT performance metric.
+    Args:
+        fpath: A path to the prediction file.
+        out_Fpath: A path for the output file for saving the results.
+    """
+    tmp_name = fpath.replace(".jsonl", "").replace("/", "-")
+    os.makedirs("tmp", exist_ok=True)
+    sources, targets, predictions = read_examples(fpath=fpath)
+    with open(f"tmp/{tmp_name}_sources.txt", "w+") as f:
+        f.write("\n".join(sources))
+    with open(f"tmp/{tmp_name}_targets.txt", "w+") as f:
+        f.write("\n".join(targets))
+    with open(f"tmp/{tmp_name}_predictions.txt", "w+") as f:
+        f.write("\n".join(predictions))
+    subprocess.run(
+        f"errant_parallel -orig tmp/{tmp_name}_sources.txt -cor tmp/{tmp_name}_targets.txt -out tmp/{tmp_name}_targets.m2 -lev -tok",
+        shell=True,
+    )
+    subprocess.run(
+        f"errant_parallel -orig tmp/{tmp_name}_sources.txt -cor tmp/{tmp_name}_predictions.txt -out tmp/{tmp_name}_predictions.m2 -lev -tok",
+        shell=True,
+    )
+    output = subprocess.check_output(
+        f"errant_compare -ref tmp/{tmp_name}_targets.m2 -hyp tmp/{tmp_name}_predictions.m2",
+        shell=True,
+    )
+    f_05 = float(output.decode().strip().split("\n")[-2].split()[-1].strip())
+    print(f"Prediction fpath: {fpath}\n\nERRANT: {f_05}", flush=True)
+    print(f"Saving to: {out_fpath}", flush=True)
+    save_results(obj={"errant": f_05}, fpath=out_fpath)
+    subprocess.run(f"rm tmp/{tmp_name}_*", shell=True)
+
+
+def main():
+    args = parse_args()
+    fpath = args.fpath
+    print(f"Out: {args.out_fdir}", flush=True)
+    out_fpath = fpath.replace(".jsonl", "_errant.json")
+    evaluate(fpath=fpath, out_fpath=out_fpath)
+
+
+if __name__ == "__main__":
+    print(
+        "\nWARNING: make sure you have ERRANT installed to run the evaluation! Available here: https://github.com/chrisjbryant/errant\n\n",
+        flush=True,
+    )
+    main()
--- a/lm_eval/tasks/noreval/ncb/ncb.yaml
+++ b/lm_eval/tasks/noreval/ncb/ncb.yaml
+task: ncb
+dataset_path: hcfa/ncb
+output_type: multiple_choice
+test_split: train
+doc_to_text: ""
+doc_to_target: 0
+doc_to_choice: "{{[correct, wrong]}}"
+num_fewshot: 0
+metric_list:
+  - metric: acc
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/noreval/norbelebele/_norbelebele_yaml
+++ b/lm_eval/tasks/noreval/norbelebele/_norbelebele_yaml
+tag: norbelebele
+dataset_path: facebook/belebele
+dataset_name: nob_Latn
+test_split: test
+fewshot_split: test
+fewshot_config:
+  sampler: first_n
+output_type: multiple_choice
+doc_to_target: "{{['1', '2', '3', '4'].index(correct_answer_num)}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/noreval/norbelebele/norbelebele_p0.yaml
+++ b/lm_eval/tasks/noreval/norbelebele/norbelebele_p0.yaml
+task: norbelebele_p0
+include: _norbelebele_yaml
+doc_to_text: "Tekst: {{flores_passage}}\nSpørsmål: {{question}}\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\nSvar:"
+doc_to_choice: ["A", "B", "C", "D"]
--- a/lm_eval/tasks/noreval/norbelebele/norbelebele_p1.yaml
+++ b/lm_eval/tasks/noreval/norbelebele/norbelebele_p1.yaml
+task: norbelebele_p1
+include: _norbelebele_yaml
+doc_to_text: "Bakgrunn: {{flores_passage}}\nSpørsmål:{{question}}\nSvaralternativer:\n- {{mc_answer1}}\n- {{mc_answer2}}\n- {{mc_answer3}}\n- {{mc_answer4}}\nRiktig svar:"
+doc_to_choice: "{{[mc_answer1, mc_answer2, mc_answer3, mc_answer4]}}"
--- a/lm_eval/tasks/noreval/norbelebele/norbelebele_p2.yaml
+++ b/lm_eval/tasks/noreval/norbelebele/norbelebele_p2.yaml
+task: norbelebele_p2
+include: _norbelebele_yaml
+doc_to_text: "{{question}}\nHvilket av følgende mulige svar er det riktige?\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\nSvar:"
+doc_to_choice: ["A", "B", "C", "D"]
--- a/lm_eval/tasks/noreval/norbelebele/norbelebele_p3.yaml
+++ b/lm_eval/tasks/noreval/norbelebele/norbelebele_p3.yaml
+task: norbelebele_p3
+include: _norbelebele_yaml
+doc_to_text: "Svar på følgende spørsmål: {{question}}\nSvaret skal baseres på følgende tekst:\n{{flores_passage}}\nVelg et svar fra denne listen:\n– {{mc_answer1}}\n– {{mc_answer2}},\n– {{mc_answer3}}\n– {{mc_answer4}}"
+doc_to_choice: "{{[mc_answer1, mc_answer2, mc_answer3, mc_answer4]}}"
+target_delimiter: "\n"
--- a/lm_eval/tasks/noreval/norbelebele/norbelebele_p4.yaml
+++ b/lm_eval/tasks/noreval/norbelebele/norbelebele_p4.yaml
+task: norbelebele_p4
+include: _norbelebele_yaml
+doc_to_text: "{{flores_passage}}\n\n{{question}}\n\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\n\nEr det riktige svaret A, B, C, eller D?"
+doc_to_choice: ["A", "B", "C", "D"]
--- a/lm_eval/tasks/noreval/norcommonsenseqa/_norcommonsenseqa_yaml
+++ b/lm_eval/tasks/noreval/norcommonsenseqa/_norcommonsenseqa_yaml
+dataset_path: ltg/norcommonsenseqa
+output_type: multiple_choice
+training_split: null
+validation_split: null
+test_split: train
+doc_to_target: "{{choices.label.index(answer)}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/noreval/norcommonsenseqa/nno/norcommonsenseqa_nno_p0.yaml
+++ b/lm_eval/tasks/noreval/norcommonsenseqa/nno/norcommonsenseqa_nno_p0.yaml
+tag: norcommonsenseqa_nno
+dataset_name: nn
+task: norcommonsenseqa_nno_p0
+include: ../_norcommonsenseqa_yaml
+doc_to_text: "Spørsmål: {{question}}\n\nSvar:"
+doc_to_choice: "{{choices.text}}"