Add Histoires Morales task (#2662)

* Add Histoires Morales task * Histoires Morales task: fix mixed line endings * Histoires Morales task: fix mixed line endings * Remove tag for a single task * Add some MT for Histoires Morales

Add Histoires Morales task (#2662)
* Add Histoires Morales task * Histoires Morales task: fix mixed line endings * Histoires Morales task: fix mixed line endings * Remove tag for a single task * Add some MT for Histoires Morales
1208afd3 · Irina Proskurina · GitHub · fe9c5707 · 1208afd3 · 1208afd3
Unverified Commit 1208afd3 authored Jan 29, 2025 by Irina Proskurina Committed by GitHub Jan 29, 2025
4 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -5,137 +5,138 @@

 For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder.

-| Task Family | Description | Language(s)                                                                                                                   |
-|-------------|-------------|-------------------------------------------------------------------------------------------------------------------------------|
-| [aclue](aclue/README.md) | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese                                                                                                               |
-| [aexams](aexams/README.md) | Tasks in Arabic related to various academic exams covering a range of subjects. | Arabic                                                                                                                        |
-| [agieval](agieval/README.md) | Tasks involving historical data or questions related to history and historical texts. | English, Chinese                                                                                                              |
-| [anli](anli/README.md) | Adversarial natural language inference tasks designed to test model robustness. | English                                                                                                                       |
-| [arabic_leaderboard_complete](arabic_leaderboard_complete/README.md) | A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT)                                                                                                              |
-| [arabic_leaderboard_light](arabic_leaderboard_light/README.md) | A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT)                                                                                                              |
-| [arabicmmlu](arabicmmlu/README.md) | Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. | Arabic                                                                                                                        |
-| [AraDICE](aradice/README.md) | A collection of multiple tasks carefully designed to evaluate dialectal and cultural capabilities in large language models (LLMs). | Arabic                                                                                                                        |
-| [arc](arc/README.md) | Tasks involving complex reasoning over a diverse set of questions.  | English                                                                                                                       |
-| [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English                                                                                                                       |
-| [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English                                                                                                                       |
-| [babi](babi/README.md) | Tasks designed as question and answering challenges based on simulated stories. | English                                                                                                                       |
-| [basque_bench](basque_bench/README.md) | Collection of tasks in Basque encompassing various evaluation areas. | Basque                                                                                                                        |
-| [basqueglue](basqueglue/README.md) | Tasks designed to evaluate language understanding in Basque language. | Basque                                                                                                                        |
-| [bbh](bbh/README.md) | Tasks focused on deep semantic understanding through hypothesization and reasoning. | English, German                                                                                                               |
-| [belebele](belebele/README.md) | Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages)                                                                                                      |
-| benchmarks | General benchmarking tasks that test a wide range of language understanding capabilities. |                                                                                                                               |
-| [bertaqa](bertaqa/README.md) | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT)                                                                                                  |
-| [bigbench](bigbench/README.md) | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple                                                                                                                      |
-| [blimp](blimp/README.md) | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English                                                                                                                       |
-| [catalan_bench](catalan_bench/README.md) | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan                                                                                                                       |
-| [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese                                                                                                                       |
-| [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese                                                                                                                       |
-| code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby                                                                                               |
-| [commonsense_qa](commonsense_qa/README.md) | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English                                                                                                                       |
-| [copal_id](copal_id/README.md) | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian                                                                                                                    |
-| [coqa](coqa/README.md) | Conversational question answering tasks to test dialog understanding. | English                                                                                                                       |
-| [crows_pairs](crows_pairs/README.md) | Tasks designed to test model biases in various sociodemographic groups. | English, French                                                                                                               |
-| csatqa | Tasks related to SAT and other standardized testing questions for academic assessment. | Korean                                                                                                                        |
-| [drop](drop/README.md) | Tasks requiring numerical reasoning, reading comprehension, and question answering. | English                                                                                                                       |
-| [eq_bench](eq_bench/README.md) | Tasks focused on equality and ethics in question answering and decision-making. | English                                                                                                                       |
-| [eus_exams](eus_exams/README.md) | Tasks based on various professional and academic exams in the Basque language. | Basque                                                                                                                        |
-| [eus_proficiency](eus_proficiency/README.md) | Tasks designed to test proficiency in the Basque language across various topics. | Basque                                                                                                                        |
-| [eus_reading](eus_reading/README.md) | Reading comprehension tasks specifically designed for the Basque language. | Basque                                                                                                                        |
-| [eus_trivia](eus_trivia/README.md) | Trivia and knowledge testing tasks in the Basque language. | Basque                                                                                                                        |
-| [fda](fda/README.md) | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English                                                                                                                       |
-| [fld](fld/README.md) | Tasks involving free-form and directed dialogue understanding. | English                                                                                                                       |
-| [french_bench](french_bench/README.md) | Set of tasks designed to assess language model performance in French. | French                                                                                                                        |
-| [galician_bench](galician_bench/README.md) | Collection of tasks in Galician encompassing various evaluation areas. | Galician                                                                                                                      |
-| [global_mmlu](global_mmlu/README.md) | Collection of culturally sensitive and culturally agnostic MMLU tasks in 15 languages with human translations or post-edits. | Multiple (15 languages)                                                                                                       |
-| [glue](glue/README.md) | General Language Understanding Evaluation benchmark to test broad language abilities. | English                                                                                                                       |
-| [gpqa](gpqa/README.md) | Tasks designed for general public question answering and knowledge verification. | English                                                                                                                       |
-| [gsm8k](gsm8k/README.md) | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. | English                                                                                                                       |
-| [haerae](haerae/README.md) | Tasks focused on assessing detailed factual and historical knowledge. | Korean                                                                                                                        |
-| [headqa](headqa/README.md) | A high-level education-based question answering dataset to test specialized knowledge. | Spanish, English                                                                                                              |
-| [hellaswag](hellaswag/README.md) | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. | English                                                                                                                       |
-| [hendrycks_ethics](hendrycks_ethics/README.md)     | Tasks designed to evaluate the ethical reasoning capabilities of models. | English                                                                                                                       |
-| [hendrycks_math](hendrycks_math/README.md) | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English                                                                                                                       |
-| [hrm8k](hrm8k/README.md) | A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT)                                                                                           |
-| [humaneval](humaneval/README.md) | Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python                                                                                                                        |
-| [ifeval](ifeval/README.md) | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English                                                                                                                       |
-| [inverse_scaling](inverse_scaling/README.md) | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English                                                                                                                       |
-| [japanese_leaderboard](japanese_leaderboard/README.md) | Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese                                                                                                                      |
-| [kbl](kbl/README.md) | Korean Benchmark for Legal Language Understanding. | Korean                                                                                                                        |
-| [kmmlu](kmmlu/README.md) | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean                                                                                                                        |
-| [kobest](kobest/README.md) | A collection of tasks designed to evaluate understanding in Korean language. | Korean                                                                                                                        |
-| [kormedmcqa](kormedmcqa/README.md) | Medical question answering tasks in Korean to test specialized domain knowledge. | Korean                                                                                                                        |
-| [lambada](lambada/README.md) | Tasks designed to predict the endings of text passages, testing language prediction skills. | English                                                                                                                       |
-| [lambada_cloze](lambada_cloze/README.md) | Cloze-style LAMBADA dataset. | English                                                                                                                       |
-| [lambada_multilingual](lambada_multilingual/README.md) | Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use `lambada_multilingual_stablelm`. | German, English, Spanish, French, Italian                                                                                     |
+| Task Family                                                              | Description | Language(s)                                                                                                                   |
+|--------------------------------------------------------------------------|-------------|-------------------------------------------------------------------------------------------------------------------------------|
+| [aclue](aclue/README.md)                                                 | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese                                                                                                               |
+| [aexams](aexams/README.md)                                               | Tasks in Arabic related to various academic exams covering a range of subjects. | Arabic                                                                                                                        |
+| [agieval](agieval/README.md)                                             | Tasks involving historical data or questions related to history and historical texts. | English, Chinese                                                                                                              |
+| [anli](anli/README.md)                                                   | Adversarial natural language inference tasks designed to test model robustness. | English                                                                                                                       |
+| [arabic_leaderboard_complete](arabic_leaderboard_complete/README.md)     | A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT)                                                                                                              |
+| [arabic_leaderboard_light](arabic_leaderboard_light/README.md)           | A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT)                                                                                                              |
+| [arabicmmlu](arabicmmlu/README.md)                                       | Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. | Arabic                                                                                                                        |
+| [AraDICE](aradice/README.md)                                             | A collection of multiple tasks carefully designed to evaluate dialectal and cultural capabilities in large language models (LLMs). | Arabic                                                                                                                        |
+| [arc](arc/README.md)                                                     | Tasks involving complex reasoning over a diverse set of questions.  | English                                                                                                                       |
+| [arithmetic](arithmetic/README.md)                                       | Tasks involving numerical computations and arithmetic reasoning. | English                                                                                                                       |
+| [asdiv](asdiv/README.md)                                                 | Tasks involving arithmetic and mathematical reasoning challenges. | English                                                                                                                       |
+| [babi](babi/README.md)                                                   | Tasks designed as question and answering challenges based on simulated stories. | English                                                                                                                       |
+| [basque_bench](basque_bench/README.md)                                   | Collection of tasks in Basque encompassing various evaluation areas. | Basque                                                                                                                        |
+| [basqueglue](basqueglue/README.md)                                       | Tasks designed to evaluate language understanding in Basque language. | Basque                                                                                                                        |
+| [bbh](bbh/README.md)                                                     | Tasks focused on deep semantic understanding through hypothesization and reasoning. | English, German                                                                                                               |
+| [belebele](belebele/README.md)                                           | Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages)                                                                                                      |
+| benchmarks                                                               | General benchmarking tasks that test a wide range of language understanding capabilities. |                                                                                                                               |
+| [bertaqa](bertaqa/README.md)                                             | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT)                                                                                                  |
+| [bigbench](bigbench/README.md)                                           | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple                                                                                                                      |
+| [blimp](blimp/README.md)                                                 | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English                                                                                                                       |
+| [catalan_bench](catalan_bench/README.md)                                 | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan                                                                                                                       |
+| [ceval](ceval/README.md)                                                 | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese                                                                                                                       |
+| [cmmlu](cmmlu/README.md)                                                 | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese                                                                                                                       |
+| code_x_glue                                                              | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby                                                                                               |
+| [commonsense_qa](commonsense_qa/README.md)                               | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English                                                                                                                       |
+| [copal_id](copal_id/README.md)                                           | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian                                                                                                                    |
+| [coqa](coqa/README.md)                                                   | Conversational question answering tasks to test dialog understanding. | English                                                                                                                       |
+| [crows_pairs](crows_pairs/README.md)                                     | Tasks designed to test model biases in various sociodemographic groups. | English, French                                                                                                               |
+| csatqa                                                                   | Tasks related to SAT and other standardized testing questions for academic assessment. | Korean                                                                                                                        |
+| [drop](drop/README.md)                                                   | Tasks requiring numerical reasoning, reading comprehension, and question answering. | English                                                                                                                       |
+| [eq_bench](eq_bench/README.md)                                           | Tasks focused on equality and ethics in question answering and decision-making. | English                                                                                                                       |
+| [eus_exams](eus_exams/README.md)                                         | Tasks based on various professional and academic exams in the Basque language. | Basque                                                                                                                        |
+| [eus_proficiency](eus_proficiency/README.md)                             | Tasks designed to test proficiency in the Basque language across various topics. | Basque                                                                                                                        |
+| [eus_reading](eus_reading/README.md)                                     | Reading comprehension tasks specifically designed for the Basque language. | Basque                                                                                                                        |
+| [eus_trivia](eus_trivia/README.md)                                       | Trivia and knowledge testing tasks in the Basque language. | Basque                                                                                                                        |
+| [fda](fda/README.md)                                                     | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English                                                                                                                       |
+| [fld](fld/README.md)                                                     | Tasks involving free-form and directed dialogue understanding. | English                                                                                                                       |
+| [french_bench](french_bench/README.md)                                   | Set of tasks designed to assess language model performance in French. | French                                                                                                                        |
+| [galician_bench](galician_bench/README.md)                               | Collection of tasks in Galician encompassing various evaluation areas. | Galician                                                                                                                      |
+| [global_mmlu](global_mmlu/README.md)                                     | Collection of culturally sensitive and culturally agnostic MMLU tasks in 15 languages with human translations or post-edits. | Multiple (15 languages)                                                                                                       |
+| [glue](glue/README.md)                                                   | General Language Understanding Evaluation benchmark to test broad language abilities. | English                                                                                                                       |
+| [gpqa](gpqa/README.md)                                                   | Tasks designed for general public question answering and knowledge verification. | English                                                                                                                       |
+| [gsm8k](gsm8k/README.md)                                                 | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. | English                                                                                                                       |
+| [haerae](haerae/README.md)                                               | Tasks focused on assessing detailed factual and historical knowledge. | Korean                                                                                                                        |
+| [headqa](headqa/README.md)                                               | A high-level education-based question answering dataset to test specialized knowledge. | Spanish, English                                                                                                              |
+| [hellaswag](hellaswag/README.md)                                         | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. | English                                                                                                                       |
+| [hendrycks_ethics](hendrycks_ethics/README.md)                           | Tasks designed to evaluate the ethical reasoning capabilities of models. | English                                                                                                                       |
+| [hendrycks_math](hendrycks_math/README.md)                               | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English                                                                                                                       |
+| [histoires_morales](histoires_morales/README.md)                         | A dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations.  | French (Some MT)                                                                                                                        |
+| [hrm8k](hrm8k/README.md)                                                 | A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT)                                                                                           |
+| [humaneval](humaneval/README.md)                                         | Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python                                                                                                                        |
+| [ifeval](ifeval/README.md)                                               | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English                                                                                                                       |
+| [inverse_scaling](inverse_scaling/README.md)                             | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English                                                                                                                       |
+| [japanese_leaderboard](japanese_leaderboard/README.md)                   | Japanese language understanding tasks to benchmark model performance on various linguistic aspects. | Japanese                                                                                                                      |
+| [kbl](kbl/README.md)                                                     | Korean Benchmark for Legal Language Understanding. | Korean                                                                                                                        |
+| [kmmlu](kmmlu/README.md)                                                 | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean                                                                                                                        |
+| [kobest](kobest/README.md)                                               | A collection of tasks designed to evaluate understanding in Korean language. | Korean                                                                                                                        |
+| [kormedmcqa](kormedmcqa/README.md)                                       | Medical question answering tasks in Korean to test specialized domain knowledge. | Korean                                                                                                                        |
+| [lambada](lambada/README.md)                                             | Tasks designed to predict the endings of text passages, testing language prediction skills. | English                                                                                                                       |
+| [lambada_cloze](lambada_cloze/README.md)                                 | Cloze-style LAMBADA dataset. | English                                                                                                                       |
+| [lambada_multilingual](lambada_multilingual/README.md)                   | Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use `lambada_multilingual_stablelm`. | German, English, Spanish, French, Italian                                                                                     |
 | [lambada_multilingual_stablelm](lambada_multilingual_stablelm/README.md) | Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on `lambada_multilingual`. | German, English, Spanish, French, Italian, Dutch, Portuguese                                                                  |
-| [leaderboard](leaderboard/README.md) | Task group used by Hugging Face's [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Those tasks are static and will not change through time | English                                                                                                                       |
-| [lingoly](lingoly/README.md) | Challenging logical reasoning benchmark in low-resource languages with controls for memorization | English, Multilingual                                                                                                         |
-| [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese                                                                                                              |
-| [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese                                                                                                              |
-| [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English                                                                                                                       |
-| [mbpp](mbpp/README.md) | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. | Python |
-| [mc_taco](mc_taco/README.md) | Question-answer pairs that require temporal commonsense comprehension. | English                                                                                                                       |
-| [med_concepts_qa](med_concepts_qa/README.md) | Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept. | English                                                                                                                       |
-| [metabench](metabench/README.md) | Distilled versions of six popular benchmarks which are highly predictive of overall benchmark performance and of a single general ability latent trait. | English                                                                                                                       |
-| medmcqa | Medical multiple choice questions assessing detailed medical knowledge. | English                                                                                                                       |
-| medqa | Multiple choice question answering based on the United States Medical License Exams. |                                                                                                                               |
-| [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu                                           |
-| [minerva_math](minerva_math/README.md) | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English                                                                                                                       |
-| [mlqa](mlqa/README.md) | MultiLingual Question Answering benchmark dataset for evaluating cross-lingual question answering performance. | English, Arabic, German, Spanish, Hindi, Vietnamese, Simplified Chinese |
-| [mmlu](mmlu/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English                                                                                                                       |
-| [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English                                                                                                                       |
-| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English                                                                                                                       |
-| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. |                                                                                                                               |
-| [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English  
-| [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English                                                                                                                       |
-| [nq_open](nq_open/README.md) | Open domain question answering tasks based on the Natural Questions dataset. | English                                                                                                                       |
-| [okapi/arc_multilingual](okapi/arc_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.**                                                                               |
-| [okapi/hellaswag_multilingual](okapi/hellaswag_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (30 languages) **Machine Translated.**                                                                               |
-| okapi/mmlu_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (34 languages) **Machine Translated.**                                                                               |
+| [leaderboard](leaderboard/README.md)                                     | Task group used by Hugging Face's [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Those tasks are static and will not change through time | English                                                                                                                       |
+| [lingoly](lingoly/README.md)                                             | Challenging logical reasoning benchmark in low-resource languages with controls for memorization | English, Multilingual                                                                                                         |
+| [logiqa](logiqa/README.md)                                               | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese                                                                                                              |
+| [logiqa2](logiqa2/README.md)                                             | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese                                                                                                              |
+| [mathqa](mathqa/README.md)                                               | Question answering tasks involving mathematical reasoning and problem-solving. | English                                                                                                                       |
+| [mbpp](mbpp/README.md)                                                   | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. | Python                                                                                                                        |
+| [mc_taco](mc_taco/README.md)                                             | Question-answer pairs that require temporal commonsense comprehension. | English                                                                                                                       |
+| [med_concepts_qa](med_concepts_qa/README.md)                             | Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept. | English                                                                                                                       |
+| [metabench](metabench/README.md)                                         | Distilled versions of six popular benchmarks which are highly predictive of overall benchmark performance and of a single general ability latent trait. | English                                                                                                                       |
+| medmcqa                                                                  | Medical multiple choice questions assessing detailed medical knowledge. | English                                                                                                                       |
+| medqa                                                                    | Multiple choice question answering based on the United States Medical License Exams. |                                                                                                                               |
+| [mgsm](mgsm/README.md)                                                   | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu                                           |
+| [minerva_math](minerva_math/README.md)                                   | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English                                                                                                                       |
+| [mlqa](mlqa/README.md)                                                   | MultiLingual Question Answering benchmark dataset for evaluating cross-lingual question answering performance. | English, Arabic, German, Spanish, Hindi, Vietnamese, Simplified Chinese                                                       |
+| [mmlu](mmlu/README.md)                                                   | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English                                                                                                                       |
+| [mmlu_pro](mmlu_pro/README.md)                                           | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English                                                                                                                       |
+| [mmlusr](mmlusr/README.md)                                               | Variation of MMLU designed to be more rigorous. | English                                                                                                                       |
+| model_written_evals                                                      | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. |                                                                                                                               |
+| [moral_stories](moral_stories/README.md)                                 | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English  
+| [mutual](mutual/README.md)                                               | A retrieval-based dataset for multi-turn dialogue reasoning. | English                                                                                                                       |
+| [nq_open](nq_open/README.md)                                             | Open domain question answering tasks based on the Natural Questions dataset. | English                                                                                                                       |
+| [okapi/arc_multilingual](okapi/arc_multilingual/README.md)               | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.**                                                                               |
+| [okapi/hellaswag_multilingual](okapi/hellaswag_multilingual/README.md)   | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (30 languages) **Machine Translated.**                                                                               |
+| okapi/mmlu_multilingual                                                  | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (34 languages) **Machine Translated.**                                                                               |
 | [okapi/truthfulqa_multilingual](okapi/truthfulqa_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.**                                                                               |
-| [openbookqa](openbookqa/README.md) | Open-book question answering tasks that require external knowledge and reasoning. | English                                                                                                                       |
-| [paloma](paloma/README.md) | Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit. | English                                                                                                                       |
-| [paws-x](paws-x/README.md) | Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities. | English, French, Spanish, German, Chinese, Japanese, Korean                                                                   |
-| [pile](pile/README.md) | Open source language modelling data set that consists of 22 smaller, high-quality datasets. | English                                                                                                                       |
-| [pile_10k](pile_10k/README.md) | The first 10K elements of The Pile, useful for debugging models trained on it. | English                                                                                                                       |
-| [piqa](piqa/README.md) | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English                                                                                                                       |
-| [polemo2](polemo2/README.md) | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish                                                                                                                        |
-| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese                                                                                                                    |
-| [prost](prost/README.md) | Tasks requiring understanding of professional standards and ethics in various domains. | English                                                                                                                       |
-| [pubmedqa](pubmedqa/README.md) | Question answering tasks based on PubMed research articles for biomedical understanding. | English                                                                                                                       |
-| [qa4mre](qa4mre/README.md) | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English                                                                                                                       |
-| [qasper](qasper/README.md) | Question Answering dataset based on academic papers, testing in-depth scientific knowledge. | English                                                                                                                       |
-| [race](race/README.md) | Reading comprehension assessment tasks based on English exams in China. | English                                                                                                                       |
-| realtoxicityprompts | Tasks to evaluate language models for generating text with potential toxicity. |                                                                                                                               |
-| [sciq](sciq/README.md) | Science Question Answering tasks to assess understanding of scientific concepts. | English                                                                                                                       |
-| [score](score/README.md) | Systematic consistency and robustness evaluation for LLMs on 3 datasets(MMLU-Pro, Agi Eval and MATH) | English                                                                                                                       |
-| [scrolls](scrolls/README.md) | Tasks that involve long-form reading comprehension across various domains. | English                                                                                                                       |
-| [siqa](siqa/README.md) | Social Interaction Question Answering to evaluate common sense and social reasoning.  | English                                                                                                                       |
-| [spanish_bench](spanish_bench/README.md) | Collection of tasks in Spanish encompassing various evaluation areas. | Spanish                                                                                                                       |
-| [squad_completion](squad_completion/README.md) | A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. | English                                                                                                                       |
-| [squadv2](squadv2/README.md) | Stanford Question Answering Dataset version 2, a reading comprehension benchmark. | English                                                                                                                       |
-| [storycloze](storycloze/README.md) | Tasks to predict story endings, focusing on narrative logic and coherence. | English                                                                                                                       |
-| [super_glue](super_glue/README.md) | A suite of challenging tasks designed to test a range of language understanding skills. | English                                                                                                                       |
-| [swag](swag/README.md) | Situations With Adversarial Generations, predicting the next event in videos. | English                                                                                                                       |
-| [swde](swde/README.md) | Information extraction tasks from semi-structured web pages. | English                                                                                                                       |
-| [tinyBenchmarks](tinyBenchmarks/README.md) | Evaluation of large language models with fewer examples using tiny versions of popular benchmarks. | English                                                                                                                       |
-| [tmmluplus](tmmluplus/README.md) | An extended set of tasks under the TMMLU framework for broader academic assessments. | Traditional Chinese                                                                                                           |
-| [toxigen](toxigen/README.md) | Tasks designed to evaluate language models on their propensity to generate toxic content. | English                                                                                                                       |
-| [translation](translation/README.md) | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese                               |
-| [triviaqa](triviaqa/README.md) | A large-scale dataset for trivia question answering to test general knowledge. | English                                                                                                                       |
-| [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English                                                                                                                       |
-| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish                                                                                                                       |
-| [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English                                                                                                                       |
-| [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English                                                                                                                       |
-| [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English                                                                                                                       |
-| [wikitext](wikitext/README.md) | Tasks based on text from Wikipedia articles to assess language modeling and generation. | English                                                                                                                       |
-| [winogrande](winogrande/README.md) | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge. | English                                                                                                                       |
-| [wmdp](wmdp/README.md) | A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions. | English                                                                                                                       |
-| [wmt2016](wmt2016/README.md) | Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. | English, Czech, German, Finnish, Russian, Romanian, Turkish                                                                   |
-| [wsc273](wsc273/README.md) | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. | English                                                                                                                       |
-| [xcopa](xcopa/README.md) | Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. | Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese                           |
-| [xnli](xnli/README.md) | Cross-Lingual Natural Language Inference to test understanding across different languages. | Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese |
-| [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque                                                                                                                        |
-| [xquad](xquad/README.md) | Cross-lingual Question Answering Dataset in multiple languages. | Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish, Vietnamese, Chinese                         |
-| [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese                             |
-| [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese                                                                       |
+| [openbookqa](openbookqa/README.md)                                       | Open-book question answering tasks that require external knowledge and reasoning. | English                                                                                                                       |
+| [paloma](paloma/README.md)                                               | Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit. | English                                                                                                                       |
+| [paws-x](paws-x/README.md)                                               | Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities. | English, French, Spanish, German, Chinese, Japanese, Korean                                                                   |
+| [pile](pile/README.md)                                                   | Open source language modelling data set that consists of 22 smaller, high-quality datasets. | English                                                                                                                       |
+| [pile_10k](pile_10k/README.md)                                           | The first 10K elements of The Pile, useful for debugging models trained on it. | English                                                                                                                       |
+| [piqa](piqa/README.md)                                                   | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English                                                                                                                       |
+| [polemo2](polemo2/README.md)                                             | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish                                                                                                                        |
+| [portuguese_bench](portuguese_bench/README.md)                           | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese                                                                                                                    |
+| [prost](prost/README.md)                                                 | Tasks requiring understanding of professional standards and ethics in various domains. | English                                                                                                                       |
+| [pubmedqa](pubmedqa/README.md)                                           | Question answering tasks based on PubMed research articles for biomedical understanding. | English                                                                                                                       |
+| [qa4mre](qa4mre/README.md)                                               | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English                                                                                                                       |
+| [qasper](qasper/README.md)                                               | Question Answering dataset based on academic papers, testing in-depth scientific knowledge. | English                                                                                                                       |
+| [race](race/README.md)                                                   | Reading comprehension assessment tasks based on English exams in China. | English                                                                                                                       |
+| realtoxicityprompts                                                      | Tasks to evaluate language models for generating text with potential toxicity. |                                                                                                                               |
+| [sciq](sciq/README.md)                                                   | Science Question Answering tasks to assess understanding of scientific concepts. | English                                                                                                                       |
+| [score](score/README.md)                                                 | Systematic consistency and robustness evaluation for LLMs on 3 datasets(MMLU-Pro, Agi Eval and MATH) | English                                                                                                                       |
+| [scrolls](scrolls/README.md)                                             | Tasks that involve long-form reading comprehension across various domains. | English                                                                                                                       |
+| [siqa](siqa/README.md)                                                   | Social Interaction Question Answering to evaluate common sense and social reasoning.  | English                                                                                                                       |
+| [spanish_bench](spanish_bench/README.md)                                 | Collection of tasks in Spanish encompassing various evaluation areas. | Spanish                                                                                                                       |
+| [squad_completion](squad_completion/README.md)                           | A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. | English                                                                                                                       |
+| [squadv2](squadv2/README.md)                                             | Stanford Question Answering Dataset version 2, a reading comprehension benchmark. | English                                                                                                                       |
+| [storycloze](storycloze/README.md)                                       | Tasks to predict story endings, focusing on narrative logic and coherence. | English                                                                                                                       |
+| [super_glue](super_glue/README.md)                                       | A suite of challenging tasks designed to test a range of language understanding skills. | English                                                                                                                       |
+| [swag](swag/README.md)                                                   | Situations With Adversarial Generations, predicting the next event in videos. | English                                                                                                                       |
+| [swde](swde/README.md)                                                   | Information extraction tasks from semi-structured web pages. | English                                                                                                                       |
+| [tinyBenchmarks](tinyBenchmarks/README.md)                               | Evaluation of large language models with fewer examples using tiny versions of popular benchmarks. | English                                                                                                                       |
+| [tmmluplus](tmmluplus/README.md)                                         | An extended set of tasks under the TMMLU framework for broader academic assessments. | Traditional Chinese                                                                                                           |
+| [toxigen](toxigen/README.md)                                             | Tasks designed to evaluate language models on their propensity to generate toxic content. | English                                                                                                                       |
+| [translation](translation/README.md)                                     | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese                               |
+| [triviaqa](triviaqa/README.md)                                           | A large-scale dataset for trivia question answering to test general knowledge. | English                                                                                                                       |
+| [truthfulqa](truthfulqa/README.md)                                       | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English                                                                                                                       |
+| [turkishmmlu](turkishmmlu/README.md)                                     | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish                                                                                                                       |
+| [unitxt](unitxt/README.md)                                               | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English                                                                                                                       |
+| [unscramble](unscramble/README.md)                                       | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English                                                                                                                       |
+| [webqs](webqs/README.md)                                                 | Web-based question answering tasks designed to evaluate internet search and retrieval. | English                                                                                                                       |
+| [wikitext](wikitext/README.md)                                           | Tasks based on text from Wikipedia articles to assess language modeling and generation. | English                                                                                                                       |
+| [winogrande](winogrande/README.md)                                       | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge. | English                                                                                                                       |
+| [wmdp](wmdp/README.md)                                                   | A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions. | English                                                                                                                       |
+| [wmt2016](wmt2016/README.md)                                             | Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. | English, Czech, German, Finnish, Russian, Romanian, Turkish                                                                   |
+| [wsc273](wsc273/README.md)                                               | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. | English                                                                                                                       |
+| [xcopa](xcopa/README.md)                                                 | Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. | Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese                           |
+| [xnli](xnli/README.md)                                                   | Cross-Lingual Natural Language Inference to test understanding across different languages. | Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese |
+| [xnli_eu](xnli_eu/README.md)                                             | Cross-lingual Natural Language Inference tasks in Basque. | Basque                                                                                                                        |
+| [xquad](xquad/README.md)                                                 | Cross-lingual Question Answering Dataset in multiple languages. | Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish, Vietnamese, Chinese                         |
+| [xstorycloze](xstorycloze/README.md)                                     | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese                             |
+| [xwinograd](xwinograd/README.md)                                         | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese                                                                       |
--- a/lm_eval/tasks/histoires_morales/README.md
+++ b/lm_eval/tasks/histoires_morales/README.md
+# Histoires Morales
+
+### Paper
+
+Title: `Histoires Morales: A French Dataset for Assessing Moral Alignment`
+
+Abstract: `https://arxiv.org/pdf/2501.17117`
+
+⚖ Histoires Morales is the first dataset for moral model alignment evaluation in French. It consists of narratives describing normative and norm-divergent actions taken by individuals to achieve certain intentions in concrete situations, along with their respective consequences.
+Each of the 12,000 stories (histoires) follows the same seven-sentence structure as the Moral Stories dataset:
+
+Context:
+
+1. Norm: A guideline for social conduct generally observed by most people in everyday situations.
+2. Situation: The setting of the story, introducing participants and describing their environment.
+3. Intention: A reasonable goal that one of the story participants (the actor) wants to achieve.
+
+Normative path:
+4. Normative action: An action by the actor that fulfills the intention while observing the norm.
+5. Normative consequence: A possible effect of the normative action on the actor’s environment.
+
+Norm-divergent path:
+6. Divergent action: An action by the actor that fulfills the intention but diverges from the norm.
+7. Divergent consequence: A possible effect of the divergent action on the actor’s environment.
+
+Histoires Morales is adapted to French from the widely used Moral Stories dataset.
+We translated the Moral Stories dataset and refined these translations through manual annotations.
+See paper for more details.
+
+Homepage: `https://huggingface.co/datasets/LabHC/histoires_morales`
+
+
+### Citation
+
+Coming soon (accepted to NAACL 2025)
+
+### Groups, Tags, and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tags
+
+No tags, since there is a single task.
+
+#### Tasks
+
+* `histoires_morales.yaml`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/histoires_morales/histoires_morales.yaml
+++ b/lm_eval/tasks/histoires_morales/histoires_morales.yaml
+task: histoires_morales
+dataset_path: LabHC/histoires_morales
+output_type: multiple_choice
+test_split: train
+process_docs: !function utils.process_docs
+doc_to_text: "{{query}}"
+doc_to_target: "{{label}}"
+doc_to_choice: "choices"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/histoires_morales/utils.py
+++ b/lm_eval/tasks/histoires_morales/utils.py
+import datasets
+
+
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc):
+        ctx = (
+            doc["norm"].capitalize()
+            + " "
+            + doc["situation"].capitalize()
+            + " "
+            + doc["intention"].capitalize()
+        )
+        choices = [doc["moral_action"], doc["immoral_action"]]
+        out_doc = {
+            "query": ctx,
+            "choices": choices,
+            "label": 0,
+        }
+        return out_doc
+
+    return dataset.map(_process_doc)