Merge branch 'main' into longcxt

# Conflicts: # lm_eval/tasks/README.md

Merge branch 'main' into longcxt
# Conflicts: # lm_eval/tasks/README.md
527a4352 · Baber · 6042f622 · 52df63b7 · 527a4352 · 527a4352
Commit 527a4352 authored Feb 20, 2025 by Baber
20 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -29,7 +29,7 @@ repos:
      - id: mixed-line-ending
        args: [--fix=lf]
  - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.9.2
+    rev: v0.9.3
    hooks:
      # Run the linter.
      - id: ruff
@@ -38,7 +38,7 @@ repos:
        # Run the formatter.
      - id: ruff-format
  - repo: https://github.com/codespell-project/codespell
-    rev: v2.3.0
+    rev: v2.4.1
    hooks:
      - id: codespell
        exclude: >

--- a/README.md
+++ b/README.md
@@ -489,7 +489,8 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
 | api             | For using api models (Anthropic, OpenAI API) |
 | deepsparse      | For running NM's DeepSparse models           |
 | dev             | For linting PRs and contributions            |
-| gptq            | For loading models with GPTQ                 |
+| gptq            | For loading models with AutoGPTQ             |
+| gptqmodel       | For loading models with GPTQModel            |
 | hf_transfer     | For speeding up HF Hub file downloads        |
 | ifeval          | For running the IFEval task                  |
 | ibm_watsonx_ai  | For using IBM watsonx.ai model apis          |

--- a/docs/interface.md
+++ b/docs/interface.md
@@ -8,7 +8,7 @@ A majority of users run the library by cloning it from Github, installing the pa

 Equivalently, running the library can be done via the `lm-eval` entrypoint at the command line.

-This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
+This mode supports a number of command-line arguments, the details of which can also be seen via running with `-h` or `--help`:

 - `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.


--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -143,7 +143,7 @@ The next thing we need to do is decide what format to use when presenting the da

 To write a prompt, users will use `doc_to_text`, `doc_to_target`, and `doc_to_choice` (Optional when certain conditions are met).

-`doc_to_text` defines the input string a model will be given while `doc_to_target` and `doc_to_choice` will be used to generate the target text. `doc_to_target` can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, `doc_to_choice` must be also be set with the appropriate list of possible choice strings.
+`doc_to_text` defines the input string a model will be given while `doc_to_target` and `doc_to_choice` will be used to generate the target text. `doc_to_target` can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, `doc_to_choice` must also be set with the appropriate list of possible choice strings.

 ### Basic prompts

@@ -172,7 +172,7 @@ doc_to_choice: choices

 We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.

-Take for example the dataset `super_glue/boolq`. As input, we'd like to use the features `passage` and `question` and string them together so that for a a sample line `doc`, the model sees something the format of:
+Take for example the dataset `super_glue/boolq`. As input, we'd like to use the features `passage` and `question` and string them together so that for a sample line `doc`, the model sees something in the format of:
 ```
 doc["passage"]
 Question: doc["question"]?
@@ -284,7 +284,7 @@ As a heuristic check:
 * Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
 * Does your task rely on metrics that need a custom implementation?

-For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
+For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). If none of the above sounds like they apply to your task, it's time to continue onto checking your task performance!

 ### Task name + tags (registering a task)

@@ -383,7 +383,7 @@ task:

 ### Configuring python classes

-There can occasions when yaml-based tasks cannot accommodate how a task is handled. LM-Eval supports the manually implementing tasks as was previously done before `0.4.x`. To register the task, you can simply make a yaml with the name of the task in `task` and the class object in `class` using the `!function` prefix.
+There can be occasions when yaml-based tasks cannot accommodate how a task is handled. LM-Eval supports the manually implementing tasks as was previously done before `0.4.x`. To register the task, you can simply make a yaml with the name of the task in `task` and the class object in `class` using the `!function` prefix.

 ```yaml
 task: squadv2
@@ -486,7 +486,7 @@ If other tasks on this dataset are already supported:

 It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.

-**Finally, please add a short description of your task(s), along with a link to its subfolder in lm_eval/tasks , to [`lm_eval/tasks/README.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md) so that users can discover your task in the library, and follow the link to your README for more information about the variants supported, their task names, and the original source of the dataset and/or evaluation setup.**
+**Finally, please add a short description of your task(s), along with a link to its subfolder in lm_eval/tasks, to [`lm_eval/tasks/README.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md) so that users can discover your task in the library, and follow the link to your README for more information about the variants supported, their task names, and the original source of the dataset and/or evaluation setup.**

 ## Submitting your task


--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -6,7 +6,7 @@ These YAML configuration files, along with the current codebase commit hash, are

 While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups also exist. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.

-If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
+If your intended task relies on features beyond what is described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.

 ## Configurations

@@ -37,7 +37,7 @@ Prompting / in-context formatting options:
 - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
 - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
 - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
- **assistant_prefill** (`str`, *optional*) — String to append after the <|assistant|> token. For example, if the task is to generate a question, the assistant_prefill could be "The answer is: " to prompt the model to generate an answer to the question. If not using a chat template then this string will be appended to the end of the prompt.
+- **gen_prefix** (`str`, *optional*) — String to append after the <|assistant|> token. For example, if the task is to generate a question, the gen_prefix could be "The answer is: " to prompt the model to generate an answer to the question. If not using a chat template then this string will be appended to the end of the prompt.

 Runtime configuration options:
 - **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
@@ -47,7 +47,7 @@ Scoring details:
 - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
 - **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
 - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency.
+- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. Can be used for cases such as self-consistency.
 - **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
 - **should_decontaminate** (`bool`, *optional*, defaults to False) - Whether to decontaminate or not.
 - **doc_to_decontamination_query** (`str`, *optional*) — Query for decontamination if `should_decontaminate` is True. If `should_decontaminate` is True but `doc_to_decontamination_query` is `None`, `doc_to_decontamination_query` will follow `doc_to_text`.
@@ -185,7 +185,7 @@ The prior implementation method of new tasks was to subclass `Task`. While we in

 ## Including a Base YAML

-You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base temeplate is in the same directory. Otherwise, You will need to define the full path.
+You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base template is in the same directory. Otherwise, You will need to define the full path.
 ```
 include: <YAML filename or with full path>
 ...
@@ -297,7 +297,7 @@ Tasks using complex filtering:

 # Group Configuration

-When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
+When evaluating a language model, it is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be cumbersome to have to list the set of tasks or add a new group name to each yaml of each individual task.

 To solve this, we can create a **group** yaml config. This is a config that contains the names of the tasks that should be included in a particular group. The config consists of two main keys: a `group` key which denotes the name of the group (as it would be called from the command line, e.g. `mmlu`) and a `task` key which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example of a group yaml config can be found at [../lm_eval/tasks/mmlu/default/_mmlu.yaml]. See also the [New Task Guide](./new_task_guide.md) for a more in-depth and tutorial-esque explanation of how to write complex GroupConfigs.


--- a/examples/transformer-lens.py
+++ b/examples/transformer-lens.py
+import warnings
+
+import torch
+import torch.nn as nn
+from transformer_lens import HookedTransformer
+from transformers import AutoConfig
+
+from lm_eval import evaluator
+from lm_eval.models.huggingface import HFLM
+
+
+def evaluate_lm_eval(lens_model: HookedTransformer, tasks: list[str], **kwargs):
+    class HFLikeModelAdapter(nn.Module):
+        """Adapts HookedTransformer to match the HuggingFace interface expected by lm-eval"""
+
+        def __init__(self, model: HookedTransformer):
+            super().__init__()
+            self.model = model
+            self.tokenizer = model.tokenizer
+            self.config = AutoConfig.from_pretrained(model.cfg.tokenizer_name)
+            self.device = model.cfg.device
+            self.tie_weights = lambda: self
+
+        def forward(self, input_ids=None, attention_mask=None, **kwargs):
+            output = self.model(input_ids, attention_mask=attention_mask, **kwargs)
+            # Make sure output has the expected .logits attribute
+            if not hasattr(output, "logits"):
+                if isinstance(output, torch.Tensor):
+                    output.logits = output
+            return output
+
+        # Only delegate specific attributes we know we need
+        def to(self, *args, **kwargs):
+            return self.model.to(*args, **kwargs)
+
+        def eval(self):
+            self.model.eval()
+            return self
+
+        def train(self, mode=True):
+            self.model.train(mode)
+            return self
+
+    model = HFLikeModelAdapter(lens_model)
+    warnings.filterwarnings("ignore", message="Failed to get model SHA for")
+    results = evaluator.simple_evaluate(
+        model=HFLM(pretrained=model, tokenizer=model.tokenizer),
+        tasks=tasks,
+        verbosity="WARNING",
+        **kwargs,
+    )
+    return results
+
+
+if __name__ == "__main__":
+    # Load base model
+    model = HookedTransformer.from_pretrained("pythia-70m")
+    res = evaluate_lm_eval(model, tasks=["arc_easy"])
+    print(res["results"])
--- a/lm_eval/api/samplers.py
+++ b/lm_eval/api/samplers.py
+import logging
+import warnings
 from functools import partial
 from typing import TYPE_CHECKING, Iterable, Optional, Union

@@ -9,6 +11,8 @@ if TYPE_CHECKING:

    from lm_eval.api.task import ConfigurableTask, Task

+eval_logger = logging.getLogger("lm-eval")
+

 class ContextSampler:
    def __init__(
@@ -97,6 +101,13 @@ class ContextSampler:
                labeled_examples += self.doc_to_choice(doc)[doc_content]

            if doc_target != "":
+                if self.target_delimiter.isspace() and str(doc_target)[0].isspace():
+                    # TODO: add logger warn once here.
+                    warnings.warn(
+                        "Both target_delimiter and target start with a space. This may cause issues.",
+                        Warning,
+                        stacklevel=2,
+                    )
                labeled_examples += self.target_delimiter
                labeled_examples += prefix
                labeled_examples += (

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -458,6 +458,7 @@ class Task(abc.ABC):
                ctx=fewshot_ctx,
                metadata=(self.config["task"], doc_id, self.config.repeats),
                apply_chat_template=apply_chat_template,
+                chat_template=chat_template,
            )

            if not isinstance(inst, list):
@@ -1063,6 +1064,8 @@ class ConfigurableTask(Task):
            Whether to provide the fewshot examples as a multiturn conversation or a single user turn.
        :param chat_template:
            callable (from lm.apply_chat_template) that takes in a list[Dict] chat transcript and renders it into a string.
+        :param gen_prefix:
+            String to append after the <|assistant|> token.
        :returns: str
            The fewshot context.
        """
@@ -1113,6 +1116,8 @@ class ConfigurableTask(Task):
        if apply_chat_template:
            if self.multiple_input:
                # TODO: append prefill?
+                if not labeled_examples:
+                    return ""
                return chat_template(labeled_examples)
            if isinstance(example, str):
                self.append_target_question(
@@ -1365,6 +1370,7 @@ class ConfigurableTask(Task):
        self, doc: dict, ctx: str, **kwargs
    ) -> Union[List[Instance], Instance]:
        apply_chat_template = kwargs.pop("apply_chat_template", False)
+        chat_template: Callable | None = kwargs.pop("chat_template", None)

        aux_arguments = None

@@ -1379,9 +1385,20 @@ class ConfigurableTask(Task):
                target_delimiter = ""
            if self.multiple_input:
                # If there are multiple inputs, choices are placed in the ctx
+                # apply chat_template to choices if apply_chat_template
                cont = self.doc_to_target(doc)
+
                arguments = [
-                    (ctx + choice, f"{target_delimiter}{cont}") for choice in choices
+                    (
+                        ctx
+                        + (
+                            chat_template([{"role": "user", "content": choice}])
+                            if apply_chat_template
+                            else choice
+                        ),
+                        f"{target_delimiter}{cont}",
+                    )
+                    for choice in choices
                ]
            else:
                # Otherwise they are placed in the continuation
@@ -1626,7 +1643,7 @@ class ConfigurableTask(Task):
                    # This allows for multiple metrics to be returned from the same function
                    for k, v in result_score.items():
                        result_dict[k] = v
-                        return result_dict
+                else:
                    result_dict[metric] = result_score
        else:
            raise ValueError(

--- a/lm_eval/models/api_models.py
+++ b/lm_eval/models/api_models.py
@@ -265,7 +265,7 @@ class TemplateAPI(TemplateLM):
            )
        else:
            # bit of a hack. We'll load back before sending to the API
-            return JsonChatStr(json.dumps(chat_history))
+            return JsonChatStr(json.dumps(chat_history, ensure_ascii=False))

    @cached_property
    def eot_token_id(self) -> Optional[int]:

--- a/lm_eval/models/vllm_causallms.py
+++ b/lm_eval/models/vllm_causallms.py
@@ -75,7 +75,6 @@ class VLLM(TemplateLM):
                "Please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
            )

-        assert "cuda" in device or device is None, "vLLM only supports CUDA"
        assert max_length is None or max_model_len is None, (
            "Either max_length or max_model_len may be provided, but not both"
        )
@@ -110,7 +109,7 @@ class VLLM(TemplateLM):
            eval_logger.warning(
                "You might experience occasional issues with model weight downloading when data_parallel is in use. To ensure stable performance, run with data_parallel_size=1 until the weights are downloaded and cached."
            )
-            self.model_args["worker_use_ray"] = True
+            self.model_args["distributed_executor_backend"] = "ray"
            self.batch_size = "auto"
            eval_logger.info("Manual batching is not compatible with data parallelism.")

@@ -247,9 +246,7 @@ class VLLM(TemplateLM):
            # vLLM hangs if tensor_parallel > 1 and resources are set in ray.remote
            # also seems to only work with decorator and not with ray.remote() fn
            # see https://github.com/vllm-project/vllm/issues/973
-            # note: this has changed on 0.3.3, and it only works now if num_gpus are set.
-            # but then tensor_parallel breaks
-            @ray.remote
+            @ray.remote(num_gpus=1 if self.tensor_parallel_size == 1 else None)
            def run_inference_one_model(
                model_args: dict,
                sampling_params,

--- a/lm_eval/models/vllm_vlms.py
+++ b/lm_eval/models/vllm_vlms.py
@@ -109,9 +109,7 @@ class VLLM_VLM(VLLM):
            # vLLM hangs if tensor_parallel > 1 and resources are set in ray.remote
            # also seems to only work with decorator and not with ray.remote() fn
            # see https://github.com/vllm-project/vllm/issues/973
-            # note: this has changed on 0.3.3, and it only works now if num_gpus are set.
-            # but then tensor_parallel breaks
-            @ray.remote
+            @ray.remote(num_gpus=1 if self.tensor_parallel_size == 1 else None)
            def run_inference_one_model(
                model_args: dict, sampling_params, requests: List[List[dict]]
            ):
@@ -271,7 +269,9 @@ class VLLM_VLM(VLLM):
                left_truncate_len=max_ctx_len,
            )

-            cont = self._model_generate(inputs, stop=until, generate=True, **kwargs)
+            cont = self._model_generate(
+                inputs, stop=until, generate=True, max_tokens=max_gen_toks, **kwargs
+            )

            for output, context in zip(cont, contexts):
                generated_text = output.outputs[0].text

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -6,7 +6,7 @@
 For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder.

 | Task Family                                                              | Description | Language(s)                                                                                                                   |
-|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
+|--------------------------------------------------------------------------|-------------|-------------------------------------------------------------------------------------------------------------------------------|
 | [aclue](aclue/README.md)                                                 | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese                                                                                                               |
 | [aexams](aexams/README.md)                                               | Tasks in Arabic related to various academic exams covering a range of subjects. | Arabic                                                                                                                        |
 | [agieval](agieval/README.md)                                             | Tasks involving historical data or questions related to history and historical texts. | English, Chinese                                                                                                              |
@@ -42,9 +42,10 @@
 | [eus_proficiency](eus_proficiency/README.md)                             | Tasks designed to test proficiency in the Basque language across various topics. | Basque                                                                                                                        |
 | [eus_reading](eus_reading/README.md)                                     | Reading comprehension tasks specifically designed for the Basque language. | Basque                                                                                                                        |
 | [eus_trivia](eus_trivia/README.md)                                       | Trivia and knowledge testing tasks in the Basque language. | Basque                                                                                                                        |
+| [evalita-LLM](evalita-LLM/README.md)                                     | A native Italian benchmark with diverse tasks formats and multiple prompts. | Italian                                                                                                                      |
 | [fda](fda/README.md)                                                     | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English                                                                                                                       |
 | [fld](fld/README.md)                                                     | Tasks involving free-form and directed dialogue understanding. | English                                                                                                                       |
-| [french_bench](french_bench/README.md)                                   | Set of tasks designed to assess language model performance in French.                                                                                                                                                                                                                                                                  | French|
+| [french_bench](french_bench/README.md)                                   | Set of tasks designed to assess language model performance in French. | French                                                                                                                        |
 | [galician_bench](galician_bench/README.md)                               | Collection of tasks in Galician encompassing various evaluation areas. | Galician                                                                                                                      |
 | [global_mmlu](global_mmlu/README.md)                                     | Collection of culturally sensitive and culturally agnostic MMLU tasks in 15 languages with human translations or post-edits. | Multiple (15 languages)                                                                                                       |
 | [glue](glue/README.md)                                                   | General Language Understanding Evaluation benchmark to test broad language abilities. | English                                                                                                                       |
@@ -55,6 +56,8 @@
 | [hellaswag](hellaswag/README.md)                                         | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. | English                                                                                                                       |
 | [hendrycks_ethics](hendrycks_ethics/README.md)                           | Tasks designed to evaluate the ethical reasoning capabilities of models. | English                                                                                                                       |
 | [hendrycks_math](hendrycks_math/README.md)                               | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English                                                                                                                       |
+| [histoires_morales](histoires_morales/README.md)                         | A dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations.  | French (Some MT)                                                                                                                        |
+| [hrm8k](hrm8k/README.md)                                                 | A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT)                                                                                           |
 | [humaneval](humaneval/README.md)                                         | Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python                                                                                                                        |
 | [ifeval](ifeval/README.md)                                               | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English                                                                                                                       |
 | [inverse_scaling](inverse_scaling/README.md)                             | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English                                                                                                                       |
@@ -83,8 +86,10 @@
 | [mlqa](mlqa/README.md)                                                   | MultiLingual Question Answering benchmark dataset for evaluating cross-lingual question answering performance. | English, Arabic, German, Spanish, Hindi, Vietnamese, Simplified Chinese                                                       |
 | [mmlu](mmlu/README.md)                                                   | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English                                                                                                                       |
 | [mmlu_pro](mmlu_pro/README.md)                                           | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English                                                                                                                       |
+| [mmlu-pro-plus](mmlu-pro-plus/README.md) | A new test set for evaluating shortcut learning and higher-order reasoning of LLMs.                                                                                                                                                                                                                                                   | English |
 | [mmlusr](mmlusr/README.md)                                               | Variation of MMLU designed to be more rigorous. | English                                                                                                                       |
 | model_written_evals                                                      | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. |                                                                                                                               |
+| [moral_stories](moral_stories/README.md)                                 | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English  
 | [mutual](mutual/README.md)                                               | A retrieval-based dataset for multi-turn dialogue reasoning. | English                                                                                                                       |
 | [nq_open](nq_open/README.md)                                             | Open domain question answering tasks based on the Natural Questions dataset. | English                                                                                                                       |
 | [okapi/arc_multilingual](okapi/arc_multilingual/README.md)               | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.**                                                                               |

--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu.yaml
@@ -9,4 +9,4 @@ aggregate_metric_list:
  - metric: acc
    weight_by_size: True
 metadata:
-  version: 0
+  version: 1
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_humanities.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_humanities.yaml
@@ -6,4 +6,4 @@ aggregate_metric_list:
  - metric: acc
    weight_by_size: True
 metadata:
-  version: 0
+  version: 1
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_language.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_language.yaml
@@ -6,4 +6,4 @@ aggregate_metric_list:
  - metric: acc
    weight_by_size: True
 metadata:
-  version: 0
+  version: 1
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_other.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_other.yaml
@@ -6,4 +6,4 @@ aggregate_metric_list:
  - metric: acc
    weight_by_size: True
 metadata:
-  version: 0
+  version: 1
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_social_science.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_social_science.yaml
@@ -6,4 +6,4 @@ aggregate_metric_list:
  - metric: acc
    weight_by_size: True
 metadata:
-  version: 0
+  version: 1
--- a/lm_eval/tasks/arabicmmlu/_arabicmmlu_stem.yaml
+++ b/lm_eval/tasks/arabicmmlu/_arabicmmlu_stem.yaml
@@ -6,4 +6,4 @@ aggregate_metric_list:
  - metric: acc
    weight_by_size: True
 metadata:
-  version: 0
+  version: 1
--- a/lm_eval/tasks/arabicmmlu/_default_arabicmmlu_template_yaml
+++ b/lm_eval/tasks/arabicmmlu/_default_arabicmmlu_template_yaml
-dataset_path: yazeed7/ArabicMMLU
+dataset_path: MBZUAI/ArabicMMLU
 test_split: test
 fewshot_split: dev
 fewshot_config:
@@ -12,4 +12,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  version: 0.0
+  version: 1.0
--- a/lm_eval/tasks/arabicmmlu/_generate_configs.py
+++ b/lm_eval/tasks/arabicmmlu/_generate_configs.py
@@ -14,46 +14,46 @@ eval_logger = logging.getLogger("lm-eval")


 SUBJECTS = {
-    "Driving Test": "other",
-    "High Geography": "social_science",
-    "High History": "humanities",
    "Islamic Studies": "humanities",
-    "Univ Accounting": "social_science",
-    "Primary General Knowledge": "other",
-    "Univ Political Science": "social_science",
-    "Primary Math": "stem",
-    "Middle General Knowledge": "other",
-    "High Biology": "stem",
-    "Primary Natural Science": "stem",
-    "High Economics": "social_science",
-    "Middle Natural Science": "stem",
-    "Middle Geography": "social_science",
-    "Primary Social Science": "social_science",
-    "Middle Computer Science": "stem",
-    "Middle Islamic Studies": "humanities",
-    "Primary Computer Science": "stem",
-    "High Physics": "stem",
-    "Middle Social Science": "social_science",
-    "Middle Civics": "social_science",
-    "High Computer Science": "stem",
+    "Driving Test": "other",
+    "Natural Science (Middle School)": "stem",
+    "Natural Science (Primary School)": "stem",
+    "History (Primary School)": "humanities",
+    "History (Middle School)": "humanities",
+    "History (High School)": "humanities",
    "General Knowledge": "other",
-    "High Civics": "social_science",
-    "Prof Law": "humanities",
-    "High Islamic Studies": "humanities",
-    "Primary Arabic Language": "language",
-    "High Arabic Language": "language",
-    "Arabic Language (Grammar)": "language",
-    "Primary History": "humanities",
-    "Middle History": "humanities",
-    "Univ Economics": "social_science",
+    "General Knowledge (Primary School)": "other",
+    "General Knowledge (Middle School)": "other",
+    "Law (Professional)": "humanities",
+    "Physics (High School)": "stem",
+    "Social Science (Middle School)": "social_science",
+    "Social Science (Primary School)": "social_science",
+    "Management (University)": "other",
+    "Arabic Language (Primary School)": "language",
+    "Arabic Language (Middle School)": "language",
+    "Arabic Language (High School)": "language",
+    "Political Science (University)": "social_science",
+    "Philosophy (High School)": "humanities",
+    "Accounting (University)": "social_science",
+    "Computer Science (University)": "stem",
+    "Computer Science (Middle School)": "stem",
+    "Computer Science (Primary School)": "stem",
+    "Computer Science (High School)": "stem",
+    "Geography (Primary School)": "social_science",
+    "Geography (Middle School)": "social_science",
+    "Geography (High School)": "social_science",
+    "Math (Primary School)": "stem",
+    "Biology (High School)": "stem",
+    "Economics (University)": "social_science",
+    "Economics (Middle School)": "social_science",
+    "Economics (High School)": "social_science",
    "Arabic Language (General)": "language",
-    "Univ Computer Science": "stem",
-    "Primary Islamic Studies": "humanities",
-    "Primary Geography": "social_science",
-    "High Philosophy": "humanities",
-    "Middle Arabic Language": "language",
-    "Middle Economics": "social_science",
-    "Univ Management": "other",
+    "Arabic Language (Grammar)": "language",
+    "Islamic Studies (High School)": "humanities",
+    "Islamic Studies (Middle School)": "humanities",
+    "Islamic Studies (Primary School)": "humanities",
+    "Civics (Middle School)": "social_science",
+    "Civics (High School)": "social_science",
 }


@@ -69,8 +69,9 @@ if __name__ == "__main__":

    # get filename of base_yaml so we can `"include": ` it in our "other" YAMLs.
    base_yaml_name = os.path.split(args.base_yaml_path)[-1]
-    with open(args.base_yaml_path, encoding="utf-8") as f:
-        base_yaml = yaml.full_load(f)
+
+    # with open(args.base_yaml_path, encoding="utf-8") as f:
+    #     base_yaml = yaml.full_load(f)

    ALL_CATEGORIES = []
    for subject, category in tqdm(SUBJECTS.items()):
@@ -81,8 +82,8 @@ if __name__ == "__main__":

        yaml_dict = {
            "include": base_yaml_name,
-            "tag": f"arabicmmlu_{category}",
-            "task": f"arabicmmlu_{subject.lower().replace(' ', '_')}",
+            "tag": f"arabicmmlu_{category}_tasks",
+            "task": f"arabicmmlu_{subject.lower().replace(' ', '_').replace('(', '').replace(')', '')}",
            "task_alias": subject,
            "dataset_name": subject,
            # "description": description,