Merge branch 'main' into inverse-scaling-tasks

60c9c170 · haileyschoelkopf · 4b2d565b · b4cd85d4 · 60c9c170 · 60c9c170
Commit 60c9c170 authored May 29, 2024 by haileyschoelkopf
20 changed files
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza3e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza3e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza3e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza3e
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza5e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza5e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza5e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza5e
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza6e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza6e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza6e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza6e
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza7e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza7e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza7e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza7e
--- a/lm_eval/tasks/eus_exams/utils.py
+++ b/lm_eval/tasks/eus_exams/utils.py
+import datasets
+
+
+def process_docs(dataset: datasets.Dataset):
+    """Filter out examples with no answer."""
+
+    def valid_example(example: dict) -> bool:
+        """Check if an example is valid."""
+        if example["answer"] not in [0, 1, 2, 3]:
+            return False
+        if example["candidates"] == ["", "", "", ""]:
+            return False
+        return True
+
+    return dataset.filter(valid_example)
--- a/lm_eval/tasks/eus_proficiency/README.md
+++ b/lm_eval/tasks/eus_proficiency/README.md
+# EusProficiency
+
+### Paper
+
+Title: Latxa: An Open Language Model and Evaluation Suite for Basque
+
+Abstract: https://arxiv.org/abs/2403.20266
+
+EusProficiency comprises 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque. We collected the atarikoa exercises from EGA exams through the years 1998 to 2008. Atarikoa is the first qualifying test of EGA, which measures different aspects of language competency, such as reading comprehension, grammar, vocabulary, spelling, and writing. Each test generally has 85 multiple-choice questions, with 4 choices and a single correct answer.
+
+Homepage: https://github.com/hitz-zentroa/latxa
+
+
+### Citation
+
+```
+@misc{etxaniz2024latxa,
+      title={Latxa: An Open Language Model and Evaluation Suite for Basque},
+      author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
+      year={2024},
+      eprint={2403.20266},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+There are no groups.
+
+#### Tasks
+
+* `eus_proficiency`: EusProficiency comprises 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/eus_proficiency/eus_proficiency.yaml
+++ b/lm_eval/tasks/eus_proficiency/eus_proficiency.yaml
+dataset_path: HiTZ/EusProficiency
+dataset_name: default
+task: eus_proficiency
+doc_to_text: "Galdera: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nD: {{candidates[3]}}\nErantzuna:"
+doc_to_choice: ["A", "B", "C", "D"]
+validation_split: null
+test_split: test
+fewshot_split: test
+output_type: multiple_choice
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/eus_reading/README.md
+++ b/lm_eval/tasks/eus_reading/README.md
+# EusReading
+
+### Paper
+
+Title: Latxa: An Open Language Model and Evaluation Suite for Basque
+
+Abstract: https://arxiv.org/abs/2403.20266
+
+EusReading consists of 352 reading comprehension exercises (irakurmena) sourced from the set of past EGA exams from 1998 to 2008. Each test generally has 10 multiple-choice questions, with 4 choices and a single correct answer. These exercises are more challenging than Belebele due to the complexity and length of the input texts. As a result, EusReading is useful to measure long context understanding of models.
+
+Homepage: https://github.com/hitz-zentroa/latxa
+
+
+### Citation
+
+```
+@misc{etxaniz2024latxa,
+      title={Latxa: An Open Language Model and Evaluation Suite for Basque},
+      author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
+      year={2024},
+      eprint={2403.20266},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+There are no groups.
+
+#### Tasks
+
+* `eus_reading`: EusReading consists of 352 reading comprehension exercises (irakurmena) sourced from the set of past EGA exams from 1998 to 2008.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/eus_reading/eus_reading.yaml
+++ b/lm_eval/tasks/eus_reading/eus_reading.yaml
+dataset_path: HiTZ/EusReading
+dataset_name: default
+task: eus_reading
+doc_to_text: !function utils.doc_to_text_context
+doc_to_choice: !function utils.doc_to_choice
+validation_split: null
+test_split: test
+fewshot_split: test
+output_type: multiple_choice
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/eus_reading/utils.py
+++ b/lm_eval/tasks/eus_reading/utils.py
+from typing import List
+
+
+letters = ["A", "B", "C", "D"]
+
+
+def doc_to_text_context(doc) -> str:
+    """
+    Converts a document to a formatted string.
+
+    Args:
+        doc (dict): A dictionary containing the document information.
+
+    Returns:
+        str: A formatted string containing the question and answer choices.
+    """
+    candidates = doc["candidates"]
+    num_choices = len(candidates)
+    if num_choices < 2:
+        raise ValueError("Invalid number of candidates")
+    choices = letters[:num_choices]
+    formatted_choices = "\n".join(
+        [f"{choice}: {candidates[i]}" for i, choice in enumerate(choices)]
+    )
+    return f"Pasartea: {doc['context']}\n\nGaldera: {doc['question']}\n{formatted_choices}\nErantzuna:"
+
+
+def doc_to_choice(doc) -> List[str]:
+    """
+    Returns the answer choices for a document.
+
+    Args:
+        doc (dict): A dictionary containing the document information.
+
+    Returns:
+        list: A list of strings containing the answer choices.
+    """
+    num_choices = len(doc["candidates"])
+    if num_choices < 2:
+        raise ValueError("Invalid number of candidates")
+    return letters[:num_choices]
--- a/lm_eval/tasks/eus_trivia/README.md
+++ b/lm_eval/tasks/eus_trivia/README.md
+# EusTrivia
+
+### Paper
+
+Title: Latxa: An Open Language Model and Evaluation Suite for Basque
+
+Abstract: https://arxiv.org/abs/2403.20266
+
+EusTrivia consists of 1,715 trivia questions from multiple online sources. 56.3\% of the questions are elementary level (grades 3-6), while the rest are considered challenging. A significant portion of the questions focus specifically on the Basque Country, its language and culture. Each multiple-choice question contains two, three or four choices (3.84 on average) and a single correct answer. Five areas of knowledge are covered:
+
+- **Humanities and Natural Sciences** (27.8%): This category encompasses questions about history, geography, biology, ecology and other social and natural sciences.
+- **Leisure and Art** (24.5%): This category includes questions on sports and athletes, performative and plastic arts and artists, architecture, cultural events, and related topics.
+- **Music** (16.0%): Here are grouped all the questions about music and musicians, both classical and contemporary.
+- **Language and Literature** (17.1%): This category is concerned with all kinds of literature productions and writers, as well as metalinguistic questions (e.g., definitions, synonyms, and word usage).
+- **Mathematics and ICT** (14.5%): This category covers mathematical problems and questions about ICT, as well as questions about people known for their contributions to these fields of knowledge.
+
+Homepage: https://github.com/hitz-zentroa/latxa
+
+
+### Citation
+
+```
+@misc{etxaniz2024latxa,
+      title={Latxa: An Open Language Model and Evaluation Suite for Basque},
+      author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
+      year={2024},
+      eprint={2403.20266},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+There are no groups.
+
+#### Tasks
+
+* `eus_trivia`: EusTrivia consists of 1,715 trivia questions from multiple online sources.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/eus_trivia/eus_trivia.yaml
+++ b/lm_eval/tasks/eus_trivia/eus_trivia.yaml
+dataset_path: HiTZ/EusTrivia
+dataset_name: default
+task: eus_trivia
+doc_to_text: !function utils.doc_to_text
+doc_to_choice: !function utils.doc_to_choice
+validation_split: null
+test_split: test
+fewshot_split: test
+output_type: multiple_choice
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/eus_trivia/utils.py
+++ b/lm_eval/tasks/eus_trivia/utils.py
+from typing import List
+
+
+letters = ["A", "B", "C", "D"]
+
+
+def doc_to_text(doc) -> str:
+    """
+    Converts a document to a formatted string.
+
+    Args:
+        doc (dict): A dictionary containing the document information.
+
+    Returns:
+        str: A formatted string containing the question and answer choices.
+    """
+    candidates = doc["candidates"]
+    num_choices = len(candidates)
+    if num_choices < 2:
+        raise ValueError("Invalid number of candidates")
+    choices = letters[:num_choices]
+    formatted_choices = "\n".join(
+        [f"{choice}: {candidates[i]}" for i, choice in enumerate(choices)]
+    )
+    return f"Galdera: {doc['question']}\n{formatted_choices}\nErantzuna:"
+
+
+def doc_to_choice(doc) -> List[str]:
+    """
+    Returns the answer choices for a document.
+
+    Args:
+        doc (dict): A dictionary containing the document information.
+
+    Returns:
+        list: A list of strings containing the answer choices.
+    """
+    num_choices = len(doc["candidates"])
+    if num_choices < 2:
+        raise ValueError("Invalid number of candidates")
+    return letters[:num_choices]
--- a/lm_eval/tasks/fda/README.md
+++ b/lm_eval/tasks/fda/README.md
+# FDA
+
+### Paper
+
+Title: Language Models Enable Simple Systems For
+Generating Structured Views Of Heterogenous Data
+Lakes
+
+Abstract: A long standing goal of the data management community is to develop general, automated systems
+that ingest semi-structured documents and output queryable tables without human effort or domain
+specific customization. Given the sheer variety of potential documents, state-of-the art systems make
+simplifying assumptions and use domain specific training. In this work, we ask whether we can
+maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad
+data, can perform diverse downstream tasks simply conditioned on natural language task descriptions.
+We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify
+two fundamentally different strategies for implementing this system: prompt the LLM to directly
+extract values from documents or prompt the LLM to synthesize code that performs the extraction.
+Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap,
+but far less accurate than directly processing each document with the LLM. To improve quality while
+maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+,
+which achieves better quality than direct extraction. Our key insight is to generate many candidate
+functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only
+outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with
+the LLM. This equates to a 110× reduction in the number of tokens the LLM needs to process,
+averaged across 16 real-world evaluation settings of 10k documents each.
+
+
+A task for LMs to perform Information Extraction, as implemented by Based.
+
+Homepage: https://github.com/HazyResearch/based-evaluation-harness
+
+
+Description:
+> FDA (Information Extraction). The task is to extract key-value pairs from a set of PDFs scraped from the FDA website. We use the dataset and labels collected in Arora et al. 2023. We break apart the documents into chunks of 1,920 tokens. For every key-value pair that appears in the chunk, we create a zero-shot prompt using the simple prompt template: {chunk} \n {key}: We allow the model to generate a fixed number of tokens after the prompt and check (with case insensitivity) if the value is contained within the generation. We report accuracy, the fraction of prompts for which the generation contains the value.
+
+
+
+### Citation
+
+```
+@misc{arora2024simple,
+      title={Simple linear attention language models balance the recall-throughput tradeoff},
+      author={Simran Arora and Sabri Eyuboglu and Michael Zhang and Aman Timalsina and Silas Alberti and Dylan Zinsley and James Zou and Atri Rudra and Christopher Ré},
+      year={2024},
+      eprint={2402.18668},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@misc{arora2023language,
+      title={Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes},
+      author={Simran Arora and Brandon Yang and Sabri Eyuboglu and Avanika Narayan and Andrew Hojel and Immanuel Trummer and Christopher Ré},
+      year={2023},
+      eprint={2304.09433},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+```
+
+### Groups and Tasks
+
+#### Tasks
+
+* `fda`: the FDA task as implemented in the paper "Simple linear attention language models balance the recall-throughput tradeoff". Designed for zero-shot evaluation of small LMs.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/fda/fda.yaml
+++ b/lm_eval/tasks/fda/fda.yaml
+task: fda
+class: !function task.FDA
--- a/lm_eval/tasks/fda/task.py
+++ b/lm_eval/tasks/fda/task.py
+"""
+"""
+import re
+from typing import List
+
+import numpy as np
+
+from lm_eval.api.instance import Instance
+from lm_eval.api.task import ConfigurableTask
+
+
+class FDA(ConfigurableTask):
+    VERSION = 0
+    DATASET_PATH = "hazyresearch/based-fda"
+    DATASET_NAME = "default"
+
+    def __init__(self):
+        super().__init__(config={"metadata": {"version": self.VERSION}})
+
+    def has_training_docs(self):
+        return False
+
+    def has_validation_docs(self):
+        return True
+
+    def has_test_docs(self):
+        return False
+
+    def validation_docs(self):
+        return self.dataset["validation"]
+
+    def doc_to_text(self, doc):
+        return doc["text"]
+
+    def doc_to_target(self, doc):
+        return doc["value"]
+
+    def construct_requests(self, doc, ctx, **kwargs):
+        """Uses RequestFactory to construct Requests and returns an iterable of
+        Requests which will be sent to the LM.
+
+        :param doc:
+            The document as returned from training_docs, validation_docs, or test_docs.
+        :param ctx: str
+            The context string, generated by fewshot_context. This includes the natural
+            language description, as well as the few shot examples, and the question
+            part of the document for `doc`.
+        """
+
+        return [
+            Instance(
+                request_type="generate_until",
+                doc=doc,
+                arguments=(ctx, {"until": ["\n"], "max_gen_toks": 48}),
+                idx=0,
+                **kwargs,
+            )
+        ]
+
+    def process_results(self, doc, results):
+        """Take a single document and the LM results and evaluates, returning a
+        dict where keys are the names of submetrics and values are the values of
+        the metric for that one document
+
+        :param doc:
+            The document as returned from training_docs, validation_docs, or test_docs.
+        :param results:
+            The results of the requests created in construct_requests.
+        """
+        # continuation, (logprob_unanswerable, _) = results
+        continuation = results
+
+        return {"contains": contains_score(continuation[0], [doc["value"]])}
+
+    def aggregation(self):
+        """
+        :returns: {str: [float] -> float}
+            A dictionary where keys are the names of submetrics and values are
+            functions that aggregate a list of metrics
+        """
+        return {
+            "contains": np.mean,  # Exact match (the normalized answer exactly match the gold answer)
+        }
+
+    def higher_is_better(self):
+        """
+        :returns: {str: bool}
+            A dictionary where keys are the names of submetrics and values are
+            whether a higher value of the submetric is better
+        """
+        return {
+            "contains": True,  # Exact match (the normalized answer exactly match the gold answer
+        }
+
+
+def contains_score(prediction: str, labels: List[str]):
+    return max(
+        int(bool(re.search(re.compile(re.escape(label), re.IGNORECASE), prediction)))
+        for label in labels
+    )
--- a/lm_eval/tasks/glue/qqp/default.yaml
+++ b/lm_eval/tasks/glue/qqp/default.yaml
@@ -5,11 +5,11 @@ dataset_name: qqp
 output_type: multiple_choice
 training_split: train
 validation_split: validation
-doc_to_text: "\nSentence 1: {{question1}}\nSentence 2: {{question2}}\nAnswer:"
+doc_to_text: "Question 1: {{question1}}\nQuestion 2: {{question2}}\nQuestion: Do both questions ask the same thing?\nAnswer:"
 doc_to_target: label
 doc_to_choice: ["no", "yes"]
 metric_list:
  - metric: acc
  - metric: f1
 metadata:
-  version: 1.0
+  version: 2.0
--- a/lm_eval/tasks/hendrycks_math/README.md
+++ b/lm_eval/tasks/hendrycks_math/README.md
+# MATH
+
+## Paper
+Measuring Mathematical Problem Solving With the MATH Dataset
+https://arxiv.org/abs/2103.03874
+
+Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
+
+NOTE: This task corresponds to the MATH (`hendrycks_math`) implementation at https://github.com/EleutherAI/lm-evaluation-harness/tree/master . For the variant which uses the custom 4-shot prompt in the Minerva paper (https://arxiv.org/abs/2206.14858), and SymPy answer checking as done by Minerva, see `lm_eval/tasks/minerva_math`.
+
+Homepage: https://github.com/hendrycks/math
+
+
+## Citation
+```
+@article{hendrycksmath2021,
+  title={Measuring Mathematical Problem Solving With the MATH Dataset},
+  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
+  journal={NeurIPS},
+  year={2021}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+- `hendrycks_math`: the MATH benchmark from Hendrycks et al. 0- or few-shot.
+
+#### Tasks
+
+- `hendrycks_math_algebra`
+- `hendrycks_math_counting_and_prob`
+- `hendrycks_math_geometry`
+- `hendrycks_math_intermediate_algebra`
+- `hendrycks_math_num_theory`
+- `hendrycks_math_prealgebra`
+- `hendrycks_math_precalc`
+
+### Checklist
+
+The checklist is the following:
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+    * Answer extraction code is taken from the original MATH benchmark paper's repository.
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math.yaml
+group: hendrycks_math
+task:
+  - hendrycks_math_algebra
+  - hendrycks_math_counting_and_prob
+  - hendrycks_math_geometry
+  - hendrycks_math_intermediate_algebra
+  - hendrycks_math_num_theory
+  - hendrycks_math_prealgebra
+  - hendrycks_math_precalc
--- a/lm_eval/tasks/hendrycks_math/hendrycks_math_algebra.yaml
+++ b/lm_eval/tasks/hendrycks_math/hendrycks_math_algebra.yaml
+group:
+  - math_word_problems
+task: hendrycks_math_algebra
+dataset_path: EleutherAI/hendrycks_math
+process_docs: !function utils.process_docs
+dataset_name: algebra
+output_type: generate_until
+training_split: train
+test_split: test
+doc_to_text:  "Problem: {{problem}}\nAnswer:"
+process_results: !function utils.process_results
+doc_to_target: "{{answer}}"
+generation_kwargs:
+  until:
+    - "Problem:"
+  do_sample: false
+  temperature: 0
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true