Commit 60c9c170 authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

Merge branch 'main' into inverse-scaling-tasks

parents 4b2d565b b4cd85d4
# Generated by utils.py
dataset_name: eu_osakidetza3e
include: eus_exams_eu
task: eus_exams_eu_osakidetza3e
# Generated by utils.py
dataset_name: eu_osakidetza5e
include: eus_exams_eu
task: eus_exams_eu_osakidetza5e
# Generated by utils.py
dataset_name: eu_osakidetza6e
include: eus_exams_eu
task: eus_exams_eu_osakidetza6e
# Generated by utils.py
dataset_name: eu_osakidetza7e
include: eus_exams_eu
task: eus_exams_eu_osakidetza7e
import datasets
def process_docs(dataset: datasets.Dataset):
"""Filter out examples with no answer."""
def valid_example(example: dict) -> bool:
"""Check if an example is valid."""
if example["answer"] not in [0, 1, 2, 3]:
return False
if example["candidates"] == ["", "", "", ""]:
return False
return True
return dataset.filter(valid_example)
# EusProficiency
### Paper
Title: Latxa: An Open Language Model and Evaluation Suite for Basque
Abstract: https://arxiv.org/abs/2403.20266
EusProficiency comprises 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque. We collected the atarikoa exercises from EGA exams through the years 1998 to 2008. Atarikoa is the first qualifying test of EGA, which measures different aspects of language competency, such as reading comprehension, grammar, vocabulary, spelling, and writing. Each test generally has 85 multiple-choice questions, with 4 choices and a single correct answer.
Homepage: https://github.com/hitz-zentroa/latxa
### Citation
```
@misc{etxaniz2024latxa,
title={Latxa: An Open Language Model and Evaluation Suite for Basque},
author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
year={2024},
eprint={2403.20266},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
There are no groups.
#### Tasks
* `eus_proficiency`: EusProficiency comprises 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque.
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
dataset_path: HiTZ/EusProficiency
dataset_name: default
task: eus_proficiency
doc_to_text: "Galdera: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nD: {{candidates[3]}}\nErantzuna:"
doc_to_choice: ["A", "B", "C", "D"]
validation_split: null
test_split: test
fewshot_split: test
output_type: multiple_choice
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
# EusReading
### Paper
Title: Latxa: An Open Language Model and Evaluation Suite for Basque
Abstract: https://arxiv.org/abs/2403.20266
EusReading consists of 352 reading comprehension exercises (irakurmena) sourced from the set of past EGA exams from 1998 to 2008. Each test generally has 10 multiple-choice questions, with 4 choices and a single correct answer. These exercises are more challenging than Belebele due to the complexity and length of the input texts. As a result, EusReading is useful to measure long context understanding of models.
Homepage: https://github.com/hitz-zentroa/latxa
### Citation
```
@misc{etxaniz2024latxa,
title={Latxa: An Open Language Model and Evaluation Suite for Basque},
author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
year={2024},
eprint={2403.20266},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
There are no groups.
#### Tasks
* `eus_reading`: EusReading consists of 352 reading comprehension exercises (irakurmena) sourced from the set of past EGA exams from 1998 to 2008.
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
dataset_path: HiTZ/EusReading
dataset_name: default
task: eus_reading
doc_to_text: !function utils.doc_to_text_context
doc_to_choice: !function utils.doc_to_choice
validation_split: null
test_split: test
fewshot_split: test
output_type: multiple_choice
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
from typing import List
letters = ["A", "B", "C", "D"]
def doc_to_text_context(doc) -> str:
"""
Converts a document to a formatted string.
Args:
doc (dict): A dictionary containing the document information.
Returns:
str: A formatted string containing the question and answer choices.
"""
candidates = doc["candidates"]
num_choices = len(candidates)
if num_choices < 2:
raise ValueError("Invalid number of candidates")
choices = letters[:num_choices]
formatted_choices = "\n".join(
[f"{choice}: {candidates[i]}" for i, choice in enumerate(choices)]
)
return f"Pasartea: {doc['context']}\n\nGaldera: {doc['question']}\n{formatted_choices}\nErantzuna:"
def doc_to_choice(doc) -> List[str]:
"""
Returns the answer choices for a document.
Args:
doc (dict): A dictionary containing the document information.
Returns:
list: A list of strings containing the answer choices.
"""
num_choices = len(doc["candidates"])
if num_choices < 2:
raise ValueError("Invalid number of candidates")
return letters[:num_choices]
# EusTrivia
### Paper
Title: Latxa: An Open Language Model and Evaluation Suite for Basque
Abstract: https://arxiv.org/abs/2403.20266
EusTrivia consists of 1,715 trivia questions from multiple online sources. 56.3\% of the questions are elementary level (grades 3-6), while the rest are considered challenging. A significant portion of the questions focus specifically on the Basque Country, its language and culture. Each multiple-choice question contains two, three or four choices (3.84 on average) and a single correct answer. Five areas of knowledge are covered:
- **Humanities and Natural Sciences** (27.8%): This category encompasses questions about history, geography, biology, ecology and other social and natural sciences.
- **Leisure and Art** (24.5%): This category includes questions on sports and athletes, performative and plastic arts and artists, architecture, cultural events, and related topics.
- **Music** (16.0%): Here are grouped all the questions about music and musicians, both classical and contemporary.
- **Language and Literature** (17.1%): This category is concerned with all kinds of literature productions and writers, as well as metalinguistic questions (e.g., definitions, synonyms, and word usage).
- **Mathematics and ICT** (14.5%): This category covers mathematical problems and questions about ICT, as well as questions about people known for their contributions to these fields of knowledge.
Homepage: https://github.com/hitz-zentroa/latxa
### Citation
```
@misc{etxaniz2024latxa,
title={Latxa: An Open Language Model and Evaluation Suite for Basque},
author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
year={2024},
eprint={2403.20266},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
There are no groups.
#### Tasks
* `eus_trivia`: EusTrivia consists of 1,715 trivia questions from multiple online sources.
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
dataset_path: HiTZ/EusTrivia
dataset_name: default
task: eus_trivia
doc_to_text: !function utils.doc_to_text
doc_to_choice: !function utils.doc_to_choice
validation_split: null
test_split: test
fewshot_split: test
output_type: multiple_choice
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
from typing import List
letters = ["A", "B", "C", "D"]
def doc_to_text(doc) -> str:
"""
Converts a document to a formatted string.
Args:
doc (dict): A dictionary containing the document information.
Returns:
str: A formatted string containing the question and answer choices.
"""
candidates = doc["candidates"]
num_choices = len(candidates)
if num_choices < 2:
raise ValueError("Invalid number of candidates")
choices = letters[:num_choices]
formatted_choices = "\n".join(
[f"{choice}: {candidates[i]}" for i, choice in enumerate(choices)]
)
return f"Galdera: {doc['question']}\n{formatted_choices}\nErantzuna:"
def doc_to_choice(doc) -> List[str]:
"""
Returns the answer choices for a document.
Args:
doc (dict): A dictionary containing the document information.
Returns:
list: A list of strings containing the answer choices.
"""
num_choices = len(doc["candidates"])
if num_choices < 2:
raise ValueError("Invalid number of candidates")
return letters[:num_choices]
# FDA
### Paper
Title: Language Models Enable Simple Systems For
Generating Structured Views Of Heterogenous Data
Lakes
Abstract: A long standing goal of the data management community is to develop general, automated systems
that ingest semi-structured documents and output queryable tables without human effort or domain
specific customization. Given the sheer variety of potential documents, state-of-the art systems make
simplifying assumptions and use domain specific training. In this work, we ask whether we can
maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad
data, can perform diverse downstream tasks simply conditioned on natural language task descriptions.
We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify
two fundamentally different strategies for implementing this system: prompt the LLM to directly
extract values from documents or prompt the LLM to synthesize code that performs the extraction.
Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap,
but far less accurate than directly processing each document with the LLM. To improve quality while
maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+,
which achieves better quality than direct extraction. Our key insight is to generate many candidate
functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only
outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with
the LLM. This equates to a 110× reduction in the number of tokens the LLM needs to process,
averaged across 16 real-world evaluation settings of 10k documents each.
A task for LMs to perform Information Extraction, as implemented by Based.
Homepage: https://github.com/HazyResearch/based-evaluation-harness
Description:
> FDA (Information Extraction). The task is to extract key-value pairs from a set of PDFs scraped from the FDA website. We use the dataset and labels collected in Arora et al. 2023. We break apart the documents into chunks of 1,920 tokens. For every key-value pair that appears in the chunk, we create a zero-shot prompt using the simple prompt template: {chunk} \n {key}: We allow the model to generate a fixed number of tokens after the prompt and check (with case insensitivity) if the value is contained within the generation. We report accuracy, the fraction of prompts for which the generation contains the value.
### Citation
```
@misc{arora2024simple,
title={Simple linear attention language models balance the recall-throughput tradeoff},
author={Simran Arora and Sabri Eyuboglu and Michael Zhang and Aman Timalsina and Silas Alberti and Dylan Zinsley and James Zou and Atri Rudra and Christopher Ré},
year={2024},
eprint={2402.18668},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{arora2023language,
title={Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes},
author={Simran Arora and Brandon Yang and Sabri Eyuboglu and Avanika Narayan and Andrew Hojel and Immanuel Trummer and Christopher Ré},
year={2023},
eprint={2304.09433},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Tasks
* `fda`: the FDA task as implemented in the paper "Simple linear attention language models balance the recall-throughput tradeoff". Designed for zero-shot evaluation of small LMs.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
task: fda
class: !function task.FDA
"""
"""
import re
from typing import List
import numpy as np
from lm_eval.api.instance import Instance
from lm_eval.api.task import ConfigurableTask
class FDA(ConfigurableTask):
VERSION = 0
DATASET_PATH = "hazyresearch/based-fda"
DATASET_NAME = "default"
def __init__(self):
super().__init__(config={"metadata": {"version": self.VERSION}})
def has_training_docs(self):
return False
def has_validation_docs(self):
return True
def has_test_docs(self):
return False
def validation_docs(self):
return self.dataset["validation"]
def doc_to_text(self, doc):
return doc["text"]
def doc_to_target(self, doc):
return doc["value"]
def construct_requests(self, doc, ctx, **kwargs):
"""Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM.
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
:param ctx: str
The context string, generated by fewshot_context. This includes the natural
language description, as well as the few shot examples, and the question
part of the document for `doc`.
"""
return [
Instance(
request_type="generate_until",
doc=doc,
arguments=(ctx, {"until": ["\n"], "max_gen_toks": 48}),
idx=0,
**kwargs,
)
]
def process_results(self, doc, results):
"""Take a single document and the LM results and evaluates, returning a
dict where keys are the names of submetrics and values are the values of
the metric for that one document
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
:param results:
The results of the requests created in construct_requests.
"""
# continuation, (logprob_unanswerable, _) = results
continuation = results
return {"contains": contains_score(continuation[0], [doc["value"]])}
def aggregation(self):
"""
:returns: {str: [float] -> float}
A dictionary where keys are the names of submetrics and values are
functions that aggregate a list of metrics
"""
return {
"contains": np.mean, # Exact match (the normalized answer exactly match the gold answer)
}
def higher_is_better(self):
"""
:returns: {str: bool}
A dictionary where keys are the names of submetrics and values are
whether a higher value of the submetric is better
"""
return {
"contains": True, # Exact match (the normalized answer exactly match the gold answer
}
def contains_score(prediction: str, labels: List[str]):
return max(
int(bool(re.search(re.compile(re.escape(label), re.IGNORECASE), prediction)))
for label in labels
)
......@@ -5,11 +5,11 @@ dataset_name: qqp
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "\nSentence 1: {{question1}}\nSentence 2: {{question2}}\nAnswer:"
doc_to_text: "Question 1: {{question1}}\nQuestion 2: {{question2}}\nQuestion: Do both questions ask the same thing?\nAnswer:"
doc_to_target: label
doc_to_choice: ["no", "yes"]
metric_list:
- metric: acc
- metric: f1
metadata:
version: 1.0
version: 2.0
# MATH
## Paper
Measuring Mathematical Problem Solving With the MATH Dataset
https://arxiv.org/abs/2103.03874
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
NOTE: This task corresponds to the MATH (`hendrycks_math`) implementation at https://github.com/EleutherAI/lm-evaluation-harness/tree/master . For the variant which uses the custom 4-shot prompt in the Minerva paper (https://arxiv.org/abs/2206.14858), and SymPy answer checking as done by Minerva, see `lm_eval/tasks/minerva_math`.
Homepage: https://github.com/hendrycks/math
## Citation
```
@article{hendrycksmath2021,
title={Measuring Mathematical Problem Solving With the MATH Dataset},
author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
journal={NeurIPS},
year={2021}
}
```
### Groups and Tasks
#### Groups
- `hendrycks_math`: the MATH benchmark from Hendrycks et al. 0- or few-shot.
#### Tasks
- `hendrycks_math_algebra`
- `hendrycks_math_counting_and_prob`
- `hendrycks_math_geometry`
- `hendrycks_math_intermediate_algebra`
- `hendrycks_math_num_theory`
- `hendrycks_math_prealgebra`
- `hendrycks_math_precalc`
### Checklist
The checklist is the following:
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* Answer extraction code is taken from the original MATH benchmark paper's repository.
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
group: hendrycks_math
task:
- hendrycks_math_algebra
- hendrycks_math_counting_and_prob
- hendrycks_math_geometry
- hendrycks_math_intermediate_algebra
- hendrycks_math_num_theory
- hendrycks_math_prealgebra
- hendrycks_math_precalc
group:
- math_word_problems
task: hendrycks_math_algebra
dataset_path: EleutherAI/hendrycks_math
process_docs: !function utils.process_docs
dataset_name: algebra
output_type: generate_until
training_split: train
test_split: test
doc_to_text: "Problem: {{problem}}\nAnswer:"
process_results: !function utils.process_results
doc_to_target: "{{answer}}"
generation_kwargs:
until:
- "Problem:"
do_sample: false
temperature: 0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment