Commit efb46937 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into convert_gen

# Conflicts:
#	lm_eval/__main__.py
#	lm_eval/evaluator.py
parents 7fbf899c ade01428
# GroundCocoa
### Paper
Title: `GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models`
Abstract: https://arxiv.org/abs/2404.04237
The rapid progress of large language models (LLMs) has seen them excel and frequently surpass human performance on standard benchmarks. This has enabled many downstream applications, such as LLM agents, to rely on their reasoning to address complex task requirements. However, LLMs are known to unexpectedly falter in simple tasks and under seemingly straightforward circumstances - underscoring the need for better and more diverse evaluation setups to measure their true capabilities. To this end, we choose to study compositional and conditional reasoning, two aspects that are central to human cognition, and introduce GroundCocoa - a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking. Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format. Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.
Homepage: `https://osu-nlp-group.github.io/GroundCocoa/`
### Citation
```
@misc{kohli2025groundcocoabenchmarkevaluatingcompositional,
title={GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models},
author={Harsh Kohli and Sachin Kumar and Huan Sun},
year={2025},
eprint={2404.04237},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2404.04237},
}
```
### Groups and Tasks
#### Groups
- Not part of a group yet
#### Tasks
- `groundcocoa`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
task: groundcocoa
dataset_path: harsh147/GroundCocoa
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{criteria}}"
doc_to_target: gold
doc_to_choice: "choices"
target_delimiter: ""
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
dataset_kwargs:
trust_remote_code: true
streaming: true
import datasets
import pandas as pd
from datasets import Dataset
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
cocoa_dataset = [sample for sample in dataset]
processed = []
for doc in cocoa_dataset:
question = "A user has specified certain criteria for booking a flight. Below are five different flight options labeled 'A', 'B', 'C', 'D', and 'E'. Review these options and select the one that best matches the user requirements. Respond with a single option and the phrase 'The answer is Option ' followed by the correct letter - 'A', 'B', 'C', 'D', or 'E'\n\n"
question = question + "User Criteria: " + doc["query"]
question = question + "\n\n Option A: " + str(doc["Option A"]) + "\n"
question = question + "\n Option B: " + str(doc["Option B"]) + "\n"
question = question + "\n Option C: " + str(doc["Option C"]) + "\n"
question = question + "\n Option D: " + str(doc["Option D"]) + "\n"
question = question + "\n Option E: " + str(doc["Option E"]) + "\n"
out_doc = {
"criteria": question,
"choices": [
"The answer is Option A",
"The answer is Option B",
"The answer is Option C",
"The answer is Option D",
"The answer is Option E",
],
"gold": "The answer is Option " + doc["Answer"],
}
processed.append(out_doc)
df = pd.DataFrame(processed)
dataset = Dataset.from_pandas(df)
return dataset
include: humaneval.yaml
task: humaneval_plus
dataset_path: evalplus/humanevalplus
...@@ -2,7 +2,6 @@ import dataclasses ...@@ -2,7 +2,6 @@ import dataclasses
from typing import Dict, Optional, Union from typing import Dict, Optional, Union
from lm_eval.tasks.ifeval import instructions_registry from lm_eval.tasks.ifeval import instructions_registry
from lm_eval.utils import eval_logger
@dataclasses.dataclass @dataclasses.dataclass
......
tag:
- kobest
task: kobest_boolq task: kobest_boolq
dataset_path: skt/kobest_v1 dataset_path: skt/kobest_v1
dataset_name: boolq dataset_name: boolq
......
tag:
- kobest
task: kobest_copa task: kobest_copa
dataset_path: skt/kobest_v1 dataset_path: skt/kobest_v1
dataset_name: copa dataset_name: copa
......
tag:
- kobest
task: kobest_hellaswag task: kobest_hellaswag
dataset_path: skt/kobest_v1 dataset_path: skt/kobest_v1
dataset_name: hellaswag dataset_name: hellaswag
......
tag:
- kobest
task: kobest_sentineg task: kobest_sentineg
dataset_path: skt/kobest_v1 dataset_path: skt/kobest_v1
dataset_name: sentineg dataset_name: sentineg
......
tag:
- kobest
task: kobest_wic task: kobest_wic
dataset_path: skt/kobest_v1 dataset_path: skt/kobest_v1
dataset_name: wic dataset_name: wic
......
dataset_path: lighteval/MATH-Hard dataset_path: DigitalLearningGmbH/MATH-lighteval
process_docs: !function utils.process_docs process_docs: !function utils.process_docs
output_type: generate_until output_type: generate_until
training_split: train training_split: train
......
import logging
import re import re
import signal import signal
from typing import Dict, List, Optional from typing import Dict, List, Optional
import datasets import datasets
from lm_eval.utils import eval_logger
eval_logger = logging.getLogger(__name__)
try: try:
import sympy import sympy
......
include: mbpp.yaml
task: mbpp_plus
dataset_path: evalplus/mbppplus
dataset_name: null
doc_to_text: "You are an expert Python programmer, and here is your task: {{prompt if prompt is defined else text}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]\n"
...@@ -63,3 +63,6 @@ If other tasks on this dataset are already supported: ...@@ -63,3 +63,6 @@ If other tasks on this dataset are already supported:
### Variant Wishlist ### Variant Wishlist
- [ ] zero-shot variant - [ ] zero-shot variant
### Changelog
version 2.0: (21-Feb-2025); added math_verify (extraction) metric. For details [see](https://huggingface.co/blog/math_verify_leaderboard)
...@@ -19,9 +19,12 @@ metric_list: ...@@ -19,9 +19,12 @@ metric_list:
- metric: exact_match - metric: exact_match
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
- metric: math_verify
aggregation: mean
higher_is_better: true
num_fewshot: 4 num_fewshot: 4
metadata: metadata:
version: 1.0 version: 2.0
dataset_kwargs: dataset_kwargs:
trust_remote_code: true trust_remote_code: true
fewshot_config: fewshot_config:
......
import logging
import re import re
import signal import signal
from importlib.metadata import version
from typing import Dict, List, Optional from typing import Dict, List, Optional
import datasets import datasets
from lm_eval.utils import eval_logger
eval_logger = logging.getLogger(__name__)
try: try:
import antlr4
import sympy import sympy
from math_verify import parse, verify
from sympy.parsing.latex import parse_latex from sympy.parsing.latex import parse_latex
except ModuleNotFoundError:
raise ModuleNotFoundError( assert version("antlr4-python3-runtime").startswith("4.11")
"`sympy` is required for generating translation task prompt templates. \ except (ModuleNotFoundError, AssertionError) as e:
please install sympy via pip install lm-eval[math] or pip install -e .[math]", raise type(e)(
) "`sympy`, `math_verify` and `antlr4-python3-runtime==4.11` are required for generating translation task prompt templates. "
"Please install the required packages via pip install lm-eval[math] or pip install -e .[math]"
) from e
# taken from # taken from
...@@ -75,8 +82,13 @@ def process_results(doc: dict, results: List[str]) -> Dict[str, int]: ...@@ -75,8 +82,13 @@ def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
else: else:
retval = 0 retval = 0
# math_verify
res = verify(parse(doc["answer"]), parse(candidates))
mathval = 1 if res else 0
results = { results = {
"exact_match": retval, "exact_match": retval,
"math_verify": mathval,
} }
return results return results
......
...@@ -11,7 +11,7 @@ import yaml ...@@ -11,7 +11,7 @@ import yaml
from tqdm import tqdm from tqdm import tqdm
eval_logger = logging.getLogger("lm-eval") eval_logger = logging.getLogger(__name__)
SUBJECTS = { SUBJECTS = {
......
...@@ -10,7 +10,7 @@ import yaml ...@@ -10,7 +10,7 @@ import yaml
from tqdm import tqdm from tqdm import tqdm
eval_logger = logging.getLogger("lm-eval") eval_logger = logging.getLogger(__name__)
SUBJECTS = { SUBJECTS = {
......
...@@ -14,7 +14,40 @@ The datasets included in PortugueseBench are: ...@@ -14,7 +14,40 @@ The datasets included in PortugueseBench are:
### Citation ### Citation
Paper for PortugueseBench coming soon.
```
@inproceedings{baucells-etal-2025-iberobench,
title = "{I}bero{B}ench: A Benchmark for {LLM} Evaluation in {I}berian Languages",
author = "Baucells, Irene and
Aula-Blasco, Javier and
de-Dios-Flores, Iria and
Paniagua Su{\'a}rez, Silvia and
Perez, Naiara and
Salles, Anna and
Sotelo Docio, Susana and
Falc{\~a}o, J{\'u}lia and
Saiz, Jose Javier and
Sepulveda Torres, Robiert and
Barnes, Jeremy and
Gamallo, Pablo and
Gonzalez-Agirre, Aitor and
Rigau, German and
Villegas, Marta",
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.699/",
pages = "10491--10519",
}
```
### Groups and Tasks ### Groups and Tasks
......
...@@ -15,6 +15,7 @@ The datasets included in SpanishBench that have been made public in previous pub ...@@ -15,6 +15,7 @@ The datasets included in SpanishBench that have been made public in previous pub
| Task | Category | Paper title | Homepage | | Task | Category | Paper title | Homepage |
|:-------------:|:-----:|:-------------:|:-----:| |:-------------:|:-----:|:-------------:|:-----:|
| Belebele_es | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele | | Belebele_es | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
| Cocoteros_es | Commonsense Reasoning | [COCOTEROS: A Spanish Corpus with Contextual Knowledge for Natural Language Generation](https://besaya.infor.uva.es/sepln24/paper04.pdf) | https://huggingface.co/datasets/gplsi/cocoteros |
| EsCoLA | Linguistic Acceptability | [EsCoLA: Spanish Corpus of Linguistic Acceptability](https://aclanthology.org/2024.lrec-main.554/) | https://huggingface.co/datasets/nbel/EsCoLA | | EsCoLA | Linguistic Acceptability | [EsCoLA: Spanish Corpus of Linguistic Acceptability](https://aclanthology.org/2024.lrec-main.554/) | https://huggingface.co/datasets/nbel/EsCoLA |
| FLORES_es | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores | | FLORES_es | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores |
| MGSM_es | Math | [Language Models are Multilingual Chain-of-Thought Reasoners](https://arxiv.org/abs/2210.03057) | https://huggingface.co/datasets/juletxara/mgsm | | MGSM_es | Math | [Language Models are Multilingual Chain-of-Thought Reasoners](https://arxiv.org/abs/2210.03057) | https://huggingface.co/datasets/juletxara/mgsm |
...@@ -28,7 +29,40 @@ The datasets included in SpanishBench that have been made public in previous pub ...@@ -28,7 +29,40 @@ The datasets included in SpanishBench that have been made public in previous pub
### Citation ### Citation
Paper for SpanishBench coming soon.
```
@inproceedings{baucells-etal-2025-iberobench,
title = "{I}bero{B}ench: A Benchmark for {LLM} Evaluation in {I}berian Languages",
author = "Baucells, Irene and
Aula-Blasco, Javier and
de-Dios-Flores, Iria and
Paniagua Su{\'a}rez, Silvia and
Perez, Naiara and
Salles, Anna and
Sotelo Docio, Susana and
Falc{\~a}o, J{\'u}lia and
Saiz, Jose Javier and
Sepulveda Torres, Robiert and
Barnes, Jeremy and
Gamallo, Pablo and
Gonzalez-Agirre, Aitor and
Rigau, German and
Villegas, Marta",
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.699/",
pages = "10491--10519",
}
```
### Groups and Tasks ### Groups and Tasks
...@@ -44,6 +78,7 @@ Paper for SpanishBench coming soon. ...@@ -44,6 +78,7 @@ Paper for SpanishBench coming soon.
The following tasks evaluate tasks on SpanishBench dataset using various scoring methods. The following tasks evaluate tasks on SpanishBench dataset using various scoring methods.
- `belebele_spa_Latn` - `belebele_spa_Latn`
- `cocoteros_es`
- `copa_es` - `copa_es`
- `escola` - `escola`
- `flores_es` - `flores_es`
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment