Unverified Commit f47ddaf8 authored by Jonas Golde's avatar Jonas Golde Committed by GitHub
Browse files

Add MastermindEval (#2788)

* add MastermindEval benchmark

* fill out checklist
parent 65ef2573
......@@ -77,6 +77,7 @@
| [lingoly](lingoly/README.md) | Challenging logical reasoning benchmark in low-resource languages with controls for memorization | English, Multilingual |
| [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese |
| [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
| [mastermind](mastermind/README.md) | Reasoning benchmark based on the board game of Mastermind. | English |
| [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English |
| [mbpp](mbpp/README.md) | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. | Python |
| [meddialog](meddialog/README.md) | Medical open-ended QA and Question Entailment stemming from the MedDialog dataset. | English |
......
# MastermindEval
### Paper
Title: MastermindEval: A Simple But Scalable Reasoning Benchmark
Abstract: https://arxiv.org/abs/2503.05891
In Mastermind, the player has to deduce a hidden sequence of symbols by iteratively
guessing using the feedback provided by the game master. MastermindEval contains pre-played
games of the board game Mastermind using Knuth's algorithm. Each game is pre-played
until only one possible, valid solutions remains. The task is to derive the hidden
sequence of symbol by combining information provided in the prompt. We offer different
splits of varying difficulty: 24 (code length 2, 4 possible colors), 35 (code length 3,
5 possible colors) and 46 (code length 4, 6 possible colors). Each split comes in two
variants - easy and hard - containing either random codes as wrong answer options
or codes that are very close (only one symbol is changed) compared to the correct code.
We further offer an agentic evaluation in which the LLM plays the game from scratch here.
GitHub repository: https://github.com/flairNLP/mastermind
### Citation
```
@inproceedings{
golde2025mastermindeval,
title={MastermindEval: A Simple But Scalable Reasoning Benchmark},
author={Jonas Golde and Patrick Haller and Fabio Barth and Alan Akbik},
booktitle={Workshop on Reasoning and Planning for Large Language Models},
year={2025},
url={https://openreview.net/forum?id=H4donosutm}
}
```
### Groups, Tags, and Tasks
#### Groups
None.
#### Tags
* `mastermind`: Evaluates all settings.
* `mastermind_easy`: Evaluates all easy settings (random wrong answer options).
* `mastermind_hard`: Evaluates all hard settings (wrong answer options differ in one symbol from the secret code).
#### Tasks
* `mastermind_24_easy`
* `mastermind_24_hard`
* `mastermind_35_easy`
* `mastermind_35_hard`
* `mastermind_46_easy`
* `mastermind_46_hard`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
tag:
- mastermind
- mastermind_easy
task: mastermind_24_easy
dataset_path: flair/mastermind_24_mcq_random
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: "{{instruction}}\n\nThe secret code is:"
doc_to_target: "{{options.label.index(answerKey)}}"
doc_to_choice: "{{options.text}}"
should_decontaminate: true
doc_to_decontamination_query: "{{instruction}}\n\nThe secret code is:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
tag:
- mastermind
- mastermind_hard
task: mastermind_24_hard
dataset_path: flair/mastermind_24_mcq_close
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: "{{instruction}}\n\nThe secret code is:"
doc_to_target: "{{options.label.index(answerKey)}}"
doc_to_choice: "{{options.text}}"
should_decontaminate: true
doc_to_decontamination_query: "{{instruction}}\n\nThe secret code is:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
include: mastermind_24_easy.yaml
task: mastermind_35_easy
dataset_path: flair/mastermind_35_mcq_random
include: mastermind_24_hard.yaml
task: mastermind_35_hard
dataset_path: flair/mastermind_35_mcq_close
include: mastermind_24_easy.yaml
task: mastermind_46_easy
dataset_path: flair/mastermind_46_mcq_random
include: mastermind_24_hard.yaml
task: mastermind_46_hard
dataset_path: flair/mastermind_46_mcq_close
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment