Add MastermindEval (#2788)

* add MastermindEval benchmark * fill out checklist

Add MastermindEval (#2788)
* add MastermindEval benchmark * fill out checklist
f47ddaf8 · Jonas Golde · GitHub · 65ef2573 · f47ddaf8 · f47ddaf8
Unverified Commit f47ddaf8 authored Mar 18, 2025 by Jonas Golde Committed by GitHub Mar 18, 2025
8 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -77,6 +77,7 @@
 | [lingoly](lingoly/README.md)                                             | Challenging logical reasoning benchmark in low-resource languages with controls for memorization                                                                                                                                                                                                                                       | English, Multilingual                                                                                                 |
 | [logiqa](logiqa/README.md)                                               | Logical reasoning tasks requiring advanced inference and deduction.                                                                                                                                                                                                                                                                    | English, Chinese                                                                                                      |
 | [logiqa2](logiqa2/README.md)                                             | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination.                                                                                                                                                                                                                                              | English, Chinese                                                                                                      |
+| [mastermind](mastermind/README.md)                                               | Reasoning benchmark based on the board game of Mastermind.                                                                                                                                                                                                                                                         | English                                                                                                               |
 | [mathqa](mathqa/README.md)                                               | Question answering tasks involving mathematical reasoning and problem-solving.                                                                                                                                                                                                                                                         | English                                                                                                               |
 | [mbpp](mbpp/README.md)                                                   | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions.                                                                                                                                                                                                                    | Python                                                                                                                |
 | [meddialog](meddialog/README.md)                                         | Medical open-ended QA and Question Entailment stemming from the MedDialog dataset.                                                                                                                                                                                                                                                     | English                                                                                                               |

--- a/lm_eval/tasks/mastermind/README.md
+++ b/lm_eval/tasks/mastermind/README.md
+# MastermindEval
+
+### Paper
+
+Title: MastermindEval: A Simple But Scalable Reasoning Benchmark
+
+Abstract: https://arxiv.org/abs/2503.05891
+
+In Mastermind, the player has to deduce a hidden sequence of symbols by iteratively
+guessing using the feedback provided by the game master. MastermindEval contains pre-played
+games of the board game Mastermind using Knuth's algorithm. Each game is pre-played
+until only one possible, valid solutions remains. The task is to derive the hidden
+sequence of symbol by combining information provided in the prompt. We offer different
+splits of varying difficulty: 24 (code length 2, 4 possible colors), 35 (code length 3,
+5 possible colors) and 46 (code length 4, 6 possible colors). Each split comes in two
+variants - easy and hard -  containing either random codes as wrong answer options
+or codes that are very close (only one symbol is changed) compared to the correct code.
+We further offer an agentic evaluation in which the LLM plays the game from scratch here.
+
+GitHub repository: https://github.com/flairNLP/mastermind
+
+
+### Citation
+
+```
+@inproceedings{
+  golde2025mastermindeval,
+  title={MastermindEval: A Simple But Scalable Reasoning Benchmark},
+  author={Jonas Golde and Patrick Haller and Fabio Barth and Alan Akbik},
+  booktitle={Workshop on Reasoning and Planning for Large Language Models},
+  year={2025},
+  url={https://openreview.net/forum?id=H4donosutm}
+}
+```
+
+### Groups, Tags, and Tasks
+
+#### Groups
+
+None.
+
+#### Tags
+
+* `mastermind`: Evaluates all settings.
+* `mastermind_easy`: Evaluates all easy settings (random wrong answer options).
+* `mastermind_hard`: Evaluates all hard settings (wrong answer options differ in one symbol from the secret code).
+
+#### Tasks
+
+* `mastermind_24_easy`
+* `mastermind_24_hard`
+* `mastermind_35_easy`
+* `mastermind_35_hard`
+* `mastermind_46_easy`
+* `mastermind_46_hard`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/mastermind/mastermind_24_easy.yaml
+++ b/lm_eval/tasks/mastermind/mastermind_24_easy.yaml
+tag:
+  - mastermind
+  - mastermind_easy
+task: mastermind_24_easy
+dataset_path: flair/mastermind_24_mcq_random
+dataset_name: null
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "{{instruction}}\n\nThe secret code is:"
+doc_to_target: "{{options.label.index(answerKey)}}"
+doc_to_choice: "{{options.text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{instruction}}\n\nThe secret code is:"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/mastermind/mastermind_24_hard.yaml
+++ b/lm_eval/tasks/mastermind/mastermind_24_hard.yaml
+tag:
+  - mastermind
+  - mastermind_hard
+task: mastermind_24_hard
+dataset_path: flair/mastermind_24_mcq_close
+dataset_name: null
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "{{instruction}}\n\nThe secret code is:"
+doc_to_target: "{{options.label.index(answerKey)}}"
+doc_to_choice: "{{options.text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{instruction}}\n\nThe secret code is:"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/mastermind/mastermind_35_easy.yaml
+++ b/lm_eval/tasks/mastermind/mastermind_35_easy.yaml
+include: mastermind_24_easy.yaml
+task: mastermind_35_easy
+dataset_path: flair/mastermind_35_mcq_random
--- a/lm_eval/tasks/mastermind/mastermind_35_hard.yaml
+++ b/lm_eval/tasks/mastermind/mastermind_35_hard.yaml
+include: mastermind_24_hard.yaml
+task: mastermind_35_hard
+dataset_path: flair/mastermind_35_mcq_close
--- a/lm_eval/tasks/mastermind/mastermind_46_easy.yaml
+++ b/lm_eval/tasks/mastermind/mastermind_46_easy.yaml
+include: mastermind_24_easy.yaml
+task: mastermind_46_easy
+dataset_path: flair/mastermind_46_mcq_random
--- a/lm_eval/tasks/mastermind/mastermind_46_hard.yaml
+++ b/lm_eval/tasks/mastermind/mastermind_46_hard.yaml
+include: mastermind_24_hard.yaml
+task: mastermind_46_hard
+dataset_path: flair/mastermind_46_mcq_close