Unverified Commit 938a4fb3 authored by James A. Michaelov's avatar James A. Michaelov Committed by GitHub
Browse files

Add LM-SynEval Benchmark (#3184)

* add lm_syneval

* edit readme

* update task readme

* formatting fixes

* run linting

* add descriptions and examples

* clean readme formatting
parent d355eac0
...@@ -87,6 +87,7 @@ ...@@ -87,6 +87,7 @@
| [leaderboard](leaderboard/README.md) | Task group used by Hugging Face's [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Those tasks are static and will not change through time | English | | [leaderboard](leaderboard/README.md) | Task group used by Hugging Face's [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Those tasks are static and will not change through time | English |
| [lingoly](lingoly/README.md) | Challenging logical reasoning benchmark in low-resource languages with controls for memorization | English, Multilingual | | [lingoly](lingoly/README.md) | Challenging logical reasoning benchmark in low-resource languages with controls for memorization | English, Multilingual |
| [libra](libra/README.md) | Evaluates long-context understanding in Russian across four complexity levels | Russian (MT) | | [libra](libra/README.md) | Evaluates long-context understanding in Russian across four complexity levels | Russian (MT) |
| [lm_syneval](lm_syneval/README.md) | Evaluates the syntactic capabilities of language models. | English |
| [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese | | [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese |
| [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese | | [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
| [mastermind](mastermind/README.md) | Reasoning benchmark based on the board game of Mastermind. | English | | [mastermind](mastermind/README.md) | Reasoning benchmark based on the board game of Mastermind. | English |
......
This diff is collapsed.
dataset_path: jmichaelov/lm_syneval
output_type: multiple_choice
test_split: test
doc_to_text: ""
target_delimiter: ""
doc_to_target: 0
doc_to_choice: "{{[sentence_good, sentence_bad]}}"
num_fewshot: 0
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
dataset_name: lm_syneval__agreement__long_vp_coord__plur_MS_LMV_LMV
include: _template_yaml
task: lm_syneval__agreement__long_vp_coord__plur_MS_LMV_LMV
dataset_name: lm_syneval__agreement__long_vp_coord__sing_MS_LMV_LMV
include: _template_yaml
task: lm_syneval__agreement__long_vp_coord__sing_MS_LMV_LMV
dataset_name: lm_syneval__agreement__obj_rel_across_anim__plur_MS_MV_plur_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_across_anim__plur_MS_MV_plur_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_across_anim__plur_MS_MV_sing_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_across_anim__plur_MS_MV_sing_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_across_anim__sing_MS_MV_plur_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_across_anim__sing_MS_MV_plur_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_across_anim__sing_MS_MV_sing_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_across_anim__sing_MS_MV_sing_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_across_inanim__plur_IS_IV_plur_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_across_inanim__plur_IS_IV_plur_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_across_inanim__plur_IS_IV_sing_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_across_inanim__plur_IS_IV_sing_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_across_inanim__sing_IS_IV_plur_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_across_inanim__sing_IS_IV_plur_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_across_inanim__sing_IS_IV_sing_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_across_inanim__sing_IS_IV_sing_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_no_comp_across_anim__plur_MS_MV_plur_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_no_comp_across_anim__plur_MS_MV_plur_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_no_comp_across_anim__plur_MS_MV_sing_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_no_comp_across_anim__plur_MS_MV_sing_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_no_comp_across_anim__sing_MS_MV_plur_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_no_comp_across_anim__sing_MS_MV_plur_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_no_comp_across_anim__sing_MS_MV_sing_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_no_comp_across_anim__sing_MS_MV_sing_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_no_comp_across_inanim__plur_IS_IV_plur_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_no_comp_across_inanim__plur_IS_IV_plur_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_no_comp_across_inanim__plur_IS_IV_sing_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_no_comp_across_inanim__plur_IS_IV_sing_ES_EV
dataset_name: lm_syneval__agreement__obj_rel_no_comp_across_inanim__sing_IS_IV_plur_ES_EV
include: _template_yaml
task: lm_syneval__agreement__obj_rel_no_comp_across_inanim__sing_IS_IV_plur_ES_EV
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment