Commit 948f120f authored by Baber's avatar Baber
Browse files

Merge branch 'main' into autobatchtest

# Conflicts:
#	lm_eval/models/huggingface.py
parents a5b1c7a8 bd80a6c0
"dataset_name": "professional_medicine"
"description": "以下是关于专业医学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_professional_medicine"
"dataset_name": "professional_psychology"
"description": "以下是关于专业心理学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_professional_psychology"
"dataset_name": "public_relations"
"description": "以下是关于公共关系的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_public_relations"
"dataset_name": "security_study"
"description": "以下是关于安全研究的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_security_study"
"dataset_name": "sociology"
"description": "以下是关于社会学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_sociology"
"dataset_name": "sports_science"
"description": "以下是关于体育学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_sports_science"
"dataset_name": "traditional_chinese_medicine"
"description": "以下是关于中医中药的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_traditional_chinese_medicine"
"dataset_name": "virology"
"description": "以下是关于病毒学的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_virology"
"dataset_name": "world_history"
"description": "以下是关于世界历史的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_world_history"
"dataset_name": "world_religions"
"description": "以下是关于世界宗教的单项选择题,请直接给出正确答案的选项。\n\n"
"include": "_default_template_yaml"
"task": "cmmlu_world_religions"
group: copal_id
tag: copal_id
task: copal_id_standard
task_alias: standard
dataset_path: haryoaw/COPAL
......
group:
tag:
- crows_pairs
- social_bias
- loglikelihood
task: crows_pairs_english
dataset_path: BigScienceBiasEval/crows_pairs_multilingual
dataset_name: english
......
group: csatqa
task:
- csatqa_gr
- csatqa_li
- csatqa_rch
- csatqa_rcs
- csatqa_rcss
- csatqa_wr
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
group: csatqa
dataset_path: EleutherAI/csatqa
dataset_path: HAERAE-HUB/csatqa
test_split: test
output_type: multiple_choice
process_docs: !function utils.process_docs
......
......@@ -2,7 +2,6 @@ import re
import string
import numpy as np
from scipy.optimize import linear_sum_assignment
_ARTICLES = re.compile(r"\b(a|an|the)\b", re.UNICODE)
......@@ -117,6 +116,8 @@ def _align_bags(predicted, gold):
Takes gold and predicted answer sets and first finds the optimal 1-1 alignment
between them and gets maximum metric values over all the answers.
"""
from scipy.optimize import linear_sum_assignment
scores = np.zeros([len(gold), len(predicted)])
for gold_index, gold_item in enumerate(gold):
for pred_index, pred_item in enumerate(predicted):
......
......@@ -16,8 +16,8 @@ Homepage: https://eqbench.com/
NOTE: There are some key differences between the lm-evaluation-harness version and the implementation described in the EQ-Bench paper (These have been OK'd by the author):
- The lm-eval version uses the EQ-Bench v2 test set (171 questions) and score calculation. It does not incorporate the revision part of the prompt, as per v2.1 (https://github.com/EQ-bench/EQ-Bench)
- No retries in lm-eval version (EQ-Bench pipeline retries with successively higher temps if it encounters unparseable answers)
- In the original implementation, unparseable answers are excluded from the final score, and 83% of answers have to be parseable or a fail is returned. The lm-eval version instead assigns 0 to unparsable answers and has no fail criteria. So for lower performing models, there may be differences with the EQ-Bench leaderboard.
- No retries in lm-eval version (EQ-Bench pipeline retries with successively higher temps if it encounters unparsable answers)
- In the original implementation, unparsable answers are excluded from the final score, and 83% of answers have to be parseable or a fail is returned. The lm-eval version instead assigns 0 to unparsable answers and has no fail criteria. So for lower performing models, there may be differences with the EQ-Bench leaderboard.
### Citation
......
......@@ -26,7 +26,7 @@ Homepage: https://github.com/hitz-zentroa/latxa
### Groups and Tasks
#### Groups
#### Tags
* `eus_exams_eu`: The Basque version of the exams.
* `eus_exams_es`: The Spanish version of the exams.
......
include: eus_exams
group:
tag:
- eus_exams_es
doc_to_text: "Pregunta: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nD: {{candidates[3]}}\nRespuesta:"
include: eus_exams
group:
tag:
- eus_exams_eu
doc_to_text: "Galdera: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nD: {{candidates[3]}}\nErantzuna:"
......@@ -12,7 +12,7 @@ class FDA(ConfigurableTask):
DATASET_PATH = "hazyresearch/based-fda"
DATASET_NAME = "default"
def __init__(self):
def __init__(self, **kwargs):
super().__init__(config={"metadata": {"version": self.VERSION}})
def has_training_docs(self):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment