Commit 2106fbeb authored by Baber's avatar Baber
Browse files

Merge branch 'main' into mathvista

# Conflicts:
#	lm_eval/models/openai_completions.py
parents 4354fe46 703fbffd
tag:
- kbl
- kbl_knowledge_em
description: '당신은 사용자의 질문에 친절하고 논리적으로 답변해 주는 법률 전문가 챗봇 입니다.\n'
dataset_path: lbox/kbl
test_split: test
output_type: generate_until
doc_to_target: "{{label}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
filter_list:
- name: "get-answer"
filter:
- function: "regex"
regex_pattern: "([A-E]).*"
- function: "take_first"
task: kbl_common_legal_mistake_qa_em
dataset_name: kbl_knowledge_common_legal_mistake_qa
doc_to_text: "### 질문: {{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\n'A', 'B', 'C' 하나를 선택하여 ''답변: A'' 같이 단답식으로 답해 주세요."
include: _kbl_knowledge_yaml
task: kbl_common_legal_mistake_qa_reasoning_em
dataset_name: kbl_knowledge_common_legal_mistake_qa_reasoning
doc_to_text: "### 질문: {{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\n'A', 'B', 'C' 하나를 선택하여 ''답변: A'' 같이 단답식으로 답해 주세요."
include: _kbl_knowledge_yaml
task: kbl_legal_concept_qa_em
dataset_name: kbl_knowledge_legal_concept_qa
doc_to_text: "### 질문: {{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nE. {{E}}\n'A', 'B', 'C', 'D', 'E' 하나를 선택하여 ''답변: A'' 같이 단답식으로 답해 주세요."
include: _kbl_knowledge_yaml
task: kbl_offense_component_qa_em
dataset_name: kbl_knowledge_offense_component_qa
doc_to_text: "### 질문: {{question}}\n다음 선택지를 읽고 선택지 하나를 골라 ''답변: A'' 같이 단답식으로 답해 주세요. ### 선택지: A. {{A}}\nB. {{B}}."
include: _kbl_knowledge_yaml
task: kbl_query_and_statute_matching_qa_em
dataset_name: kbl_knowledge_query_and_statute_matching_qa
doc_to_text: "### 질문: {{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nE. {{E}}\nA, B, C, D, E 하나를 선택하여 ''답변: A'' 같이 단답식으로 답해 주세요."
include: _kbl_knowledge_yaml
task: kbl_statute_hallucination_qa_em
dataset_name: kbl_knowledge_statute_hallucination_qa
doc_to_text: "### 질문: {{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n'A', 'B', 'C', 'D' 하나를 선택하여 ''답변: A'' 같이 단답식으로 답해 주세요."
include: _kbl_knowledge_yaml
task: kbl_statute_number_and_content_matching_qa_em
dataset_name: kbl_knowledge_statute_number_and_content_matching_qa
doc_to_text: "### 질문: {{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nE. {{E}}\n A, B, C, D, E 하나를 선택하여 ''답변: A'' 같이 단답식으로 답해 주세요."
include: _kbl_knowledge_yaml
tag:
- kbl
- kbl_reasoning_em
description: '당신은 사용자의 질문에 친절하고 논리적으로 답변해 주는 법률 전문가 챗봇 입니다.\n'
dataset_path: lbox/kbl
test_split: test
output_type: generate_until
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
filter_list:
- name: "get-answer"
filter:
- function: "regex"
regex_pattern: "([A-E]).*"
- function: "take_first"
task: kbl_case_relevance_qa_p_em
dataset_name: kbl_reasoning_case_relevance_qa_p
doc_to_text: "### 질문: {{question}}\n\n[첫번째 판결문 상고인]\n{{query_case_appellant}}\n[첫번째 판결문 사실관계]\n{{query_case_fact}}\n[첫번째 판결문 당사자들의 주장]\n{{query_case_claim}}\n[첫번째 판결문 판사의 의견]\n{{query_case_judicial_opinion}}\n\n[두번째 판결문 상고인]\n{{retrieved_case_appellant}}\n[두번째 판결문 사실관계]\n{{retrieved_case_fact}}\n[두번째 판결문 당사자들의 주장]\n{{retrieved_case_claim}}\n[두번째 판결문 판사의 의견]\n{{retrieved_case_judicial_opinion}}\n\nA: {{A}}, B: {{B}}\n 하나를 선택하여 '답변: A'과 같이 단답식으로 답해주세요."
doc_to_target: "{{label}}"
include: _kbl_reasoning_yaml
task: kbl_case_relevance_qa_q_em
dataset_name: kbl_reasoning_case_relevance_qa_q
doc_to_text: "### 질문: {{question}}\n[의뢰인의 주장]\n{{query}}\n\n[판결문]\n- 상고인\n{{retrieved_case_appellant}}\n- 사실관계\n{{retrieved_case_fact}}\n- 당사자들의 주장\n{{retrieved_case_claim}}\n- 판사의 의견\n{{retrieved_case_judicial_opinion}}\n\nA: {{A}}, B: {{B}}\n 하나를 선택하여 '답변: A'과 같이 단답식으로 답해주세요."
doc_to_target: "{{label}}"
include: _kbl_reasoning_yaml
task: kbl_causal_reasoning_qa_em
dataset_name: kbl_reasoning_causal_reasoning_qa
doc_to_text: "### 질문: {{question}}\n검사의 공소사실: {{facts_charged}}\n피고인의 주장: {{defendant_claim}}\n증거: {{facts_accepted}}\nX, Y를 각각\nX: {{cause}})\nY: {{effect}}\n라고 X와 Y 사이의 관계를\nA: {{A}}, B: {{B}}\n 하나를 선택하여 '답변: A'과 같이 단답식으로 답해주세요."
doc_to_target: label
include: _kbl_reasoning_yaml
task: kbl_statement_consistency_qa_em
dataset_name: kbl_reasoning_statement_consistency_qa
doc_to_text: "### 질문: {{question}}\n진술1: {{statement1}}\n진술2: {{statement2}}\nA: {{A}}, B: {{B}}\n 하나를 선택하여 '답변: A'과 같이 단답식으로 답해주세요."
doc_to_target: label
include: _kbl_reasoning_yaml
......@@ -18,7 +18,7 @@ metric_list:
higher_is_better: true
num_fewshot: 4
metadata:
version: 1.0
version: 2.0
dataset_kwargs:
trust_remote_code: true
fewshot_config:
......
......@@ -17,6 +17,9 @@ please install sympy via pip install lm-eval[math] or pip install -e .[math]",
)
INVALID_ANSWER = "[invalidanswer]"
# taken from
# https://github.com/wellecks/lm-evaluation-harness/blob/master/lm_eval/tasks/minerva_math.py
def doc_to_text(doc: dict) -> str:
......@@ -70,7 +73,10 @@ def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
unnormalized_answer = get_unnormalized_answer(candidates)
answer = normalize_final_answer(unnormalized_answer)
if is_equiv(answer, doc["answer"]):
if answer == INVALID_ANSWER:
return {"exact_match": 0}
if answer.strip() == doc["answer"].strip() or is_equiv(answer, doc["answer"]):
retval = 1
else:
retval = 0
......@@ -112,17 +118,19 @@ def last_boxed_only_string(string: str) -> Optional[str]:
def remove_boxed(s: str) -> str:
if "\\boxed " in s:
left = "\\boxed "
assert s[: len(left)] == left
return s[len(left) :]
left = "\\boxed{"
try:
if "\\boxed " in s:
left = "\\boxed "
assert s[: len(left)] == left
return s[len(left) :]
assert s[: len(left)] == left
assert s[-1] == "}"
left = "\\boxed{"
return s[len(left) : -1]
assert s[: len(left)] == left
assert s[-1] == "}"
return s[len(left) : -1]
except AssertionError:
return INVALID_ANSWER
class timeout:
......@@ -146,7 +154,7 @@ def is_equiv(x1: str, x2: str) -> bool:
x1 and x2 are normalized latex string
"""
try:
with timeout(seconds=5):
with timeout(seconds=1):
try:
parsed_x1 = parse_latex(x1)
parsed_x2 = parse_latex(x2)
......@@ -185,7 +193,6 @@ def is_equiv(x1: str, x2: str) -> bool:
def get_unnormalized_answer(text: str) -> str:
INVALID_ANSWER = "[invalidanswer]"
end_seq = "I hope it is correct."
text += end_seq
match = re.search(
......
# Task-name
### Paper
Title: LLAMA Evals
Abstract: Evals reproducing those provided by the LLAMA team in the Hugging Face repo.
`Short description of paper / benchmark goes here:`
Homepage: `https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f`
Note: The tasks are formatted to be run with apply_chat_template and fewshot_as_multiturn.
### Citation
```
BibTeX-formatted citation goes here
```
### Groups, Tags, and Tasks
#### Groups
* `group_name`: `Short description`
#### Tags
* `tag_name`: `Short description`
#### Tasks
* `mmlu_llama`: `generation variant of MMLU`
* `arc_chalenge_chat`: `generation variant of ARC-Challenge using MMLU format`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
output_type: generate_until
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
doc_to_text: "Given the following question and four candidate answers (A, B, C and D), choose the best answer.\nQuestion: {{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nYour response should end with \"The best answer is [the_answer_letter]\" where the [the_answer_letter] is one of A, B, C or D."
assistant_prefill: "The best answer is"
doc_to_target: "{{['A.','B.','C.','D.'][answer]}}"
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
regexes_to_ignore:
- "\\$"
- "\\.$"
generation_kwargs:
until:
- "."
max_gen_toks: 10
filter_list:
- name: strict_match
filter:
- function: remove_whitespace
- function: take_first
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
group: mmlu_llama_humanities
group_alias: humanities
task:
- mmlu_llama_humanities_tasks
aggregate_metric_list:
- metric: exact_match
aggregation: mean
weight_by_size: True
filter_list: [strict_match]
metadata:
version: 1
group: mmlu_llama_other
group_alias: other
task:
- mmlu_llama_other_tasks
aggregate_metric_list:
- metric: exact_match
aggregation: mean
weight_by_size: True
filter_list: [strict_match]
metadata:
version: 1
group: mmlu_llama_social_sciences
group_alias: social sciences
task:
- mmlu_llama_social_sciences_tasks
aggregate_metric_list:
- metric: exact_match
aggregation: mean
weight_by_size: True
filter_list: [strict_match]
metadata:
version: 1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment