Commit c1e63555 authored by Yu Shi Jie's avatar Yu Shi Jie
Browse files

Merge branch 'upstream' into 'mmlu-pro'

add tokenizer logs info (#1731)

See merge request shijie.yu/lm-evaluation-harness!4
parents e361687c 42dc2448
import string
def doc_to_text(doc):
doc_to_text = f"{doc['question']}\n"
for i in range(len(doc["options"])):
doc_to_text += f"{string.ascii_uppercase[i]}. {doc['options'][i]}\n"
doc_to_text += "Answer:"
return doc_to_text
def doc_to_choice(doc):
return [string.ascii_uppercase[i] for i in range(len(doc["options"]))]
group: leaderboard_musr
task:
- leaderboard_musr_murder_mysteries
- leaderboard_musr_object_placements
- leaderboard_musr_team_allocation
dataset_path: TAUR-Lab/MuSR
output_type: multiple_choice
doc_to_text: !function utils.doc_to_text
doc_to_target: "{{answer_choice}}"
doc_to_choice: "{{choices}}"
metric_list:
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
include: "_template_yaml"
task: leaderboard_musr_murder_mysteries
test_split: murder_mysteries
include: "_template_yaml"
task: leaderboard_musr_object_placements
test_split: object_placements
include: "_template_yaml"
task: leaderboard_musr_team_allocation
test_split: team_allocation
import ast
def doc_to_choice(doc):
"""
Convert a doc to a choice.
"""
return ast.literal_eval(doc["choices"])
DOC_TO_TEXT = "{narrative}\n\n" "{question}\n\n" "{choices}\n" "Answer:"
def doc_to_text(doc):
"""
Convert a doc to text.
"""
choices = ""
for i, choice in enumerate(ast.literal_eval(doc["choices"])):
choices += f"{i+1} - {choice}\n"
text = DOC_TO_TEXT.format(
narrative=doc["narrative"], question=doc["question"], choices=choices
)
return text
group:
tag:
- math_word_problems
task: mathqa
dataset_path: math_qa
......
# MedConceptsQA
### Paper
Title: `MedConceptsQA: Open Source Medical Concepts QA Benchmark`
Abstract: https://arxiv.org/abs/2405.07348
MedConceptsQA is a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs.
The questions are categorized into three levels of difficulty: easy, medium, and hard.
Our benchmark serves as a valuable resource for evaluating the
abilities of Large Language Models to interpret medical codes and distinguish
between medical concepts.
### Citation
```
@article{shoham2024medconceptsqa,
title={MedConceptsQA--Open Source Medical Concepts QA Benchmark},
author={Shoham, Ofir Ben and Rappoport, Nadav},
journal={arXiv preprint arXiv:2405.07348},
year={2024}
}
```
### Groups and Tasks
#### Groups
* `med_concepts_qa`: Contains all the QA tasks (diagnosis, procedures ,and drugs).
#### Tasks
* `med_concepts_qa_icd9cm` - ICD9-CM (diagnosis codes, ICD9 format) question-answering. This involves providing information, clarifications, and answering questions related to ICD-9-CM (International Classification of Diseases, 9th Revision, Clinical Modification) diagnosis codes.
* `med_concepts_qa_icd10cm` - ICD10-CM (diagnosis codes, ICD10 format) question-answering. This involves providing information, clarifications, and answering questions related to ICD-10-CM (International Classification of Diseases, 10th Revision, Clinical Modification) diagnosis codes.
* `med_concepts_qa_icd9proc` - ICD9-Proc (procedure codes, ICD9 format) question-answering. This involves providing information, clarifications, and answering questions related to ICD-9-PCS (International Classification of Diseases, 9th Revision, Procedure Coding System) procedure codes.
* `med_concepts_qa_icd10proc` - ICD10-Proc (procedure codes, ICD10 format) question-answering. This involves providing information, clarifications, and answering questions related to ICD-10-PCS (International Classification of Diseases, 10th Revision, Procedure Coding System) procedure codes.
* `med_concepts_qa_atc` - ATC (Anatomical Therapeutic Chemical Classification System) question-answering. This involves providing information, clarifications, and answering questions related to the ATC classification system, which is used for the classification of drugs and other medical products according to the organ or system on which they act and their therapeutic, pharmacological, and chemical properties.
dataset_path: ofir408/MedConceptsQA
output_type: multiple_choice
description: "Answer A,B,C,D according to the answer to this multiple choice question.\n"
fewshot_split: dev
fewshot_config:
sampler: first_n
num_fewshot: 4
test_split: test
doc_to_text: "{{question}}\nAnswer:"
doc_to_target: answer_id
doc_to_choice: ['A', 'B', 'C', 'D']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
from typing import List
import yaml
def generate_yaml_content(vocab_name: str, level: str):
content = {
"dataset_name": f"{vocab_name}_{level}",
"tag": f"med_concepts_qa_{vocab_name}_tasks",
"include": "_default_template_yaml",
"task": f"med_concepts_qa_{vocab_name}_{level}",
"task_alias": f"{vocab_name}_{level}",
}
return content
def generate_yaml_files(
vocab_names: List[str], levels: List[str], file_name_prefix: str
):
for vocab_name in vocab_names:
for level in levels:
yaml_content = generate_yaml_content(vocab_name, level)
filename = f"{file_name_prefix}_{vocab_name}_{level}.yaml"
with open(filename, "w") as yaml_file:
yaml.dump(yaml_content, yaml_file, default_flow_style=False)
print(f"Done to generated {filename}")
if __name__ == "__main__":
generate_yaml_files(
vocab_names=["icd9cm", "icd10cm", "icd9proc", "icd10proc", "atc"],
levels=["easy", "medium", "hard"],
file_name_prefix="med_concepts_qa",
)
group: med_concepts_qa
task:
- med_concepts_qa_icd9cm
- med_concepts_qa_icd10cm
- med_concepts_qa_icd9proc
- med_concepts_qa_icd10proc
- med_concepts_qa_atc
aggregate_metric_list:
- metric: acc
aggregation: mean
group: med_concepts_qa_atc
task:
- med_concepts_qa_atc_tasks
aggregate_metric_list:
- metric: acc
aggregation: mean
group: med_concepts_qa_icd10cm
task:
- med_concepts_qa_icd10cm_tasks
aggregate_metric_list:
- metric: acc
aggregation: mean
group: med_concepts_qa_icd10proc
task:
- med_concepts_qa_icd10proc_tasks
aggregate_metric_list:
- metric: acc
aggregation: mean
group: med_concepts_qa_icd9cm
task:
- med_concepts_qa_icd9cm_tasks
aggregate_metric_list:
- metric: acc
aggregation: mean
group: med_concepts_qa_icd9proc
task:
- med_concepts_qa_icd9proc_tasks
aggregate_metric_list:
- metric: acc
aggregation: mean
dataset_name: atc_easy
include: _default_template_yaml
tag: med_concepts_qa_atc_tasks
task: med_concepts_qa_atc_easy
task_alias: atc_easy
dataset_name: atc_hard
include: _default_template_yaml
tag: med_concepts_qa_atc_tasks
task: med_concepts_qa_atc_hard
task_alias: atc_hard
dataset_name: atc_medium
include: _default_template_yaml
tag: med_concepts_qa_atc_tasks
task: med_concepts_qa_atc_medium
task_alias: atc_medium
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment