New healthcare benchmark: careqa (#2714)

* New healthcare benchmark: careqa * LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0> * Add fixes, READMES, and remove task_list.txt * pre-commit passed, add formatting updates; add nanmean agg_metric * Fix import error. * Wrapped imports in try excepts * Wrapped imports in try excepts; also metrics to catch bert_score import error * Try except to catch ImportErrors as well * use np.nan * pre-commit --------- Co-authored-by: PabloAgustin <pablo.martin@bsc.es> Co-authored-by: Baber <baber@hey.com>

New healthcare benchmark: careqa (#2714)
* New healthcare benchmark: careqa * LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0> * Add fixes, READMES, and remove task_list.txt * pre-commit passed, add formatting updates; add nanmean agg_metric * Fix import error. * Wrapped imports in try excepts * Wrapped imports in try excepts; also metrics to catch bert_score import error * Try except to catch ImportErrors as well * use np.nan * pre-commit --------- Co-authored-by: PabloAgustin <pablo.martin@bsc.es> Co-authored-by: Baber <baber@hey.com>
7c9fbcf8 · PabloAgustin · GitHub · 2c8ffb80 · 7c9fbcf8 · 7c9fbcf8
Unverified Commit 7c9fbcf8 authored Mar 11, 2025 by PabloAgustin Committed by GitHub Mar 12, 2025
20 changed files
--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -21,6 +21,13 @@ def bypass_agg(arr):
    return 999
+@register_aggregation("nanmean")
+def nanmean(arr):
+    if len(arr) == 0 or all(np.isnan(arr)):
+        return np.nan
+    return np.nanmean(arr)
 @register_aggregation("mean")
 def mean(arr):
    return sum(arr) / len(arr)
@@ -498,6 +505,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):
        bleu,
        chrf,
        ter,
+        nanmean,
    ]
    if metric in bootstrappable:

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
--- a/lm_eval/tasks/careqa/README.md
+++ b/lm_eval/tasks/careqa/README.md
+# CareQA
+### Paper
+Title: `Automatic Evaluation of Healthcare LLMs Beyond Question-Answering`
+Abstract: [https://arxiv.org/abs/2502.06666](https://arxiv.org/abs/2502.06666)
+CareQA originates from the Spanish Specialised Healthcare Training (MIR) exams by the
+Spanish Ministry of Health. The close-ended version is a multiple-choice question
+answering (MCQA) including 5,621 QA pairs across six categories: medicine, nursing,
+biology, chemistry, psychology, and pharmacology, sourced from the 2020 to 2024 exam
+editions. CareQA is available in both English and Spanish. The open-ended version
+(English only) contains 3,730 QA pairs.
+Homepage: \
+[https://huggingface.co/datasets/HPAI-BSC/CareQA](https://huggingface.co/datasets/HPAI-BSC/CareQA)
+#### Tasks
+* `careqa_en`: MCQA in english.
+* `careqa_es`: MCQA in spanish.
+* `careqa_open`: Open-Ended QA in english.
+* `careqa_open_perplexity`: Open-Ended QA in english, evaluated with perplexity.
+### Citation
+```bibtex
+@misc{ariasduart2025automaticevaluationhealthcarellms,
+      title={Automatic Evaluation of Healthcare LLMs Beyond Question-Answering},
+      author={Anna Arias-Duart and Pablo Agustin Martin-Torres and Daniel Hinjos and Pablo Bernabeu-Perez and Lucia Urcelay Ganzabal and Marta Gonzalez Mallo and Ashwin Kumar Gururajan and Enrique Lopez-Cuena and Sergio Alvarez-Napagao and Dario Garcia-Gasulla},
+      year={2025},
+      eprint={2502.06666},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.06666},
+}
+```
--- a/lm_eval/tasks/careqa/careqa_en.yaml
+++ b/lm_eval/tasks/careqa/careqa_en.yaml
+task: careqa_en
+dataset_path: HPAI-BSC/CareQA
+dataset_name: CareQA_en
+test_split: test
+output_type: multiple_choice
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+doc_to_choice: ['A', 'B', 'C', 'D']
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/careqa/careqa_es.yaml
+++ b/lm_eval/tasks/careqa/careqa_es.yaml
+include: careqa_en.yaml
+task: careqa_es
+dataset_name: CareQA_es
--- a/lm_eval/tasks/careqa/careqa_open.yaml
+++ b/lm_eval/tasks/careqa/careqa_open.yaml
+task: careqa_open
+dataset_path: HPAI-BSC/CareQA
+dataset_name: CareQA_en_open
+description: >
+  Instructions: The following text is a medical question. Answer it in the most factual, concise and informative way possible"
+output_type: generate_until
+test_split: test
+doc_to_text: !function utils_open.doc_to_text
+doc_to_target: !function utils_open.doc_to_target
+process_results: !function utils_open.process_results_gen
+generation_kwargs:
+  until:
+    - "\n\n"
+metric_list:
+  - metric: bleu
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: rouge1
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: rouge2
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: rougeL
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: bleurt
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: bert_score
+    aggregation: nanmean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/careqa/careqa_open_perplexity.yaml
+++ b/lm_eval/tasks/careqa/careqa_open_perplexity.yaml
+include: careqa_open.yaml
+task: careqa_open_perplexity
+output_type: loglikelihood_rolling
+doc_to_text: ""
+doc_to_target: !function utils_open.doc_to_target
+process_results: !function utils_perplexity.process_results
+metric_list:
+  - metric: word_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    higher_is_better: false
+metadata:
+  version: 1.0
+generation_kwargs: null
--- a/lm_eval/tasks/careqa/utils.py
+++ b/lm_eval/tasks/careqa/utils.py
+def doc_to_text(doc) -> str:
+    """
+    Question: <question>
+    Choices:
+    A. <choice1>
+    B. <choice2>
+    C. <choice3>
+    D. <choice4>
+    Answer:
+    """
+    if doc["question"] is None:
+        doc = {
+            "question": "In relation to the immune mechanism involved in the rejection of transplanted solid organs, indicate the incorrect answer:",
+            "op1": "Acute T-cell mediated rejection can be controlled through the use of drugs such as cyclosporine A or corticosteroids.",
+            "exam_id": 36,
+            "op3": "Chronic rejection or chronic graft injury is associated with endothelial damage mediated by anti-HLA antibodies.",
+            "category": "Medicine",
+            "unique_id": "5636d1af-e0b1-43b0-8a04-6f127dcf6785",
+            "op4": "Hyperacute rejection is mediated by cytotoxic T lymphocytes against donor antigens present in the recipient.",
+            "op2": "The presence of specific antibodies against the donor (DSA) in the recipient prior to transplantation is a contraindication for it.",
+            "cop": 4,
+            "year": 2024,
+        }
+    choices = [doc["op1"], doc["op2"], doc["op3"], doc["op4"]]
+    option_choices = {
+        "A": choices[0],
+        "B": choices[1],
+        "C": choices[2],
+        "D": choices[3],
+    }
+    prompt = "Question: " + doc["question"] + "\nChoices:\n"
+    for choice, option in option_choices.items():
+        prompt += f"{choice.upper()}. {option}\n"
+    prompt += "Answer:"
+    return prompt
+def doc_to_target(doc) -> int:
+    return doc["cop"] - 1
--- a/lm_eval/tasks/careqa/utils_open.py
+++ b/lm_eval/tasks/careqa/utils_open.py
+import numpy as np
+try:
+    import evaluate
+    bleu = evaluate.load("bleu")
+    rouge = evaluate.load("rouge")
+    bertscore = evaluate.load("bertscore")
+    bleurt = evaluate.load("bleurt", "bleurt-base-512", module_type="metric")
+except (ModuleNotFoundError, ImportError):
+    raise ModuleNotFoundError(
+        "Please install evaluation metrics via pip install evaluate and pip install bert-score",
+    )
+except Exception as e:
+    raise RuntimeError(
+        f"Error loading evaluation metrics: {str(e)}. Please check your installation."
+    )
+def doc_eval(pred, refs):
+    try:
+        bleu_results = bleu.compute(predictions=pred, references=refs)
+    except Exception as e:
+        print(f"Bleu error: {e}")
+        bleu_results = {"bleu": np.NAN}
+    try:
+        rouge_results = rouge.compute(predictions=pred, references=refs)
+    except Exception as e:
+        print(f"Rouge error: {e}")
+        rouge_results = {"rouge1": np.NAN, "rouge2": np.NAN, "rougeL": np.NAN}
+    try:
+        bleurt_scores = bleurt.compute(predictions=pred, references=refs)["scores"]
+    except Exception as e:
+        print(f"Bleurt error: {e}")
+        bleurt_scores = [np.NAN]
+    try:
+        bert_scores = bertscore.compute(predictions=pred, references=refs, lang="en")[
+            "f1"
+        ]
+    except Exception as e:
+        print(f"Bert error: {e}")
+        bert_scores = [np.NAN]
+    if bleu_results["bleu"] == 0:
+        # Sometimes bleu is 0.0 and this breaks the stderr computation.
+        bleu_results["bleu"] += 1e-5
+    results = {
+        "bleu": bleu_results["bleu"],
+        "rouge1": rouge_results["rouge1"],
+        "rouge2": rouge_results["rouge2"],
+        "rougeL": rouge_results["rougeL"],
+        "bleurt": np.mean(bleurt_scores),
+        "bert_score": np.mean(bert_scores),
+    }
+    return results
+def doc_to_text(doc) -> str:
+    return doc["question"]
+def doc_to_target(doc) -> str:
+    return doc["answer"]
+def process_results_gen(doc, results):
+    pred, refs = [results[0]], [doc_to_target(doc)]
+    if len(refs[0]) < 1 or len(pred[0]) < 1:
+        return {
+            "bleu": np.NAN,
+            "rouge1": np.NAN,
+            "rouge2": np.NAN,
+            "rougeL": np.NAN,
+            "bleurt": np.NAN,
+            "bert_score": np.NAN,
+        }
+    results = doc_eval(pred, refs)
+    return {
+        "bleu": results["bleu"],
+        "rouge1": results["rouge1"],
+        "rouge2": results["rouge2"],
+        "rougeL": results["rougeL"],
+        "bleurt": results["bleurt"],
+        "bert_score": results["bert_score"],
+    }
+def process_results_gen_w_repeats(doc, results):
+    pred, refs = [results[0]], [doc_to_target(doc)]
+    if len(refs[0]) < 1 or len(pred[0]) < 1:
+        return {
+            "bleu": np.NAN,
+            "rouge1": np.NAN,
+            "rouge2": np.NAN,
+            "rougeL": np.NAN,
+            "bleurt": np.NAN,
+            "bert_score": np.NAN,
+        }
+    results = doc_eval(pred, refs)
+    return {
+        "bleu": results["bleu"],
+        "rouge1": results["rouge1"],
+        "rouge2": results["rouge2"],
+        "rougeL": results["rougeL"],
+        "bleurt": results["bleurt"],
+        "bert_score": results["bert_score"],
+    }
--- a/lm_eval/tasks/careqa/utils_perplexity.py
+++ b/lm_eval/tasks/careqa/utils_perplexity.py
+import math
+import re
+def doc_to_target(doc) -> str:
+    return doc["answer"]
+def process_results(doc, results):
+    (loglikelihood,) = results
+    _words = len(re.split(r"\s+", doc_to_target(doc)))
+    _bytes = len(doc_to_target(doc).encode("utf-8"))
+    print(f"perplexity: {math.exp(-loglikelihood / _words)}")
+    return {
+        "word_perplexity": (loglikelihood, _words),
+        "byte_perplexity": (loglikelihood, _bytes),
+        "bits_per_byte": (loglikelihood, _bytes),
+    }
--- a/lm_eval/tasks/med_prescriptions/med_prescriptions_easy.yaml
+++ b/lm_eval/tasks/med_prescriptions/med_prescriptions_easy.yaml
+group: med_prescriptions
+task: med_prescriptions_easy
+dataset_path: devlocalhost/prescription-full
+output_type: multiple_choice
+training_split: train
+validation_split: train
+test_split: train
+process_docs: !function utils.process_docs
+doc_to_text: !function utils.doc_to_text_easy
+doc_to_choice: !function utils.doc_to_choice_easy
+doc_to_target: !function utils.doc_to_target
+generation_kwargs:
+  until:
+    - "\n\n"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/med_prescriptions/med_prescriptions_hard.yaml
+++ b/lm_eval/tasks/med_prescriptions/med_prescriptions_hard.yaml
+include: med_prescriptions_easy.yaml
+task: med_prescriptions_hard
+doc_to_text: !function utils.doc_to_text_hard
+doc_to_choice: !function utils.doc_to_choice_hard
--- a/lm_eval/tasks/med_prescriptions/utils.py
+++ b/lm_eval/tasks/med_prescriptions/utils.py
--- a/lm_eval/tasks/med_text_classification/med_text_classification_easy.yaml
+++ b/lm_eval/tasks/med_text_classification/med_text_classification_easy.yaml
+group: med_text_classification
+task: med_text_classification_easy
+dataset_path: csv
+dataset_name: null
+dataset_kwargs:
+  data_files:
+    train: /gpfs/projects/bsc70/heka/data/datasets/med_text_class_train.csv
+output_type: multiple_choice
+training_split: train
+validation_split: train
+test_split: train
+process_docs: !function utils.process_docs
+doc_to_text: !function utils.doc_to_text_easy
+doc_to_choice: !function utils.doc_to_choice_easy
+doc_to_target: !function utils.doc_to_target_easy
+generation_kwargs:
+  until:
+    - "\n\n"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/med_text_classification/med_text_classification_hard.yaml
+++ b/lm_eval/tasks/med_text_classification/med_text_classification_hard.yaml
+include: med_text_classification_easy.yaml
+task: med_text_classification_hard
+dataset_kwargs:
+  data_files:
+    train: /gpfs/projects/bsc70/heka/data/datasets/mtsamples.csv
+process_docs: !function utils.process_docs_hard
+doc_to_text: !function utils.doc_to_text_hard
+doc_to_choice: !function utils.doc_to_choice_hard
+doc_to_target: !function utils.doc_to_target_hard
--- a/lm_eval/tasks/med_text_classification/utils.py
+++ b/lm_eval/tasks/med_text_classification/utils.py
+import random
+import datasets
+def process_docs_hard(dataset: datasets.Dataset):
+    return dataset
+def process_docs(dataset: datasets.Dataset):
+    def _helper(doc):
+        return doc
+    num_entries = len(dataset)
+    ten_percent_index = int(0.1 * num_entries)
+    # Select the first 10% of the dataset
+    filtered_dataset = dataset.select(range(ten_percent_index))
+    return filtered_dataset.map(_helper)
+def doc_to_choice_easy(doc):
+    return [
+        "neoplasms",
+        "digestive system diseases",
+        "nervous system diseases",
+        "cardiovascular diseases",
+        "general pathological conditions",
+    ]
+def doc_to_text_easy(doc) -> str:
+    choices = doc_to_choice_easy(doc)
+    prompt = (
+        "Classify the topic of the following medical text into one of the following choices. \n"
+        "Text: {} \n"
+        "Choices: \n"
+        "A. {} \n"
+        "B. {} \n"
+        "C. {} \n"
+        "D. {} \n"
+        "E. {} \n Answer:".format(
+            doc["text"], choices[0], choices[1], choices[2], choices[3], choices[4]
+        )
+    )
+    return prompt
+def doc_to_target_easy(doc):
+    return int(doc["class"]) - 1
+def doc_to_text_hard(doc) -> str:
+    choices = doc_to_choice_hard(doc)
+    prompt = (
+        "Select the medical specialty the following text is talking about among the following choices. \n"
+        "Text: {} \n"
+        "Choices: {}\n"
+        " Answer:".format(doc["transcription"], choices)
+    )
+    return prompt
+def doc_to_choice_hard(doc):
+    choices_list = [
+        " Bariatrics",
+        " Allergy / Immunology",
+        " Dentistry",
+        " Cardiovascular / Pulmonary",
+        " Urology",
+        " Hospice - Palliative Care",
+        " Radiology",
+        " Pediatrics - Neonatal",
+        " Neurology",
+        " Neurosurgery",
+        " Emergency Room Reports",
+        " IME-QME-Work Comp etc.",
+        " Office Notes",
+        " Surgery",
+        " Letters",
+        " Ophthalmology",
+        " Hematology - Oncology",
+        " Endocrinology",
+        " Cosmetic / Plastic Surgery",
+        " Diets and Nutritions",
+        " Rheumatology",
+        " Nephrology",
+        " Physical Medicine - Rehab",
+        " Podiatry",
+        " Chiropractic",
+        " Lab Medicine - Pathology",
+        " Orthopedic",
+        " Autopsy",
+        " Psychiatry / Psychology",
+        " Speech - Language",
+        " ENT - Otolaryngology",
+        " Sleep Medicine",
+        " Dermatology",
+        " SOAP / Chart / Progress Notes",
+        " General Medicine",
+        " Consult - History and Phy.",
+        " Obstetrics / Gynecology",
+        " Gastroenterology",
+        " Pain Management",
+        " Discharge Summary",
+    ]
+    return choices_list
+def doc_to_target_hard(doc):
+    choices = doc_to_choice_hard(doc)
+    gold = doc["medical_specialty"]
+    idx = choices.index(gold)
+    return idx
--- a/lm_eval/tasks/meddialog/README.md
+++ b/lm_eval/tasks/meddialog/README.md
+# Meddialog
+### Paper
+Title: `MedDialog: Large-scale Medical Dialogue Datasets`
+Abstract: [https://aclanthology.org/2020.emnlp-main.743/](https://aclanthology.org/2020.emnlp-main.743/)
+This task contains the english version of the MedDialog Medical Dialogue Dataset, divided in two tasks:
+question entailment, and open-ended Question Answering (QA).
+#### Tasks
+* `meddialog_qsumm`: Question entailment in english.
+* `meddialog_qsumm_perplexity`: Question entailment in english, evaluated with perplexity.
+* `meddialog_raw_dialogues`: Open-Ended QA in english.
+* `meddialog_raw_perplexity`: Open-Ended QA in english, evaluated with perplexity.
+### Citation
+```bibtex
+@inproceedings{zeng-etal-2020-meddialog,
+    title = "{M}ed{D}ialog: Large-scale Medical Dialogue Datasets",
+    author = "Zeng, Guangtao  and
+      Yang, Wenmian  and
+      Ju, Zeqian  and
+      Yang, Yue  and
+      Wang, Sicheng  and
+      Zhang, Ruisi  and
+      Zhou, Meng  and
+      Zeng, Jiaqi  and
+      Dong, Xiangyu  and
+      Zhang, Ruoyu  and
+      Fang, Hongchao  and
+      Zhu, Penghui  and
+      Chen, Shu  and
+      Xie, Pengtao",
+    editor = "Webber, Bonnie  and
+      Cohn, Trevor  and
+      He, Yulan  and
+      Liu, Yang",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
+    month = nov,
+    year = "2020",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2020.emnlp-main.743/",
+    doi = "10.18653/v1/2020.emnlp-main.743",
+    pages = "9241--9250",
+    abstract = "Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build large-scale medical dialogue datasets {--} MedDialog, which contain 1) a Chinese dataset with 3.4 million conversations between patients and doctors, 11.3 million utterances, 660.2 million tokens, covering 172 specialties of diseases, and 2) an English dataset with 0.26 million conversations, 0.51 million utterances, 44.53 million tokens, covering 96 specialties of diseases. To our best knowledge, MedDialog is the largest medical dialogue dataset to date. We pretrain several dialogue generation models on the Chinese MedDialog dataset, including Transformer, GPT, BERT-GPT, and compare their performance. It is shown that models trained on MedDialog are able to generate clinically correct and doctor-like medical dialogues. We also study the transferability of models trained on MedDialog to low-resource medical dialogue generation tasks. It is shown that via transfer learning which finetunes the models pretrained on MedDialog, the performance on medical dialogue generation tasks with small datasets can be greatly improved, as shown in human evaluation and automatic evaluation. The datasets and code are available at \url{https://github.com/UCSD-AI4H/Medical-Dialogue-System}"
+}
+```
--- a/lm_eval/tasks/meddialog/meddialog_qsumm.yaml
+++ b/lm_eval/tasks/meddialog/meddialog_qsumm.yaml
+group: meddialog
+include: meddialog_raw_dialogues.yaml
+task: meddialog_qsumm
+dataset_path: lighteval/med_dialog
+dataset_name: icliniq
+description: >
+  Instructions: The following text is contains a medical question. Extract and summarize the question.
+output_type: generate_until
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: !function utils.doc_to_text_qsumm
+doc_to_target: !function utils.doc_to_target_qsumm
+process_results: !function utils.process_results_gen_qsumm
--- a/lm_eval/tasks/meddialog/meddialog_qsumm_perplexity.yaml
+++ b/lm_eval/tasks/meddialog/meddialog_qsumm_perplexity.yaml
+include: meddialog_qsumm.yaml
+task: meddialog_qsumm_perplexity
+output_type: loglikelihood_rolling
+doc_to_text: ""
+process_results: !function utils_perplexity.process_results_qsumm
+metric_list:
+  - metric: word_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    higher_is_better: false
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/meddialog/meddialog_raw_dialogues.yaml
+++ b/lm_eval/tasks/meddialog/meddialog_raw_dialogues.yaml
+group: meddialog
+task: meddialog_raw_dialogues
+dataset_path: bigbio/meddialog
+description: >
+  Instructions: The following text is from a collection of medical dialogues. What follows is the patients question. Answer how a doctor would, trying to be as helpful as possible.
+output_type: generate_until
+training_split: train
+validation_split: train
+test_split: train
+doc_to_text: !function utils.doc_to_text_raw
+doc_to_target: !function utils.doc_to_target_raw
+process_results: !function utils.process_results_gen_raw
+generation_kwargs:
+  until:
+    - "\n\n"
+metric_list:
+  - metric: bleu
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: rouge1
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: rouge2
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: rougeL
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: bleurt
+    aggregation: nanmean
+    higher_is_better: true
+  - metric: bert_score
+    aggregation: nanmean
+    higher_is_better: true
+metadata:
+  version: 1.0